--- layout: "guides" page_title: "Using Prometheus to Monitor Nomad Metrics" sidebar_current: "guides-operations-monitoring-prometheus" description: |- It is possible to collect metrics on Nomad with Prometheus after enabling telemetry on Nomad servers and clients. --- # Using Prometheus to Monitor Nomad Metrics This guide explains how to configure [Prometheus][prometheus] to integrate with a Nomad cluster and Prometheus [Alertmanager][alertmanager]. While this guide introduces the basics of enabling [telemetry][telemetry] and alerting, a Nomad operator can go much further by customizing dashboards and integrating different [receivers][receivers] for alerts. ## Reference Material - [Configuring Prometheus][configuring prometheus] - [Telemetry Stanza in Nomad Agent Configuration][telemetry stanza] - [Alerting Overview][alerting] ## Estimated Time to Complete 25 minutes ## Challenge Think of a scenario where a Nomad operator needs to deploy Prometheus to collect metrics from a Nomad cluster. The operator must enable telemetry on the Nomad servers and clients as well as configure Prometheus to use Consul for service discovery. The operator must also configure Prometheus Alertmanager so notifications can be sent out to a specified [receiver][receivers]. ## Solution Deploy Prometheus with a configuration that accounts for a highly dynamic environment. Integrate service discovery into the configuration file to avoid using hard-coded IP addresses. Place the Prometheus deployment behind [fabio][fabio] (this will allow easy access to the Prometheus web interface by allowing the Nomad operator to hit any of the client nodes at the `/` path. ## Prerequisites To perform the tasks described in this guide, you need to have a Nomad environment with Consul installed. You can use this [repo](https://github.com/hashicorp/nomad/tree/master/terraform#provision-a-nomad-cluster-in-the-cloud) to easily provision a sandbox environment. This guide will assume a cluster with one server node and three client nodes. -> **Please Note:** This guide is for demo purposes and is only using a single server node. In a production cluster, 3 or 5 server nodes are recommended. The alerting rules defined in this guide are for instructional purposes. Please refer to [Alerting Rules][alertingrules] for more information. ## Steps ### Step 1: Enable Telemetry on Nomad Servers and Clients Add the stanza below in your Nomad client and server configuration files. If you have used the provided repo in this guide to set up a Nomad cluster, the configuration file will be `/etc/nomad.d/nomad.hcl`. After making this change, restart the Nomad service on each server and client node. ```hcl telemetry { collection_interval = "1s" disable_hostname = true prometheus_metrics = true publish_allocation_metrics = true publish_node_metrics = true } ``` ### Step 2: Create a Job for Fabio Create a job for Fabio and name it `fabio.nomad` ```hcl job "fabio" { datacenters = ["dc1"] type = "system" group "fabio" { task "fabio" { driver = "docker" config { image = "fabiolb/fabio" network_mode = "host" } resources { cpu = 100 memory = 64 network { mbits = 20 port "lb" { static = 9999 } port "ui" { static = 9998 } } } } } } ``` To learn more about fabio and the options used in this job file, see [Load Balancing with Fabio][fabio-lb]. For the purpose of this guide, it is important to note that the `type` option is set to [system][system] so that fabio will be deployed on all client nodes. We have also set `network_mode` to `host` so that fabio will be able to use Consul for service discovery. ### Step 3: Run the Fabio Job We can now register our fabio job: ```shell $ nomad job run fabio.nomad ==> Monitoring evaluation "7b96701e" Evaluation triggered by job "fabio" Allocation "d0e34682" created: node "28d7f859", group "fabio" Allocation "238ec0f7" created: node "510898b6", group "fabio" Allocation "9a2e8359" created: node "f3739267", group "fabio" Evaluation status changed: "pending" -> "complete" ==> Evaluation "7b96701e" finished with status "complete" ``` At this point, you should be able to visit any one of your client nodes at port `9998` and see the web interface for fabio. The routing table will be empty since we have not yet deployed anything that fabio can route to. Accordingly, if you visit any of the client nodes at port `9999` at this point, you will get a `404` HTTP response. That will change soon. ### Step 4: Create a Job for Prometheus Create a job for Prometheus and name it `prometheus.nomad` ```hcl job "prometheus" { datacenters = ["dc1"] type = "service" group "monitoring" { count = 1 restart { attempts = 2 interval = "30m" delay = "15s" mode = "fail" } ephemeral_disk { size = 300 } task "prometheus" { template { change_mode = "noop" destination = "local/prometheus.yml" data = < Monitoring evaluation "4e6b7127" Evaluation triggered by job "prometheus" Evaluation within deployment: "d3a651a7" Allocation "9725af3d" created: node "28d7f859", group "monitoring" Evaluation status changed: "pending" -> "complete" ==> Evaluation "4e6b7127" finished with status "complete" ``` Prometheus is now deployed. You can visit any of your client nodes at port `9999` to visit the web interface. There is only one instance of Prometheus running in the Nomad cluster, but you are automatically routed to it regardless of which node you visit because fabio is deployed and running on the cluster as well. At the top menu bar, click on `Status` and then `Targets`. You should see all of your Nomad nodes (servers and clients) show up as targets. Please note that the IP addresses will be different in your cluster. [![Prometheus Targets][prometheus-targets]][prometheus-targets] Let's use Prometheus to query how many jobs are running in our Nomad cluster. On the main page, type `nomad_nomad_job_summary_running` into the query section. You can also select the query from the drop-down list. [![Running Jobs][running-jobs]][running-jobs] You can see that the value of our fabio job is `3` since it is using the [system][system] scheduler type. This makes sense because we are running three Nomad clients in our demo cluster. The value of our Prometheus job, on the other hand, is `1` since we have only deployed one instance of it. To see the description of other metrics, visit the [telemetry][telemetry] section. ### Step 6: Deploy Alertmanager Now that we have enabled Prometheus to collect metrics from our cluster and see the state of our jobs, let's deploy [Alertmanager][alertmanager]. Keep in mind that Prometheus sends alerts to Alertmanager. It is then Alertmanager's job to send out the notifications on those alerts to any designated [receiver][receivers]. Create a job for Alertmanager and named it `alertmanager.nomad` ```hcl job "alertmanager" { datacenters = ["dc1"] type = "service" group "alerting" { count = 1 restart { attempts = 2 interval = "30m" delay = "15s" mode = "fail" } ephemeral_disk { size = 300 } task "alertmanager" { driver = "docker" config { image = "prom/alertmanager:latest" port_map { alertmanager_ui = 9093 } } resources { network { mbits = 10 port "alertmanager_ui" {} } } service { name = "alertmanager" tags = ["urlprefix-/alertmanager strip=/alertmanager"] port = "alertmanager_ui" check { name = "alertmanager_ui port alive" type = "http" path = "/-/healthy" interval = "10s" timeout = "2s" } } } } } ``` ### Step 7: Configure Prometheus to Integrate with Alertmanager Now that we have deployed Alertmanager, let's slightly modify our Prometheus job configuration to allow it to recognize and send alerts to it. Note that there are some rules in the configuration that refer a to a web server we will deploy soon. Below is the same Prometheus configuration we detailed above, but we have added some sections that hook Prometheus into the Alertmanager and set up some Alerting rules. ```hcl job "prometheus" { datacenters = ["dc1"] type = "service" group "monitoring" { count = 1 restart { attempts = 2 interval = "30m" delay = "15s" mode = "fail" } ephemeral_disk { size = 300 } task "prometheus" { template { change_mode = "noop" destination = "local/webserver_alert.yml" data = < < client node IP >:9999/alertmanager You should see that Alertmanager has received the alert. [![Alertmanager Web UI][alertmanager-webui]][alertmanager-webui] ## Next Steps Read more about Prometheus [Alertmanager][alertmanager] and how to configure it to send out notifications to a [receiver][receivers] of your choice. [active-alerts]: /assets/images/active-alert.png [alerts]: /assets/images/alerts.png [alerting]: https://prometheus.io/docs/alerting/overview/ [alertingrules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ [alertmanager]: https://prometheus.io/docs/alerting/alertmanager/ [alertmanager-webui]: /assets/images/alertmanager-webui.png [configuring prometheus]: https://prometheus.io/docs/introduction/first_steps/#configuring-prometheus [consul_sd_config]: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cconsul_sd_config%3E [env]: /docs/runtime/environment.html [fabio]: https://fabiolb.net/ [fabio-lb]: /guides/load-balancing/fabio.html [new-targets]: /assets/images/new-targets.png [prometheus-targets]: /assets/images/prometheus-targets.png [running-jobs]: /assets/images/running-jobs.png [telemetry]: /docs/configuration/telemetry.html [telemetry stanza]: /docs/configuration/telemetry.html [template]: /docs/job-specification/template.html [volumes]: /docs/drivers/docker.html#volumes [prometheus]: https://prometheus.io/docs/introduction/overview/ [receivers]: https://prometheus.io/docs/alerting/configuration/#%3Creceiver%3E [system]: /docs/schedulers.html#system