open-nomad/website/source/docs/telemetry/overview.html.md

---
layout: "docs"
page_title: "Overview"
sidebar_current: "docs-telemetry-overview"
description: |-
  Overview of runtime metrics available in Nomad along with monitoring and
  alerting.
---

# Telemetry Overview

The Nomad client and server agents collect a wide range of runtime metrics
related to the performance of the system. On the server side, leaders and
followers have metrics in common as well as metrics that are specific to their
roles. Clients have separate metrics for the host metrics and for
allocations/tasks, both of which have to be [explicitly
enabled][telemetry-stanza]. There are also runtime metrics that are common to
all servers and clients.

By default, the Nomad agent collects telemetry data at a [1 second
interval][collection-interval]. Note that Nomad supports [Gauges, counters and
timers][metric-types].

There are three ways to obtain metrics from Nomad:

* Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for
  the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus
  formatted metrics.
* Send the USR1 signal to the Nomad process. This will dump the current
  telemetry information to STDERR (on Linux).
* Configure Nomad to automatically forward metrics to a third-party provider.

Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the
integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem].
Metrics can also be forwarded to [Statsite][statsite-telem],
[StatsD][statsd-telem], and [Circonus][circonus-telem].

## Alerting

The recommended practice for alerting is to leverage the alerting capabilities
of your monitoring provider. Nomad’s intention is to surface metrics that enable
users to configure the necessary alerts using their existing monitoring systems
as a scaffold, rather than to natively support alerting. Here are a few common
patterns:

* Export metrics from Nomad to Prometheus using the [StatsD
  exporter][statsd-exporter], define [alerting rules][alerting-rules] in
  Prometheus, and use [Alertmanager][alertmanager] for summarization and
  routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is
  supported for [Datadog][datadog-alerting].

* Periodically submit test jobs into Nomad to determine if your application
  deployment pipeline is working end-to-end. This pattern is well-suited to
  batch processing workloads.

* Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios
  monitor when a new Nomad job is added. When a job is removed, remove the
  Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a
  job-specific alerting system.

* Write a script that looks at the history of each batch job to determine
  whether or not the job is in an unhealthy state, updating your monitoring
  system as appropriate. In many cases, it may be ok if a given batch job fails
  occasionally, as long as it goes back to passing.

# Key Performance Indicators

The sections below cover a number of important metrics

## Consensus Protocol (Raft)

Nomad uses the Raft consensus protocol for leader election and state
replication. Spurious leader elections can be caused by networking issues
between the servers or insufficient CPU resources. Users in cloud environments
often bump their servers up to the next instance class with improved networking
and CPU to stabilize leader elections. The `nomad.raft.leader.lastContact` metric
is a general indicator of Raft latency which can be used to observe how Raft
timing is performing and guide the decision to upgrade to more powerful servers.
`nomad.raft.leader.lastContact` should not get too close to the leader lease
timeout of 500ms.

## Federated Deployments (Serf)

Nomad uses the membership and failure detection capabilities of the Serf library
to maintain a single, global gossip pool for all servers in a federated
deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
that membership is unstable.

## Scheduling

The following metrics allow an operator to observe changes in throughput at the
various points in the scheduling process (evaluation, scheduling/planning, and
placement):

* **nomad.broker.total_blocked** - The number of blocked evaluations.
* **nomad.worker.invoke_scheduler.\<type\>** - The time to run the scheduler of
  the given type.
* **nomad.plan.evaluate** - The time to evaluate a scheduler Plan.
* **nomad.plan.submit** - The time to submit a scheduler Plan.
*  **nomad.plan.queue_depth** - The number of scheduler Plans waiting to be
   evaluated.

Upticks in any of the above metrics indicate a decrease in scheduler throughput.

## Capacity

The importance of monitoring resource availability is workload specific. Batch
processing workloads often operate under the assumption that the cluster should
be at or near capacity, with queued jobs running as soon as adequate resources
become available. Clusters that are primarily responsible for long running
services with an uptime requirement may want to maintain headroom at 20% or
more. The following metrics can be used to assess capacity across the cluster on
a per client basis:

* **nomad.client.allocated.cpu**
* **nomad.client.unallocated.cpu**
* **nomad.client.allocated.disk**
* **nomad.client.unallocated.disk**
* **nomad.client.allocated.iops**
* **nomad.client.unallocated.iops**
* **nomad.client.allocated.memory**
* **nomad.client.unallocated.memory**

## Task Resource Consumption

The metrics listed [here][allocation-metrics] can be used to track resource
consumption on a per task basis. For user facing services, it is common to alert
when the CPU is at or above the reserved resources for the task.

## Job and Task Status

We do not currently surface metrics for job and task/allocation status, although
we will consider adding metrics where it makes sense. 

## Runtime Metrics

Runtime metrics apply to all clients and servers. The following metrics are
general indicators of load and memory pressure:

* **nomad.runtime.num_goroutines**
* **nomad.runtime.heap_objects**
* **nomad.runtime.alloc_bytes**

It is recommended to alert on upticks in any of the above, server memory usage
in particular.


[alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
[alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
[allocation-metrics]: /docs/telemetry/metrics.html#allocation-metrics
[circonus-telem]: /docs/configuration/telemetry.html#circonus
[collection-interval]: /docs/configuration/telemetry.html#collection_interval
[datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
[datadog-telem]: /docs/configuration/telemetry.html#datadog
[prometheus-telem]: /docs/configuration/telemetry.html#prometheus
[metrics-api-endpoint]: /api/metrics.html
[metric-types]: /docs/telemetry/metrics.html#metric-types
[statsd-exporter]: https://github.com/prometheus/statsd_exporter
[statsd-telem]: /docs/configuration/telemetry.html#statsd
[statsite-telem]: /docs/configuration/telemetry.html#statsite
[tagged-metrics]: /docs/telemetry/metrics.html#tagged-metrics
[telemetry-stanza]: /docs/configuration/telemetry.html
-												[WIP] Add telemetry overview section (#5529)

* re-arrange telemetry docs and add overview with navigation

* update job and task status section

* fix navigation

* Update website/source/docs/telemetry/overview.html.md

Co-Authored-By: Chris Baker <cgbaker@hashicorp.com>

* Update website/source/docs/telemetry/overview.html.md

Co-Authored-By: Chris Baker <cgbaker@hashicorp.com>

* Update website/source/docs/telemetry/overview.html.md

Co-Authored-By: Chris Baker <cgbaker@hashicorp.com>

* Update website/source/docs/telemetry/metrics.html.md

Co-Authored-By: Chris Baker <cgbaker@hashicorp.com>

* Update website/source/docs/telemetry/metrics.html.md

Co-Authored-By: Chris Baker <cgbaker@hashicorp.com>

* fix formatting for nomad.plan.evaluate metric

* clarifications on collection interval and namespace labell

* fix typo

* Update website/source/docs/telemetry/overview.html.md

Co-Authored-By: Chris Baker <cgbaker@hashicorp.com>

* Update website/source/docs/telemetry/overview.html.md

Co-Authored-By: Chris Baker <cgbaker@hashicorp.com>

* Update website/source/docs/telemetry/overview.html.md

Co-Authored-By: Chris Baker <cgbaker@hashicorp.com>

											
										
										
											2019-06-19 19:25:14 +00:00
+								---
 								layout: "docs"
 								page_title: "Overview"
 								sidebar_current: "docs-telemetry-overview"
 								description: |-
 								  Overview of runtime metrics available in Nomad along with monitoring and
 								  alerting.
 								---
 								# Telemetry Overview
 								The Nomad client and server agents collect a wide range of runtime metrics
 								related to the performance of the system. On the server side, leaders and
 								followers have metrics in common as well as metrics that are specific to their
 								roles. Clients have separate metrics for the host metrics and for
 								allocations/tasks, both of which have to be [explicitly
 								enabled][telemetry-stanza]. There are also runtime metrics that are common to
 								all servers and clients.
 								By default, the Nomad agent collects telemetry data at a [1 second
 								interval][collection-interval]. Note that Nomad supports [Gauges, counters and
 								timers][metric-types].
 								There are three ways to obtain metrics from Nomad:
 								* Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for
 								  the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus
 								  formatted metrics.
 								* Send the USR1 signal to the Nomad process. This will dump the current
 								  telemetry information to STDERR (on Linux).
 								* Configure Nomad to automatically forward metrics to a third-party provider.
 								Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the
 								integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem].
 								Metrics can also be forwarded to [Statsite][statsite-telem],
 								[StatsD][statsd-telem], and [Circonus][circonus-telem].
 								## Alerting
 								The recommended practice for alerting is to leverage the alerting capabilities
 								of your monitoring provider. Nomad’s intention is to surface metrics that enable
 								users to configure the necessary alerts using their existing monitoring systems
 								as a scaffold, rather than to natively support alerting. Here are a few common
 								patterns:
 								* Export metrics from Nomad to Prometheus using the [StatsD
 								  exporter][statsd-exporter], define [alerting rules][alerting-rules] in
 								  Prometheus, and use [Alertmanager][alertmanager] for summarization and
 								  routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is
 								  supported for [Datadog][datadog-alerting].
 								* Periodically submit test jobs into Nomad to determine if your application
 								  deployment pipeline is working end-to-end. This pattern is well-suited to
 								  batch processing workloads.
 								* Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios
 								  monitor when a new Nomad job is added. When a job is removed, remove the
 								  Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a
 								  job-specific alerting system.
 								* Write a script that looks at the history of each batch job to determine
 								  whether or not the job is in an unhealthy state, updating your monitoring
 								  system as appropriate. In many cases, it may be ok if a given batch job fails
 								  occasionally, as long as it goes back to passing.
 								# Key Performance Indicators
 								The sections below cover a number of important metrics
 								## Consensus Protocol (Raft)
 								Nomad uses the Raft consensus protocol for leader election and state
 								replication. Spurious leader elections can be caused by networking issues
 								between the servers or insufficient CPU resources. Users in cloud environments
 								often bump their servers up to the next instance class with improved networking
 								and CPU to stabilize leader elections. The `nomad.raft.leader.lastContact` metric
 								is a general indicator of Raft latency which can be used to observe how Raft
 								timing is performing and guide the decision to upgrade to more powerful servers.
 								`nomad.raft.leader.lastContact` should not get too close to the leader lease
 								timeout of 500ms.
 								## Federated Deployments (Serf)
 								Nomad uses the membership and failure detection capabilities of the Serf library
 								to maintain a single, global gossip pool for all servers in a federated
 								deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
 								that membership is unstable.
 								## Scheduling
 								The following metrics allow an operator to observe changes in throughput at the
 								various points in the scheduling process (evaluation, scheduling/planning, and
 								placement):
 								* **nomad.broker.total_blocked** - The number of blocked evaluations.
 								* **nomad.worker.invoke_scheduler.\<type\>** - The time to run the scheduler of
 								  the given type.
 								* **nomad.plan.evaluate** - The time to evaluate a scheduler Plan.
 								* **nomad.plan.submit** - The time to submit a scheduler Plan.
 								*  **nomad.plan.queue_depth** - The number of scheduler Plans waiting to be
 								   evaluated.
 								Upticks in any of the above metrics indicate a decrease in scheduler throughput.
 								## Capacity
 								The importance of monitoring resource availability is workload specific. Batch
 								processing workloads often operate under the assumption that the cluster should
 								be at or near capacity, with queued jobs running as soon as adequate resources
 								become available. Clusters that are primarily responsible for long running
 								services with an uptime requirement may want to maintain headroom at 20% or
 								more. The following metrics can be used to assess capacity across the cluster on
 								a per client basis:
 								* **nomad.client.allocated.cpu**
 								* **nomad.client.unallocated.cpu**
 								* **nomad.client.allocated.disk**
 								* **nomad.client.unallocated.disk**
 								* **nomad.client.allocated.iops**
 								* **nomad.client.unallocated.iops**
 								* **nomad.client.allocated.memory**
 								* **nomad.client.unallocated.memory**
 								## Task Resource Consumption
 								The metrics listed [here][allocation-metrics] can be used to track resource
 								consumption on a per task basis. For user facing services, it is common to alert
 								when the CPU is at or above the reserved resources for the task.
 								## Job and Task Status
 								We do not currently surface metrics for job and task/allocation status, although
 								we will consider adding metrics where it makes sense.
 								## Runtime Metrics
 								Runtime metrics apply to all clients and servers. The following metrics are
 								general indicators of load and memory pressure:
 								* **nomad.runtime.num_goroutines**
 								* **nomad.runtime.heap_objects**
 								* **nomad.runtime.alloc_bytes**
 								It is recommended to alert on upticks in any of the above, server memory usage
 								in particular.
 								[alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
 								[alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
 								[allocation-metrics]: /docs/telemetry/metrics.html#allocation-metrics
 								[circonus-telem]: /docs/configuration/telemetry.html#circonus
 								[collection-interval]: /docs/configuration/telemetry.html#collection_interval
 								[datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
 								[datadog-telem]: /docs/configuration/telemetry.html#datadog
 								[prometheus-telem]: /docs/configuration/telemetry.html#prometheus
 								[metrics-api-endpoint]: /api/metrics.html
 								[metric-types]: /docs/telemetry/metrics.html#metric-types
 								[statsd-exporter]: https://github.com/prometheus/statsd_exporter
 								[statsd-telem]: /docs/configuration/telemetry.html#statsd
 								[statsite-telem]: /docs/configuration/telemetry.html#statsite
 								[tagged-metrics]: /docs/telemetry/metrics.html#tagged-metrics
 								[telemetry-stanza]: /docs/configuration/telemetry.html