open-nomad/website/pages/docs/telemetry/index.mdx

167 lines
7.2 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
layout: docs
page_title: Telemetry Overview
sidebar_title: Telemetry
description: |-
Overview of runtime metrics available in Nomad along with monitoring and
alerting.
---
# Telemetry Overview
The Nomad client and server agents collect a wide range of runtime metrics
related to the performance of the system. Operators can use this data to gain
real-time visibility into their cluster and improve performance. Additionally,
Nomad operators can set up monitoring and alerting based on these metrics in
order to respond to any changes in the cluster state.
On the server side, leaders and
followers have metrics in common as well as metrics that are specific to their
roles. Clients have separate metrics for the host metrics and for
allocations/tasks, both of which have to be [explicitly
enabled][telemetry-stanza]. There are also runtime metrics that are common to
all servers and clients.
By default, the Nomad agent collects telemetry data at a [1 second
interval][collection-interval]. Note that Nomad supports [Gauges, counters and
timers][metric-types].
There are three ways to obtain metrics from Nomad:
- Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for
the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus
formatted metrics.
- Send the USR1 signal to the Nomad process. This will dump the current
telemetry information to STDERR (on Linux).
- Configure Nomad to automatically forward metrics to a third-party provider.
Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the
integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem].
Metrics can also be forwarded to [Statsite][statsite-telem],
[StatsD][statsd-telem], and [Circonus][circonus-telem].
## Alerting
The recommended practice for alerting is to leverage the alerting capabilities
of your monitoring provider. Nomads intention is to surface metrics that enable
users to configure the necessary alerts using their existing monitoring systems
as a scaffold, rather than to natively support alerting. Here are a few common
patterns:
- Export metrics from Nomad to Prometheus using the [StatsD
exporter][statsd-exporter], define [alerting rules][alerting-rules] in
Prometheus, and use [Alertmanager][alertmanager] for summarization and
routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is
supported for [Datadog][datadog-alerting].
- Periodically submit test jobs into Nomad to determine if your application
deployment pipeline is working end-to-end. This pattern is well-suited to
batch processing workloads.
- Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios
monitor when a new Nomad job is added. When a job is removed, remove the
Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a
job-specific alerting system.
- Write a script that looks at the history of each batch job to determine
whether or not the job is in an unhealthy state, updating your monitoring
system as appropriate. In many cases, it may be ok if a given batch job fails
occasionally, as long as it goes back to passing.
# Key Performance Indicators
The sections below cover a number of important metrics
## Consensus Protocol (Raft)
Nomad uses the Raft consensus protocol for leader election and state
replication. Spurious leader elections can be caused by networking issues
between the servers or insufficient CPU resources. Users in cloud environments
often bump their servers up to the next instance class with improved networking
and CPU to stabilize leader elections. The `nomad.raft.leader.lastContact` metric
is a general indicator of Raft latency which can be used to observe how Raft
timing is performing and guide the decision to upgrade to more powerful servers.
`nomad.raft.leader.lastContact` should not get too close to the leader lease
timeout of 500ms.
## Federated Deployments (Serf)
Nomad uses the membership and failure detection capabilities of the Serf library
to maintain a single, global gossip pool for all servers in a federated
deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
that membership is unstable.
## Scheduling
The following metrics allow an operator to observe changes in throughput at the
various points in the scheduling process (evaluation, scheduling/planning, and
placement):
- **nomad.broker.total_blocked** - The number of blocked evaluations.
- **nomad.worker.invoke_scheduler.\<type\>** - The time to run the scheduler of
the given type.
- **nomad.plan.evaluate** - The time to evaluate a scheduler Plan.
- **nomad.plan.submit** - The time to submit a scheduler Plan.
- **nomad.plan.queue_depth** - The number of scheduler Plans waiting to be
evaluated.
Upticks in any of the above metrics indicate a decrease in scheduler throughput.
## Capacity
The importance of monitoring resource availability is workload specific. Batch
processing workloads often operate under the assumption that the cluster should
be at or near capacity, with queued jobs running as soon as adequate resources
become available. Clusters that are primarily responsible for long running
services with an uptime requirement may want to maintain headroom at 20% or
more. The following metrics can be used to assess capacity across the cluster on
a per client basis:
- **nomad.client.allocated.cpu**
- **nomad.client.unallocated.cpu**
- **nomad.client.allocated.disk**
- **nomad.client.unallocated.disk**
- **nomad.client.allocated.iops**
- **nomad.client.unallocated.iops**
- **nomad.client.allocated.memory**
- **nomad.client.unallocated.memory**
## Task Resource Consumption
The metrics listed [here][allocation-metrics] can be used to track resource
consumption on a per task basis. For user facing services, it is common to alert
when the CPU is at or above the reserved resources for the task.
## Job and Task Status
We do not currently surface metrics for job and task/allocation status, although
we will consider adding metrics where it makes sense.
## Runtime Metrics
Runtime metrics apply to all clients and servers. The following metrics are
general indicators of load and memory pressure:
- **nomad.runtime.num_goroutines**
- **nomad.runtime.heap_objects**
- **nomad.runtime.alloc_bytes**
It is recommended to alert on upticks in any of the above, server memory usage
in particular.
[alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
[alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
[allocation-metrics]: /docs/telemetry/metrics#allocation-metrics
[circonus-telem]: /docs/configuration/telemetry#circonus
[collection-interval]: /docs/configuration/telemetry#collection_interval
[datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
[datadog-telem]: /docs/configuration/telemetry#datadog
[prometheus-telem]: /docs/configuration/telemetry#prometheus
[metrics-api-endpoint]:/1/api-docs/metrics
[metric-types]: /docs/telemetry/metrics#metric-types
[statsd-exporter]: https://github.com/prometheus/statsd_exporter
[statsd-telem]: /docs/configuration/telemetry#statsd
[statsite-telem]: /docs/configuration/telemetry#statsite
[tagged-metrics]: /docs/telemetry/metrics#tagged-metrics
[telemetry-stanza]: /docs/configuration/telemetry