2019-06-19 19:25:14 +00:00
|
|
|
|
---
|
2020-02-06 23:45:31 +00:00
|
|
|
|
layout: docs
|
|
|
|
|
page_title: Telemetry Overview
|
|
|
|
|
sidebar_title: Telemetry
|
2019-06-19 19:25:14 +00:00
|
|
|
|
description: |-
|
|
|
|
|
Overview of runtime metrics available in Nomad along with monitoring and
|
|
|
|
|
alerting.
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
# Telemetry Overview
|
|
|
|
|
|
|
|
|
|
The Nomad client and server agents collect a wide range of runtime metrics
|
2020-02-06 23:45:31 +00:00
|
|
|
|
related to the performance of the system. Operators can use this data to gain
|
|
|
|
|
real-time visibility into their cluster and improve performance. Additionally,
|
|
|
|
|
Nomad operators can set up monitoring and alerting based on these metrics in
|
|
|
|
|
order to respond to any changes in the cluster state.
|
|
|
|
|
|
|
|
|
|
On the server side, leaders and
|
2019-06-19 19:25:14 +00:00
|
|
|
|
followers have metrics in common as well as metrics that are specific to their
|
|
|
|
|
roles. Clients have separate metrics for the host metrics and for
|
|
|
|
|
allocations/tasks, both of which have to be [explicitly
|
|
|
|
|
enabled][telemetry-stanza]. There are also runtime metrics that are common to
|
|
|
|
|
all servers and clients.
|
|
|
|
|
|
|
|
|
|
By default, the Nomad agent collects telemetry data at a [1 second
|
|
|
|
|
interval][collection-interval]. Note that Nomad supports [Gauges, counters and
|
|
|
|
|
timers][metric-types].
|
|
|
|
|
|
|
|
|
|
There are three ways to obtain metrics from Nomad:
|
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for
|
2019-06-19 19:25:14 +00:00
|
|
|
|
the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus
|
|
|
|
|
formatted metrics.
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- Send the USR1 signal to the Nomad process. This will dump the current
|
2019-06-19 19:25:14 +00:00
|
|
|
|
telemetry information to STDERR (on Linux).
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- Configure Nomad to automatically forward metrics to a third-party provider.
|
2019-06-19 19:25:14 +00:00
|
|
|
|
|
|
|
|
|
Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the
|
|
|
|
|
integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem].
|
|
|
|
|
Metrics can also be forwarded to [Statsite][statsite-telem],
|
|
|
|
|
[StatsD][statsd-telem], and [Circonus][circonus-telem].
|
|
|
|
|
|
|
|
|
|
## Alerting
|
|
|
|
|
|
|
|
|
|
The recommended practice for alerting is to leverage the alerting capabilities
|
|
|
|
|
of your monitoring provider. Nomad’s intention is to surface metrics that enable
|
|
|
|
|
users to configure the necessary alerts using their existing monitoring systems
|
|
|
|
|
as a scaffold, rather than to natively support alerting. Here are a few common
|
|
|
|
|
patterns:
|
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- Export metrics from Nomad to Prometheus using the [StatsD
|
2019-06-19 19:25:14 +00:00
|
|
|
|
exporter][statsd-exporter], define [alerting rules][alerting-rules] in
|
|
|
|
|
Prometheus, and use [Alertmanager][alertmanager] for summarization and
|
|
|
|
|
routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is
|
|
|
|
|
supported for [Datadog][datadog-alerting].
|
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- Periodically submit test jobs into Nomad to determine if your application
|
2019-06-19 19:25:14 +00:00
|
|
|
|
deployment pipeline is working end-to-end. This pattern is well-suited to
|
|
|
|
|
batch processing workloads.
|
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios
|
2019-06-19 19:25:14 +00:00
|
|
|
|
monitor when a new Nomad job is added. When a job is removed, remove the
|
|
|
|
|
Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a
|
|
|
|
|
job-specific alerting system.
|
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- Write a script that looks at the history of each batch job to determine
|
2019-06-19 19:25:14 +00:00
|
|
|
|
whether or not the job is in an unhealthy state, updating your monitoring
|
|
|
|
|
system as appropriate. In many cases, it may be ok if a given batch job fails
|
|
|
|
|
occasionally, as long as it goes back to passing.
|
|
|
|
|
|
|
|
|
|
# Key Performance Indicators
|
|
|
|
|
|
|
|
|
|
The sections below cover a number of important metrics
|
|
|
|
|
|
|
|
|
|
## Consensus Protocol (Raft)
|
|
|
|
|
|
|
|
|
|
Nomad uses the Raft consensus protocol for leader election and state
|
|
|
|
|
replication. Spurious leader elections can be caused by networking issues
|
|
|
|
|
between the servers or insufficient CPU resources. Users in cloud environments
|
|
|
|
|
often bump their servers up to the next instance class with improved networking
|
|
|
|
|
and CPU to stabilize leader elections. The `nomad.raft.leader.lastContact` metric
|
|
|
|
|
is a general indicator of Raft latency which can be used to observe how Raft
|
|
|
|
|
timing is performing and guide the decision to upgrade to more powerful servers.
|
|
|
|
|
`nomad.raft.leader.lastContact` should not get too close to the leader lease
|
|
|
|
|
timeout of 500ms.
|
|
|
|
|
|
|
|
|
|
## Federated Deployments (Serf)
|
|
|
|
|
|
|
|
|
|
Nomad uses the membership and failure detection capabilities of the Serf library
|
|
|
|
|
to maintain a single, global gossip pool for all servers in a federated
|
|
|
|
|
deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
|
|
|
|
|
that membership is unstable.
|
|
|
|
|
|
|
|
|
|
## Scheduling
|
|
|
|
|
|
|
|
|
|
The following metrics allow an operator to observe changes in throughput at the
|
|
|
|
|
various points in the scheduling process (evaluation, scheduling/planning, and
|
|
|
|
|
placement):
|
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- **nomad.broker.total_blocked** - The number of blocked evaluations.
|
|
|
|
|
- **nomad.worker.invoke_scheduler.\<type\>** - The time to run the scheduler of
|
2019-06-19 19:25:14 +00:00
|
|
|
|
the given type.
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- **nomad.plan.evaluate** - The time to evaluate a scheduler Plan.
|
|
|
|
|
- **nomad.plan.submit** - The time to submit a scheduler Plan.
|
|
|
|
|
- **nomad.plan.queue_depth** - The number of scheduler Plans waiting to be
|
|
|
|
|
evaluated.
|
2019-06-19 19:25:14 +00:00
|
|
|
|
|
|
|
|
|
Upticks in any of the above metrics indicate a decrease in scheduler throughput.
|
|
|
|
|
|
|
|
|
|
## Capacity
|
|
|
|
|
|
|
|
|
|
The importance of monitoring resource availability is workload specific. Batch
|
|
|
|
|
processing workloads often operate under the assumption that the cluster should
|
|
|
|
|
be at or near capacity, with queued jobs running as soon as adequate resources
|
|
|
|
|
become available. Clusters that are primarily responsible for long running
|
|
|
|
|
services with an uptime requirement may want to maintain headroom at 20% or
|
|
|
|
|
more. The following metrics can be used to assess capacity across the cluster on
|
|
|
|
|
a per client basis:
|
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- **nomad.client.allocated.cpu**
|
|
|
|
|
- **nomad.client.unallocated.cpu**
|
|
|
|
|
- **nomad.client.allocated.disk**
|
|
|
|
|
- **nomad.client.unallocated.disk**
|
|
|
|
|
- **nomad.client.allocated.iops**
|
|
|
|
|
- **nomad.client.unallocated.iops**
|
|
|
|
|
- **nomad.client.allocated.memory**
|
|
|
|
|
- **nomad.client.unallocated.memory**
|
2019-06-19 19:25:14 +00:00
|
|
|
|
|
|
|
|
|
## Task Resource Consumption
|
|
|
|
|
|
|
|
|
|
The metrics listed [here][allocation-metrics] can be used to track resource
|
|
|
|
|
consumption on a per task basis. For user facing services, it is common to alert
|
|
|
|
|
when the CPU is at or above the reserved resources for the task.
|
|
|
|
|
|
|
|
|
|
## Job and Task Status
|
|
|
|
|
|
|
|
|
|
We do not currently surface metrics for job and task/allocation status, although
|
2020-02-06 23:45:31 +00:00
|
|
|
|
we will consider adding metrics where it makes sense.
|
2019-06-19 19:25:14 +00:00
|
|
|
|
|
|
|
|
|
## Runtime Metrics
|
|
|
|
|
|
|
|
|
|
Runtime metrics apply to all clients and servers. The following metrics are
|
|
|
|
|
general indicators of load and memory pressure:
|
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
|
- **nomad.runtime.num_goroutines**
|
|
|
|
|
- **nomad.runtime.heap_objects**
|
|
|
|
|
- **nomad.runtime.alloc_bytes**
|
2019-06-19 19:25:14 +00:00
|
|
|
|
|
|
|
|
|
It is recommended to alert on upticks in any of the above, server memory usage
|
|
|
|
|
in particular.
|
|
|
|
|
|
|
|
|
|
[alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
|
|
|
|
|
[alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
|
2020-02-06 23:45:31 +00:00
|
|
|
|
[allocation-metrics]: /docs/telemetry/metrics#allocation-metrics
|
|
|
|
|
[circonus-telem]: /docs/configuration/telemetry#circonus
|
|
|
|
|
[collection-interval]: /docs/configuration/telemetry#collection_interval
|
2019-06-19 19:25:14 +00:00
|
|
|
|
[datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
|
2020-02-06 23:45:31 +00:00
|
|
|
|
[datadog-telem]: /docs/configuration/telemetry#datadog
|
|
|
|
|
[prometheus-telem]: /docs/configuration/telemetry#prometheus
|
2020-03-20 22:26:58 +00:00
|
|
|
|
[metrics-api-endpoint]: /api-docs/metrics
|
2020-02-06 23:45:31 +00:00
|
|
|
|
[metric-types]: /docs/telemetry/metrics#metric-types
|
2019-06-19 19:25:14 +00:00
|
|
|
|
[statsd-exporter]: https://github.com/prometheus/statsd_exporter
|
2020-02-06 23:45:31 +00:00
|
|
|
|
[statsd-telem]: /docs/configuration/telemetry#statsd
|
|
|
|
|
[statsite-telem]: /docs/configuration/telemetry#statsite
|
|
|
|
|
[tagged-metrics]: /docs/telemetry/metrics#tagged-metrics
|
|
|
|
|
[telemetry-stanza]: /docs/configuration/telemetry
|