163 lines
7 KiB
Markdown
163 lines
7 KiB
Markdown
|
---
|
|||
|
layout: "docs"
|
|||
|
page_title: "Overview"
|
|||
|
sidebar_current: "docs-telemetry-overview"
|
|||
|
description: |-
|
|||
|
Overview of runtime metrics available in Nomad along with monitoring and
|
|||
|
alerting.
|
|||
|
---
|
|||
|
|
|||
|
# Telemetry Overview
|
|||
|
|
|||
|
The Nomad client and server agents collect a wide range of runtime metrics
|
|||
|
related to the performance of the system. On the server side, leaders and
|
|||
|
followers have metrics in common as well as metrics that are specific to their
|
|||
|
roles. Clients have separate metrics for the host metrics and for
|
|||
|
allocations/tasks, both of which have to be [explicitly
|
|||
|
enabled][telemetry-stanza]. There are also runtime metrics that are common to
|
|||
|
all servers and clients.
|
|||
|
|
|||
|
By default, the Nomad agent collects telemetry data at a [1 second
|
|||
|
interval][collection-interval]. Note that Nomad supports [Gauges, counters and
|
|||
|
timers][metric-types].
|
|||
|
|
|||
|
There are three ways to obtain metrics from Nomad:
|
|||
|
|
|||
|
* Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for
|
|||
|
the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus
|
|||
|
formatted metrics.
|
|||
|
* Send the USR1 signal to the Nomad process. This will dump the current
|
|||
|
telemetry information to STDERR (on Linux).
|
|||
|
* Configure Nomad to automatically forward metrics to a third-party provider.
|
|||
|
|
|||
|
Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the
|
|||
|
integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem].
|
|||
|
Metrics can also be forwarded to [Statsite][statsite-telem],
|
|||
|
[StatsD][statsd-telem], and [Circonus][circonus-telem].
|
|||
|
|
|||
|
## Alerting
|
|||
|
|
|||
|
The recommended practice for alerting is to leverage the alerting capabilities
|
|||
|
of your monitoring provider. Nomad’s intention is to surface metrics that enable
|
|||
|
users to configure the necessary alerts using their existing monitoring systems
|
|||
|
as a scaffold, rather than to natively support alerting. Here are a few common
|
|||
|
patterns:
|
|||
|
|
|||
|
* Export metrics from Nomad to Prometheus using the [StatsD
|
|||
|
exporter][statsd-exporter], define [alerting rules][alerting-rules] in
|
|||
|
Prometheus, and use [Alertmanager][alertmanager] for summarization and
|
|||
|
routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is
|
|||
|
supported for [Datadog][datadog-alerting].
|
|||
|
|
|||
|
* Periodically submit test jobs into Nomad to determine if your application
|
|||
|
deployment pipeline is working end-to-end. This pattern is well-suited to
|
|||
|
batch processing workloads.
|
|||
|
|
|||
|
* Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios
|
|||
|
monitor when a new Nomad job is added. When a job is removed, remove the
|
|||
|
Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a
|
|||
|
job-specific alerting system.
|
|||
|
|
|||
|
* Write a script that looks at the history of each batch job to determine
|
|||
|
whether or not the job is in an unhealthy state, updating your monitoring
|
|||
|
system as appropriate. In many cases, it may be ok if a given batch job fails
|
|||
|
occasionally, as long as it goes back to passing.
|
|||
|
|
|||
|
# Key Performance Indicators
|
|||
|
|
|||
|
The sections below cover a number of important metrics
|
|||
|
|
|||
|
## Consensus Protocol (Raft)
|
|||
|
|
|||
|
Nomad uses the Raft consensus protocol for leader election and state
|
|||
|
replication. Spurious leader elections can be caused by networking issues
|
|||
|
between the servers or insufficient CPU resources. Users in cloud environments
|
|||
|
often bump their servers up to the next instance class with improved networking
|
|||
|
and CPU to stabilize leader elections. The `nomad.raft.leader.lastContact` metric
|
|||
|
is a general indicator of Raft latency which can be used to observe how Raft
|
|||
|
timing is performing and guide the decision to upgrade to more powerful servers.
|
|||
|
`nomad.raft.leader.lastContact` should not get too close to the leader lease
|
|||
|
timeout of 500ms.
|
|||
|
|
|||
|
## Federated Deployments (Serf)
|
|||
|
|
|||
|
Nomad uses the membership and failure detection capabilities of the Serf library
|
|||
|
to maintain a single, global gossip pool for all servers in a federated
|
|||
|
deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
|
|||
|
that membership is unstable.
|
|||
|
|
|||
|
## Scheduling
|
|||
|
|
|||
|
The following metrics allow an operator to observe changes in throughput at the
|
|||
|
various points in the scheduling process (evaluation, scheduling/planning, and
|
|||
|
placement):
|
|||
|
|
|||
|
* **nomad.broker.total_blocked** - The number of blocked evaluations.
|
|||
|
* **nomad.worker.invoke_scheduler.\<type\>** - The time to run the scheduler of
|
|||
|
the given type.
|
|||
|
* **nomad.plan.evaluate** - The time to evaluate a scheduler Plan.
|
|||
|
* **nomad.plan.submit** - The time to submit a scheduler Plan.
|
|||
|
* **nomad.plan.queue_depth** - The number of scheduler Plans waiting to be
|
|||
|
evaluated.
|
|||
|
|
|||
|
Upticks in any of the above metrics indicate a decrease in scheduler throughput.
|
|||
|
|
|||
|
## Capacity
|
|||
|
|
|||
|
The importance of monitoring resource availability is workload specific. Batch
|
|||
|
processing workloads often operate under the assumption that the cluster should
|
|||
|
be at or near capacity, with queued jobs running as soon as adequate resources
|
|||
|
become available. Clusters that are primarily responsible for long running
|
|||
|
services with an uptime requirement may want to maintain headroom at 20% or
|
|||
|
more. The following metrics can be used to assess capacity across the cluster on
|
|||
|
a per client basis:
|
|||
|
|
|||
|
* **nomad.client.allocated.cpu**
|
|||
|
* **nomad.client.unallocated.cpu**
|
|||
|
* **nomad.client.allocated.disk**
|
|||
|
* **nomad.client.unallocated.disk**
|
|||
|
* **nomad.client.allocated.iops**
|
|||
|
* **nomad.client.unallocated.iops**
|
|||
|
* **nomad.client.allocated.memory**
|
|||
|
* **nomad.client.unallocated.memory**
|
|||
|
|
|||
|
## Task Resource Consumption
|
|||
|
|
|||
|
The metrics listed [here][allocation-metrics] can be used to track resource
|
|||
|
consumption on a per task basis. For user facing services, it is common to alert
|
|||
|
when the CPU is at or above the reserved resources for the task.
|
|||
|
|
|||
|
## Job and Task Status
|
|||
|
|
|||
|
We do not currently surface metrics for job and task/allocation status, although
|
|||
|
we will consider adding metrics where it makes sense.
|
|||
|
|
|||
|
## Runtime Metrics
|
|||
|
|
|||
|
Runtime metrics apply to all clients and servers. The following metrics are
|
|||
|
general indicators of load and memory pressure:
|
|||
|
|
|||
|
* **nomad.runtime.num_goroutines**
|
|||
|
* **nomad.runtime.heap_objects**
|
|||
|
* **nomad.runtime.alloc_bytes**
|
|||
|
|
|||
|
It is recommended to alert on upticks in any of the above, server memory usage
|
|||
|
in particular.
|
|||
|
|
|||
|
|
|||
|
[alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
|
|||
|
[alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
|
|||
|
[allocation-metrics]: /docs/telemetry/metrics.html#allocation-metrics
|
|||
|
[circonus-telem]: /docs/configuration/telemetry.html#circonus
|
|||
|
[collection-interval]: /docs/configuration/telemetry.html#collection_interval
|
|||
|
[datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
|
|||
|
[datadog-telem]: /docs/configuration/telemetry.html#datadog
|
|||
|
[prometheus-telem]: /docs/configuration/telemetry.html#prometheus
|
|||
|
[metrics-api-endpoint]: /api/metrics.html
|
|||
|
[metric-types]: /docs/telemetry/metrics.html#metric-types
|
|||
|
[statsd-exporter]: https://github.com/prometheus/statsd_exporter
|
|||
|
[statsd-telem]: /docs/configuration/telemetry.html#statsd
|
|||
|
[statsite-telem]: /docs/configuration/telemetry.html#statsite
|
|||
|
[tagged-metrics]: /docs/telemetry/metrics.html#tagged-metrics
|
|||
|
[telemetry-stanza]: /docs/configuration/telemetry.html
|