4331e86e8e
Create a new "Operating Nomad" section of the docs where we can put reference material for operators that doesn't quite fit in either configuration file / command line documentation or a step-by-step Learn Guide. Pre-populate this with the existing telemetry docs and some links out to the Learn Guide sections.
1136 lines
31 KiB
Plaintext
1136 lines
31 KiB
Plaintext
---
|
|
layout: docs
|
|
page_title: Metrics
|
|
sidebar_title: Metrics
|
|
description: Learn about the different metrics available in Nomad.
|
|
---
|
|
|
|
# Metrics
|
|
|
|
The Nomad agent collects various runtime metrics about the performance of
|
|
different libraries and subsystems. These metrics are aggregated on a ten
|
|
second interval and are retained for one minute.
|
|
|
|
This data can be accessed via an HTTP endpoint or via sending a signal to the
|
|
Nomad process.
|
|
|
|
As of Nomad version 0.7, this data is available via HTTP at `/metrics`. See
|
|
[Metrics](/api-docs/metrics) for more information.
|
|
|
|
To view this data via sending a signal to the Nomad process: on Unix,
|
|
this is `USR1` while on Windows it is `BREAK`. Once Nomad receives the signal,
|
|
it will dump the current telemetry information to the agent's `stderr`.
|
|
|
|
This telemetry information can be used for debugging or otherwise
|
|
getting a better view of what Nomad is doing.
|
|
|
|
Telemetry information can be streamed to both [statsite](https://github.com/armon/statsite)
|
|
as well as statsd based on providing the appropriate configuration options.
|
|
|
|
To configure the telemetry output please see the [agent
|
|
configuration](/docs/configuration/telemetry).
|
|
|
|
Below is sample output of a telemetry dump:
|
|
|
|
```text
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000
|
|
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000
|
|
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000
|
|
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000
|
|
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000
|
|
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000
|
|
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813
|
|
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204
|
|
```
|
|
|
|
## Key Metrics
|
|
|
|
When telemetry is being streamed to statsite or statsd, `interval` is defined to
|
|
be their flush interval. Otherwise, the interval can be assumed to be 10 seconds
|
|
when retrieving metrics using the above described signals.
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Metric</th>
|
|
<th>Description</th>
|
|
<th>Unit</th>
|
|
<th>Type</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.runtime.num_goroutines</code>
|
|
</td>
|
|
<td>Number of goroutines and general load pressure indicator</td>
|
|
<td># of goroutines</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.runtime.alloc_bytes</code>
|
|
</td>
|
|
<td>Memory utilization</td>
|
|
<td># of bytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.runtime.heap_objects</code>
|
|
</td>
|
|
<td>Number of objects on the heap. General memory pressure indicator</td>
|
|
<td># of heap objects</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.raft.apply</code>
|
|
</td>
|
|
<td>Number of Raft transactions</td>
|
|
<td>Raft transactions / `interval`</td>
|
|
<td>Counter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.raft.replication.appendEntries</code>
|
|
</td>
|
|
<td>Raft transaction commit time</td>
|
|
<td>ms / Raft Log Append</td>
|
|
<td>Timer</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.raft.leader.lastContact</code>
|
|
</td>
|
|
<td>
|
|
Time since last contact to leader. General indicator of Raft latency
|
|
</td>
|
|
<td>ms / Leader Contact</td>
|
|
<td>Timer</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.broker.total_ready</code>
|
|
</td>
|
|
<td>Number of evaluations ready to be processed</td>
|
|
<td># of evaluations</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.broker.total_unacked</code>
|
|
</td>
|
|
<td>Evaluations dispatched for processing but incomplete</td>
|
|
<td># of evaluations</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.broker.total_blocked</code>
|
|
</td>
|
|
<td>
|
|
Evaluations that are blocked until an existing evaluation for the same
|
|
job completes
|
|
</td>
|
|
<td># of evaluations</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.plan.queue_depth</code>
|
|
</td>
|
|
<td>Number of scheduler Plans waiting to be evaluated</td>
|
|
<td># of plans</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.plan.submit</code>
|
|
</td>
|
|
<td>
|
|
Time to submit a scheduler Plan. Higher values cause lower scheduling
|
|
throughput
|
|
</td>
|
|
<td>ms / Plan Submit</td>
|
|
<td>Timer</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.plan.evaluate</code>
|
|
</td>
|
|
<td>
|
|
Time to validate a scheduler Plan. Higher values cause lower scheduling
|
|
throughput. Similar to <code>nomad.plan.submit</code> but does not
|
|
include RPC time or time in the Plan Queue
|
|
</td>
|
|
<td>ms / Plan Evaluation</td>
|
|
<td>Timer</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.worker.invoke_scheduler.<type></code>
|
|
</td>
|
|
<td>Time to run the scheduler of the given type</td>
|
|
<td>ms / Scheduler Run</td>
|
|
<td>Timer</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.worker.wait_for_index</code>
|
|
</td>
|
|
<td>
|
|
Time waiting for Raft log replication from leader. High delays result in
|
|
lower scheduling throughput
|
|
</td>
|
|
<td>ms / Raft Index Wait</td>
|
|
<td>Timer</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.heartbeat.active</code>
|
|
</td>
|
|
<td>
|
|
Number of active heartbeat timers. Each timer represents a Nomad Client
|
|
connection
|
|
</td>
|
|
<td># of heartbeat timers</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.heartbeat.invalidate</code>
|
|
</td>
|
|
<td>
|
|
The length of time it takes to invalidate a Nomad Client due to failed
|
|
heartbeats
|
|
</td>
|
|
<td>ms / Heartbeat Invalidation</td>
|
|
<td>Timer</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.rpc.query</code>
|
|
</td>
|
|
<td>Number of RPC queries</td>
|
|
<td>RPC Queries / `interval`</td>
|
|
<td>Counter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.rpc.request</code>
|
|
</td>
|
|
<td>Number of RPC requests being handled</td>
|
|
<td>RPC Requests / `interval`</td>
|
|
<td>Counter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.rpc.request_error</code>
|
|
</td>
|
|
<td>Number of RPC requests being handled that result in an error</td>
|
|
<td>RPC Errors / `interval`</td>
|
|
<td>Counter</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Client Metrics
|
|
|
|
The Nomad client emits metrics related to the resource usage of the allocations
|
|
and tasks running on it and the node itself. Operators have to explicitly turn
|
|
on publishing host and allocation metrics. Publishing allocation and host
|
|
metrics can be turned on by setting the value of `publish_allocation_metrics`
|
|
`publish_node_metrics` to `true`.
|
|
|
|
By default the collection interval is 1 second but it can be changed by the
|
|
changing the value of the `collection_interval` key in the `telemetry`
|
|
configuration block.
|
|
|
|
Please see the [agent configuration](/docs/configuration/telemetry)
|
|
page for more details.
|
|
|
|
As of Nomad 0.9, Nomad will emit additional labels for [parameterized](/docs/job-specification/parameterized) and
|
|
[periodic](/docs/job-specification/parameterized) jobs. Nomad
|
|
emits the parent job id as a new label `parent_id`. Also, the labels `dispatch_id`
|
|
and `periodic_id` are emitted, containing the ID of the specific invocation of the
|
|
parameterized or periodic job respectively. For example, a dispatch job with the id
|
|
`myjob/dispatch-1312323423423`, will have the following labels.
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Label</th>
|
|
<th>Value</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>job</td>
|
|
<td>
|
|
<code>myjob/dispatch-1312323423423</code>
|
|
</td>
|
|
</tr>
|
|
<tr>
|
|
<td>parent_id</td>
|
|
<td>myjob</td>
|
|
</tr>
|
|
<tr>
|
|
<td>dispatch_id</td>
|
|
<td>1312323423423</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Host Metrics (post Nomad version 0.7)
|
|
|
|
Starting in version 0.7, Nomad will emit [tagged metrics][tagged-metrics], in the below format:
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Metric</th>
|
|
<th>Description</th>
|
|
<th>Unit</th>
|
|
<th>Type</th>
|
|
<th>Labels</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocated.cpu</code>
|
|
</td>
|
|
<td>Total amount of CPU shares the scheduler has allocated to tasks</td>
|
|
<td>MHz</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.unallocated.cpu</code>
|
|
</td>
|
|
<td>
|
|
Total amount of CPU shares free for the scheduler to allocate to tasks
|
|
</td>
|
|
<td>MHz</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocated.memory</code>
|
|
</td>
|
|
<td>Total amount of memory the scheduler has allocated to tasks</td>
|
|
<td>Megabytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.unallocated.memory</code>
|
|
</td>
|
|
<td>
|
|
Total amount of memory free for the scheduler to allocate to tasks
|
|
</td>
|
|
<td>Megabytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocated.disk</code>
|
|
</td>
|
|
<td>Total amount of disk space the scheduler has allocated to tasks</td>
|
|
<td>Megabytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.unallocated.disk</code>
|
|
</td>
|
|
<td>
|
|
Total amount of disk space free for the scheduler to allocate to tasks
|
|
</td>
|
|
<td>Megabytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocated.network</code>
|
|
</td>
|
|
<td>
|
|
Total amount of bandwidth the scheduler has allocated to tasks on the
|
|
given device
|
|
</td>
|
|
<td>Megabits</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, device</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.unallocated.network</code>
|
|
</td>
|
|
<td>
|
|
Total amount of bandwidth free for the scheduler to allocate to tasks on
|
|
the given device
|
|
</td>
|
|
<td>Megabits</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, device</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.memory.total</code>
|
|
</td>
|
|
<td>Total amount of physical memory on the node</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.memory.available</code>
|
|
</td>
|
|
<td>
|
|
Total amount of memory available to processes which includes free and
|
|
cached memory
|
|
</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.memory.used</code>
|
|
</td>
|
|
<td>Amount of memory used by processes</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.memory.free</code>
|
|
</td>
|
|
<td>Amount of memory which is free</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.uptime</code>
|
|
</td>
|
|
<td>Uptime of the host running the Nomad client</td>
|
|
<td>Seconds</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.cpu.total</code>
|
|
</td>
|
|
<td>Total CPU utilization</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, cpu</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.cpu.user</code>
|
|
</td>
|
|
<td>CPU utilization in the user space</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, cpu</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.cpu.system</code>
|
|
</td>
|
|
<td>CPU utilization in the system space</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, cpu</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.cpu.idle</code>
|
|
</td>
|
|
<td>Idle time spent by the CPU</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, cpu</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.disk.size</code>
|
|
</td>
|
|
<td>Total size of the device</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, disk</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.disk.used</code>
|
|
</td>
|
|
<td>Amount of space which has been used</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, disk</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.disk.available</code>
|
|
</td>
|
|
<td>Amount of space which is available</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, disk</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.disk.used_percent</code>
|
|
</td>
|
|
<td>Percentage of disk space used</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, disk</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.disk.inodes_percent</code>
|
|
</td>
|
|
<td>Disk space consumed by the inodes</td>
|
|
<td>Percent</td>
|
|
<td>Gauge</td>
|
|
<td>node_id, datacenter, disk</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.start</code>
|
|
</td>
|
|
<td>Number of allocations starting</td>
|
|
<td>Integer</td>
|
|
<td>Counter</td>
|
|
<td>node_id, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.running</code>
|
|
</td>
|
|
<td>Number of allocations starting to run</td>
|
|
<td>Integer</td>
|
|
<td>Counter</td>
|
|
<td>node_id, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.failed</code>
|
|
</td>
|
|
<td>Number of allocations failing</td>
|
|
<td>Integer</td>
|
|
<td>Counter</td>
|
|
<td>node_id, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.restart</code>
|
|
</td>
|
|
<td>Number of allocations restarting</td>
|
|
<td>Integer</td>
|
|
<td>Counter</td>
|
|
<td>node_id, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.complete</code>
|
|
</td>
|
|
<td>Number of allocations completing</td>
|
|
<td>Integer</td>
|
|
<td>Counter</td>
|
|
<td>node_id, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.destroy</code>
|
|
</td>
|
|
<td>Number of allocations being destroyed</td>
|
|
<td>Integer</td>
|
|
<td>Counter</td>
|
|
<td>node_id, job, task_group</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
Nomad 0.9 adds an additional `node_class` label from the client's
|
|
`NodeClass` attribute. This label is set to the string "none" if empty.
|
|
|
|
## Host Metrics (deprecated post Nomad 0.7)
|
|
|
|
The below are metrics emitted by Nomad in versions prior to 0.7. These metrics
|
|
can be emitted in the below format post-0.7 (as well as the new format,
|
|
detailed above) but any new metrics will only be available in the new format.
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Metric</th>
|
|
<th>Description</th>
|
|
<th>Unit</th>
|
|
<th>Type</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocated.cpu.<HostID></code>
|
|
</td>
|
|
<td>Total amount of CPU shares the scheduler has allocated to tasks</td>
|
|
<td>MHz</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.unallocated.cpu.<HostID></code>
|
|
</td>
|
|
<td>
|
|
Total amount of CPU shares free for the scheduler to allocate to tasks
|
|
</td>
|
|
<td>MHz</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocated.memory.<HostID></code>
|
|
</td>
|
|
<td>Total amount of memory the scheduler has allocated to tasks</td>
|
|
<td>Megabytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.unallocated.memory.<HostID></code>
|
|
</td>
|
|
<td>
|
|
Total amount of memory free for the scheduler to allocate to tasks
|
|
</td>
|
|
<td>Megabytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocated.disk.<HostID></code>
|
|
</td>
|
|
<td>Total amount of disk space the scheduler has allocated to tasks</td>
|
|
<td>Megabytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.unallocated.disk.<HostID></code>
|
|
</td>
|
|
<td>
|
|
Total amount of disk space free for the scheduler to allocate to tasks
|
|
</td>
|
|
<td>Megabytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>
|
|
nomad.client.allocated.network.<Device-Name>.<HostID>
|
|
</code>
|
|
</td>
|
|
<td>
|
|
Total amount of bandwidth the scheduler has allocated to tasks on the
|
|
given device
|
|
</td>
|
|
<td>Megabits</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>
|
|
nomad.client.unallocated.network.<Device-Name>.<HostID>
|
|
</code>
|
|
</td>
|
|
<td>
|
|
Total amount of bandwidth free for the scheduler to allocate to tasks on
|
|
the given device
|
|
</td>
|
|
<td>Megabits</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.memory.<HostID>.total</code>
|
|
</td>
|
|
<td>Total amount of physical memory on the node</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.memory.<HostID>.available</code>
|
|
</td>
|
|
<td>
|
|
Total amount of memory available to processes which includes free and
|
|
cached memory
|
|
</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.memory.<HostID>.used</code>
|
|
</td>
|
|
<td>Amount of memory used by processes</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.memory.<HostID>.free</code>
|
|
</td>
|
|
<td>Amount of memory which is free</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.uptime.<HostID></code>
|
|
</td>
|
|
<td>Uptime of the host running the Nomad client</td>
|
|
<td>Seconds</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.cpu.<HostID>.<CPU-Core>.total</code>
|
|
</td>
|
|
<td>Total CPU utilization</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.cpu.<HostID>.<CPU-Core>.user</code>
|
|
</td>
|
|
<td>CPU utilization in the user space</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>
|
|
nomad.client.host.cpu.<HostID>.<CPU-Core>.system
|
|
</code>
|
|
</td>
|
|
<td>CPU utilization in the system space</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.host.cpu.<HostID>.<CPU-Core>.idle</code>
|
|
</td>
|
|
<td>Idle time spent by the CPU</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>
|
|
nomad.client.host.disk.<HostID>.<Device-Name>.size
|
|
</code>
|
|
</td>
|
|
<td>Total size of the device</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>
|
|
nomad.client.host.disk.<HostID>.<Device-Name>.used
|
|
</code>
|
|
</td>
|
|
<td>Amount of space which has been used</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>
|
|
nomad.client.host.disk.<HostID>.<Device-Name>.available
|
|
</code>
|
|
</td>
|
|
<td>Amount of space which is available</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>
|
|
nomad.client.host.disk.<HostID>.<Device-Name>.used_percent
|
|
</code>
|
|
</td>
|
|
<td>Percentage of disk space used</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>
|
|
nomad.client.host.disk.<HostID>.<Device-Name>.inodes_percent
|
|
</code>
|
|
</td>
|
|
<td>Disk space consumed by the inodes</td>
|
|
<td>Percent</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Allocation Metrics
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Metric</th>
|
|
<th>Description</th>
|
|
<th>Unit</th>
|
|
<th>Type</th>
|
|
<th>Labels</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.memory.allocated</code>
|
|
</td>
|
|
<td>Amount of memory allocated by the task</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.memory.rss</code>
|
|
</td>
|
|
<td>Amount of RSS memory consumed by the task</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.memory.cache</code>
|
|
</td>
|
|
<td>Amount of memory cached by the task</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.memory.swap</code>
|
|
</td>
|
|
<td>Amount of memory swapped by the task</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.memory.usage</code>
|
|
</td>
|
|
<td>Total amount of memory used by the task</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.memory.max_usage</code>
|
|
</td>
|
|
<td>Maximum amount of memory ever used by the task</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.memory.kernel_usage</code>
|
|
</td>
|
|
<td>Amount of memory used by the kernel for this task</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.memory.kernel_max_usage</code>
|
|
</td>
|
|
<td>Maximum amount of memory ever used by the kernel for this task</td>
|
|
<td>Bytes</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.cpu.allocated</code>
|
|
</td>
|
|
<td>Total CPU resources allocated by the task across all cores</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, host, job, namespace, task, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.cpu.total_percent</code>
|
|
</td>
|
|
<td>Total CPU resources consumed by the task across all cores</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, task, namespace, host, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.cpu.system</code>
|
|
</td>
|
|
<td>Total CPU resources consumed by the task in the system space</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, task, namespace, host, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.cpu.user</code>
|
|
</td>
|
|
<td>Total CPU resources consumed by the task in the user space</td>
|
|
<td>Percentage</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, task, namespace, host, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.cpu.throttled_time</code>
|
|
</td>
|
|
<td>Total time that the task was throttled</td>
|
|
<td>Nanoseconds</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, task, namespace, host, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.cpu.throttled_periods</code>
|
|
</td>
|
|
<td>Total number of CPU periods that the task was throttled</td>
|
|
<td>Nanoseconds</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, task, namespace, host, job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.client.allocs.cpu.total_ticks</code>
|
|
</td>
|
|
<td>CPU ticks consumed by the process in the last collection interval</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
<td>alloc_id, task, namespace, host, job, task_group</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Job Summary Metrics
|
|
|
|
Job summary metrics are emitted by the Nomad leader server.
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Metric</th>
|
|
<th>Description</th>
|
|
<th>Unit</th>
|
|
<th>Type</th>
|
|
<th>Labels</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_summary.queued</code>
|
|
</td>
|
|
<td>Number of queued allocations for a job</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
<td>job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_summary.complete</code>
|
|
</td>
|
|
<td>Number of complete allocations for a job</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
<td>job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_summary.failed</code>
|
|
</td>
|
|
<td>Number of failed allocations for a job</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
<td>job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_summary.running</code>
|
|
</td>
|
|
<td>Number of running allocations for a job</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
<td>job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_summary.starting</code>
|
|
</td>
|
|
<td>Number of starting allocations for a job</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
<td>job, task_group</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_summary.lost</code>
|
|
</td>
|
|
<td>Number of lost allocations for a job</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
<td>job, task_group</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Job Status Metrics
|
|
|
|
Job status metrics are emitted by the Nomad leader server.
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Metric</th>
|
|
<th>Description</th>
|
|
<th>Unit</th>
|
|
<th>Type</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_status.pending</code>
|
|
</td>
|
|
<td>Number jobs pending</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_status.running</code>
|
|
</td>
|
|
<td>Number jobs running</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
<tr>
|
|
<td>
|
|
<code>nomad.job_status.dead</code>
|
|
</td>
|
|
<td>Number of dead jobs</td>
|
|
<td>Integer</td>
|
|
<td>Gauge</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Metric Types
|
|
|
|
<table>
|
|
<thead>
|
|
<tr>
|
|
<th>Type</th>
|
|
<th>Description</th>
|
|
<th>Quantiles</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody>
|
|
<tr>
|
|
<td>Gauge</td>
|
|
<td>
|
|
Gauge types report an absolute number at the end of the aggregation
|
|
interval
|
|
</td>
|
|
<td>false</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Counter</td>
|
|
<td>
|
|
Counts are incremented and flushed at the end of the aggregation
|
|
interval and then are reset to zero
|
|
</td>
|
|
<td>true</td>
|
|
</tr>
|
|
<tr>
|
|
<td>Timer</td>
|
|
<td>
|
|
Timers measure the time to complete a task and will include quantiles,
|
|
means, standard deviation, etc per interval.
|
|
</td>
|
|
<td>true</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
## Tagged Metrics
|
|
|
|
As of version 0.7, Nomad will start emitting metrics in a tagged format. Each
|
|
metric can support more than one tag, meaning that it is possible to do a
|
|
match over metrics for datapoints such as a particular datacenter, and return
|
|
all metrics with this tag. Nomad supports labels for namespaces as well.
|
|
|
|
[tagged-metrics]: /docs/telemetry/metrics#tagged-metrics
|