--- layout: docs page_title: Metrics sidebar_title: Metrics description: Learn about the different metrics available in Nomad. --- # Metrics The Nomad agent collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute. This data can be accessed via an HTTP endpoint or via sending a signal to the Nomad process. As of Nomad version 0.7, this data is available via HTTP at `/metrics`. See [Metrics](/api/metrics) for more information. To view this data via sending a signal to the Nomad process: on Unix, this is `USR1` while on Windows it is `BREAK`. Once Nomad receives the signal, it will dump the current telemetry information to the agent's `stderr`. This telemetry information can be used for debugging or otherwise getting a better view of what Nomad is doing. Telemetry information can be streamed to both [statsite](https://github.com/armon/statsite) as well as statsd based on providing the appropriate configuration options. To configure the telemetry output please see the [agent configuration](/docs/configuration/telemetry). Below is sample output of a telemetry dump: ```text [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204 ``` ## Key Metrics When telemetry is being streamed to statsite or statsd, `interval` is defined to be their flush interval. Otherwise, the interval can be assumed to be 10 seconds when retrieving metrics using the above described signals.
Metric | Description | Unit | Type |
---|---|---|---|
nomad.runtime.num_goroutines
|
Number of goroutines and general load pressure indicator | # of goroutines | Gauge |
nomad.runtime.alloc_bytes
|
Memory utilization | # of bytes | Gauge |
nomad.runtime.heap_objects
|
Number of objects on the heap. General memory pressure indicator | # of heap objects | Gauge |
nomad.raft.apply
|
Number of Raft transactions | Raft transactions / `interval` | Counter |
nomad.raft.replication.appendEntries
|
Raft transaction commit time | ms / Raft Log Append | Timer |
nomad.raft.leader.lastContact
|
Time since last contact to leader. General indicator of Raft latency | ms / Leader Contact | Timer |
nomad.broker.total_ready
|
Number of evaluations ready to be processed | # of evaluations | Gauge |
nomad.broker.total_unacked
|
Evaluations dispatched for processing but incomplete | # of evaluations | Gauge |
nomad.broker.total_blocked
|
Evaluations that are blocked until an existing evaluation for the same job completes | # of evaluations | Gauge |
nomad.plan.queue_depth
|
Number of scheduler Plans waiting to be evaluated | # of plans | Gauge |
nomad.plan.submit
|
Time to submit a scheduler Plan. Higher values cause lower scheduling throughput | ms / Plan Submit | Timer |
nomad.plan.evaluate
|
Time to validate a scheduler Plan. Higher values cause lower scheduling
throughput. Similar to nomad.plan.submit but does not
include RPC time or time in the Plan Queue
|
ms / Plan Evaluation | Timer |
nomad.worker.invoke_scheduler.<type>
|
Time to run the scheduler of the given type | ms / Scheduler Run | Timer |
nomad.worker.wait_for_index
|
Time waiting for Raft log replication from leader. High delays result in lower scheduling throughput | ms / Raft Index Wait | Timer |
nomad.heartbeat.active
|
Number of active heartbeat timers. Each timer represents a Nomad Client connection | # of heartbeat timers | Gauge |
nomad.heartbeat.invalidate
|
The length of time it takes to invalidate a Nomad Client due to failed heartbeats | ms / Heartbeat Invalidation | Timer |
nomad.rpc.query
|
Number of RPC queries | RPC Queries / `interval` | Counter |
nomad.rpc.request
|
Number of RPC requests being handled | RPC Requests / `interval` | Counter |
nomad.rpc.request_error
|
Number of RPC requests being handled that result in an error | RPC Errors / `interval` | Counter |
Label | Value |
---|---|
job |
myjob/dispatch-1312323423423
|
parent_id | myjob |
dispatch_id | 1312323423423 |
Metric | Description | Unit | Type | Labels |
---|---|---|---|---|
nomad.client.allocated.cpu
|
Total amount of CPU shares the scheduler has allocated to tasks | MHz | Gauge | node_id, datacenter |
nomad.client.unallocated.cpu
|
Total amount of CPU shares free for the scheduler to allocate to tasks | MHz | Gauge | node_id, datacenter |
nomad.client.allocated.memory
|
Total amount of memory the scheduler has allocated to tasks | Megabytes | Gauge | node_id, datacenter |
nomad.client.unallocated.memory
|
Total amount of memory free for the scheduler to allocate to tasks | Megabytes | Gauge | node_id, datacenter |
nomad.client.allocated.disk
|
Total amount of disk space the scheduler has allocated to tasks | Megabytes | Gauge | node_id, datacenter |
nomad.client.unallocated.disk
|
Total amount of disk space free for the scheduler to allocate to tasks | Megabytes | Gauge | node_id, datacenter |
nomad.client.allocated.network
|
Total amount of bandwidth the scheduler has allocated to tasks on the given device | Megabits | Gauge | node_id, datacenter, device |
nomad.client.unallocated.network
|
Total amount of bandwidth free for the scheduler to allocate to tasks on the given device | Megabits | Gauge | node_id, datacenter, device |
nomad.client.host.memory.total
|
Total amount of physical memory on the node | Bytes | Gauge | node_id, datacenter |
nomad.client.host.memory.available
|
Total amount of memory available to processes which includes free and cached memory | Bytes | Gauge | node_id, datacenter |
nomad.client.host.memory.used
|
Amount of memory used by processes | Bytes | Gauge | node_id, datacenter |
nomad.client.host.memory.free
|
Amount of memory which is free | Bytes | Gauge | node_id, datacenter |
nomad.client.uptime
|
Uptime of the host running the Nomad client | Seconds | Gauge | node_id, datacenter |
nomad.client.host.cpu.total
|
Total CPU utilization | Percentage | Gauge | node_id, datacenter, cpu |
nomad.client.host.cpu.user
|
CPU utilization in the user space | Percentage | Gauge | node_id, datacenter, cpu |
nomad.client.host.cpu.system
|
CPU utilization in the system space | Percentage | Gauge | node_id, datacenter, cpu |
nomad.client.host.cpu.idle
|
Idle time spent by the CPU | Percentage | Gauge | node_id, datacenter, cpu |
nomad.client.host.disk.size
|
Total size of the device | Bytes | Gauge | node_id, datacenter, disk |
nomad.client.host.disk.used
|
Amount of space which has been used | Bytes | Gauge | node_id, datacenter, disk |
nomad.client.host.disk.available
|
Amount of space which is available | Bytes | Gauge | node_id, datacenter, disk |
nomad.client.host.disk.used_percent
|
Percentage of disk space used | Percentage | Gauge | node_id, datacenter, disk |
nomad.client.host.disk.inodes_percent
|
Disk space consumed by the inodes | Percent | Gauge | node_id, datacenter, disk |
nomad.client.allocs.start
|
Number of allocations starting | Integer | Counter | node_id, job, task_group |
nomad.client.allocs.running
|
Number of allocations starting to run | Integer | Counter | node_id, job, task_group |
nomad.client.allocs.failed
|
Number of allocations failing | Integer | Counter | node_id, job, task_group |
nomad.client.allocs.restart
|
Number of allocations restarting | Integer | Counter | node_id, job, task_group |
nomad.client.allocs.complete
|
Number of allocations completing | Integer | Counter | node_id, job, task_group |
nomad.client.allocs.destroy
|
Number of allocations being destroyed | Integer | Counter | node_id, job, task_group |
Metric | Description | Unit | Type |
---|---|---|---|
nomad.client.allocated.cpu.<HostID>
|
Total amount of CPU shares the scheduler has allocated to tasks | MHz | Gauge |
nomad.client.unallocated.cpu.<HostID>
|
Total amount of CPU shares free for the scheduler to allocate to tasks | MHz | Gauge |
nomad.client.allocated.memory.<HostID>
|
Total amount of memory the scheduler has allocated to tasks | Megabytes | Gauge |
nomad.client.unallocated.memory.<HostID>
|
Total amount of memory free for the scheduler to allocate to tasks | Megabytes | Gauge |
nomad.client.allocated.disk.<HostID>
|
Total amount of disk space the scheduler has allocated to tasks | Megabytes | Gauge |
nomad.client.unallocated.disk.<HostID>
|
Total amount of disk space free for the scheduler to allocate to tasks | Megabytes | Gauge |
nomad.client.allocated.network.<Device-Name>.<HostID>
|
Total amount of bandwidth the scheduler has allocated to tasks on the given device | Megabits | Gauge |
nomad.client.unallocated.network.<Device-Name>.<HostID>
|
Total amount of bandwidth free for the scheduler to allocate to tasks on the given device | Megabits | Gauge |
nomad.client.host.memory.<HostID>.total
|
Total amount of physical memory on the node | Bytes | Gauge |
nomad.client.host.memory.<HostID>.available
|
Total amount of memory available to processes which includes free and cached memory | Bytes | Gauge |
nomad.client.host.memory.<HostID>.used
|
Amount of memory used by processes | Bytes | Gauge |
nomad.client.host.memory.<HostID>.free
|
Amount of memory which is free | Bytes | Gauge |
nomad.client.uptime.<HostID>
|
Uptime of the host running the Nomad client | Seconds | Gauge |
nomad.client.host.cpu.<HostID>.<CPU-Core>.total
|
Total CPU utilization | Percentage | Gauge |
nomad.client.host.cpu.<HostID>.<CPU-Core>.user
|
CPU utilization in the user space | Percentage | Gauge |
nomad.client.host.cpu.<HostID>.<CPU-Core>.system
|
CPU utilization in the system space | Percentage | Gauge |
nomad.client.host.cpu.<HostID>.<CPU-Core>.idle
|
Idle time spent by the CPU | Percentage | Gauge |
nomad.client.host.disk.<HostID>.<Device-Name>.size
|
Total size of the device | Bytes | Gauge |
nomad.client.host.disk.<HostID>.<Device-Name>.used
|
Amount of space which has been used | Bytes | Gauge |
nomad.client.host.disk.<HostID>.<Device-Name>.available
|
Amount of space which is available | Bytes | Gauge |
nomad.client.host.disk.<HostID>.<Device-Name>.used_percent
|
Percentage of disk space used | Percentage | Gauge |
nomad.client.host.disk.<HostID>.<Device-Name>.inodes_percent
|
Disk space consumed by the inodes | Percent | Gauge |
Metric | Description | Unit | Type |
---|---|---|---|
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.rss
|
Amount of RSS memory consumed by the task | Bytes | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.cache
|
Amount of memory cached by the task | Bytes | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.swap
|
Amount of memory swapped by the task | Bytes | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.max_usage
|
Maximum amount of memory ever used by the task | Bytes | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_usage
|
Amount of memory used by the kernel for this task | Bytes | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_max_usage
|
Maximum amount of memory ever used by the kernel for this task | Bytes | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_percent
|
Total CPU resources consumed by the task across all cores | Percentage | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.system
|
Total CPU resources consumed by the task in the system space | Percentage | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.user
|
Total CPU resources consumed by the task in the user space | Percentage | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.throttled_time
|
Total time that the task was throttled | Nanoseconds | Gauge |
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_ticks
|
CPU ticks consumed by the process in the last collection interval | Integer | Gauge |
Metric | Description | Unit | Type | Labels |
---|---|---|---|---|
nomad.job_summary.queued
|
Number of queued allocations for a job | Integer | Gauge | job, task_group |
nomad.job_summary.complete
|
Number of complete allocations for a job | Integer | Gauge | job, task_group |
nomad.job_summary.failed
|
Number of failed allocations for a job | Integer | Gauge | job, task_group |
nomad.job_summary.running
|
Number of running allocations for a job | Integer | Gauge | job, task_group |
nomad.job_summary.starting
|
Number of starting allocations for a job | Integer | Gauge | job, task_group |
nomad.job_summary.lost
|
Number of lost allocations for a job | Integer | Gauge | job, task_group |
Metric | Description | Unit | Type |
---|---|---|---|
nomad.job_status.pending
|
Number jobs pending | Integer | Gauge |
nomad.job_status.running
|
Number jobs running | Integer | Gauge |
nomad.job_status.dead
|
Number of dead jobs | Integer | Gauge |
Type | Description | Quantiles |
---|---|---|
Gauge | Gauge types report an absolute number at the end of the aggregation interval | false |
Counter | Counts are incremented and flushed at the end of the aggregation interval and then are reset to zero | true |
Timer | Timers measure the time to complete a task and will include quantiles, means, standard deviation, etc per interval. | true |