open-nomad/website/source/docs/agent/telemetry.html.md

14 KiB

layout page_title sidebar_current description
docs Telemetry docs-agent-telemetry Learn about the telemetry data available in Nomad.

Telemetry

The Nomad agent collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute.

To view this data, you must send a signal to the Nomad process: on Unix, this is USR1 while on Windows it is BREAK. Once Nomad receives the signal, it will dump the current telemetry information to the agent's stderr.

This telemetry information can be used for debugging or otherwise getting a better view of what Nomad is doing.

Telemetry information can be streamed to both statsite as well as statsd based on providing the appropriate configuration options.

To configure the telemetry output please see the agent configuration.

Below is sample output of a telemetry dump:

[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000
[2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000
[2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813
[2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204

Key Metrics

When telemetry is being streamed to statsite or statsd, interval is defined to be their flush interval. Otherwise, the interval can be assumed to be 10 seconds when retrieving metrics using the above described signals.

Metric Description Unit Type
`nomad.runtime.num_goroutines` Number of goroutines and general load pressure indicator # of goroutines Gauge
`nomad.runtime.alloc_bytes` Memory utilization # of bytes Gauge
`nomad.runtime.heap_objects` Number of objects on the heap. General memory pressure indicator # of heap objects Gauge
`nomad.raft.apply` Number of Raft transactions Raft transactions / `interval` Counter
`nomad.raft.replication.appendEntries` Raft transaction commit time ms / Raft Log Append Timer
`nomad.raft.leader.lastContact` Time since last contact to leader. General indicator of Raft latency ms / Leader Contact Timer
`nomad.broker.total_ready` Number of evaluations ready to be processed # of evaluations Gauge
`nomad.broker.total_unacked` Evaluations dispatched for processing but incomplete # of evaluations Gauge
`nomad.broker.total_blocked` Evaluations that are blocked til an existing evaluation for the same job completes # of evaluations Gauge
`nomad.plan.queue_depth` Number of scheduler Plans waiting to be evaluated # of plans Gauge
`nomad.plan.submit` Time to submit a scheduler Plan. Higher values cause lower scheduling throughput ms / Plan Submit Timer
`nomad.plan.evaluate` Time to validate a scheduler Plan. Higher values cause lower scheduling throughput. Similar to `nomad.plan.submit` but does not include RPC time or time in the Plan Queue ms / Plan Evaluation Timer
`nomad.worker.invoke_scheduler.` Time to run the scheduler of the given type ms / Scheduler Run Timer
`nomad.worker.wait_for_index` Time waiting for Raft log replication from leader. High delays result in lower scheduling throughput ms / Raft Index Wait Timer
`nomad.heartbeat.active` Number of active heartbeat timers. Each timer represents a Nomad Client connection # of heartbeat timers Gauge
`nomad.heartbeat.invalidate` The length of time it takes to invalidate a Nomad Client due to failed heartbeats ms / Heartbeat Invalidation Timer
`nomad.rpc.query` Number of RPC queries RPC Queries / `interval` Counter
`nomad.rpc.request` Number of RPC requests being handled RPC Requests / `interval` Counter
`nomad.rpc.request_error` Number of RPC requests being handled that result in an error RPC Errors / `interval` Counter

Client Metrics

The Nomad client emits metrics related to the resource usage of the allocations and tasks running on it and the node itself. By default the collection interval is 1 second but it can be changed by the changing the value of the collection_interval key in the telemetry configuration block.

Host Metrics

Metric Description Unit Type
`nomad.client.host.memmory..total` Total amount of physical memory on the node Bytes Gauge
`nomad.client.host.memmory..available` Total amount of memory available to processes which includes free and cached memory Bytes Gauge
`nomad.client.host.memory..used` Amount of memory used by processes Bytes Gauge
`nomad.client.host.memory..free` Amount of memory which is free Bytes Gauge
`nomad.client.uptime.` Uptime of the host running the Nomad client Seconds Gauge
`nomad.client.host.cpu...total` Total CPU utilization Percentage Gauge
`nomad.client.host.cpu...user` CPU utilization in the user space Percentage Gauge
`nomad.client.host.cpu...system` CPU utilization in the system space Percentage Gauge
`nomad.client.host.cpu...idle` Idle time spent by the CPU Percentage Gauge
`nomad.client.host.disk...size` Total size of the device Bytes Gauge
`nomad.client.host.disk...used` Amount of space which has been used Bytes Gauge
`nomad.client.host.disk...available` Amount of space which is available Bytes Gauge
`nomad.client.host.disk...used_percent` Percentage of disk space used Percentage Gauge
`nomad.client.host.disk...inodes_percent` Disk space consumed by the inodes Percent Gauge

Allocation Metrics

Metric Description Unit Type
`nomad.client.allocs.....memory.rss` Amount of RSS memory consumed by the task Bytes Gauge
`nomad.client.allocs.....memory.cache` Amount of memory cached by the task Bytes Gauge
`nomad.client.allocs.....memory.swap` Amount of memory swapped by the task Bytes Gauge
`nomad.client.allocs.....memory.max_usage` Maximum amount of memory ever used by the task Bytes Gauge
`nomad.client.allocs.....memory.kernel_usage` Amount of memory used by the kernel for this task Bytes Gauge
`nomad.client.allocs.....memory.kernel_max_usage` Maximum amount of memory ever used by the kernel for this task Bytes Gauge
`nomad.client.allocs.....cpu.total_percent` Total CPU resources consumed by the task across all cores Percentage Gauge
`nomad.client.allocs.....cpu.system` Total CPU resources consumed by the task in the system space Percentage Gauge
`nomad.client.allocs..TaskGroup>...cpu.user` Total CPU resources consumed by the task in the user space Percentage Gauge
`nomad.client.allocs.....cpu.throttled_time` Total time that the task was throttled Nanoseconds Gauge
`nomad.client.allocs.....cpu.total_ticks` CPU ticks consumed by the process in the last collection interval Integer Gauge

Metric Types

Type Description Quantiles
Gauge Gauge types report an absolute number at the end of the aggregation interval false
Counter Counts are incremented and flushed at the end of the aggregation interval and then are reset to zero true
Timer Timers measure the time to complete a task and will include quantiles, means, standard deviation, etc per interval. true