65845c7531
* VAULT-1564 report in-flight requests * adding a changelog * Changing some variable names and fixing comments * minor style change * adding unauthenticated support for in-flight-req * adding documentation for the listener.profiling stanza * adding an atomic counter for the inflight requests addressing comments * addressing comments * logging completed requests * fixing a test * providing log_requests_info as a config option to determine at which level requests should be logged * removing a member and a method from the StatusHeaderResponseWriter struct * adding api docks * revert changes in NewHTTPResponseWriter * Fix logging invalid log_requests_info value * Addressing comments * Fixing a test * use an tomic value for logRequestsInfo, and moving the CreateClientID function to Core * fixing go.sum * minor refactoring * protecting InFlightRequests from data race * another try on fixing a data race * another try to fix a data race * addressing comments * fixing couple of tests * changing log_requests_info to log_requests_level * minor style change * fixing a test * removing the lock in InFlightRequests * use single-argument form for interface assertion * adding doc for the new configuration paramter * adding the new doc to the nav data file * minor fix
552 lines
130 KiB
Plaintext
552 lines
130 KiB
Plaintext
---
|
||
layout: docs
|
||
page_title: Telemetry
|
||
description: Learn about the telemetry data available in Vault.
|
||
---
|
||
|
||
# Telemetry
|
||
|
||
The Vault server process collects various runtime metrics about the performance of different libraries and subsystems. These metrics are aggregated on a ten second interval and are retained for one minute in-memory. In order to monitor Vault and collect durable metrics, Telemetry from Vault must be stored in metrics aggregation software.
|
||
|
||
To view the raw data, you must send a signal to the Vault process: on Unix-style operating systems, this is `USR1` while on Windows it is `BREAK`. When the Vault process receives this signal it will dump the current telemetry information to the process's `stderr`.
|
||
|
||
This telemetry information can be used for debugging or otherwise getting a better view of what Vault is doing.
|
||
|
||
Telemetry information can also be streamed directly from Vault to a range of metrics aggregation solutions as described in the [telemetry Stanza documentation][telemetry-stanza].
|
||
|
||
The following is an example telemetry dump snippet:
|
||
|
||
```text
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.expire.num_leases': 5100.000
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.num_goroutines': 39.000
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.sys_bytes': 222746880.000
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.malloc_count': 109189192.000
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.free_count': 108408240.000
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.heap_objects': 780953.000
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_runs': 232.000
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.alloc_bytes': 72954392.000
|
||
[2017-12-19 20:37:50 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_pause_ns': 150293024.000
|
||
[2017-12-19 20:37:50 +0000 UTC][S] 'vault.merkle.flushDirty': Count: 100 Min: 0.008 Mean: 0.027 Max: 0.183 Stddev: 0.024 Sum: 2.681 LastUpdated: 2017-12-19 20:37:59.848733035 +0000 UTC m=+10463.692105920
|
||
[2017-12-19 20:37:50 +0000 UTC][S] 'vault.merkle.saveCheckpoint': Count: 4 Min: 0.021 Mean: 0.054 Max: 0.110 Stddev: 0.039 Sum: 0.217 LastUpdated: 2017-12-19 20:37:57.048458148 +0000 UTC m=+10460.891835029
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.alloc_bytes': 73326136.000
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.sys_bytes': 222746880.000
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.malloc_count': 109195904.000
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.free_count': 108409568.000
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.heap_objects': 786342.000
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_pause_ns': 150293024.000
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.expire.num_leases': 5100.000
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.num_goroutines': 39.000
|
||
[2017-12-19 20:38:00 +0000 UTC][G] 'vault.7f320e57f9fe.runtime.total_gc_runs': 232.000
|
||
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.route.rollback.consul-': Count: 1 Sum: 0.013 LastUpdated: 2017-12-19 20:38:01.968471579 +0000 UTC m=+10465.811842067
|
||
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.consul-': Count: 1 Sum: 0.073 LastUpdated: 2017-12-19 20:38:01.968502743 +0000 UTC m=+10465.811873131
|
||
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.pki-': Count: 1 Sum: 0.070 LastUpdated: 2017-12-19 20:38:01.96867005 +0000 UTC m=+10465.812041936
|
||
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.route.rollback.auth-app-id-': Count: 1 Sum: 0.012 LastUpdated: 2017-12-19 20:38:01.969146401 +0000 UTC m=+10465.812516689
|
||
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.identity-': Count: 1 Sum: 0.063 LastUpdated: 2017-12-19 20:38:01.968029888 +0000 UTC m=+10465.811400276
|
||
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.rollback.attempt.database-': Count: 1 Sum: 0.066 LastUpdated: 2017-12-19 20:38:01.969394215 +0000 UTC m=+10465.812764603
|
||
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.barrier.get': Count: 16 Min: 0.010 Mean: 0.015 Max: 0.031 Stddev: 0.005 Sum: 0.237 LastUpdated: 2017-12-19 20:38:01.983268118 +0000 UTC m=+10465.826637008
|
||
[2017-12-19 20:38:00 +0000 UTC][S] 'vault.merkle.flushDirty': Count: 100 Min: 0.006 Mean: 0.024 Max: 0.098 Stddev: 0.019 Sum: 2.386 LastUpdated: 2017-12-19 20:38:09.848158309 +0000 UTC m=+10473.691527099
|
||
```
|
||
|
||
You'll note that log entries are prefixed with the metric type as follows:
|
||
|
||
- **[C]** is a counter. Counters are cumulative metrics that are incremented when some event occurs, and are reset at the end of reporting intervals. Vault retains counters and other metrics for one minute in-memory, so to see accurate and persistent counters over time an [aggregation solution][telemetry-stanza] must be configured.
|
||
- **[G]** is a gauge. Gauges provide measurements of current values.
|
||
- **[S]** is a summary. Summaries provide sample observations of values. Vault commonly uses summaries for measuring timing duration of discrete events in the reporting interval.
|
||
|
||
The following sections describe available Vault metrics. The metrics interval can be assumed to be 10 seconds when manually triggering metrics output using the above described signals. Some high-cardinality gauges, like `vault.kv.secret.count`, are emitted every 10 minutes, or at an interval configured in the `telemetry` stanza.
|
||
|
||
Some Vault metrics come with additional [labels](#metric-labels) describing the measurement in more detail, such as the namespace in which an operation takes place, or the auth method used to create a token. In the in-memory telemetry, or other telemetry engines that do not support labels, this additional information is incorporated into the metric name. The metric name in the table below is followed by a list of labels supported, in the order in which they appear if flattened.
|
||
|
||
## Audit Metrics
|
||
|
||
These metrics relate to auditing.
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :--------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------- | :------ |
|
||
| `vault.audit.log_request` | Duration of time taken by all audit log requests across all audit log devices | ms | summary |
|
||
| `vault.audit.log_response` | Duration of time taken by audit log responses across all audit log devices | ms | summary |
|
||
| `vault.audit.log_request_failure` | Number of audit log request failures. **NOTE**: This is a particularly important metric. Any non-zero value here indicates that there was a failure to make an audit log request to any of the configured audit log devices; **when Vault cannot log to any of the configured audit log devices it ceases all user operations**, and you should begin troubleshooting the audit log devices immediately if this metric continually increases. | failures | counter |
|
||
| `vault.audit.log_response_failure` | Number of audit log response failures. **NOTE**: This is a particularly important metric. Any non-zero value here indicates that there was a failure to receive a response to a request made to one of the configured audit log devices; **when Vault cannot log to any of the configured audit log devices it ceases all user operations**, and you should begin troubleshooting the audit log devices immediately if this metric continually increases. | failures | counter |
|
||
|
||
**NOTE:** In addition, there are audit metrics for each enabled audit device represented as `vault.audit.<type>.log_request`. For example, if a file audit device is enabled, its metrics would be `vault.audit.file.log_request` and `vault.audit.file.log_response` .
|
||
|
||
## Core Metrics
|
||
|
||
These metrics represent operational aspects of the running Vault instance.
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :-------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :----------- | :------ |
|
||
| `vault.barrier.delete` | Duration of time taken by DELETE operations at the barrier | ms | summary |
|
||
| `vault.barrier.get` | Duration of time taken by GET operations at the barrier | ms | summary |
|
||
| `vault.barrier.put` | Duration of time taken by PUT operations at the barrier | ms | summary |
|
||
| `vault.barrier.list` | Duration of time taken by LIST operations at the barrier | ms | summary |
|
||
| `vault.cache.hit` | Number of times a value was retrieved from the LRU cache. | cache hit | counter |
|
||
| `vault.cache.miss` | Number of times a value was not in the LRU cache. The results in a read from the configured storage. | cache miss | counter |
|
||
| `vault.cache.write` | Number of times a value was written to the LRU cache. | cache write | counter |
|
||
| `vault.cache.delete` | Number of times a value was deleted from the LRU cache. This does not count cache expirations. | cache delete | counter |
|
||
| `vault.core.active` | Has value 1 when the vault node is active, and 0 when node is in standby. | bool | gauge |
|
||
| `vault.core.activity.fragment_size` | Number of entities or tokens (depending on the "type" label) observed by the local node. | tokens | counter |
|
||
| `vault.core.activity.segment_write` | Duration of time taken writing activity log segments to storage. | ms | summary |
|
||
| `vault.core.check_token` | Duration of time taken by token checks handled by Vault core | ms | summary |
|
||
| `vault.core.fetch_acl_and_token` | Duration of time taken by ACL and corresponding token entry fetches handled by Vault core | ms | summary |
|
||
| `vault.core.handle_request` | Duration of time taken by requests handled by Vault core | ms | summary |
|
||
| `vault.core.handle_login_request` | Duration of time taken by login requests handled by Vault core | ms | summary |
|
||
| `vault.core.in_flight_requests` | Number of in-flight requests. | requests | gauge |
|
||
| `vault.core.leadership_setup_failed` | Duration of time taken by cluster leadership setup failures which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
|
||
| `vault.core.leadership_lost` | Duration of time taken by cluster leadership losses which have occurred in a highly available Vault cluster. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
|
||
| `vault.core.license.expiration_time_epoch` | Time as epoch (seconds since Jan 1 1970) at which license will expire. | seconds | gauge |
|
||
| `vault.core.mount_table.num_entries` | Number of mounts in a particular mount table. This metric is labeled by table type (auth or logical) and whether or not the table is replicated (local or not) | objects | gauge |
|
||
| `vault.core.mount_table.size` | Size of a particular mount table. This metric is labeled by table type (auth or logical) and whether or not the table is replicated (local or not) | objects | gauge |
|
||
| `vault.core.post_unseal` | Duration of time taken by post-unseal operations handled by Vault core | ms | summary |
|
||
| `vault.core.pre_seal` | Duration of time taken by pre-seal operations | ms | summary |
|
||
| `vault.core.seal-with-request` | Duration of time taken by requested seal operations | ms | summary |
|
||
| `vault.core.seal` | Duration of time taken by seal operations | ms | summary |
|
||
| `vault.core.seal-internal` | Duration of time taken by internal seal operations | ms | summary |
|
||
| `vault.core.step_down` | Duration of time taken by cluster leadership step downs. This should be monitored and alerted on for overall cluster leadership status. | ms | summary |
|
||
| `vault.core.unseal` | Duration of time taken by unseal operations | ms | summary |
|
||
| `vault.core.unsealed` | Has value 1 when Vault is unsealed, and 0 when Vault is sealed. | bool | gauge |
|
||
| `vault.metrics.collection` (cluster,gauge) | Time taken to collect usage gauges, labelled by gauge type. | summary |
|
||
| `vault.metrics.collection.interval` (cluster,gauge) | Current value of usage gauge collection interval. | summary |
|
||
| `vault.metrics.collection.error` (cluster,gauge) | Errors while collection usage gauges, labeled by gauge type. | counter |
|
||
| `vault.rollback.attempt.<mountpoint>` | Time taken to perform a rollback operation on the given mount point. The mount point name has its forward slashes `/` replaced by `-`. For example, a rollback operation on the `auth/token` backend would be reportes as `vault.rollback.attempt.auth-token-`. | ms | summary |
|
||
| `vault.route.create.<mountpoint>` | Time taken to dispatch a create operation to a backend, and for that backend to process it. The mount point name has its forward slashes `/` replaced by `-`. For example, a create operation to `ns1/secret/` would have corresponding metric `vault.route.create.ns1-secret-`. The number of samples of this metric, and the corresponding ones for other operations below, indicates how many operations were performed per mount point. | ms | summary |
|
||
| `vault.route.delete.<mountpoint>` | Time taken to dispatch a delete operation to a backend, and for that backend to process it. | ms | summary |
|
||
| `vault.route.list.<mountpoint>` | Time taken to dispatch a list operation to a backend, and for that backend to process it. | ms | summary |
|
||
| `vault.route.read.<mountpoint>` | Time taken to dispatch a read operation to a backend, and for that backend to process it. | ms | summary |
|
||
| `vault.route.rollback.<mountpoint>` | Time taken to dispatch a rollback operation to a backend, and for that backend to process it. Rollback operations are automatically scheduled to clean up partial errors. | ms | summary |
|
||
|
||
## Runtime Metrics
|
||
|
||
These metrics collect information from Vault's Go runtime, such as memory usage information.
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :-------------------------------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------- | :----- |
|
||
| `vault.runtime.alloc_bytes` | Number of bytes allocated by the Vault process. This could burst from time to time, but should return to a steady state value. | bytes | gauge |
|
||
| `vault.runtime.free_count` | Number of freed objects | objects | gauge |
|
||
| `vault.runtime.heap_objects` | Number of objects on the heap. This is a good general memory pressure indicator worth establishing a baseline and thresholds for alerting. | objects | gauge |
|
||
| `vault.runtime.malloc_count` | Cumulative count of allocated heap objects | objects | gauge |
|
||
| `vault.runtime.num_goroutines` | Number of goroutines. This serves as a general system load indicator worth establishing a baseline and thresholds for alerting. | goroutines | gauge |
|
||
| `vault.runtime.sys_bytes` | Number of bytes allocated to Vault. This includes what is being used by Vault's heap and what has been reclaimed but not given back to the operating system. | bytes | gauge |
|
||
| `vault.runtime.total_gc_pause_ns` | The total garbage collector pause time since Vault was last started | ns | gauge |
|
||
| `vault.runtime.gc_pause_ns` | Total duration of the last garbage collection run | ns | summary |
|
||
| `vault.runtime.total_gc_runs` | Total number of garbage collection runs since Vault was last started | operations | gauge |
|
||
|
||
## Policy Metrics
|
||
|
||
These metrics report measurements of the time spent performing policy operations.
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :--------------------------- | :---------------------------- | :--- | :------ |
|
||
| `vault.policy.get_policy` | Time taken to get a policy | ms | summary |
|
||
| `vault.policy.list_policies` | Time taken to list policies | ms | summary |
|
||
| `vault.policy.delete_policy` | Time taken to delete a policy | ms | summary |
|
||
| `vault.policy.set_policy` | Time taken to set a policy | ms | summary |
|
||
|
||
## Token, Identity, and Lease Metrics
|
||
|
||
These metrics cover measurement of token, identity, and lease operations, and counts of the number of such objects managed by Vault.
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :---------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | :------- | :------ |
|
||
| `vault.expire.fetch-lease-times` | Time taken to fetch lease times | ms | summary |
|
||
| `vault.expire.fetch-lease-times-by-token` | Time taken to fetch lease times by token | ms | summary |
|
||
| `vault.expire.num_leases` | Number of all leases which are eligible for eventual expiry | leases | gauge |
|
||
| `vault.expire.num_irrevocable_leases` | Number of leases that cannot be revoked automatically | leases | gauge |
|
||
| `vault.expire.leases.by_expiration` (cluster,gauge,expiring,namespace) | Number of leases set to expire, grouped by a time interval. This time interval and total number of time intervals are configurable via `lease_metrics_epsilon` and `num_lease_metrics_buckets` in the telemetry stanza of a vault server configuration. The default values for these are `1hr` and `168` respectively, so the metric will report the number of leases that will expire each hour from the current time to a week from the current time. One can additionally group lease expiration by namespace by setting `add_lease_metrics_namespace_labels` to `true` in the config file (default is `false`). | leases | gauge |
|
||
| `vault.expire.lease_expiration` | Count of lease expirations | leases | counter |
|
||
| `vault.expire.job_manager.total_jobs` | Total pending revocation jobs | leases | summary |
|
||
| `vault.expire.job_manager.queue_length` | Total pending revocation jobs by auth method | leases | summary |
|
||
| `vault.expire.lease_expiration` | Count of lease expirations | leases | counter |
|
||
| `vault.expire.lease_expiration.time_in_queue` | Time taken for lease to get to the front of the revoke queue | ms | summary |
|
||
| `vault.expire.lease_expiration.error` | Count of lease expiration errors | errors | counter |
|
||
| `vault.expire.revoke` | Time taken to revoke a token | ms | summary |
|
||
| `vault.expire.revoke-force` | Time taken to forcibly revoke a token | ms | summary |
|
||
| `vault.expire.revoke-prefix` | Time taken to revoke tokens on a prefix | ms | summary |
|
||
| `vault.expire.revoke-by-token` | Time taken to revoke all secrets issued with a given token | ms | summary |
|
||
| `vault.expire.renew` | Time taken to renew a lease | ms | summary |
|
||
| `vault.expire.renew-token` | Time taken to renew a token which does not need to invoke a logical backend | ms | summary |
|
||
| `vault.expire.register` | Time taken for register operations | ms | summary |
|
||
| `vault.expire.register-auth` | Time taken for register authentication operations which create lease entries without lease ID | ms | summary |
|
||
| `vault.identity.num_entities` | Number of identity entities stored in Vault | entities | gauge |
|
||
| `vault.identity.entity.active.monthly` (cluster, namespace) | Number of distinct entities that created a token during the past month, per namespace. Only available if client count is enabled. Reported at the start of each month. | entities | gauge |
|
||
| `vault.identity.entity.active.partial_month` (cluster) | Total number of distinct entities that created a token during the current month. Only available if client count is enabled. Reported periodically within each month. | entities | gauge |
|
||
| `vault.identity.entity.active.reporting_period` (cluster, namespace) | Number of distinct entities that created a token in the past N months, as defined by the client count default reporting period. Only available if client count is enabled. Reported at the start of each month. | entities | gauge |
|
||
| `vault.identity.entity.alias.count` (cluster, namespace, auth_method, mount_point) | Number of identity entities aliases stored in Vault, grouped by the auth mount that created them. This gauge is computed every 10 minutes. | aliases | gauge |
|
||
| `vault.identity.entity.count` (cluster, namespace) | Number of identity entities stored in Vault, grouped by namespace. | entities | gauge |
|
||
| `vault.identity.entity.creation` (cluster, namespace, auth_method, mount_point) | Number of identity entities created, grouped by the auth mount that created them. | entities | counter |
|
||
| `vault.identity.upsert_entity_txn` | Time taken to insert a new or modified entity into the in-memory database, and persist it to storage. | ms | summary |
|
||
| `vault.identity.upsert_group_txn` | Time taken to insert a new or modified group into the in-memory database, and persist it to storage. This operation is performed on group membership changes. | ms | summary |
|
||
| `vault.token.count` (cluster, namespace) | Number of service tokens available for use; counts all un-expired and un-revoked tokens in Vault's token store. This measurement is performed every 10 minutes. | token | gauge |
|
||
| `vault.token.count.by_auth` (cluster, namespace, auth_method) | Number of service tokens that were created by a particular auth method. | tokens | gauge |
|
||
| `vault.token.count.by_policy` (cluster, namespace, policy) | Number of service tokens that have a particular policy attached. If a token has more than one policy, it is counted in each policy gauge. | tokens | gauge |
|
||
| `vault.token.count.by_ttl` (cluster, namespace, creation_ttl) | Number of service tokens, grouped by the TTL range they were assigned at creation. | tokens | gauge |
|
||
| `vault.token.create` | The time taken to create a token | ms | summary |
|
||
| `vault.token.create_root` | Number of created root tokens. Does not decrease on revocation. | tokens | counter |
|
||
| `vault.token.createAccessor` | The time taken to create a token accessor | ms | summary |
|
||
| `vault.token.creation` (cluster, namespace, auth_method, mount_point, creation_ttl, token_type) | Number of service or batch tokens created. | tokens | counter |
|
||
| `vault.token.lookup` | The time taken to look up a token | ms | summary |
|
||
| `vault.token.revoke` | Time taken to revoke a token | ms | summary |
|
||
| `vault.token.revoke-tree` | Time taken to revoke a token tree | ms | summary |
|
||
| `vault.token.store` | Time taken to store an updated token entry without writing to the secondary index | ms | summary |
|
||
|
||
## Resource Quota Metrics
|
||
|
||
These metrics relate to rate limit and lease count quotas. Each metric comes with a label "name" identifying the specific quota.
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :---------------------------------- | :---------------------------------------------------------------- | :---- | :------ |
|
||
| `vault.quota.rate_limit.violation` | Total number of rate limit quota violations | quota | counter |
|
||
| `vault.quota.lease_count.violation` | Total number of lease count quota violations | quota | counter |
|
||
| `vault.quota.lease_count.max` | Total maximum amount of leases allowed by the lease count quota | lease | gauge |
|
||
| `vault.quota.lease_count.counter` | Total current amount of leases generated by the lease count quota | lease | gauge |
|
||
|
||
## Merkle Tree and Write Ahead Log Metrics
|
||
|
||
These metrics relate to internal operations on Merkle Trees and Write Ahead Logs (WAL)
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :-------------------------------------- | :-------------------------------------------------------------------------- | :---- | :------ |
|
||
| `vault.merkle.flushDirty` | Time taken to flush any dirty pages to cold storage | ms | summary |
|
||
| `vault.merkle.flushDirty.num_pages` | Number of pages flushed | pages | gauge |
|
||
| `vault.merkle.saveCheckpoint` | Time taken to save the checkpoint | ms | summary |
|
||
| `vault.merkle.saveCheckpoint.num_dirty` | Number of dirty pages at checkpoint | pages | gauge |
|
||
| `vault.wal.deleteWALs` | Time taken to delete a Write Ahead Log (WAL) | ms | summary |
|
||
| `vault.wal.gc.deleted` | Number of Write Ahead Logs (WAL) deleted during each garbage collection run | WAL | gauge |
|
||
| `vault.wal.gc.total` | Total Number of Write Ahead Logs (WAL) on disk | WAL | gauge |
|
||
| `vault.wal.loadWAL` | Time taken to load a Write Ahead Log (WAL) | ms | summary |
|
||
| `vault.wal.persistWALs` | Time taken to persist a Write Ahead Log (WAL) | ms | summary |
|
||
| `vault.wal.flushReady` | Time taken to flush a ready Write Ahead Log (WAL) to storage | ms | summary |
|
||
| `vault.wal.flushReady.queue_len` | Size of the write queue in the WAL system | WAL | summary |
|
||
|
||
## HA Metrics
|
||
|
||
These metrics are emitted on standbys when talking to the active node, and in some cases by performance standbys as well.
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :---------------------------------------| :---------------------------------------------------------------------------| :---- | :------ |
|
||
| `vault.ha.rpc.client.forward` | Time taken to forward a request from a standby to the active node | ms | summary |
|
||
| `vault.ha.rpc.client.forward.errors` | Number of standby request forwarding failures | errors| counter |
|
||
|
||
## Replication Metrics
|
||
|
||
These metrics relate to [Vault Enterprise Replication](/docs/enterprise/replication). The following metrics are not available in telemetry unless replication is in an unhealthy state: `replication.fetchRemoteKeys`, `replication.merkleDiff`, and `replication.merkleSync`.
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :------------------------------------------------------------ | :----------------------------------------------------------------------------------------------------------------------------------------- | :-------------- | :------ |
|
||
| `vault.core.replication.performance.primary` | Set to 1 if this is a performance primary, 0 if not | boolean | gauge |
|
||
| `vault.core.replication.performance.secondary` | Set to 1 if this is a performance secondary, 0 if not | boolean | gauge |
|
||
| `vault.core.replication.dr.primary` | Set to 1 if this is a DR primary, 0 if not | boolean | gauge |
|
||
| `vault.core.replication.dr.secondary` | Set to 1 if this is a DR secondary, 0 if not | boolean | gauge |
|
||
| `vault.core.performance_standby` | Set to 1 if this is a performance standby, 0 if not | boolean | gauge |
|
||
| `vault.logshipper.streamWALs.missing_guard` | Number of incidences where the starting Merkle Tree index used to begin streaming WAL entries is not matched/found | missing guards | counter |
|
||
| `vault.logshipper.streamWALs.guard_found` | Number of incidences where the starting Merkle Tree index used to begin streaming WAL entries is matched/found | found guards | counter |
|
||
| `vault.logshipper.streamWALs.scanned_entries` | Number of entries scanned in the buffer before the right one was found. | scanned entries | summary |
|
||
| `vault.logshipper.buffer.length` | Current length of the log shipper buffer | buffer entries | gauge |
|
||
| `vault.logshipper.buffer.size` | Current size in bytes of the log shipper buffer | bytes | gauge |
|
||
| `vault.logshipper.buffer.max_length` | Maximum length of the log shipper buffer | buffer entries | gauge |
|
||
| `vault.logshipper.buffer.max_size` | Maximum size in bytes of the log shipper buffer | bytes | gauge |
|
||
| `vault.replication.fetchRemoteKeys` | Time taken to fetch keys from a remote cluster participating in replication prior to Merkle Tree based delta generation | ms | summary |
|
||
| `vault.replication.merkleDiff` | Time taken to perform a Merkle Tree based delta generation between the clusters participating in replication | ms | summary |
|
||
| `vault.replication.merkleSync` | Time taken to perform a Merkle Tree based synchronization using the last delta generated between the clusters participating in replication | ms | summary |
|
||
| `vault.replication.merkle.commit_index` | The last committed index in the Merkle Tree. | sequence number | gauge |
|
||
| `vault.replication.wal.last_wal` | The index of the last WAL | sequence number | gauge |
|
||
| `vault.replication.wal.last_dr_wal` | The index of the last DR WAL | sequence number | gauge |
|
||
| `vault.replication.wal.last_performance_wal` | The index of the last Performance WAL | sequence number | gauge |
|
||
| `vault.replication.fsm.last_remote_wal` | The index of the last remote WAL | sequence number | gauge |
|
||
| `vault.replication.wal.gc` | Time taken to complete one run of the WAL garbage collection process | ms | summary |
|
||
| `vault.replication.rpc.server.auth_request` | Duration of time taken by auth request | ms | summary |
|
||
| `vault.replication.rpc.server.bootstrap_request` | Duration of time taken by bootstrap request | ms | summary |
|
||
| `vault.replication.rpc.server.conflicting_pages_request` | Duration of time taken by conflicting pages request | ms | summary |
|
||
| `vault.replication.rpc.server.echo` | Duration of time taken by echo | ms | summary |
|
||
| `vault.replication.rpc.server.forwarding_request` | Duration of time taken by forwarding request | ms | summary |
|
||
| `vault.replication.rpc.server.guard_hash_request` | Duration of time taken by guard hash request | ms | summary |
|
||
| `vault.replication.rpc.server.persist_alias_request` | Duration of time taken by persist alias request | ms | summary |
|
||
| `vault.replication.rpc.server.persist_persona_request` | Duration of time taken by persist persona request | ms | summary |
|
||
| `vault.replication.rpc.server.stream_wals_request` | Duration of time taken by stream wals request | ms | summary |
|
||
| `vault.replication.rpc.server.sub_page_hashes_request` | Duration of time taken by sub page hashes request | ms | summary |
|
||
| `vault.replication.rpc.server.sync_counter_request` | Duration of time taken by sync counter request | ms | summary |
|
||
| `vault.replication.rpc.server.upsert_group_request` | Duration of time taken by upsert group request | ms | summary |
|
||
| `vault.replication.rpc.client.conflicting_pages` | Duration of time taken by client conflicting pages request | ms | summary |
|
||
| `vault.replication.rpc.client.fetch_keys` | Duration of time taken by client fetch keys request | ms | summary |
|
||
| `vault.replication.rpc.client.forward` | Duration of time taken by client forward request | ms | summary |
|
||
| `vault.replication.rpc.client.guard_hash` | Duration of time taken by client guard hash request | ms | summary |
|
||
| `vault.replication.rpc.client.persist_alias` | Duration of time taken by | ms | summary |
|
||
| `vault.replication.rpc.client.register_auth` | Duration of time taken by client register auth request | ms | summary |
|
||
| `vault.replication.rpc.client.register_lease` | Duration of time taken by client register lease request | ms | summary |
|
||
| `vault.replication.rpc.client.stream_wals` | Duration of time taken by client s | ms | summary |
|
||
| `vault.replication.rpc.client.sub_page_hashes` | Duration of time taken by client sub page hashes request | ms | summary |
|
||
| `vault.replication.rpc.client.sync_counter` | Duration of time taken by client sync counter request | ms | summary |
|
||
| `vault.replication.rpc.client.upsert_group` | Duration of time taken by client upstert group request | ms | summary |
|
||
| `vault.replication.rpc.client.wrap_in_cubbyhole` | Duration of time taken by client wrap in cubbyhole request | ms | summary |
|
||
| `vault.replication.rpc.dr.server.echo` | Duration of time taken by DR echo request | ms | summary |
|
||
| `vault.replication.rpc.dr.server.fetch_keys_request` | Duration of time taken by DR fetch keys request | ms | summary |
|
||
| `vault.replication.rpc.standby.server.echo` | Duration of time taken by standby echo request | ms | summary |
|
||
| `vault.replication.rpc.standby.server.register_auth_request` | Duration of time taken by standby register auth request | ms | summary |
|
||
| `vault.replication.rpc.standby.server.register_lease_request` | Duration of time taken by standby register lease request | ms | summary |
|
||
| `vault.replication.rpc.standby.server.wrap_token_request` | Duration of time taken by standby wrap token request | ms | summary |
|
||
|
||
## Secrets Engines Metrics
|
||
|
||
These metrics relate to the supported [secrets engines][secrets-engines].
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :----- | :------ |
|
||
| `database.Initialize` | Time taken to initialize a database secret engine across all database secrets engines | ms | summary |
|
||
| `database.<name>.Initialize` | Time taken to initialize a database secret engine for the named database secrets engine `<name>`, for example: `database.postgresql-prod.Initialize` | ms | summary |
|
||
| `database.Initialize.error` | Number of database secrets engine initialization operation errors across all database secrets engines | errors | counter |
|
||
| `database.<name>.Initialize.error` | Number of database secrets engine initialization operation errors for the named database secrets engine `<name>`, for example: `database.postgresql-prod.Initialize.error` | errors | counter |
|
||
| `database.Close` | Time taken to close a database secret engine across all database secrets engines | ms | summary |
|
||
| `database.<name>.Close` | Time taken to close a database secret engine for the named database secrets engine `<name>`, for example: `database.postgresql-prod.Close` | ms | summary |
|
||
| `database.Close.error` | Number of database secrets engine close operation errors across all database secrets engines | errors | counter |
|
||
| `database.<name>.Close.error` | Number of database secrets engine close operation errors for the named database secrets engine `<name>`, for example: `database.postgresql-prod.Close.error` | errors | counter |
|
||
| `database.CreateUser` | Time taken to create a user across all database secrets engines | ms | summary |
|
||
| `database.<name>.CreateUser` | Time taken to create a user for the named database secrets engine `<name>` | ms | summary |
|
||
| `database.CreateUser.error` | Number of user creation operation errors across all database secrets engines | errors | counter |
|
||
| `database.<name>.CreateUser.error` | Number of user creation operation errors for the named database secrets engine `<name>`, for example: `database.postgresql-prod.CreateUser.error` | errors | counter |
|
||
| `database.RenewUser` | Time taken to renew a user across all database secrets engines | ms | summary |
|
||
| `database.<name>.RenewUser` | Time taken to renew a user for the named database secrets engine `<name>`, for example: `database.postgresql-prod.RenewUser` | ms | summary |
|
||
| `database.RenewUser.error` | Number of user renewal operation errors across all database secrets engines | errors | counter |
|
||
| `database.<name>.RenewUser.error` | Number of user renewal operations for the named database secrets engine `<name>`, for example: `database.postgresql-prod.RenewUser.error` | errors | counter |
|
||
| `database.RevokeUser` | Time taken to revoke a user across all database secrets engines | ms | summary |
|
||
| `database.<name>.RevokeUser` | Time taken to revoke a user for the named database secrets engine `<name>`, for example: `database.postgresql-prod.RevokeUser` | ms | summary |
|
||
| `database.RevokeUser.error` | Number of user revocation operation errors across all database secrets engines | errors | counter |
|
||
| `database.<name>.RevokeUser.error` | Number of user revocation operations for the named database secrets engine `<name>`, for example: `database.postgresql-prod.RevokeUser.error` | errors | counter |
|
||
| `secrets.pki.tidy.cert_store_current_entry` | The index of the current entry in the certificate store being verified by the tidy operation | entry index | gauge |
|
||
| `secrets.pki.tidy.cert_store_deleted_count` | Number of entries deleted from the certificate store | entry | counter |
|
||
| `secrets.pki.tidy.cert_store_total_entries` | Number of entries in the certificate store to verify during the tidy operation | entry | gauge |
|
||
| `secrets.pki.tidy.duration` | Duration of time taken by the PKI tidy operation | ms | summary |
|
||
| `secrets.pki.tidy.failure` | Number of times the PKI tidy operation has not completed due to errors | operations | counter |
|
||
| `secrets.pki.tidy.revoked_cert_current_entry` | The index of the current revoked certificate entry in the certificate store being verified by the tidy operation | entry index | gauge |
|
||
| `secrets.pki.tidy.revoked_cert_deleted_count` | Number of entries deleted from the certificate store for revoked certificates | entry | counter |
|
||
| `secrets.pki.tidy.revoked_cert_total_entries` | Number of entries in the certificate store for revoked certificates to verify during the tidy operation | entry | gauge |
|
||
| `secrets.pki.tidy.start_time_epoch` | Start time (as seconds since Jan 1 1970) when the PKI tidy operation is active, 0 otherwise | seconds | gauge |
|
||
| `secrets.pki.tidy.success` | Number of times the PKI tidy operation has completed succcessfully | operations | counter |
|
||
| `vault.secret.kv.count` (cluster, namespace, mount_point) | Number of entries in each key-value secret engine. | paths | gauge |
|
||
| `vault.secret.lease.creation` (cluster, namespace, secret_engine, mount_point, creation_ttl) | Counts the number of leases created by secret engines. | leases | counter |
|
||
|
||
## Storage Backend Metrics
|
||
|
||
These metrics relate to the supported [storage backends][storage-backends].
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :-------------------------- | :--------------------------------------------------------------------------------------------------------------------- | :--- | :------ |
|
||
| `vault.azure.put` | Duration of a PUT operation against the [Azure storage backend][azure-storage-backend] | ms | summary |
|
||
| `vault.azure.get` | Duration of a GET operation against the [Azure storage backend][azure-storage-backend] | ms | summary |
|
||
| `vault.azure.delete` | Duration of a DELETE operation against the [Azure storage backend][azure-storage-backend] | ms | summary |
|
||
| `vault.azure.list` | Duration of a LIST operation against the [Azure storage backend][azure-storage-backend] | ms | summary |
|
||
| `vault.cassandra.put` | Duration of a PUT operation against the [Cassandra storage backend][cassandra-storage-backend] | ms | summary |
|
||
| `vault.cassandra.get` | Duration of a GET operation against the [Cassandra storage backend][cassandra-storage-backend] | ms | summary |
|
||
| `vault.cassandra.delete` | Duration of a DELETE operation against the [Cassandra storage backend][cassandra-storage-backend] | ms | summary |
|
||
| `vault.cassandra.list` | Duration of a LIST operation against the [Cassandra storage backend][cassandra-storage-backend] | ms | summary |
|
||
| `vault.cockroachdb.put` | Duration of a PUT operation against the [CockroachDB storage backend][cockroachdb-storage-backend] | ms | summary |
|
||
| `vault.cockroachdb.get` | Duration of a GET operation against the [CockroachDB storage backend][cockroachdb-storage-backend] | ms | summary |
|
||
| `vault.cockroachdb.delete` | Duration of a DELETE operation against the [CockroachDB storage backend][cockroachdb-storage-backend] | ms | summary |
|
||
| `vault.cockroachdb.list` | Duration of a LIST operation against the [CockroachDB storage backend][cockroachdb-storage-backend] | ms | summary |
|
||
| `vault.consul.put` | Duration of a PUT operation against the [Consul storage backend][consul-storage-backend] | ms | summary |
|
||
| `vault.consul.transaction` | Duration of a Txn operation against the [Consul storage backend][consul-storage-backend] | ms | summary |
|
||
| `vault.consul.get` | Duration of a GET operation against the [Consul storage backend][consul-storage-backend] | ms | summary |
|
||
| `vault.consul.delete` | Duration of a DELETE operation against the [Consul storage backend][consul-storage-backend] | ms | summary |
|
||
| `vault.consul.list` | Duration of a LIST operation against the [Consul storage backend][consul-storage-backend] | ms | summary |
|
||
| `vault.couchdb.put` | Duration of a PUT operation against the [CouchDB storage backend][couchdb-storage-backend] | ms | summary |
|
||
| `vault.couchdb.get` | Duration of a GET operation against the [CouchDB storage backend][couchdb-storage-backend] | ms | summary |
|
||
| `vault.couchdb.delete` | Duration of a DELETE operation against the [CouchDB storage backend][couchdb-storage-backend] | ms | summary |
|
||
| `vault.couchdb.list` | Duration of a LIST operation against the [CouchDB storage backend][couchdb-storage-backend] | ms | summary |
|
||
| `vault.dynamodb.put` | Duration of a PUT operation against the [DynamoDB storage backend][dynamodb-storage-backend] | ms | summary |
|
||
| `vault.dynamodb.get` | Duration of a GET operation against the [DynamoDB storage backend][dynamodb-storage-backend] | ms | summary |
|
||
| `vault.dynamodb.delete` | Duration of a DELETE operation against the [DynamoDB storage backend][dynamodb-storage-backend] | ms | summary |
|
||
| `vault.dynamodb.list` | Duration of a LIST operation against the [DynamoDB storage backend][dynamodb-storage-backend] | ms | summary |
|
||
| `vault.etcd.put` | Duration of a PUT operation against the [etcd storage backend][etcd-storage-backend] | ms | summary |
|
||
| `vault.etcd.get` | Duration of a GET operation against the [etcd storage backend][etcd-storage-backend] | ms | summary |
|
||
| `vault.etcd.delete` | Duration of a DELETE operation against the [etcd storage backend][etcd-storage-backend] | ms | summary |
|
||
| `vault.etcd.list` | Duration of a LIST operation against the [etcd storage backend][etcd-storage-backend] | ms | summary |
|
||
| `vault.gcs.put` | Duration of a PUT operation against the [Google Cloud Storage storage backend][gcs-storage-backend] | ms | summary |
|
||
| `vault.gcs.get` | Duration of a GET operation against the [Google Cloud Storage storage backend][gcs-storage-backend] | ms | summary |
|
||
| `vault.gcs.delete` | Duration of a DELETE operation against the [Google Cloud Storage storage backend][gcs-storage-backend] | ms | summary |
|
||
| `vault.gcs.list` | Duration of a LIST operation against the [Google Cloud Storage storage backend][gcs-storage-backend] | ms | summary |
|
||
| `vault.gcs.lock.unlock` | Duration of an UNLOCK operation against the [Google Cloud Storage storage backend][gcs-storage-backend] in HA mode | ms | summary |
|
||
| `vault.gcs.lock.lock` | Duration of a LOCK operation against the [Google Cloud Storage storage backend][gcs-storage-backend] in HA mode | ms | summary |
|
||
| `vault.gcs.lock.value` | Duration of a VALUE operation against the [Google Cloud Storage storage backend][gcs-storage-backend] in HA mode | ms | summary |
|
||
| `vault.mssql.put` | Duration of a PUT operation against the [MS-SQL storage backend][mssql-storage-backend] | ms | summary |
|
||
| `vault.mssql.get` | Duration of a GET operation against the [MS-SQL storage backend][mssql-storage-backend] | ms | summary |
|
||
| `vault.mssql.delete` | Duration of a DELETE operation against the [MS-SQL storage backend][mssql-storage-backend] | ms | summary |
|
||
| `vault.mssql.list` | Duration of a LIST operation against the [MS-SQL storage backend][mssql-storage-backend] | ms | summary |
|
||
| `vault.mysql.put` | Duration of a PUT operation against the [MySQL storage backend][mysql-storage-backend] | ms | summary |
|
||
| `vault.mysql.get` | Duration of a GET operation against the [MySQL storage backend][mysql-storage-backend] | ms | summary |
|
||
| `vault.mysql.delete` | Duration of a DELETE operation against the [MySQL storage backend][mysql-storage-backend] | ms | summary |
|
||
| `vault.mysql.list` | Duration of a LIST operation against the [MySQL storage backend][mysql-storage-backend] | ms | summary |
|
||
| `vault.postgres.put` | Duration of a PUT operation against the [PostgreSQL storage backend][postgresql-storage-backend] | ms | summary |
|
||
| `vault.postgres.get` | Duration of a GET operation against the [PostgreSQL storage backend][postgresql-storage-backend] | ms | summary |
|
||
| `vault.postgres.delete` | Duration of a DELETE operation against the [PostgreSQL storage backend][postgresql-storage-backend] | ms | summary |
|
||
| `vault.postgres.list` | Duration of a LIST operation against the [PostgreSQL storage backend][postgresql-storage-backend] | ms | summary |
|
||
| `vault.s3.put` | Duration of a PUT operation against the [Amazon S3 storage backend][s3-storage-backend] | ms | summary |
|
||
| `vault.s3.get` | Duration of a GET operation against the [Amazon S3 storage backend][s3-storage-backend] | ms | summary |
|
||
| `vault.s3.delete` | Duration of a DELETE operation against the [Amazon S3 storage backend][s3-storage-backend] | ms | summary |
|
||
| `vault.s3.list` | Duration of a LIST operation against the [Amazon S3 storage backend][s3-storage-backend] | ms | summary |
|
||
| `vault.spanner.put` | Duration of a PUT operation against the [Google Cloud Spanner storage backend][spanner-storage-backend] | ms | summary |
|
||
| `vault.spanner.get` | Duration of a GET operation against the [Google Cloud Spanner storage backend][spanner-storage-backend] | ms | summary |
|
||
| `vault.spanner.delete` | Duration of a DELETE operation against the [Google Cloud Spanner storage backend][spanner-storage-backend] | ms | summary |
|
||
| `vault.spanner.list` | Duration of a LIST operation against the [Google Cloud Spanner storage backend][spanner-storage-backend] | ms | summary |
|
||
| `vault.spanner.lock.unlock` | Duration of an UNLOCK operation against the [Google Cloud Spanner storage backend][spanner-storage-backend] in HA mode | ms | summary |
|
||
| `vault.spanner.lock.lock` | Duration of a LOCK operation against the [Google Cloud Spanner storage backend][spanner-storage-backend] in HA mode | ms | summary |
|
||
| `vault.spanner.lock.value` | Duration of a VALUE operation against the [Google Cloud Spanner storage backend][gcs-storage-backend] in HA mode | ms | summary |
|
||
| `vault.swift.put` | Duration of a PUT operation against the [Swift storage backend][swift-storage-backend] | ms | summary |
|
||
| `vault.swift.get` | Duration of a GET operation against the [Swift storage backend][swift-storage-backend] | ms | summary |
|
||
| `vault.swift.delete` | Duration of a DELETE operation against the [Swift storage backend][swift-storage-backend] | ms | summary |
|
||
| `vault.swift.list` | Duration of a LIST operation against the [Swift storage backend][swift-storage-backend] | ms | summary |
|
||
| `vault.zookeeper.put` | Duration of a PUT operation against the [ZooKeeper storage backend][zookeeper-storage-backend] | ms | summary |
|
||
| `vault.zookeeper.get` | Duration of a GET operation against the [ZooKeeper storage backend][zookeeper-storage-backend] | ms | summary |
|
||
| `vault.zookeeper.delete` | Duration of a DELETE operation against the [ZooKeeper storage backend][zookeeper-storage-backend] | ms | summary |
|
||
| `vault.zookeeper.list` | Duration of a LIST operation against the [ZooKeeper storage backend][zookeeper-storage-backend] | ms | summary |
|
||
|
||
## Integrated Storage (Raft)
|
||
|
||
These metrics relate to raft based [integrated storage][integrated-storage].
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :--------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------------- | :------ |
|
||
| `vault.raft.apply` | Number of Raft transactions occurring over the interval, which is a general indicator of the write load on the Raft servers. | raft transactions / interval | counter |
|
||
| `vault.raft.barrier` | Number of times the node has started the barrier i.e the number of times it has issued a blocking call, to ensure that the node has all the pending operations that were queued, to be applied to the node's FSM. | blocks / interval | counter |
|
||
| `vault.raft.candidate.electSelf` | Time to request for a vote from a peer. | ms | summary |
|
||
| `vault.raft.commitNumLogs` | Number of logs processed for application to the FSM in a single batch. | logs | gauge |
|
||
| `vault.raft.commitTime` | Time to commit a new entry to the Raft log on the leader. | ms | timer |
|
||
| `vault.raft.compactLogs` | Time to trim the logs that are no longer needed. | ms | summary |
|
||
| `vault.raft.delete` | Time to delete file from raft's underlying storage. | ms | summary |
|
||
| `vault.raft.delete_prefix` | Time to delete files under a prefix from raft's underlying storage. | ms | summary |
|
||
| `vault.raft.fsm.apply` | Number of logs committed since the last interval. | commit logs / interval | summary |
|
||
| `vault.raft.fsm.applyBatch` | Time to apply batch of logs. | ms | summary |
|
||
| `vault.raft.fsm.applyBatchNum` | Number of logs applied in batch. | ms | summary |
|
||
| `vault.raft.fsm.enqueue` | Time to enqueue a batch of logs for the FSM to apply. | ms | timer |
|
||
| `vault.raft.fsm.restore` | Time taken by the FSM to restore its state from a snapshot. | ms | summary |
|
||
| `vault.raft.fsm.snapshot` | Time taken by the FSM to record the current state for the snapshot. | ms | summary |
|
||
| `vault.raft.fsm.store_config` | Time to store the configuration. | ms | summary |
|
||
| `vault.raft.get` | Time to retrieve file from raft's underlying storage. | ms | summary |
|
||
| `vault.raft.leader.dispatchLog` | Time for the leader to write log entries to disk. | ms | timer |
|
||
| `vault.raft.leader.dispatchNumLogs` | Number of logs committed to disk in a batch. | logs | gauge |
|
||
| `vault.raft.list` | Time to retrieve list of keys from raft's underlying storage. | ms | summary |
|
||
| `vault.raft.peers` | Number of peers in the raft cluster configuration. | peers | gauge |
|
||
| `vault.raft.put` | Time to persist key in raft's underlying storage. | ms | summary |
|
||
| `vault.raft.replication.appendEntries.log` | Number of logs replicated to a node, to bring it up to speed with the leader's logs. | logs appended / interval | counter |
|
||
| `vault.raft.replication.appendEntries.rpc` | Time taken by the append entries RFC, to replicate the log entries of a leader node onto its follower node(s). | ms | timer |
|
||
| `vault.raft.replication.heartbeat` | Time taken to invoke appendEntries on a peer, so that it doesn’t timeout on a periodic basis. | ms | timer |
|
||
| `vault.raft.replication.installSnapshot` | Time taken to process the installSnapshot RPC call. This metric should only be seen on nodes which are currently in the follower state. | ms | timer |
|
||
| `vault.raft.restore` | Number of times the restore operation has been performed by the node. Here, restore refers to the action of raft consuming an external snapshot to restore its state. | operation invoked / interval | counter |
|
||
| `vault.raft.restoreUserSnapshot` | Time taken by the node to restore the FSM state from a user's snapshot. | ms | timer |
|
||
| `vault.raft.rpc.appendEntries` | Time taken to process an append entries RPC call from a node. | ms | timer |
|
||
| `vault.raft.rpc.appendEntries.processLogs` | Time taken to process the outstanding log entries of a node. | ms | timer |
|
||
| `vault.raft.rpc.appendEntries.storeLogs` | Time taken to add any outstanding logs for a node, since the last appendEntries was invoked. | ms | timer |
|
||
| `vault.raft.rpc.installSnapshot` | Time taken to process the installSnapshot RPC call. This metric should only be seen on nodes which are currently in the follower state. | ms | timer |
|
||
| `vault.raft.rpc.processHeartbeat` | Time taken to process a heartbeat request. | ms | timer |
|
||
| `vault.raft.rpc.requestVote` | Time taken to complete requestVote RPC call. | ms | summary |
|
||
| `vault.raft.snapshot.create` | Time taken to initialize the snapshot process. | ms | timer |
|
||
| `vault.raft.snapshot.persist` | Time taken to dump the current snapshot taken by the node to the disk. | ms | timer |
|
||
| `vault.raft.snapshot.takeSnapshot` | Total time involved in taking the current snapshot (creating one and persisting it) by the node. | ms | timer |
|
||
| `vault.raft.state.follower` | Number of times node has entered the follower mode. This happens when a new node joins the cluster or after the end of a leader election. | follower state entered / interval | counter |
|
||
| `vault.raft.transition.heartbeat_timeout` | Number of times node has transitioned to the Candidate state, after receive no heartbeat messages from the last known leader. | timeouts / interval | counter |
|
||
| `vault.raft.transition.leader_lease_timeout` | Number of times quorum of nodes were not able to be contacted. | contact failures | counter |
|
||
| `vault.raft.verify_leader` | Number of times node checks whether it is still the leader or not. | checks / interval | counter |
|
||
| `vault.raft-storage.delete` | Time to insert log entry to delete path. | ms | timer |
|
||
| `vault.raft-storage.get` | Time to retrieve value for path from FSM. | ms | timer |
|
||
| `vault.raft-storage.put` | Time to insert log entry to persist path. | ms | timer |
|
||
| `vault.raft-storage.list` | Time to list all entries under the prefix from the FSM. | ms | timer |
|
||
| `vault.raft-storage.transaction` | Time to insert operations into a single log. | ms | timer |
|
||
| `vault.raft-storage.entry_size` | The total size of a Raft entry during log application in bytes. | bytes | summary |
|
||
| `vault.raft_storage.bolt.freelist.`<br/>`free_pages` | Number of free pages in the freelist. | pages | gauge |
|
||
| `vault.raft_storage.bolt.freelist.`<br/>`pending_pages` | Number of pending pages in the freelist. | pages | gauge |
|
||
| `vault.raft_storage.bolt.freelist.`<br/>`allocated_bytes` | Total bytes allocated in free pages. | bytes | gauge |
|
||
| `vault.raft_storage.bolt.freelist.`<br/>`used_bytes` | Total bytes used by the freelist. | bytes | gauge |
|
||
| `vault.raft_storage.bolt.transaction.`<br/>`started_read_transactions` | Number of started read transactions. | transactions | gauge |
|
||
| `vault.raft_storage.bolt.transaction.`<br/>`currently_open_read_transactions` | Number of currently open read transactions. | transactions | gauge |
|
||
| `vault.raft_storage.bolt.page.count` | Number of page allocations. | allocations | gauge |
|
||
| `vault.raft_storage.bolt.page.`<br/>`bytes_allocated` | Total bytes allocated. | bytes | gauge |
|
||
| `vault.raft_storage.bolt.cursor.count` | Number of cursors created. | cursors | gauge |
|
||
| `vault.raft_storage.bolt.node.count` | Number of node allocations. | nodes | gauge |
|
||
| `vault.raft_storage.bolt.node.dereferences` | Number of node dereferences. | dereferences | gauge |
|
||
| `vault.raft_storage.bolt.rebalance.count` | Number of node rebalances. | rebalances | gauge |
|
||
| `vault.raft_storage.bolt.rebalance.time` | Time taken rebalancing. | ms | summary |
|
||
| `vault.raft_storage.bolt.split.count` | Number of nodes split. | nodes | gauge |
|
||
| `vault.raft_storage.bolt.spill.count` | Number of nodes spilled. | nodes | gauge |
|
||
| `vault.raft_storage.bolt.spill.time` | Time taken spilling. | ms | summary |
|
||
| `vault.raft_storage.bolt.write.count` | Number of writes performed. | writes | gauge |
|
||
| `vault.raft_storage.bolt.write.time` | Time taken writing to disk. | ms | summary |
|
||
|
||
## Integrated Storage (Raft) Autopilot
|
||
| Metric | Description | Unit | Type |
|
||
| :---------------------------------- | :-----------------------------------------------------------------------------------------------------| :-------- | :------ |
|
||
| `vault.autopilot.node.healthy` | Set to 1 if the node_id is deemed healthy by Autopilot, 0 if not | bool | gauge |
|
||
| `vault.autopilot.healthy` | Set to 1 if Autopilot considers all nodes healthy | bool | gauge |
|
||
| `vault.autopilot.failure_tolerance` | How many nodes can be lost while maintaining quorum, i.e. number of healthy nodes in excess of quorum | nodes | gauge |
|
||
|
||
Since Autopilot runs only the on the active node, these metrics are only emitted by the active node.
|
||
|
||
## Integrated Storage (Raft) Leadership Changes
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :------------------------------ | :------------------------------------------------------------------------------------------------------------ | :-------- | :------ |
|
||
| `vault.raft.leader.lastContact` | Measures the time since the leader was last able to contact the follower nodes when checking its leader lease | ms | summary |
|
||
| `vault.raft.state.candidate` | Increments whenever raft server starts an election | Elections | counter |
|
||
| `vault.raft.state.leader` | Increments whenever raft server becomes a leader | Leaders | counter |
|
||
|
||
**Why they're important**: Normally, your raft cluster should have a stable
|
||
leader. If there are frequent elections or leadership changes, it would likely
|
||
indicate network issues between the raft nodes, or that the raft servers
|
||
themselves are unable to keep up with the load.
|
||
|
||
**What to look for**: For a healthy cluster, you're looking for a lastContact
|
||
lower than 200ms, leader > 0 and candidate == 0. Deviations from this might
|
||
indicate flapping leadership.
|
||
|
||
## Integrated Storage (Raft) Automated Snapshots
|
||
|
||
These metrics related to the Enterprise feature [Raft Automated Snapshots](/docs/enterprise/automated-raft-snapshots).
|
||
|
||
| Metric | Description | Unit | Type |
|
||
| :------------------------------------------ | :-------------------------------------------------------------------------------------------- | :--------- | :------ |
|
||
| `vault.autosnapshots.total.snapshot.size` | For storage_type=local, space on disk used by saved snapshots | bytes | gauge |
|
||
| `vault.autosnapshots.percent.maxspace.used` | For storage_type=local, percent used of maximum allocated space | percentage | gauge |
|
||
| `vault.autosnapshots.save.errors` | Increments whenever an error occurs trying to save a snapshot | n/a | counter |
|
||
| `vault.autosnapshots.save.duration` | Measures the time taken saving a snapshot | ms | summary |
|
||
| `vault.autosnapshots.last.success.time` | Epoch time (seconds since 1970/01/01) of last successful snapshot save | n/a | gauge |
|
||
| `vault.autosnapshots.snapshot.size` | Measures the size in bytes of snapshots | bytes | summary |
|
||
| `vault.autosnapshots.rotate.duration` | Measures the time taken to rotate (i.e. delete) old snapshots to satisfy configured retention | ms | summary |
|
||
| `vault.autosnapshots.snapshots.in.storage` | Number of snapshots in storage | n/a | gauge |
|
||
|
||
## Metric Labels
|
||
|
||
| Metric | Description | Example |
|
||
| :--------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------- |
|
||
| `auth_method` | Authorization engine type . | `userpass` |
|
||
| `cluster` | The cluster name from which the metric originated; set in the configuration file, or automatically generated when a cluster is create | `vault-cluster-d54ad07` |
|
||
| `creation_ttl` | Time-to-live value assigned to a token or lease at creation. This value is rounded up to the next-highest bucket; the available buckets are `1m`, `10m`, `20m`, `1h`, `2h`, `1d`, `2d`, `7d`, and `30d`. Any longer TTL is assigned the value `+Inf`. | `7d` |
|
||
| `mount_point` | Path at which an auth method or secret engine is mounted. | `auth/userpass/` |
|
||
| `namespace` | A namespace path, or `root` for the root namespace | `ns1` |
|
||
| `policy` | A single named policy | `default` |
|
||
| `secret_engine` | The [secret engine][secrets-engine] type. | `aws` |
|
||
| `token_type` | Identifies whether the token is a batch token or a service token. | `service` |
|
||
| `peer_id` | Unique identifier of a raft peer. | `node-1` |
|
||
| `node_id` | Unique identifier of a raft peer, same as peer_id. | `node-1` |
|
||
| `snapshot_config_name` | For automated snapshots, the name of the configuration | `config1` |
|
||
|
||
[secrets-engines]: /docs/secrets
|
||
[storage-backends]: /docs/configuration/storage
|
||
[telemetry-stanza]: /docs/configuration/telemetry
|
||
[cubbyhole-secrets-engine]: /docs/secrets/cubbyhole
|
||
[kv-secrets-engine]: /docs/secrets/kv
|
||
[ldap-auth-backend]: /docs/auth/ldap
|
||
[token-auth-backend]: /docs/auth/token
|
||
[azure-storage-backend]: /docs/configuration/storage/azure
|
||
[cassandra-storage-backend]: /docs/configuration/storage/cassandra
|
||
[cockroachdb-storage-backend]: /docs/configuration/storage/cockroachdb
|
||
[consul-storage-backend]: /docs/configuration/storage/consul
|
||
[couchdb-storage-backend]: /docs/configuration/storage/couchdb
|
||
[dynamodb-storage-backend]: /docs/configuration/storage/dynamodb
|
||
[etcd-storage-backend]: /docs/configuration/storage/etcd
|
||
[gcs-storage-backend]: /docs/configuration/storage/google-cloud-storage
|
||
[spanner-storage-backend]: /docs/configuration/storage/google-cloud-spanner
|
||
[mssql-storage-backend]: /docs/configuration/storage/mssql
|
||
[mysql-storage-backend]: /docs/configuration/storage/mysql
|
||
[postgresql-storage-backend]: /docs/configuration/storage/postgresql
|
||
[s3-storage-backend]: /docs/configuration/storage/s3
|
||
[swift-storage-backend]: /docs/configuration/storage/swift
|
||
[zookeeper-storage-backend]: /docs/configuration/storage/zookeeper
|
||
[integrated-storage]: /docs/configuration/storage/raft
|