From fb8739d92640a0fab326921f20e4e8cd520918b9 Mon Sep 17 00:00:00 2001 From: Michael Schurter Date: Mon, 26 Sep 2022 14:43:34 -0700 Subject: [PATCH] docs: write a lot of words about heartbeats (#14679) * docs: write a lot of words about heartbeats Alternative to #14670 * Apply suggestions from code review Co-authored-by: Tim Gross * use descriptive title for link * rework example of high failover ttl Co-authored-by: Tim Gross --- website/content/docs/configuration/server.mdx | 131 ++++++++++++++---- 1 file changed, 105 insertions(+), 26 deletions(-) diff --git a/website/content/docs/configuration/server.mdx b/website/content/docs/configuration/server.mdx index 6ed1a2659..99a1bb48e 100644 --- a/website/content/docs/configuration/server.mdx +++ b/website/content/docs/configuration/server.mdx @@ -118,38 +118,25 @@ server { example section](#configuring-scheduler-config) for more details `default_scheduler_config` was introduced in Nomad 0.10.4. -- `heartbeat_grace` `(string: "10s")` - Specifies the additional time given as a - grace period beyond the heartbeat TTL of nodes to account for network and - processing delays as well as clock skew. This is specified using a label - suffix like "30s" or "1h". - -- `license_path` `(string: "")` - Specifies the path to load a Nomad Enterprise - license from. This must be an absolute path (`/opt/nomad/license.hclic`). The - license can also be set by setting `NOMAD_LICENSE_PATH` or by setting - `NOMAD_LICENSE` as the entire license value. `license_path` has the highest - precedence, followed by `NOMAD_LICENSE` and then `NOMAD_LICENSE_PATH`. +- `heartbeat_grace` `(string: "10s")` - Specifies the additional time given + beyond the heartbeat TTL of Clients to account for network and processing + delays and clock skew. This is specified using a label suffix like "30s" or + "1h". See [Client Heartbeats](#client-heartbeats) below for details. - `min_heartbeat_ttl` `(string: "10s")` - Specifies the minimum time between - node heartbeats. This is used as a floor to prevent excessive updates. This is - specified using a label suffix like "30s" or "1h". Lowering the minimum TTL is - a tradeoff as it lowers failure detection time of nodes at the tradeoff of - false positives and increased load on the leader. + Client heartbeats. This is used as a floor to prevent excessive updates. This + is specified using a label suffix like "30s" or "1h". See [Client + Heartbeats](#client-heartbeats) below for details. -- `failover_heartbeat_ttl` `(string: "5m")` - Specifies the TTL applied to - heartbeats after a new leader is elected, since we no longer know the status - of all the heartbeats. This is specified using a label suffix like "30s" or - "1h". - - ~> Lowering the `failover_heartbeat_ttl` is a tradeoff as it lowers failure - detection time of nodes at the tradeoff of false positives. False positives - could cause all clients to stop their allocations if a leadership transition - lasts longer than `heartbeat_grace + failover_heartbeat_ttl`. +- `failover_heartbeat_ttl` `(string: "5m")` - The time by which all Clients + must heartbeat after a Server leader election. This is specified using a label + suffix like "30s" or "1h". See [Client Heartbeats](#client-heartbeats) below + for details. - `max_heartbeats_per_second` `(float: 50.0)` - Specifies the maximum target rate of heartbeats being processed per second. This allows the TTL to be - increased to meet the target rate. Increasing the maximum heartbeats per - second is a tradeoff as it lowers failure detection time of nodes at the - tradeoff of false positives and increased load on the leader. + increased to meet the target rate. See [Client + Heartbeats](#client-heartbeats) below for details. - `non_voting_server` `(bool: false)` - (Enterprise-only) Specifies whether this server will act as a non-voting member of the cluster to help provide @@ -160,6 +147,12 @@ server { disallow this server from making any scheduling decisions. This defaults to the number of CPU cores. +- `license_path` `(string: "")` - Specifies the path to load a Nomad Enterprise + license from. This must be an absolute path (`/opt/nomad/license.hclic`). The + license can also be set by setting `NOMAD_LICENSE_PATH` or by setting + `NOMAD_LICENSE` as the entire license value. `license_path` has the highest + precedence, followed by `NOMAD_LICENSE` and then `NOMAD_LICENSE_PATH`. + - `plan_rejection_tracker` ([PlanRejectionTracker](#plan_rejection_tracker-parameters)) - Configuration for the plan rejection tracker that the Nomad leader uses to track the history of plan rejections. @@ -369,6 +362,90 @@ server { } ``` +## Client Heartbeats ((#client-heartbeats)) + +~> This is an advanced topic. It is most beneficial to clusters over 1,000 + nodes or with unreliable networks or nodes (eg some edge deployments). + +Nomad Clients periodically heartbeat to Nomad Servers to confirm they are +operating as expected. Nomad Clients which do not heartbeat in the specified +amount of time are considered `down` and their allocations are marked as `lost` +or `disconnected` (if [`max_client_disconnect`][max_client_disconnect] is set) +and rescheduled. + +The various heartbeat related parameters allow you to tune the following +tradeoffs: + +- The longer the heartbeat period, the longer a `down` Client's workload will + take to be rescheduled. +- The shorter the heartbeat period, the more likely transient network issues, + leader elections, and other temporary issues could cause a perfectly + functional Client and its workloads to be marked as `down` and the work + rescheduled. + +While Nomad Clients can connect to any Server, all heartbeats are forwarded to +the leader for processing. Since this heartbeat processing consumes resources, +Nomad adjusts the rate at which Clients heartbeat based on cluster size. The +goal is to try to keep the resource cost of processing heartbeats constant +regardless of cluster size. + +The base formula for determining how often a Client must heartbeat is: + +``` + / +``` + +Other factors modify this base TTL: + +- A random factor up to `2x` is added to the base TTL to prevent the + [thundering herd][herd] problem where a large number of clients attempt to + heartbeat at exactly the same time. +- [`min_heartbeat_ttl`](#min_heartbeat_ttl) is used as the lower bound to + prevent small clusters from sending excessive heartbeats. +- [`heartbeat_grace`](#heartbeat_grace) is the amount of _extra_ time the + leader will wait for a heartbeat beyond the base heartbeat. +- After a leader election all Clients are given up to `failover_heartbeat_ttl` + to successfully heartbeat. This gives Clients time to discover a functioning + Server in case they were directly connected to a leader that crashed. + +For example, given the default values for heartbeat parameters, different sized +clusters will use the following TTLs for the heartbeats. Note that the `Server TTL` +simply adds the `heartbeat_grace` parameter to the TTL Clients are given. + +| Clients | Client TTL | Server TTL | Safe after elections | +| ------- | ----------- | ----------- | -------------------- | +| 10 | 10s - 20s | 20s - 30s | yes | +| 100 | 10s - 20s | 20s - 30s | yes | +| 1000 | 20s - 40s | 30s - 50s | yes | +| 5000 | 100s - 200s | 110s - 210s | yes | +| 10000 | 200s - 400s | 210s - 410s | NO (see below) | + +Regardless of size, all clients will have a Server TTL of +`failover_heartbeat_ttl` after a leader election. It should always be larger +than the maximum Client TTL for your cluster size in order to prevent marking +live Clients as `down`. + +For clusters over 5000 Clients you should increase `failover_heartbeat_ttl` +using the following formula: + +``` +(2 * ( / )) + (10 * ) + +# For example with 6000 Clients: +(2 * (6000 / 50)) + (10 * 10) = 340s (5m40s) +``` + +This ensures Clients have some additional time to failover even if they were +told to heartbeat after the maximum interval. + +The actual value used should take into consideration how much tolerance your +system has for a delay in noticing crashed Clients. For example a +`failover_heartbeat_ttl` of 30 minutes may give even the slowest clients in the +largest clusters ample time to heartbeat after an election. However if the +election was due to a datacenter-wide failure affecting Clients, it will be 30 +minutes before Nomad recognizes that they are `down` and reschedules their +work. + [encryption]: https://learn.hashicorp.com/tutorials/nomad/security-gossip-encryption 'Nomad Encryption Overview' [server-join]: /docs/configuration/server_join 'Server Join' [update-scheduler-config]: /api-docs/operator/scheduler#update-scheduler-configuration 'Scheduler Config' @@ -378,3 +455,5 @@ server { [`nomad operator keygen`]: /docs/commands/operator/keygen [search]: /docs/configuration/search [encryption key]: /docs/operations/key-management +[max_client_disconnect]: /docs/job-specification/group#max-client-disconnect +[herd]: https://en.wikipedia.org/wiki/Thundering_herd_problem