open-nomad/client/fingerprint
Tim Gross 17aee4d69c
fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
..
test_fixtures client/fingerprint/consul: add new attributes to consul fingerprinter 2021-06-03 12:49:22 -05:00
arch.go
arch_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
bridge.go client/fingerprint: detect unloaded dynamic bridge kernel module 2020-11-09 13:56:14 -06:00
bridge_default.go gofmt all the files 2021-10-01 10:14:28 -04:00
bridge_linux.go deps: bump gopsutil to v3.21.2 2021-03-30 16:02:51 -04:00
bridge_linux_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
cgroup.go CI: make make check clean on macOS (#14528) 2022-09-09 12:26:34 -04:00
cgroup_default.go client: enable support for cgroups v2 2022-03-23 11:35:27 -05:00
cgroup_linux.go CI: make make check clean on macOS (#14528) 2022-09-09 12:26:34 -04:00
cgroup_test.go client: enable support for cgroups v2 2022-03-23 11:35:27 -05:00
cni.go multi-interface network support 2020-06-19 09:42:10 -04:00
cni_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
consul.go fingerprint: don't clear Consul/Vault attributes on failure (#14673) 2022-09-23 14:45:12 -04:00
consul_test.go fingerprint: don't clear Consul/Vault attributes on failure (#14673) 2022-09-23 14:45:12 -04:00
cpu.go client/ar: thread through cpuset manager 2021-04-13 13:28:36 -04:00
cpu_default.go gofmt all the files 2021-10-01 10:14:28 -04:00
cpu_linux.go client: enable support for cgroups v2 2022-03-23 11:35:27 -05:00
cpu_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
env_aws.go client: Add AWS EC2 instance-life-cycle from metadata to client fingerprint (#12371) 2022-03-25 11:50:52 -04:00
env_aws_cpu.go build: update aws env cpu info 2022-08-02 07:59:58 -05:00
env_aws_test.go client: Add AWS EC2 instance-life-cycle from metadata to client fingerprint (#12371) 2022-03-25 11:50:52 -04:00
env_azure.go chore: bump golangci-lint from v1.24 to v1.39 2021-04-03 09:50:23 +02:00
env_azure_test.go testing: setting env var incompatible with parallel tests (#14405) 2022-08-30 14:49:03 -04:00
env_digitalocean.go chore: remove use of "err" a log line context key for errors. (#14433) 2022-09-01 15:06:10 +02:00
env_digitalocean_test.go testing: setting env var incompatible with parallel tests (#14405) 2022-08-30 14:49:03 -04:00
env_gce.go chore: bump golangci-lint from v1.24 to v1.39 2021-04-03 09:50:23 +02:00
env_gce_test.go testing: setting env var incompatible with parallel tests (#14405) 2022-08-30 14:49:03 -04:00
fingerprint.go add digitalocean fingerprinter 2022-02-05 22:17:36 -08:00
fingerprint_default.go gofmt all the files 2021-10-01 10:14:28 -04:00
fingerprint_linux.go CNI Implementation (#7518) 2020-06-18 11:05:29 -07:00
fingerprint_test.go
host.go fingerprint kernel architecture name (#13182) 2022-06-02 15:51:00 -04:00
host_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
memory.go deps: bump gopsutil to v3.21.2 2021-03-30 16:02:51 -04:00
memory_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
network.go fix host network reserved port fingerprint (#11728) 2021-12-22 15:29:54 -05:00
network_default.go gofmt all the files 2021-10-01 10:14:28 -04:00
network_linux.go Log network device name during fingerprinting (#11184) 2021-09-16 10:48:31 -04:00
network_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
network_windows.go Disable PowerShell profile and simplify fingerprinting link speed on Windows (#11183) 2021-09-22 11:17:47 -04:00
nomad.go client: add service discovery feature enabled attribute. 2022-03-14 12:42:01 +01:00
nomad_test.go Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
signal.go
signal_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
storage.go
storage_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
storage_unix.go gofmt all the files 2021-10-01 10:14:28 -04:00
storage_windows.go
structs.go Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
vault.go fingerprint: don't clear Consul/Vault attributes on failure (#14673) 2022-09-23 14:45:12 -04:00
vault_test.go fingerprint: don't clear Consul/Vault attributes on failure (#14673) 2022-09-23 14:45:12 -04:00
zstorage_windows.go