open-nomad/client
Tim Gross 17aee4d69c
fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
..
allocdir build: run gofmt on all go source files 2022-08-16 11:14:11 -05:00
allochealth 2 small data race fixes in logmon and check tests (#14538) 2022-09-13 12:54:06 -07:00
allocrunner connect: add nomad env to envoy bootstrap (#12959) 2022-09-22 13:18:18 -05:00
allocwatcher test: use `T.TempDir` to create temporary test directory (#12853) 2022-05-12 11:42:40 -04:00
config cleanup more helper updates (#14638) 2022-09-21 14:53:25 -05:00
consul Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
devicemanager cleanup: replace TypeToPtr helper methods with pointer.Of (#14151) 2022-08-17 18:26:34 +02:00
dynamicplugins build: run gofmt on all go source files 2022-08-16 11:14:11 -05:00
fingerprint fingerprint: don't clear Consul/Vault attributes on failure (#14673) 2022-09-23 14:45:12 -04:00
interfaces artifact: fix numerous go-getter security issues 2022-05-24 16:29:39 -04:00
lib CI: make `make check` clean on macOS (#14528) 2022-09-09 12:26:34 -04:00
logmon 2 small data race fixes in logmon and check tests (#14538) 2022-09-13 12:54:06 -07:00
pluginmanager cleanup more helper updates (#14638) 2022-09-21 14:53:25 -05:00
servers feat: remove dependency to consul/lib 2022-04-09 13:22:44 +02:00
serviceregistration Add Namespace, Job and Group to envoy stats (#14311) 2022-09-22 10:38:21 -04:00
state cleanup more helper updates (#14638) 2022-09-21 14:53:25 -05:00
stats ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
structs client: add support for checks in nomad services 2022-07-12 17:09:50 -05:00
taskenv connect: interpolate task env in config values (#14445) 2022-09-02 15:00:28 -04:00
testutil client: cgroups v2 code review followup 2022-03-24 13:40:42 -05:00
vaultclient ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
acl.go client: fix data races in config handling (#14139) 2022-08-18 16:32:04 -07:00
acl_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
agent_endpoint.go client: fix data races in config handling (#14139) 2022-08-18 16:32:04 -07:00
agent_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
alloc_endpoint.go Task lifecycle restart (#14127) 2022-08-24 17:43:07 -04:00
alloc_endpoint_test.go Task lifecycle restart (#14127) 2022-08-24 17:43:07 -04:00
alloc_watcher_e2e_test.go job_hooks: add implicit constraint when using Consul for services. (#12602) 2022-04-20 14:09:13 +02:00
client.go cleanup more helper updates (#14638) 2022-09-21 14:53:25 -05:00
client_stats_endpoint.go Server side impl + touch ups 2018-02-15 13:59:02 -08:00
client_stats_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_test.go client: refactor cpuset manager initialization 2022-08-25 11:18:43 -05:00
csi_endpoint.go CSI: allow updates to volumes on re-registration (#12167) 2022-03-07 11:06:59 -05:00
csi_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
driver_manager_test.go client: fix data races in config handling (#14139) 2022-08-18 16:32:04 -07:00
enterprise_client_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
fingerprint_manager.go chore: fixup inconsistent method receiver names. (#11704) 2021-12-20 11:44:21 +01:00
fingerprint_manager_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
fs_endpoint.go cleanup: replace TypeToPtr helper methods with pointer.Of (#14151) 2022-08-17 18:26:34 +02:00
fs_endpoint_test.go raw_exec: make raw exec driver work with cgroups v2 2022-04-04 16:11:38 -05:00
gc.go chore: fix incorrect docstring formatting. 2021-08-30 11:08:12 +02:00
gc_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
heartbeatstop.go client: fix race in heartbeat tracker (#14119) 2022-08-16 09:41:08 -07:00
heartbeatstop_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
node_updater.go client: fix data races in config handling (#14139) 2022-08-18 16:32:04 -07:00
rpc.go client: fix data races in config handling (#14139) 2022-08-18 16:32:04 -07:00
rpc_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
testing.go client: fix data races in config handling (#14139) 2022-08-18 16:32:04 -07:00
util.go Revert "client: defensive against getting stale alloc updates" 2020-06-19 15:39:44 -04:00