Commit graph

12 commits

Author SHA1 Message Date
Michael Schurter b554f9344a
fingerprint: lengthen Vault check after seen (#14693)
Extension of #14673

Once Vault is initially fingerprinted, extend the period since changes
should be infrequent and the fingerprint is relatively expensive since
it is contacting a central Vault server.

Also move the period timer reset *after* the fingerprint. This is
similar to #9435 where the idea is to ensure the retry period starts
*after* the operation is attempted. 15s will be the *minimum* time
between fingerprints now instead of the *maximum* time between
fingerprints.

In the case of Vault fingerprinting, the original behavior might cause
the following:

1. Timer is reset to 15s
2. Fingerprint takes 16s
3. Timer has already elapsed so we immediately Fingerprint again

Even if fingerprinting Vault only takes a few seconds, that may very
well be due to excessive load and backing off our fingerprints is
desirable. The new bevahior ensures we always wait at least 15s between
fingerprint attempts and should allow some natural jittering based on
server load and network latency.
2022-09-26 12:14:19 -07:00
Tim Gross 17aee4d69c
fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Danielle Tomlinson 66c521ca17 client: Move fingerprint structs to pkg
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.

It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
2018-12-01 17:10:39 +01:00
Alex Dadgar 8504505c0d client uses passed logger and fix fingerprinters 2018-10-16 16:53:30 -07:00
Alex Dadgar 300b1a7a15 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Chelsea Holland Komlo b8e8064835 code review fixup 2018-01-31 18:34:03 -05:00
Chelsea Holland Komlo 7b53474a6e add applicable boolean to fingerprint response
public fields and remove getter functions
2018-01-31 13:21:45 -05:00
Chelsea Holland Komlo 7c19de797c create safe getters and setters for fingerprint response 2018-01-26 11:22:05 -05:00
Chelsea Holland Komlo 9a8344333b refactor Fingerprint to request/response construct 2018-01-24 11:54:02 -05:00
Alex Dadgar a9c786a4fe Make test Vault pick random ports 2017-07-25 17:40:59 -07:00
Alex Dadgar 8ca3a16825 Fingerprint 2016-09-01 11:10:14 -07:00