open-nomad/nomad/state
Michael Schurter ace5faf948
core: backoff considerably when worker is behind raft (#15523)
Upon dequeuing an evaluation workers snapshot their state store at the
eval's wait index or later. This ensures we process an eval at a point
in time after it was created or updated. Processing an eval on an old
snapshot could cause any number of problems such as:

1. Since job registration atomically updates an eval and job in a single
   raft entry, scheduling against indexes before that may not have the
   eval's job or may have an older version.
2. The older the scheduler's snapshot, the higher the likelihood
   something has changed in the cluster state which will cause the plan
   applier to reject the scheduler's plan. This could waste work or
   even cause eval's to be failed needlessly.

However, the workers run in parallel with a new server pulling the
cluster state from a peer. During this time, which may be many minutes
long, the state store is likely far behind the minimum index required
to process evaluations.

This PR addresses this by adding an additional long backoff period after
an eval is nacked. If the scheduler's indexes catches up within the
additional backoff, it will unblock early to dequeue the next eval.

When the server shuts down we'll get a `context.Canceled` error from the state
store method. We need to bubble this error up so that other callers can detect
it. Handle this case separately when waiting after dequeue so that we can warn
on shutdown instead of throwing an ambiguous error message with just the text
"canceled."

While there may be more precise ways to block scheduling until the
server catches up, this approach adds little risk and covers additional
cases where a server may be temporarily behind due to a spike in load or
a saturated network.

For testing, we make the `raftSyncLimit` into a parameter on the worker's `run` method 
so that we can run backoff tests without waiting 30+ seconds. We haven't followed thru
and made all the worker globals into worker parameters, because there isn't much
use outside of testing, but we can consider that in the future.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-01-24 08:56:35 -05:00
..
indexer core: add ACL token expiry state, struct, and RPC handling. (#13718) 2022-07-13 15:40:34 +02:00
paginator build: run gofmt on all go source files 2022-08-16 11:14:11 -05:00
autopilot.go autopilot: correctly return errors within state functions. (#12714) 2022-04-21 08:54:50 +02:00
autopilot_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
deployment_events_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
events.go events: add ACL binding rules to core events stream topics. (#15544) 2022-12-14 14:49:49 +01:00
events_test.go events: add ACL binding rules to core events stream topics. (#15544) 2022-12-14 14:49:49 +01:00
iterator.go csi: use node MaxVolumes during scheduling (#7565) 2020-03-31 17:16:47 -04:00
schema.go acl: add binding rule object state schema and functionality. (#15511) 2022-12-14 08:48:18 +01:00
schema_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
state_changes.go events: Use single eventsFromChanges func (#9281) 2020-11-05 13:06:52 -08:00
state_store.go core: backoff considerably when worker is behind raft (#15523) 2023-01-24 08:56:35 -05:00
state_store_acl.go acl: add ACL roles to event stream topic and resolve policies. (#14923) 2022-10-20 09:43:35 +02:00
state_store_acl_binding_rule.go acl: add binding rule object state schema and functionality. (#15511) 2022-12-14 08:48:18 +01:00
state_store_acl_binding_rule_test.go bugfix: unit test for GetACLBindingRules (#15583) 2022-12-20 15:06:09 +01:00
state_store_acl_sso.go bugfix: acl sso auth methods test failures (#15512) 2022-12-09 18:47:32 +01:00
state_store_acl_sso_test.go acl: make sure there is only one default Auth Method per type (#15504) 2022-12-09 14:46:54 +01:00
state_store_acl_test.go cleanup: rename Equals to Equal for consistency (#14759) 2022-10-10 09:28:46 -05:00
state_store_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
state_store_restore.go acl: add binding rule object state schema and functionality. (#15511) 2022-12-14 08:48:18 +01:00
state_store_restore_test.go acl: add binding rule object state schema and functionality. (#15511) 2022-12-14 08:48:18 +01:00
state_store_service_regisration_test.go cleanup: rename Equals to Equal for consistency (#14759) 2022-10-10 09:28:46 -05:00
state_store_service_registration.go cleanup: rename Equals to Equal for consistency (#14759) 2022-10-10 09:28:46 -05:00
state_store_test.go eval delete: move batching of deletes into RPC handler and state (#15117) 2022-11-14 14:08:13 -05:00
state_store_variables.go cleanup: rename Equals to Equal for consistency (#14759) 2022-10-10 09:28:46 -05:00
state_store_variables_oss.go rename SecureVariables to Variables throughout 2022-08-26 16:06:24 -04:00
state_store_variables_test.go cleanup: rename Equals to Equal for consistency (#14759) 2022-10-10 09:28:46 -05:00
testing.go CSI: allow updates to volumes on re-registration (#12167) 2022-03-07 11:06:59 -05:00