ace5faf948
Upon dequeuing an evaluation workers snapshot their state store at the eval's wait index or later. This ensures we process an eval at a point in time after it was created or updated. Processing an eval on an old snapshot could cause any number of problems such as: 1. Since job registration atomically updates an eval and job in a single raft entry, scheduling against indexes before that may not have the eval's job or may have an older version. 2. The older the scheduler's snapshot, the higher the likelihood something has changed in the cluster state which will cause the plan applier to reject the scheduler's plan. This could waste work or even cause eval's to be failed needlessly. However, the workers run in parallel with a new server pulling the cluster state from a peer. During this time, which may be many minutes long, the state store is likely far behind the minimum index required to process evaluations. This PR addresses this by adding an additional long backoff period after an eval is nacked. If the scheduler's indexes catches up within the additional backoff, it will unblock early to dequeue the next eval. When the server shuts down we'll get a `context.Canceled` error from the state store method. We need to bubble this error up so that other callers can detect it. Handle this case separately when waiting after dequeue so that we can warn on shutdown instead of throwing an ambiguous error message with just the text "canceled." While there may be more precise ways to block scheduling until the server catches up, this approach adds little risk and covers additional cases where a server may be temporarily behind due to a spike in load or a saturated network. For testing, we make the `raftSyncLimit` into a parameter on the worker's `run` method so that we can run backoff tests without waiting 30+ seconds. We haven't followed thru and made all the worker globals into worker parameters, because there isn't much use outside of testing, but we can consider that in the future. Co-authored-by: Tim Gross <tgross@hashicorp.com> |
||
---|---|---|
.. | ||
indexer | ||
paginator | ||
autopilot.go | ||
autopilot_test.go | ||
deployment_events_test.go | ||
events.go | ||
events_test.go | ||
iterator.go | ||
schema.go | ||
schema_test.go | ||
state_changes.go | ||
state_store.go | ||
state_store_acl.go | ||
state_store_acl_binding_rule.go | ||
state_store_acl_binding_rule_test.go | ||
state_store_acl_sso.go | ||
state_store_acl_sso_test.go | ||
state_store_acl_test.go | ||
state_store_oss.go | ||
state_store_restore.go | ||
state_store_restore_test.go | ||
state_store_service_regisration_test.go | ||
state_store_service_registration.go | ||
state_store_test.go | ||
state_store_variables.go | ||
state_store_variables_oss.go | ||
state_store_variables_test.go | ||
testing.go |