open-nomad

History

Michael Schurter ace5faf948 core: backoff considerably when worker is behind raft (#15523 ) Upon dequeuing an evaluation workers snapshot their state store at the eval's wait index or later. This ensures we process an eval at a point in time after it was created or updated. Processing an eval on an old snapshot could cause any number of problems such as: 1. Since job registration atomically updates an eval and job in a single raft entry, scheduling against indexes before that may not have the eval's job or may have an older version. 2. The older the scheduler's snapshot, the higher the likelihood something has changed in the cluster state which will cause the plan applier to reject the scheduler's plan. This could waste work or even cause eval's to be failed needlessly. However, the workers run in parallel with a new server pulling the cluster state from a peer. During this time, which may be many minutes long, the state store is likely far behind the minimum index required to process evaluations. This PR addresses this by adding an additional long backoff period after an eval is nacked. If the scheduler's indexes catches up within the additional backoff, it will unblock early to dequeue the next eval. When the server shuts down we'll get a `context.Canceled` error from the state store method. We need to bubble this error up so that other callers can detect it. Handle this case separately when waiting after dequeue so that we can warn on shutdown instead of throwing an ambiguous error message with just the text "canceled." While there may be more precise ways to block scheduling until the server catches up, this approach adds little risk and covers additional cases where a server may be temporarily behind due to a spike in load or a saturated network. For testing, we make the `raftSyncLimit` into a parameter on the worker's `run` method so that we can run backoff tests without waiting 30+ seconds. We haven't followed thru and made all the worker globals into worker parameters, because there isn't much use outside of testing, but we can consider that in the future. Co-authored-by: Tim Gross <tgross@hashicorp.com>		2023-01-24 08:56:35 -05:00
..
indexer	core: add ACL token expiry state, struct, and RPC handling. (#13718 )	2022-07-13 15:40:34 +02:00
paginator	build: run gofmt on all go source files	2022-08-16 11:14:11 -05:00
autopilot.go	autopilot: correctly return errors within state functions. (#12714 )	2022-04-21 08:54:50 +02:00
autopilot_test.go	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
deployment_events_test.go	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
events.go	events: add ACL binding rules to core events stream topics. (#15544 )	2022-12-14 14:49:49 +01:00
events_test.go	events: add ACL binding rules to core events stream topics. (#15544 )	2022-12-14 14:49:49 +01:00
iterator.go	csi: use node MaxVolumes during scheduling (#7565 )	2020-03-31 17:16:47 -04:00
schema.go	acl: add binding rule object state schema and functionality. (#15511 )	2022-12-14 08:48:18 +01:00
schema_test.go	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
state_changes.go	events: Use single eventsFromChanges func (#9281 )	2020-11-05 13:06:52 -08:00
state_store.go	core: backoff considerably when worker is behind raft (#15523 )	2023-01-24 08:56:35 -05:00
state_store_acl.go	acl: add ACL roles to event stream topic and resolve policies. (#14923 )	2022-10-20 09:43:35 +02:00
state_store_acl_binding_rule.go	acl: add binding rule object state schema and functionality. (#15511 )	2022-12-14 08:48:18 +01:00
state_store_acl_binding_rule_test.go	bugfix: unit test for GetACLBindingRules (#15583 )	2022-12-20 15:06:09 +01:00
state_store_acl_sso.go	bugfix: acl sso auth methods test failures (#15512 )	2022-12-09 18:47:32 +01:00
state_store_acl_sso_test.go	acl: make sure there is only one default Auth Method per type (#15504 )	2022-12-09 14:46:54 +01:00
state_store_acl_test.go	cleanup: rename Equals to Equal for consistency (#14759 )	2022-10-10 09:28:46 -05:00
state_store_oss.go	gofmt all the files	2021-10-01 10:14:28 -04:00
state_store_restore.go	acl: add binding rule object state schema and functionality. (#15511 )	2022-12-14 08:48:18 +01:00
state_store_restore_test.go	acl: add binding rule object state schema and functionality. (#15511 )	2022-12-14 08:48:18 +01:00
state_store_service_regisration_test.go	cleanup: rename Equals to Equal for consistency (#14759 )	2022-10-10 09:28:46 -05:00
state_store_service_registration.go	cleanup: rename Equals to Equal for consistency (#14759 )	2022-10-10 09:28:46 -05:00
state_store_test.go	eval delete: move batching of deletes into RPC handler and state (#15117 )	2022-11-14 14:08:13 -05:00
state_store_variables.go	cleanup: rename Equals to Equal for consistency (#14759 )	2022-10-10 09:28:46 -05:00
state_store_variables_oss.go	rename SecureVariables to Variables throughout	2022-08-26 16:06:24 -04:00
state_store_variables_test.go	cleanup: rename Equals to Equal for consistency (#14759 )	2022-10-10 09:28:46 -05:00
testing.go	CSI: allow updates to volumes on re-registration (#12167 )	2022-03-07 11:06:59 -05:00