open-nomad/nomad
Michael Schurter ace5faf948
core: backoff considerably when worker is behind raft (#15523)
Upon dequeuing an evaluation workers snapshot their state store at the
eval's wait index or later. This ensures we process an eval at a point
in time after it was created or updated. Processing an eval on an old
snapshot could cause any number of problems such as:

1. Since job registration atomically updates an eval and job in a single
   raft entry, scheduling against indexes before that may not have the
   eval's job or may have an older version.
2. The older the scheduler's snapshot, the higher the likelihood
   something has changed in the cluster state which will cause the plan
   applier to reject the scheduler's plan. This could waste work or
   even cause eval's to be failed needlessly.

However, the workers run in parallel with a new server pulling the
cluster state from a peer. During this time, which may be many minutes
long, the state store is likely far behind the minimum index required
to process evaluations.

This PR addresses this by adding an additional long backoff period after
an eval is nacked. If the scheduler's indexes catches up within the
additional backoff, it will unblock early to dequeue the next eval.

When the server shuts down we'll get a `context.Canceled` error from the state
store method. We need to bubble this error up so that other callers can detect
it. Handle this case separately when waiting after dequeue so that we can warn
on shutdown instead of throwing an ambiguous error message with just the text
"canceled."

While there may be more precise ways to block scheduling until the
server catches up, this approach adds little risk and covers additional
cases where a server may be temporarily behind due to a spike in load or
a saturated network.

For testing, we make the `raftSyncLimit` into a parameter on the worker's `run` method 
so that we can run backoff tests without waiting 30+ seconds. We haven't followed thru
and made all the worker globals into worker parameters, because there isn't much
use outside of testing, but we can consider that in the future.

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-01-24 08:56:35 -05:00
..
deploymentwatcher
drainer
mock rpc: add OIDC login related endpoints. 2023-01-13 13:14:29 +00:00
state core: backoff considerably when worker is behind raft (#15523) 2023-01-24 08:56:35 -05:00
stream event stream: ensure token expiry is correctly checked for subs. 2022-10-27 13:08:05 -04:00
structs consul: correctly interpret missing consul checks as unhealthy (#15822) 2023-01-19 14:01:12 -06:00
volumewatcher volumewatcher: prevent panic on nil volume (#15101) 2022-11-01 16:53:10 -04:00
acl.go Authenticate method improvements (#15734) 2023-01-10 09:46:38 -05:00
acl_endpoint.go Merge branch 'main' into sso/gh-13120-oidc-login 2023-01-18 10:05:31 +00:00
acl_endpoint_test.go rpc: add OIDC login related endpoints. 2023-01-13 13:14:29 +00:00
acl_test.go Authenticate method improvements (#15734) 2023-01-10 09:46:38 -05:00
alloc_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
alloc_endpoint_test.go
autopilot.go autopilot: include only servers from the same region (#15290) 2022-11-17 12:09:36 -05:00
autopilot_oss.go
autopilot_test.go autopilot: include only servers from the same region (#15290) 2022-11-17 12:09:36 -05:00
blocked_evals.go
blocked_evals_stats.go
blocked_evals_stats_test.go
blocked_evals_system.go
blocked_evals_test.go
client_agent_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
client_agent_endpoint_test.go
client_alloc_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
client_alloc_endpoint_test.go
client_csi_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
client_csi_endpoint_test.go remove most static RPC handlers (#15451) 2022-12-02 10:12:05 -05:00
client_fs_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
client_fs_endpoint_test.go
client_rpc.go
client_rpc_test.go
client_stats_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
client_stats_endpoint_test.go
config.go sso: add ACL auth-method HTTP API CRUD endpoints (#15338) 2022-11-23 09:38:02 +01:00
consul.go consul: Removed unused ConsulUsage.Kinds. (#11303) 2022-09-22 10:07:14 -05:00
consul_oss_test.go consul: Removed unused ConsulUsage.Kinds. (#11303) 2022-09-22 10:07:14 -05:00
consul_policy.go
consul_policy_oss_test.go
consul_policy_test.go
consul_test.go
core_sched.go variables: limit rekey eval to half the nack timeout (#15102) 2022-11-01 16:50:50 -04:00
core_sched_test.go keyring: safely handle missing keys and restore GC (#15092) 2022-11-01 15:00:50 -04:00
csi_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
csi_endpoint_test.go remove most static RPC handlers (#15451) 2022-12-02 10:12:05 -05:00
deployment_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
deployment_endpoint_test.go
deployment_watcher_shims.go
drainer_int_test.go
drainer_shims.go
encrypter.go keyring: update handle to state inside replication loop (#15227) 2022-11-17 08:40:12 -05:00
encrypter_test.go keyring: update handle to state inside replication loop (#15227) 2022-11-17 08:40:12 -05:00
endpoints_oss.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
eval_broker.go Rename nomad.broker.total_blocked metric (#15835) 2023-01-20 14:23:56 -05:00
eval_broker_test.go Rename nomad.broker.total_blocked metric (#15835) 2023-01-20 14:23:56 -05:00
eval_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
eval_endpoint_test.go eval delete: move batching of deletes into RPC handler and state (#15117) 2022-11-14 14:08:13 -05:00
event_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
event_endpoint_test.go event stream: ensure token expiry is correctly checked for subs. 2022-10-27 13:08:05 -04:00
fsm.go acl: add binding rule object state schema and functionality. (#15511) 2022-12-14 08:48:18 +01:00
fsm_oss.go
fsm_registry_oss.go
fsm_test.go acl: add binding rule object state schema and functionality. (#15511) 2022-12-14 08:48:18 +01:00
heartbeat.go remove most static RPC handlers (#15451) 2022-12-02 10:12:05 -05:00
heartbeat_test.go
job_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
job_endpoint_hook_connect.go
job_endpoint_hook_connect_test.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
job_endpoint_hook_expose_check.go
job_endpoint_hook_expose_check_test.go
job_endpoint_hook_vault.go
job_endpoint_hook_vault_oss.go
job_endpoint_hooks.go servicedisco: implicit constraint for nomad v1.4 when using nsd checks (#14868) 2022-10-11 08:21:42 -05:00
job_endpoint_hooks_test.go servicedisco: implicit constraint for nomad v1.4 when using nsd checks (#14868) 2022-10-11 08:21:42 -05:00
job_endpoint_oss.go scheduler: create placements for non-register MRD (#15325) 2022-11-25 12:45:34 -05:00
job_endpoint_oss_test.go
job_endpoint_test.go [ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833) 2022-11-02 16:58:24 -04:00
job_endpoint_validators.go
job_endpoint_validators_test.go
keyring_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
keyring_endpoint_test.go
leader.go core: add ACL binding rule to replication system. (#15555) 2022-12-16 09:08:00 +01:00
leader_oss.go
leader_test.go cleanup: remove usage of consul/sdk/testutil/retry (#15609) 2023-01-02 08:06:20 -06:00
merge.go
namespace_endpoint.go Authenticate method improvements (#15734) 2023-01-10 09:46:38 -05:00
namespace_endpoint_test.go
node_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
node_endpoint_test.go remove most static RPC handlers (#15451) 2022-12-02 10:12:05 -05:00
operator_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
operator_endpoint_test.go ci: swap freeport for portal in packages (#15661) 2023-01-03 11:25:20 -06:00
periodic.go make version checks specific to region (1.4.x) (#14912) 2022-10-17 16:23:51 -04:00
periodic_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
periodic_endpoint_test.go
periodic_test.go
plan_apply.go keyring: safely handle missing keys and restore GC (#15092) 2022-11-01 15:00:50 -04:00
plan_apply_node_tracker.go
plan_apply_node_tracker_test.go
plan_apply_oss.go
plan_apply_pool.go
plan_apply_pool_test.go
plan_apply_test.go fix panic from keyring raft entries being written during upgrade (#14821) 2022-10-06 12:47:02 -04:00
plan_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
plan_endpoint_test.go
plan_normalization_test.go
plan_queue.go Add missing timer reset (#15134) 2022-11-03 18:57:57 -04:00
plan_queue_test.go
raft_rpc.go
regions_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
regions_endpoint_test.go
rpc.go
rpc_test.go Pre forwarding authentication (#15417) 2022-12-06 14:44:03 -05:00
scaling_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
scaling_endpoint_test.go
search_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
search_endpoint_oss.go
search_endpoint_test.go
serf.go
serf_test.go
server.go Add raft snapshot configuration options (#15522) 2023-01-20 14:21:51 -05:00
server_setup.go
server_setup_oss.go
server_test.go Add raft snapshot configuration options (#15522) 2023-01-20 14:21:51 -05:00
service_registration_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
service_registration_endpoint_test.go deps: update set and test (#14680) 2022-09-26 08:28:03 -05:00
stats_fetcher.go
stats_fetcher_test.go
status_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
status_endpoint_test.go
system_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
system_endpoint_test.go
testing.go ci: swap freeport for portal in packages (#15661) 2023-01-03 11:25:20 -06:00
testing_oss.go
timetable.go
timetable_test.go
util.go make version checks specific to region (1.4.x) (#14912) 2022-10-17 16:23:51 -04:00
util_test.go make version checks specific to region (1.4.x) (#14912) 2022-10-17 16:23:51 -04:00
variables_endpoint.go provide RPCContext to all RPC handlers (#15430) 2022-12-01 10:05:15 -05:00
variables_endpoint_test.go deps: update shoenig/test to v0.6.0 (#15715) 2023-01-09 09:37:08 -06:00
vault.go vault: configure user agent on Nomad vault clients (#15745) 2023-01-10 10:39:45 -06:00
vault_test.go
vault_testing.go
worker.go core: backoff considerably when worker is behind raft (#15523) 2023-01-24 08:56:35 -05:00
worker_string_schedulerworkerstatus.go
worker_string_workerstatus.go
worker_test.go core: backoff considerably when worker is behind raft (#15523) 2023-01-24 08:56:35 -05:00