open-nomad/nomad
Luiz Aoqui b656981cf0
Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
..
deploymentwatcher ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
drainer CSI: node drain should end once only plugins remain (#12846) 2022-05-03 10:20:22 -04:00
mock SV: CAS: Implement Check and Set for Delete and Upsert (#13429) 2022-07-11 13:34:06 -04:00
state Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
stream events: fixup service events and rename topic to service. 2022-04-05 08:25:22 +01:00
structs Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
volumewatcher core: allow deleting of evaluations (#13492) 2022-07-06 16:30:11 +02:00
acl.go secure variables ACL policies (#13294) 2022-07-11 13:34:05 -04:00
acl_endpoint.go Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
acl_endpoint_test.go Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
acl_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
alloc_endpoint.go api: apply new ACL check for wildcard namespace (#13608) 2022-07-06 16:17:16 -04:00
alloc_endpoint_test.go api: apply new ACL check for wildcard namespace (#13608) 2022-07-06 16:17:16 -04:00
autopilot.go
autopilot_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
blocked_evals.go core: add tests for blocked evals math 2022-05-24 09:05:18 -05:00
blocked_evals_stats.go core: add tests for blocked evals math 2022-05-24 09:05:18 -05:00
blocked_evals_stats_test.go core: test duplicated blocked eval stats 2022-05-24 15:44:06 -04:00
blocked_evals_system.go
blocked_evals_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_agent_endpoint.go
client_agent_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_alloc_endpoint.go
client_alloc_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_csi_endpoint.go
client_csi_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_fs_endpoint.go
client_fs_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_rpc.go
client_rpc_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_stats_endpoint.go
client_stats_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
config.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
consul.go adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
consul_oss_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
consul_policy.go cleanup: purge github.com/pkg/errors 2022-04-01 19:24:02 -05:00
consul_policy_oss_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
consul_policy_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
consul_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
core_sched.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
core_sched_test.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
csi_endpoint.go CSI: skip node unpublish on GC'd or down nodes (#13301) 2022-06-09 11:33:22 -04:00
csi_endpoint_test.go CSI: skip node unpublish on GC'd or down nodes (#13301) 2022-06-09 11:33:22 -04:00
deployment_endpoint.go api: apply consistent behaviour of the reverse query parameter (#12244) 2022-03-11 19:44:52 -05:00
deployment_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
deployment_watcher_shims.go
drainer_int_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
drainer_shims.go
encrypter.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
encrypter_test.go implement Encrypt/Decrypt methods of encrypter (#13375) 2022-07-11 13:34:05 -04:00
endpoints_oss.go
eval_broker.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
eval_broker_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
eval_endpoint.go Allow wildcard for Evaluations API (#13530) 2022-07-11 16:42:17 -04:00
eval_endpoint_test.go Allow wildcard for Evaluations API (#13530) 2022-07-11 16:42:17 -04:00
event_endpoint.go
event_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
fsm.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
fsm_oss.go
fsm_registry_oss.go
fsm_test.go Secure Variables: Seperate Encrypted and Decrypted structs (#13355) 2022-07-11 13:34:05 -04:00
heartbeat.go reconciler: Handle canaries when client disconnects (#12539) 2022-04-21 10:05:58 -04:00
heartbeat_test.go heartbeat: Handle transitioning from disconnected to down (#12559) 2022-04-15 09:47:45 -04:00
job_endpoint.go api: apply new ACL check for wildcard namespace (#13608) 2022-07-06 16:17:16 -04:00
job_endpoint_hook_connect.go adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
job_endpoint_hook_connect_test.go adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
job_endpoint_hook_expose_check.go cleanup: purge github.com/pkg/errors 2022-04-01 19:24:02 -05:00
job_endpoint_hook_expose_check_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
job_endpoint_hook_vault.go cli: correctly use and validate job with vault token set 2022-05-19 12:13:34 -05:00
job_endpoint_hook_vault_oss.go Support Vault entity aliases (#12449) 2022-04-05 14:18:10 -04:00
job_endpoint_hooks.go job_hooks: add implicit constraint when using Consul for services. (#12602) 2022-04-20 14:09:13 +02:00
job_endpoint_hooks_test.go job_hooks: add implicit constraint when using Consul for services. (#12602) 2022-04-20 14:09:13 +02:00
job_endpoint_oss.go Support Vault entity aliases (#12449) 2022-04-05 14:18:10 -04:00
job_endpoint_oss_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
job_endpoint_test.go vault: revert support for entity aliases (#12723) 2022-04-22 10:46:34 -04:00
job_endpoint_validators.go cleanup: purge github.com/pkg/errors 2022-04-01 19:24:02 -05:00
job_endpoint_validators_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
keyring_endpoint.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
keyring_endpoint_test.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
leader.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
leader_oss.go
leader_test.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
merge.go
namespace_endpoint.go
namespace_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
node_endpoint.go core: emit node evals only for sys jobs in dc (#12955) 2022-07-06 14:35:18 -07:00
node_endpoint_test.go core: emit node evals only for sys jobs in dc (#12955) 2022-07-06 14:35:18 -07:00
operator_endpoint.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
operator_endpoint_test.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
periodic.go
periodic_endpoint.go
periodic_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
periodic_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
plan_apply.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
plan_apply_node_tracker.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
plan_apply_node_tracker_test.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
plan_apply_oss.go
plan_apply_pool.go
plan_apply_pool_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
plan_apply_test.go plan_apply: Add missing unit test for validating plans for disconnected clients (#12495) 2022-04-07 09:58:09 -04:00
plan_endpoint.go
plan_endpoint_test.go fix deadlock in plan_apply (#13407) 2022-06-23 12:06:27 -04:00
plan_normalization_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
plan_queue.go
plan_queue_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
raft_rpc.go
regions_endpoint.go
regions_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
rpc.go feat: remove dependency to consul/lib 2022-04-09 13:22:44 +02:00
rpc_test.go core: allow deleting of evaluations (#13492) 2022-07-06 16:30:11 +02:00
scaling_endpoint.go
scaling_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
search_endpoint.go Implement HTTP search API for Variables (#13257) 2022-07-11 13:34:05 -04:00
search_endpoint_oss.go Implement HTTP search API for Variables (#13257) 2022-07-11 13:34:05 -04:00
search_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
secure_variables_endpoint.go SV: CLI: var list command (#13707) 2022-07-12 12:49:39 -04:00
secure_variables_endpoint_oss.go implement quota tracking for secure variablees (#13453) 2022-07-11 13:34:06 -04:00
secure_variables_endpoint_test.go secure vars: fix enterprise test by upserting the namespace (#13719) 2022-07-12 12:05:52 -04:00
serf.go
serf_test.go test: use `T.TempDir` to create temporary test directory (#12853) 2022-05-12 11:42:40 -04:00
server.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
server_setup_oss.go
server_test.go test: use `T.TempDir` to create temporary test directory (#12853) 2022-05-12 11:42:40 -04:00
service_registration_endpoint.go secure variables ACL policies (#13294) 2022-07-11 13:34:05 -04:00
service_registration_endpoint_test.go workload identity (#13223) 2022-07-11 13:34:05 -04:00
stats_fetcher.go
stats_fetcher_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
status_endpoint.go
status_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
system_endpoint.go
system_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
testing.go keystore serialization (#13106) 2022-07-11 13:34:04 -04:00
testing_oss.go
timetable.go
timetable_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
util.go core: add deprecated mvn tag to serf (#12327) 2022-03-24 14:44:21 -04:00
util_test.go disconnected clients: ensure servers meet minimum required version (#12202) 2022-04-05 17:12:23 -04:00
vault.go vault: revert support for entity aliases (#12723) 2022-04-22 10:46:34 -04:00
vault_test.go vault: revert support for entity aliases (#12723) 2022-04-22 10:46:34 -04:00
vault_testing.go vault: revert support for entity aliases (#12723) 2022-04-22 10:46:34 -04:00
worker.go disconnected clients: ensure servers meet minimum required version (#12202) 2022-04-05 17:12:23 -04:00
worker_string_schedulerworkerstatus.go
worker_string_workerstatus.go
worker_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00