open-nomad/nomad
Luiz Aoqui b656981cf0
Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
..
deploymentwatcher ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
drainer CSI: node drain should end once only plugins remain (#12846) 2022-05-03 10:20:22 -04:00
mock SV: CAS: Implement Check and Set for Delete and Upsert (#13429) 2022-07-11 13:34:06 -04:00
state Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
stream events: fixup service events and rename topic to service. 2022-04-05 08:25:22 +01:00
structs Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
volumewatcher core: allow deleting of evaluations (#13492) 2022-07-06 16:30:11 +02:00
acl.go secure variables ACL policies (#13294) 2022-07-11 13:34:05 -04:00
acl_endpoint.go Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
acl_endpoint_test.go Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
acl_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
alloc_endpoint.go api: apply new ACL check for wildcard namespace (#13608) 2022-07-06 16:17:16 -04:00
alloc_endpoint_test.go api: apply new ACL check for wildcard namespace (#13608) 2022-07-06 16:17:16 -04:00
autopilot.go implement MinQuorum 2020-02-16 16:04:59 -06:00
autopilot_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
blocked_evals.go core: add tests for blocked evals math 2022-05-24 09:05:18 -05:00
blocked_evals_stats.go core: add tests for blocked evals math 2022-05-24 09:05:18 -05:00
blocked_evals_stats_test.go core: test duplicated blocked eval stats 2022-05-24 15:44:06 -04:00
blocked_evals_system.go blocked_evals system evals indexed by job and node 2019-07-18 10:32:12 -04:00
blocked_evals_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_agent_endpoint.go json handles were moved to a new package in #10202 2021-04-02 13:31:10 +00:00
client_agent_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_alloc_endpoint.go Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
client_alloc_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_csi_endpoint.go CSI: volume snapshot 2021-04-01 11:16:52 -04:00
client_csi_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_fs_endpoint.go Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
client_fs_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_rpc.go core: remove all traces of unused protocol version 2022-02-18 16:12:36 -08:00
client_rpc_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
client_stats_endpoint.go server 2018-09-15 16:23:13 -07:00
client_stats_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
config.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
consul.go adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
consul_oss_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
consul_policy.go cleanup: purge github.com/pkg/errors 2022-04-01 19:24:02 -05:00
consul_policy_oss_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
consul_policy_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
consul_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
core_sched.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
core_sched_test.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
csi_endpoint.go CSI: skip node unpublish on GC'd or down nodes (#13301) 2022-06-09 11:33:22 -04:00
csi_endpoint_test.go CSI: skip node unpublish on GC'd or down nodes (#13301) 2022-06-09 11:33:22 -04:00
deployment_endpoint.go api: apply consistent behaviour of the reverse query parameter (#12244) 2022-03-11 19:44:52 -05:00
deployment_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
deployment_watcher_shims.go consul: plubming for specifying consul namespace in job/group 2021-04-05 10:03:19 -06:00
drainer_int_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
drainer_shims.go set node.StatusUpdatedAt in raft 2019-05-21 16:13:32 -04:00
encrypter.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
encrypter_test.go implement Encrypt/Decrypt methods of encrypter (#13375) 2022-07-11 13:34:05 -04:00
endpoints_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
eval_broker.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
eval_broker_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
eval_endpoint.go Allow wildcard for Evaluations API (#13530) 2022-07-11 16:42:17 -04:00
eval_endpoint_test.go Allow wildcard for Evaluations API (#13530) 2022-07-11 16:42:17 -04:00
event_endpoint.go Event Stream: Track ACL changes, unsubscribe on invalidating changes (#9447) 2020-12-01 11:11:34 -05:00
event_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
fsm.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
fsm_oss.go chore: ensure consistent file naming for non-enterprise files. 2022-01-13 11:32:16 +01:00
fsm_registry_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
fsm_test.go Secure Variables: Seperate Encrypted and Decrypted structs (#13355) 2022-07-11 13:34:05 -04:00
heartbeat.go reconciler: Handle canaries when client disconnects (#12539) 2022-04-21 10:05:58 -04:00
heartbeat_test.go heartbeat: Handle transitioning from disconnected to down (#12559) 2022-04-15 09:47:45 -04:00
job_endpoint.go api: apply new ACL check for wildcard namespace (#13608) 2022-07-06 16:17:16 -04:00
job_endpoint_hook_connect.go adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
job_endpoint_hook_connect_test.go adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
job_endpoint_hook_expose_check.go cleanup: purge github.com/pkg/errors 2022-04-01 19:24:02 -05:00
job_endpoint_hook_expose_check_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
job_endpoint_hook_vault.go cli: correctly use and validate job with vault token set 2022-05-19 12:13:34 -05:00
job_endpoint_hook_vault_oss.go Support Vault entity aliases (#12449) 2022-04-05 14:18:10 -04:00
job_endpoint_hooks.go job_hooks: add implicit constraint when using Consul for services. (#12602) 2022-04-20 14:09:13 +02:00
job_endpoint_hooks_test.go job_hooks: add implicit constraint when using Consul for services. (#12602) 2022-04-20 14:09:13 +02:00
job_endpoint_oss.go Support Vault entity aliases (#12449) 2022-04-05 14:18:10 -04:00
job_endpoint_oss_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
job_endpoint_test.go vault: revert support for entity aliases (#12723) 2022-04-22 10:46:34 -04:00
job_endpoint_validators.go cleanup: purge github.com/pkg/errors 2022-04-01 19:24:02 -05:00
job_endpoint_validators_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
keyring_endpoint.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
keyring_endpoint_test.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
leader.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
leader_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
leader_test.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
merge.go
namespace_endpoint.go Fix some errcheck errors (#9811) 2021-01-14 12:46:35 -08:00
namespace_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
node_endpoint.go core: emit node evals only for sys jobs in dc (#12955) 2022-07-06 14:35:18 -07:00
node_endpoint_test.go core: emit node evals only for sys jobs in dc (#12955) 2022-07-06 14:35:18 -07:00
operator_endpoint.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
operator_endpoint_test.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
periodic.go periodic: always reset periodic children status 2021-03-25 11:27:09 -04:00
periodic_endpoint.go dispatch-job capability to dispatch periodic jobs 2020-10-27 16:33:01 -04:00
periodic_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
periodic_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
plan_apply.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
plan_apply_node_tracker.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
plan_apply_node_tracker_test.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
plan_apply_oss.go chore: ensure consistent file naming for non-enterprise files. 2022-01-13 11:32:16 +01:00
plan_apply_pool.go Log reason a plan gets rejected per node. 2017-07-13 17:14:02 -07:00
plan_apply_pool_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
plan_apply_test.go plan_apply: Add missing unit test for validating plans for disconnected clients (#12495) 2022-04-07 09:58:09 -04:00
plan_endpoint.go fix mTLS certificate check on agent to agent RPCs (#11998) 2022-02-04 20:35:20 -05:00
plan_endpoint_test.go fix deadlock in plan_apply (#13407) 2022-06-23 12:06:27 -04:00
plan_normalization_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
plan_queue.go cleanup: prevent leaks from time.After 2022-02-02 14:32:26 -06:00
plan_queue_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
raft_rpc.go Refactor 2018-02-15 13:59:00 -08:00
regions_endpoint.go server 2018-09-15 16:23:13 -07:00
regions_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
rpc.go feat: remove dependency to consul/lib 2022-04-09 13:22:44 +02:00
rpc_test.go core: allow deleting of evaluations (#13492) 2022-07-06 16:30:11 +02:00
scaling_endpoint.go chore: fixup inconsistent method receiver names. (#11704) 2021-12-20 11:44:21 +01:00
scaling_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
search_endpoint.go Implement HTTP search API for Variables (#13257) 2022-07-11 13:34:05 -04:00
search_endpoint_oss.go Implement HTTP search API for Variables (#13257) 2022-07-11 13:34:05 -04:00
search_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
secure_variables_endpoint.go SV: CLI: var list command (#13707) 2022-07-12 12:49:39 -04:00
secure_variables_endpoint_oss.go implement quota tracking for secure variablees (#13453) 2022-07-11 13:34:06 -04:00
secure_variables_endpoint_test.go secure vars: fix enterprise test by upserting the namespace (#13719) 2022-07-12 12:05:52 -04:00
serf.go core: remove all traces of unused protocol version 2022-02-18 16:12:36 -08:00
serf_test.go test: use `T.TempDir` to create temporary test directory (#12853) 2022-05-12 11:42:40 -04:00
server.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
server_setup_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
server_test.go test: use `T.TempDir` to create temporary test directory (#12853) 2022-05-12 11:42:40 -04:00
service_registration_endpoint.go secure variables ACL policies (#13294) 2022-07-11 13:34:05 -04:00
service_registration_endpoint_test.go workload identity (#13223) 2022-07-11 13:34:05 -04:00
stats_fetcher.go core: remove all traces of unused protocol version 2022-02-18 16:12:36 -08:00
stats_fetcher_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
status_endpoint.go core: remove all traces of unused protocol version 2022-02-18 16:12:36 -08:00
status_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
system_endpoint.go chore: fix incorrect docstring formatting. 2021-08-30 11:08:12 +02:00
system_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
testing.go keystore serialization (#13106) 2022-07-11 13:34:04 -04:00
testing_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
timetable.go vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:45:21 -04:00
timetable_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
util.go core: add deprecated mvn tag to serf (#12327) 2022-03-24 14:44:21 -04:00
util_test.go disconnected clients: ensure servers meet minimum required version (#12202) 2022-04-05 17:12:23 -04:00
vault.go vault: revert support for entity aliases (#12723) 2022-04-22 10:46:34 -04:00
vault_test.go vault: revert support for entity aliases (#12723) 2022-04-22 10:46:34 -04:00
vault_testing.go vault: revert support for entity aliases (#12723) 2022-04-22 10:46:34 -04:00
worker.go disconnected clients: ensure servers meet minimum required version (#12202) 2022-04-05 17:12:23 -04:00
worker_string_schedulerworkerstatus.go Make number of scheduler workers reloadable (#11593) 2022-01-06 11:56:13 -05:00
worker_string_workerstatus.go Make number of scheduler workers reloadable (#11593) 2022-01-06 11:56:13 -05:00
worker_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00