open-nomad/command/agent
Luiz Aoqui b656981cf0
Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
..
consul docs: add docs and tests for tagged_addresses 2022-05-31 13:02:48 -05:00
event
host ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
monitor ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
pprof ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
test-resources consul-template: revert function_denylist logic (#12071) 2022-04-18 13:57:56 -04:00
testdata Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
acl_endpoint.go Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
acl_endpoint_test.go Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
agent.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
agent_endpoint.go api: prevent excessice CPU load on job parse 2022-02-09 19:51:47 -05:00
agent_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
agent_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
agent_test.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
alloc_endpoint.go http: add alloc service registration agent HTTP endpoint. 2022-03-03 12:13:32 +01:00
alloc_endpoint_test.go test: move remaining tests to use ci.Parallel. 2022-03-24 08:45:13 +01:00
bindata_assetfs.go Generate files for 1.3.1 release 2022-05-24 16:29:46 -04:00
command.go core: merge reserved_ports into host_networks (#13651) 2022-07-12 14:40:25 -07:00
command_test.go feat: Warn if bootstrap_expect is even number (#12961) 2022-06-06 15:22:59 +02:00
config.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
config_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
config_parse.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
config_parse_test.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
config_test.go Track plan rejection history and automatically mark clients as ineligible (#13421) 2022-07-12 18:40:20 -04:00
csi_endpoint.go CSI: replace structs->api with serialization extension (#12583) 2022-04-15 14:29:34 -04:00
csi_endpoint_test.go CSI: replace structs->api with serialization extension (#12583) 2022-04-15 14:29:34 -04:00
deployment_endpoint.go initial base work for implementing sorting and filter across API endpoints (#12076) 2022-02-16 14:34:36 -05:00
deployment_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
eval_endpoint.go core: allow deleting of evaluations (#13492) 2022-07-06 16:30:11 +02:00
eval_endpoint_test.go core: allow deleting of evaluations (#13492) 2022-07-06 16:30:11 +02:00
event_endpoint.go return 405 on non-GET requests to /v1/event/stream (fixes #9526) (#9564) 2020-12-08 13:09:20 -05:00
event_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
fs_endpoint.go api: return 404 for alloc FS list/stat endpoints (#11482) 2021-11-17 11:15:07 -05:00
fs_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
helpers.go
helpers_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
http.go keyring HTTP API (#13077) 2022-07-11 13:34:04 -04:00
http_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
http_stdlog.go
http_stdlog_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
http_test.go parse ACL token from authorization header (#12534) 2022-06-06 15:51:02 -04:00
job_endpoint.go CSI: make plugin health_timeout configurable in csi_plugin stanza (#13340) 2022-06-14 10:04:16 -04:00
job_endpoint_test.go adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
keyring.go
keyring_endpoint.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
keyring_endpoint_test.go core job for secure variables re-key (#13440) 2022-07-11 13:34:06 -04:00
keyring_test.go test: use T.TempDir to create temporary test directory (#12853) 2022-05-12 11:42:40 -04:00
log_file.go prevent active log from being overwritten when agent starts (#11386) 2021-10-26 20:57:07 -04:00
log_file_bsd.go freebsd: build fix for ARM7 32-bit (#11854) 2022-01-14 12:25:32 -05:00
log_file_linux.go prevent active log from being overwritten when agent starts (#11386) 2021-10-26 20:57:07 -04:00
log_file_test.go test: use T.TempDir to create temporary test directory (#12853) 2022-05-12 11:42:40 -04:00
log_file_windows.go prevent active log from being overwritten when agent starts (#11386) 2021-10-26 20:57:07 -04:00
log_levels.go ci: set test log level off in gha 2022-03-25 13:43:33 -05:00
log_levels_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
metrics_endpoint.go agent: return req error if prometheus metrics are disabled. 2021-03-09 15:28:58 +01:00
metrics_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
namespace_endpoint.go
namespace_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
node_endpoint.go Add os to NodeListStub struct. (#12497) 2022-04-15 17:22:45 -07:00
node_endpoint_test.go Add os to NodeListStub struct. (#12497) 2022-04-15 17:22:45 -07:00
operator_endpoint.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
operator_endpoint_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
operator_endpoint_test.go core: allow pausing and un-pausing of leader broker routine (#13045) 2022-07-06 16:13:48 +02:00
plugins.go
region_endpoint.go
region_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
retry_join.go
retry_join_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
scaling_endpoint.go Add gocritic to golangci-lint config (#9556) 2020-12-08 12:47:04 -08:00
scaling_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
search_endpoint.go api: implement fuzzy search API 2021-04-16 16:36:07 -06:00
search_endpoint_test.go Implement HTTP search API for Variables (#13257) 2022-07-11 13:34:05 -04:00
secure_variable_endpoint.go SV: CAS: Implement Check and Set for Delete and Upsert (#13429) 2022-07-11 13:34:06 -04:00
secure_variable_endpoint_test.go SV: CAS: Implement Check and Set for Delete and Upsert (#13429) 2022-07-11 13:34:06 -04:00
service_registration_endpoint.go api: enable selecting subset of services using rendezvous hashing 2022-06-25 10:37:37 -05:00
service_registration_endpoint_test.go api: enable selecting subset of services using rendezvous hashing 2022-06-25 10:37:37 -05:00
stats_endpoint.go
stats_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
status_endpoint.go
status_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
stub_asset.go gofmt all the files 2021-10-01 10:14:28 -04:00
syslog.go
syslog_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
system_endpoint.go
system_endpoint_test.go ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
testagent.go ci: set test log level off in gha 2022-03-25 13:43:33 -05:00
testagent_oss.go gofmt all the files 2021-10-01 10:14:28 -04:00
testingutils_test.go mock: add default host network 2020-11-23 10:11:00 -06:00