open-nomad

History

Tim Gross 37134a4a37 eval delete: move batching of deletes into RPC handler and state (#15117 ) During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluations can appear. In these cases, the `eval delete` command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, re-serialize to MessagePack, send the log entries through raft, and get the FSM applied. To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look a the failed options first: * A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight). * A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete. Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only abut 5% of the total FSM apply time), so counter-intuitively this rework ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes. Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.		2022-11-14 14:08:13 -05:00
..
contexts	rename SecureVariables to Variables throughout	2022-08-26 16:06:24 -04:00
internal/testutil	api: use errors.New not fmt.Errorf when error doesn't have format. (#14027 )	2022-08-05 17:05:47 +02:00
acl.go	acl: fix encoding expiration time in ACL token list API. (#14542 )	2022-09-12 15:50:35 +02:00
acl_test.go	acl: fix encoding expiration time in ACL token list API. (#14542 )	2022-09-12 15:50:35 +02:00
agent.go	Make number of scheduler workers reloadable (#11593 )	2022-01-06 11:56:13 -05:00
agent_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
allocations.go	Task lifecycle restart (#14127 )	2022-08-24 17:43:07 -04:00
allocations_exec.go	cleanup: prevent leaks from time.After	2022-02-02 14:32:26 -06:00
allocations_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
api.go	api: trim space of error response output	2022-08-16 15:00:38 -05:00
api_test.go	testing: setting env var incompatible with parallel tests (#14405 )	2022-08-30 14:49:03 -04:00
compose_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
constraint.go	Tag Job spec with HCLv2 tags	2020-10-21 14:05:46 -04:00
constraint_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
consul.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
consul_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
csi.go	CSI: failed allocation should not block its own controller unpublish (#14484 )	2022-09-08 13:30:05 -04:00
csi_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
deployments.go	cli: do not import structs, use API package only. (#13938 )	2022-08-02 16:33:08 +02:00
evaluations.go	eval delete: move batching of deletes into RPC handler and state (#15117 )	2022-11-14 14:08:13 -05:00
evaluations_test.go	core: allow deleting of evaluations (#13492 )	2022-07-06 16:30:11 +02:00
event_stream.go	api: add convenience string func to Topic type. (#14843 )	2022-10-19 14:12:23 +02:00
event_stream_test.go	api: add convenience string func to Topic type. (#14843 )	2022-10-19 14:12:23 +02:00
fs.go	api: document warnings for setting `api.ClientConnTimeout` (#14122 )	2022-08-15 16:06:02 -04:00
fs_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
go.mod	build(deps): bump github.com/shoenig/test from 0.4.3 to 0.4.4 in /api (#15163 )	2022-11-06 08:06:01 -06:00
go.sum	build(deps): bump github.com/shoenig/test from 0.4.3 to 0.4.4 in /api (#15163 )	2022-11-06 08:06:01 -06:00
ioutil.go	api: use errors.New not fmt.Errorf when error doesn't have format. (#14027 )	2022-08-05 17:05:47 +02:00
ioutil_test.go	api: use errors.New not fmt.Errorf when error doesn't have format. (#14027 )	2022-08-05 17:05:47 +02:00
jobs.go	[ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833 )	2022-11-02 16:58:24 -04:00
jobs_test.go	template: error on missing key (#15141 )	2022-11-04 13:23:01 -04:00
keyring.go	api: update keyring comment to reflect correct feature name. (#14558 )	2022-09-13 10:05:03 -04:00
keyring_test.go	remove root keyring install API (#14514 )	2022-09-09 08:50:35 -04:00
namespace.go	core: allow deleting of evaluations (#13492 )	2022-07-06 16:30:11 +02:00
namespace_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
nodes.go	Add os to NodeListStub struct. (#12497 )	2022-04-15 17:22:45 -07:00
nodes_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
operator.go	core: allow pausing and un-pausing of leader broker routine (#13045 )	2022-07-06 16:13:48 +02:00
operator_autopilot.go	implement MinQuorum	2020-02-16 16:04:59 -06:00
operator_ent_test.go	api: fix ENT-only test imports for moved testutil package (#12320 )	2022-03-18 10:12:28 -04:00
operator_metrics.go	Metrics gotemplate support, debug bundle features (#9067 )	2020-10-14 15:16:10 -04:00
operator_metrics_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
operator_test.go	core: allow pausing and un-pausing of leader broker routine (#13045 )	2022-07-06 16:13:48 +02:00
quota.go	rename SecureVariables to Variables throughout	2022-08-26 16:06:24 -04:00
quota_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
raw.go	core: allow deleting of evaluations (#13492 )	2022-07-06 16:30:11 +02:00
recommendations.go	added new policy capabilities for recommendations API	2020-10-28 14:32:16 +00:00
regions.go	cli: ensure `-stale` flag is respected by `nomad operator debug` (#11678 )	2021-12-15 10:44:03 -05:00
regions_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
resources.go	api: remove `mapstructure` tags from`Port` struct (#12916 )	2022-11-08 11:26:28 +01:00
resources_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
scaling.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
scaling_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
search.go	api: implement fuzzy search API	2021-04-16 16:36:07 -06:00
search_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
sentinel.go	api: use errors.New not fmt.Errorf when error doesn't have format. (#14027 )	2022-08-05 17:05:47 +02:00
sentinel_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
services.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
services_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
status.go	display server leaders per region	2016-03-17 16:04:09 -07:00
status_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
system.go	Add missing ReconcileSummaries API method	2017-08-24 11:55:10 +02:00
system_test.go	ci: use serial testing for api in CI	2022-03-17 08:35:01 -05:00
tasks.go	template: error on missing key (#15141 )	2022-11-04 13:23:01 -04:00
tasks_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
util_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
utils.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
utils_test.go	cleanup: replace TypeToPtr helper methods with pointer.Of (#14151 )	2022-08-17 18:26:34 +02:00
variables.go	[vars:api] Return fake QueryMeta on 403s with Peek() (#14661 )	2022-09-22 13:15:05 -04:00
variables_test.go	rename SecureVariables to Variables throughout	2022-08-26 16:06:24 -04:00