open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	37134a4a37	eval delete: move batching of deletes into RPC handler and state (#15117 ) During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluations can appear. In these cases, the `eval delete` command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, re-serialize to MessagePack, send the log entries through raft, and get the FSM applied. To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look a the failed options first: * A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight). * A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete. Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only abut 5% of the total FSM apply time), so counter-intuitively this rework ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes. Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.	2022-11-14 14:08:13 -05:00
Tim Gross	9e1c0b46d8	API for `Eval.Count` (#15147 ) Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is designed to support interactive use in the `nomad eval delete` command to get a count of evals expected to be deleted before doing so. The state store operations to do this sort of thing are somewhat expensive, but it's cheaper than serializing a big list of evals to JSON. Note that although it seems like this could be done as an extra parameter and response field on `Eval.List`, having it as its own endpoint avoids having to change the response body shape and lets us avoid handling the legacy filter params supported by `Eval.List`.	2022-11-07 08:53:19 -05:00
James Rasell	bb5b510c9d	cli: do not import structs, use API package only. (#13938 )	2022-08-02 16:33:08 +02:00
James Rasell	0c0b028a59	core: allow deleting of evaluations (#13492 ) * core: add eval delete RPC and core functionality. * agent: add eval delete HTTP endpoint. * api: add eval delete API functionality. * cli: add eval delete command. * docs: add eval delete website documentation.	2022-07-06 16:30:11 +02:00
Luiz Aoqui	15089f055f	api: add related evals to eval details (#12305 ) The `related` query param is used to indicate that the request should return a list of related (next, previous, and blocked) evaluations. Co-authored-by: Jasmine Dahilig <jasmine@hashicorp.com>	2022-03-17 13:56:14 -04:00
Jasmine Dahilig	8d980edd2e	add create and modify timestamps to evaluations (#5881 )	2019-08-07 09:50:35 -07:00
Preetha Appan	9a5e6edf1f	Rename DelayCeiling to MaxDelay	2018-03-14 16:10:32 -05:00
Alex Dadgar	c1cc51dbee	sync	2017-10-13 14:36:02 -07:00
Alex Dadgar	84d06f6abe	Sync namespace changes	2017-09-07 17:04:21 -07:00
Alex Dadgar	d04877d23c	initial impl	2017-07-07 12:03:11 -07:00
Alex Dadgar	9011a7984c	Add metrics to show allocations on the client This PR adds the following metrics to the client: client.allocations.migrating client.allocations.blocked client.allocations.pending client.allocations.running client.allocations.terminal Also adds some missing fields to the API version of the evaluation.	2017-03-09 12:37:41 -08:00
Alex Dadgar	a37656e7d8	Add QueuedAllocations to api.Evaluation	2017-01-06 11:32:14 -08:00
Alex Dadgar	fcc57fbc66	rename SpawnedBlockedEval and simplify map safety check	2016-05-24 18:12:59 -07:00
Alex Dadgar	1feb57b047	Evals track blocked evals they create	2016-05-19 13:09:52 -07:00
Alex Dadgar	8f5f12ae81	Scheduler no longer produces failed allocations; failed alloc metrics stored in evaluation	2016-05-18 18:11:40 -07:00
Ivo Verberk	0c01ca49e6	Refactoring continued * Refactor other cli commands to new design * Add PrefixList method to api package * Add more tests	2015-12-24 20:53:37 +01:00
Ryan Uber	61b8249d08	api: sort all list responses	2015-09-17 13:10:20 -07:00
Ryan Uber	855ec7a712	api: use stub structs	2015-09-13 20:02:22 -07:00
Ryan Uber	2cbdd4c1c3	api: working on evaluations	2015-09-09 13:48:56 -07:00
Ryan Uber	1904724839	api: finishing jobs	2015-09-08 18:42:34 -07:00

20 Commits