open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	37134a4a37	eval delete: move batching of deletes into RPC handler and state (#15117 ) During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluations can appear. In these cases, the `eval delete` command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, re-serialize to MessagePack, send the log entries through raft, and get the FSM applied. To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look a the failed options first: * A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight). * A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete. Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only abut 5% of the total FSM apply time), so counter-intuitively this rework ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes. Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.	2022-11-14 14:08:13 -05:00
Tim Gross	9e1c0b46d8	API for `Eval.Count` (#15147 ) Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is designed to support interactive use in the `nomad eval delete` command to get a count of evals expected to be deleted before doing so. The state store operations to do this sort of thing are somewhat expensive, but it's cheaper than serializing a big list of evals to JSON. Note that although it seems like this could be done as an extra parameter and response field on `Eval.List`, having it as its own endpoint avoids having to change the response body shape and lets us avoid handling the legacy filter params supported by `Eval.List`.	2022-11-07 08:53:19 -05:00
Tim Gross	dab9388c75	refactor eval delete safety check (#15070 ) The `Eval.Delete` endpoint has a helper that takes a list of jobs and allocs and determines whether the eval associated with those is safe to delete (based on their state). Filtering improvements to the `Eval.Delete` endpoint are going to need this check to run in the state store itself for consistency. Refactor to push this check down into the state store to keep the eventual diff for that work reasonable.	2022-10-28 09:10:33 -04:00
Tim Gross	9c37a234e7	test: refactor EvalEndpoint_Delete (#15065 ) While working on filtering improvements to the `Eval.Delete` endpoint I noticed that this test was going to need to expand significantly and needed some refactoring to make that work nicely. In order to reduce the size of the eventual diff, I've pulled this refactoring out into its own changeset.	2022-10-27 15:29:22 -04:00
Phil Renaud	e9219a1ae0	Allow wildcard for Evaluations API (#13530 ) * Failing test and TODO for wildcard * Alias the namespace query parameter for Evals * eval: fix list when using ACLs and * namespace Apply the same verification process as in job, allocs and scaling policy list endpoints to handle the eval list when using an ACL token with limited namespace support but querying using the `` wildcard namespace. changelog: add entry for #13530 * ui: set namespace when querying eval Evals have a unique UUID as ID, but when querying them the Nomad API still expects a namespace query param, otherwise it assumes `default`. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-07-11 16:42:17 -04:00
James Rasell	0c0b028a59	core: allow deleting of evaluations (#13492 ) * core: add eval delete RPC and core functionality. * agent: add eval delete HTTP endpoint. * api: add eval delete API functionality. * cli: add eval delete command. * docs: add eval delete website documentation.	2022-07-06 16:30:11 +02:00
James Rasell	181b247384	core: allow pausing and un-pausing of leader broker routine (#13045 ) * core: allow pause/un-pause of eval broker on region leader. * agent: add ability to pause eval broker via scheduler config. * cli: add operator scheduler commands to interact with config. * api: add ability to pause eval broker via scheduler config * e2e: add operator scheduler test for eval broker pause. * docs: include new opertor scheduler CLI and pause eval API info.	2022-07-06 16:13:48 +02:00
Luiz Aoqui	15089f055f	api: add related evals to eval details (#12305 ) The `related` query param is used to indicate that the request should return a list of related (next, previous, and blocked) evaluations. Co-authored-by: Jasmine Dahilig <jasmine@hashicorp.com>	2022-03-17 13:56:14 -04:00
Seth Hoenig	2631659551	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
Luiz Aoqui	ab8ce87bba	Add pagination, filtering and sort to more API endpoints (#12186 )	2022-03-08 20:54:17 -05:00
Luiz Aoqui	01931587ba	api: paginated results with different ordering (#12128 ) The paginator logic was built when go-memdb iterators would return items ordered lexicographically by their ID prefixes, but #12054 added the option for some tables to return results ordered by their `CreateIndex` instead, which invalidated the previous paginator assumption. The iterator used for pagination must still return results in some order so that the paginator can properly handle requests where the next_token value is not present in the results anymore (e.g., the eval was GC'ed). In these situations, the paginator will start the returned page in the first element right after where the requested token should've been. This commit moves the logic to generate pagination tokens from the elements being paginated to the iterator itself so that callers can have more control over the token format to make sure they are properly ordered and stable. It also allows configuring the paginator as being ordered in ascending or descending order, which is relevant when looking for a token that may not be present anymore.	2022-03-01 15:36:49 -05:00
Luiz Aoqui	de91954582	initial base work for implementing sorting and filter across API endpoints (#12076 )	2022-02-16 14:34:36 -05:00
Luiz Aoqui	110dbeeb9d	Add `go-bexpr` filters to evals and deployment list endpoints (#12034 )	2022-02-16 11:40:30 -05:00
Seth Hoenig	40c714a681	api: return sorted results in certain list endpoints These API endpoints now return results in chronological order. They can return results in reverse chronological order by setting the query parameter ascending=true. - Eval.List - Deployment.List	2022-02-15 13:48:28 -06:00
Tim Gross	e046bb31e9	api: respect wildcard in evaluations list API (#11710 )	2021-12-20 12:23:50 -05:00
Tim Gross	624ecab901	evaluations list pagination and filtering (#11648 ) API queries can request pagination using the `NextToken` and `PerPage` fields of `QueryOptions`, when supported by the underlying API. Add a `NextToken` field to the `structs.QueryMeta` so that we have a common field across RPCs to tell the caller where to resume paging from on their next API call. Include this field on the `api.QueryMeta` as well so that it's available for future versions of List HTTP APIs that wrap the response with `QueryMeta` rather than returning a simple list of structs. In the meantime callers can get the `X-Nomad-NextToken`. Add pagination to the `Eval.List` RPC by checking for pagination token and page size in `QueryOptions`. This will allow resuming from the last ID seen so long as the query parameters and the state store itself are unchanged between requests. Add filtering by job ID or evaluation status over the results we get out of the state store. Parse the query parameters of the `Eval.List` API into the arguments expected for filtering in the RPC call.	2021-12-10 13:43:03 -05:00
Drew Bailey	6c788fdccd	Events/msgtype cleanup (#9117 ) * use msgtype in upsert node adds message type to signature for upsert node, update tests, remove placeholder method * UpsertAllocs msg type test setup * use upsertallocs with msg type in signature update test usage of delete node delete placeholder msgtype method * add msgtype to upsert evals signature, update test call sites with test setup msg type handle snapshot upsert eval outside of FSM and ignore eval event remove placeholder upsertevalsmsgtype handle job plan rpc and prevent event creation for plan msgtype cleanup upsertnodeevents updatenodedrain msgtype msg type 0 is a node registration event, so set the default to the ignore type * fix named import * fix signature ordering on upsertnode to match	2020-10-19 09:30:15 -04:00
Drew Bailey	9d48818eb8	writetxn can return error, add alloc and job generic events. Add events table for durability	2020-10-14 12:44:39 -04:00
Drew Bailey	4793bb4e01	Events/deployment events (#9004 ) * Node Drain events and Node Events (#8980) Deployment status updates handle deployment status updates (paused, failed, resume) deployment alloc health generate events from apply plan result txn err check, slim down deployment event one ndjson line per index * consolidate down to node event + type * fix UpdateDeploymentAllocHealth test invocations * fix test	2020-10-14 12:44:37 -04:00
Seth Hoenig	f0c3dca49c	tests: swap lib/freeport for tweaked helper/freeport Copy the updated version of freeport (sdk/freeport), and tweak it for use in Nomad tests. This means staying below port 10000 to avoid conflicts with the lib/freeport that is still transitively used by the old version of consul that we vendor. Also provide implementations to find ephemeral ports of macOS and Windows environments. Ports acquired through freeport are supposed to be returned to freeport, which this change now also introduces. Many tests are modified to include calls to a cleanup function for Server objects. This should help quite a bit with some flakey tests, but not all of them. Our port problems will not go away completely until we upgrade our vendor version of consul. With Go modules, we'll probably do a 'replace' to swap out other copies of freeport with the one now in 'nomad/helper/freeport'.	2019-12-09 08:37:32 -06:00
Alex Dadgar	e779d9444b	Update nomad/eval_endpoint_test.go Co-Authored-By: schmichael <michael.schurter@gmail.com>	2019-03-05 15:19:15 -08:00
Michael Schurter	05f51499ba	nomad: compare current eval when setting WaitIndex Consider currently dequeued Evaluation's ModifyIndex when determining its WaitIndex. Normally the Evaluation itself would already be in the state store snapshot used to determine the WaitIndex. However, since the FSM applies Raft messages to the state store concurrently with Dequeueing, it's possible the currently dequeued Evaluation won't yet exist in the state store snapshot used by JobsForEval. This can be solved by always considering the current eval's modify index and using it if it is greater than all of the evals returned by the state store.	2019-03-01 15:23:39 -08:00
Alex Dadgar	4bdccab550	goimports	2019-01-22 15:44:31 -08:00
Michael Schurter	7dd7fbcda2	non-Existent -> nonexistent Reverting from #3963 https://www.merriam-webster.com/dictionary/existent	2018-03-12 11:59:33 -07:00
Josh Soref	f1d21bfdfe	spelling: enqueuing	2018-03-11 18:00:07 +00:00
Alex Dadgar	6dd1c9f49d	Refactor	2018-02-15 13:59:00 -08:00
Alex Dadgar	a6dfffa4fa	Add testing interfaces	2018-02-15 13:59:00 -08:00
Preetha Appan	40cb1d327c	Address some code review comments	2017-12-18 15:22:23 -06:00
Preetha Appan	3c36abfe14	Update eval modify index as part of plan apply.	2017-12-18 10:03:55 -06:00
Michael Schurter	15b991e039	base64 migrate token HTTP header values must be ASCII. Also constant time compare tokens and test the generate and compare helper functions.	2017-10-13 10:59:13 -07:00
Michael Schurter	84d8a51be1	SecretID -> AuthToken	2017-10-12 15:16:33 -07:00
Michael Schurter	57ff12432b	Move acl helpers from nomad/ into nomad/mock They're useful in command/agent/ tests.	2017-10-06 14:50:06 -07:00
Michael Schurter	22169a7cd4	Eval.Allocations ACL enforcement	2017-10-03 14:57:47 -07:00
Michael Schurter	b3db8f41fd	Eval.List ACL enforcement	2017-10-03 14:57:47 -07:00
Michael Schurter	fae1be5ab2	Eval.GetEval ACL enforcement	2017-10-03 14:57:47 -07:00
Michael Schurter	a66c53d45a	Remove `structs` import from `api` Goes a step further and removes structs import from api's tests as well by moving GenerateUUID to its own package.	2017-09-29 10:36:08 -07:00
Alex Dadgar	6911bd7676	Worker waits til max ModifyIndex across EvalsByJob This PR fixes a scheduling race condition in which the plan results from one invocation of the scheduler were not being considered by the next since the Worker was not waiting for the correct index. Fixes https://github.com/hashicorp/nomad/issues/3198	2017-09-14 14:28:43 -07:00
Alex Dadgar	84d06f6abe	Sync namespace changes	2017-09-07 17:04:21 -07:00
Alex Dadgar	06eddf243c	parallel nomad tests	2017-07-25 17:39:36 -07:00
Alex Dadgar	e7b128271f	Diff code fixes	2017-04-16 16:54:40 -07:00
Alex Dadgar	5be806a3df	Fix vet script and fix vet problems This PR fixes our vet script and fixes all the missed vet changes. It also fixes pointers being printed in `nomad stop <job>` and `nomad node-status <node>`.	2017-02-27 16:00:19 -08:00
Alex Dadgar	d182aac7a7	Fix nomad tests	2017-02-07 22:10:33 -08:00
Alex Dadgar	04862ca10e	Tests compile	2017-02-07 21:30:57 -08:00
Alex Dadgar	a1d08c2aba	Add scheduler version enforcement	2016-10-26 14:52:48 -07:00
Diptanu Choudhury	6193529040	Fixed more tests	2016-07-25 17:31:40 -07:00
Alex Dadgar	ecdce9a641	don't dequeue	2016-06-07 09:51:20 -07:00
Alex Dadgar	3100b4a086	Change eval_endpoint test to not retry but block longer	2016-06-03 12:02:49 -07:00
Alex Dadgar	299a0bb4b3	up timeout for dequeue in test	2016-06-03 11:36:50 -07:00
Alex Dadgar	629542f64e	flaky test	2016-05-31 23:50:14 +00:00
Alex Dadgar	18d9e89065	Reuse the same evaluation and reblock it until there is no more work to do	2016-05-24 20:10:56 -07:00

1 2

82 Commits