During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.
To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:
* A naive solution here would be to just send the filter as the raft request and
let the FSM apply delete the whole set in a single operation. Benchmarking with
1M evals on a 3 node cluster demonstrated this can block the FSM apply for
several minutes, which puts the cluster at risk if there's a leadership
failover (the barrier write can't be made while this apply is in-flight).
* A less naive but still bad solution would be to have the RPC handler filter
and paginate, and then hand a list of IDs to the existing raft log
entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
took roughly an hour to complete.
Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.
Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.
Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is
designed to support interactive use in the `nomad eval delete` command to get a
count of evals expected to be deleted before doing so.
The state store operations to do this sort of thing are somewhat expensive, but
it's cheaper than serializing a big list of evals to JSON. Note that although it
seems like this could be done as an extra parameter and response field on
`Eval.List`, having it as its own endpoint avoids having to change the response
body shape and lets us avoid handling the legacy filter params supported by
`Eval.List`.
The `Eval.Delete` endpoint has a helper that takes a list of jobs and allocs and
determines whether the eval associated with those is safe to delete (based on
their state). Filtering improvements to the `Eval.Delete` endpoint are going to
need this check to run in the state store itself for consistency.
Refactor to push this check down into the state store to keep the eventual diff
for that work reasonable.
While working on filtering improvements to the `Eval.Delete` endpoint I noticed
that this test was going to need to expand significantly and needed some
refactoring to make that work nicely. In order to reduce the size of the
eventual diff, I've pulled this refactoring out into its own changeset.
* Failing test and TODO for wildcard
* Alias the namespace query parameter for Evals
* eval: fix list when using ACLs and * namespace
Apply the same verification process as in job, allocs and scaling
policy list endpoints to handle the eval list when using an ACL token
with limited namespace support but querying using the `*` wildcard
namespace.
* changelog: add entry for #13530
* ui: set namespace when querying eval
Evals have a unique UUID as ID, but when querying them the Nomad API
still expects a namespace query param, otherwise it assumes `default`.
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
* core: allow pause/un-pause of eval broker on region leader.
* agent: add ability to pause eval broker via scheduler config.
* cli: add operator scheduler commands to interact with config.
* api: add ability to pause eval broker via scheduler config
* e2e: add operator scheduler test for eval broker pause.
* docs: include new opertor scheduler CLI and pause eval API info.
The `related` query param is used to indicate that the request should
return a list of related (next, previous, and blocked) evaluations.
Co-authored-by: Jasmine Dahilig <jasmine@hashicorp.com>
The paginator logic was built when go-memdb iterators would return items
ordered lexicographically by their ID prefixes, but #12054 added the
option for some tables to return results ordered by their `CreateIndex`
instead, which invalidated the previous paginator assumption.
The iterator used for pagination must still return results in some order
so that the paginator can properly handle requests where the next_token
value is not present in the results anymore (e.g., the eval was GC'ed).
In these situations, the paginator will start the returned page in the
first element right after where the requested token should've been.
This commit moves the logic to generate pagination tokens from the
elements being paginated to the iterator itself so that callers can have
more control over the token format to make sure they are properly
ordered and stable.
It also allows configuring the paginator as being ordered in ascending
or descending order, which is relevant when looking for a token that may
not be present anymore.
These API endpoints now return results in chronological order. They
can return results in reverse chronological order by setting the
query parameter ascending=true.
- Eval.List
- Deployment.List
API queries can request pagination using the `NextToken` and `PerPage`
fields of `QueryOptions`, when supported by the underlying API.
Add a `NextToken` field to the `structs.QueryMeta` so that we have a
common field across RPCs to tell the caller where to resume paging
from on their next API call. Include this field on the `api.QueryMeta`
as well so that it's available for future versions of List HTTP APIs
that wrap the response with `QueryMeta` rather than returning a simple
list of structs. In the meantime callers can get the `X-Nomad-NextToken`.
Add pagination to the `Eval.List` RPC by checking for pagination token
and page size in `QueryOptions`. This will allow resuming from the
last ID seen so long as the query parameters and the state store
itself are unchanged between requests.
Add filtering by job ID or evaluation status over the results we get
out of the state store.
Parse the query parameters of the `Eval.List` API into the arguments
expected for filtering in the RPC call.
* use msgtype in upsert node
adds message type to signature for upsert node, update tests, remove placeholder method
* UpsertAllocs msg type test setup
* use upsertallocs with msg type in signature
update test usage of delete node
delete placeholder msgtype method
* add msgtype to upsert evals signature, update test call sites with test setup msg type
handle snapshot upsert eval outside of FSM and ignore eval event
remove placeholder upsertevalsmsgtype
handle job plan rpc and prevent event creation for plan
msgtype cleanup upsertnodeevents
updatenodedrain msgtype
msg type 0 is a node registration event, so set the default to the ignore type
* fix named import
* fix signature ordering on upsertnode to match
* Node Drain events and Node Events (#8980)
Deployment status updates
handle deployment status updates (paused, failed, resume)
deployment alloc health
generate events from apply plan result
txn err check, slim down deployment event
one ndjson line per index
* consolidate down to node event + type
* fix UpdateDeploymentAllocHealth test invocations
* fix test
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.
Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.
This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
Consider currently dequeued Evaluation's ModifyIndex when determining
its WaitIndex. Normally the Evaluation itself would already be in the
state store snapshot used to determine the WaitIndex. However, since the FSM
applies Raft messages to the state store concurrently with Dequeueing,
it's possible the currently dequeued Evaluation won't yet exist in the
state store snapshot used by JobsForEval.
This can be solved by always considering the current eval's modify index
and using it if it is greater than all of the evals returned by the
state store.
This PR fixes a scheduling race condition in which the plan results from
one invocation of the scheduler were not being considered by the next
since the Worker was not waiting for the correct index.
Fixes https://github.com/hashicorp/nomad/issues/3198
This PR fixes our vet script and fixes all the missed vet changes.
It also fixes pointers being printed in `nomad stop <job>` and `nomad
node-status <node>`.