Before this change trying to run `nomad operator raft {info,logs}` on an
inuse raft.db would cause the command to block until the agent using
raft.db is closed.
After this change the command will block for 1s before returning a
(hopefully) helpful error message.
This change also sets the ReadOnly mode on the underlying BoltDb to
ensure diagnostics make no changes to the underlying store. We have no
evidence this has ever occurred, but it seems like a useful safety
measure.
No changelog added since this is a minor tweak in a "new" feature (it
was hidden in previous relases).
The `nomad operator raft logs` command uses a raft helper that reads
in the logs from raft and serializes them to JSON. The previous
implementation returned the slice of all logs and then serializes the
entire object. Update the helper to stream the log entries and then
serialize them as newline-delimited JSON.
The `nomad operator raft` and `nomad operator snapshot state`
subcommands for inspecting on-disk raft state were hidden and
undocumented. Expose and document these so that advanced operators
have support for these tools.
Use the new filtering and pagination capabilities of the `Eval.List`
RPC to provide filtering and pagination at the command line.
Also includes note that `nomad eval status -json` is deprecated and
will be replaced with a single evaluation view in a future version of
Nomad.
When a cluster doesn't have a leader, the `nomad operator debug`
command can safely use stale queries to gracefully degrade the
consistency of almost all its queries. The query parameter for these
API calls was not being set by the command.
Some `api` package queries do not include `QueryOptions` because
they target a specific agent, but they can potentially be forwarded to
other agents. If there is no leader, these forwarded queries will
fail. Provide methods to call these APIs with `QueryOptions`.
Some operators use very long group/task `shutdown_delay` settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.
Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and
`nomad job stop` commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.
Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before clients receive updated service discovery
information and won't be gracefully drained.
API queries can request pagination using the `NextToken` and `PerPage`
fields of `QueryOptions`, when supported by the underlying API.
Add a `NextToken` field to the `structs.QueryMeta` so that we have a
common field across RPCs to tell the caller where to resume paging
from on their next API call. Include this field on the `api.QueryMeta`
as well so that it's available for future versions of List HTTP APIs
that wrap the response with `QueryMeta` rather than returning a simple
list of structs. In the meantime callers can get the `X-Nomad-NextToken`.
Add pagination to the `Eval.List` RPC by checking for pagination token
and page size in `QueryOptions`. This will allow resuming from the
last ID seen so long as the query parameters and the state store
itself are unchanged between requests.
Add filtering by job ID or evaluation status over the results we get
out of the state store.
Parse the query parameters of the `Eval.List` API into the arguments
expected for filtering in the RPC call.
We don't run tests over documentation PRs marked by the branch prefix
`docs-*`. With the new backport assistant, that should also include
branches with the prefix `backports/docs-*`
This changeset adds more specific recommendations as to what metrics
to monitor, and what resources should be examined during incident
response.
It also renames the "Telemetry" section to "Monitoring Nomad" to
surface the material better and distinguish it from the "Metric
Reference".
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>