Processing an evaluation is nearly a pure function over the state
snapshot, but we randomly shuffle the nodes. This means that
developers can't take a given state snapshot and pass an evaluation
through it and be guaranteed the same plan results.
But the evaluation ID is already random, so if we use this as the seed
for shuffling the nodes we can greatly reduce the sources of
non-determinism. Unfortunately golang map iteration uses a global
source of randomness and not a goroutine-local one, but arguably
if the scheduler behavior is impacted by this, that's a bug in the
iteration.
If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.
Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
Update the logic in the Nomad client's alloc health tracker which
erroneously marks existing healthy allocations with dead poststart ephemeral
tasks as unhealthy even if they were already successful during a previous
deployment.
This PR replaces use of time.After with a safe helper function
that creates a time.Timer to use instead. The new function returns
both a time.Timer and a Stop function that the caller must handle.
Unlike time.NewTimer, the helper function does not panic if the duration
set is <= 0.
* csi: resolve invalid claim states on read
It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.
* csi: make feasibility check errors more understandable
When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
The volumewatcher that runs on the leader needs to make RPC calls
rather than writing to raft (as we do in the deploymentwatcher)
because the unpublish workflow needs to make RPC calls to the
clients. This requires that the volumewatcher has access to the
leader's ACL token.
But when leadership transitions, the new leader creates a new leader
ACL token. This ACL token needs to be passed into the volumewatcher
when we enable it, otherwise the volumewatcher can find itself with a
stale token.
* csi: resolve invalid claim states on read
It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.
* csi: make feasibility check errors more understandable
When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
The HCL1 parser did not respect connect.sidecar_task.resources if the
connect.sidecar_service block was not set (an optimiztion that no longer
makes sense with connect gateways).
Fixes#10899
This PR sets the minimum Go version for the `api` submodule to Go 1.17.
It also upgrades
- gorilla/websocket 1.4.1 -> 1.4.2
- mitchelh/mapstructure 1.4.2 -> 1.4.3
- stretchr/testify 1.5.1 -> 1.7.0
Closes#11518#11602#11528
The volumewatcher that runs on the leader needs to make RPC calls
rather than writing to raft (as we do in the deploymentwatcher)
because the unpublish workflow needs to make RPC calls to the
clients. This requires that the volumewatcher has access to the
leader's ACL token.
But when leadership transitions, the new leader creates a new leader
ACL token. This ACL token needs to be passed into the volumewatcher
when we enable it, otherwise the volumewatcher can find itself with a
stale token.
This PR upgrades our CI images and fixes some affected tests.
- upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02)
- eliminate go-machine-recent-image (no longer necessary)
- manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174)
- fix tcp dial error check (message seems to be OS specific)
- spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2)
- use safe MkdirTemp for generating tmpfiles
NOT applied: (too flakey)
- eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting)
- upgrade resource type for all imanges to large (2C -> 4C)
github.com/kr/pty was moved to github.com/creack/pty
Swap this dependency so we can upgrade to the latest version
and no longer need a replace directive.
After swapping gzip handler to use the gorilla library, we
must account for a quirk in how zero/minimal length response
bodies are delivered.
The previous gzip handler was configured to compress all responses
regardless of size - even if the data was zero length or below the
network MTU. This behavior changed in [v1.1.0](c551b6c3b4 (diff-de723e6602cc2f16f7a9d85fd89d69954edc12a49134dab8901b10ee06d1879d))
which is why we could not upgrade.
The Nomad HTTP Client mutates the http.Response.Body object, making
a strong assumption that if the Content-Encoding header is set to "gzip",
the response will be readable via gzip decoder. This is no longer true
for the nytimes gzip handler, and is also not true for the gorilla gzip
handler.
It seems in practice this only makes a difference on the /v1/operator/license
endpoint which returns an empty response in OSS Nomad.
The fix here is to simply not wrap the response body reader if we
encounter an io.EOF while creating the gzip reader - indicating there
is no data to decode.
Improves `nomad debug` error messages when contacting agents that do not
have /v1/agent/host endpoints (the endpoint was added in v0.12.0)
Part of #9568 and manually tested against Nomad v0.8.7.
Hopefully isRedirectError can be reused for more cases listed in #9568
The command line client sends a specific volume ID, but this isn't
enforced at the API level and we were incorrectly using a prefix match
for volume deregistration, resulting in cases where a volume with a
shorter ID that's a prefix of another volume would be deregistered
instead of the intended volume.
When the `volume deregister` or `volume detach` commands get an ID
prefix that matches multiple volumes, show the full length of the
volume IDs in the list of volumes shown so so that the user can select
the correct one.
The size of `stat_t` fields is architecture dependent, which was
reportedly causing a build failure on FreeBSD ARM7 32-bit
systems. This changeset matches the behavior we have on Linux.
When we copy the system DNS to a task's `resolv.conf`, we should set
the permissions as world-readable so that unprivileged users within
the task can read it.
This PR exposes the following existing`consul-template` configuration options to Nomad jobspec authors in the `{job.group.task.template}` stanza.
- `wait`
It also exposes the following`consul-template` configuration to Nomad operators in the `{client.template}` stanza.
- `max_stale`
- `block_query_wait`
- `consul_retry`
- `vault_retry`
- `wait`
Finally, it adds the following new Nomad-specific configuration to the `{client.template}` stanza that allows Operators to set bounds on what `jobspec` authors configure.
- `wait_bounds`
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* Fixed name of `nomad.scheduler.allocs.reschedule` metric
* Added new metrics to metrics reference documentation
* Expanded definitions of "waiting" metrics
* Changelog entry for #10236 and #10237
## Development Environment Changes
* Added stringer to build deps
## New HTTP APIs
* Added scheduler worker config API
* Added scheduler worker info API
## New Internals
* (Scheduler)Worker API refactor—Start(), Stop(), Pause(), Resume()
* Update shutdown to use context
* Add mutex for contended server data
- `workerLock` for the `workers` slice
- `workerConfigLock` for the `Server.Config.NumSchedulers` and
`Server.Config.EnabledSchedulers` values
## Other
* Adding docs for scheduler worker api
* Add changelog message
Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
When `volumewatcher.Watcher` starts on the leader, it starts a watch
on every volume and triggers a reap of unused claims on any change to
that volume. But if a reaping is in-flight during leadership
transitions, it will fail and the event that triggered the reap will
be dropped. Perform one reap of unused claims at the start of the
watcher so that leadership transitions don't drop this event.
The task runner prestart hooks take a `joincontext` so they have the
option to exit early if either of two contexts are canceled: from
killing the task or client shutdown. Some tasks exit without being
shutdown from the server, so neither of the joined contexts ever gets
canceled and we leak the `joincontext` (48 bytes) and its internal
goroutine. This primarily impacts batch jobs and any task that fails
or completes early such as non-sidecar prestart lifecycle tasks.
Cancel the `joincontext` after the prestart call exits to fix the
leak.
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.
Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.
In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.
This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
The `nomad operator raft` and `nomad operator snapshot state`
subcommands for inspecting on-disk raft state were hidden and
undocumented. Expose and document these so that advanced operators
have support for these tools.
Use the new filtering and pagination capabilities of the `Eval.List`
RPC to provide filtering and pagination at the command line.
Also includes note that `nomad eval status -json` is deprecated and
will be replaced with a single evaluation view in a future version of
Nomad.
When a cluster doesn't have a leader, the `nomad operator debug`
command can safely use stale queries to gracefully degrade the
consistency of almost all its queries. The query parameter for these
API calls was not being set by the command.
Some `api` package queries do not include `QueryOptions` because
they target a specific agent, but they can potentially be forwarded to
other agents. If there is no leader, these forwarded queries will
fail. Provide methods to call these APIs with `QueryOptions`.
Some operators use very long group/task `shutdown_delay` settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.
Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and
`nomad job stop` commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.
Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before clients receive updated service discovery
information and won't be gracefully drained.
API queries can request pagination using the `NextToken` and `PerPage`
fields of `QueryOptions`, when supported by the underlying API.
Add a `NextToken` field to the `structs.QueryMeta` so that we have a
common field across RPCs to tell the caller where to resume paging
from on their next API call. Include this field on the `api.QueryMeta`
as well so that it's available for future versions of List HTTP APIs
that wrap the response with `QueryMeta` rather than returning a simple
list of structs. In the meantime callers can get the `X-Nomad-NextToken`.
Add pagination to the `Eval.List` RPC by checking for pagination token
and page size in `QueryOptions`. This will allow resuming from the
last ID seen so long as the query parameters and the state store
itself are unchanged between requests.
Add filtering by job ID or evaluation status over the results we get
out of the state store.
Parse the query parameters of the `Eval.List` API into the arguments
expected for filtering in the RPC call.
During incident response, operators may find that automated processes
elsewhere in the organization can be generating new workloads on Nomad
clusters that are unable to handle the workload. This changeset adds a
field to the `SchedulerConfiguration` API that causes all job
registration calls to be rejected unless the request has a management
ACL token.
The `consul.client_auto_join` configuration block tells the Nomad
client whether to use Consul service discovery to find Nomad
servers. By default it is set to `true`, but contrary to the
documentation it was only respected during the initial client
registration. If a client missed a heartbeat, failed a
`Node.UpdateStatus` RPC, or if there was no Nomad leader, the client
would fallback to Consul even if `client_auto_join` was set to
`false`. This changeset returns early from the client's trigger for
Consul discovery if the `client_auto_join` field is set to `false`.
In the system scheduler, if a subset of clients are filtered by class,
we hit a code path where the `AllocMetric` has been copied, but the
`Copy` method does not instantiate the various maps. This leads to an
assignment to a nil map. This changeset ensures that the maps are
non-nil before continuing.
The `Copy` method relies on functions in the `helper` package that all
return nil slices or maps when passed zero-length inputs. This
changeset to fix the panic bug intentionally defers updating those
functions because it'll have potential impact on memory usage. See
https://github.com/hashicorp/nomad/issues/11564 for more details.
In the system scheduler, if a subset of clients are filtered by class,
we hit a code path where the `AllocMetric` has been copied, but the
`Copy` method does not instantiate the various maps. This leads to an
assignment to a nil map. This changeset ensures that the maps are
non-nil before continuing.
The `Copy` method relies on functions in the `helper` package that all
return nil slices or maps when passed zero-length inputs. This
changeset to fix the panic bug intentionally defers updating those
functions because it'll have potential impact on memory usage. See
https://github.com/hashicorp/nomad/issues/11564 for more details.
This change modifies the Nomad job register and deregister RPCs to
accept an updated option set which includes eval priority. This
param is optional and override the use of the job priority to set
the eval priority.
In order to ensure all evaluations as a result of the request use
the same eval priority, the priority is shared to the
allocReconciler and deploymentWatcher. This creates a new
distinction between eval priority and job priority.
The Nomad agent HTTP API has been modified to allow setting the
eval priority on job update and delete. To keep consistency with
the current v1 API, job update accepts this as a payload param;
job delete accepts this as a query param.
Any user supplied value is validated within the agent HTTP handler
removing the need to pass invalid requests to the server.
The register and deregister opts functions now all for setting
the eval priority on requests.
The change includes a small change to the DeregisterOpts function
which handles nil opts. This brings the function inline with the
RegisterOpts.
The QEMU driver allows arbitrary command line options, but many of
these options give access to host resources that operators may not
want to expose such as devices. Add an optional allowlist to the
plugin configuration so that operators can limit the resources for
QEMU.
* api: return 404 for alloc FS list/stat endpoints
If the alloc filesystem doesn't have a file requested by the List
Files or Stat File API, we currently return a HTTP 500 error with the
expected "file not found" error message. Return a HTTP 404 error
instead.
* update FS Handler
Previously the FS handler would interpret a 500 status as a 404
in the adapter layer by checking if the response body contained
the text or is the response status
was 500 and then throw an error code for 404.
Co-authored-by: Jai Bhagat <jaybhagat841@gmail.com>
go-getter 1.5.9 includes a patch in 1.5.6 that automatically unpacks
uncompressed tar archives. Previously Nomad only unpacked compressed
archives, but documented that it unpacked all archives.
* debug: refactor Consul API collection
* debug: refactor Vault API collection
* debug: cleanup test timing
* debug: extend test to multiregion
* debug: save cmdline flags in bundle
* debug: add cli version to output
* Add changelog entry
Enhance the CLI in order to return the host network in two flavors
(default, verbose) of the `node status` command.
Fixes: #11223.
Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
- Making RPC Upgrade mode reloadable.
- Add suggestions from code review
- remove spurious comment
- switch to require(t,...) form for test.
- Add to changelog
As we have continued to see reports of #9506 we need to elevate this log
line as it is the only way to detect when plans are being *erroneously*
rejected.
Users who see this log line repeatedly should drain and restart the node
in the log line. This seems to workaorund the issue.
Please post any details on #9506!
Log the failure error when the agent fails to start. Previously, the
agent startup failure error would be emitted to the command UI but not
logged. So it doesn't get emitted to syslog or `log_file` if they are
set, and it makes debugging much harder. Also, logging the error again
before exit makes the error more visible: previously, the operator
needed to scroll to the top to find the error.
On a sample failure, the output will look like:
```
==> WARNING: Bootstrap mode enabled! Potentially unsafe operation.
==> Loaded configuration from sample-configs/config-bad
==> Starting Nomad agent...
==> Error starting agent: setting up server node ID failed: mkdir /path-without-permission: read-only file system
2021-10-20T14:38:51.179-0400 [WARN] agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/path-without-permission/plugins
2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins
2021-10-20T14:38:51.181-0400 [DEBUG] agent.plugin_loader.docker: using client connection initialized from environment: plugin_dir=/path-without-permission/plugins
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=java type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=docker type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=mock_driver type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=exec type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [INFO] agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
2021-10-20T14:38:51.181-0400 [ERROR] agent: error starting agent: error="setting up server node ID failed: mkdir /path-without-permission: read-only file system"
```
This change adds the final `ERROR` message. It's easy to miss the `==>
Error starting agent` above.
The system scheduler should leave allocs on draining nodes as-is, but
stop node stop allocs on nodes that are no longer part of the job
datacenters.
Previously, the scheduler did not make the distinction and left system
job allocs intact if they are already running.
I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 .
Fixes https://github.com/hashicorp/nomad/issues/11373
Pick up https://github.com/golang/snappy/pull/56 to handle arm64 architectures to fix panics. tldr; Golang 1.16 changed `memmove` implementation for arm64 requiring additional cpu registers that snappy wasn't preserving in its assembly implementation.
Other projects have experienced this issue as well, searching for `encode_arm64.s:666` on your favorite search engine will reveal some. Vault updated the dependency earlier this August: https://github.com/hashicorp/vault/pull/12371 .
I believe this issue affects Nomad 1.2.x and 1.1.x. Nomad 1.0.x use Golang 1.15 and isn't affected. However, backporting the change to 1.0.x should be harmless.
Fixed https://github.com/hashicorp/nomad/issues/11385 .
* Include region and namespace in CLI output
* Add region and prefix matching for server members
* Add namespace and region API outputs to cluster metadata folder
* Add region awareness to WaitForClient helper function
* Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice
* Refactor test client agent generation
* Add tests for region
* Add changelog
We see this error all the time
```
no handler registered for event
event.Message=, event.Annotations=, event.Timestamp=0001-01-01T00:00:00Z, event.TaskName=, event.AllocID=, event.TaskID=,
```
So we're handling an even with all default fields. I noted that this can
happen if only err is set as in
```
func (d *driverPluginClient) handleTaskEvents(reqCtx context.Context, ch chan *TaskEvent, stream proto.Driver_TaskEventsClient) {
defer close(ch)
for {
ev, err := stream.Recv()
if err != nil {
if err != io.EOF {
ch <- &TaskEvent{
Err: grpcutils.HandleReqCtxGrpcErr(err, reqCtx, d.doneCtx),
}
}
```
In this case Err fails to be serialized by the logger, see this test
```
ev := &drivers.TaskEvent{
Err: fmt.Errorf("errz"),
}
i.logger.Warn("ben test", "event", ev)
i.logger.Warn("ben test2", "event err str", ev.Err.Error())
i.logger.Warn("ben test3", "event err", ev.Err)
ev.Err = nil
i.logger.Warn("ben test4", "nil error", ev.Err)
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.643900Z","driver":"mock_driver","event":{"TaskID":"","TaskName":"","AllocID":"","Timestamp":"0001-01-01T00:00:00Z","Message":"","Annotations":null,"Err":{}}}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test2","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644226Z","driver":"mock_driver","event err str":"errz"}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test3","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644240Z","driver":"mock_driver","event err":"errz"}
2021-10-06T22:37:56.736Z INFO nomad.stdout {"@level":"warn","@message":"ben test4","@module":"client.driver_mgr","@timestamp":"2021-10-06T22:37:56.644252Z","driver":"mock_driver","nil error":null}
```
Note in the first example err is set to an empty object and the error is
lost.
What we want is the last two examples which call out the err field
explicitly so we can see what it is in this case