Adds a new configuration to clients to optionally allow them to drain their
workloads on shutdown. The client sends the `Node.UpdateDrain` RPC targeting
itself and then monitors the drain state as seen by the server until the drain
is complete or the deadline expires. If it loses connection with the server, it
will monitor local client status instead to ensure allocations are stopped
before exiting.
If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.
This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.
The `-deadline` and `-force` flag for the `nomad node drain` command only cause
the draining to ignore the `migrate` block's healthy deadline, max parallel,
etc. These flags don't have anything to do with the `kill_timeout` or
`shutdown_delay` options of the jobspec.
This changeset fixes the skipped E2E tests so that they validate the intended
behavior, and updates the docs for more clarity.
* [no ci] deps: update docker to 23.0.3
This PR brings our docker/docker dependency (which is hosted at github.com/moby/moby)
up to 23.0.3 (forward about 2 years). Refactored our use of docker/libnetwork to
reference the package in its new home, which is docker/docker/libnetwork (it is
no longer an independent repository). Some minor nearby test case cleanup as well.
* add cl
* func: add namespace support for list deployment
* func: add wildcard to namespace filter for deployments
* Update deployment_endpoint.go
* style: use must instead of require or asseert
* style: rename paginator to avoid clash with import
* style: add changelog entry
* fix: add missing parameter for upsert jobs
msgpackrpc codec handles are specific to a connection and cannot be shared
between goroutines; this can cause corrupted decoding. Fix the drainer
integration test so that we create separate codecs for the goroutines that the
test helper spins up to simulate client updates.
This changeset also refactors the drainer integration test to bring it up to
current idioms and library usages, make assertions more clear, and reduce
duplication.
The first start of a Consul Connect proxy sidecar triggers a run
of the envoy_version hook which modifies the task config image
entry. The modification takes into account a number of factors to
correctly populate this. Importantly, once the hook has run, it
marks itself as done so the taskrunner will not execute it again.
When the client receives a non-destructive update for the
allocation which the proxy sidecar is a member of, it will update
and overwrite the task definition within the taskerunner. In doing
so it overwrite the modification performed by the hook. If the
allocation is restarted, the envoy_version hook will be skipped as
it previously marked itself as done, and therefore the sidecar
config image is incorrect and causes a driver error.
The fix removes the hook in marking itself as done to the view of
the taskrunner.
* api: enable support for setting original source alongside job
This PR adds support for setting job source material along with
the registration of a job.
This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.
The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.
The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.
* api: avoid writing var content to disk for parsing
* api: move submission validation into RPC layer
* api: return an error if updating a job submission without namespace or job id
* api: be exact about the job index we associate a submission with (modify)
* api: reword api docs scheduling
* api: prune all but the last 6 job submissions
* api: protect against nil job submission in job validation
* api: set max job source size in test server
* api: fixups from pr
new WaitForPlugin() called during csiHook.Prerun,
so that on startup, clients can recover running
tasks that use CSI volumes, instead of them being
terminated and rescheduled because they need a
node plugin that is "not found" *yet*, only because
the plugin task has not yet been recovered.
The `ephemeral_disk` block's `migrate` field allows for best-effort migration of
the ephemeral disk data to new nodes. The documentation says the `migrate` field
is only respected if `sticky=true`, but in fact if client ACLs are not set the
data is migrated even if `sticky=false`.
The existing behavior when client ACLs are disabled has existed since the early
implementation, so "fixing" that case now would silently break backwards
compatibility. Additionally, having `migrate` not imply `sticky` seems
nonsensical: it suggests that if we place on a new node we migrate the data but
if we place on the same node, we throw the data away!
Update so that `migrate=true` implies `sticky=true` as follows:
* The failure mode when client ACLs are enabled comes from the server not passing
along a migration token. Update the server so that the server provides a
migration token whenever `migrate=true` and not just when `sticky=true` too.
* Update the scheduler so that `migrate` implies `sticky`.
* Update the client so that we check for `migrate || sticky` where appropriate.
* Refactor the E2E tests to move them off the old framework and make the intention
of the test more clear.
While working on several open drain issues, I'm fixing up the E2E tests. This
subset of tests being refactored are existing ones that already work. I'm
shipping these as their own PR to keep review sizes manageable when I push up
PRs in the next few days for #9902, #12314, and #12915.
The e2e/acl package has some nice helpers for tracking and cleaning up ACL
objects, but they are currently private. Export them so I can abuse them in
other e2e tests.
This changeset provides a matrix test of ACL enforcement across several
dimensions:
* anonymous vs bogus vs valid tokens
* permitted vs not permitted by policy
* request sent to server vs sent to client (and forwarded)
The vSphere plugin is exclusive to k8s because it relies on k8s-APIs (and
crashes without them being present). Upstream unfortunately will not support
Nomad, so we shouldn't refer to it in our concept docs here.
Nomad's security model requires mTLS in order to secure client-to-server and
server-to-server communications. Configuring ACLs alone is not enough. Loudly
warn the user if mTLS is not configured in non-dev modes.
Requests without an ACL token that pass thru the client's HTTP API are treated
as though they come from the client itself. This allows bypass of ACLs on RPC
requests where ACL permissions are checked (like `Job.Register`). Invalid tokens
are correctly rejected.
Fix the bypass by only setting a client ID on the identity if we have a valid node secret.
Note that this changeset will break rate metrics for RPCs sent by clients
without a client secret such as `Node.GetClientAllocs`; these requests will be
recorded as anonymous.
Future work should:
* Ensure the node secret is sent with all client-driven RPCs except
`Node.Register` which is TOFU.
* Create a new `acl.ACL` object from client requests so that we
can enforce ACLs for all endpoints in a uniform way that's less error-prone.~
These endpoints all refer to JobID by the time you get to the RPC request
layer, but the HTTP handler functions call the field JobName, which is a different
field (... though often with the same value).
This update changes the behaviour when following logs from an
allocation, so that both stdout and stderr files streamed when the
operator supplies the follow flag. The previous behaviour is held
when all other flags and situations are provided.
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
The allocrunner has a facility for passing data written by allocrunner hooks to
taskrunner hooks. Currently the only consumers of this facility are the
allocrunner CSI hook (which writes data) and the taskrunner volume hook (which
reads that same data).
The allocrunner hook for CSI volumes doesn't set the alloc hook resources
atomically. Instead, it gets the current resources and then writes a new version
back. Because the CSI hook is currently the only writer and all readers happen
long afterwards, this should be safe but #16623 shows there's some sequence of
events during restore where this breaks down.
Refactor hook resources so that hook data is accessed via setters and getters
that hold the mutex.
The workflow described in the docs for apt installation is deprecated. Update to
match the workflow described in the Tutorials and official packaging guide.