When claiming a CSI volume, we need to ensure the CSI node plugin is running
before we send any CSI RPCs. This extends even to the controller publish RPC
because it requires the storage provider's "external node ID" for the
client. This primarily impacts client restarts but also is a problem if the node
plugin exits (and fingerprints) while the allocation that needs a CSI volume
claim is being placed.
Unfortunately there's no mapping of volume to plugin ID available in the
jobspec, so we don't have enough information to wait on plugins until we either
get the volume from the server or retrieve the plugin ID from data we've
persisted on the client.
If we always require getting the volume from the server before making the claim,
a client restart for disconnected clients will cause all the allocations that
need CSI volumes to fail. Even while connected, checking in with the server to
verify the volume's plugin before trying to make a claim RPC is inherently racy,
so we'll leave that case as-is and it will fail the claim if the node plugin
needed to support a newly-placed allocation is flapping such that the node
fingerprint is changing.
This changeset persists a minimum subset of data about the volume and its plugin
in the client state DB, and retrieves that data during the CSI hook's prerun to
avoid re-claiming and remounting the volume unnecessarily.
This changeset also updates the RPC handler to use the external node ID from the
claim whenever it is available.
Fixes: #13028
In Nomad 1.5.3 we fixed a security bug that allowed bypass of ACL checks if the
request came thru a client node first. But this fix broke (knowingly) the
identification of many client-to-server RPCs. These will be now measured as if
they were anonymous. The reason for this is that many client-to-server RPCs do
not send the node secret and instead rely on the protection of mTLS.
This changeset ensures that the node secret is being sent with every
client-to-server RPC request. In a future version of Nomad we can add
enforcement on the server side, but this was left out of this changeset to
reduce risks to the safe upgrade path.
Sending the node secret as an auth token introduces a new problem during initial
introduction of a client. Clients send many RPCs concurrently with
`Node.Register`, but until the node is registered the node secret is unknown to
the server and will be rejected as invalid. This causes permission denied
errors.
To fix that, this changeset introduces a gate on having successfully made a
`Node.Register` RPC before any other RPCs can be sent (except for `Status.Ping`,
which we need earlier but which also ignores the error because that handler
doesn't do an authorization check). This ensures that we only send requests with
a node secret already known to the server. This also makes client startup a
little easier to reason about because we know `Node.Register` must succeed
first, and it should make for a good place to hook in future plans for secure
introduction of nodes. The tradeoff is that an existing client that has running
allocs will take slightly longer (a second or two) to transition to ready after
a restart, because the transition in `Node.UpdateStatus` is gated at the server
by first submitting `Node.UpdateAlloc` with client alloc updates.
When client nodes are restarted, all allocations that have been scheduled on the
node have their modify index updated, including terminal allocations. There are
several contributing factors:
* The `allocSync` method that updates the servers isn't gated on first contact
with the servers. This means that if a server updates the desired state while
the client is down, the `allocSync` races with the `Node.ClientGetAlloc`
RPC. This will typically result in the client updating the server with "running"
and then immediately thereafter "complete".
* The `allocSync` method unconditionally sends the `Node.UpdateAlloc` RPC even
if it's possible to assert that the server has definitely seen the client
state. The allocrunner may queue-up updates even if we gate sending them. So
then we end up with a race between the allocrunner updating its internal state
to overwrite the previous update and `allocSync` sending the bogus or duplicate
update.
This changeset adds tracking of server-acknowledged state to the
allocrunner. This state gets checked in the `allocSync` before adding the update
to the batch, and updated when `Node.UpdateAlloc` returns successfully. To
implement this we need to be able to equality-check the updates against the last
acknowledged state. We also need to add the last acknowledged state to the
client state DB, otherwise we'd drop unacknowledged updates across restarts.
The client restart test has been expanded to cover a variety of allocation
states, including allocs stopped before shutdown, allocs stopped by the server
while the client is down, and allocs that have been completely GC'd on the
server while the client is down. I've also bench tested scenarios where the task
workload is killed while the client is down, resulting in a failed restore.
Fixes#16381
Fixes#14617
Dynamic Node Metadata allows Nomad users, and their jobs, to update Node metadata through an API. Currently Node metadata is only reloaded when a Client agent is restarted.
Includes new UI for editing metadata as well.
---------
Co-authored-by: Phil Renaud <phil.renaud@hashicorp.com>
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.
Currently only HTTP and TCP checks are supported, though more types
could be added later.
In order to minimize this change while keeping a simple version of the
behavior, we set `lastOk` to the current time less the intial server
connection timeout. If the client starts and never contacts the
server, it will stop all configured tasks after the initial server
connection grace period, on the assumption that we've been out of
touch longer than any configured `stop_after_client_disconnect`.
The more complex state behavior might be justified later, but we
should learn about failure modes first.
- track lastHeartbeat, the client local time of the last successful
heartbeat round trip
- track allocations with `stop_after_client_disconnect` configured
- trigger allocation destroy (which handles cleanup)
- restore heartbeat/killable allocs tracking when allocs are recovered from disk
- on client restart, stop those allocs after a grace period if the
servers are still partioned
In order to correctly fingerprint dynamic plugins on client restarts,
we need to persist a handle to the plugin (that is, connection info)
to the client state store.
The dynamic registry will sync automatically to the client state
whenever it receives a register/deregister call.
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.
Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
Introduce a device manager that manages the lifecycle of device plugins
on the client. It fingerprints, collects stats, and forwards Reserve
requests to the correct plugin. The manager, also handles device plugins
failing and validates their output.
* Migrated all of the old leader task tests and got them passing
* Refactor and consolidate task killing code in AR to always kill leader
tasks first
* Fixed lots of issues with state restoring
* Fixed deadlock in AR.Destroy if AR.Run had never been called
* Added a new in memory statedb for testing
* Stopping an alloc is implemented via Updates but update hooks are
*not* run.
* Destroying an alloc is a best effort cleanup.
* AllocRunner destroy hooks implemented.
* Disk migration and blocking on a previous allocation exiting moved to
its own package to avoid cycles. Now only depends on alloc broadcaster
instead of also using a waitch.
* AllocBroadcaster now only drops stale allocations and always keeps the
latest version.
* Made AllocDir safe for concurrent use
Lots of internal contexts that are currently unused. Unsure if they
should be used or removed.