The `CSIVolume` struct has references to allocations that are "denormalized"; we
don't store them on the `CSIVolume` struct but hydrate them on read. Tests
detecting potential state store corruptions found two locations where we're not
copying the volume before denormalizing:
* When garbage collecting CSI volume claims.
* When checking if it's safe to force-deregister the volume.
There are no known user-visible problems associated with these bugs but both
have the potential of mutating volume claims outside of a FSM transaction. This
changeset also cleans up state mutations in some CSI tests so as to avoid having
working tests cover up potential future bugs.
Some of the core scheduler tests need the maximum batch size for writes to be
smaller than the usual `structs.MaxUUIDsPerWriteRequest`. But they do so by
unsafely modifying the global struct, which creates test flakes in other tests.
Modify the functions under test to take a batch size parameter. Production code
will pass the global while the tests can inject smaller values. Turn the
`structs.MaxUUIDsPerWriteRequest` into a constant, and add a semgrep rule for
avoiding this kind of thing in the future.
Prior to 2409f72 the code compared the modification index of a job to itself. Afterwards, the code compared the creation index of the job to itself. In either case there should never be a case of re-parenting of allocs causing the evaluation to trivially always result in false, which leads to unreclaimable memory.
Prior to this change allocations and evaluations for batch jobs were never garbage collected until the batch job was explicitly stopped. The new `batch_eval_gc_threshold` server configuration controls how often they are collected. The default threshold is `24h`.
In order to limit how much the rekey job can monopolize a scheduler worker, we
limit how long it can run to 1min before stopping work and emitting a new
eval. But this exactly matches the default nack timeout, so it'll fail the eval
rather than getting a chance to emit a new one.
Set the timeout for the rekey eval to half the configured nack timeout.
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.
Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
that returns an error, to make the workflow more clear and reduce nesting. Log
the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
on initializing the keyring, so there's a race condition in the keyring tests
where we can test for the existence of the root key before the keyring has
been initialize. Change this to an "eventually" test.
But these fixes aren't enough to fix#14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
The configuration knobs for root keyring garbage collection are present in the
consumer and present in the user-facing config, but we missed the spot where we
copy from one to the other. Fix this so that users can set their own thresholds.
The root key is automatically rotated every ~30d, but the function that does
both rotation and key GC was wired up such that `nomad system gc` caused an
unexpected key rotation. Split this into two functions so that `nomad system gc`
cleans up old keys without forcing a rotation, which will be done periodially
or by the `nomad operator root keyring rotate` command.
* One-time tokens are not replicated between regions, so we don't want to enforce
that the version check across all of serf, just members in the same region.
* Scheduler: Disconnected clients handling is specific to a single region, so we
don't want to enforce that the version check across all of serf, just members in
the same region.
* Variables: enforce version check in Apply RPC
* Cleans up a bunch of legacy checks.
This changeset is specific to 1.4.x and the changes for previous versions of
Nomad will be manually backported in a separate PR.
Move conflict resolution implementation into the state store with a new Apply RPC.
This also makes the RPC for secure variables much more similar to Consul's KV,
which will help us support soft deletes in a post-1.4.0 version of Nomad.
Reimplement quotas in the state store functions.
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Two new periodic core jobs have been added which handle removing
expired local and global tokens from state. The local core job is
run on every leader; the global core job is only run on the leader
within the authoritative region.
When the `Full` flag is passed for key rotation, we kick off a core
job to decrypt and re-encrypt all the secure variables so that they
use the new key.
Extend the GC job to support periodic key rotation.
Update the GC process to safely support signed workload identity. We
can't GC any key used to sign a workload identity. Finding which key
was used to sign every allocation will be expensive, but there are not
that many keys. This lets us take a conservative approach: find the
oldest live allocation and ensure that we don't GC any key older than
that key.
Almost all GC jobs check the index of the objects being GC'd to see if
they're older than a configured threshold. This code was repeated six
times in `CoreScheduler` with only logging changes, so it seems safe
to extract it as its own method.
These API endpoints now return results in chronological order. They
can return results in reverse chronological order by setting the
query parameter ascending=true.
- Eval.List
- Deployment.List
Non-CSI garbage collection tasks on the server only log the cutoff
index in the case where it's not a forced GC from `nomad system gc`.
Do the same for CSI for consistency.
* The volume claim GC method and volumewatcher both have logic
collecting terminal allocations that duplicates most of the logic
that's now in the state store's `CSIVolumeDenormalize` method. Copy
this logic into the state store so that all code paths have the same
view of the past claims.
* Remove logic in the volume claim GC that now lives in the state
store's `CSIVolumeDenormalize` method.
* Remove logic in the volumewatcher that now lives in the state
store's `CSIVolumeDenormalize` method.
* Remove logic in the node unpublish RPC that now lives in the state
store's `CSIVolumeDenormalize` method.
* csi: resolve invalid claim states on read
It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.
* csi: make feasibility check errors more understandable
When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
RPC endpoints for the user-driven APIs (`UpsertOneTimeToken` and
`ExchangeOneTimeToken`) and token expiration (`ExpireOneTimeTokens`).
Includes adding expiration to the periodic core GC job.
The unpublish workflow requires that we know the mode (RW vs RO) if we want to
unpublish the node. Update the hook and the Unpublish RPC so that we mark the
claim for release in a new state but leave the mode alone. This fixes a bug
where RO claims were failing node unpublish.
The core job GC doesn't know the mode, but we don't need it for that workflow,
so add a mode specifically for GC; the volumewatcher uses this as a sentinel
to check whether claims (with their specific RW vs RO modes) need to be claimed.
During CSI plugin GC, we don't return an error if the volume is in use,
because this is not an error condition. If we were to return an error during a
`nomad system gc`, we would not continue on to GC volumes.
But check for the specific error message fails if the GC is performed on a
worker rather than on the leader, due to RPC forwarding wrapping the error
message. Use a less specific test so that we don't return an error.
Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This
may forward client RPCs to the node plugins or to the controller plugins,
depending on whether other allocations on this node have claims on this
volume.
By making clients responsible for running the `CSIVolume.Unpublish` RPC (and
making the RPC available to a `nomad volume detach` command), the
volumewatcher becomes only used by the core GC job and we no longer need
async volume GC from job deregister and node update.
This changeset implements a periodic garbage collection of CSI volumes
with missing allocations. This can happen in a scenario where a node
update fails partially and the allocation updates are written to raft
but the evaluations to GC the volumes are dropped. This feature will
cover this edge case and ensure that upgrades from 0.11.0 and 0.11.1
get any stray claims cleaned up.
This changeset implements a periodic garbage collection of unused CSI
plugins. Plugins are self-cleaning when the last allocation for a
plugin is stopped, but this feature will cover any missing edge cases
and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins
cleaned up.
This changeset adds a subsystem to run on the leader, similar to the
deployment watcher or node drainer. The `Watcher` performs a blocking
query on updates to the `CSIVolumes` table and triggers reaping of
volume claims.
This will avoid tying up scheduling workers by immediately sending
volume claim workloads into their own loop, rather than blocking the
scheduling workers in the core GC job doing things like talking to CSI
controllers
The volume watcher is enabled on leader step-up and disabled on leader
step-down.
The volume claim GC mechanism now makes an empty claim RPC for the
volume to trigger an index bump. That in turn unblocks the blocking
query in the volume watcher so it can assess which claims can be
released for a volume.
Adds a `CSIVolumeClaim` type to be tracked as current and past claims
on a volume. Allows for a client RPC failure during node or controller
detachment without having to keep the allocation around after the
first garbage collection eval.
This changeset lays groundwork for moving the actual detachment RPCs
into a volume watching loop outside the GC eval.
The current design of `ClientCSI` RPC requires that callers in the
server know about the free-standing `nodeForControllerPlugin`
function. This makes it difficult to send `ClientCSI` RPC messages
from subpackages of `nomad` and adds a bunch of boilerplate to every
server-side caller of a controller RPC.
This changeset makes it so that the `ClientCSI` RPCs will populate and
validate the controller's client node ID if it hasn't been passed by
the caller, centralizing the logic of picking and validating
controller targets into the `nomad.ClientCSI` struct.
The `Job.Deregister` call will block on the client CSI controller RPCs
while the alloc still exists on the Nomad client node. So we need to
make the volume claim reaping async from the `Job.Deregister`. This
allows `nomad job stop` to return immediately. In order to make this
work, this changeset changes the volume GC so that the GC jobs are on a
by-volume basis rather than a by-job basis; we won't have to query
the (possibly deleted) job at the time of volume GC. We smuggle the
volume ID and whether it's a purge into the GC eval ID the same way we
smuggled the job ID previously.
The CSI plugins uses the external volume ID for all operations, but
the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the
mount paths. Pass the External ID as an arg in the RPC call so that
the unpublish workflows have it without calling back to the server to
find the external ID.
The controller CSI plugins need the CSI node ID (or in other words,
the storage provider's view of node ID like the EC2 instance ID), not
the Nomad node ID, to determine how to detach the external volume.
* nomad/state/state_store: error message copy/paste error
* nomad/structs/structs: add a VolumeEval to the JobDeregisterResponse
* nomad/job_endpoint: synchronously, volumeClaimReap on job Deregister
* nomad/core_sched: make volumeClaimReap available without a CoreSched
* nomad/job_endpoint: Deregister return early if the job is missing
* nomad/job_endpoint_test: job Deregistion is idempotent
* nomad/core_sched: conditionally ignore alloc status in volumeClaimReap
* nomad/job_endpoint: volumeClaimReap all allocations, even running
* nomad/core_sched_test: extra argument to collectClaimsToGCImpl
* nomad/job_endpoint: job deregistration is not idempotent
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.
This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
* nomad/state/schema: use the namespace compound index
* scheduler/scheduler: CSIVolumeByID interface signature namespace
* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace
* scheduler/feasible: pass the captured namespace to CSIVolumeByID
* nomad/state/state_store: use namespace in csi_volume index
* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim
* nomad/core_sched: pass the namespace in volumeClaimReap
* nomad/node_endpoint_test: namespaces in Claim testing
* nomad/csi_endpoint: pass RequestNamespace to state.*
* nomad/csi_endpoint_test: appropriately failed test
* command/alloc_status_test: appropriately failed test
* node_endpoint_test: avoid notTheNamespace for the job
* scheduler/feasible_test: call SetJob to capture the namespace
* nomad/csi_endpoint: ACL check the req namespace, query by namespace
* nomad/state/state_store: remove deregister namespace check
* nomad/state/state_store: remove unused CSIVolumes
* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace
* nomad/csi_endpoint: ACL check
* nomad/state/state_store_test: remove call to state.CSIVolumes
* nomad/core_sched_test: job namespace match so claim gc works
When an alloc is marked terminal (and after node unstage/unpublish
have been called), the client syncs the terminal alloc state with the
server via `Node.UpdateAlloc RPC`.
For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC
handler at the server will emit an eval for a new core job to garbage
collect CSI volume claims. When this eval is handled on the core
scheduler, it will call a `volumeReap` method to release the claims
for all terminal allocs on the job.
The volume reap will issue a `ControllerUnpublishVolume` RPC for any
node that has no alloc claiming the volume. Once this returns (or
is skipped), the volume reap will send a new `CSIVolume.Claim` RPC
that releases the volume claim for that allocation in the state store,
making it available for scheduling again.
This same `volumeReap` method will be called from the core job GC,
which gives us a second chance to reclaim volumes during GC if there
were controller RPC failures.
This changeset adds a new core job `CoreJobCSIVolumePublicationGC` to
the leader's loop for scheduling core job evals. Right now this is an
empty method body without even a config file stanza. Later changesets
will implement the logic of volume publication GC.