In #12112 and #12113 we solved for the problem of races in releasing
volume claims, but there was a case that we missed. During a node
drain with a controller attach/detach, we can hit a race where we call
controller publish before the unpublish has completed. This is
discouraged in the spec but plugins are supposed to handle it
safely. But if the storage provider's API is slow enough and the
plugin doesn't handle the case safely, the volume can get "locked"
into a state where the provider's API won't detach it cleanly.
Check the claim before making any external controller publish RPC
calls so that Nomad is responsible for the canonical information about
whether a volume is currently claimed.
This has a couple side-effects that also had to get fixed here:
* Changing the order means that the volume will have a past claim
without a valid external node ID because it came from the client, and
this uncovered a separate bug where we didn't assert the external node
ID was valid before returning it. Fallthrough to getting the ID from
the plugins in the state store in this case. We avoided this
originally because of concerns around plugins getting lost during node
drain but now that we've fixed that we may want to revisit it in
future work.
* We should make sure we're handling `FailedPrecondition` cases from
the controller plugin the same way we handle other retryable cases.
* Several tests had to be updated because they were assuming we fail
in a particular order that we're no longer doing.
Listing snapshots was incorrectly returning nanoseconds instead of
seconds, and formatting of timestamps both list and create snapshot
was treating the timestamp as though it were nanoseconds instead of
seconds. This resulted in create timestamps always being displayed as
zero values.
Fix the unit conversion error in the command line and the incorrect
extraction in the CSI plugin client code. Beef up the unit tests to
make sure this code is actually exercised.
CSI `CreateVolume` RPC is idempotent given that the topology,
capabilities, and parameters are unchanged. CSI volumes have many
user-defined fields that are immutable once set, and many fields that
are not user-settable.
Update the `Register` RPC so that updating a volume via the API merges
onto any existing volume without touching Nomad-controlled fields,
while validating it with the same strict requirements expected for
idempotent `CreateVolume` RPCs.
Also, clarify that this state store method is used for everything, not just
for the `Register` RPC.
The `CreateSnapshot` RPC expects a plugin ID to be set by the API, but
in the common case of the `nomad volume snapshot create` command, we
don't ask the user for the plugin ID because it's available from the
volume we're snapshotting.
Change the order of the RPC so that we get the volume first and then
use the volume's plugin ID for the plugin if the API didn't set the
value.
The `CSIPlugin.List` RPC was intended to accept a prefix to filter the
list of plugins being listed. This was being accidentally being done
in the state store instead, which contributed to incorrect filtering
behavior for plugins in the `volume plugin status` command.
Move the prefix matching into the RPC so that it calls the
prefix-matching method in the state store if we're looking for a
prefix.
Update the `plugin status command` to accept a prefix for the plugin
ID argument so that it matches the expected behavior of other commands.
When using a prefix value and the * wildcard for namespace, the endpoint
would not take the prefix value into consideration due to the order in
which the checks were executed but also the logic for retrieving volumes
from the state store.
This commit changes the order to check for a prefix first and wraps the
result iterator of the state store query in a filter to apply the
prefix.
* Remove redundant schedulable check in `FreeWriteClaims`. If a volume
has been created but not yet claimed, its capabilities will be checked
in `WriteSchedulable` at both scheduling time and claim time. We don't
need to also check them in the `FreeWriteClaims` method.
* Enforce maximum volume claims for writers.
When the scheduler checks feasibility for CSI volumes, the check is
fairly loose: earlier versions of the same job are not counted as
active claims. This allows the scheduler to place new allocations
for the new version of a job, under the assumption that we'll replace
the existing allocations and their volume claims.
But when the alloc runner claims the volume, we need to enforce the
active claims even if they're for allocations of an earlier version of
the job. Otherwise we'll try to mount a volume that's currently being
unmounted, and this will cause replacement allocations to frequently
fail.
* Enforce single-node reader check for read-only volumes. When the
alloc runner makes a claim for a read-only volume, we only check that
the volume is potentially schedulable and not that it actually has
free read claims.
The plugin stub object does not include fine-grained capability checks, which
means `nomad volume status -verbose` will return ugly and verbose error
"Unimplemented" messages from the plugin if it does not support the CSI
`ListVolumes` RPC. Return a nicer error message from our RPC handler instead.
In order to support new controller RPCs, we need to fingerprint volume
capabilities in more detail and perform controller RPCs only when the specific
capability is present. This fixes a bug in Ceph support where the plugin can
only suport create/delete but we assume that it also supports attach/detach.
The CSI specification requires that we validate a list of `Capability` (access
mode + accessibility) when we create volume, but the existing volume
registration workflow incorrectly validates a single capability. The
specific capability required by a volume claim is checked at the time we make
the claim, so remove the check for `AttachmentMode`/`AcccessMode`.
* upsertaclpolicies
* delete acl policies msgtype
* upsert acl policies msgtype
* delete acl tokens msgtype
* acl bootstrap msgtype
wip unsubscribe on token delete
test that subscriptions are closed after an ACL token has been deleted
Start writing policyupdated test
* update test to use before/after policy
* add SubscribeWithACLCheck to run acl checks on subscribe
* update rpc endpoint to use broker acl check
* Add and use subscriptions.closeSubscriptionFunc
This fixes the issue of not being able to defer unlocking the mutex on
the event broker in the for loop.
handle acl policy updates
* rpc endpoint test for terminating acl change
* add comments
Co-authored-by: Kris Hicks <khicks@hashicorp.com>
The CSIVolume struct "denormalizes" allocations when it's first queried from
the state store. The CSIVolumeByID method on the state store copies the volume
before denormalizing so that we don't end up with unexpected changes. The
copying has some subtle bugs that meant that Allocations (as well as
Topologies and MountOptions) were not getting copied when expected.
Also, ensure we never write allocations attached to volumes to the state store
during claims.
* use msgtype in upsert node
adds message type to signature for upsert node, update tests, remove placeholder method
* UpsertAllocs msg type test setup
* use upsertallocs with msg type in signature
update test usage of delete node
delete placeholder msgtype method
* add msgtype to upsert evals signature, update test call sites with test setup msg type
handle snapshot upsert eval outside of FSM and ignore eval event
remove placeholder upsertevalsmsgtype
handle job plan rpc and prevent event creation for plan
msgtype cleanup upsertnodeevents
updatenodedrain msgtype
msg type 0 is a node registration event, so set the default to the ignore type
* fix named import
* fix signature ordering on upsertnode to match
Volumes using attachment mode `file-system` use the CSI filesystem API when
they're mounted, and can be passed mount options. But `block-device` mode
volumes don't have this option. When RPCs are made to plugins, we are silently
dropping the mount options we don't expect to see, but this results in a poor
operator experience when the mount options aren't honored. This changeset
makes passing mount options to a `block-device` volume a validation error.
Fixes a bug where CSI volumes with the `MULTI_NODE_MULTI_WRITER` access mode
were using the same logic as `MULTI_NODE_SINGLE_WRITER` to determine whether
the volume had writer claims available for scheduling.
Extends CSI claim endpoint test to exercise multi-reader and make sure `WriteFreeClaims`
is exercised for multi-writer in feasibility test.
The soundness guarantees of the CSI specification leave a little to be desired
in our ability to provide a 100% reliable automated solution for managing
volumes. This changeset provides a new command to bridge this gap by providing
the operator the ability to intervene.
The command doesn't take an allocation ID so that the operator doesn't have to
keep track of alloc IDs that may have been GC'd. Handle this case in the
unpublish RPC by sending the client RPC for all the terminal/nil allocs on the
selected node.
CSI plugins can require credentials for some publishing and
unpublishing workflow RPCs. Secrets are configured at the time of
volume registration, stored in the volume struct, and then passed
around as an opaque map by Nomad to the plugins.
This changeset implements a periodic garbage collection of unused CSI
plugins. Plugins are self-cleaning when the last allocation for a
plugin is stopped, but this feature will cover any missing edge cases
and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins
cleaned up.
In this changeset:
* If a Nomad client node is running both a controller and a node
plugin (which is a common case), then if only the controller or the
node is removed, the plugin was not being updated with the correct
counts.
* The existing test for plugin cleanup didn't go back to the state
store, which normally is ok but is complicated in this case by
denormalization which changes the behavior. This commit makes the
test more comprehensive.
* Set "controller required" when plugin has `PUBLISH_READONLY`. All
known controllers that support `PUBLISH_READONLY` also support
`PUBLISH_UNPUBLISH_VOLUME` but we shouldn't assume this.
* Only create plugins when the allocs for those plugins are
healthy. If we allow a plugin to be created for the first time when
the alloc is not healthy, then we'll recreate deleted plugins when
the job's allocs all get marked terminal.
* Terminal plugin alloc updates should cleanup the plugin. The client
fingerprint can't tell if the plugin is unhealthy intentionally (for
the case of updates or job stop). Allocations that are
server-terminal should delete themselves from the plugin and trigger
a plugin self-GC, the same as an unused node.
The `CSIVolumeClaim` fields were added after 0.11.1, so claims made
before that may be missing the value. Repair this when we read the
volume out of the state store.
The `NodeID` field was added after 0.11.0, so we need to ensure it's
been populated during upgrades from 0.11.0.
Follow-up for a method missed in the refactor for #7688. The
`volAndPluginLookup` method is only ever called from the server's `CSI`
RPC and never the `ClientCSI` RPC, so move it into that scope.
The current design of `ClientCSI` RPC requires that callers in the
server know about the free-standing `nodeForControllerPlugin`
function. This makes it difficult to send `ClientCSI` RPC messages
from subpackages of `nomad` and adds a bunch of boilerplate to every
server-side caller of a controller RPC.
This changeset makes it so that the `ClientCSI` RPCs will populate and
validate the controller's client node ID if it hasn't been passed by
the caller, centralizing the logic of picking and validating
controller targets into the `nomad.ClientCSI` struct.
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.
This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
* nomad/state/state_store: enforce that the volume namespace exists
* nomad/csi_endpoint_test: a couple of broken namespaces now
* nomad/csi_endpoint_test: one more test
* nomad/node_endpoint_test: use structs.DefaultNamespace
* nomad/state/state_store_test: use DefaultNamespace
* nomad/structs/csi: delete just one plugin type from a node
* nomad/structs/csi: add DeleteAlloc
* nomad/state/state_store: add deleteJobFromPlugin
* nomad/state/state_store: use DeleteAlloc not DeleteNodeType
* move CreateTestCSIPlugin to state to avoid an import cycle
* nomad/state/state_store_test: delete a plugin by deleting its jobs
* nomad/*_test: move CreateTestCSIPlugin to state
* nomad/state/state_store: update one plugin per transaction
* command/plugin_status_test: move CreateTestCSIPlugin
* nomad: csi: handle nils CSIPlugin methods, clarity
* nomad/structs/csi: split CanWrite into health, in use
* scheduler/scheduler: expose AllocByID in the state interface
* nomad/state/state_store_test
* scheduler/stack: SetJobID on the matcher
* scheduler/feasible: when a volume writer is in use, check if it's us
* scheduler/feasible: remove SetJob
* nomad/state/state_store: denormalize allocs before Claim
* nomad/structs/csi: return errors on claim, with context
* nomad/csi_endpoint_test: new alloc doesn't look like an update
* nomad/state/state_store_test: change test reference to CanWrite
* acl/policy: add the volume ACL policies
* nomad/csi_endpoint: enforce ACLs for volume access
* nomad/search_endpoint_oss: volume acls
* acl/acl: add plugin read as a global policy
* acl/policy: add PluginPolicy global cap type
* nomad/csi_endpoint: check the global plugin ACL policy
* nomad/mock/acl: PluginPolicy
* nomad/csi_endpoint: fix list rebase
* nomad/core_sched_test: new test since #7358
* nomad/csi_endpoint_test: use correct permissions for list
* nomad/csi_endpoint: allowCSIMount keeps ACL checks together
* nomad/job_endpoint: check mount permission for jobs
* nomad/job_endpoint_test: need plugin read, too
* nomad/state/schema: use the namespace compound index
* scheduler/scheduler: CSIVolumeByID interface signature namespace
* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace
* scheduler/feasible: pass the captured namespace to CSIVolumeByID
* nomad/state/state_store: use namespace in csi_volume index
* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim
* nomad/core_sched: pass the namespace in volumeClaimReap
* nomad/node_endpoint_test: namespaces in Claim testing
* nomad/csi_endpoint: pass RequestNamespace to state.*
* nomad/csi_endpoint_test: appropriately failed test
* command/alloc_status_test: appropriately failed test
* node_endpoint_test: avoid notTheNamespace for the job
* scheduler/feasible_test: call SetJob to capture the namespace
* nomad/csi_endpoint: ACL check the req namespace, query by namespace
* nomad/state/state_store: remove deregister namespace check
* nomad/state/state_store: remove unused CSIVolumes
* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace
* nomad/csi_endpoint: ACL check
* nomad/state/state_store_test: remove call to state.CSIVolumes
* nomad/core_sched_test: job namespace match so claim gc works
This changeset implements the remaining controller detach RPCs: server-to-client and client-to-controller. The tests also uncovered a bug in our RPC for claims which is fixed here; the volume claim RPC is used for both claiming and releasing a claim on a volume. We should only submit a controller publish RPC when the claim is new and not when it's being released.
* client/allocrunner/csi_hook: tag errors
* nomad/client_csi_endpoint: tag errors
* nomad/client_rpc: remove an unnecessary error tag
* nomad/state/state_store: ControllerRequired fix intent
We use ControllerRequired to indicate that a volume should use the
publish/unpublish workflow, rather than that it has a controller. We
need to check both RequiresControllerPlugin and SupportsAttachDetach
from the fingerprint to check that.
* nomad/csi_endpoint: tag errors
* nomad/csi_endpoint_test: longer error messages, mock fingerprints