The soundness guarantees of the CSI specification leave a little to be desired
in our ability to provide a 100% reliable automated solution for managing
volumes. This changeset provides a new command to bridge this gap by providing
the operator the ability to intervene.
The command doesn't take an allocation ID so that the operator doesn't have to
keep track of alloc IDs that may have been GC'd. Handle this case in the
unpublish RPC by sending the client RPC for all the terminal/nil allocs on the
selected node.
The CSI client RPC uses error wrapping to detect the type of error bubbling up
from plugins, but if the errors we get aren't wrapped at each layer, we can't
unwrap the inner error.
Also eliminates some unused args.
Controller plugins that land on the same node will collide over their CSI
`mount_dir`, so give them enough room in our tests that they don't land on the
same host.
Also, version bump the EBS node plugins to match the controllers.
This change adds the ability to set the fields `success_before_passing` and
`failures_before_critical` on Consul service check definitions. This is a
feature added to Consul v1.7.0 and later.
https://www.consul.io/docs/agent/checks#success-failures-before-passing-critical
Nomad doesn't do much besides pass the fields through to Consul.
Fixes#6913
* update vault integration docs
docs/integrations/vault-integration was a copy of the learn guide. Remove that and move /docs/vault-integration to this location instead
fix link
fix link
Update website/pages/docs/integrations/vault-integration.mdx
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Update website/pages/docs/integrations/vault-integration.mdx
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Update website/pages/docs/integrations/vault-integration.mdx
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Update website/pages/docs/integrations/vault-integration.mdx
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Update website/pages/docs/integrations/vault-integration.mdx
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Update website/pages/docs/integrations/vault-integration.mdx
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Update website/pages/docs/integrations/vault-integration.mdx
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
Update website/pages/docs/integrations/vault-integration.mdx
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
* revert accidental deletion
Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>
When deregistering a client, CSI plugins running on that client may not get a
chance to fingerprint before being stopped. Account for the case where a
plugin allocation is the last instance of the plugin and has been deleted from
the state store to avoid errors during node deregistration.
In order to prevent staleness, changed driver links to point to releases page rather than a specific version.
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Postrun hooks for allocation runners don't currently block the registration of
terminal health with the servers, which is what allows system jobs to be
drained. So draining nodes with jobs that claim CSI volumes requires the
`-ignore-system` job to ensure that the postrun hook for service jobs gets a
chance to execute.
When the client-side actions of a CSI client RPC succeed but we get
disconnected during the RPC or we fail to checkpoint the claim state, we want
to be able to retry the client RPC without getting blocked by the client-side
state (ex. mount points) already having been cleaned up in previous calls.
Using the count of node claims from earlier in the `CSIVolume.Unpublish RPC
doesn't correctly account for cases where the RPC was interrupted but
checkpointed. Instead, we'll check the current allocation count and status to
determine whether we need to send a controller unpublish.
Upgrade our consul/api import to the equivelent of consul@v1.8.1 which includes
a bug fix necessary for #6913. If consul would publish a proper api/ submodule tag
we could reference that.
Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This
may forward client RPCs to the node plugins or to the controller plugins,
depending on whether other allocations on this node have claims on this
volume.
By making clients responsible for running the `CSIVolume.Unpublish` RPC (and
making the RPC available to a `nomad volume detach` command), the
volumewatcher becomes only used by the core GC job and we no longer need
async volume GC from job deregister and node update.
This changeset updates `nomad/volumewatcher` to take advantage of the
`CSIVolume.Unpublish` RPC. This lets us eliminate a bunch of code and
associated tests. The raft batching code can be safely dropped, as the
characteristic times of the CSI RPCs are on the order of seconds or even
minutes, so batching up raft RPCs added complexity without any real world
performance wins.
Includes refactor w/ test cleanup and dead code elimination in volumewatcher
The documentation encourages operators to run multiple controller plugin
instances for HA, but the client RPCs don't take advantage of this by retrying
when the RPC fails in cases when the plugin is unavailable (because the node
has drained or the alloc has failed but we haven't received an updated
fingerprint yet).
This changeset tries all known controllers on ready nodes before giving up,
and adds tests that exercise the client RPC routing and retries.