docs: always use -ignore-system on node drain with CSI (#8606)

Postrun hooks for allocation runners don't currently block the registration of
terminal health with the servers, which is what allows system jobs to be
drained. So draining nodes with jobs that claim CSI volumes requires the
`-ignore-system` job to ensure that the postrun hook for service jobs gets a
chance to execute.
This commit is contained in:
Tim Gross 2020-08-07 11:22:28 -04:00 committed by GitHub
parent 7d53ed88d6
commit 3169839653
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 29 additions and 27 deletions

View File

@ -18,10 +18,12 @@ all allocations have terminated. Canceling the `node drain` command _will not_
cancel the drain. Drains may be canceled by using the `-disable` parameter
below.
When draining more than one node at a time, it is recommended you first disable
[scheduling eligibility][eligibility] on all nodes that will be drained. For
example if you are decommissioning an entire class of nodes, first run `node eligibility -disable` on all of their node IDs, and then run `node drain -enable`. This will ensure allocations drained from the first node are not
placed on another node about to be drained.
When draining more than one node at a time, it is recommended you first
disable [scheduling eligibility][eligibility] on all nodes that will be
drained. For example if you are decommissioning an entire class of nodes,
first run `node eligibility -disable` on all of their node IDs, and then run
`node drain -enable`. This will ensure allocations drained from the first node
are not placed on another node about to be drained.
The [node status] command compliments this nicely by providing the current drain
status of a given node.
@ -65,8 +67,10 @@ operation is desired.
- `-no-deadline`: No deadline allows the allocations to drain off the node
without being force stopped after a certain deadline.
- `-ignore-system`: Ignore system allows the drain to complete without stopping
system job allocations. By default system jobs are stopped last.
- `-ignore-system`: Ignore system allows the drain to complete without
stopping system job allocations. By default system jobs are stopped
last. You should always use this flag when draining a node running
[CSI node plugins][internals-csi].
- `-keep-ineligible`: Keep ineligible will maintain the node's scheduling
ineligibility even if the drain is being disabled. This is useful when an
@ -135,3 +139,4 @@ $ nomad node drain -self -monitor
[migrate]: /docs/job-specification/migrate
[node status]: /docs/commands/node/status
[workload migration guide]: https://learn.hashicorp.com/nomad/operating-nomad/node-draining
[internals-csi]: /docs/internals/plugins/csi

View File

@ -56,12 +56,12 @@ that perform both the controller and node roles in the same
instance. Not every plugin provider has or needs a controller; that's
specific to the provider implementation.
You should almost always run node plugins as Nomad `system` jobs to
ensure volume claims are released when a Nomad client is drained. Use
constraints for the node plugin jobs based on the availability of
volumes. For example, AWS EBS volumes are specific to particular
availability zones with a region. Controller plugins can be run as
`service` jobs.
You should always run node plugins as Nomad `system` jobs and use the
`-ignore-system` flag on the `nomad node drain` command to ensure that the
node plugins are still running while the node is being drained. Use
constraints for the node plugin jobs based on the availability of volumes. For
example, AWS EBS volumes are specific to particular availability zones with a
region. Controller plugins can be run as `service` jobs.
Nomad exposes a Unix domain socket named `csi.sock` inside each CSI
plugin task, and communicates over the gRPC protocol expected by the
@ -111,17 +111,13 @@ client, and the node plugin mounts the volume to a staging area in
the Nomad data directory. Nomad will bind-mount this staged directory
into each task that mounts the volume.
This cycle is reversed when a task that claims a volume becomes
terminal. The client updates the server frequently about changes to
allocations, including terminal state. When the server receives a
terminal state for a job with volume claims, it creates a volume claim
garbage collection (GC) evaluation to to handled by the core job
scheduler. The GC job will send "detach" RPCs to the node plugin. The
node plugin unmounts the bind-mount from the allocation and unmounts
the volume from the plugin (if it's not in use by another task). The
GC job will then send "unpublish" RPCs to the controller plugin (if
any), and decrement the claim count for the volume. At this point the
volumes claim capacity has been freed up for scheduling.
This cycle is reversed when a task that claims a volume becomes terminal. The
client will send an "unpublish" RPC to the server, which will send "detach"
RPCs to the node plugin. The node plugin unmounts the bind-mount from the
allocation and unmounts the volume from the plugin (if it's not in use by
another task). The server will then send "unpublish" RPCs to the controller
plugin (if any), and decrement the claim count for the volume. At this point
the volumes claim capacity has been freed up for scheduling.
[csi-spec]: https://github.com/container-storage-interface/spec
[csi-drivers-list]: https://kubernetes-csi.github.io/docs/drivers.html

View File

@ -51,10 +51,11 @@ host. With the Docker task driver, you can use the `privileged = true`
configuration, but no other default task drivers currently have this
option.
~> **Note:** During node drains, jobs that claim volumes should be
moved before the `node` or `monolith` plugin for those
volumes. Because [`system`][system] jobs are moved last during node drains, you
should run `node` or `monolith` plugins as `system` jobs.
~> **Note:** During node drains, jobs that claim volumes must be moved before
the `node` or `monolith` plugin for those volumes. You should run `node` or
`monolith` plugins as [`system`][system] jobs and use the `-ignore-system`
flag on `nomad node drain` to ensure that the plugins are running while the
node is being drained.
## `csi_plugin` Examples