docs: always use -ignore-system on node drain with CSI (#8606)

Postrun hooks for allocation runners don't currently block the registration of terminal health with the servers, which is what allows system jobs to be drained. So draining nodes with jobs that claim CSI volumes requires the `-ignore-system` job to ensure that the postrun hook for service jobs gets a chance to execute.
2020-08-07 11:22:28 -04:00 · 2020-08-07 11:22:28 -04:00 · 3169839653
parent 7d53ed88d6
commit 3169839653
3 changed files with 29 additions and 27 deletions
--- a/website/pages/docs/commands/node/drain.mdx
+++ b/website/pages/docs/commands/node/drain.mdx
@ -18,10 +18,12 @@ all allocations have terminated. Canceling the `node drain` command _will not_
 cancel the drain. Drains may be canceled by using the `-disable` parameter
 below.

-When draining more than one node at a time, it is recommended you first disable
-[scheduling eligibility][eligibility] on all nodes that will be drained. For
-example if you are decommissioning an entire class of nodes, first run `node eligibility -disable` on all of their node IDs, and then run `node drain -enable`. This will ensure allocations drained from the first node are not
-placed on another node about to be drained.
+When draining more than one node at a time, it is recommended you first
+disable [scheduling eligibility][eligibility] on all nodes that will be
+drained. For example if you are decommissioning an entire class of nodes,
+first run `node eligibility -disable` on all of their node IDs, and then run
+`node drain -enable`. This will ensure allocations drained from the first node
+are not placed on another node about to be drained.

 The [node status] command compliments this nicely by providing the current drain
 status of a given node.
@ -65,8 +67,10 @@ operation is desired.
 - `-no-deadline`: No deadline allows the allocations to drain off the node
  without being force stopped after a certain deadline.

- `-ignore-system`: Ignore system allows the drain to complete without stopping
-  system job allocations. By default system jobs are stopped last.
+- `-ignore-system`: Ignore system allows the drain to complete without
+  stopping system job allocations. By default system jobs are stopped
+  last. You should always use this flag when draining a node running
+  [CSI node plugins][internals-csi].

 - `-keep-ineligible`: Keep ineligible will maintain the node's scheduling
  ineligibility even if the drain is being disabled. This is useful when an
@ -135,3 +139,4 @@ $ nomad node drain -self -monitor
 [migrate]: /docs/job-specification/migrate
 [node status]: /docs/commands/node/status
 [workload migration guide]: https://learn.hashicorp.com/nomad/operating-nomad/node-draining
+[internals-csi]: /docs/internals/plugins/csi
--- a/website/pages/docs/internals/plugins/csi.mdx
+++ b/website/pages/docs/internals/plugins/csi.mdx
@ -56,12 +56,12 @@ that perform both the controller and node roles in the same
 instance. Not every plugin provider has or needs a controller; that's
 specific to the provider implementation.

-You should almost always run node plugins as Nomad `system` jobs to
-ensure volume claims are released when a Nomad client is drained. Use
-constraints for the node plugin jobs based on the availability of
-volumes. For example, AWS EBS volumes are specific to particular
-availability zones with a region. Controller plugins can be run as
-`service` jobs.
+You should always run node plugins as Nomad `system` jobs and use the
+`-ignore-system` flag on the `nomad node drain` command to ensure that the
+node plugins are still running while the node is being drained. Use
+constraints for the node plugin jobs based on the availability of volumes. For
+example, AWS EBS volumes are specific to particular availability zones with a
+region. Controller plugins can be run as `service` jobs.

 Nomad exposes a Unix domain socket named `csi.sock` inside each CSI
 plugin task, and communicates over the gRPC protocol expected by the
@ -111,17 +111,13 @@ client, and the node plugin mounts the volume to a staging area in
 the Nomad data directory. Nomad will bind-mount this staged directory
 into each task that mounts the volume.

-This cycle is reversed when a task that claims a volume becomes
-terminal. The client updates the server frequently about changes to
-allocations, including terminal state. When the server receives a
-terminal state for a job with volume claims, it creates a volume claim
-garbage collection (GC) evaluation to to handled by the core job
-scheduler. The GC job will send "detach" RPCs to the node plugin. The
-node plugin unmounts the bind-mount from the allocation and unmounts
-the volume from the plugin (if it's not in use by another task). The
-GC job will then send "unpublish" RPCs to the controller plugin (if
-any), and decrement the claim count for the volume. At this point the
-volume’s claim capacity has been freed up for scheduling.
+This cycle is reversed when a task that claims a volume becomes terminal. The
+client will send an "unpublish" RPC to the server, which will send "detach"
+RPCs to the node plugin.  The node plugin unmounts the bind-mount from the
+allocation and unmounts the volume from the plugin (if it's not in use by
+another task). The server will then send "unpublish" RPCs to the controller
+plugin (if any), and decrement the claim count for the volume. At this point
+the volume’s claim capacity has been freed up for scheduling.

 [csi-spec]: https://github.com/container-storage-interface/spec
 [csi-drivers-list]: https://kubernetes-csi.github.io/docs/drivers.html
--- a/website/pages/docs/job-specification/csi_plugin.mdx
+++ b/website/pages/docs/job-specification/csi_plugin.mdx
@ -51,10 +51,11 @@ host. With the Docker task driver, you can use the `privileged = true`
 configuration, but no other default task drivers currently have this
 option.

-~> **Note:** During node drains, jobs that claim volumes should be
-moved before the `node` or `monolith` plugin for those
-volumes. Because [`system`][system] jobs are moved last during node drains, you
-should run `node` or `monolith` plugins as `system` jobs.
+~> **Note:** During node drains, jobs that claim volumes must be moved before
+the `node` or `monolith` plugin for those volumes. You should run `node` or
+`monolith` plugins as [`system`][system] jobs and use the `-ignore-system`
+flag on `nomad node drain` to ensure that the plugins are running while the
+node is being drained.

 ## `csi_plugin` Examples