Merge pull request #4085 from hashicorp/docs-node-drain

Initial Node drain docs
2018-03-30 16:34:49 -07:00 · 2018-03-30 16:34:49 -07:00 · 3495df7da9
parent 6871a068cb 1f1a20eaed
commit 3495df7da9
8 changed files with 268 additions and 6 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,9 +1,15 @@
 ## 0.8 (Unreleased)

 __BACKWARDS INCOMPATIBILITIES:__
+ * cli: node drain now blocks until the drain completes and all allocations on
+   the draining node have stopped. Use -detach for the old behavior.
 * discovery: Prevent absolute URLs in check paths. The documentation indicated
   that absolute URLs are not allowed, but it was not enforced. Absolute URLs
   in HTTP check paths will now fail to validate. [[GH-3685](https://github.com/hashicorp/nomad/issues/3685)]
+ * drain: Draining a node no longer stops all allocations immediately: a new
+   [migrate stanza](https://www.nomadproject.io/docs/job-specification/migrate.html)
+   allows jobs to specify how quickly task groups can be drained. A `-force`
+   option can be used to emulate the old drain behavior.
 * jobspec: The default values for restart policy have changed. Restart policy mode defaults to "fail" and the
   attempts/time interval values have been changed to enable faster server side rescheduling. See
   [restart stanza](https://www.nomadproject.io/docs/job-specification/restart.html) for more information.
@ -21,6 +27,9 @@ IMPROVEMENTS:
 * core: Servers can now retry connecting to Vault to verify tokens without requiring a SIGHUP to do so [[GH-3957](https://github.com/hashicorp/nomad/issues/3957)]
 * core: Updated yamux library to pick up memory and CPU performance improvements [[GH-3980](https://github.com/hashicorp/nomad/issues/3980)]
 * core: Client stanza now supports overriding total memory [[GH-4052](https://github.com/hashicorp/nomad/issues/4052)]
+ * core: Node draining is now able to migrate allocations in a controlled
+   manner with parameters specified by the drain command and in job files using
+   the migrate stanza [[GH-4010](https://github.com/hashicorp/nomad/issues/4010)]
 * acl: Increase token name limit from 64 characters to 256 [[GH-3888](https://github.com/hashicorp/nomad/issues/3888)]
 * cli: Node status and filesystem related commands do not require direct
   network access to the Nomad client nodes [[GH-3892](https://github.com/hashicorp/nomad/issues/3892)]
--- a/command/job_init.go
+++ b/command/job_init.go
@ -159,6 +159,35 @@ job "example" {
    canary = 0
  }

+  # The migrate stanza specifies the group's strategy for migrating off of
+  # draining nodes. If omitted, a default migration strategy is applied.
+  #
+  # For more information on the "migrate" stanza, please see 
+  # the online documentation at:
+  #
+  #     https://www.nomadproject.io/docs/job-specification/migrate.html
+  #
+  migrate {
+    # Specifies the number of task groups that can be migrated at the same
+    # time. This number must be less than the total count for the group as
+    # (count - max_parallel) will be left running during migrations.
+    max_parallel = 1
+
+    # Specifies the mechanism in which allocations health is determined. The
+    # potential values are "checks" or "task_states".
+    health_check = "checks"
+
+    # Specifies the minimum time the allocation must be in the healthy state
+    # before it is marked as healthy and unblocks further allocations from being
+    # migrated. This is specified using a label suffix like "30s" or "15m".
+    min_healthy_time = "10s"
+
+    # Specifies the deadline in which the allocation must be marked as healthy
+    # after which the allocation is automatically transitioned to unhealthy. This
+    # is specified using a label suffix like "2m" or "1h".
+    healthy_deadline = "5m"
+  }
+
  # The "group" stanza defines a series of tasks that should be co-located on
  # the same Nomad client. Any task within a group will be placed on the same
  # client.
--- a/website/source/docs/commands/job/deployments.html.md.erb
+++ b/website/source/docs/commands/job/deployments.html.md.erb
@ -8,8 +8,8 @@ description: >

 # Command: job deployments

-The `job dispatch` command is used to display the deployments for a particular
-job.
+The `job deployments` command is used to display the deployments for a
+particular job.

 ## Usage

--- a/website/source/docs/commands/node.html.md.erb
+++ b/website/source/docs/commands/node.html.md.erb
@ -18,9 +18,11 @@ Run `nomad node <subcommand> -h` for help on that subcommand. The following
 subcommands are available:

 * [`node config`][config] - View or modify client configuration details
-* [`node drain`][drain] - Toggle drain mode on a given node
+* [`node drain`][drain] - Set drain mode on a given node
+* [`node eligibility`][eligibility] - Toggle scheduilng eligibility on a given node
 * [`node status`][status] - Display status information about nodes

 [config]: /docs/commands/node/config.html "View or modify client configuration details"
-[drain]: /docs/commands/node/drain.html "Toggle drain mode on a given node"
+[drain]: /docs/commands/node/drain.html "Set drain mode on a given node"
+[eligibility]: /docs/commands/node/eligibility.html "Toggle scheduling eligibility on a given node"
 [status]: /docs/commands/node/status.html "Display status information about nodes"
--- a/website/source/docs/commands/node/drain.html.md.erb
+++ b/website/source/docs/commands/node/drain.html.md.erb
@ -10,7 +10,20 @@ description: >

 The `node drain` command is used to toggle drain mode on a given node. Drain
 mode prevents any new tasks from being allocated to the node, and begins
-migrating all existing allocations away.
+migrating all existing allocations away. Allocations will be migrated according
+to their [`migrate`][migrate] stanza until the drain's deadline is reached.
+
+By default the `node drain` command blocks until a node is done draining and
+all allocations have terminated. Canceling the `node drain` command *will not*
+cancel the drain. Drains may be canceled by using the `-disable` parameter
+below.
+
+When draining more than one node at a time, it is recommended you first disable
+[scheduling eligibility][eligibility] on all nodes that will be drained. For
+example if you are decommissioning an entire class of nodes, first run `node
+eligibility -disable` on all of their node IDs, and then run `node drain
+-enable`. This will ensure allocations drained from the first node are not
+placed on another node about to be drained.

 The [node status](/docs/commands/node/status.html) command compliments this
 nicely by providing the current drain status of a given node.
@ -37,6 +50,19 @@ operation is desired.

 * `-enable`: Enable node drain mode.
 * `-disable`: Disable node drain mode.
+* `-deadline`: Set the deadline by which all allocations must be moved off the
+  node. Remaining allocations after the deadline are force removed from the
+  node. Defaults to 1 hour.
+* `-detach`: Return immediately instead of entering monitor mode.
+* `-force`: Force remove allocations off the node immediately.
+* `-no-deadline`: No deadline allows the allocations to drain off the node
+  without being force stopped after a certain deadline.
+* `-ignore-system`: Ignore sytem allows the drain to complete without stopping
+  system job allocations. By default system jobs are stopped last.
+* `-keep-ineligible`: Keep ineligible will maintain the node's scheduling
+  ineligibility even if the drain is being disabled. This is useful when an
+  existing drain is being cancelled but additional scheduling on the node is not
+  desired.
 * `-self`: Drain the local node.
 * `-yes`: Automatic yes to prompts.

@ -45,11 +71,46 @@ operation is desired.
 Enable drain mode on node with ID prefix "4d2ba53b":

 ```
-$ nomad node drain -enable 4d2ba53b
+$ nomad node drain -enable f4e8a9e5
+Are you sure you want to enable drain mode for node "f4e8a9e5-30d8-3536-1e6f-cda5c869c35e"? [y/N] y
+2018-03-30T23:13:16Z: Ctrl-C to stop monitoring: will not cancel the node drain
+2018-03-30T23:13:16Z: Node "f4e8a9e5-30d8-3536-1e6f-cda5c869c35e" drain strategy set
+2018-03-30T23:13:17Z: Alloc "1877230b-64d3-a7dd-9c31-dc5ad3c93e9a" marked for migration
+2018-03-30T23:13:17Z: Alloc "1877230b-64d3-a7dd-9c31-dc5ad3c93e9a" draining
+2018-03-30T23:13:17Z: Alloc "1877230b-64d3-a7dd-9c31-dc5ad3c93e9a" status running -> complete
+2018-03-30T23:13:29Z: Alloc "3fce5308-818c-369e-0bb7-f61f0a1be9ed" marked for migration
+2018-03-30T23:13:29Z: Alloc "3fce5308-818c-369e-0bb7-f61f0a1be9ed" draining
+2018-03-30T23:13:30Z: Alloc "3fce5308-818c-369e-0bb7-f61f0a1be9ed" status running -> complete
+2018-03-30T23:13:41Z: Alloc "9a98c5aa-a719-2f34-ecfc-0e6268b5d537" marked for migration
+2018-03-30T23:13:41Z: Alloc "9a98c5aa-a719-2f34-ecfc-0e6268b5d537" draining
+2018-03-30T23:13:41Z: Node "f4e8a9e5-30d8-3536-1e6f-cda5c869c35e" drain complete
+2018-03-30T23:13:42Z: Alloc "9a98c5aa-a719-2f34-ecfc-0e6268b5d537" status running -> complete
+2018-03-30T23:13:42Z: All allocations on node "f4e8a9e5-30d8-3536-1e6f-cda5c869c35e" have stopped.
 ```

 Enable drain mode on the local node:

 ```
 $ nomad node drain -enable -self
+...
 ```
+
+Enable drain mode but do not stop system jobs:
+
+```
+$ nomad node drain -enable -ignore-system 4d2ba53b
+...
+```
+
+Disable drain mode but keep the node ineligible for scheduling. Useful for
+inspecting the current state of a misbehaving node without Nomad trying to
+start or migrate allocations:
+
+```
+$ nomad node drain -disable -keep-ineligible 4d2ba53b
+...
+```
+
+
+[eligibility]: /docs/commands/node/eligibility.html
+[migrate]: /docs/job-specification/migrate.html
--- a/website/source/docs/commands/node/eligibility.html.md.erb
+++ b/website/source/docs/commands/node/eligibility.html.md.erb
@ -0,0 +1,71 @@
+---
+layout: "docs"
+page_title: "Commands: node eligibility"
+sidebar_current: "docs-commands-node-eligibility"
+description: >
+  The node eligibility command is used to configure a node's scheduling
+  eligibility.
+---
+
+# Command: node eligibility
+
+The `node eligibility` command is used to toggle scheduling eligibility for a
+given node. By default node's are eligible for scheduling meaning they can
+receive placements and run new allocations. Node's that have their scheduling
+elegibility disabled are ineligibile for new placements.
+
+The [`node drain`][drain] command automatically disables eligibility. Disabling
+a drain restore eligibility by default.
+
+Disable scheduling eligibility is useful when draining a set of nodes: first
+disable eligibility on each node that will be drained. Then drain each node.
+If you just drain each node allocations may get rescheduled multiple times as
+they get placed on node's about to be drained!
+
+Disabling scheduling eligibility may also be useful when investigating poorly
+behaved nodes. It allows operators to investigate the current state of a node
+without the risk of additional work being assigned to it.
+
+## Usage
+
+```
+nomad node eligibility [options] <node>
+```
+
+A `-self` flag can be used to toggle eligibility of the local node. If this is
+not supplied, a node ID or prefix must be provided. If there is an exact match,
+the eligibility will be adjusted for that node. Otherwise, a list of matching
+nodes and information will be displayed.
+
+It is also required to pass one of `-enable` or `-disable`, depending on which
+operation is desired.
+
+## General Options
+
+<%= partial "docs/commands/_general_options" %>
+
+## Drain Options
+
+* `-enable`: Enable scheduling eligbility.
+* `-disable`: Disable scheduling eligibility.
+* `-self`: Set eligibility for the local node.
+* `-yes`: Automatic yes to prompts.
+
+## Examples
+
+Enable scheduling eligibility on node with ID prefix "574545c5":
+
+```
+$ nomad node eligibility -enable 574545c5
+Node "574545c5-c2d7-e352-d505-5e2cb9fe169f" scheduling eligibility set: eligible for scheduling
+```
+
+Disable scheduling eligibility on the local node:
+
+```
+$ nomad node eligibility -disable -self
+Node "574545c5-c2d7-e352-d505-5e2cb9fe169f" scheduling eligibility set: ineligible for scheduling
+```
+
+
+[drain]: /docs/commands/node/drain.html
--- a/website/source/docs/job-specification/migrate.html.md
+++ b/website/source/docs/job-specification/migrate.html.md
@ -0,0 +1,84 @@
+---
+layout: "docs"
+page_title: "migrate Stanza - Job Specification"
+sidebar_current: "docs-job-specification-migrate"
+description: |-
+  The "migrate" stanza specifies the group's migrate strategy. The migrate
+  strategy is used to control the job's behavior when it is being migrated off
+  of a draining node.
+---
+
+# `migrate` Stanza
+
+<table class="table table-bordered table-striped">
+  <tr>
+    <th width="120">Placement</th>
+    <td>
+      <code>job -> **migrate**</code>
+    </td>
+    <td>
+      <code>job -> group -> **migrate**</code>
+    </td>
+  </tr>
+</table>
+
+The `migrate` stanza specifies the group's strategy for migrating off of
+[draining][drain] nodes. If omitted, a default migration strategy is applied.
+If specified at the job level, the configuration will apply to all groups
+within the job. Only service jobs with a count greater than 1 support migrate
+stanzas.
+
+```hcl
+job "docs" {
+  migrate {
+    max_parallel     = 1
+    health_check     = "checks"
+    min_healthy_time = "10s"
+    healthy_deadline = "5m"
+  }
+}
+```
+
+When one or more nodes are draining, only `max_parallel` allocations will be
+stopped at a time. Node draining will not continue until replacement
+allocations have been healthy for their `min_healthy_time` or
+`healthy_deadline` is reached.
+
+Note that a node's drain [deadline][deadline] will override the `migrate`
+stanza for allocations on that node. The `migrate` stanza is for job authors to
+define how their services should be migrated, while the node drain deadline is
+for system operators to put hard limits on how long a drain may take.
+
+## `migrate` Parameters
+
+- `max_parallel` `(int: 1)` - Specifies the number of allocations that can be
+  migrated at the same time. This number must be less than the total
+  [`count`][count] for the group as `count - max_parallel` will be left running
+  during migrations.
+
+- `health_check` `(string: "checks")` - Specifies the mechanism in which
+  allocations health is determined. The potential values are:
+
+  - "checks" - Specifies that the allocation should be considered healthy when
+    all of its tasks are running and their associated [checks][checks] are
+    healthy, and unhealthy if any of the tasks fail or not all checks become
+    healthy.  This is a superset of "task_states" mode.
+
+  - "task_states" - Specifies that the allocation should be considered healthy when
+    all its tasks are running and unhealthy if tasks fail.
+
+- `min_healthy_time` `(string: "10s")` - Specifies the minimum time the
+  allocation must be in the healthy state before it is marked as healthy and
+  unblocks further allocations from being migrated. This is specified using a
+  label suffix like "30s" or "15m".
+
+- `healthy_deadline` `(string: "5m")` - Specifies the deadline in which the
+  allocation must be marked as healthy after which the allocation is
+  automatically transitioned to unhealthy. This is specified using a label
+  suffix like "2m" or "1h".
+
+
+[checks]: /docs/job-specification/service.html#check-parameters
+[count]: /docs/job-specification/group.html#count
+[drain]: /docs/commands/node/drain.html
+[deadline]: /docs/commands/node/drain.html#deadline
--- a/website/source/layouts/docs.erb
+++ b/website/source/layouts/docs.erb
@ -53,6 +53,9 @@
          <li<%= sidebar_current("docs-job-specification-meta")%>>
            <a href="/docs/job-specification/meta.html">meta</a>
          </li>
+          <li<%= sidebar_current("docs-job-specification-migrate")%>>
+            <a href="/docs/job-specification/migrate.html">migrate</a>
+          </li>
          <li<%= sidebar_current("docs-job-specification-network")%>>
            <a href="/docs/job-specification/network.html">network</a>
          </li>
@ -324,6 +327,9 @@
              <li<%= sidebar_current("docs-commands-node-drain") %>>
                <a href="/docs/commands/node/drain.html">drain</a>
              </li>
+              <li<%= sidebar_current("docs-commands-node-eligibility") %>>
+                <a href="/docs/commands/node/eligibility.html">eligibility</a>
+              </li>
              <li<%= sidebar_current("docs-commands-node-status") %>>
                <a href="/docs/commands/node/status.html">status</a>
              </li>