Service and Batch Job Preemption Guide (#5853)

* fix navigation issue for spread guide * skeleton for preemption guide * background info, challenge, and pre-reqs * steps * rewording of intro * re-wording * adding more detail to intro * clarify use of preemption in intro
2019-07-29 16:38:18 -04:00 · 2019-07-29 16:38:18 -04:00 · 0d52fd8893
parent 5bd655e87d
commit 0d52fd8893
3 changed files with 444 additions and 2 deletions
--- a/website/source/guides/operating-a-job/advanced-scheduling/preemption-service-batch.html.md
+++ b/website/source/guides/operating-a-job/advanced-scheduling/preemption-service-batch.html.md
@ -0,0 +1,438 @@
+---
+layout: "guides"
+page_title: "Preemption (Service and Batch Jobs)"
+sidebar_current: "guides-operating-a-job-preemption-service-batch"
+description: |-
+  The following guide walks the user through enabling and using preemption on
+  service and batch jobs in Nomad Enterprise (0.9.3 and above). 
+---
+
+# Preemption for Service and Batch Jobs
+
+~> **Enterprise Only!** This functionality only exists in Nomad Enterprise. This
+is not present in the open source version of Nomad.
+
+Prior to Nomad 0.9, job [priority][priority] in Nomad was used to process
+scheduling requests in priority order. Preemption, implemented in Nomad 0.9
+allows Nomad to evict running allocations to place allocations of a higher
+priority. Allocations of a job that are blocked temporarily go into "pending"
+status until the cluster has additional capacity to run them. This is useful
+when operators need to run relatively higher priority tasks sooner even under
+resource contention across the cluster.
+
+While Nomad 0.9 introduced preemption for [system][system-job] jobs, Nomad 0.9.3
+[Enterprise][enterprise] additionally allows preemption for
+[service][service-job] and [batch][batch-job] jobs. This functionality can
+easily be enabled by sending a [payload][payload-preemption-config] with the
+appropriate options specified to the [scheduler
+configuration][update-scheduler] API endpoint.
+
+## Reference Material
+
+- [Preemption][preemption]
+- [Nomad Enterprise Preemption][enterprise-preemption]
+
+## Estimated Time to Complete
+
+20 minutes
+
+## Prerequisites
+
+To perform the tasks described in this guide, you need to have a Nomad
+environment with Consul installed. You can use this
+[repo](https://github.com/hashicorp/nomad/tree/master/terraform#provision-a-nomad-cluster-in-the-cloud)
+to easily provision a sandbox environment. This guide will assume a cluster with
+one server node and three client nodes. To simulate resource contention, the
+nodes in this environment will each have 1 GB RAM (For AWS, you can choose the
+[t2.micro][t2-micro] instance type). Remember that service and batch job
+preemption require Nomad 0.9.3 [Enterprise][enterprise].
+
+-> **Please Note:** This guide is for demo purposes and is only using a single
+server node. In a production cluster, 3 or 5 server nodes are recommended.
+
+## Steps
+
+### Step 1: Create a Job with Low Priority
+
+Start by creating a job with relatively lower priority into your Nomad cluster.
+One of the allocations from this job will be preempted in a subsequent
+deployment when there is a resource contention in the cluster. Copy the
+following job into a file and name it `webserver.nomad`.
+
+```hcl
+job "webserver" {
+  datacenters = ["dc1"]
+  type        = "service"
+  priority    = 40
+
+  group "webserver" {
+    count = 3
+
+    task "apache" {
+      driver = "docker"
+
+      config {
+        image = "httpd:latest"
+
+        port_map {
+          http = 80
+        }
+      }
+
+      resources {
+        network {
+          mbits = 10
+          port  "http"{}
+        }
+
+        memory = 600
+      }
+
+      service {
+        name = "apache-webserver"
+        port = "http"
+
+        check {
+          name     = "alive"
+          type     = "http"
+          path     = "/"
+          interval = "10s"
+          timeout  = "2s"
+        }
+      }
+    }
+  }
+}
+```
+Note that the [count][count] is 3 and that each allocation is specifying 600 MB
+of [memory][memory]. Remember that each node only has 1 GB of RAM.
+
+### Step 2: Run the Low Priority Job
+
+Register `webserver.nomad`:
+
+```shell
+$ nomad run webserver.nomad 
+==> Monitoring evaluation "1596bfc8"
+    Evaluation triggered by job "webserver"
+    Allocation "725d3b49" created: node "16653ac1", group "webserver"
+    Allocation "e2f9cb3d" created: node "f765c6e8", group "webserver"
+    Allocation "e9d8df1b" created: node "b0700ec0", group "webserver"
+    Evaluation status changed: "pending" -> "complete"
+==> Evaluation "1596bfc8" finished with status "complete"
+```
+You should be able to check the status of the `webserver` job at this point and see that an allocation has been placed on each client node in the cluster:
+
+```shell
+$ nomad status webserver
+ID            = webserver
+Name          = webserver
+Submit Date   = 2019-06-19T04:20:32Z
+Type          = service
+Priority      = 40
+...
+Allocations
+ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
+725d3b49  16653ac1  webserver   0        run      running  1m18s ago  59s ago
+e2f9cb3d  f765c6e8  webserver   0        run      running  1m18s ago  1m2s ago
+e9d8df1b  b0700ec0  webserver   0        run      running  1m18s ago  59s ago
+```
+
+### Step 3: Create a Job with High Priority 
+
+Create another job with a [priority][priority] greater than the job you just deployed. Copy the following into a file named `redis.nomad`:
+
+```hcl
+job "redis" {
+  datacenters = ["dc1"]
+  type        = "service"
+  priority    = 80
+
+  group "cache1" {
+    count = 1
+
+    task "redis" {
+      driver = "docker"
+
+      config {
+        image = "redis:latest"
+
+        port_map {
+          db = 6379
+        }
+      }
+
+      resources {
+        network {
+          port "db" {}
+        }
+
+        memory = 700
+      }
+
+      service {
+        name = "redis-cache"
+        port = "db"
+
+        check {
+          name     = "alive"
+          type     = "tcp"
+          interval = "10s"
+          timeout  = "2s"
+        }
+      }
+    }
+  }
+}
+```
+Note that this job has a priority of 80 (greater than the priority of the job
+from [Step 1][step-1]) and requires 700 MB of memory. This allocation will
+create a resource contention in the cluster since each node only has 1 GB of
+memory with a 600 MB allocation already placed on it.
+
+### Step 4: Try to Run `redis.nomad`
+
+Remember that preemption for service and batch jobs are [disabled by
+default][preemption-config]. This means that the `redis` job will be queued due
+to resource contention in the cluster. You can verify the resource contention before actually registering your job by running the [`plan`][plan] command:
+
+```shell
+$ nomad plan redis.nomad 
+ Job: "redis"
+ Task Group: "cache1" (1 create)
+  + Task: "redis" (forces create)
+
+Scheduler dry-run:
+- WARNING: Failed to place all allocations.
+  Task Group "cache1" (failed to place 1 allocation):
+    * Resources exhausted on 3 nodes
+    * Dimension "memory" exhausted on 3 nodes
+```
+Run the job to see that the allocation will be queued:
+
+```shell
+$ nomad run redis.nomad 
+==> Monitoring evaluation "1e54e283"
+    Evaluation triggered by job "redis"
+    Evaluation status changed: "pending" -> "complete"
+==> Evaluation "1e54e283" finished with status "complete" but failed to place all allocations:
+    Task Group "cache1" (failed to place 1 allocation):
+      * Resources exhausted on 3 nodes
+      * Dimension "memory" exhausted on 3 nodes
+    Evaluation "1512251a" waiting for additional capacity to place remainder
+```
+
+You may also verify the allocation has been queued by now checking the status of the job:
+
+```shell
+$ nomad status redis
+ID            = redis
+Name          = redis
+Submit Date   = 2019-06-19T03:33:17Z
+Type          = service
+Priority      = 80
+...
+Placement Failure
+Task Group "cache1":
+  * Resources exhausted on 3 nodes
+  * Dimension "memory" exhausted on 3 nodes
+
+Allocations
+No allocations placed
+```
+You may remove this job now. In the next steps, we will enable service job preemption and re-deploy:
+
+```shell
+$ nomad stop -purge redis
+==> Monitoring evaluation "153db6c0"
+    Evaluation triggered by job "redis"
+    Evaluation status changed: "pending" -> "complete"
+==> Evaluation "153db6c0" finished with status "complete"
+```
+
+### Step 5: Enable Service Job Preemption
+
+Verify the [scheduler configuration][scheduler-configuration] with the following
+command:
+
+```shell
+$ curl -s localhost:4646/v1/operator/scheduler/configuration | jq
+{
+  "SchedulerConfig": {
+    "PreemptionConfig": {
+      "SystemSchedulerEnabled": true,
+      "BatchSchedulerEnabled": false,
+      "ServiceSchedulerEnabled": false
+    },
+    "CreateIndex": 5,
+    "ModifyIndex": 506
+  },
+  "Index": 506,
+  "LastContact": 0,
+  "KnownLeader": true
+}
+```
+
+Note that [BatchSchedulerEnabled][batch-enabled] and
+[ServiceSchedulerEnabled][service-enabled] are both set to `false` by default.
+Since we are preempting service jobs in this guide, we need to set
+`ServiceSchedulerEnabled` to `true`. We will do this by directly interacting
+with the [API][update-scheduler].
+
+Create the following JSON payload and place it in a file named `scheduler.json`:
+
+```json
+{
+  "PreemptionConfig": {
+    "SystemSchedulerEnabled": true,
+    "BatchSchedulerEnabled": false,
+    "ServiceSchedulerEnabled": true
+  }
+}
+```
+Note that [ServiceSchedulerEnabled][service-enabled] has been set to `true`.
+
+Run the following command to update the scheduler configuration:
+
+```shell
+$ curl -XPOST localhost:4646/v1/operator/scheduler/configuration -d @scheduler.json
+```
+You should now be able to check the scheduler configuration again and see that
+preemption has been enabled for service jobs (output below is abbreviated):
+
+```shell
+$ curl -s localhost:4646/v1/operator/scheduler/configuration | jq
+{
+  "SchedulerConfig": {
+    "PreemptionConfig": {
+      "SystemSchedulerEnabled": true,
+      "BatchSchedulerEnabled": false,
+      "ServiceSchedulerEnabled": true
+    },
+...
+}
+```
+
+### Step 6: Try Running `redis.nomad` Again
+
+Now that you have enabled preemption on service jobs, deploying your `redis` job
+should evict one of the lower priority `webserver` allocations and place it into
+a queue. You can run `nomad plan` to see a preview of what will happen:
+
+```shell
+$ nomad plan redis.nomad 
+ Job: "redis"
+ Task Group: "cache1" (1 create)
+  + Task: "redis" (forces create)
+
+Scheduler dry-run:
+- All tasks successfully allocated.
+
+Preemptions:
+
+Alloc ID                              Job ID     Task Group
+725d3b49-d5cf-6ba2-be3d-cb441c10a8b3  webserver  webserver
+...
+```
+
+Note that Nomad is indicating one of the `webserver` allocations will be
+evicted.
+
+Now run the `redis` job:
+
+```shell
+$ nomad run redis.nomad 
+==> Monitoring evaluation "7ada9d9f"
+    Evaluation triggered by job "redis"
+    Allocation "8bfcdda3" created: node "16653ac1", group "cache1"
+    Evaluation status changed: "pending" -> "complete"
+==> Evaluation "7ada9d9f" finished with status "complete"
+```
+You can check the status of the `webserver` job and verify one of the allocations has been evicted:
+
+```shell
+$ nomad status webserver
+ID            = webserver
+Name          = webserver
+Submit Date   = 2019-06-19T04:20:32Z
+Type          = service
+Priority      = 40
+...
+Summary
+Task Group  Queued  Starting  Running  Failed  Complete  Lost
+webserver   1       0         2        0       1         0
+
+Placement Failure
+Task Group "webserver":
+  * Resources exhausted on 3 nodes
+  * Dimension "memory" exhausted on 3 nodes
+
+Allocations
+ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
+725d3b49  16653ac1  webserver   0        evict    complete  4m10s ago  33s ago
+e2f9cb3d  f765c6e8  webserver   0        run      running   4m10s ago  3m54s ago
+e9d8df1b  b0700ec0  webserver   0        run      running   4m10s ago  3m51s ago
+```
+
+### Step 7: Stop the Redis Job
+
+Stop the `redis` job and verify that evicted/queued `webserver` allocation
+starts running again:
+
+```shell
+$ nomad stop redis
+==> Monitoring evaluation "670922e9"
+    Evaluation triggered by job "redis"
+    Evaluation status changed: "pending" -> "complete"
+==> Evaluation "670922e9" finished with status "complete"
+```
+You should now be able to see from the `webserver` status that the third allocation that was previously preempted is running again:
+
+```shell
+$ nomad status webserver
+ID            = webserver
+Name          = webserver
+Submit Date   = 2019-06-19T04:20:32Z
+Type          = service
+Priority      = 40
+Datacenters   = dc1
+Status        = running
+Periodic      = false
+Parameterized = false
+
+Summary
+Task Group  Queued  Starting  Running  Failed  Complete  Lost
+webserver   0       0         3        0       1         0
+
+Allocations
+ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
+f623eb81  16653ac1  webserver   0        run      running   13s ago    7s ago
+725d3b49  16653ac1  webserver   0        evict    complete  6m44s ago  3m7s ago
+e2f9cb3d  f765c6e8  webserver   0        run      running   6m44s ago  6m28s ago
+e9d8df1b  b0700ec0  webserver   0        run      running   6m44s ago  6m25s ago
+```
+
+## Next Steps
+
+The process you learned in this guide can also be applied to
+[batch][batch-enabled] jobs as well. Read more about preemption in Nomad
+Enterprise [here][enterprise-preemption].
+
+[batch-enabled]: /api/operator.html#batchschedulerenabled-1
+[batch-job]: /docs/schedulers.html#batch
+[count]: /docs/job-specification/group.html#count
+[enterprise]: /docs/enterprise/index.html
+[enterprise-preemption]: /docs/enterprise/preemption/index.html
+[memory]: /docs/job-specification/resources.html#memory
+[payload-preemption-config]: /api/operator.html#sample-payload-1
+[plan]: /docs/commands/job/plan.html
+[preemption]: /docs/internals/scheduling/preemption.html
+[preemption-config]: /api/operator.html#preemptionconfig-1
+[priority]: /docs/job-specification/job.html#priority
+[service-enabled]: /api/operator.html#serviceschedulerenabled-1
+[service-job]: /docs/schedulers.html#service
+[step-1]: #step-1-create-a-job-with-low-priority
+[system-job]: /docs/schedulers.html#system
+[t2-micro]: https://aws.amazon.com/ec2/instance-types/
+[update-scheduler]: /api/operator.html#update-scheduler-configuration
+[scheduler-configuration]: /api/operator.html#read-scheduler-configuration
--- a/website/source/guides/operating-a-job/advanced-scheduling/spread.html.md
+++ b/website/source/guides/operating-a-job/advanced-scheduling/spread.html.md
@ -1,7 +1,7 @@
 ---
 layout: "guides"
 page_title: "Spread"
-sidebar_current: "guides-advanced-scheduling"
+sidebar_current: "guides-operating-a-job-spread"
 description: |-
  The following guide walks the user through using the spread stanza in Nomad.
 ---
--- a/website/source/layouts/guides.erb
+++ b/website/source/layouts/guides.erb
@ -119,10 +119,14 @@
            <a href="/guides/operating-a-job/advanced-scheduling/affinity.html">Placement Preferences with Affinities</a>
          </li>

-          <li<%= sidebar_current("guides-spread") %>>
+          <li<%= sidebar_current("guides-operating-a-job-spread") %>>
            <a href="/guides/operating-a-job/advanced-scheduling/spread.html">Fault Tolerance with Spread</a>
          </li>

+          <li<%= sidebar_current("guides-operating-a-job-preemption-service-batch") %>>
+            <a href="/guides/operating-a-job/advanced-scheduling/preemption-service-batch.html">Preemption (Service and Batch Jobs)</a>
+          </li>
+
          <li<%= sidebar_current("guides-operating-a-job-external-lxc") %>>
            <a href="/guides/operating-a-job/external/lxc.html">Running LXC Applications</a>
          </li>