Service and Batch Job Preemption Guide (#5853)

* fix navigation issue for spread guide

* skeleton for preemption guide

* background info, challenge, and pre-reqs

* steps

* rewording of intro

* re-wording

* adding more detail to intro

* clarify use of preemption in intro
This commit is contained in:
Omar Khawaja 2019-07-29 16:38:18 -04:00 committed by GitHub
parent 5bd655e87d
commit 0d52fd8893
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 444 additions and 2 deletions

View File

@ -0,0 +1,438 @@
---
layout: "guides"
page_title: "Preemption (Service and Batch Jobs)"
sidebar_current: "guides-operating-a-job-preemption-service-batch"
description: |-
The following guide walks the user through enabling and using preemption on
service and batch jobs in Nomad Enterprise (0.9.3 and above).
---
# Preemption for Service and Batch Jobs
~> **Enterprise Only!** This functionality only exists in Nomad Enterprise. This
is not present in the open source version of Nomad.
Prior to Nomad 0.9, job [priority][priority] in Nomad was used to process
scheduling requests in priority order. Preemption, implemented in Nomad 0.9
allows Nomad to evict running allocations to place allocations of a higher
priority. Allocations of a job that are blocked temporarily go into "pending"
status until the cluster has additional capacity to run them. This is useful
when operators need to run relatively higher priority tasks sooner even under
resource contention across the cluster.
While Nomad 0.9 introduced preemption for [system][system-job] jobs, Nomad 0.9.3
[Enterprise][enterprise] additionally allows preemption for
[service][service-job] and [batch][batch-job] jobs. This functionality can
easily be enabled by sending a [payload][payload-preemption-config] with the
appropriate options specified to the [scheduler
configuration][update-scheduler] API endpoint.
## Reference Material
- [Preemption][preemption]
- [Nomad Enterprise Preemption][enterprise-preemption]
## Estimated Time to Complete
20 minutes
## Prerequisites
To perform the tasks described in this guide, you need to have a Nomad
environment with Consul installed. You can use this
[repo](https://github.com/hashicorp/nomad/tree/master/terraform#provision-a-nomad-cluster-in-the-cloud)
to easily provision a sandbox environment. This guide will assume a cluster with
one server node and three client nodes. To simulate resource contention, the
nodes in this environment will each have 1 GB RAM (For AWS, you can choose the
[t2.micro][t2-micro] instance type). Remember that service and batch job
preemption require Nomad 0.9.3 [Enterprise][enterprise].
-> **Please Note:** This guide is for demo purposes and is only using a single
server node. In a production cluster, 3 or 5 server nodes are recommended.
## Steps
### Step 1: Create a Job with Low Priority
Start by creating a job with relatively lower priority into your Nomad cluster.
One of the allocations from this job will be preempted in a subsequent
deployment when there is a resource contention in the cluster. Copy the
following job into a file and name it `webserver.nomad`.
```hcl
job "webserver" {
datacenters = ["dc1"]
type = "service"
priority = 40
group "webserver" {
count = 3
task "apache" {
driver = "docker"
config {
image = "httpd:latest"
port_map {
http = 80
}
}
resources {
network {
mbits = 10
port "http"{}
}
memory = 600
}
service {
name = "apache-webserver"
port = "http"
check {
name = "alive"
type = "http"
path = "/"
interval = "10s"
timeout = "2s"
}
}
}
}
}
```
Note that the [count][count] is 3 and that each allocation is specifying 600 MB
of [memory][memory]. Remember that each node only has 1 GB of RAM.
### Step 2: Run the Low Priority Job
Register `webserver.nomad`:
```shell
$ nomad run webserver.nomad
==> Monitoring evaluation "1596bfc8"
Evaluation triggered by job "webserver"
Allocation "725d3b49" created: node "16653ac1", group "webserver"
Allocation "e2f9cb3d" created: node "f765c6e8", group "webserver"
Allocation "e9d8df1b" created: node "b0700ec0", group "webserver"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "1596bfc8" finished with status "complete"
```
You should be able to check the status of the `webserver` job at this point and see that an allocation has been placed on each client node in the cluster:
```shell
$ nomad status webserver
ID = webserver
Name = webserver
Submit Date = 2019-06-19T04:20:32Z
Type = service
Priority = 40
...
Allocations
ID Node ID Task Group Version Desired Status Created Modified
725d3b49 16653ac1 webserver 0 run running 1m18s ago 59s ago
e2f9cb3d f765c6e8 webserver 0 run running 1m18s ago 1m2s ago
e9d8df1b b0700ec0 webserver 0 run running 1m18s ago 59s ago
```
### Step 3: Create a Job with High Priority
Create another job with a [priority][priority] greater than the job you just deployed. Copy the following into a file named `redis.nomad`:
```hcl
job "redis" {
datacenters = ["dc1"]
type = "service"
priority = 80
group "cache1" {
count = 1
task "redis" {
driver = "docker"
config {
image = "redis:latest"
port_map {
db = 6379
}
}
resources {
network {
port "db" {}
}
memory = 700
}
service {
name = "redis-cache"
port = "db"
check {
name = "alive"
type = "tcp"
interval = "10s"
timeout = "2s"
}
}
}
}
}
```
Note that this job has a priority of 80 (greater than the priority of the job
from [Step 1][step-1]) and requires 700 MB of memory. This allocation will
create a resource contention in the cluster since each node only has 1 GB of
memory with a 600 MB allocation already placed on it.
### Step 4: Try to Run `redis.nomad`
Remember that preemption for service and batch jobs are [disabled by
default][preemption-config]. This means that the `redis` job will be queued due
to resource contention in the cluster. You can verify the resource contention before actually registering your job by running the [`plan`][plan] command:
```shell
$ nomad plan redis.nomad
+ Job: "redis"
+ Task Group: "cache1" (1 create)
+ Task: "redis" (forces create)
Scheduler dry-run:
- WARNING: Failed to place all allocations.
Task Group "cache1" (failed to place 1 allocation):
* Resources exhausted on 3 nodes
* Dimension "memory" exhausted on 3 nodes
```
Run the job to see that the allocation will be queued:
```shell
$ nomad run redis.nomad
==> Monitoring evaluation "1e54e283"
Evaluation triggered by job "redis"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "1e54e283" finished with status "complete" but failed to place all allocations:
Task Group "cache1" (failed to place 1 allocation):
* Resources exhausted on 3 nodes
* Dimension "memory" exhausted on 3 nodes
Evaluation "1512251a" waiting for additional capacity to place remainder
```
You may also verify the allocation has been queued by now checking the status of the job:
```shell
$ nomad status redis
ID = redis
Name = redis
Submit Date = 2019-06-19T03:33:17Z
Type = service
Priority = 80
...
Placement Failure
Task Group "cache1":
* Resources exhausted on 3 nodes
* Dimension "memory" exhausted on 3 nodes
Allocations
No allocations placed
```
You may remove this job now. In the next steps, we will enable service job preemption and re-deploy:
```shell
$ nomad stop -purge redis
==> Monitoring evaluation "153db6c0"
Evaluation triggered by job "redis"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "153db6c0" finished with status "complete"
```
### Step 5: Enable Service Job Preemption
Verify the [scheduler configuration][scheduler-configuration] with the following
command:
```shell
$ curl -s localhost:4646/v1/operator/scheduler/configuration | jq
{
"SchedulerConfig": {
"PreemptionConfig": {
"SystemSchedulerEnabled": true,
"BatchSchedulerEnabled": false,
"ServiceSchedulerEnabled": false
},
"CreateIndex": 5,
"ModifyIndex": 506
},
"Index": 506,
"LastContact": 0,
"KnownLeader": true
}
```
Note that [BatchSchedulerEnabled][batch-enabled] and
[ServiceSchedulerEnabled][service-enabled] are both set to `false` by default.
Since we are preempting service jobs in this guide, we need to set
`ServiceSchedulerEnabled` to `true`. We will do this by directly interacting
with the [API][update-scheduler].
Create the following JSON payload and place it in a file named `scheduler.json`:
```json
{
"PreemptionConfig": {
"SystemSchedulerEnabled": true,
"BatchSchedulerEnabled": false,
"ServiceSchedulerEnabled": true
}
}
```
Note that [ServiceSchedulerEnabled][service-enabled] has been set to `true`.
Run the following command to update the scheduler configuration:
```shell
$ curl -XPOST localhost:4646/v1/operator/scheduler/configuration -d @scheduler.json
```
You should now be able to check the scheduler configuration again and see that
preemption has been enabled for service jobs (output below is abbreviated):
```shell
$ curl -s localhost:4646/v1/operator/scheduler/configuration | jq
{
"SchedulerConfig": {
"PreemptionConfig": {
"SystemSchedulerEnabled": true,
"BatchSchedulerEnabled": false,
"ServiceSchedulerEnabled": true
},
...
}
```
### Step 6: Try Running `redis.nomad` Again
Now that you have enabled preemption on service jobs, deploying your `redis` job
should evict one of the lower priority `webserver` allocations and place it into
a queue. You can run `nomad plan` to see a preview of what will happen:
```shell
$ nomad plan redis.nomad
+ Job: "redis"
+ Task Group: "cache1" (1 create)
+ Task: "redis" (forces create)
Scheduler dry-run:
- All tasks successfully allocated.
Preemptions:
Alloc ID Job ID Task Group
725d3b49-d5cf-6ba2-be3d-cb441c10a8b3 webserver webserver
...
```
Note that Nomad is indicating one of the `webserver` allocations will be
evicted.
Now run the `redis` job:
```shell
$ nomad run redis.nomad
==> Monitoring evaluation "7ada9d9f"
Evaluation triggered by job "redis"
Allocation "8bfcdda3" created: node "16653ac1", group "cache1"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "7ada9d9f" finished with status "complete"
```
You can check the status of the `webserver` job and verify one of the allocations has been evicted:
```shell
$ nomad status webserver
ID = webserver
Name = webserver
Submit Date = 2019-06-19T04:20:32Z
Type = service
Priority = 40
...
Summary
Task Group Queued Starting Running Failed Complete Lost
webserver 1 0 2 0 1 0
Placement Failure
Task Group "webserver":
* Resources exhausted on 3 nodes
* Dimension "memory" exhausted on 3 nodes
Allocations
ID Node ID Task Group Version Desired Status Created Modified
725d3b49 16653ac1 webserver 0 evict complete 4m10s ago 33s ago
e2f9cb3d f765c6e8 webserver 0 run running 4m10s ago 3m54s ago
e9d8df1b b0700ec0 webserver 0 run running 4m10s ago 3m51s ago
```
### Step 7: Stop the Redis Job
Stop the `redis` job and verify that evicted/queued `webserver` allocation
starts running again:
```shell
$ nomad stop redis
==> Monitoring evaluation "670922e9"
Evaluation triggered by job "redis"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "670922e9" finished with status "complete"
```
You should now be able to see from the `webserver` status that the third allocation that was previously preempted is running again:
```shell
$ nomad status webserver
ID = webserver
Name = webserver
Submit Date = 2019-06-19T04:20:32Z
Type = service
Priority = 40
Datacenters = dc1
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
webserver 0 0 3 0 1 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
f623eb81 16653ac1 webserver 0 run running 13s ago 7s ago
725d3b49 16653ac1 webserver 0 evict complete 6m44s ago 3m7s ago
e2f9cb3d f765c6e8 webserver 0 run running 6m44s ago 6m28s ago
e9d8df1b b0700ec0 webserver 0 run running 6m44s ago 6m25s ago
```
## Next Steps
The process you learned in this guide can also be applied to
[batch][batch-enabled] jobs as well. Read more about preemption in Nomad
Enterprise [here][enterprise-preemption].
[batch-enabled]: /api/operator.html#batchschedulerenabled-1
[batch-job]: /docs/schedulers.html#batch
[count]: /docs/job-specification/group.html#count
[enterprise]: /docs/enterprise/index.html
[enterprise-preemption]: /docs/enterprise/preemption/index.html
[memory]: /docs/job-specification/resources.html#memory
[payload-preemption-config]: /api/operator.html#sample-payload-1
[plan]: /docs/commands/job/plan.html
[preemption]: /docs/internals/scheduling/preemption.html
[preemption-config]: /api/operator.html#preemptionconfig-1
[priority]: /docs/job-specification/job.html#priority
[service-enabled]: /api/operator.html#serviceschedulerenabled-1
[service-job]: /docs/schedulers.html#service
[step-1]: #step-1-create-a-job-with-low-priority
[system-job]: /docs/schedulers.html#system
[t2-micro]: https://aws.amazon.com/ec2/instance-types/
[update-scheduler]: /api/operator.html#update-scheduler-configuration
[scheduler-configuration]: /api/operator.html#read-scheduler-configuration

View File

@ -1,7 +1,7 @@
---
layout: "guides"
page_title: "Spread"
sidebar_current: "guides-advanced-scheduling"
sidebar_current: "guides-operating-a-job-spread"
description: |-
The following guide walks the user through using the spread stanza in Nomad.
---

View File

@ -119,10 +119,14 @@
<a href="/guides/operating-a-job/advanced-scheduling/affinity.html">Placement Preferences with Affinities</a>
</li>
<li<%= sidebar_current("guides-spread") %>>
<li<%= sidebar_current("guides-operating-a-job-spread") %>>
<a href="/guides/operating-a-job/advanced-scheduling/spread.html">Fault Tolerance with Spread</a>
</li>
<li<%= sidebar_current("guides-operating-a-job-preemption-service-batch") %>>
<a href="/guides/operating-a-job/advanced-scheduling/preemption-service-batch.html">Preemption (Service and Batch Jobs)</a>
</li>
<li<%= sidebar_current("guides-operating-a-job-external-lxc") %>>
<a href="/guides/operating-a-job/external/lxc.html">Running LXC Applications</a>
</li>