2015-09-19 19:02:48 +00:00
|
|
|
---
|
2020-02-06 23:45:31 +00:00
|
|
|
layout: docs
|
|
|
|
page_title: Scheduling
|
|
|
|
sidebar_title: Internals
|
|
|
|
description: Learn about how scheduling works in Nomad.
|
2015-09-19 19:02:48 +00:00
|
|
|
---
|
|
|
|
|
|
|
|
# Scheduling in Nomad
|
|
|
|
|
2017-08-01 17:13:42 +00:00
|
|
|
[![Nomad Data Model][img-data-model]][img-data-model]
|
|
|
|
|
|
|
|
There are four primary "nouns" in Nomad; jobs, nodes, allocations, and
|
|
|
|
evaluations. Jobs are submitted by users and represent a _desired state_. A job
|
|
|
|
is a declarative description of tasks to run which are bounded by constraints
|
2020-02-06 23:45:31 +00:00
|
|
|
and require resources. Tasks can be scheduled on nodes in the cluster running
|
2017-08-01 17:13:42 +00:00
|
|
|
the Nomad client. The mapping of tasks in a job to clients is done using
|
|
|
|
allocations. An allocation is used to declare that a set of tasks in a job
|
|
|
|
should be run on a particular node. Scheduling is the process of determining
|
|
|
|
the appropriate allocations and is done as part of an evaluation.
|
|
|
|
|
|
|
|
An evaluation is created any time the external state, either desired or
|
|
|
|
emergent, changes. The desired state is based on jobs, meaning the desired
|
|
|
|
state changes if a new job is submitted, an existing job is updated, or a job
|
|
|
|
is deregistered. The emergent state is based on the client nodes, and so we
|
|
|
|
must handle the failure of any clients in the system. These events trigger the
|
|
|
|
creation of a new evaluation, as Nomad must _evaluate_ the state of the world
|
|
|
|
and reconcile it with the desired state.
|
2015-09-19 19:02:48 +00:00
|
|
|
|
|
|
|
This diagram shows the flow of an evaluation through Nomad:
|
|
|
|
|
2017-08-01 17:13:42 +00:00
|
|
|
[![Nomad Evaluation Flow][img-eval-flow]][img-eval-flow]
|
|
|
|
|
|
|
|
The lifecycle of an evaluation begins with an event causing the evaluation to
|
|
|
|
be created. Evaluations are created in the `pending` state and are enqueued
|
|
|
|
into the evaluation broker. There is a single evaluation broker which runs on
|
|
|
|
the leader server. The evaluation broker is used to manage the queue of pending
|
|
|
|
evaluations, provide priority ordering, and ensure at least once delivery.
|
|
|
|
|
|
|
|
Nomad servers run scheduling workers, defaulting to one per CPU core, which are
|
|
|
|
used to process evaluations. The workers dequeue evaluations from the broker,
|
|
|
|
and then invoke the appropriate scheduler as specified by the job. Nomad ships
|
|
|
|
with a `service` scheduler that optimizes for long-lived services, a `batch`
|
|
|
|
scheduler that is used for fast placement of batch jobs, a `system` scheduler
|
|
|
|
that is used to run jobs on every node, and a `core` scheduler which is used
|
2020-07-22 15:57:43 +00:00
|
|
|
for internal maintenance.
|
2017-08-01 17:13:42 +00:00
|
|
|
|
|
|
|
Schedulers are responsible for processing an evaluation and generating an
|
|
|
|
allocation _plan_. The plan is the set of allocations to evict, update, or
|
|
|
|
create. The specific logic used to generate a plan may vary by scheduler, but
|
|
|
|
generally the scheduler needs to first reconcile the desired state with the
|
|
|
|
real state to determine what must be done. New allocations need to be placed
|
|
|
|
and existing allocations may need to be updated, migrated, or stopped.
|
|
|
|
|
|
|
|
Placing allocations is split into two distinct phases, feasibility checking and
|
|
|
|
ranking. In the first phase the scheduler finds nodes that are feasible by
|
|
|
|
filtering unhealthy nodes, those missing necessary drivers, and those failing
|
|
|
|
the specified constraints.
|
|
|
|
|
|
|
|
The second phase is ranking, where the scheduler scores feasible nodes to find
|
|
|
|
the best fit. Scoring is primarily based on bin packing, which is used to
|
|
|
|
optimize the resource utilization and density of applications, but is also
|
2017-08-01 17:58:18 +00:00
|
|
|
augmented by affinity and anti-affinity rules. Nomad automatically applies a job
|
|
|
|
anti-affinity rule which discourages colocating multiple instances of a task
|
|
|
|
group. The combination of this anti-affinity and bin packing optimizes for
|
|
|
|
density while reducing the probability of correlated failures.
|
2017-07-30 20:18:21 +00:00
|
|
|
|
2017-08-01 17:13:42 +00:00
|
|
|
Once the scheduler has ranked enough nodes, the highest ranking node is
|
|
|
|
selected and added to the allocation plan.
|
2015-09-19 19:02:48 +00:00
|
|
|
|
2017-08-01 17:13:42 +00:00
|
|
|
When planning is complete, the scheduler submits the plan to the leader which
|
|
|
|
adds the plan to the plan queue. The plan queue manages pending plans, provides
|
|
|
|
priority ordering, and allows Nomad to handle concurrency races. Multiple
|
|
|
|
schedulers are running in parallel without locking or reservations, making
|
|
|
|
Nomad optimistically concurrent. As a result, schedulers might overlap work on
|
|
|
|
the same node and cause resource over-subscription. The plan queue allows the
|
|
|
|
leader node to protect against this and do partial or complete rejections of a
|
|
|
|
plan.
|
2015-09-19 19:02:48 +00:00
|
|
|
|
|
|
|
As the leader processes plans, it creates allocations when there is no conflict
|
2017-08-01 17:13:42 +00:00
|
|
|
and otherwise informs the scheduler of a failure in the plan result. The plan
|
|
|
|
result provides feedback to the scheduler, allowing it to terminate or explore
|
|
|
|
alternate plans if the previous plan was partially or completely rejected.
|
|
|
|
|
|
|
|
Once the scheduler has finished processing an evaluation, it updates the status
|
|
|
|
of the evaluation and acknowledges delivery with the evaluation broker. This
|
|
|
|
completes the lifecycle of an evaluation. Allocations that were created,
|
|
|
|
modified or deleted as a result will be picked up by client nodes and will
|
|
|
|
begin execution.
|
|
|
|
|
2020-02-06 23:45:31 +00:00
|
|
|
[omega]: https://research.google.com/pubs/pub41684.html
|
|
|
|
[borg]: https://research.google.com/pubs/pub43438.html
|
|
|
|
[img-data-model]: /img/nomad-data-model.png
|
|
|
|
[img-eval-flow]: /img/nomad-evaluation-flow.png
|