website: add docs on scheduling
This commit is contained in:
parent
1a18a57368
commit
5d091a0536
BIN
website/source/assets/images/eval-flow.png
(Stored with Git LFS)
Normal file
BIN
website/source/assets/images/eval-flow.png
(Stored with Git LFS)
Normal file
Binary file not shown.
BIN
website/source/assets/images/nomad-nouns.png
(Stored with Git LFS)
Normal file
BIN
website/source/assets/images/nomad-nouns.png
(Stored with Git LFS)
Normal file
Binary file not shown.
94
website/source/docs/internals/scheduling.html.md
Normal file
94
website/source/docs/internals/scheduling.html.md
Normal file
|
@ -0,0 +1,94 @@
|
||||||
|
---
|
||||||
|
layout: "docs"
|
||||||
|
page_title: "Scheduling"
|
||||||
|
sidebar_current: "docs-internals-scheduling"
|
||||||
|
description: |-
|
||||||
|
Learn about how schedulig works in Nomad.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Scheduling
|
||||||
|
|
||||||
|
Scheduling is a core function of Nomad. It is the process of assigning tasks
|
||||||
|
from jobs to client machines. This process must respect the constraints as declared
|
||||||
|
in the job, and optimize for resource utilization by bin packing. This page documents
|
||||||
|
the details of how scheduling works in Nomad to help both users and developers
|
||||||
|
build a mental model of how it works. The design is heavily inspired by Google's
|
||||||
|
work on [Omega: flexible, scalable schedulers for large compute clusters](http://research.google.com/pubs/pub41684.html)
|
||||||
|
|
||||||
|
~> **Advanced Topic!** This page covers technical details
|
||||||
|
of Nomad. You don't need to understand these details to
|
||||||
|
effectively use Nomad. The details are documented here for
|
||||||
|
those who wish to learn about them without having to go
|
||||||
|
spelunking through the source code.
|
||||||
|
|
||||||
|
# Scheduling in Nomad
|
||||||
|
|
||||||
|
![Data Model](/assets/images/nomad-nouns.png)
|
||||||
|
|
||||||
|
There are four primary "nouns" in Nomad, these are jobs, nodes, allocations, and evaluations.
|
||||||
|
Jobs are submitted by users and represent a _desired state_. A job is a declarative description
|
||||||
|
of tasks to run which are bounded by constraints and require resources. Nodes are the servers
|
||||||
|
in the clusters that tasks can be scheduled on. The mapping of tasks in a job to nodes is done
|
||||||
|
using allocations. An allocation is used to declare that a set of tasks in a job should be run
|
||||||
|
on a particular node. Scheduling is the process of determining the appropriate allocations and
|
||||||
|
is done as part of an evaluation.
|
||||||
|
|
||||||
|
An evaluation is created any time the external state, either desired or emergent, changes. The desired
|
||||||
|
state is based on jobs, meaning the desired state changes if a new job is submitted, an
|
||||||
|
existing job is updated, or a job is deregistered. The emergent state is based on the client
|
||||||
|
nodes, and so we must handle the failure of any clients in the system. These events trigger
|
||||||
|
the creation of a new evaluation, as Nomad must _evaluate_ the state of the world and reconcile
|
||||||
|
it with the desired state.
|
||||||
|
|
||||||
|
This diagram shows the flow of an evaluation through Nomad:
|
||||||
|
|
||||||
|
![Evaluation Flow](/assets/images/eval-flow.png)
|
||||||
|
|
||||||
|
The lifecycle of an evaluation beings with an event causing the evaluation to be
|
||||||
|
created. Evaluations are created in the `pending` state and are enqueued into the
|
||||||
|
evaluation broker. There is a single evaluation broker which runs on the leader server.
|
||||||
|
The evaluation broker is used to manage the queue of pending evaluations, provide priority ordering,
|
||||||
|
and ensure at least once delivery.
|
||||||
|
|
||||||
|
Nomad servers run scheduling workers, defaulting to one per CPU core, which are used to
|
||||||
|
process evaluations. The workers dequeue evaluations from the broker, and then invoke
|
||||||
|
the appropriate schedule as specified by the job. Nomad ships with a `service` scheduler
|
||||||
|
that optimizes for long-lived services, a `batch` scheduler that is used for fast placement
|
||||||
|
of batch jobs, and a `core` scheduler which is used for internal maintenance. Nomad can
|
||||||
|
be extended to support custom schedulers as well.
|
||||||
|
|
||||||
|
Schedulers are responsible for processing an evaluation and generating an allocation _plan_.
|
||||||
|
The plan is the set of allocations to evict, update, or create. The specific logic used to
|
||||||
|
generate a plan may vary by scheduler, but generally the scheduler needs to first reconcile
|
||||||
|
the desired state with the real state to determine what must be done. New allocations need
|
||||||
|
to be placed and existing allocations may need to be updated, migrated, or stopped.
|
||||||
|
|
||||||
|
Placing allocations is split into two distinct phases, feasibility
|
||||||
|
checking and ranking. In the first phase the scheduler finds nodes that are
|
||||||
|
feasible by filtering unhealthy nodes, those missing necessary drivers, and those
|
||||||
|
failing the specified constraints.
|
||||||
|
|
||||||
|
The second phase is ranking, where the scheduler scores feasible nodes to find the best fit.
|
||||||
|
Scoring is primarily based on bin packing, which is used to optimize the resource utilization
|
||||||
|
and density of applications, but is also augmented by affinity and anti-affinity rules.
|
||||||
|
Once the scheduler has ranked enough nodes, the highest ranking node is selected and
|
||||||
|
added to the allocation plan.
|
||||||
|
|
||||||
|
When planning is complete, the scheduler submits the plan to the leader and
|
||||||
|
gets added to the plan queue. The plan queue manages pending plans, provides priority
|
||||||
|
ordering, and allows Nomad to handle concurrency races. Multiple schedulers are running
|
||||||
|
in parallel without locking or reservations, making Nomad optimistically concurrent.
|
||||||
|
As a result, schedulers might overlap work on the same node and cause resource
|
||||||
|
over-subscription. The plan queue allows the leader node to protect against this and
|
||||||
|
do partial or complete rejections of a plan.
|
||||||
|
|
||||||
|
As the leader processes plans, it creates allocations when there is no conflict
|
||||||
|
and otherwise informs the scheduler of a failure in the plan result. The plan result
|
||||||
|
provides feedback to the scheduler, allowing it to terminate or explore alternate plans
|
||||||
|
if the previous plan was partially or completely rejected.
|
||||||
|
|
||||||
|
Once the scheduler has finished processing an evaluation, it updates the status of
|
||||||
|
the evaluation and acknowledges delivery with the evaluation broker. This completes
|
||||||
|
the lifecycle of an evaluation. Allocations that were created, modified or deleted
|
||||||
|
as a result will be picked up by client nodes and will begin execution.
|
||||||
|
|
|
@ -19,6 +19,10 @@
|
||||||
|
|
||||||
<li<%= sidebar_current("docs-internals-gossip") %>>
|
<li<%= sidebar_current("docs-internals-gossip") %>>
|
||||||
<a href="/docs/internals/gossip.html">Gossip Protocol</a>
|
<a href="/docs/internals/gossip.html">Gossip Protocol</a>
|
||||||
|
</li>
|
||||||
|
|
||||||
|
<li<%= sidebar_current("docs-internals-scheduling") %>>
|
||||||
|
<a href="/docs/internals/scheduling.html">Scheduling</a>
|
||||||
</li>
|
</li>
|
||||||
|
|
||||||
<li<%= sidebar_current("docs-internals-telemetry") %>>
|
<li<%= sidebar_current("docs-internals-telemetry") %>>
|
||||||
|
|
Loading…
Reference in a new issue