diff --git a/website/source/assets/images/eval-flow.png b/website/source/assets/images/eval-flow.png new file mode 100644 index 000000000..437f94eb7 --- /dev/null +++ b/website/source/assets/images/eval-flow.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:f93ea06f18814e27501cf5459444d7aa5a22e7d1f8ed339556363a8ad3b7576c +size 69066 diff --git a/website/source/assets/images/nomad-nouns.png b/website/source/assets/images/nomad-nouns.png new file mode 100644 index 000000000..be8d3defc --- /dev/null +++ b/website/source/assets/images/nomad-nouns.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:1e28b5172e9486b3e29f949847c086282c6b73cc706f686b6fe903b89f87c70d +size 60698 diff --git a/website/source/docs/internals/scheduling.html.md b/website/source/docs/internals/scheduling.html.md new file mode 100644 index 000000000..e3a19ea27 --- /dev/null +++ b/website/source/docs/internals/scheduling.html.md @@ -0,0 +1,94 @@ +--- +layout: "docs" +page_title: "Scheduling" +sidebar_current: "docs-internals-scheduling" +description: |- + Learn about how schedulig works in Nomad. +--- + +# Scheduling + +Scheduling is a core function of Nomad. It is the process of assigning tasks +from jobs to client machines. This process must respect the constraints as declared +in the job, and optimize for resource utilization by bin packing. This page documents +the details of how scheduling works in Nomad to help both users and developers +build a mental model of how it works. The design is heavily inspired by Google's +work on [Omega: flexible, scalable schedulers for large compute clusters](http://research.google.com/pubs/pub41684.html) + +~> **Advanced Topic!** This page covers technical details +of Nomad. You don't need to understand these details to +effectively use Nomad. The details are documented here for +those who wish to learn about them without having to go +spelunking through the source code. + +# Scheduling in Nomad + +![Data Model](/assets/images/nomad-nouns.png) + +There are four primary "nouns" in Nomad, these are jobs, nodes, allocations, and evaluations. +Jobs are submitted by users and represent a _desired state_. A job is a declarative description +of tasks to run which are bounded by constraints and require resources. Nodes are the servers +in the clusters that tasks can be scheduled on. The mapping of tasks in a job to nodes is done +using allocations. An allocation is used to declare that a set of tasks in a job should be run +on a particular node. Scheduling is the process of determining the appropriate allocations and +is done as part of an evaluation. + +An evaluation is created any time the external state, either desired or emergent, changes. The desired +state is based on jobs, meaning the desired state changes if a new job is submitted, an +existing job is updated, or a job is deregistered. The emergent state is based on the client +nodes, and so we must handle the failure of any clients in the system. These events trigger +the creation of a new evaluation, as Nomad must _evaluate_ the state of the world and reconcile +it with the desired state. + +This diagram shows the flow of an evaluation through Nomad: + +![Evaluation Flow](/assets/images/eval-flow.png) + +The lifecycle of an evaluation beings with an event causing the evaluation to be +created. Evaluations are created in the `pending` state and are enqueued into the +evaluation broker. There is a single evaluation broker which runs on the leader server. +The evaluation broker is used to manage the queue of pending evaluations, provide priority ordering, +and ensure at least once delivery. + +Nomad servers run scheduling workers, defaulting to one per CPU core, which are used to +process evaluations. The workers dequeue evaluations from the broker, and then invoke +the appropriate schedule as specified by the job. Nomad ships with a `service` scheduler +that optimizes for long-lived services, a `batch` scheduler that is used for fast placement +of batch jobs, and a `core` scheduler which is used for internal maintenance. Nomad can +be extended to support custom schedulers as well. + +Schedulers are responsible for processing an evaluation and generating an allocation _plan_. +The plan is the set of allocations to evict, update, or create. The specific logic used to +generate a plan may vary by scheduler, but generally the scheduler needs to first reconcile +the desired state with the real state to determine what must be done. New allocations need +to be placed and existing allocations may need to be updated, migrated, or stopped. + +Placing allocations is split into two distinct phases, feasibility +checking and ranking. In the first phase the scheduler finds nodes that are +feasible by filtering unhealthy nodes, those missing necessary drivers, and those +failing the specified constraints. + +The second phase is ranking, where the scheduler scores feasible nodes to find the best fit. +Scoring is primarily based on bin packing, which is used to optimize the resource utilization +and density of applications, but is also augmented by affinity and anti-affinity rules. +Once the scheduler has ranked enough nodes, the highest ranking node is selected and +added to the allocation plan. + +When planning is complete, the scheduler submits the plan to the leader and +gets added to the plan queue. The plan queue manages pending plans, provides priority +ordering, and allows Nomad to handle concurrency races. Multiple schedulers are running +in parallel without locking or reservations, making Nomad optimistically concurrent. +As a result, schedulers might overlap work on the same node and cause resource +over-subscription. The plan queue allows the leader node to protect against this and +do partial or complete rejections of a plan. + +As the leader processes plans, it creates allocations when there is no conflict +and otherwise informs the scheduler of a failure in the plan result. The plan result +provides feedback to the scheduler, allowing it to terminate or explore alternate plans +if the previous plan was partially or completely rejected. + +Once the scheduler has finished processing an evaluation, it updates the status of +the evaluation and acknowledges delivery with the evaluation broker. This completes +the lifecycle of an evaluation. Allocations that were created, modified or deleted +as a result will be picked up by client nodes and will begin execution. + diff --git a/website/source/layouts/docs.erb b/website/source/layouts/docs.erb index 78d5f585e..a53d044d5 100644 --- a/website/source/layouts/docs.erb +++ b/website/source/layouts/docs.erb @@ -19,6 +19,10 @@ > Gossip Protocol + + + > + Scheduling >