internals documentation with diagrams (#14750)

This changeset adds new architecture internals documents to the contributing
guide. These are intentionally here and not on the public-facing website because
the material is not required for operators and includes a lot of diagrams that
we can cheaply maintain with mermaid syntax but would involve art assets to have
up on the main site that would become quickly out of date as code changes happen
and be extremely expensive to maintain. However, these should be suitable to use
as points of conversation with expert end users.

Included:
* A description of Evaluation triggers and expected counts, with examples.
* A description of Evaluation states and implicit states. This is taken from an
  internal document in our team wiki.
* A description of how writing the State Store works. This is taken from a
  diagram I put together a few months ago for internal education purposes.
* A description of Evaluation lifecycle, from registration to running
  Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but
  broken into digestible chunks and without multi-region deployments, which I'd
  like to cover in a future doc.

Also includes adding Deployments to our public-facing glossary.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
This commit is contained in:
Tim Gross 2022-10-03 14:06:41 -04:00 committed by GitHub
parent bd8d023ee5
commit 2a6e8be6ba
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 965 additions and 0 deletions

View File

@ -0,0 +1,342 @@
# Architecture: Evaluations into Allocations
The [Scheduling Concepts][] docs provide an overview of how the scheduling
process works. This document is intended to go a bit deeper than those docs and
walk thru the lifecycle of an Evaluation from job registration to running
Allocations on the clients. The process can be broken into 4 parts:
* [Job Registration](#job-registration)
* [Scheduling](#scheduling)
* [Deployment Watcher](#deployment-watcher)
* [Client Allocs](#client-allocs)
Note that in all the diagrams below, writing to the State Store is considered
atomic. This means the Nomad leader has replicated to all the followers, and all
the servers have applied the raft log entry to their local FSM. So long as
`raftApply` returns without an error, we have a guarantee that all servers will
be able to retrieve the entry from their state store at some point in the
future.
## Job Registration
Creating or updating a Job is a _synchronous_ API operation. By the time the
response has returned to the API consumer, the Job and an Evaluation for that Job
have been written to the state store of the server nodes.
* Note: parameterized or periodic batch jobs don't create an Evaluation at
registration, only once dispatched.
* Note: The scheduler is very fast! This means that once the CLI gets a
response it can immediately start making queries to get information
about the next steps.
* Note: This workflow is different for Multi-Region Deployments (found in Nomad
Enterprise). That will be documented separately.
The diagram below shows this initial synchronous phase. Note here that there's
no scheduling happening yet, so no Allocation (or Deployment) has been created.
```mermaid
sequenceDiagram
participant user as User
participant cli as CLI
participant httpAPI as HTTP API
participant leaderRpc as Leader RPC
participant stateStore as State Store
user ->> cli: nomad job run
activate cli
cli ->> httpAPI: Create Job API
httpAPI ->> leaderRpc: Job.Register
activate leaderRpc
leaderRpc ->> stateStore: write Job
activate stateStore
stateStore -->> leaderRpc: ok
deactivate stateStore
leaderRpc ->> stateStore: write Evaluation
activate stateStore
stateStore -->> leaderRpc: ok
Note right of stateStore: EvalBroker.Enqueue
Note right of stateStore: (see Scheduling below)
deactivate stateStore
leaderRpc -->> httpAPI: Job.Register response
deactivate leaderRpc
httpAPI -->> cli: Create Job response
deactivate cli
```
## Scheduling
A long-lived goroutine on the Nomad leader called the Eval Broker maintains a
queue of Evaluations previously written to the state store and enqueued via the
`EvalBroker.Enqueue` method. (When a leader transition occurs, the leader
queries all the Evaluations in the state store and enqueues them in its new Eval
Broker.)
Scheduler workers are long-lived goroutines running on all server
nodes. Typically this will be one per core on followers and 1/4 that number on
the leader. The workers poll for Evaluations from the Eval Broker with the
`Eval.Dequeue` RPC. Once a worker has an evaluation, it instantiates a scheduler
for that evaluation (of type `service`, `system`, `sysbatch`, `batch`, or
`core`).
Because a worker is running one scheduler at a time, Nomad's documentation often
refers to "workers" and "schedulers" interchangeably, but the worker is the
long-lived goroutine and the scheduler is the struct that contains the code and
state around processing a single evaluation. The scheduler mutates itself and is
thrown away once the evaluation is processed.
The scheduler takes a snapshot of that server node's state store so that it has
a constant current view of the cluster state. The scheduler executes 3 main
steps:
* Reconcile: compare the cluster state and job specification to determine what
changes need to be made -- starting or stopping Allocations. The scheduler
creates new Allocations in this step. But note that _these Allocations are not
yet persisted to the state store_.
* Feasibility Check: For each Allocation it needs, the scheduler iterates over
Nodes until it finds up to 2 Nodes that match the Allocation's resource requirements
and constraints.
* Note: for system jobs or jobs with with `spread` blocks, the scheduler has
to check all Nodes.
* Scoring: For each feasible node, the scheduler ranks them and picks the one
with the highest score.
If the job is a `service` job, the scheduler will also create (or update) a
Deployment. When the scheduler determines the number of Allocations to create,
it examine the job's [`update`][] block. Only the number of Allocations needed
to complete the next phase of the update will be created in a given
allocation. The Deployment is used by the deployment watcher (see below) to
monitor the health of allocations and create new Evaluations to continue the
update.
If the scheduler cannot place all Allocations, it will create a new Evaluation
in the `blocked` state and submit it to the leader. The Eval Broker will
re-enque that Evaluation once cluster state has changed. (This process is the
first green box in the sequence diagram below.)
Once the scheduler has completed processing the Evaluation, if there are
Allocations (and possibly a Deployment) to update, it will submit this work as a
Plan to the leader. The leader needs to validate this plan and serialize it:
* The scheduler took a snapshot of cluster state at the start of its work, so
that state may have changed in the meantime.
* Schedulers run concurrently across the cluster, so they may generate
conflicting plans (particularly on heavily-packed clusters).
The leader processes the plan in the plan applier. If the plan is valid, the
plan applier will write the Allocations (and Deployment) update to the state
store. If not, it will reject the plan and the scheduler will try to create a
new plan with a refreshed state. If the scheduler fails to submit a valid plan
too many times it submits a `blocked` Evaluation that is triggered by
`max-plan-attempts` type. (The plan submit process is the second green box in
the sequence diagram below.)
Once the scheduler has a response from the leader, it will tell the Eval Broker
to Ack the Evaluation (if it successfully submitted the plan) or Nack the
Evaluation (if it failed to do so) so that another scheduler can try processing it.
The diagram below shows the scheduling phase, including submitting plans to the
planner. Note that at the end of this phase, Allocations (and Deployment) have
been persisted to the state store.
```mermaid
sequenceDiagram
participant leaderRpc as Leader RPC
participant stateStore as State Store
participant planner as Planner
participant broker as Eval Broker
participant sched as Scheduler
leaderRpc ->> stateStore: write Evaluation (see above)
stateStore ->> broker: EvalBroker.Enqueue
activate broker
broker -->> broker: enqueue eval
broker -->> stateStore: ok
deactivate broker
sched ->> broker: Eval.Dequeue (blocks until work available)
activate sched
broker -->> sched: Evaluation
sched ->> stateStore: query to get job and cluster state
stateStore -->> sched: results
sched -->> sched: Reconcile (how many allocs are needed?)
deactivate sched
alt
Note right of sched: Needs new allocs
activate sched
sched -->> sched: create Allocations
sched -->> sched: create Deployment (service jobs)
Note left of sched: for Deployments, only 1 "batch" of Allocs will get created for each Eval
sched -->> sched: Feasibility Check
Note left of sched: iterate over Nodes to find a placement for each Alloc
sched -->> sched: Ranking
Note left of sched: pick best option for each Alloc
%% start rect highlight for blocked eval
rect rgb(213, 246, 234)
Note left of sched: Not enough room! (But we can submit a partial plan)
sched ->> leaderRpc: Eval.Upsert (blocked)
activate leaderRpc
leaderRpc ->> stateStore: write Evaluation
activate stateStore
stateStore -->> leaderRpc: ok
deactivate stateStore
leaderRpc --> sched: ok
deactivate leaderRpc
end
%% end rect highlight for blocked eval
%% start rect highlight for planner
rect rgb(213, 246, 234)
Note over sched, planner: Note: scheduler snapshot may be stale state relative to leader so it's serialized by plan applier
sched ->> planner: Plan.Submit (Allocations + Deployment)
planner -->> planner: is plan still valid?
Note right of planner: Plan is valid
activate planner
planner ->> leaderRpc: Allocations.Upsert + Deployment.Upsert
activate leaderRpc
leaderRpc ->> stateStore: write Allocations and Deployment
activate stateStore
stateStore -->> leaderRpc: ok
deactivate stateStore
leaderRpc -->> planner: ok
deactivate leaderRpc
planner -->> sched: ok
deactivate planner
end
%% end rect highlight for planner
sched -->> sched: retry on failure, if exceed max attempts will Eval.Nack
else
end
sched ->> broker: Eval.Ack (Eval.Nack if failed)
activate broker
broker ->> stateStore: complete Evaluation
activate stateStore
stateStore -->> broker: ok
deactivate stateStore
broker -->> sched: ok
deactivate broker
deactivate sched
```
## Deployment Watcher
As noted under Scheduling above, a Deployment is created for service jobs. A
deployment watcher runs on the leader. Its job is to watch the state of
Allocations being placed for a given job version and to emit new Evaluations so
that more Allocations for that job can be created.
The "deployments watcher" (plural) makes a blocking query for Deployments and
spins up a new "deployment watcher" (singular) for each one. That goroutine will
live until its Deployment is complete or failed.
The deployment watcher makes blocking queries on Allocation health and its own
Deployment (which can be canceled or paused by a user). When there's a change in
any of those states, it compares the current state against the [`update`][]
block and the timers it maintains for `min_healthy_time`, `healthy_deadline`,
and `progress_deadline`. It then updates the Deployment state and creates a new
Evaluation if the current step of the update is complete.
The diagram below shows deployments from a high level. Note that Deployments do
not themselves create Allocations -- they create Evaluations and then the
schedulers process those as they do normally.
```mermaid
sequenceDiagram
participant leaderRpc as Leader RPC
participant stateStore as State Store
participant dw as Deployment Watcher
dw ->> stateStore: blocking query for new Deployments
activate dw
stateStore -->> dw: new Deployment
dw -->> dw: start watcher
dw ->> stateStore: blocking query for Allocation health
stateStore -->> dw: Allocation health updates
dw ->> dw: next step?
Note right of dw: Update state and create evaluations for next batch...
Note right of dw: Or fail the Deployment and update state
dw ->> leaderRpc: Deployment.Upsert + Evaluation.Upsert
activate leaderRpc
leaderRpc ->> stateStore: write Deployment and Evaluations
activate stateStore
stateStore -->> leaderRpc: ok
deactivate stateStore
leaderRpc -->> dw: ok
deactivate leaderRpc
deactivate dw
```
## Client Allocs
Once the plan applier has persisted Allocations to the state store (with an
associated Node ID), they become available to get placed on the client. Clients
_pull_ new allocations (and changes to allocations), so a new allocation will be
in the `pending` state until it's been pulled down by a Client and the
allocation has been instantiated.
Once the Allocation is running and healthy, the Client will send a
`Node.UpdateAlloc` RPC back to the server so that info can be persisted in the
state store. This is the allocation health data the Deployment Watcher is
querying for above.
```mermaid
sequenceDiagram
participant followerRpc as Follower RPC
participant stateStore as State Store
participant client as Client
participant allocrunner as Allocation Runner
client ->> followerRpc: Alloc.GetAllocs RPC
activate client
Note right of client: this query can be stale
followerRpc ->> stateStore: query for Allocations
activate followerRpc
activate stateStore
stateStore -->> followerRpc: Allocations
deactivate stateStore
followerRpc -->> client: Allocations
deactivate followerRpc
client ->> allocrunner: Create or update allocation runners
client ->> followerRpc: Node.UpdateAlloc
Note right of followerRpc: will be forwarded to leader
deactivate client
```
[Scheduling Concepts]: https://nomadproject.io/docs/concepts/scheduling/scheduling
[`update`]: https://www.nomadproject.io/docs/job-specification/update

View File

@ -0,0 +1,201 @@
# Architecture: Evaluation Status
The [Scheduling in Nomad][] internals documentation covers the path that an
evaluation takes through the leader, worker, and plan applier. But it doesn't
cover in any detail the various `Evaluation.Status` values, or where the
`PreviousEval`, `NextEval`, or `BlockedEval` ID pointers are set.
The state diagram below describes the transitions between `Status` values as
solid arrows. The dashed arrows represent when a new evaluation is created. The
parenthetical labels on those arrows are the `TriggeredBy` field for the new
evaluation.
The status values are:
* `pending` evaluations either are queued to be scheduled, are still being
processed in the scheduler, or are being applied by the plan applier and not
yet acknowledged.
* `failed` evaluations have failed to be applied by the plan applier (or are
somehow invalid in the scheduler; this is always a bug)
* `blocked` evaluations are created when an eval has failed too many attempts to
have its plan applied by the leader, or when a plan can only be partially
applied and there are still more allocations to create.
* `complete` means the plan was applied successfully (at least partially).
* `canceled` means the evaluation was superseded by state changes like a new
version of the job.
```mermaid
flowchart LR
event((Cluster\nEvent))
pending([pending])
blocked([blocked])
complete([complete])
failed([failed])
canceled([canceled])
%% style classes
classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
class event other;
class pending,blocked,complete,failed,canceled status;
event -. "job-register
job-deregister
periodic-job
node-update
node-drain
alloc-stop
scheduled
alloc-failure
job-scaling" .-> pending
pending -. "new eval\n(rolling-update)" .-> pending
pending -. "new eval\n(preemption)" .-> pending
pending -. "new eval\n(max-plan-attempts)" .-> blocked
pending -- if plan submitted --> complete
pending -- if invalid --> failed
pending -- if no-op --> canceled
failed -- if retried --> blocked
failed -- if retried --> complete
blocked -- if no-op --> canceled
blocked -- if plan submitted --> complete
complete -. "new eval\n(deployment-watcher)" .-> pending
complete -. "new eval\n(queued-allocs)" .-> blocked
failed -. "new eval\n(failed-follow-up)" .-> pending
```
But it's hard to get a full picture of the evaluation lifecycle purely from the
`Status` fields, because evaluations have several "quasi-statuses" which aren't
represented as explicit statuses in the state store:
* `scheduling` is the status where an eval is being processed by the scheduler
worker.
* `applying` is the status where the resulting plan for the eval is being
applied in the plan applier on the leader.
* `delayed` is an enqueued eval that will be dequeued some time in the future.
* `deleted` is when an eval is removed from the state store entirely.
By adding these statuses to the diagram (the dashed nodes), you can see where
the same `Status` transition might result in different `PreviousEval`,
`NextEval`, or `BlockedEval` set. You can also see where the "chain" of
evaluations is broken when new evals are created for preemptions or by the
deployment watcher.
```mermaid
flowchart LR
event((Cluster\nEvent))
%% statuss
pending([pending])
blocked([blocked])
complete([complete])
failed([failed])
canceled([canceled])
%% quasi-statuss
deleted([deleted])
delayed([delayed])
scheduling([scheduling])
applying([applying])
%% style classes
classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef quasistatus fill:#d5f6ea,stroke-dasharray: 5 5,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
class event other;
class pending,blocked,complete,failed,canceled status;
class deleted,delayed,scheduling,applying quasistatus;
event -- "job-register
job-deregister
periodic-job
node-update
node-drain
alloc-stop
scheduled
alloc-failure
job-scaling" --> pending
pending -- dequeued --> scheduling
pending -- if delayed --> delayed
delayed -- dequeued --> scheduling
scheduling -. "not all allocs placed
new eval created by scheduler
trigger queued-allocs
new has .PreviousEval = old.ID
old has .BlockedEval = new.ID" .-> blocked
scheduling -. "failed to plan
new eval created by scheduler
trigger: max-plan-attempts
new has .PreviousEval = old.ID
old has .BlockedEval = new.ID" .-> blocked
scheduling -- "not all allocs placed
reuse already-blocked eval" --> blocked
blocked -- "unblocked by
external state changes" --> scheduling
scheduling -- allocs placed --> complete
scheduling -- "wrong eval type or
max retries exceeded
on plan submit" --> failed
scheduling -- "canceled by
job update/stop" --> canceled
failed -- retry --> scheduling
scheduling -. "new eval from rolling update (system jobs)
created by scheduler
trigger: rolling-update
new has .PreviousEval = old.ID
old has .NextEval = new.ID" .-> pending
scheduling -- submit --> applying
applying -- failed --> scheduling
applying -. "new eval for preempted allocs
created by plan applier
trigger: preemption
new has .PreviousEval = unset!
old has .BlockedEval = unset!" .-> pending
complete -. "new eval from deployments (service jobs)
created by deploymentwatcher
trigger: deployment-watcher
new has .PreviousEval = unset!
old has .NextEval = unset!" .-> pending
failed -- "new eval
trigger: failed-follow-up
new has .PreviousEval = old.ID
old has .NextEval = new.ID" --> pending
pending -- "undeliverable evals
reaped by leader" --> failed
blocked -- "duplicate blocked evals
reaped by leader" --> canceled
canceled -- garbage\ncollection --> deleted
failed -- garbage\ncollection --> deleted
complete -- garbage\ncollection --> deleted
```
[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling

View File

@ -0,0 +1,259 @@
# Architecture: Evaluation Triggers
The [Scheduling in Nomad][] internals documentation covers the path that an
evaluation takes through the leader, worker, and plan applier. This document
describes what events within the cluster cause Evaluations to be created.
Evaluations have a `TriggeredBy` field which is always one of the values defined
in [`structs.go`][]:
```
const (
EvalTriggerJobRegister = "job-register"
EvalTriggerJobDeregister = "job-deregister"
EvalTriggerPeriodicJob = "periodic-job"
EvalTriggerNodeDrain = "node-drain"
EvalTriggerNodeUpdate = "node-update"
EvalTriggerAllocStop = "alloc-stop"
EvalTriggerScheduled = "scheduled"
EvalTriggerRollingUpdate = "rolling-update"
EvalTriggerDeploymentWatcher = "deployment-watcher"
EvalTriggerFailedFollowUp = "failed-follow-up"
EvalTriggerMaxPlans = "max-plan-attempts"
EvalTriggerRetryFailedAlloc = "alloc-failure"
EvalTriggerQueuedAllocs = "queued-allocs"
EvalTriggerPreemption = "preemption"
EvalTriggerScaling = "job-scaling"
EvalTriggerMaxDisconnectTimeout = "max-disconnect-timeout"
EvalTriggerReconnect = "reconnect"
)
```
The list below covers each trigger and what can trigger it.
* **job-register**: Creating or updating a Job will result in 1 Evaluation
created, plus any follow-up Evaluations associated with scheduling, planning,
or deployments.
* **job-deregister**: Stopping a Job will result in 1 Evaluation created, plus
any follow-up Evaluations associated with scheduling, planning, or
deployments.
* **periodic-job**: A periodic job that hits its timer and dispatches a child
job will result in 1 Evaluation created, plus any additional Evaluations
associated with scheduling or planning.
* **node-drain**: Draining a node will create 1 Evaluation for each Job on the
node that's draining, plus any additional Evaluations associated with
scheduling or planning.
* **node-update**: When the fingerprint of a client node has changed or the node
has changed state (from up to down), Nomad creates 1 Evaluation for each Job
running on the Node, plus 1 Evaluation for each system job that has
`datacenters` that include the datacenter for that Node.
* **alloc-stop**: When the API that serves the `nomad alloc stop` command is
hit, Nomad creates 1 Evaluation.
* **scheduled**: Nomad's internal housekeeping will periodically create
Evaluations for garbage collection.
* **rolling-update**: When a `system` job is updated, the [`update`][] block's
`stagger` field controls how many Allocations will be scheduled at a time. The
scheduler will create 1 Evaluation to follow-up for the next set.
* **deployment-watcher**: When a `service` job is updated, the [`update`][]
block controls how many Allocations will be scheduled at a time. The
deployment watcher runs on the leader and monitors Allocation healthy. It will
create 1 Evaluation when the Deployment has reached the next step.
* **failed-follow-up**: Evaluations that hit a delivery limit and will not be
retried by the eval broker are marked as failed. The leader periodically
reaps failed Evaluations and creates 1 new Evaluation for these, with a delay.
* **max-plan-attempts**: The scheduler will retry Evaluations that are rejected
by the plan applier with a new cluster state snapshot. If the scheduler
exceeds the maximum number of retries, it will create 1 new Evaluation in the
`blocked` state.
* **alloc-failure**: If an Allocation fails and exceeds its maximum
[`restart` attempts][], Nomad creates 1 new Evaluation.
* **queued-allocs**: When a scheduler processes an Evaluation, it may not be
able to place all Allocations. It will create 1 new Evaluation in the
`blocked` state to be processed later when node updates arrive.
* **preemption**: When Allocations are preempted, the plan applier creates 1
Evaluation for each Job that has been preempted.
* **job-scaling**: Scaling a Job will result in 1 Evaluation created, plus any
follow-up Evaluations associated with scheduling, planning, or deployments.
* **max-disconnect-timeout**: When an Allocation is in the `unknown` state for
longer than the [`max_client_disconnect`][] window, the scheduler will create
1 Evaluation.
* **reconnect**: When a Node in the `disconnected` state reconnects, Nomad will
create 1 Evaluation per job with an allocation on the reconnected Node.
## Follow-up Evaluations
Almost any Evaluation processed by the scheduler can result in additional
Evaluations being created, whether because the scheduler needs to follow-up from
failed scheduling or because the resulting plan changes the state of the
cluster. This can result in a large number of Evaluations when the cluster is in
an unstable state with frequent changes.
Consider the following example where a node running 1 system job and 2 service
jobs misses its heartbeat and is marked lost. The Evaluation for the system job
is successfully planned. One of the service jobs no longer meets constraints. The
other service job is successfully scheduled but the resulting plan is rejected
because the scheduler has fallen behind in raft replication. A total of 6
Evaluations are created.
```mermaid
flowchart TD
event((Node\nmisses\nheartbeat))
system([system\nnode-update])
service1([service 1\nnode-update])
service2([service 2\nnode-update])
blocked([service 1\nblocked\nqueued-allocs])
failed([service 2\nfailed\nmax-plan-attempts])
followup([service 2\nfailed-follow-up])
%% style classes
classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
class event other;
class system,service1,service2,blocked,failed,followup eval;
event --> system
event --> service1
event --> service2
service1 --> blocked
service2 --> failed
failed --> followup
```
Next, consider this example where a `service` job has been updated. The task
group has `count = 3` and the following `update` block:
```hcl
update {
max_parallel = 1
canary = 1
}
```
After each Evaluation is processed, the Deployment Watcher will be waiting to
receive information on updated Allocation health. Then it will emit a new
Evaluation for the next step. A total of 4 Evaluations are created.
```mermaid
flowchart TD
registerEvent((Job\nRegister))
alloc1health((Canary\nHealthy))
alloc2health((Alloc 2\nHealthy))
alloc3health((Alloc 3\nHealthy))
register([job-register])
dwPostCanary([deployment-watcher])
dwPostAlloc2([deployment-watcher])
dwPostAlloc3([deployment-watcher])
%% style classes
classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
class registerEvent,alloc1health,alloc2health,alloc3health other
class register,dwPostCanary,dwPostAlloc2,dwPostAlloc3 eval
registerEvent --> register
register --> wait1
alloc1health --> wait1
wait1 --> dwPostCanary
dwPostCanary --> wait2
alloc2health --> wait2
wait2 --> dwPostAlloc2
dwPostAlloc2 --> wait3
alloc3health --> wait3
wait3 --> dwPostAlloc3
```
Lastly, consider this example where 2 nodes each running 5 Allocations that are
all for system jobs are "flapping" by missing heartbeats and then
re-registering, or frequently changing fingerprints. This diagram will show the
results from each node going down once and then coming back up.
```mermaid
flowchart TD
%% style classes
classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
eventAdown((Node A\nmisses\nheartbeat))
eventAup((Node A\nheartbeats))
eventBdown((Node B\nmisses\nheartbeat))
eventBup((Node B\nheartbeats))
eventAdown --> eventAup
eventBdown --> eventBup
A01down([job 1 node A\nnode-update])
A02down([job 2 node A\nnode-update])
A03down([job 3 node A\nnode-update])
A04down([job 4 node A\nnode-update])
A05down([job 5 node A\nnode-update])
B01down([job 1 node B\nnode-update])
B02down([job 2 node B\nnode-update])
B03down([job 3 node B\nnode-update])
B04down([job 4 node B\nnode-update])
B05down([job 5 node B\nnode-update])
A01up([job 1 node A\nnode-update])
A02up([job 2 node A\nnode-update])
A03up([job 3 node A\nnode-update])
A04up([job 4 node A\nnode-update])
A05up([job 5 node A\nnode-update])
B01up([job 1 node B\nnode-update])
B02up([job 2 node B\nnode-update])
B03up([job 3 node B\nnode-update])
B04up([job 4 node B\nnode-update])
B05up([job 5 node B\nnode-update])
eventAdown:::other --> A01down:::eval
eventAdown:::other --> A02down:::eval
eventAdown:::other --> A03down:::eval
eventAdown:::other --> A04down:::eval
eventAdown:::other --> A05down:::eval
eventAup:::other --> A01up:::eval
eventAup:::other --> A02up:::eval
eventAup:::other --> A03up:::eval
eventAup:::other --> A04up:::eval
eventAup:::other --> A05up:::eval
eventBdown:::other --> B01down:::eval
eventBdown:::other --> B02down:::eval
eventBdown:::other --> B03down:::eval
eventBdown:::other --> B04down:::eval
eventBdown:::other --> B05down:::eval
eventBup:::other --> B01up:::eval
eventBup:::other --> B02up:::eval
eventBup:::other --> B03up:::eval
eventBup:::other --> B04up:::eval
eventBup:::other --> B05up:::eval
```
You can extrapolate this example to large clusters: 100 nodes each running 10
system jobs and 40 service jobs that go down once and come back up will result
in 100 * 40 * 2 == 8000 Evaluations created for the service jobs, which will
result in rescheduling of service allocations to new nodes. For the system jobs,
2000 Evaluations will be created and all of these will be no-op Evaluations that
will still need to be replicated to all raft peers, canceled by the scheduler,
and eventually need to be garbage collected.
[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling
[`structs.go`]: https://github.com/hashicorp/nomad/blob/v1.4.0-beta.1/nomad/structs/structs.go#L10857-L10875
[`update`]: https://www.nomadproject.io/docs/job-specification/update
[`restart` attempts]: https://www.nomadproject.io/docs/job-specification/restart
[`max_client_disconnect`]: https://www.nomadproject.io/docs/job-specification/group#max-client-disconnect

View File

@ -0,0 +1,151 @@
# Architecture: Nomad State Store
Nomad server state is an in-memory state store backed by raft. All writes to
state are serialized into message pack and written as raft logs. The raft logs
are replicated from the leader to the followers. Once each follower has
persisted the log entry and applied the entry to its in-memory state ("FSM"),
the leader considers the write committed.
This architecture has a few implications:
* The `fsm.Apply` functions must be deterministic over their inputs for a given
state. You can never generate random IDs or assign wall-clock timestamps in
the state store. These values must be provided as parameters from the RPC
handler.
```go
# Incorrect: generating a timestamp in the state store is not deterministic.
func (s *StateStore) UpsertObject(...) {
# ...
obj.CreateTime = time.Now()
# ...
}
# Correct: non-deterministic values should be passed as inputs:
func (s *StateStore) UpsertObject(..., timestamp time.Time) {
# ...
obj.CreateTime = timestamp
# ...
}
```
* Every object you read from the state store must be copied before it can be
mutated, because mutating the object modifies it outside the raft
workflow. The result can be servers having inconsistent state, transactions
breaking, or even server panics.
```go
# Incorrect: job is mutated without copying.
job, err := state.JobByID(ws, namespace, id)
job.Status = structs.JobStatusRunning
# Correct: only the job copy is mutated.
job, err := state.JobByID(ws, namespace, id)
updateJob := job.Copy()
updateJob.Status = structs.JobStatusRunning
```
Adding new objects to the state store should be done as part of adding new RPC
endpoints. See the [RPC Endpoint Checklist][].
```mermaid
flowchart TD
%% entities
ext(("API\nclient"))
any("Any node
(client or server)")
follower(Follower)
rpcLeader("RPC handler (on leader)")
writes("writes go thru raft
raftApply(MessageType, entry) in nomad/rpc.go
structs.MessageType in nomad/structs/structs.go
go generate ./... for nomad/msgtypes.go")
click writes href "https://github.com/hashicorp/nomad/tree/main/nomad" _blank
reads("reads go directly to state store
Typical state_store.go funcs to implement:
state.GetMyThingByID
state.GetMyThingByPrefix
state.ListMyThing
state.UpsertMyThing
state.DeleteMyThing")
click writes href "https://github.com/hashicorp/nomad/tree/main/nomad/state" _blank
raft("hashicorp/raft")
bolt("boltdb")
fsm("Application-specific
Finite State Machine (FSM)
(aka State Store)")
click writes href "https://github.com/hashicorp/nomad/tree/main/nomad/fsm.go" _blank
memdb("hashicorp/go-memdb")
%% style classes
classDef leader fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
class any,follower other;
class rpcLeader,raft,bolt,fsm,memdb leader;
%% flows
ext -- HTTP request --> any
any -- "RPC request
to connected server
(follower or leader)" --> follower
follower -- "(1) srv.Forward (to leader)" --> rpcLeader
raft -- "(3) replicate to a
quorum of followers
wait on their fsm.Apply" --> follower
rpcLeader --> reads
reads --> memdb
rpcLeader --> writes
writes -- "(2)" --> raft
raft -- "(4) write log to disk" --> bolt
raft -- "(5) fsm.Apply
nomad/fsm.go" --> fsm
fsm -- "(6) txn.Insert" --> memdb
bolt <-- "Snapshot Persist: nomad/fsm.go
Snapshot Restore: nomad/fsm.go" --> memdb
%% notes
note1("Typical structs to implement
for RPC handlers:
structs.MyThing
.Diff()
.Copy()
.Merge()
structs.MyThingUpsertRequest
structs.MyThingUpsertResponse
structs.MyThingGetRequest
structs.MyThingGetResponse
structs.MyThingListRequest
structs.MyThingListResponse
structs.MyThingDeleteRequest
structs.MyThingDeleteResponse
Don't forget to register your new RPC
in nomad/server.go!")
note1 -.- rpcLeader
```
[RPC Endpoint Checklist]: https://github.com/hashicorp/nomad/blob/main/contributing/checklist-rpc-endpoint.md

View File

@ -69,6 +69,15 @@ either the _desired state_ (jobs) or _actual state_ (clients) changes, Nomad
creates a new evaluation to determine if any actions must be taken. An creates a new evaluation to determine if any actions must be taken. An
evaluation may result in changes to allocations if necessary. evaluation may result in changes to allocations if necessary.
#### Deployment
Deployments are the mechanism by which Nomad rolls out changes to cluster state
in a step-by-step fashion. Deployments are only available for Jobs with the type
`service`. When an Evaluation is processed, the scheduler creates only the
number of Allocations permitted by the [`update`][] block and the current state
of the cluster. The Deployment is used to monitor the health of those
Allocations and emit a new Evaluation for the next step of the update.
#### Server #### Server
Nomad servers are the brains of the cluster. There is a cluster of servers per Nomad servers are the brains of the cluster. There is a cluster of servers per
@ -161,3 +170,6 @@ are more details available for each of the sub-systems. The [consensus protocol]
are all documented in more detail. are all documented in more detail.
For other details, either consult the code, ask in IRC or reach out to the mailing list. For other details, either consult the code, ask in IRC or reach out to the mailing list.
[`update`]: /docs/job-specification/update