From 2a6e8be6ba55874841f261c6d949c14253e53d62 Mon Sep 17 00:00:00 2001 From: Tim Gross Date: Mon, 3 Oct 2022 14:06:41 -0400 Subject: [PATCH] internals documentation with diagrams (#14750) This changeset adds new architecture internals documents to the contributing guide. These are intentionally here and not on the public-facing website because the material is not required for operators and includes a lot of diagrams that we can cheaply maintain with mermaid syntax but would involve art assets to have up on the main site that would become quickly out of date as code changes happen and be extremely expensive to maintain. However, these should be suitable to use as points of conversation with expert end users. Included: * A description of Evaluation triggers and expected counts, with examples. * A description of Evaluation states and implicit states. This is taken from an internal document in our team wiki. * A description of how writing the State Store works. This is taken from a diagram I put together a few months ago for internal education purposes. * A description of Evaluation lifecycle, from registration to running Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but broken into digestible chunks and without multi-region deployments, which I'd like to cover in a future doc. Also includes adding Deployments to our public-facing glossary. Co-authored-by: Luiz Aoqui Co-authored-by: Michael Schurter Co-authored-by: Seth Hoenig --- contributing/architecture-eval-lifecycle.md | 342 ++++++++++++++++++ contributing/architecture-eval-states.md | 201 ++++++++++ contributing/architecture-eval-triggers.md | 259 +++++++++++++ contributing/architecture-state-store.md | 151 ++++++++ .../content/docs/concepts/architecture.mdx | 12 + 5 files changed, 965 insertions(+) create mode 100644 contributing/architecture-eval-lifecycle.md create mode 100644 contributing/architecture-eval-states.md create mode 100644 contributing/architecture-eval-triggers.md create mode 100644 contributing/architecture-state-store.md diff --git a/contributing/architecture-eval-lifecycle.md b/contributing/architecture-eval-lifecycle.md new file mode 100644 index 000000000..9156f12e8 --- /dev/null +++ b/contributing/architecture-eval-lifecycle.md @@ -0,0 +1,342 @@ +# Architecture: Evaluations into Allocations + +The [Scheduling Concepts][] docs provide an overview of how the scheduling +process works. This document is intended to go a bit deeper than those docs and +walk thru the lifecycle of an Evaluation from job registration to running +Allocations on the clients. The process can be broken into 4 parts: + +* [Job Registration](#job-registration) +* [Scheduling](#scheduling) +* [Deployment Watcher](#deployment-watcher) +* [Client Allocs](#client-allocs) + +Note that in all the diagrams below, writing to the State Store is considered +atomic. This means the Nomad leader has replicated to all the followers, and all +the servers have applied the raft log entry to their local FSM. So long as +`raftApply` returns without an error, we have a guarantee that all servers will +be able to retrieve the entry from their state store at some point in the +future. + +## Job Registration + +Creating or updating a Job is a _synchronous_ API operation. By the time the +response has returned to the API consumer, the Job and an Evaluation for that Job +have been written to the state store of the server nodes. + +* Note: parameterized or periodic batch jobs don't create an Evaluation at + registration, only once dispatched. +* Note: The scheduler is very fast! This means that once the CLI gets a + response it can immediately start making queries to get information + about the next steps. +* Note: This workflow is different for Multi-Region Deployments (found in Nomad + Enterprise). That will be documented separately. + +The diagram below shows this initial synchronous phase. Note here that there's +no scheduling happening yet, so no Allocation (or Deployment) has been created. + +```mermaid +sequenceDiagram + + participant user as User + participant cli as CLI + participant httpAPI as HTTP API + participant leaderRpc as Leader RPC + participant stateStore as State Store + + user ->> cli: nomad job run + activate cli + cli ->> httpAPI: Create Job API + httpAPI ->> leaderRpc: Job.Register + + activate leaderRpc + + leaderRpc ->> stateStore: write Job + activate stateStore + stateStore -->> leaderRpc: ok + deactivate stateStore + + leaderRpc ->> stateStore: write Evaluation + + activate stateStore + stateStore -->> leaderRpc: ok + + Note right of stateStore: EvalBroker.Enqueue + Note right of stateStore: (see Scheduling below) + + deactivate stateStore + + leaderRpc -->> httpAPI: Job.Register response + deactivate leaderRpc + + httpAPI -->> cli: Create Job response + deactivate cli + +``` + +## Scheduling + +A long-lived goroutine on the Nomad leader called the Eval Broker maintains a +queue of Evaluations previously written to the state store and enqueued via the +`EvalBroker.Enqueue` method. (When a leader transition occurs, the leader +queries all the Evaluations in the state store and enqueues them in its new Eval +Broker.) + +Scheduler workers are long-lived goroutines running on all server +nodes. Typically this will be one per core on followers and 1/4 that number on +the leader. The workers poll for Evaluations from the Eval Broker with the +`Eval.Dequeue` RPC. Once a worker has an evaluation, it instantiates a scheduler +for that evaluation (of type `service`, `system`, `sysbatch`, `batch`, or +`core`). + +Because a worker is running one scheduler at a time, Nomad's documentation often +refers to "workers" and "schedulers" interchangeably, but the worker is the +long-lived goroutine and the scheduler is the struct that contains the code and +state around processing a single evaluation. The scheduler mutates itself and is +thrown away once the evaluation is processed. + +The scheduler takes a snapshot of that server node's state store so that it has +a constant current view of the cluster state. The scheduler executes 3 main +steps: + +* Reconcile: compare the cluster state and job specification to determine what + changes need to be made -- starting or stopping Allocations. The scheduler + creates new Allocations in this step. But note that _these Allocations are not + yet persisted to the state store_. +* Feasibility Check: For each Allocation it needs, the scheduler iterates over + Nodes until it finds up to 2 Nodes that match the Allocation's resource requirements + and constraints. + * Note: for system jobs or jobs with with `spread` blocks, the scheduler has + to check all Nodes. +* Scoring: For each feasible node, the scheduler ranks them and picks the one + with the highest score. + +If the job is a `service` job, the scheduler will also create (or update) a +Deployment. When the scheduler determines the number of Allocations to create, +it examine the job's [`update`][] block. Only the number of Allocations needed +to complete the next phase of the update will be created in a given +allocation. The Deployment is used by the deployment watcher (see below) to +monitor the health of allocations and create new Evaluations to continue the +update. + +If the scheduler cannot place all Allocations, it will create a new Evaluation +in the `blocked` state and submit it to the leader. The Eval Broker will +re-enque that Evaluation once cluster state has changed. (This process is the +first green box in the sequence diagram below.) + +Once the scheduler has completed processing the Evaluation, if there are +Allocations (and possibly a Deployment) to update, it will submit this work as a +Plan to the leader. The leader needs to validate this plan and serialize it: + +* The scheduler took a snapshot of cluster state at the start of its work, so + that state may have changed in the meantime. +* Schedulers run concurrently across the cluster, so they may generate + conflicting plans (particularly on heavily-packed clusters). + +The leader processes the plan in the plan applier. If the plan is valid, the +plan applier will write the Allocations (and Deployment) update to the state +store. If not, it will reject the plan and the scheduler will try to create a +new plan with a refreshed state. If the scheduler fails to submit a valid plan +too many times it submits a `blocked` Evaluation that is triggered by +`max-plan-attempts` type. (The plan submit process is the second green box in +the sequence diagram below.) + +Once the scheduler has a response from the leader, it will tell the Eval Broker +to Ack the Evaluation (if it successfully submitted the plan) or Nack the +Evaluation (if it failed to do so) so that another scheduler can try processing it. + +The diagram below shows the scheduling phase, including submitting plans to the +planner. Note that at the end of this phase, Allocations (and Deployment) have +been persisted to the state store. + +```mermaid +sequenceDiagram + + participant leaderRpc as Leader RPC + participant stateStore as State Store + participant planner as Planner + participant broker as Eval Broker + participant sched as Scheduler + + leaderRpc ->> stateStore: write Evaluation (see above) + stateStore ->> broker: EvalBroker.Enqueue + activate broker + broker -->> broker: enqueue eval + broker -->> stateStore: ok + deactivate broker + + sched ->> broker: Eval.Dequeue (blocks until work available) + activate sched + broker -->> sched: Evaluation + + sched ->> stateStore: query to get job and cluster state + stateStore -->> sched: results + + sched -->> sched: Reconcile (how many allocs are needed?) + deactivate sched + + alt + Note right of sched: Needs new allocs + activate sched + sched -->> sched: create Allocations + sched -->> sched: create Deployment (service jobs) + Note left of sched: for Deployments, only 1 "batch" of Allocs will get created for each Eval + sched -->> sched: Feasibility Check + Note left of sched: iterate over Nodes to find a placement for each Alloc + sched -->> sched: Ranking + Note left of sched: pick best option for each Alloc + + %% start rect highlight for blocked eval + rect rgb(213, 246, 234) + Note left of sched: Not enough room! (But we can submit a partial plan) + sched ->> leaderRpc: Eval.Upsert (blocked) + activate leaderRpc + leaderRpc ->> stateStore: write Evaluation + activate stateStore + stateStore -->> leaderRpc: ok + deactivate stateStore + leaderRpc --> sched: ok + deactivate leaderRpc + end + %% end rect highlight for blocked eval + + + %% start rect highlight for planner + rect rgb(213, 246, 234) + Note over sched, planner: Note: scheduler snapshot may be stale state relative to leader so it's serialized by plan applier + + sched ->> planner: Plan.Submit (Allocations + Deployment) + planner -->> planner: is plan still valid? + + Note right of planner: Plan is valid + activate planner + + planner ->> leaderRpc: Allocations.Upsert + Deployment.Upsert + activate leaderRpc + leaderRpc ->> stateStore: write Allocations and Deployment + activate stateStore + stateStore -->> leaderRpc: ok + deactivate stateStore + leaderRpc -->> planner: ok + deactivate leaderRpc + planner -->> sched: ok + deactivate planner + + end + %% end rect highlight for planner + + sched -->> sched: retry on failure, if exceed max attempts will Eval.Nack + + else + + end + + sched ->> broker: Eval.Ack (Eval.Nack if failed) + activate broker + broker ->> stateStore: complete Evaluation + activate stateStore + stateStore -->> broker: ok + deactivate stateStore + broker -->> sched: ok + deactivate broker + deactivate sched +``` + +## Deployment Watcher + +As noted under Scheduling above, a Deployment is created for service jobs. A +deployment watcher runs on the leader. Its job is to watch the state of +Allocations being placed for a given job version and to emit new Evaluations so +that more Allocations for that job can be created. + +The "deployments watcher" (plural) makes a blocking query for Deployments and +spins up a new "deployment watcher" (singular) for each one. That goroutine will +live until its Deployment is complete or failed. + +The deployment watcher makes blocking queries on Allocation health and its own +Deployment (which can be canceled or paused by a user). When there's a change in +any of those states, it compares the current state against the [`update`][] +block and the timers it maintains for `min_healthy_time`, `healthy_deadline`, +and `progress_deadline`. It then updates the Deployment state and creates a new +Evaluation if the current step of the update is complete. + +The diagram below shows deployments from a high level. Note that Deployments do +not themselves create Allocations -- they create Evaluations and then the +schedulers process those as they do normally. + +```mermaid +sequenceDiagram + + participant leaderRpc as Leader RPC + participant stateStore as State Store + participant dw as Deployment Watcher + + dw ->> stateStore: blocking query for new Deployments + activate dw + stateStore -->> dw: new Deployment + dw -->> dw: start watcher + + dw ->> stateStore: blocking query for Allocation health + stateStore -->> dw: Allocation health updates + dw ->> dw: next step? + + Note right of dw: Update state and create evaluations for next batch... + Note right of dw: Or fail the Deployment and update state + + dw ->> leaderRpc: Deployment.Upsert + Evaluation.Upsert + activate leaderRpc + leaderRpc ->> stateStore: write Deployment and Evaluations + activate stateStore + stateStore -->> leaderRpc: ok + deactivate stateStore + leaderRpc -->> dw: ok + deactivate leaderRpc + deactivate dw + +``` + +## Client Allocs + +Once the plan applier has persisted Allocations to the state store (with an +associated Node ID), they become available to get placed on the client. Clients +_pull_ new allocations (and changes to allocations), so a new allocation will be +in the `pending` state until it's been pulled down by a Client and the +allocation has been instantiated. + +Once the Allocation is running and healthy, the Client will send a +`Node.UpdateAlloc` RPC back to the server so that info can be persisted in the +state store. This is the allocation health data the Deployment Watcher is +querying for above. + +```mermaid +sequenceDiagram + + participant followerRpc as Follower RPC + participant stateStore as State Store + + participant client as Client + participant allocrunner as Allocation Runner + + client ->> followerRpc: Alloc.GetAllocs RPC + activate client + + Note right of client: this query can be stale + + followerRpc ->> stateStore: query for Allocations + activate followerRpc + activate stateStore + stateStore -->> followerRpc: Allocations + deactivate stateStore + followerRpc -->> client: Allocations + deactivate followerRpc + + client ->> allocrunner: Create or update allocation runners + + client ->> followerRpc: Node.UpdateAlloc + Note right of followerRpc: will be forwarded to leader + + deactivate client +``` + + +[Scheduling Concepts]: https://nomadproject.io/docs/concepts/scheduling/scheduling +[`update`]: https://www.nomadproject.io/docs/job-specification/update diff --git a/contributing/architecture-eval-states.md b/contributing/architecture-eval-states.md new file mode 100644 index 000000000..fdaac339d --- /dev/null +++ b/contributing/architecture-eval-states.md @@ -0,0 +1,201 @@ +# Architecture: Evaluation Status + +The [Scheduling in Nomad][] internals documentation covers the path that an +evaluation takes through the leader, worker, and plan applier. But it doesn't +cover in any detail the various `Evaluation.Status` values, or where the +`PreviousEval`, `NextEval`, or `BlockedEval` ID pointers are set. + +The state diagram below describes the transitions between `Status` values as +solid arrows. The dashed arrows represent when a new evaluation is created. The +parenthetical labels on those arrows are the `TriggeredBy` field for the new +evaluation. + +The status values are: + +* `pending` evaluations either are queued to be scheduled, are still being + processed in the scheduler, or are being applied by the plan applier and not + yet acknowledged. +* `failed` evaluations have failed to be applied by the plan applier (or are + somehow invalid in the scheduler; this is always a bug) +* `blocked` evaluations are created when an eval has failed too many attempts to + have its plan applied by the leader, or when a plan can only be partially + applied and there are still more allocations to create. +* `complete` means the plan was applied successfully (at least partially). +* `canceled` means the evaluation was superseded by state changes like a new + version of the job. + + +```mermaid +flowchart LR + + event((Cluster\nEvent)) + + pending([pending]) + blocked([blocked]) + complete([complete]) + failed([failed]) + canceled([canceled]) + + %% style classes + classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class event other; + class pending,blocked,complete,failed,canceled status; + + event -. "job-register + job-deregister + periodic-job + node-update + node-drain + alloc-stop + scheduled + alloc-failure + job-scaling" .-> pending + + pending -. "new eval\n(rolling-update)" .-> pending + pending -. "new eval\n(preemption)" .-> pending + + pending -. "new eval\n(max-plan-attempts)" .-> blocked + pending -- if plan submitted --> complete + pending -- if invalid --> failed + pending -- if no-op --> canceled + + failed -- if retried --> blocked + failed -- if retried --> complete + + blocked -- if no-op --> canceled + blocked -- if plan submitted --> complete + + complete -. "new eval\n(deployment-watcher)" .-> pending + complete -. "new eval\n(queued-allocs)" .-> blocked + + failed -. "new eval\n(failed-follow-up)" .-> pending +``` + +But it's hard to get a full picture of the evaluation lifecycle purely from the +`Status` fields, because evaluations have several "quasi-statuses" which aren't +represented as explicit statuses in the state store: + +* `scheduling` is the status where an eval is being processed by the scheduler + worker. +* `applying` is the status where the resulting plan for the eval is being + applied in the plan applier on the leader. +* `delayed` is an enqueued eval that will be dequeued some time in the future. +* `deleted` is when an eval is removed from the state store entirely. + +By adding these statuses to the diagram (the dashed nodes), you can see where +the same `Status` transition might result in different `PreviousEval`, +`NextEval`, or `BlockedEval` set. You can also see where the "chain" of +evaluations is broken when new evals are created for preemptions or by the +deployment watcher. + + +```mermaid +flowchart LR + + event((Cluster\nEvent)) + + %% statuss + pending([pending]) + blocked([blocked]) + complete([complete]) + failed([failed]) + canceled([canceled]) + + %% quasi-statuss + deleted([deleted]) + delayed([delayed]) + scheduling([scheduling]) + applying([applying]) + + %% style classes + classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef quasistatus fill:#d5f6ea,stroke-dasharray: 5 5,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + + class event other; + class pending,blocked,complete,failed,canceled status; + class deleted,delayed,scheduling,applying quasistatus; + + event -- "job-register + job-deregister + periodic-job + node-update + node-drain + alloc-stop + scheduled + alloc-failure + job-scaling" --> pending + + pending -- dequeued --> scheduling + pending -- if delayed --> delayed + delayed -- dequeued --> scheduling + + scheduling -. "not all allocs placed + new eval created by scheduler + trigger queued-allocs + new has .PreviousEval = old.ID + old has .BlockedEval = new.ID" .-> blocked + + scheduling -. "failed to plan + new eval created by scheduler + trigger: max-plan-attempts + new has .PreviousEval = old.ID + old has .BlockedEval = new.ID" .-> blocked + + scheduling -- "not all allocs placed + reuse already-blocked eval" --> blocked + + blocked -- "unblocked by + external state changes" --> scheduling + + scheduling -- allocs placed --> complete + + scheduling -- "wrong eval type or + max retries exceeded + on plan submit" --> failed + + scheduling -- "canceled by + job update/stop" --> canceled + + failed -- retry --> scheduling + + scheduling -. "new eval from rolling update (system jobs) + created by scheduler + trigger: rolling-update + new has .PreviousEval = old.ID + old has .NextEval = new.ID" .-> pending + + scheduling -- submit --> applying + applying -- failed --> scheduling + + applying -. "new eval for preempted allocs + created by plan applier + trigger: preemption + new has .PreviousEval = unset! + old has .BlockedEval = unset!" .-> pending + + complete -. "new eval from deployments (service jobs) + created by deploymentwatcher + trigger: deployment-watcher + new has .PreviousEval = unset! + old has .NextEval = unset!" .-> pending + + failed -- "new eval + trigger: failed-follow-up + new has .PreviousEval = old.ID + old has .NextEval = new.ID" --> pending + + pending -- "undeliverable evals + reaped by leader" --> failed + + blocked -- "duplicate blocked evals + reaped by leader" --> canceled + + canceled -- garbage\ncollection --> deleted + failed -- garbage\ncollection --> deleted + complete -- garbage\ncollection --> deleted +``` + + +[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling diff --git a/contributing/architecture-eval-triggers.md b/contributing/architecture-eval-triggers.md new file mode 100644 index 000000000..e8d7d7e69 --- /dev/null +++ b/contributing/architecture-eval-triggers.md @@ -0,0 +1,259 @@ +# Architecture: Evaluation Triggers + +The [Scheduling in Nomad][] internals documentation covers the path that an +evaluation takes through the leader, worker, and plan applier. This document +describes what events within the cluster cause Evaluations to be created. + +Evaluations have a `TriggeredBy` field which is always one of the values defined +in [`structs.go`][]: + +``` +const ( + EvalTriggerJobRegister = "job-register" + EvalTriggerJobDeregister = "job-deregister" + EvalTriggerPeriodicJob = "periodic-job" + EvalTriggerNodeDrain = "node-drain" + EvalTriggerNodeUpdate = "node-update" + EvalTriggerAllocStop = "alloc-stop" + EvalTriggerScheduled = "scheduled" + EvalTriggerRollingUpdate = "rolling-update" + EvalTriggerDeploymentWatcher = "deployment-watcher" + EvalTriggerFailedFollowUp = "failed-follow-up" + EvalTriggerMaxPlans = "max-plan-attempts" + EvalTriggerRetryFailedAlloc = "alloc-failure" + EvalTriggerQueuedAllocs = "queued-allocs" + EvalTriggerPreemption = "preemption" + EvalTriggerScaling = "job-scaling" + EvalTriggerMaxDisconnectTimeout = "max-disconnect-timeout" + EvalTriggerReconnect = "reconnect" +) +``` + +The list below covers each trigger and what can trigger it. + +* **job-register**: Creating or updating a Job will result in 1 Evaluation + created, plus any follow-up Evaluations associated with scheduling, planning, + or deployments. +* **job-deregister**: Stopping a Job will result in 1 Evaluation created, plus + any follow-up Evaluations associated with scheduling, planning, or + deployments. +* **periodic-job**: A periodic job that hits its timer and dispatches a child + job will result in 1 Evaluation created, plus any additional Evaluations + associated with scheduling or planning. +* **node-drain**: Draining a node will create 1 Evaluation for each Job on the + node that's draining, plus any additional Evaluations associated with + scheduling or planning. +* **node-update**: When the fingerprint of a client node has changed or the node + has changed state (from up to down), Nomad creates 1 Evaluation for each Job + running on the Node, plus 1 Evaluation for each system job that has + `datacenters` that include the datacenter for that Node. +* **alloc-stop**: When the API that serves the `nomad alloc stop` command is + hit, Nomad creates 1 Evaluation. +* **scheduled**: Nomad's internal housekeeping will periodically create + Evaluations for garbage collection. +* **rolling-update**: When a `system` job is updated, the [`update`][] block's + `stagger` field controls how many Allocations will be scheduled at a time. The + scheduler will create 1 Evaluation to follow-up for the next set. +* **deployment-watcher**: When a `service` job is updated, the [`update`][] + block controls how many Allocations will be scheduled at a time. The + deployment watcher runs on the leader and monitors Allocation healthy. It will + create 1 Evaluation when the Deployment has reached the next step. +* **failed-follow-up**: Evaluations that hit a delivery limit and will not be + retried by the eval broker are marked as failed. The leader periodically + reaps failed Evaluations and creates 1 new Evaluation for these, with a delay. +* **max-plan-attempts**: The scheduler will retry Evaluations that are rejected + by the plan applier with a new cluster state snapshot. If the scheduler + exceeds the maximum number of retries, it will create 1 new Evaluation in the + `blocked` state. +* **alloc-failure**: If an Allocation fails and exceeds its maximum + [`restart` attempts][], Nomad creates 1 new Evaluation. +* **queued-allocs**: When a scheduler processes an Evaluation, it may not be + able to place all Allocations. It will create 1 new Evaluation in the + `blocked` state to be processed later when node updates arrive. +* **preemption**: When Allocations are preempted, the plan applier creates 1 + Evaluation for each Job that has been preempted. +* **job-scaling**: Scaling a Job will result in 1 Evaluation created, plus any + follow-up Evaluations associated with scheduling, planning, or deployments. +* **max-disconnect-timeout**: When an Allocation is in the `unknown` state for + longer than the [`max_client_disconnect`][] window, the scheduler will create + 1 Evaluation. +* **reconnect**: When a Node in the `disconnected` state reconnects, Nomad will + create 1 Evaluation per job with an allocation on the reconnected Node. + +## Follow-up Evaluations + +Almost any Evaluation processed by the scheduler can result in additional +Evaluations being created, whether because the scheduler needs to follow-up from +failed scheduling or because the resulting plan changes the state of the +cluster. This can result in a large number of Evaluations when the cluster is in +an unstable state with frequent changes. + +Consider the following example where a node running 1 system job and 2 service +jobs misses its heartbeat and is marked lost. The Evaluation for the system job +is successfully planned. One of the service jobs no longer meets constraints. The +other service job is successfully scheduled but the resulting plan is rejected +because the scheduler has fallen behind in raft replication. A total of 6 +Evaluations are created. + +```mermaid +flowchart TD + + event((Node\nmisses\nheartbeat)) + + system([system\nnode-update]) + service1([service 1\nnode-update]) + service2([service 2\nnode-update]) + + blocked([service 1\nblocked\nqueued-allocs]) + failed([service 2\nfailed\nmax-plan-attempts]) + followup([service 2\nfailed-follow-up]) + + %% style classes + classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class event other; + class system,service1,service2,blocked,failed,followup eval; + + event --> system + event --> service1 + event --> service2 + + service1 --> blocked + + service2 --> failed + failed --> followup +``` + +Next, consider this example where a `service` job has been updated. The task +group has `count = 3` and the following `update` block: + +```hcl +update { + max_parallel = 1 + canary = 1 +} +``` + +After each Evaluation is processed, the Deployment Watcher will be waiting to +receive information on updated Allocation health. Then it will emit a new +Evaluation for the next step. A total of 4 Evaluations are created. + +```mermaid +flowchart TD + + registerEvent((Job\nRegister)) + alloc1health((Canary\nHealthy)) + alloc2health((Alloc 2\nHealthy)) + alloc3health((Alloc 3\nHealthy)) + + register([job-register]) + dwPostCanary([deployment-watcher]) + dwPostAlloc2([deployment-watcher]) + dwPostAlloc3([deployment-watcher]) + + %% style classes + classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class registerEvent,alloc1health,alloc2health,alloc3health other + class register,dwPostCanary,dwPostAlloc2,dwPostAlloc3 eval + + registerEvent --> register + register --> wait1 + alloc1health --> wait1 + wait1 --> dwPostCanary + + dwPostCanary --> wait2 + alloc2health --> wait2 + wait2 --> dwPostAlloc2 + + dwPostAlloc2 --> wait3 + alloc3health --> wait3 + wait3 --> dwPostAlloc3 + +``` + +Lastly, consider this example where 2 nodes each running 5 Allocations that are +all for system jobs are "flapping" by missing heartbeats and then +re-registering, or frequently changing fingerprints. This diagram will show the +results from each node going down once and then coming back up. + +```mermaid +flowchart TD + + %% style classes + classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + + eventAdown((Node A\nmisses\nheartbeat)) + eventAup((Node A\nheartbeats)) + eventBdown((Node B\nmisses\nheartbeat)) + eventBup((Node B\nheartbeats)) + + eventAdown --> eventAup + eventBdown --> eventBup + + A01down([job 1 node A\nnode-update]) + A02down([job 2 node A\nnode-update]) + A03down([job 3 node A\nnode-update]) + A04down([job 4 node A\nnode-update]) + A05down([job 5 node A\nnode-update]) + + B01down([job 1 node B\nnode-update]) + B02down([job 2 node B\nnode-update]) + B03down([job 3 node B\nnode-update]) + B04down([job 4 node B\nnode-update]) + B05down([job 5 node B\nnode-update]) + + A01up([job 1 node A\nnode-update]) + A02up([job 2 node A\nnode-update]) + A03up([job 3 node A\nnode-update]) + A04up([job 4 node A\nnode-update]) + A05up([job 5 node A\nnode-update]) + + B01up([job 1 node B\nnode-update]) + B02up([job 2 node B\nnode-update]) + B03up([job 3 node B\nnode-update]) + B04up([job 4 node B\nnode-update]) + B05up([job 5 node B\nnode-update]) + + eventAdown:::other --> A01down:::eval + eventAdown:::other --> A02down:::eval + eventAdown:::other --> A03down:::eval + eventAdown:::other --> A04down:::eval + eventAdown:::other --> A05down:::eval + + eventAup:::other --> A01up:::eval + eventAup:::other --> A02up:::eval + eventAup:::other --> A03up:::eval + eventAup:::other --> A04up:::eval + eventAup:::other --> A05up:::eval + + eventBdown:::other --> B01down:::eval + eventBdown:::other --> B02down:::eval + eventBdown:::other --> B03down:::eval + eventBdown:::other --> B04down:::eval + eventBdown:::other --> B05down:::eval + + eventBup:::other --> B01up:::eval + eventBup:::other --> B02up:::eval + eventBup:::other --> B03up:::eval + eventBup:::other --> B04up:::eval + eventBup:::other --> B05up:::eval + +``` + +You can extrapolate this example to large clusters: 100 nodes each running 10 +system jobs and 40 service jobs that go down once and come back up will result +in 100 * 40 * 2 == 8000 Evaluations created for the service jobs, which will +result in rescheduling of service allocations to new nodes. For the system jobs, +2000 Evaluations will be created and all of these will be no-op Evaluations that +will still need to be replicated to all raft peers, canceled by the scheduler, +and eventually need to be garbage collected. + + + +[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling +[`structs.go`]: https://github.com/hashicorp/nomad/blob/v1.4.0-beta.1/nomad/structs/structs.go#L10857-L10875 +[`update`]: https://www.nomadproject.io/docs/job-specification/update +[`restart` attempts]: https://www.nomadproject.io/docs/job-specification/restart +[`max_client_disconnect`]: https://www.nomadproject.io/docs/job-specification/group#max-client-disconnect diff --git a/contributing/architecture-state-store.md b/contributing/architecture-state-store.md new file mode 100644 index 000000000..96a8b39e4 --- /dev/null +++ b/contributing/architecture-state-store.md @@ -0,0 +1,151 @@ +# Architecture: Nomad State Store + +Nomad server state is an in-memory state store backed by raft. All writes to +state are serialized into message pack and written as raft logs. The raft logs +are replicated from the leader to the followers. Once each follower has +persisted the log entry and applied the entry to its in-memory state ("FSM"), +the leader considers the write committed. + +This architecture has a few implications: + +* The `fsm.Apply` functions must be deterministic over their inputs for a given + state. You can never generate random IDs or assign wall-clock timestamps in + the state store. These values must be provided as parameters from the RPC + handler. + + ```go + # Incorrect: generating a timestamp in the state store is not deterministic. + func (s *StateStore) UpsertObject(...) { + # ... + obj.CreateTime = time.Now() + # ... + } + + # Correct: non-deterministic values should be passed as inputs: + func (s *StateStore) UpsertObject(..., timestamp time.Time) { + # ... + obj.CreateTime = timestamp + # ... + } + ``` + +* Every object you read from the state store must be copied before it can be + mutated, because mutating the object modifies it outside the raft + workflow. The result can be servers having inconsistent state, transactions + breaking, or even server panics. + + ```go + # Incorrect: job is mutated without copying. + job, err := state.JobByID(ws, namespace, id) + job.Status = structs.JobStatusRunning + + # Correct: only the job copy is mutated. + job, err := state.JobByID(ws, namespace, id) + updateJob := job.Copy() + updateJob.Status = structs.JobStatusRunning + ``` + +Adding new objects to the state store should be done as part of adding new RPC +endpoints. See the [RPC Endpoint Checklist][]. + +```mermaid +flowchart TD + + %% entities + + ext(("API\nclient")) + any("Any node + (client or server)") + follower(Follower) + + rpcLeader("RPC handler (on leader)") + + writes("writes go thru raft + raftApply(MessageType, entry) in nomad/rpc.go + structs.MessageType in nomad/structs/structs.go + go generate ./... for nomad/msgtypes.go") + click writes href "https://github.com/hashicorp/nomad/tree/main/nomad" _blank + + reads("reads go directly to state store + Typical state_store.go funcs to implement: + + state.GetMyThingByID + state.GetMyThingByPrefix + state.ListMyThing + state.UpsertMyThing + state.DeleteMyThing") + click writes href "https://github.com/hashicorp/nomad/tree/main/nomad/state" _blank + + raft("hashicorp/raft") + + bolt("boltdb") + + fsm("Application-specific + Finite State Machine (FSM) + (aka State Store)") + click writes href "https://github.com/hashicorp/nomad/tree/main/nomad/fsm.go" _blank + + memdb("hashicorp/go-memdb") + + %% style classes + classDef leader fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class any,follower other; + class rpcLeader,raft,bolt,fsm,memdb leader; + + %% flows + + ext -- HTTP request --> any + + any -- "RPC request + to connected server + (follower or leader)" --> follower + + follower -- "(1) srv.Forward (to leader)" --> rpcLeader + + raft -- "(3) replicate to a + quorum of followers + wait on their fsm.Apply" --> follower + + rpcLeader --> reads + reads --> memdb + + rpcLeader --> writes + writes -- "(2)" --> raft + + raft -- "(4) write log to disk" --> bolt + raft -- "(5) fsm.Apply + nomad/fsm.go" --> fsm + + fsm -- "(6) txn.Insert" --> memdb + + bolt <-- "Snapshot Persist: nomad/fsm.go + Snapshot Restore: nomad/fsm.go" --> memdb + + + %% notes + + note1("Typical structs to implement + for RPC handlers: + + structs.MyThing + .Diff() + .Copy() + .Merge() + structs.MyThingUpsertRequest + structs.MyThingUpsertResponse + structs.MyThingGetRequest + structs.MyThingGetResponse + structs.MyThingListRequest + structs.MyThingListResponse + structs.MyThingDeleteRequest + structs.MyThingDeleteResponse + + Don't forget to register your new RPC + in nomad/server.go!") + + note1 -.- rpcLeader +``` + + +[RPC Endpoint Checklist]: https://github.com/hashicorp/nomad/blob/main/contributing/checklist-rpc-endpoint.md diff --git a/website/content/docs/concepts/architecture.mdx b/website/content/docs/concepts/architecture.mdx index 96c3babef..bfc5d7216 100644 --- a/website/content/docs/concepts/architecture.mdx +++ b/website/content/docs/concepts/architecture.mdx @@ -69,6 +69,15 @@ either the _desired state_ (jobs) or _actual state_ (clients) changes, Nomad creates a new evaluation to determine if any actions must be taken. An evaluation may result in changes to allocations if necessary. +#### Deployment + +Deployments are the mechanism by which Nomad rolls out changes to cluster state +in a step-by-step fashion. Deployments are only available for Jobs with the type +`service`. When an Evaluation is processed, the scheduler creates only the +number of Allocations permitted by the [`update`][] block and the current state +of the cluster. The Deployment is used to monitor the health of those +Allocations and emit a new Evaluation for the next step of the update. + #### Server Nomad servers are the brains of the cluster. There is a cluster of servers per @@ -161,3 +170,6 @@ are more details available for each of the sub-systems. The [consensus protocol] are all documented in more detail. For other details, either consult the code, ask in IRC or reach out to the mailing list. + + +[`update`]: /docs/job-specification/update