diff --git a/contributing/architecture-eval-lifecycle.md b/contributing/architecture-eval-lifecycle.md new file mode 100644 index 000000000..9156f12e8 --- /dev/null +++ b/contributing/architecture-eval-lifecycle.md @@ -0,0 +1,342 @@ +# Architecture: Evaluations into Allocations + +The [Scheduling Concepts][] docs provide an overview of how the scheduling +process works. This document is intended to go a bit deeper than those docs and +walk thru the lifecycle of an Evaluation from job registration to running +Allocations on the clients. The process can be broken into 4 parts: + +* [Job Registration](#job-registration) +* [Scheduling](#scheduling) +* [Deployment Watcher](#deployment-watcher) +* [Client Allocs](#client-allocs) + +Note that in all the diagrams below, writing to the State Store is considered +atomic. This means the Nomad leader has replicated to all the followers, and all +the servers have applied the raft log entry to their local FSM. So long as +`raftApply` returns without an error, we have a guarantee that all servers will +be able to retrieve the entry from their state store at some point in the +future. + +## Job Registration + +Creating or updating a Job is a _synchronous_ API operation. By the time the +response has returned to the API consumer, the Job and an Evaluation for that Job +have been written to the state store of the server nodes. + +* Note: parameterized or periodic batch jobs don't create an Evaluation at + registration, only once dispatched. +* Note: The scheduler is very fast! This means that once the CLI gets a + response it can immediately start making queries to get information + about the next steps. +* Note: This workflow is different for Multi-Region Deployments (found in Nomad + Enterprise). That will be documented separately. + +The diagram below shows this initial synchronous phase. Note here that there's +no scheduling happening yet, so no Allocation (or Deployment) has been created. + +```mermaid +sequenceDiagram + + participant user as User + participant cli as CLI + participant httpAPI as HTTP API + participant leaderRpc as Leader RPC + participant stateStore as State Store + + user ->> cli: nomad job run + activate cli + cli ->> httpAPI: Create Job API + httpAPI ->> leaderRpc: Job.Register + + activate leaderRpc + + leaderRpc ->> stateStore: write Job + activate stateStore + stateStore -->> leaderRpc: ok + deactivate stateStore + + leaderRpc ->> stateStore: write Evaluation + + activate stateStore + stateStore -->> leaderRpc: ok + + Note right of stateStore: EvalBroker.Enqueue + Note right of stateStore: (see Scheduling below) + + deactivate stateStore + + leaderRpc -->> httpAPI: Job.Register response + deactivate leaderRpc + + httpAPI -->> cli: Create Job response + deactivate cli + +``` + +## Scheduling + +A long-lived goroutine on the Nomad leader called the Eval Broker maintains a +queue of Evaluations previously written to the state store and enqueued via the +`EvalBroker.Enqueue` method. (When a leader transition occurs, the leader +queries all the Evaluations in the state store and enqueues them in its new Eval +Broker.) + +Scheduler workers are long-lived goroutines running on all server +nodes. Typically this will be one per core on followers and 1/4 that number on +the leader. The workers poll for Evaluations from the Eval Broker with the +`Eval.Dequeue` RPC. Once a worker has an evaluation, it instantiates a scheduler +for that evaluation (of type `service`, `system`, `sysbatch`, `batch`, or +`core`). + +Because a worker is running one scheduler at a time, Nomad's documentation often +refers to "workers" and "schedulers" interchangeably, but the worker is the +long-lived goroutine and the scheduler is the struct that contains the code and +state around processing a single evaluation. The scheduler mutates itself and is +thrown away once the evaluation is processed. + +The scheduler takes a snapshot of that server node's state store so that it has +a constant current view of the cluster state. The scheduler executes 3 main +steps: + +* Reconcile: compare the cluster state and job specification to determine what + changes need to be made -- starting or stopping Allocations. The scheduler + creates new Allocations in this step. But note that _these Allocations are not + yet persisted to the state store_. +* Feasibility Check: For each Allocation it needs, the scheduler iterates over + Nodes until it finds up to 2 Nodes that match the Allocation's resource requirements + and constraints. + * Note: for system jobs or jobs with with `spread` blocks, the scheduler has + to check all Nodes. +* Scoring: For each feasible node, the scheduler ranks them and picks the one + with the highest score. + +If the job is a `service` job, the scheduler will also create (or update) a +Deployment. When the scheduler determines the number of Allocations to create, +it examine the job's [`update`][] block. Only the number of Allocations needed +to complete the next phase of the update will be created in a given +allocation. The Deployment is used by the deployment watcher (see below) to +monitor the health of allocations and create new Evaluations to continue the +update. + +If the scheduler cannot place all Allocations, it will create a new Evaluation +in the `blocked` state and submit it to the leader. The Eval Broker will +re-enque that Evaluation once cluster state has changed. (This process is the +first green box in the sequence diagram below.) + +Once the scheduler has completed processing the Evaluation, if there are +Allocations (and possibly a Deployment) to update, it will submit this work as a +Plan to the leader. The leader needs to validate this plan and serialize it: + +* The scheduler took a snapshot of cluster state at the start of its work, so + that state may have changed in the meantime. +* Schedulers run concurrently across the cluster, so they may generate + conflicting plans (particularly on heavily-packed clusters). + +The leader processes the plan in the plan applier. If the plan is valid, the +plan applier will write the Allocations (and Deployment) update to the state +store. If not, it will reject the plan and the scheduler will try to create a +new plan with a refreshed state. If the scheduler fails to submit a valid plan +too many times it submits a `blocked` Evaluation that is triggered by +`max-plan-attempts` type. (The plan submit process is the second green box in +the sequence diagram below.) + +Once the scheduler has a response from the leader, it will tell the Eval Broker +to Ack the Evaluation (if it successfully submitted the plan) or Nack the +Evaluation (if it failed to do so) so that another scheduler can try processing it. + +The diagram below shows the scheduling phase, including submitting plans to the +planner. Note that at the end of this phase, Allocations (and Deployment) have +been persisted to the state store. + +```mermaid +sequenceDiagram + + participant leaderRpc as Leader RPC + participant stateStore as State Store + participant planner as Planner + participant broker as Eval Broker + participant sched as Scheduler + + leaderRpc ->> stateStore: write Evaluation (see above) + stateStore ->> broker: EvalBroker.Enqueue + activate broker + broker -->> broker: enqueue eval + broker -->> stateStore: ok + deactivate broker + + sched ->> broker: Eval.Dequeue (blocks until work available) + activate sched + broker -->> sched: Evaluation + + sched ->> stateStore: query to get job and cluster state + stateStore -->> sched: results + + sched -->> sched: Reconcile (how many allocs are needed?) + deactivate sched + + alt + Note right of sched: Needs new allocs + activate sched + sched -->> sched: create Allocations + sched -->> sched: create Deployment (service jobs) + Note left of sched: for Deployments, only 1 "batch" of Allocs will get created for each Eval + sched -->> sched: Feasibility Check + Note left of sched: iterate over Nodes to find a placement for each Alloc + sched -->> sched: Ranking + Note left of sched: pick best option for each Alloc + + %% start rect highlight for blocked eval + rect rgb(213, 246, 234) + Note left of sched: Not enough room! (But we can submit a partial plan) + sched ->> leaderRpc: Eval.Upsert (blocked) + activate leaderRpc + leaderRpc ->> stateStore: write Evaluation + activate stateStore + stateStore -->> leaderRpc: ok + deactivate stateStore + leaderRpc --> sched: ok + deactivate leaderRpc + end + %% end rect highlight for blocked eval + + + %% start rect highlight for planner + rect rgb(213, 246, 234) + Note over sched, planner: Note: scheduler snapshot may be stale state relative to leader so it's serialized by plan applier + + sched ->> planner: Plan.Submit (Allocations + Deployment) + planner -->> planner: is plan still valid? + + Note right of planner: Plan is valid + activate planner + + planner ->> leaderRpc: Allocations.Upsert + Deployment.Upsert + activate leaderRpc + leaderRpc ->> stateStore: write Allocations and Deployment + activate stateStore + stateStore -->> leaderRpc: ok + deactivate stateStore + leaderRpc -->> planner: ok + deactivate leaderRpc + planner -->> sched: ok + deactivate planner + + end + %% end rect highlight for planner + + sched -->> sched: retry on failure, if exceed max attempts will Eval.Nack + + else + + end + + sched ->> broker: Eval.Ack (Eval.Nack if failed) + activate broker + broker ->> stateStore: complete Evaluation + activate stateStore + stateStore -->> broker: ok + deactivate stateStore + broker -->> sched: ok + deactivate broker + deactivate sched +``` + +## Deployment Watcher + +As noted under Scheduling above, a Deployment is created for service jobs. A +deployment watcher runs on the leader. Its job is to watch the state of +Allocations being placed for a given job version and to emit new Evaluations so +that more Allocations for that job can be created. + +The "deployments watcher" (plural) makes a blocking query for Deployments and +spins up a new "deployment watcher" (singular) for each one. That goroutine will +live until its Deployment is complete or failed. + +The deployment watcher makes blocking queries on Allocation health and its own +Deployment (which can be canceled or paused by a user). When there's a change in +any of those states, it compares the current state against the [`update`][] +block and the timers it maintains for `min_healthy_time`, `healthy_deadline`, +and `progress_deadline`. It then updates the Deployment state and creates a new +Evaluation if the current step of the update is complete. + +The diagram below shows deployments from a high level. Note that Deployments do +not themselves create Allocations -- they create Evaluations and then the +schedulers process those as they do normally. + +```mermaid +sequenceDiagram + + participant leaderRpc as Leader RPC + participant stateStore as State Store + participant dw as Deployment Watcher + + dw ->> stateStore: blocking query for new Deployments + activate dw + stateStore -->> dw: new Deployment + dw -->> dw: start watcher + + dw ->> stateStore: blocking query for Allocation health + stateStore -->> dw: Allocation health updates + dw ->> dw: next step? + + Note right of dw: Update state and create evaluations for next batch... + Note right of dw: Or fail the Deployment and update state + + dw ->> leaderRpc: Deployment.Upsert + Evaluation.Upsert + activate leaderRpc + leaderRpc ->> stateStore: write Deployment and Evaluations + activate stateStore + stateStore -->> leaderRpc: ok + deactivate stateStore + leaderRpc -->> dw: ok + deactivate leaderRpc + deactivate dw + +``` + +## Client Allocs + +Once the plan applier has persisted Allocations to the state store (with an +associated Node ID), they become available to get placed on the client. Clients +_pull_ new allocations (and changes to allocations), so a new allocation will be +in the `pending` state until it's been pulled down by a Client and the +allocation has been instantiated. + +Once the Allocation is running and healthy, the Client will send a +`Node.UpdateAlloc` RPC back to the server so that info can be persisted in the +state store. This is the allocation health data the Deployment Watcher is +querying for above. + +```mermaid +sequenceDiagram + + participant followerRpc as Follower RPC + participant stateStore as State Store + + participant client as Client + participant allocrunner as Allocation Runner + + client ->> followerRpc: Alloc.GetAllocs RPC + activate client + + Note right of client: this query can be stale + + followerRpc ->> stateStore: query for Allocations + activate followerRpc + activate stateStore + stateStore -->> followerRpc: Allocations + deactivate stateStore + followerRpc -->> client: Allocations + deactivate followerRpc + + client ->> allocrunner: Create or update allocation runners + + client ->> followerRpc: Node.UpdateAlloc + Note right of followerRpc: will be forwarded to leader + + deactivate client +``` + + +[Scheduling Concepts]: https://nomadproject.io/docs/concepts/scheduling/scheduling +[`update`]: https://www.nomadproject.io/docs/job-specification/update diff --git a/contributing/architecture-eval-states.md b/contributing/architecture-eval-states.md new file mode 100644 index 000000000..fdaac339d --- /dev/null +++ b/contributing/architecture-eval-states.md @@ -0,0 +1,201 @@ +# Architecture: Evaluation Status + +The [Scheduling in Nomad][] internals documentation covers the path that an +evaluation takes through the leader, worker, and plan applier. But it doesn't +cover in any detail the various `Evaluation.Status` values, or where the +`PreviousEval`, `NextEval`, or `BlockedEval` ID pointers are set. + +The state diagram below describes the transitions between `Status` values as +solid arrows. The dashed arrows represent when a new evaluation is created. The +parenthetical labels on those arrows are the `TriggeredBy` field for the new +evaluation. + +The status values are: + +* `pending` evaluations either are queued to be scheduled, are still being + processed in the scheduler, or are being applied by the plan applier and not + yet acknowledged. +* `failed` evaluations have failed to be applied by the plan applier (or are + somehow invalid in the scheduler; this is always a bug) +* `blocked` evaluations are created when an eval has failed too many attempts to + have its plan applied by the leader, or when a plan can only be partially + applied and there are still more allocations to create. +* `complete` means the plan was applied successfully (at least partially). +* `canceled` means the evaluation was superseded by state changes like a new + version of the job. + + +```mermaid +flowchart LR + + event((Cluster\nEvent)) + + pending([pending]) + blocked([blocked]) + complete([complete]) + failed([failed]) + canceled([canceled]) + + %% style classes + classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class event other; + class pending,blocked,complete,failed,canceled status; + + event -. "job-register + job-deregister + periodic-job + node-update + node-drain + alloc-stop + scheduled + alloc-failure + job-scaling" .-> pending + + pending -. "new eval\n(rolling-update)" .-> pending + pending -. "new eval\n(preemption)" .-> pending + + pending -. "new eval\n(max-plan-attempts)" .-> blocked + pending -- if plan submitted --> complete + pending -- if invalid --> failed + pending -- if no-op --> canceled + + failed -- if retried --> blocked + failed -- if retried --> complete + + blocked -- if no-op --> canceled + blocked -- if plan submitted --> complete + + complete -. "new eval\n(deployment-watcher)" .-> pending + complete -. "new eval\n(queued-allocs)" .-> blocked + + failed -. "new eval\n(failed-follow-up)" .-> pending +``` + +But it's hard to get a full picture of the evaluation lifecycle purely from the +`Status` fields, because evaluations have several "quasi-statuses" which aren't +represented as explicit statuses in the state store: + +* `scheduling` is the status where an eval is being processed by the scheduler + worker. +* `applying` is the status where the resulting plan for the eval is being + applied in the plan applier on the leader. +* `delayed` is an enqueued eval that will be dequeued some time in the future. +* `deleted` is when an eval is removed from the state store entirely. + +By adding these statuses to the diagram (the dashed nodes), you can see where +the same `Status` transition might result in different `PreviousEval`, +`NextEval`, or `BlockedEval` set. You can also see where the "chain" of +evaluations is broken when new evals are created for preemptions or by the +deployment watcher. + + +```mermaid +flowchart LR + + event((Cluster\nEvent)) + + %% statuss + pending([pending]) + blocked([blocked]) + complete([complete]) + failed([failed]) + canceled([canceled]) + + %% quasi-statuss + deleted([deleted]) + delayed([delayed]) + scheduling([scheduling]) + applying([applying]) + + %% style classes + classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef quasistatus fill:#d5f6ea,stroke-dasharray: 5 5,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + + class event other; + class pending,blocked,complete,failed,canceled status; + class deleted,delayed,scheduling,applying quasistatus; + + event -- "job-register + job-deregister + periodic-job + node-update + node-drain + alloc-stop + scheduled + alloc-failure + job-scaling" --> pending + + pending -- dequeued --> scheduling + pending -- if delayed --> delayed + delayed -- dequeued --> scheduling + + scheduling -. "not all allocs placed + new eval created by scheduler + trigger queued-allocs + new has .PreviousEval = old.ID + old has .BlockedEval = new.ID" .-> blocked + + scheduling -. "failed to plan + new eval created by scheduler + trigger: max-plan-attempts + new has .PreviousEval = old.ID + old has .BlockedEval = new.ID" .-> blocked + + scheduling -- "not all allocs placed + reuse already-blocked eval" --> blocked + + blocked -- "unblocked by + external state changes" --> scheduling + + scheduling -- allocs placed --> complete + + scheduling -- "wrong eval type or + max retries exceeded + on plan submit" --> failed + + scheduling -- "canceled by + job update/stop" --> canceled + + failed -- retry --> scheduling + + scheduling -. "new eval from rolling update (system jobs) + created by scheduler + trigger: rolling-update + new has .PreviousEval = old.ID + old has .NextEval = new.ID" .-> pending + + scheduling -- submit --> applying + applying -- failed --> scheduling + + applying -. "new eval for preempted allocs + created by plan applier + trigger: preemption + new has .PreviousEval = unset! + old has .BlockedEval = unset!" .-> pending + + complete -. "new eval from deployments (service jobs) + created by deploymentwatcher + trigger: deployment-watcher + new has .PreviousEval = unset! + old has .NextEval = unset!" .-> pending + + failed -- "new eval + trigger: failed-follow-up + new has .PreviousEval = old.ID + old has .NextEval = new.ID" --> pending + + pending -- "undeliverable evals + reaped by leader" --> failed + + blocked -- "duplicate blocked evals + reaped by leader" --> canceled + + canceled -- garbage\ncollection --> deleted + failed -- garbage\ncollection --> deleted + complete -- garbage\ncollection --> deleted +``` + + +[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling diff --git a/contributing/architecture-eval-triggers.md b/contributing/architecture-eval-triggers.md new file mode 100644 index 000000000..e8d7d7e69 --- /dev/null +++ b/contributing/architecture-eval-triggers.md @@ -0,0 +1,259 @@ +# Architecture: Evaluation Triggers + +The [Scheduling in Nomad][] internals documentation covers the path that an +evaluation takes through the leader, worker, and plan applier. This document +describes what events within the cluster cause Evaluations to be created. + +Evaluations have a `TriggeredBy` field which is always one of the values defined +in [`structs.go`][]: + +``` +const ( + EvalTriggerJobRegister = "job-register" + EvalTriggerJobDeregister = "job-deregister" + EvalTriggerPeriodicJob = "periodic-job" + EvalTriggerNodeDrain = "node-drain" + EvalTriggerNodeUpdate = "node-update" + EvalTriggerAllocStop = "alloc-stop" + EvalTriggerScheduled = "scheduled" + EvalTriggerRollingUpdate = "rolling-update" + EvalTriggerDeploymentWatcher = "deployment-watcher" + EvalTriggerFailedFollowUp = "failed-follow-up" + EvalTriggerMaxPlans = "max-plan-attempts" + EvalTriggerRetryFailedAlloc = "alloc-failure" + EvalTriggerQueuedAllocs = "queued-allocs" + EvalTriggerPreemption = "preemption" + EvalTriggerScaling = "job-scaling" + EvalTriggerMaxDisconnectTimeout = "max-disconnect-timeout" + EvalTriggerReconnect = "reconnect" +) +``` + +The list below covers each trigger and what can trigger it. + +* **job-register**: Creating or updating a Job will result in 1 Evaluation + created, plus any follow-up Evaluations associated with scheduling, planning, + or deployments. +* **job-deregister**: Stopping a Job will result in 1 Evaluation created, plus + any follow-up Evaluations associated with scheduling, planning, or + deployments. +* **periodic-job**: A periodic job that hits its timer and dispatches a child + job will result in 1 Evaluation created, plus any additional Evaluations + associated with scheduling or planning. +* **node-drain**: Draining a node will create 1 Evaluation for each Job on the + node that's draining, plus any additional Evaluations associated with + scheduling or planning. +* **node-update**: When the fingerprint of a client node has changed or the node + has changed state (from up to down), Nomad creates 1 Evaluation for each Job + running on the Node, plus 1 Evaluation for each system job that has + `datacenters` that include the datacenter for that Node. +* **alloc-stop**: When the API that serves the `nomad alloc stop` command is + hit, Nomad creates 1 Evaluation. +* **scheduled**: Nomad's internal housekeeping will periodically create + Evaluations for garbage collection. +* **rolling-update**: When a `system` job is updated, the [`update`][] block's + `stagger` field controls how many Allocations will be scheduled at a time. The + scheduler will create 1 Evaluation to follow-up for the next set. +* **deployment-watcher**: When a `service` job is updated, the [`update`][] + block controls how many Allocations will be scheduled at a time. The + deployment watcher runs on the leader and monitors Allocation healthy. It will + create 1 Evaluation when the Deployment has reached the next step. +* **failed-follow-up**: Evaluations that hit a delivery limit and will not be + retried by the eval broker are marked as failed. The leader periodically + reaps failed Evaluations and creates 1 new Evaluation for these, with a delay. +* **max-plan-attempts**: The scheduler will retry Evaluations that are rejected + by the plan applier with a new cluster state snapshot. If the scheduler + exceeds the maximum number of retries, it will create 1 new Evaluation in the + `blocked` state. +* **alloc-failure**: If an Allocation fails and exceeds its maximum + [`restart` attempts][], Nomad creates 1 new Evaluation. +* **queued-allocs**: When a scheduler processes an Evaluation, it may not be + able to place all Allocations. It will create 1 new Evaluation in the + `blocked` state to be processed later when node updates arrive. +* **preemption**: When Allocations are preempted, the plan applier creates 1 + Evaluation for each Job that has been preempted. +* **job-scaling**: Scaling a Job will result in 1 Evaluation created, plus any + follow-up Evaluations associated with scheduling, planning, or deployments. +* **max-disconnect-timeout**: When an Allocation is in the `unknown` state for + longer than the [`max_client_disconnect`][] window, the scheduler will create + 1 Evaluation. +* **reconnect**: When a Node in the `disconnected` state reconnects, Nomad will + create 1 Evaluation per job with an allocation on the reconnected Node. + +## Follow-up Evaluations + +Almost any Evaluation processed by the scheduler can result in additional +Evaluations being created, whether because the scheduler needs to follow-up from +failed scheduling or because the resulting plan changes the state of the +cluster. This can result in a large number of Evaluations when the cluster is in +an unstable state with frequent changes. + +Consider the following example where a node running 1 system job and 2 service +jobs misses its heartbeat and is marked lost. The Evaluation for the system job +is successfully planned. One of the service jobs no longer meets constraints. The +other service job is successfully scheduled but the resulting plan is rejected +because the scheduler has fallen behind in raft replication. A total of 6 +Evaluations are created. + +```mermaid +flowchart TD + + event((Node\nmisses\nheartbeat)) + + system([system\nnode-update]) + service1([service 1\nnode-update]) + service2([service 2\nnode-update]) + + blocked([service 1\nblocked\nqueued-allocs]) + failed([service 2\nfailed\nmax-plan-attempts]) + followup([service 2\nfailed-follow-up]) + + %% style classes + classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class event other; + class system,service1,service2,blocked,failed,followup eval; + + event --> system + event --> service1 + event --> service2 + + service1 --> blocked + + service2 --> failed + failed --> followup +``` + +Next, consider this example where a `service` job has been updated. The task +group has `count = 3` and the following `update` block: + +```hcl +update { + max_parallel = 1 + canary = 1 +} +``` + +After each Evaluation is processed, the Deployment Watcher will be waiting to +receive information on updated Allocation health. Then it will emit a new +Evaluation for the next step. A total of 4 Evaluations are created. + +```mermaid +flowchart TD + + registerEvent((Job\nRegister)) + alloc1health((Canary\nHealthy)) + alloc2health((Alloc 2\nHealthy)) + alloc3health((Alloc 3\nHealthy)) + + register([job-register]) + dwPostCanary([deployment-watcher]) + dwPostAlloc2([deployment-watcher]) + dwPostAlloc3([deployment-watcher]) + + %% style classes + classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class registerEvent,alloc1health,alloc2health,alloc3health other + class register,dwPostCanary,dwPostAlloc2,dwPostAlloc3 eval + + registerEvent --> register + register --> wait1 + alloc1health --> wait1 + wait1 --> dwPostCanary + + dwPostCanary --> wait2 + alloc2health --> wait2 + wait2 --> dwPostAlloc2 + + dwPostAlloc2 --> wait3 + alloc3health --> wait3 + wait3 --> dwPostAlloc3 + +``` + +Lastly, consider this example where 2 nodes each running 5 Allocations that are +all for system jobs are "flapping" by missing heartbeats and then +re-registering, or frequently changing fingerprints. This diagram will show the +results from each node going down once and then coming back up. + +```mermaid +flowchart TD + + %% style classes + classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + + eventAdown((Node A\nmisses\nheartbeat)) + eventAup((Node A\nheartbeats)) + eventBdown((Node B\nmisses\nheartbeat)) + eventBup((Node B\nheartbeats)) + + eventAdown --> eventAup + eventBdown --> eventBup + + A01down([job 1 node A\nnode-update]) + A02down([job 2 node A\nnode-update]) + A03down([job 3 node A\nnode-update]) + A04down([job 4 node A\nnode-update]) + A05down([job 5 node A\nnode-update]) + + B01down([job 1 node B\nnode-update]) + B02down([job 2 node B\nnode-update]) + B03down([job 3 node B\nnode-update]) + B04down([job 4 node B\nnode-update]) + B05down([job 5 node B\nnode-update]) + + A01up([job 1 node A\nnode-update]) + A02up([job 2 node A\nnode-update]) + A03up([job 3 node A\nnode-update]) + A04up([job 4 node A\nnode-update]) + A05up([job 5 node A\nnode-update]) + + B01up([job 1 node B\nnode-update]) + B02up([job 2 node B\nnode-update]) + B03up([job 3 node B\nnode-update]) + B04up([job 4 node B\nnode-update]) + B05up([job 5 node B\nnode-update]) + + eventAdown:::other --> A01down:::eval + eventAdown:::other --> A02down:::eval + eventAdown:::other --> A03down:::eval + eventAdown:::other --> A04down:::eval + eventAdown:::other --> A05down:::eval + + eventAup:::other --> A01up:::eval + eventAup:::other --> A02up:::eval + eventAup:::other --> A03up:::eval + eventAup:::other --> A04up:::eval + eventAup:::other --> A05up:::eval + + eventBdown:::other --> B01down:::eval + eventBdown:::other --> B02down:::eval + eventBdown:::other --> B03down:::eval + eventBdown:::other --> B04down:::eval + eventBdown:::other --> B05down:::eval + + eventBup:::other --> B01up:::eval + eventBup:::other --> B02up:::eval + eventBup:::other --> B03up:::eval + eventBup:::other --> B04up:::eval + eventBup:::other --> B05up:::eval + +``` + +You can extrapolate this example to large clusters: 100 nodes each running 10 +system jobs and 40 service jobs that go down once and come back up will result +in 100 * 40 * 2 == 8000 Evaluations created for the service jobs, which will +result in rescheduling of service allocations to new nodes. For the system jobs, +2000 Evaluations will be created and all of these will be no-op Evaluations that +will still need to be replicated to all raft peers, canceled by the scheduler, +and eventually need to be garbage collected. + + + +[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling +[`structs.go`]: https://github.com/hashicorp/nomad/blob/v1.4.0-beta.1/nomad/structs/structs.go#L10857-L10875 +[`update`]: https://www.nomadproject.io/docs/job-specification/update +[`restart` attempts]: https://www.nomadproject.io/docs/job-specification/restart +[`max_client_disconnect`]: https://www.nomadproject.io/docs/job-specification/group#max-client-disconnect diff --git a/contributing/architecture-state-store.md b/contributing/architecture-state-store.md new file mode 100644 index 000000000..96a8b39e4 --- /dev/null +++ b/contributing/architecture-state-store.md @@ -0,0 +1,151 @@ +# Architecture: Nomad State Store + +Nomad server state is an in-memory state store backed by raft. All writes to +state are serialized into message pack and written as raft logs. The raft logs +are replicated from the leader to the followers. Once each follower has +persisted the log entry and applied the entry to its in-memory state ("FSM"), +the leader considers the write committed. + +This architecture has a few implications: + +* The `fsm.Apply` functions must be deterministic over their inputs for a given + state. You can never generate random IDs or assign wall-clock timestamps in + the state store. These values must be provided as parameters from the RPC + handler. + + ```go + # Incorrect: generating a timestamp in the state store is not deterministic. + func (s *StateStore) UpsertObject(...) { + # ... + obj.CreateTime = time.Now() + # ... + } + + # Correct: non-deterministic values should be passed as inputs: + func (s *StateStore) UpsertObject(..., timestamp time.Time) { + # ... + obj.CreateTime = timestamp + # ... + } + ``` + +* Every object you read from the state store must be copied before it can be + mutated, because mutating the object modifies it outside the raft + workflow. The result can be servers having inconsistent state, transactions + breaking, or even server panics. + + ```go + # Incorrect: job is mutated without copying. + job, err := state.JobByID(ws, namespace, id) + job.Status = structs.JobStatusRunning + + # Correct: only the job copy is mutated. + job, err := state.JobByID(ws, namespace, id) + updateJob := job.Copy() + updateJob.Status = structs.JobStatusRunning + ``` + +Adding new objects to the state store should be done as part of adding new RPC +endpoints. See the [RPC Endpoint Checklist][]. + +```mermaid +flowchart TD + + %% entities + + ext(("API\nclient")) + any("Any node + (client or server)") + follower(Follower) + + rpcLeader("RPC handler (on leader)") + + writes("writes go thru raft + raftApply(MessageType, entry) in nomad/rpc.go + structs.MessageType in nomad/structs/structs.go + go generate ./... for nomad/msgtypes.go") + click writes href "https://github.com/hashicorp/nomad/tree/main/nomad" _blank + + reads("reads go directly to state store + Typical state_store.go funcs to implement: + + state.GetMyThingByID + state.GetMyThingByPrefix + state.ListMyThing + state.UpsertMyThing + state.DeleteMyThing") + click writes href "https://github.com/hashicorp/nomad/tree/main/nomad/state" _blank + + raft("hashicorp/raft") + + bolt("boltdb") + + fsm("Application-specific + Finite State Machine (FSM) + (aka State Store)") + click writes href "https://github.com/hashicorp/nomad/tree/main/nomad/fsm.go" _blank + + memdb("hashicorp/go-memdb") + + %% style classes + classDef leader fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class any,follower other; + class rpcLeader,raft,bolt,fsm,memdb leader; + + %% flows + + ext -- HTTP request --> any + + any -- "RPC request + to connected server + (follower or leader)" --> follower + + follower -- "(1) srv.Forward (to leader)" --> rpcLeader + + raft -- "(3) replicate to a + quorum of followers + wait on their fsm.Apply" --> follower + + rpcLeader --> reads + reads --> memdb + + rpcLeader --> writes + writes -- "(2)" --> raft + + raft -- "(4) write log to disk" --> bolt + raft -- "(5) fsm.Apply + nomad/fsm.go" --> fsm + + fsm -- "(6) txn.Insert" --> memdb + + bolt <-- "Snapshot Persist: nomad/fsm.go + Snapshot Restore: nomad/fsm.go" --> memdb + + + %% notes + + note1("Typical structs to implement + for RPC handlers: + + structs.MyThing + .Diff() + .Copy() + .Merge() + structs.MyThingUpsertRequest + structs.MyThingUpsertResponse + structs.MyThingGetRequest + structs.MyThingGetResponse + structs.MyThingListRequest + structs.MyThingListResponse + structs.MyThingDeleteRequest + structs.MyThingDeleteResponse + + Don't forget to register your new RPC + in nomad/server.go!") + + note1 -.- rpcLeader +``` + + +[RPC Endpoint Checklist]: https://github.com/hashicorp/nomad/blob/main/contributing/checklist-rpc-endpoint.md diff --git a/website/content/docs/concepts/architecture.mdx b/website/content/docs/concepts/architecture.mdx index 96c3babef..bfc5d7216 100644 --- a/website/content/docs/concepts/architecture.mdx +++ b/website/content/docs/concepts/architecture.mdx @@ -69,6 +69,15 @@ either the _desired state_ (jobs) or _actual state_ (clients) changes, Nomad creates a new evaluation to determine if any actions must be taken. An evaluation may result in changes to allocations if necessary. +#### Deployment + +Deployments are the mechanism by which Nomad rolls out changes to cluster state +in a step-by-step fashion. Deployments are only available for Jobs with the type +`service`. When an Evaluation is processed, the scheduler creates only the +number of Allocations permitted by the [`update`][] block and the current state +of the cluster. The Deployment is used to monitor the health of those +Allocations and emit a new Evaluation for the next step of the update. + #### Server Nomad servers are the brains of the cluster. There is a cluster of servers per @@ -161,3 +170,6 @@ are more details available for each of the sub-systems. The [consensus protocol] are all documented in more detail. For other details, either consult the code, ask in IRC or reach out to the mailing list. + + +[`update`]: /docs/job-specification/update