diff --git a/contributing/architecture-eval-lifecycle.md b/contributing/architecture-eval-lifecycle.md
new file mode 100644
index 000000000..9156f12e8
--- /dev/null
+++ b/contributing/architecture-eval-lifecycle.md
@@ -0,0 +1,342 @@
+# Architecture: Evaluations into Allocations
+
+The [Scheduling Concepts][] docs provide an overview of how the scheduling
+process works. This document is intended to go a bit deeper than those docs and
+walk thru the lifecycle of an Evaluation from job registration to running
+Allocations on the clients. The process can be broken into 4 parts:
+
+* [Job Registration](#job-registration)
+* [Scheduling](#scheduling)
+* [Deployment Watcher](#deployment-watcher)
+* [Client Allocs](#client-allocs)
+
+Note that in all the diagrams below, writing to the State Store is considered
+atomic. This means the Nomad leader has replicated to all the followers, and all
+the servers have applied the raft log entry to their local FSM. So long as
+`raftApply` returns without an error, we have a guarantee that all servers will
+be able to retrieve the entry from their state store at some point in the
+future.
+
+## Job Registration
+
+Creating or updating a Job is a _synchronous_ API operation. By the time the
+response has returned to the API consumer, the Job and an Evaluation for that Job
+have been written to the state store of the server nodes.
+
+* Note: parameterized or periodic batch jobs don't create an Evaluation at
+  registration, only once dispatched.
+* Note: The scheduler is very fast! This means that once the CLI gets a
+  response it can immediately start making queries to get information
+  about the next steps.
+* Note: This workflow is different for Multi-Region Deployments (found in Nomad
+  Enterprise). That will be documented separately.
+
+The diagram below shows this initial synchronous phase. Note here that there's
+no scheduling happening yet, so no Allocation (or Deployment) has been created.
+
+```mermaid
+sequenceDiagram
+
+    participant user as User
+    participant cli as CLI
+    participant httpAPI as HTTP API
+    participant leaderRpc as Leader RPC
+    participant stateStore as State Store
+
+    user ->> cli: nomad job run
+    activate cli
+    cli ->> httpAPI: Create Job API
+    httpAPI ->> leaderRpc: Job.Register
+
+    activate leaderRpc
+
+    leaderRpc ->> stateStore: write Job
+    activate stateStore
+    stateStore -->> leaderRpc: ok
+    deactivate stateStore
+
+    leaderRpc ->> stateStore: write Evaluation
+
+    activate stateStore
+    stateStore -->> leaderRpc: ok
+
+    Note right of stateStore: EvalBroker.Enqueue
+    Note right of stateStore: (see Scheduling below)
+
+    deactivate stateStore
+
+    leaderRpc -->> httpAPI: Job.Register response
+    deactivate leaderRpc
+
+    httpAPI -->> cli: Create Job response
+    deactivate cli
+
+```
+
+## Scheduling
+
+A long-lived goroutine on the Nomad leader called the Eval Broker maintains a
+queue of Evaluations previously written to the state store and enqueued via the
+`EvalBroker.Enqueue` method. (When a leader transition occurs, the leader
+queries all the Evaluations in the state store and enqueues them in its new Eval
+Broker.)
+
+Scheduler workers are long-lived goroutines running on all server
+nodes. Typically this will be one per core on followers and 1/4 that number on
+the leader. The workers poll for Evaluations from the Eval Broker with the
+`Eval.Dequeue` RPC. Once a worker has an evaluation, it instantiates a scheduler
+for that evaluation (of type `service`, `system`, `sysbatch`, `batch`, or
+`core`).
+
+Because a worker is running one scheduler at a time, Nomad's documentation often
+refers to "workers" and "schedulers" interchangeably, but the worker is the
+long-lived goroutine and the scheduler is the struct that contains the code and
+state around processing a single evaluation. The scheduler mutates itself and is
+thrown away once the evaluation is processed.
+
+The scheduler takes a snapshot of that server node's state store so that it has
+a constant current view of the cluster state. The scheduler executes 3 main
+steps:
+
+* Reconcile: compare the cluster state and job specification to determine what
+  changes need to be made -- starting or stopping Allocations. The scheduler
+  creates new Allocations in this step. But note that _these Allocations are not
+  yet persisted to the state store_.
+* Feasibility Check: For each Allocation it needs, the scheduler iterates over
+  Nodes until it finds up to 2 Nodes that match the Allocation's resource requirements
+  and constraints.
+  * Note: for system jobs or jobs with with `spread` blocks, the scheduler has
+    to check all Nodes.
+* Scoring: For each feasible node, the scheduler ranks them and picks the one
+  with the highest score.
+
+If the job is a `service` job, the scheduler will also create (or update) a
+Deployment. When the scheduler determines the number of Allocations to create,
+it examine the job's [`update`][] block. Only the number of Allocations needed
+to complete the next phase of the update will be created in a given
+allocation. The Deployment is used by the deployment watcher (see below) to
+monitor the health of allocations and create new Evaluations to continue the
+update.
+
+If the scheduler cannot place all Allocations, it will create a new Evaluation
+in the `blocked` state and submit it to the leader. The Eval Broker will
+re-enque that Evaluation once cluster state has changed. (This process is the
+first green box in the sequence diagram below.)
+
+Once the scheduler has completed processing the Evaluation, if there are
+Allocations (and possibly a Deployment) to update, it will submit this work as a
+Plan to the leader. The leader needs to validate this plan and serialize it:
+
+* The scheduler took a snapshot of cluster state at the start of its work, so
+  that state may have changed in the meantime.
+* Schedulers run concurrently across the cluster, so they may generate
+  conflicting plans (particularly on heavily-packed clusters).
+
+The leader processes the plan in the plan applier. If the plan is valid, the
+plan applier will write the Allocations (and Deployment) update to the state
+store. If not, it will reject the plan and the scheduler will try to create a
+new plan with a refreshed state. If the scheduler fails to submit a valid plan
+too many times it submits a `blocked` Evaluation that is triggered by
+`max-plan-attempts` type. (The plan submit process is the second green box in
+the sequence diagram below.)
+
+Once the scheduler has a response from the leader, it will tell the Eval Broker
+to Ack the Evaluation (if it successfully submitted the plan) or Nack the
+Evaluation (if it failed to do so) so that another scheduler can try processing it.
+
+The diagram below shows the scheduling phase, including submitting plans to the
+planner. Note that at the end of this phase, Allocations (and Deployment) have
+been persisted to the state store.
+
+```mermaid
+sequenceDiagram
+
+    participant leaderRpc as Leader RPC
+    participant stateStore as State Store
+    participant planner as Planner
+    participant broker as Eval Broker
+    participant sched as Scheduler
+
+    leaderRpc ->> stateStore: write Evaluation (see above)
+    stateStore ->> broker: EvalBroker.Enqueue
+    activate broker
+    broker -->> broker: enqueue eval
+    broker -->> stateStore: ok
+    deactivate broker
+
+    sched ->> broker: Eval.Dequeue (blocks until work available)
+    activate sched
+    broker -->> sched: Evaluation
+
+    sched ->> stateStore: query to get job and cluster state
+    stateStore -->> sched: results
+
+    sched -->> sched: Reconcile (how many allocs are needed?)
+    deactivate sched
+
+    alt
+        Note right of sched: Needs new allocs
+        activate sched
+        sched -->> sched: create Allocations
+        sched -->> sched: create Deployment (service jobs)
+        Note left of sched: for Deployments, only 1 "batch" of Allocs will get created for each Eval
+        sched -->> sched: Feasibility Check
+        Note left of sched: iterate over Nodes to find a placement for each Alloc
+        sched -->> sched: Ranking
+        Note left of sched: pick best option for each Alloc
+
+        %% start rect highlight for blocked eval
+        rect rgb(213, 246, 234)
+            Note left of sched: Not enough room! (But we can submit a partial plan)
+            sched ->> leaderRpc: Eval.Upsert (blocked)
+            activate leaderRpc
+            leaderRpc ->> stateStore: write Evaluation
+            activate stateStore
+            stateStore -->> leaderRpc: ok
+            deactivate stateStore
+            leaderRpc --> sched: ok
+            deactivate leaderRpc
+        end
+        %% end rect highlight for blocked eval
+
+
+        %% start rect highlight for planner
+        rect rgb(213, 246, 234)
+            Note over sched, planner: Note: scheduler snapshot may be stale state relative to leader so it's serialized by plan applier
+
+            sched ->> planner: Plan.Submit (Allocations + Deployment)
+            planner -->> planner: is plan still valid?
+
+            Note right of planner: Plan is valid
+            activate planner
+
+            planner ->> leaderRpc: Allocations.Upsert + Deployment.Upsert
+            activate leaderRpc
+            leaderRpc ->> stateStore: write Allocations and Deployment
+            activate stateStore
+            stateStore -->> leaderRpc: ok
+            deactivate stateStore
+            leaderRpc -->> planner: ok
+            deactivate leaderRpc
+            planner -->> sched: ok
+            deactivate planner
+
+        end
+        %% end rect highlight for planner
+
+        sched -->> sched: retry on failure, if exceed max attempts will Eval.Nack
+
+    else
+
+    end
+
+    sched ->> broker: Eval.Ack (Eval.Nack if failed)
+    activate broker
+    broker ->> stateStore: complete Evaluation
+    activate stateStore
+    stateStore -->> broker: ok
+    deactivate stateStore
+    broker -->> sched: ok
+    deactivate broker
+    deactivate sched
+```
+
+## Deployment Watcher
+
+As noted under Scheduling above, a Deployment is created for service jobs. A
+deployment watcher runs on the leader. Its job is to watch the state of
+Allocations being placed for a given job version and to emit new Evaluations so
+that more Allocations for that job can be created.
+
+The "deployments watcher" (plural) makes a blocking query for Deployments and
+spins up a new "deployment watcher" (singular) for each one. That goroutine will
+live until its Deployment is complete or failed.
+
+The deployment watcher makes blocking queries on Allocation health and its own
+Deployment (which can be canceled or paused by a user). When there's a change in
+any of those states, it compares the current state against the [`update`][]
+block and the timers it maintains for `min_healthy_time`, `healthy_deadline`,
+and `progress_deadline`. It then updates the Deployment state and creates a new
+Evaluation if the current step of the update is complete.
+
+The diagram below shows deployments from a high level. Note that Deployments do
+not themselves create Allocations -- they create Evaluations and then the
+schedulers process those as they do normally.
+
+```mermaid
+sequenceDiagram
+
+    participant leaderRpc as Leader RPC
+    participant stateStore as State Store
+    participant dw as Deployment Watcher
+
+    dw ->> stateStore: blocking query for new Deployments
+    activate dw
+    stateStore -->> dw: new Deployment
+    dw -->> dw: start watcher
+
+    dw ->> stateStore: blocking query for Allocation health
+    stateStore -->> dw: Allocation health updates
+    dw ->> dw: next step?
+
+    Note right of dw: Update state and create evaluations for next batch...
+    Note right of dw: Or fail the Deployment and update state
+
+    dw ->> leaderRpc: Deployment.Upsert + Evaluation.Upsert
+    activate leaderRpc
+    leaderRpc ->> stateStore: write Deployment and Evaluations
+    activate stateStore
+    stateStore -->> leaderRpc: ok
+    deactivate stateStore
+    leaderRpc -->> dw: ok
+    deactivate leaderRpc
+    deactivate dw
+
+```
+
+## Client Allocs
+
+Once the plan applier has persisted Allocations to the state store (with an
+associated Node ID), they become available to get placed on the client. Clients
+_pull_ new allocations (and changes to allocations), so a new allocation will be
+in the `pending` state until it's been pulled down by a Client and the
+allocation has been instantiated.
+
+Once the Allocation is running and healthy, the Client will send a
+`Node.UpdateAlloc` RPC back to the server so that info can be persisted in the
+state store. This is the allocation health data the Deployment Watcher is
+querying for above.
+
+```mermaid
+sequenceDiagram
+
+    participant followerRpc as Follower RPC
+    participant stateStore as State Store
+
+    participant client as Client
+    participant allocrunner as Allocation Runner
+
+    client ->> followerRpc: Alloc.GetAllocs RPC
+    activate client
+
+    Note right of client: this query can be stale
+
+    followerRpc ->> stateStore: query for Allocations
+    activate followerRpc
+    activate stateStore
+    stateStore -->> followerRpc: Allocations
+    deactivate stateStore
+    followerRpc -->> client: Allocations
+    deactivate followerRpc
+
+    client ->> allocrunner: Create or update allocation runners
+
+    client ->> followerRpc: Node.UpdateAlloc
+    Note right of followerRpc: will be forwarded to leader
+
+    deactivate client
+```
+
+
+[Scheduling Concepts]: https://nomadproject.io/docs/concepts/scheduling/scheduling
+[`update`]: https://www.nomadproject.io/docs/job-specification/update
diff --git a/contributing/architecture-eval-states.md b/contributing/architecture-eval-states.md
new file mode 100644
index 000000000..fdaac339d
--- /dev/null
+++ b/contributing/architecture-eval-states.md
@@ -0,0 +1,201 @@
+# Architecture: Evaluation Status
+
+The [Scheduling in Nomad][] internals documentation covers the path that an
+evaluation takes through the leader, worker, and plan applier. But it doesn't
+cover in any detail the various `Evaluation.Status` values, or where the
+`PreviousEval`, `NextEval`, or `BlockedEval` ID pointers are set.
+
+The state diagram below describes the transitions between `Status` values as
+solid arrows. The dashed arrows represent when a new evaluation is created. The
+parenthetical labels on those arrows are the `TriggeredBy` field for the new
+evaluation.
+
+The status values are:
+
+* `pending` evaluations either are queued to be scheduled, are still being
+  processed in the scheduler, or are being applied by the plan applier and not
+  yet acknowledged.
+* `failed` evaluations have failed to be applied by the plan applier (or are
+  somehow invalid in the scheduler; this is always a bug)
+* `blocked` evaluations are created when an eval has failed too many attempts to
+  have its plan applied by the leader, or when a plan can only be partially
+  applied and there are still more allocations to create.
+* `complete` means the plan was applied successfully (at least partially).
+* `canceled` means the evaluation was superseded by state changes like a new
+  version of the job.
+
+
+```mermaid
+flowchart LR
+
+    event((Cluster\nEvent))
+
+    pending([pending])
+    blocked([blocked])
+    complete([complete])
+    failed([failed])
+    canceled([canceled])
+
+    %% style classes
+    classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+    class event other;
+    class pending,blocked,complete,failed,canceled status;
+
+    event -. "job-register
+      job-deregister
+      periodic-job
+      node-update
+      node-drain
+      alloc-stop
+      scheduled
+      alloc-failure
+      job-scaling" .-> pending
+
+    pending -. "new eval\n(rolling-update)" .-> pending
+    pending -. "new eval\n(preemption)" .-> pending
+
+    pending -. "new eval\n(max-plan-attempts)" .-> blocked
+    pending -- if plan submitted --> complete
+    pending -- if invalid --> failed
+    pending -- if no-op --> canceled
+
+    failed -- if retried --> blocked
+    failed -- if retried --> complete
+
+    blocked -- if no-op --> canceled
+    blocked -- if plan submitted --> complete
+
+    complete -. "new eval\n(deployment-watcher)" .-> pending
+    complete -. "new eval\n(queued-allocs)" .-> blocked
+
+    failed -. "new eval\n(failed-follow-up)" .-> pending
+```
+
+But it's hard to get a full picture of the evaluation lifecycle purely from the
+`Status` fields, because evaluations have several "quasi-statuses" which aren't
+represented as explicit statuses in the state store:
+
+* `scheduling` is the status where an eval is being processed by the scheduler
+  worker.
+* `applying` is the status where the resulting plan for the eval is being
+  applied in the plan applier on the leader.
+* `delayed` is an enqueued eval that will be dequeued some time in the future.
+* `deleted` is when an eval is removed from the state store entirely.
+
+By adding these statuses to the diagram (the dashed nodes), you can see where
+the same `Status` transition might result in different `PreviousEval`,
+`NextEval`, or `BlockedEval` set. You can also see where the "chain" of
+evaluations is broken when new evals are created for preemptions or by the
+deployment watcher.
+
+
+```mermaid
+flowchart LR
+
+    event((Cluster\nEvent))
+
+    %% statuss
+    pending([pending])
+    blocked([blocked])
+    complete([complete])
+    failed([failed])
+    canceled([canceled])
+
+    %% quasi-statuss
+    deleted([deleted])
+    delayed([delayed])
+    scheduling([scheduling])
+    applying([applying])
+
+    %% style classes
+    classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef quasistatus fill:#d5f6ea,stroke-dasharray: 5 5,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+
+    class event other;
+    class pending,blocked,complete,failed,canceled status;
+    class deleted,delayed,scheduling,applying quasistatus;
+
+    event -- "job-register
+      job-deregister
+      periodic-job
+      node-update
+      node-drain
+      alloc-stop
+      scheduled
+      alloc-failure
+      job-scaling" --> pending
+
+    pending -- dequeued --> scheduling
+    pending -- if delayed --> delayed
+    delayed -- dequeued --> scheduling
+
+    scheduling -. "not all allocs placed
+      new eval created by scheduler
+      trigger queued-allocs
+      new has .PreviousEval = old.ID
+      old has .BlockedEval = new.ID" .-> blocked
+
+    scheduling -. "failed to plan
+      new eval created by scheduler
+      trigger: max-plan-attempts
+      new has .PreviousEval = old.ID
+      old has .BlockedEval = new.ID" .-> blocked
+
+    scheduling -- "not all allocs placed
+      reuse already-blocked eval" --> blocked
+
+    blocked -- "unblocked by
+      external state changes" --> scheduling
+
+    scheduling -- allocs placed --> complete
+
+    scheduling -- "wrong eval type or
+      max retries exceeded
+      on plan submit" --> failed
+
+    scheduling -- "canceled by
+      job update/stop" --> canceled
+
+    failed -- retry --> scheduling
+
+    scheduling -. "new eval from rolling update (system jobs)
+      created by scheduler
+      trigger: rolling-update
+      new has .PreviousEval = old.ID
+      old has .NextEval = new.ID" .-> pending
+
+    scheduling -- submit --> applying
+    applying -- failed --> scheduling
+
+    applying -. "new eval for preempted allocs
+      created by plan applier
+      trigger: preemption
+      new has .PreviousEval = unset!
+      old has .BlockedEval = unset!" .-> pending
+
+    complete -. "new eval from deployments (service jobs)
+      created by deploymentwatcher
+      trigger: deployment-watcher
+      new has .PreviousEval = unset!
+      old has .NextEval = unset!" .-> pending
+
+    failed -- "new eval
+      trigger: failed-follow-up
+      new has .PreviousEval = old.ID
+      old has .NextEval = new.ID" --> pending
+
+    pending -- "undeliverable evals
+      reaped by leader" --> failed
+
+    blocked -- "duplicate blocked evals
+      reaped by leader" --> canceled
+
+    canceled -- garbage\ncollection --> deleted
+    failed -- garbage\ncollection --> deleted
+    complete -- garbage\ncollection --> deleted
+```
+
+
+[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling
diff --git a/contributing/architecture-eval-triggers.md b/contributing/architecture-eval-triggers.md
new file mode 100644
index 000000000..e8d7d7e69
--- /dev/null
+++ b/contributing/architecture-eval-triggers.md
@@ -0,0 +1,259 @@
+# Architecture: Evaluation Triggers
+
+The [Scheduling in Nomad][] internals documentation covers the path that an
+evaluation takes through the leader, worker, and plan applier. This document
+describes what events within the cluster cause Evaluations to be created.
+
+Evaluations have a `TriggeredBy` field which is always one of the values defined
+in [`structs.go`][]:
+
+```
+const (
+	EvalTriggerJobRegister          = "job-register"
+	EvalTriggerJobDeregister        = "job-deregister"
+	EvalTriggerPeriodicJob          = "periodic-job"
+	EvalTriggerNodeDrain            = "node-drain"
+	EvalTriggerNodeUpdate           = "node-update"
+	EvalTriggerAllocStop            = "alloc-stop"
+	EvalTriggerScheduled            = "scheduled"
+	EvalTriggerRollingUpdate        = "rolling-update"
+	EvalTriggerDeploymentWatcher    = "deployment-watcher"
+	EvalTriggerFailedFollowUp       = "failed-follow-up"
+	EvalTriggerMaxPlans             = "max-plan-attempts"
+	EvalTriggerRetryFailedAlloc     = "alloc-failure"
+	EvalTriggerQueuedAllocs         = "queued-allocs"
+	EvalTriggerPreemption           = "preemption"
+	EvalTriggerScaling              = "job-scaling"
+	EvalTriggerMaxDisconnectTimeout = "max-disconnect-timeout"
+	EvalTriggerReconnect            = "reconnect"
+)
+```
+
+The list below covers each trigger and what can trigger it.
+
+* **job-register**: Creating or updating a Job will result in 1 Evaluation
+  created, plus any follow-up Evaluations associated with scheduling, planning,
+  or deployments.
+* **job-deregister**: Stopping a Job will result in 1 Evaluation created, plus
+  any follow-up Evaluations associated with scheduling, planning, or
+  deployments.
+* **periodic-job**: A periodic job that hits its timer and dispatches a child
+  job will result in 1 Evaluation created, plus any additional Evaluations
+  associated with scheduling or planning.
+* **node-drain**: Draining a node will create 1 Evaluation for each Job on the
+  node that's draining, plus any additional Evaluations associated with
+  scheduling or planning.
+* **node-update**: When the fingerprint of a client node has changed or the node
+  has changed state (from up to down), Nomad creates 1 Evaluation for each Job
+  running on the Node, plus 1 Evaluation for each system job that has
+  `datacenters` that include the datacenter for that Node.
+* **alloc-stop**: When the API that serves the `nomad alloc stop` command is
+  hit, Nomad creates 1 Evaluation.
+* **scheduled**: Nomad's internal housekeeping will periodically create
+  Evaluations for garbage collection.
+* **rolling-update**: When a `system` job is updated, the [`update`][] block's
+  `stagger` field controls how many Allocations will be scheduled at a time. The
+  scheduler will create 1 Evaluation to follow-up for the next set.
+* **deployment-watcher**: When a `service` job is updated, the [`update`][]
+  block controls how many Allocations will be scheduled at a time. The
+  deployment watcher runs on the leader and monitors Allocation healthy. It will
+  create 1 Evaluation when the Deployment has reached the next step.
+* **failed-follow-up**: Evaluations that hit a delivery limit and will not be
+  retried by the eval broker are marked as failed. The leader periodically
+  reaps failed Evaluations and creates 1 new Evaluation for these, with a delay.
+* **max-plan-attempts**: The scheduler will retry Evaluations that are rejected
+  by the plan applier with a new cluster state snapshot. If the scheduler
+  exceeds the maximum number of retries, it will create 1 new Evaluation in the
+  `blocked` state.
+* **alloc-failure**: If an Allocation fails and exceeds its maximum
+  [`restart` attempts][], Nomad creates 1 new Evaluation.
+* **queued-allocs**: When a scheduler processes an Evaluation, it may not be
+  able to place all Allocations. It will create 1 new Evaluation in the
+  `blocked` state to be processed later when node updates arrive.
+* **preemption**: When Allocations are preempted, the plan applier creates 1
+  Evaluation for each Job that has been preempted.
+* **job-scaling**: Scaling a Job will result in 1 Evaluation created, plus any
+  follow-up Evaluations associated with scheduling, planning, or deployments.
+* **max-disconnect-timeout**: When an Allocation is in the `unknown` state for
+  longer than the [`max_client_disconnect`][] window, the scheduler will create
+  1 Evaluation.
+* **reconnect**: When a Node in the `disconnected` state reconnects, Nomad will
+  create 1 Evaluation per job with an allocation on the reconnected Node.
+
+## Follow-up Evaluations
+
+Almost any Evaluation processed by the scheduler can result in additional
+Evaluations being created, whether because the scheduler needs to follow-up from
+failed scheduling or because the resulting plan changes the state of the
+cluster. This can result in a large number of Evaluations when the cluster is in
+an unstable state with frequent changes.
+
+Consider the following example where a node running 1 system job and 2 service
+jobs misses its heartbeat and is marked lost. The Evaluation for the system job
+is successfully planned. One of the service jobs no longer meets constraints. The
+other service job is successfully scheduled but the resulting plan is rejected
+because the scheduler has fallen behind in raft replication. A total of 6
+Evaluations are created.
+
+```mermaid
+flowchart TD
+
+    event((Node\nmisses\nheartbeat))
+
+    system([system\nnode-update])
+    service1([service 1\nnode-update])
+    service2([service 2\nnode-update])
+
+    blocked([service 1\nblocked\nqueued-allocs])
+    failed([service 2\nfailed\nmax-plan-attempts])
+    followup([service 2\nfailed-follow-up])
+
+    %% style classes
+    classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+    class event other;
+    class system,service1,service2,blocked,failed,followup eval;
+
+    event --> system
+    event --> service1
+    event --> service2
+
+    service1 --> blocked
+
+    service2 --> failed
+    failed --> followup
+```
+
+Next, consider this example where a `service` job has been updated. The task
+group has `count = 3` and the following `update` block:
+
+```hcl
+update {
+  max_parallel = 1
+  canary       = 1
+}
+```
+
+After each Evaluation is processed, the Deployment Watcher will be waiting to
+receive information on updated Allocation health. Then it will emit a new
+Evaluation for the next step. A total of 4 Evaluations are created.
+
+```mermaid
+flowchart TD
+
+    registerEvent((Job\nRegister))
+    alloc1health((Canary\nHealthy))
+    alloc2health((Alloc 2\nHealthy))
+    alloc3health((Alloc 3\nHealthy))
+
+    register([job-register])
+    dwPostCanary([deployment-watcher])
+    dwPostAlloc2([deployment-watcher])
+    dwPostAlloc3([deployment-watcher])
+
+    %% style classes
+    classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+    class registerEvent,alloc1health,alloc2health,alloc3health other
+    class register,dwPostCanary,dwPostAlloc2,dwPostAlloc3 eval
+
+    registerEvent --> register
+    register --> wait1
+    alloc1health --> wait1
+    wait1 --> dwPostCanary
+
+    dwPostCanary --> wait2
+    alloc2health --> wait2
+    wait2 --> dwPostAlloc2
+
+    dwPostAlloc2 --> wait3
+    alloc3health --> wait3
+    wait3 --> dwPostAlloc3
+
+```
+
+Lastly, consider this example where 2 nodes each running 5 Allocations that are
+all for system jobs are "flapping" by missing heartbeats and then
+re-registering, or frequently changing fingerprints. This diagram will show the
+results from each node going down once and then coming back up.
+
+```mermaid
+flowchart TD
+
+    %% style classes
+    classDef eval fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+
+    eventAdown((Node A\nmisses\nheartbeat))
+    eventAup((Node A\nheartbeats))
+    eventBdown((Node B\nmisses\nheartbeat))
+    eventBup((Node B\nheartbeats))
+
+    eventAdown --> eventAup
+    eventBdown --> eventBup
+
+    A01down([job 1 node A\nnode-update])
+    A02down([job 2 node A\nnode-update])
+    A03down([job 3 node A\nnode-update])
+    A04down([job 4 node A\nnode-update])
+    A05down([job 5 node A\nnode-update])
+
+    B01down([job 1 node B\nnode-update])
+    B02down([job 2 node B\nnode-update])
+    B03down([job 3 node B\nnode-update])
+    B04down([job 4 node B\nnode-update])
+    B05down([job 5 node B\nnode-update])
+
+    A01up([job 1 node A\nnode-update])
+    A02up([job 2 node A\nnode-update])
+    A03up([job 3 node A\nnode-update])
+    A04up([job 4 node A\nnode-update])
+    A05up([job 5 node A\nnode-update])
+
+    B01up([job 1 node B\nnode-update])
+    B02up([job 2 node B\nnode-update])
+    B03up([job 3 node B\nnode-update])
+    B04up([job 4 node B\nnode-update])
+    B05up([job 5 node B\nnode-update])
+
+    eventAdown:::other --> A01down:::eval
+    eventAdown:::other --> A02down:::eval
+    eventAdown:::other --> A03down:::eval
+    eventAdown:::other --> A04down:::eval
+    eventAdown:::other --> A05down:::eval
+
+    eventAup:::other --> A01up:::eval
+    eventAup:::other --> A02up:::eval
+    eventAup:::other --> A03up:::eval
+    eventAup:::other --> A04up:::eval
+    eventAup:::other --> A05up:::eval
+
+    eventBdown:::other --> B01down:::eval
+    eventBdown:::other --> B02down:::eval
+    eventBdown:::other --> B03down:::eval
+    eventBdown:::other --> B04down:::eval
+    eventBdown:::other --> B05down:::eval
+
+    eventBup:::other --> B01up:::eval
+    eventBup:::other --> B02up:::eval
+    eventBup:::other --> B03up:::eval
+    eventBup:::other --> B04up:::eval
+    eventBup:::other --> B05up:::eval
+
+```
+
+You can extrapolate this example to large clusters: 100 nodes each running 10
+system jobs and 40 service jobs that go down once and come back up will result
+in 100 * 40 * 2 == 8000 Evaluations created for the service jobs, which will
+result in rescheduling of service allocations to new nodes. For the system jobs,
+2000 Evaluations will be created and all of these will be no-op Evaluations that
+will still need to be replicated to all raft peers, canceled by the scheduler,
+and eventually need to be garbage collected.
+
+
+
+[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling
+[`structs.go`]: https://github.com/hashicorp/nomad/blob/v1.4.0-beta.1/nomad/structs/structs.go#L10857-L10875
+[`update`]: https://www.nomadproject.io/docs/job-specification/update
+[`restart` attempts]: https://www.nomadproject.io/docs/job-specification/restart
+[`max_client_disconnect`]: https://www.nomadproject.io/docs/job-specification/group#max-client-disconnect
diff --git a/contributing/architecture-state-store.md b/contributing/architecture-state-store.md
new file mode 100644
index 000000000..96a8b39e4
--- /dev/null
+++ b/contributing/architecture-state-store.md
@@ -0,0 +1,151 @@
+# Architecture: Nomad State Store
+
+Nomad server state is an in-memory state store backed by raft. All writes to
+state are serialized into message pack and written as raft logs. The raft logs
+are replicated from the leader to the followers. Once each follower has
+persisted the log entry and applied the entry to its in-memory state ("FSM"),
+the leader considers the write committed.
+
+This architecture has a few implications:
+
+* The `fsm.Apply` functions must be deterministic over their inputs for a given
+  state. You can never generate random IDs or assign wall-clock timestamps in
+  the state store. These values must be provided as parameters from the RPC
+  handler.
+
+      ```go
+      # Incorrect: generating a timestamp in the state store is not deterministic.
+      func (s *StateStore) UpsertObject(...) {
+          # ...
+          obj.CreateTime = time.Now()
+          # ...
+      }
+
+      # Correct: non-deterministic values should be passed as inputs:
+      func (s *StateStore) UpsertObject(..., timestamp time.Time) {
+          # ...
+          obj.CreateTime = timestamp
+          # ...
+      }
+      ```
+
+* Every object you read from the state store must be copied before it can be
+  mutated, because mutating the object modifies it outside the raft
+  workflow. The result can be servers having inconsistent state, transactions
+  breaking, or even server panics.
+
+      ```go
+      # Incorrect: job is mutated without copying.
+      job, err := state.JobByID(ws, namespace, id)
+      job.Status = structs.JobStatusRunning
+
+      # Correct: only the job copy is mutated.
+      job, err := state.JobByID(ws, namespace, id)
+      updateJob := job.Copy()
+      updateJob.Status = structs.JobStatusRunning
+      ```
+
+Adding new objects to the state store should be done as part of adding new RPC
+endpoints. See the [RPC Endpoint Checklist][].
+
+```mermaid
+flowchart TD
+
+    %% entities
+
+    ext(("API\nclient"))
+    any("Any node
+      (client or server)")
+    follower(Follower)
+
+    rpcLeader("RPC handler (on leader)")
+
+    writes("writes go thru raft
+        raftApply(MessageType, entry) in nomad/rpc.go
+        structs.MessageType in nomad/structs/structs.go
+        go generate ./... for nomad/msgtypes.go")
+    click writes href "https://github.com/hashicorp/nomad/tree/main/nomad" _blank
+
+    reads("reads go directly to state store
+        Typical state_store.go funcs to implement:
+
+        state.GetMyThingByID
+        state.GetMyThingByPrefix
+        state.ListMyThing
+        state.UpsertMyThing
+        state.DeleteMyThing")
+    click writes href "https://github.com/hashicorp/nomad/tree/main/nomad/state" _blank
+
+    raft("hashicorp/raft")
+
+    bolt("boltdb")
+
+    fsm("Application-specific
+      Finite State Machine (FSM)
+      (aka State Store)")
+    click writes href "https://github.com/hashicorp/nomad/tree/main/nomad/fsm.go" _blank
+
+    memdb("hashicorp/go-memdb")
+
+    %% style classes
+    classDef leader fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+    class any,follower other;
+    class rpcLeader,raft,bolt,fsm,memdb leader;
+
+    %% flows
+
+    ext -- HTTP request --> any
+
+    any -- "RPC request
+      to connected server
+      (follower or leader)" --> follower
+
+    follower -- "(1) srv.Forward (to leader)" --> rpcLeader
+
+    raft -- "(3) replicate to a
+      quorum of followers
+      wait on their fsm.Apply" --> follower
+
+    rpcLeader --> reads
+    reads --> memdb
+
+    rpcLeader --> writes
+    writes -- "(2)" --> raft
+
+    raft -- "(4) write log to disk" --> bolt
+    raft -- "(5) fsm.Apply
+      nomad/fsm.go" --> fsm
+
+    fsm -- "(6) txn.Insert" --> memdb
+
+    bolt <-- "Snapshot Persist: nomad/fsm.go
+    Snapshot Restore: nomad/fsm.go" --> memdb
+
+
+    %% notes
+
+    note1("Typical structs to implement
+        for RPC handlers:
+
+        structs.MyThing
+          .Diff()
+          .Copy()
+          .Merge()
+        structs.MyThingUpsertRequest
+        structs.MyThingUpsertResponse
+        structs.MyThingGetRequest
+        structs.MyThingGetResponse
+        structs.MyThingListRequest
+        structs.MyThingListResponse
+        structs.MyThingDeleteRequest
+        structs.MyThingDeleteResponse
+
+        Don't forget to register your new RPC
+        in nomad/server.go!")
+
+    note1 -.- rpcLeader
+```
+
+
+[RPC Endpoint Checklist]: https://github.com/hashicorp/nomad/blob/main/contributing/checklist-rpc-endpoint.md
diff --git a/website/content/docs/concepts/architecture.mdx b/website/content/docs/concepts/architecture.mdx
index 96c3babef..bfc5d7216 100644
--- a/website/content/docs/concepts/architecture.mdx
+++ b/website/content/docs/concepts/architecture.mdx
@@ -69,6 +69,15 @@ either the _desired state_ (jobs) or _actual state_ (clients) changes, Nomad
 creates a new evaluation to determine if any actions must be taken. An
 evaluation may result in changes to allocations if necessary.
 
+#### Deployment
+
+Deployments are the mechanism by which Nomad rolls out changes to cluster state
+in a step-by-step fashion. Deployments are only available for Jobs with the type
+`service`. When an Evaluation is processed, the scheduler creates only the
+number of Allocations permitted by the [`update`][] block and the current state
+of the cluster. The Deployment is used to monitor the health of those
+Allocations and emit a new Evaluation for the next step of the update.
+
 #### Server
 
 Nomad servers are the brains of the cluster. There is a cluster of servers per
@@ -161,3 +170,6 @@ are more details available for each of the sub-systems. The [consensus protocol]
 are all documented in more detail.
 
 For other details, either consult the code, ask in IRC or reach out to the mailing list.
+
+
+[`update`]: /docs/job-specification/update