contrib: architecture guide to the drainer (#16569)

The drainer component is fairly complex. As part of upcoming work to fix some of the drainer's rough edges, document the drainer's architecture from a Nomad developer perspective.
2023-03-21 09:17:24 -04:00 · 2023-03-21 09:17:24 -04:00 · 1763622dfd
parent 518fd610b3
commit 1763622dfd
1 changed files with 185 additions and 0 deletions
--- a/contributing/architecture-drainer.md
+++ b/contributing/architecture-drainer.md
@ -0,0 +1,185 @@
+# Architecture: Drainer
+
+The drainer is a component that runs on the leader that services requests from
+the [`nomad node drain`][] command and related workflows in the web UI. For a
+play-by-play from the user's perspective, see [node drain tutorial][]. This
+document describes the internals of the drainer for Nomad developers.
+
+The high-level workflow is that:
+* The user sets the drain state of the Client ("Node") in the state store.
+* Allocations are migrated according to their `migrate` block.
+* The drainer creates a watcher for the Node to fire an event when the work is
+  done or the drain's deadline is reached.
+* The drainer creates watchers for each job's allocs on the Node, to fire
+  progress events.
+
+Effectively the drainer marks allocations for migration and emits an eval, and
+then lets the scheduler take it from there.
+
+## Components
+
+There are four major components of the drainer:
+
+- **`NodeDrainer`**: The entrypoint struct for the [`nomad/drainer`][]
+  package. This struct runs a top-level event loop that's enabled only on the
+  leader. It's configured with the three "watcher" interfaces described below.
+
+- **`DrainingNodeWatcher`**: A watcher interface implemented by the
+  `nodeDrainWatcher` struct in [`watch_nodes.go`][]. Runs a loop that watches
+  for changes to Nodes. If a Node change transitions a Node to draining, the
+  `DrainingNodeWatcher` adds the Node to its tracker. It queries the state store
+  to get the jobs that have allocations running on that Node, and registers
+  those jobs with the `DrainingJobWatcher`.
+
+- **`DrainingJobWatcher`**: A watcher interface implemented by the
+  `drainingJobWatcher` struct in [`watch_jobs.go`][]. Runs a loop that watches
+  jobs registered by the `DrainingNodeWatcher` and allocations in the state
+  store. The job watcher is where the job's [`migrate`][] block is handled so
+  that only the correct number of allocations are being drained at a time. The
+  job watcher exposes two methods that return channels:
+
+  - `Drain()` returns a channel that produces a `DrainRequest` for all the
+    allocs on these jobs that need to be drained. The `NodeDrainer` turns this
+    request into Raft writes via `AllocUpdateDesiredTransition`.
+
+  - `Migrated()` returns a channel that produces slices of allocations that have
+    completed migration. "Completed" should mean exactly what the end user would
+    expect; the replacement allocations have been placed, `ephemeral_disk` has
+    been migrated (if possible), and the old allocation is fully stopped.
+
+- **`DrainDeadlineNotifier`**: A watcher interface implemented by the
+  `deadlineHeap` struct in [`drain_heap.go`][]. Runs a loop that tracks the
+  Nodes being drained against their deadline timers. The `NodeDrainer` can watch
+  the channel returned by the `NextBatch` method to get slices of Nodes that
+  have failed to complete their migrations by the deadline.
+
+There is also a collection of other minor components important to understanding
+the workflow:
+
+- **Raft shims**: Because the `nomad/drainer` package is not in the same package
+  as the server code, the server configures the `NodeDrainer` with shim
+  functions that close over the small set of Raft apply functions the drainer
+  needs. For this reason they are located in [`nomad/drainer_shims.go`][] rather
+  than the `nomad/drainer` package.
+
+  - `AllocUpdateDesiredTransition` includes allocation desired status changes
+    and the evaluations that will need to be processed.
+
+  - `NodesDrainComplete` includes updates for the drained Node.
+
+- **`drainingNode`**: This struct represents the state of a single Node whose
+  drain is being tracked. Created by the `DrainingNodeWatcher` whenever a
+  Node is marked for draining in the state store.
+
+- **`DrainRequest`**: This struct represents a set of allocations that should be
+  marked for drain. Created by `DrainingJobWatcher` whenever it receives a job
+  to drain.
+
+_A note on code style:_ the drainer is implemented with an unusual amount of
+dependency injection via factory functions that return interfaces because it has
+to handle state and raft writes without being in the top-level `nomad` package
+itself. It also can't import the top-level `nomad` package because the drainer
+is instantiated by the server, and that would create a circular
+import. Generally speaking we don't want to emulate this style elsewhere in
+Nomad because it makes implementation harder to follow, but it makes sense in
+this limited case.
+
+## Events
+
+The components combine into three high-level flow of events. The first is the
+flow of a newly draining Node. The `NodeWatcher` gets the Node from a blocking
+query. It registers the job with the `JobWatcher`. The `JobWatcher` determines
+which allocations need draining. These are polled from the `Drain()` channel by
+`NodeDrainer` and written to raft via the `AllocUpdateDesiredTransition`
+shim. Then the scheduler and clients picks up the changes.
+
+```mermaid
+flowchart TD
+    %% entities
+    clients
+    scheduler
+    user(user)
+    NodeDrainer([NodeDrainer])
+    NodeWatcher([DrainingNodeWatcher])
+    JobWatcher([DrainingJobWatcher])
+    StateStore([state store])
+
+    %% style classes
+    classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+    class user,clients,scheduler,StateStore other;
+    class NodeDrainer,NodeWatcher,JobWatcher component;
+
+    user -. "1. enable drain\nfor node" .-> StateStore
+    StateStore -- "2. blocking query for\nnewly draining nodes" --> NodeWatcher
+    NodeWatcher -- "3. RegisterJobs(jobs)" --> JobWatcher
+    JobWatcher -- "4. Drain(): allocs for job that need draining" --> NodeDrainer
+    NodeDrainer -- "5. AllocUpdateDesiredTransition\n(raft shim)" --> StateStore
+
+    StateStore -. 6. EvalDequeue .-> scheduler
+    StateStore -. 7. GetAllocs .-> clients
+```
+
+The second is when allocation migrations are complete. The clients update the
+state of the migrated allocs. The `JobWatcher` has a blocking query that detects
+these changes. Allocs that are done migrating get sent on the `Migrated()`
+channel that's polled by the `NodeDrainer`. The `NodeDrainer` determines whether
+the Node is done being drained, and writes an update via the
+`NodesDrainComplete` raft shim.
+
+```mermaid
+flowchart TD
+    %% entities
+    clients
+    StateStore([state store])
+    NodeDrainer([NodeDrainer])
+    JobWatcher([DrainingJobWatcher])
+
+    %% style classes
+    classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+    class clients,StateStore other;
+    class NodeDrainer,JobWatcher component;
+
+    clients -. "1. UpdateAlloc" .-> StateStore
+    StateStore -- "2. blocking query\nfor allocs" --> JobWatcher
+    JobWatcher -- "3. Migrated(): allocs that are done" --> NodeDrainer
+    NodeDrainer -- "4. NodesDrainComplete\n(raft shim)" --> StateStore
+```
+
+And the third is when Nodes pass their deadline. The `NodeWatcher` is
+responsible for adding and removing the watch in the `DeadlineNotifier`. The
+`DeadlineNotifier` is responsible for watching the timer. If the Node isn't
+removed before the deadline, the `DeadlineNotifier` tells the `NodeDrainer` and
+the `NodeDrainer` updates the state via the `NodesDrainComplete` shim. At this
+point the remaining allocs will be forced to shutdown immediately.
+
+```mermaid
+flowchart TD
+    %% entities
+    NodeDrainer([NodeDrainer])
+    NodeWatcher([DrainingNodeWatcher])
+    DeadlineNotifier([DrainDeadlineNotifier])
+    StateStore([state store])
+
+    %% style classes
+    classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
+    classDef other fill:#d5f6ea,stroke:#1d9467
+    class StateStore other;
+    class NodeDrainer,NodeWatcher,DeadlineNotifier component;
+
+    NodeWatcher -- "1. Watch()" --> DeadlineNotifier
+    NodeWatcher -- "2a. Remove()" --> DeadlineNotifier
+    DeadlineNotifier -- "2b. watch the clock" --> DeadlineNotifier
+    DeadlineNotifier -- "3. node has passed deadline" --> NodeDrainer
+    NodeDrainer -- "4. NodesDrainComplete\n(raft shim)" --> StateStore
+```
+
+[`nomad node drain`]: https://developer.hashicorp.com/nomad/docs/commands/node/drain
+[node drain tutorial]: https://developer.hashicorp.com/nomad/tutorials/manage-clusters/node-drain
+[`nomad/drainer`]: https://github.com/hashicorp/nomad/tree/main/nomad/drainer
+[`watch_nodes.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/watch_nodes.go
+[`watch_jobs.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/watch_jobs.go
+[`drain_heap.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/drain_heap.go
+[`nomad/drainer_shims.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer_shims.go
+[`migrate`]: https://developer.hashicorp.com/nomad/docs/job-specification/migrate