From 1763622dfd74c3ee511bda8e58e41381da668d88 Mon Sep 17 00:00:00 2001 From: Tim Gross Date: Tue, 21 Mar 2023 09:17:24 -0400 Subject: [PATCH] contrib: architecture guide to the drainer (#16569) The drainer component is fairly complex. As part of upcoming work to fix some of the drainer's rough edges, document the drainer's architecture from a Nomad developer perspective. --- contributing/architecture-drainer.md | 185 +++++++++++++++++++++++++++ 1 file changed, 185 insertions(+) create mode 100644 contributing/architecture-drainer.md diff --git a/contributing/architecture-drainer.md b/contributing/architecture-drainer.md new file mode 100644 index 000000000..3c33d1c23 --- /dev/null +++ b/contributing/architecture-drainer.md @@ -0,0 +1,185 @@ +# Architecture: Drainer + +The drainer is a component that runs on the leader that services requests from +the [`nomad node drain`][] command and related workflows in the web UI. For a +play-by-play from the user's perspective, see [node drain tutorial][]. This +document describes the internals of the drainer for Nomad developers. + +The high-level workflow is that: +* The user sets the drain state of the Client ("Node") in the state store. +* Allocations are migrated according to their `migrate` block. +* The drainer creates a watcher for the Node to fire an event when the work is + done or the drain's deadline is reached. +* The drainer creates watchers for each job's allocs on the Node, to fire + progress events. + +Effectively the drainer marks allocations for migration and emits an eval, and +then lets the scheduler take it from there. + +## Components + +There are four major components of the drainer: + +- **`NodeDrainer`**: The entrypoint struct for the [`nomad/drainer`][] + package. This struct runs a top-level event loop that's enabled only on the + leader. It's configured with the three "watcher" interfaces described below. + +- **`DrainingNodeWatcher`**: A watcher interface implemented by the + `nodeDrainWatcher` struct in [`watch_nodes.go`][]. Runs a loop that watches + for changes to Nodes. If a Node change transitions a Node to draining, the + `DrainingNodeWatcher` adds the Node to its tracker. It queries the state store + to get the jobs that have allocations running on that Node, and registers + those jobs with the `DrainingJobWatcher`. + +- **`DrainingJobWatcher`**: A watcher interface implemented by the + `drainingJobWatcher` struct in [`watch_jobs.go`][]. Runs a loop that watches + jobs registered by the `DrainingNodeWatcher` and allocations in the state + store. The job watcher is where the job's [`migrate`][] block is handled so + that only the correct number of allocations are being drained at a time. The + job watcher exposes two methods that return channels: + + - `Drain()` returns a channel that produces a `DrainRequest` for all the + allocs on these jobs that need to be drained. The `NodeDrainer` turns this + request into Raft writes via `AllocUpdateDesiredTransition`. + + - `Migrated()` returns a channel that produces slices of allocations that have + completed migration. "Completed" should mean exactly what the end user would + expect; the replacement allocations have been placed, `ephemeral_disk` has + been migrated (if possible), and the old allocation is fully stopped. + +- **`DrainDeadlineNotifier`**: A watcher interface implemented by the + `deadlineHeap` struct in [`drain_heap.go`][]. Runs a loop that tracks the + Nodes being drained against their deadline timers. The `NodeDrainer` can watch + the channel returned by the `NextBatch` method to get slices of Nodes that + have failed to complete their migrations by the deadline. + +There is also a collection of other minor components important to understanding +the workflow: + +- **Raft shims**: Because the `nomad/drainer` package is not in the same package + as the server code, the server configures the `NodeDrainer` with shim + functions that close over the small set of Raft apply functions the drainer + needs. For this reason they are located in [`nomad/drainer_shims.go`][] rather + than the `nomad/drainer` package. + + - `AllocUpdateDesiredTransition` includes allocation desired status changes + and the evaluations that will need to be processed. + + - `NodesDrainComplete` includes updates for the drained Node. + +- **`drainingNode`**: This struct represents the state of a single Node whose + drain is being tracked. Created by the `DrainingNodeWatcher` whenever a + Node is marked for draining in the state store. + +- **`DrainRequest`**: This struct represents a set of allocations that should be + marked for drain. Created by `DrainingJobWatcher` whenever it receives a job + to drain. + +_A note on code style:_ the drainer is implemented with an unusual amount of +dependency injection via factory functions that return interfaces because it has +to handle state and raft writes without being in the top-level `nomad` package +itself. It also can't import the top-level `nomad` package because the drainer +is instantiated by the server, and that would create a circular +import. Generally speaking we don't want to emulate this style elsewhere in +Nomad because it makes implementation harder to follow, but it makes sense in +this limited case. + +## Events + +The components combine into three high-level flow of events. The first is the +flow of a newly draining Node. The `NodeWatcher` gets the Node from a blocking +query. It registers the job with the `JobWatcher`. The `JobWatcher` determines +which allocations need draining. These are polled from the `Drain()` channel by +`NodeDrainer` and written to raft via the `AllocUpdateDesiredTransition` +shim. Then the scheduler and clients picks up the changes. + +```mermaid +flowchart TD + %% entities + clients + scheduler + user(user) + NodeDrainer([NodeDrainer]) + NodeWatcher([DrainingNodeWatcher]) + JobWatcher([DrainingJobWatcher]) + StateStore([state store]) + + %% style classes + classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class user,clients,scheduler,StateStore other; + class NodeDrainer,NodeWatcher,JobWatcher component; + + user -. "1. enable drain\nfor node" .-> StateStore + StateStore -- "2. blocking query for\nnewly draining nodes" --> NodeWatcher + NodeWatcher -- "3. RegisterJobs(jobs)" --> JobWatcher + JobWatcher -- "4. Drain(): allocs for job that need draining" --> NodeDrainer + NodeDrainer -- "5. AllocUpdateDesiredTransition\n(raft shim)" --> StateStore + + StateStore -. 6. EvalDequeue .-> scheduler + StateStore -. 7. GetAllocs .-> clients +``` + +The second is when allocation migrations are complete. The clients update the +state of the migrated allocs. The `JobWatcher` has a blocking query that detects +these changes. Allocs that are done migrating get sent on the `Migrated()` +channel that's polled by the `NodeDrainer`. The `NodeDrainer` determines whether +the Node is done being drained, and writes an update via the +`NodesDrainComplete` raft shim. + +```mermaid +flowchart TD + %% entities + clients + StateStore([state store]) + NodeDrainer([NodeDrainer]) + JobWatcher([DrainingJobWatcher]) + + %% style classes + classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class clients,StateStore other; + class NodeDrainer,JobWatcher component; + + clients -. "1. UpdateAlloc" .-> StateStore + StateStore -- "2. blocking query\nfor allocs" --> JobWatcher + JobWatcher -- "3. Migrated(): allocs that are done" --> NodeDrainer + NodeDrainer -- "4. NodesDrainComplete\n(raft shim)" --> StateStore +``` + +And the third is when Nodes pass their deadline. The `NodeWatcher` is +responsible for adding and removing the watch in the `DeadlineNotifier`. The +`DeadlineNotifier` is responsible for watching the timer. If the Node isn't +removed before the deadline, the `DeadlineNotifier` tells the `NodeDrainer` and +the `NodeDrainer` updates the state via the `NodesDrainComplete` shim. At this +point the remaining allocs will be forced to shutdown immediately. + +```mermaid +flowchart TD + %% entities + NodeDrainer([NodeDrainer]) + NodeWatcher([DrainingNodeWatcher]) + DeadlineNotifier([DrainDeadlineNotifier]) + StateStore([state store]) + + %% style classes + classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467 + classDef other fill:#d5f6ea,stroke:#1d9467 + class StateStore other; + class NodeDrainer,NodeWatcher,DeadlineNotifier component; + + NodeWatcher -- "1. Watch()" --> DeadlineNotifier + NodeWatcher -- "2a. Remove()" --> DeadlineNotifier + DeadlineNotifier -- "2b. watch the clock" --> DeadlineNotifier + DeadlineNotifier -- "3. node has passed deadline" --> NodeDrainer + NodeDrainer -- "4. NodesDrainComplete\n(raft shim)" --> StateStore +``` + +[`nomad node drain`]: https://developer.hashicorp.com/nomad/docs/commands/node/drain +[node drain tutorial]: https://developer.hashicorp.com/nomad/tutorials/manage-clusters/node-drain +[`nomad/drainer`]: https://github.com/hashicorp/nomad/tree/main/nomad/drainer +[`watch_nodes.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/watch_nodes.go +[`watch_jobs.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/watch_jobs.go +[`drain_heap.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/drain_heap.go +[`nomad/drainer_shims.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer_shims.go +[`migrate`]: https://developer.hashicorp.com/nomad/docs/job-specification/migrate