contrib: architecture guide to the drainer (#16569)

The drainer component is fairly complex. As part of upcoming work to fix some of
the drainer's rough edges, document the drainer's architecture from a Nomad
developer perspective.
This commit is contained in:
Tim Gross 2023-03-21 09:17:24 -04:00 committed by GitHub
parent 518fd610b3
commit 1763622dfd
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23

View file

@ -0,0 +1,185 @@
# Architecture: Drainer
The drainer is a component that runs on the leader that services requests from
the [`nomad node drain`][] command and related workflows in the web UI. For a
play-by-play from the user's perspective, see [node drain tutorial][]. This
document describes the internals of the drainer for Nomad developers.
The high-level workflow is that:
* The user sets the drain state of the Client ("Node") in the state store.
* Allocations are migrated according to their `migrate` block.
* The drainer creates a watcher for the Node to fire an event when the work is
done or the drain's deadline is reached.
* The drainer creates watchers for each job's allocs on the Node, to fire
progress events.
Effectively the drainer marks allocations for migration and emits an eval, and
then lets the scheduler take it from there.
## Components
There are four major components of the drainer:
- **`NodeDrainer`**: The entrypoint struct for the [`nomad/drainer`][]
package. This struct runs a top-level event loop that's enabled only on the
leader. It's configured with the three "watcher" interfaces described below.
- **`DrainingNodeWatcher`**: A watcher interface implemented by the
`nodeDrainWatcher` struct in [`watch_nodes.go`][]. Runs a loop that watches
for changes to Nodes. If a Node change transitions a Node to draining, the
`DrainingNodeWatcher` adds the Node to its tracker. It queries the state store
to get the jobs that have allocations running on that Node, and registers
those jobs with the `DrainingJobWatcher`.
- **`DrainingJobWatcher`**: A watcher interface implemented by the
`drainingJobWatcher` struct in [`watch_jobs.go`][]. Runs a loop that watches
jobs registered by the `DrainingNodeWatcher` and allocations in the state
store. The job watcher is where the job's [`migrate`][] block is handled so
that only the correct number of allocations are being drained at a time. The
job watcher exposes two methods that return channels:
- `Drain()` returns a channel that produces a `DrainRequest` for all the
allocs on these jobs that need to be drained. The `NodeDrainer` turns this
request into Raft writes via `AllocUpdateDesiredTransition`.
- `Migrated()` returns a channel that produces slices of allocations that have
completed migration. "Completed" should mean exactly what the end user would
expect; the replacement allocations have been placed, `ephemeral_disk` has
been migrated (if possible), and the old allocation is fully stopped.
- **`DrainDeadlineNotifier`**: A watcher interface implemented by the
`deadlineHeap` struct in [`drain_heap.go`][]. Runs a loop that tracks the
Nodes being drained against their deadline timers. The `NodeDrainer` can watch
the channel returned by the `NextBatch` method to get slices of Nodes that
have failed to complete their migrations by the deadline.
There is also a collection of other minor components important to understanding
the workflow:
- **Raft shims**: Because the `nomad/drainer` package is not in the same package
as the server code, the server configures the `NodeDrainer` with shim
functions that close over the small set of Raft apply functions the drainer
needs. For this reason they are located in [`nomad/drainer_shims.go`][] rather
than the `nomad/drainer` package.
- `AllocUpdateDesiredTransition` includes allocation desired status changes
and the evaluations that will need to be processed.
- `NodesDrainComplete` includes updates for the drained Node.
- **`drainingNode`**: This struct represents the state of a single Node whose
drain is being tracked. Created by the `DrainingNodeWatcher` whenever a
Node is marked for draining in the state store.
- **`DrainRequest`**: This struct represents a set of allocations that should be
marked for drain. Created by `DrainingJobWatcher` whenever it receives a job
to drain.
_A note on code style:_ the drainer is implemented with an unusual amount of
dependency injection via factory functions that return interfaces because it has
to handle state and raft writes without being in the top-level `nomad` package
itself. It also can't import the top-level `nomad` package because the drainer
is instantiated by the server, and that would create a circular
import. Generally speaking we don't want to emulate this style elsewhere in
Nomad because it makes implementation harder to follow, but it makes sense in
this limited case.
## Events
The components combine into three high-level flow of events. The first is the
flow of a newly draining Node. The `NodeWatcher` gets the Node from a blocking
query. It registers the job with the `JobWatcher`. The `JobWatcher` determines
which allocations need draining. These are polled from the `Drain()` channel by
`NodeDrainer` and written to raft via the `AllocUpdateDesiredTransition`
shim. Then the scheduler and clients picks up the changes.
```mermaid
flowchart TD
%% entities
clients
scheduler
user(user)
NodeDrainer([NodeDrainer])
NodeWatcher([DrainingNodeWatcher])
JobWatcher([DrainingJobWatcher])
StateStore([state store])
%% style classes
classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
class user,clients,scheduler,StateStore other;
class NodeDrainer,NodeWatcher,JobWatcher component;
user -. "1. enable drain\nfor node" .-> StateStore
StateStore -- "2. blocking query for\nnewly draining nodes" --> NodeWatcher
NodeWatcher -- "3. RegisterJobs(jobs)" --> JobWatcher
JobWatcher -- "4. Drain(): allocs for job that need draining" --> NodeDrainer
NodeDrainer -- "5. AllocUpdateDesiredTransition\n(raft shim)" --> StateStore
StateStore -. 6. EvalDequeue .-> scheduler
StateStore -. 7. GetAllocs .-> clients
```
The second is when allocation migrations are complete. The clients update the
state of the migrated allocs. The `JobWatcher` has a blocking query that detects
these changes. Allocs that are done migrating get sent on the `Migrated()`
channel that's polled by the `NodeDrainer`. The `NodeDrainer` determines whether
the Node is done being drained, and writes an update via the
`NodesDrainComplete` raft shim.
```mermaid
flowchart TD
%% entities
clients
StateStore([state store])
NodeDrainer([NodeDrainer])
JobWatcher([DrainingJobWatcher])
%% style classes
classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
class clients,StateStore other;
class NodeDrainer,JobWatcher component;
clients -. "1. UpdateAlloc" .-> StateStore
StateStore -- "2. blocking query\nfor allocs" --> JobWatcher
JobWatcher -- "3. Migrated(): allocs that are done" --> NodeDrainer
NodeDrainer -- "4. NodesDrainComplete\n(raft shim)" --> StateStore
```
And the third is when Nodes pass their deadline. The `NodeWatcher` is
responsible for adding and removing the watch in the `DeadlineNotifier`. The
`DeadlineNotifier` is responsible for watching the timer. If the Node isn't
removed before the deadline, the `DeadlineNotifier` tells the `NodeDrainer` and
the `NodeDrainer` updates the state via the `NodesDrainComplete` shim. At this
point the remaining allocs will be forced to shutdown immediately.
```mermaid
flowchart TD
%% entities
NodeDrainer([NodeDrainer])
NodeWatcher([DrainingNodeWatcher])
DeadlineNotifier([DrainDeadlineNotifier])
StateStore([state store])
%% style classes
classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
classDef other fill:#d5f6ea,stroke:#1d9467
class StateStore other;
class NodeDrainer,NodeWatcher,DeadlineNotifier component;
NodeWatcher -- "1. Watch()" --> DeadlineNotifier
NodeWatcher -- "2a. Remove()" --> DeadlineNotifier
DeadlineNotifier -- "2b. watch the clock" --> DeadlineNotifier
DeadlineNotifier -- "3. node has passed deadline" --> NodeDrainer
NodeDrainer -- "4. NodesDrainComplete\n(raft shim)" --> StateStore
```
[`nomad node drain`]: https://developer.hashicorp.com/nomad/docs/commands/node/drain
[node drain tutorial]: https://developer.hashicorp.com/nomad/tutorials/manage-clusters/node-drain
[`nomad/drainer`]: https://github.com/hashicorp/nomad/tree/main/nomad/drainer
[`watch_nodes.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/watch_nodes.go
[`watch_jobs.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/watch_jobs.go
[`drain_heap.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/drain_heap.go
[`nomad/drainer_shims.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer_shims.go
[`migrate`]: https://developer.hashicorp.com/nomad/docs/job-specification/migrate