contrib: architecture guide to the drainer (#16569)
The drainer component is fairly complex. As part of upcoming work to fix some of the drainer's rough edges, document the drainer's architecture from a Nomad developer perspective.
This commit is contained in:
parent
518fd610b3
commit
1763622dfd
185
contributing/architecture-drainer.md
Normal file
185
contributing/architecture-drainer.md
Normal file
|
@ -0,0 +1,185 @@
|
|||
# Architecture: Drainer
|
||||
|
||||
The drainer is a component that runs on the leader that services requests from
|
||||
the [`nomad node drain`][] command and related workflows in the web UI. For a
|
||||
play-by-play from the user's perspective, see [node drain tutorial][]. This
|
||||
document describes the internals of the drainer for Nomad developers.
|
||||
|
||||
The high-level workflow is that:
|
||||
* The user sets the drain state of the Client ("Node") in the state store.
|
||||
* Allocations are migrated according to their `migrate` block.
|
||||
* The drainer creates a watcher for the Node to fire an event when the work is
|
||||
done or the drain's deadline is reached.
|
||||
* The drainer creates watchers for each job's allocs on the Node, to fire
|
||||
progress events.
|
||||
|
||||
Effectively the drainer marks allocations for migration and emits an eval, and
|
||||
then lets the scheduler take it from there.
|
||||
|
||||
## Components
|
||||
|
||||
There are four major components of the drainer:
|
||||
|
||||
- **`NodeDrainer`**: The entrypoint struct for the [`nomad/drainer`][]
|
||||
package. This struct runs a top-level event loop that's enabled only on the
|
||||
leader. It's configured with the three "watcher" interfaces described below.
|
||||
|
||||
- **`DrainingNodeWatcher`**: A watcher interface implemented by the
|
||||
`nodeDrainWatcher` struct in [`watch_nodes.go`][]. Runs a loop that watches
|
||||
for changes to Nodes. If a Node change transitions a Node to draining, the
|
||||
`DrainingNodeWatcher` adds the Node to its tracker. It queries the state store
|
||||
to get the jobs that have allocations running on that Node, and registers
|
||||
those jobs with the `DrainingJobWatcher`.
|
||||
|
||||
- **`DrainingJobWatcher`**: A watcher interface implemented by the
|
||||
`drainingJobWatcher` struct in [`watch_jobs.go`][]. Runs a loop that watches
|
||||
jobs registered by the `DrainingNodeWatcher` and allocations in the state
|
||||
store. The job watcher is where the job's [`migrate`][] block is handled so
|
||||
that only the correct number of allocations are being drained at a time. The
|
||||
job watcher exposes two methods that return channels:
|
||||
|
||||
- `Drain()` returns a channel that produces a `DrainRequest` for all the
|
||||
allocs on these jobs that need to be drained. The `NodeDrainer` turns this
|
||||
request into Raft writes via `AllocUpdateDesiredTransition`.
|
||||
|
||||
- `Migrated()` returns a channel that produces slices of allocations that have
|
||||
completed migration. "Completed" should mean exactly what the end user would
|
||||
expect; the replacement allocations have been placed, `ephemeral_disk` has
|
||||
been migrated (if possible), and the old allocation is fully stopped.
|
||||
|
||||
- **`DrainDeadlineNotifier`**: A watcher interface implemented by the
|
||||
`deadlineHeap` struct in [`drain_heap.go`][]. Runs a loop that tracks the
|
||||
Nodes being drained against their deadline timers. The `NodeDrainer` can watch
|
||||
the channel returned by the `NextBatch` method to get slices of Nodes that
|
||||
have failed to complete their migrations by the deadline.
|
||||
|
||||
There is also a collection of other minor components important to understanding
|
||||
the workflow:
|
||||
|
||||
- **Raft shims**: Because the `nomad/drainer` package is not in the same package
|
||||
as the server code, the server configures the `NodeDrainer` with shim
|
||||
functions that close over the small set of Raft apply functions the drainer
|
||||
needs. For this reason they are located in [`nomad/drainer_shims.go`][] rather
|
||||
than the `nomad/drainer` package.
|
||||
|
||||
- `AllocUpdateDesiredTransition` includes allocation desired status changes
|
||||
and the evaluations that will need to be processed.
|
||||
|
||||
- `NodesDrainComplete` includes updates for the drained Node.
|
||||
|
||||
- **`drainingNode`**: This struct represents the state of a single Node whose
|
||||
drain is being tracked. Created by the `DrainingNodeWatcher` whenever a
|
||||
Node is marked for draining in the state store.
|
||||
|
||||
- **`DrainRequest`**: This struct represents a set of allocations that should be
|
||||
marked for drain. Created by `DrainingJobWatcher` whenever it receives a job
|
||||
to drain.
|
||||
|
||||
_A note on code style:_ the drainer is implemented with an unusual amount of
|
||||
dependency injection via factory functions that return interfaces because it has
|
||||
to handle state and raft writes without being in the top-level `nomad` package
|
||||
itself. It also can't import the top-level `nomad` package because the drainer
|
||||
is instantiated by the server, and that would create a circular
|
||||
import. Generally speaking we don't want to emulate this style elsewhere in
|
||||
Nomad because it makes implementation harder to follow, but it makes sense in
|
||||
this limited case.
|
||||
|
||||
## Events
|
||||
|
||||
The components combine into three high-level flow of events. The first is the
|
||||
flow of a newly draining Node. The `NodeWatcher` gets the Node from a blocking
|
||||
query. It registers the job with the `JobWatcher`. The `JobWatcher` determines
|
||||
which allocations need draining. These are polled from the `Drain()` channel by
|
||||
`NodeDrainer` and written to raft via the `AllocUpdateDesiredTransition`
|
||||
shim. Then the scheduler and clients picks up the changes.
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
%% entities
|
||||
clients
|
||||
scheduler
|
||||
user(user)
|
||||
NodeDrainer([NodeDrainer])
|
||||
NodeWatcher([DrainingNodeWatcher])
|
||||
JobWatcher([DrainingJobWatcher])
|
||||
StateStore([state store])
|
||||
|
||||
%% style classes
|
||||
classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
|
||||
classDef other fill:#d5f6ea,stroke:#1d9467
|
||||
class user,clients,scheduler,StateStore other;
|
||||
class NodeDrainer,NodeWatcher,JobWatcher component;
|
||||
|
||||
user -. "1. enable drain\nfor node" .-> StateStore
|
||||
StateStore -- "2. blocking query for\nnewly draining nodes" --> NodeWatcher
|
||||
NodeWatcher -- "3. RegisterJobs(jobs)" --> JobWatcher
|
||||
JobWatcher -- "4. Drain(): allocs for job that need draining" --> NodeDrainer
|
||||
NodeDrainer -- "5. AllocUpdateDesiredTransition\n(raft shim)" --> StateStore
|
||||
|
||||
StateStore -. 6. EvalDequeue .-> scheduler
|
||||
StateStore -. 7. GetAllocs .-> clients
|
||||
```
|
||||
|
||||
The second is when allocation migrations are complete. The clients update the
|
||||
state of the migrated allocs. The `JobWatcher` has a blocking query that detects
|
||||
these changes. Allocs that are done migrating get sent on the `Migrated()`
|
||||
channel that's polled by the `NodeDrainer`. The `NodeDrainer` determines whether
|
||||
the Node is done being drained, and writes an update via the
|
||||
`NodesDrainComplete` raft shim.
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
%% entities
|
||||
clients
|
||||
StateStore([state store])
|
||||
NodeDrainer([NodeDrainer])
|
||||
JobWatcher([DrainingJobWatcher])
|
||||
|
||||
%% style classes
|
||||
classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
|
||||
classDef other fill:#d5f6ea,stroke:#1d9467
|
||||
class clients,StateStore other;
|
||||
class NodeDrainer,JobWatcher component;
|
||||
|
||||
clients -. "1. UpdateAlloc" .-> StateStore
|
||||
StateStore -- "2. blocking query\nfor allocs" --> JobWatcher
|
||||
JobWatcher -- "3. Migrated(): allocs that are done" --> NodeDrainer
|
||||
NodeDrainer -- "4. NodesDrainComplete\n(raft shim)" --> StateStore
|
||||
```
|
||||
|
||||
And the third is when Nodes pass their deadline. The `NodeWatcher` is
|
||||
responsible for adding and removing the watch in the `DeadlineNotifier`. The
|
||||
`DeadlineNotifier` is responsible for watching the timer. If the Node isn't
|
||||
removed before the deadline, the `DeadlineNotifier` tells the `NodeDrainer` and
|
||||
the `NodeDrainer` updates the state via the `NodesDrainComplete` shim. At this
|
||||
point the remaining allocs will be forced to shutdown immediately.
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
%% entities
|
||||
NodeDrainer([NodeDrainer])
|
||||
NodeWatcher([DrainingNodeWatcher])
|
||||
DeadlineNotifier([DrainDeadlineNotifier])
|
||||
StateStore([state store])
|
||||
|
||||
%% style classes
|
||||
classDef component fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
|
||||
classDef other fill:#d5f6ea,stroke:#1d9467
|
||||
class StateStore other;
|
||||
class NodeDrainer,NodeWatcher,DeadlineNotifier component;
|
||||
|
||||
NodeWatcher -- "1. Watch()" --> DeadlineNotifier
|
||||
NodeWatcher -- "2a. Remove()" --> DeadlineNotifier
|
||||
DeadlineNotifier -- "2b. watch the clock" --> DeadlineNotifier
|
||||
DeadlineNotifier -- "3. node has passed deadline" --> NodeDrainer
|
||||
NodeDrainer -- "4. NodesDrainComplete\n(raft shim)" --> StateStore
|
||||
```
|
||||
|
||||
[`nomad node drain`]: https://developer.hashicorp.com/nomad/docs/commands/node/drain
|
||||
[node drain tutorial]: https://developer.hashicorp.com/nomad/tutorials/manage-clusters/node-drain
|
||||
[`nomad/drainer`]: https://github.com/hashicorp/nomad/tree/main/nomad/drainer
|
||||
[`watch_nodes.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/watch_nodes.go
|
||||
[`watch_jobs.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/watch_jobs.go
|
||||
[`drain_heap.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer/drain_heap.go
|
||||
[`nomad/drainer_shims.go`]: https://github.com/hashicorp/nomad/blob/main/nomad/drainer_shims.go
|
||||
[`migrate`]: https://developer.hashicorp.com/nomad/docs/job-specification/migrate
|
Loading…
Reference in a new issue