open-nomad/contributing/architecture-eval-states.md

# Architecture: Evaluation Status

The [Scheduling in Nomad][] internals documentation covers the path that an
evaluation takes through the leader, worker, and plan applier. But it doesn't
cover in any detail the various `Evaluation.Status` values, or where the
`PreviousEval`, `NextEval`, or `BlockedEval` ID pointers are set.

The state diagram below describes the transitions between `Status` values as
solid arrows. The dashed arrows represent when a new evaluation is created. The
parenthetical labels on those arrows are the `TriggeredBy` field for the new
evaluation.

The status values are:

* `pending` evaluations either are queued to be scheduled, are still being
  processed in the scheduler, or are being applied by the plan applier and not
  yet acknowledged.
* `failed` evaluations have failed to be applied by the plan applier (or are
  somehow invalid in the scheduler; this is always a bug)
* `blocked` evaluations are created when an eval has failed too many attempts to
  have its plan applied by the leader, or when a plan can only be partially
  applied and there are still more allocations to create.
* `complete` means the plan was applied successfully (at least partially).
* `canceled` means the evaluation was superseded by state changes like a new
  version of the job.


```mermaid
flowchart LR

    event((Cluster\nEvent))

    pending([pending])
    blocked([blocked])
    complete([complete])
    failed([failed])
    canceled([canceled])

    %% style classes
    classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
    classDef other fill:#d5f6ea,stroke:#1d9467
    class event other;
    class pending,blocked,complete,failed,canceled status;

    event -. "job-register
      job-deregister
      periodic-job
      node-update
      node-drain
      alloc-stop
      scheduled
      alloc-failure
      job-scaling" .-> pending

    pending -. "new eval\n(rolling-update)" .-> pending
    pending -. "new eval\n(preemption)" .-> pending

    pending -. "new eval\n(max-plan-attempts)" .-> blocked
    pending -- if plan submitted --> complete
    pending -- if invalid --> failed
    pending -- if no-op --> canceled

    failed -- if retried --> blocked
    failed -- if retried --> complete

    blocked -- if no-op --> canceled
    blocked -- if plan submitted --> complete

    complete -. "new eval\n(deployment-watcher)" .-> pending
    complete -. "new eval\n(queued-allocs)" .-> blocked

    failed -. "new eval\n(failed-follow-up)" .-> pending
```

But it's hard to get a full picture of the evaluation lifecycle purely from the
`Status` fields, because evaluations have several "quasi-statuses" which aren't
represented as explicit statuses in the state store:

* `scheduling` is the status where an eval is being processed by the scheduler
  worker.
* `applying` is the status where the resulting plan for the eval is being
  applied in the plan applier on the leader.
* `delayed` is an enqueued eval that will be dequeued some time in the future.
* `deleted` is when an eval is removed from the state store entirely.

By adding these statuses to the diagram (the dashed nodes), you can see where
the same `Status` transition might result in different `PreviousEval`,
`NextEval`, or `BlockedEval` set. You can also see where the "chain" of
evaluations is broken when new evals are created for preemptions or by the
deployment watcher.


```mermaid
flowchart LR

    event((Cluster\nEvent))

    %% statuss
    pending([pending])
    blocked([blocked])
    complete([complete])
    failed([failed])
    canceled([canceled])

    %% quasi-statuss
    deleted([deleted])
    delayed([delayed])
    scheduling([scheduling])
    applying([applying])

    %% style classes
    classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467
    classDef quasistatus fill:#d5f6ea,stroke-dasharray: 5 5,stroke:#1d9467
    classDef other fill:#d5f6ea,stroke:#1d9467

    class event other;
    class pending,blocked,complete,failed,canceled status;
    class deleted,delayed,scheduling,applying quasistatus;

    event -- "job-register
      job-deregister
      periodic-job
      node-update
      node-drain
      alloc-stop
      scheduled
      alloc-failure
      job-scaling" --> pending

    pending -- dequeued --> scheduling
    pending -- if delayed --> delayed
    delayed -- dequeued --> scheduling

    scheduling -. "not all allocs placed
      new eval created by scheduler
      trigger queued-allocs
      new has .PreviousEval = old.ID
      old has .BlockedEval = new.ID" .-> blocked

    scheduling -. "failed to plan
      new eval created by scheduler
      trigger: max-plan-attempts
      new has .PreviousEval = old.ID
      old has .BlockedEval = new.ID" .-> blocked

    scheduling -- "not all allocs placed
      reuse already-blocked eval" --> blocked

    blocked -- "unblocked by
      external state changes" --> scheduling

    scheduling -- allocs placed --> complete

    scheduling -- "wrong eval type or
      max retries exceeded
      on plan submit" --> failed

    scheduling -- "canceled by
      job update/stop" --> canceled

    failed -- retry --> scheduling

    scheduling -. "new eval from rolling update (system jobs)
      created by scheduler
      trigger: rolling-update
      new has .PreviousEval = old.ID
      old has .NextEval = new.ID" .-> pending

    scheduling -- submit --> applying
    applying -- failed --> scheduling

    applying -. "new eval for preempted allocs
      created by plan applier
      trigger: preemption
      new has .PreviousEval = unset!
      old has .BlockedEval = unset!" .-> pending

    complete -. "new eval from deployments (service jobs)
      created by deploymentwatcher
      trigger: deployment-watcher
      new has .PreviousEval = unset!
      old has .NextEval = unset!" .-> pending

    failed -- "new eval
      trigger: failed-follow-up
      new has .PreviousEval = old.ID
      old has .NextEval = new.ID" --> pending

    pending -- "undeliverable evals
      reaped by leader" --> failed

    blocked -- "duplicate blocked evals
      reaped by leader" --> canceled

    canceled -- garbage\ncollection --> deleted
    failed -- garbage\ncollection --> deleted
    complete -- garbage\ncollection --> deleted
```


[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling
internals documentation with diagrams (#14750) This changeset adds new architecture internals documents to the contributing guide. These are intentionally here and not on the public-facing website because the material is not required for operators and includes a lot of diagrams that we can cheaply maintain with mermaid syntax but would involve art assets to have up on the main site that would become quickly out of date as code changes happen and be extremely expensive to maintain. However, these should be suitable to use as points of conversation with expert end users. Included: * A description of Evaluation triggers and expected counts, with examples. * A description of Evaluation states and implicit states. This is taken from an internal document in our team wiki. * A description of how writing the State Store works. This is taken from a diagram I put together a few months ago for internal education purposes. * A description of Evaluation lifecycle, from registration to running Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but broken into digestible chunks and without multi-region deployments, which I'd like to cover in a future doc. Also includes adding Deployments to our public-facing glossary. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@duck.com> 2022-10-03 18:06:41 +00:00			`# Architecture: Evaluation Status`

			`The [Scheduling in Nomad][] internals documentation covers the path that an`
			`evaluation takes through the leader, worker, and plan applier. But it doesn't`
			cover in any detail the various `Evaluation.Status` values, or where the
			`PreviousEval`, `NextEval`, or `BlockedEval` ID pointers are set.

			The state diagram below describes the transitions between `Status` values as
			`solid arrows. The dashed arrows represent when a new evaluation is created. The`
			parenthetical labels on those arrows are the `TriggeredBy` field for the new
			`evaluation.`

			`The status values are:`

			* `pending` evaluations either are queued to be scheduled, are still being
			`processed in the scheduler, or are being applied by the plan applier and not`
			`yet acknowledged.`
			* `failed` evaluations have failed to be applied by the plan applier (or are
			`somehow invalid in the scheduler; this is always a bug)`
			* `blocked` evaluations are created when an eval has failed too many attempts to
			`have its plan applied by the leader, or when a plan can only be partially`
			`applied and there are still more allocations to create.`
			* `complete` means the plan was applied successfully (at least partially).
			* `canceled` means the evaluation was superseded by state changes like a new
			`version of the job.`


			```mermaid
			`flowchart LR`

			`event((Cluster\nEvent))`

			`pending([pending])`
			`blocked([blocked])`
			`complete([complete])`
			`failed([failed])`
			`canceled([canceled])`

			`%% style classes`
			`classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467`
			`classDef other fill:#d5f6ea,stroke:#1d9467`
			`class event other;`
			`class pending,blocked,complete,failed,canceled status;`

			`event -. "job-register`
			`job-deregister`
			`periodic-job`
			`node-update`
			`node-drain`
			`alloc-stop`
			`scheduled`
			`alloc-failure`
			`job-scaling" .-> pending`

			`pending -. "new eval\n(rolling-update)" .-> pending`
			`pending -. "new eval\n(preemption)" .-> pending`

			`pending -. "new eval\n(max-plan-attempts)" .-> blocked`
			`pending -- if plan submitted --> complete`
			`pending -- if invalid --> failed`
			`pending -- if no-op --> canceled`

			`failed -- if retried --> blocked`
			`failed -- if retried --> complete`

			`blocked -- if no-op --> canceled`
			`blocked -- if plan submitted --> complete`

			`complete -. "new eval\n(deployment-watcher)" .-> pending`
			`complete -. "new eval\n(queued-allocs)" .-> blocked`

			`failed -. "new eval\n(failed-follow-up)" .-> pending`
			```

			`But it's hard to get a full picture of the evaluation lifecycle purely from the`
			`Status` fields, because evaluations have several "quasi-statuses" which aren't
			`represented as explicit statuses in the state store:`

			* `scheduling` is the status where an eval is being processed by the scheduler
			`worker.`
			* `applying` is the status where the resulting plan for the eval is being
			`applied in the plan applier on the leader.`
			* `delayed` is an enqueued eval that will be dequeued some time in the future.
			* `deleted` is when an eval is removed from the state store entirely.

			`By adding these statuses to the diagram (the dashed nodes), you can see where`
			the same `Status` transition might result in different `PreviousEval`,
			`NextEval`, or `BlockedEval` set. You can also see where the "chain" of
			`evaluations is broken when new evals are created for preemptions or by the`
			`deployment watcher.`


			```mermaid
			`flowchart LR`

			`event((Cluster\nEvent))`

			`%% statuss`
			`pending([pending])`
			`blocked([blocked])`
			`complete([complete])`
			`failed([failed])`
			`canceled([canceled])`

			`%% quasi-statuss`
			`deleted([deleted])`
			`delayed([delayed])`
			`scheduling([scheduling])`
			`applying([applying])`

			`%% style classes`
			`classDef status fill:#d5f6ea,stroke-width:4px,stroke:#1d9467`
			`classDef quasistatus fill:#d5f6ea,stroke-dasharray: 5 5,stroke:#1d9467`
			`classDef other fill:#d5f6ea,stroke:#1d9467`

			`class event other;`
			`class pending,blocked,complete,failed,canceled status;`
			`class deleted,delayed,scheduling,applying quasistatus;`

			`event -- "job-register`
			`job-deregister`
			`periodic-job`
			`node-update`
			`node-drain`
			`alloc-stop`
			`scheduled`
			`alloc-failure`
			`job-scaling" --> pending`

			`pending -- dequeued --> scheduling`
			`pending -- if delayed --> delayed`
			`delayed -- dequeued --> scheduling`

			`scheduling -. "not all allocs placed`
			`new eval created by scheduler`
			`trigger queued-allocs`
			`new has .PreviousEval = old.ID`
			`old has .BlockedEval = new.ID" .-> blocked`

			`scheduling -. "failed to plan`
			`new eval created by scheduler`
			`trigger: max-plan-attempts`
			`new has .PreviousEval = old.ID`
			`old has .BlockedEval = new.ID" .-> blocked`

			`scheduling -- "not all allocs placed`
			`reuse already-blocked eval" --> blocked`

			`blocked -- "unblocked by`
			`external state changes" --> scheduling`

			`scheduling -- allocs placed --> complete`

			`scheduling -- "wrong eval type or`
			`max retries exceeded`
			`on plan submit" --> failed`

			`scheduling -- "canceled by`
			`job update/stop" --> canceled`

			`failed -- retry --> scheduling`

			`scheduling -. "new eval from rolling update (system jobs)`
			`created by scheduler`
			`trigger: rolling-update`
			`new has .PreviousEval = old.ID`
			`old has .NextEval = new.ID" .-> pending`

			`scheduling -- submit --> applying`
			`applying -- failed --> scheduling`

			`applying -. "new eval for preempted allocs`
			`created by plan applier`
			`trigger: preemption`
			`new has .PreviousEval = unset!`
			`old has .BlockedEval = unset!" .-> pending`

			`complete -. "new eval from deployments (service jobs)`
			`created by deploymentwatcher`
			`trigger: deployment-watcher`
			`new has .PreviousEval = unset!`
			`old has .NextEval = unset!" .-> pending`

			`failed -- "new eval`
			`trigger: failed-follow-up`
			`new has .PreviousEval = old.ID`
			`old has .NextEval = new.ID" --> pending`

			`pending -- "undeliverable evals`
			`reaped by leader" --> failed`

			`blocked -- "duplicate blocked evals`
			`reaped by leader" --> canceled`

			`canceled -- garbage\ncollection --> deleted`
			`failed -- garbage\ncollection --> deleted`
			`complete -- garbage\ncollection --> deleted`
			```


			`[Scheduling in Nomad]: https://www.nomadproject.io/docs/internals/scheduling/scheduling`