The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.
This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect
* scheduler/reconcile: thread follupEvalIDs through to results.stop
* scheduler/reconcile: comment typo
* nomad/_test: correct arguments for plan.AppendStoppedAlloc
* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
* client/heartbeatstop: reversed time condition for startup grace
* scheduler/generic_sched: use `delayInstead` to avoid a loop
Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.
* scheduler/reconcile: lostLater are different than delayedReschedules
Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
* jobspec, api: add stop_after_client_disconnect
* nomad/state/state_store: error message typo
* structs: alloc methods to support stop_after_client_disconnect
1. a global AllocStates to track status changes with timestamps. We
need this to track the time at which the alloc became lost
originally.
2. ShouldClientStop() and WaitClientStop() to actually do the math
* scheduler/reconcile_util: delayByStopAfterClientDisconnect
* scheduler/reconcile: use delayByStopAfterClientDisconnect
* scheduler/util: updateNonTerminalAllocsToLost comments
This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now
* scheduler/testing: AssertEvalStatus just use require
* scheduler/generic_sched: don't create a blocked eval if delayed
* scheduler/generic_sched_test: several scheduling cases
The BinPackIter accounted for node reservations twice when scoring nodes
which could bias scores toward nodes with reservations.
Pseudo-code for previous algorithm:
```
proposed = reservedResources + sum(allocsResources)
available = nodeResources - reservedResources
score = 1 - (proposed / available)
```
The node's reserved resources are added to the total resources used by
allocations, and then the node's reserved resources are later
substracted from the node's overall resources.
The new algorithm is:
```
proposed = sum(allocResources)
available = nodeResources - reservedResources
score = 1 - (proposed / available)
```
The node's reserved resources are no longer added to the total resources
used by allocations.
My guess as to how this bug happened is that the resource utilization
variable (`util`) is calculated and returned by the `AllocsFit` function
which needs to take reserved resources into account as a basic
feasibility check.
To avoid re-calculating alloc resource usage (because there may be a
large number of allocs), we reused `util` in the `ScoreFit` function.
`ScoreFit` properly accounts for reserved resources by subtracting them
from the node's overall resources. However since `util` _also_ took
reserved resources into account the score would be incorrect.
Prior to the fix the added test output:
```
Node: reserved Score: 1.0000
Node: reserved2 Score: 1.0000
Node: no-reserved Score: 0.9741
```
The scores being 1.0 for *both* nodes with reserved resources is a good
hint something is wrong as they should receive different scores. Upon
further inspection the double accounting of reserved resources caused
their scores to be >1.0 and clamped.
After the fix the added test outputs:
```
Node: no-reserved Score: 0.9741
Node: reserved Score: 0.9480
Node: reserved2 Score: 0.8717
```
* nomad/structs/csi: split CanWrite into health, in use
* scheduler/scheduler: expose AllocByID in the state interface
* nomad/state/state_store_test
* scheduler/stack: SetJobID on the matcher
* scheduler/feasible: when a volume writer is in use, check if it's us
* scheduler/feasible: remove SetJob
* nomad/state/state_store: denormalize allocs before Claim
* nomad/structs/csi: return errors on claim, with context
* nomad/csi_endpoint_test: new alloc doesn't look like an update
* nomad/state/state_store_test: change test reference to CanWrite
* nomad/state/schema: use the namespace compound index
* scheduler/scheduler: CSIVolumeByID interface signature namespace
* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace
* scheduler/feasible: pass the captured namespace to CSIVolumeByID
* nomad/state/state_store: use namespace in csi_volume index
* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim
* nomad/core_sched: pass the namespace in volumeClaimReap
* nomad/node_endpoint_test: namespaces in Claim testing
* nomad/csi_endpoint: pass RequestNamespace to state.*
* nomad/csi_endpoint_test: appropriately failed test
* command/alloc_status_test: appropriately failed test
* node_endpoint_test: avoid notTheNamespace for the job
* scheduler/feasible_test: call SetJob to capture the namespace
* nomad/csi_endpoint: ACL check the req namespace, query by namespace
* nomad/state/state_store: remove deregister namespace check
* nomad/state/state_store: remove unused CSIVolumes
* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace
* nomad/csi_endpoint: ACL check
* nomad/state/state_store_test: remove call to state.CSIVolumes
* nomad/core_sched_test: job namespace match so claim gc works
Previously we were looking up plugins based on the Alias Name for a CSI
Volume within the context of its task group.
Here we first look up a volume based on its identifier and then validate
the existence of the plugin based on its `PluginID`.
This commit filters the jobs volumes when setting them on the
feasibility checker. This ensures that the rest of the checker does not
have to worry about non-csi volumes.
* state_store: csi volumes/plugins store the index in the txn
* nomad: csi_endpoint_test require index checks need uint64()
* nomad: other tests using int 0 not uint64(0)
* structs: pass index into New, but not other struct methods
* state_store: csi plugin indexes, use new struct interface
* nomad: csi_endpoint_test check index/query meta (on explicit 0)
* structs: NewCSIVolume takes an index arg now
* scheduler/test: NewCSIVolume takes an index arg now
diffSystemAllocs -> diffSystemAllocsForNode, this function is only used
for diffing system allocations, but lacked awareness of eligible
nodes and the node ID that the allocation was going to be placed.
This change now ignores a change if its existing allocation is on an
ineligible node. For a new allocation, it also checks tainted and
ineligible nodes in the same function instead of nil-ing out the diff
after computation in diffSystemAllocs
If an existing system allocation is running and the node its running on
is marked as ineligible, subsequent plan/applys return an RPC error
instead of a more helpful plan result.
This change logs the error, and appends a failedTGAlloc for the
placement.
If an alloc is being preempted and marked as evict, but the underlying
node is lost before the migration takes place, the allocation currently
stays as desired evict, status running forever, or until the node comes
back online.
This commit updates updateNonTerminalAllocsToLost to check for a
destired status of Evict as well as Stop when updating allocations on
tainted nodes.
switch to table test for lost node cases
We currently log an error if preemption is unable to find a suitable set of
allocations to preempt. This commit changes that to debug level since not finding
preemptable allocations is not an error condition.
Fixes#5856
When the scheduler looks for a placement for an allocation that's
replacing another allocation, it's supposed to penalize the previous
node if the allocation had been rescheduled or failed. But we're
currently always penalizing the node, which leads to unnecessary
migrations on job update.
This commit leaves in place the existing behavior where if the
previous alloc was itself rescheduled, its previous nodes are also
penalized. This is conservative but the right behavior especially on
larger clusters where a group of hosts might be having correlated
trouble (like an AZ failure).
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>