Non-CSI garbage collection tasks on the server only log the cutoff
index in the case where it's not a forced GC from `nomad system gc`.
Do the same for CSI for consistency.
Update the logic in the Nomad client's alloc health tracker which
erroneously marks existing healthy allocations with dead poststart ephemeral
tasks as unhealthy even if they were already successful during a previous
deployment.
This PR replaces use of time.After with a safe helper function
that creates a time.Timer to use instead. The new function returns
both a time.Timer and a Stop function that the caller must handle.
Unlike time.NewTimer, the helper function does not panic if the duration
set is <= 0.
PR #11550 changed the job stop exit behaviour when monitoring the
deployment. When stopping a job, the deployment becomes cancelled
and therefore the CLI now exits with status code 1 as it see this
as an error.
This change adds a new utility e2e function that accounts for this
behaviour.
Previously we copied this library by hand to avoid vendor-ing a bunch of
files related to minimock. Now that we no longer vendor, just import the
library normally.
Also we might use more of the library for handling `time.After` uses,
for which this library provides a Context-based solution.
The Plan.Submit endpoint assumed PlanRequest.Plan was never nil. While
there is no evidence it ever has been nil, we should not panic if a nil
plan is ever submitted because that would crash the leader.
When an allocation stops, the `csi_hook` makes an unpublish RPC to the
servers to unpublish via the CSI RPCs: first to the node plugins and
then the controller plugins. The controller RPCs must happen after the
node RPCs so that the node has had a chance to unmount the volume
before the controller tries to detach the associated device.
But the client has local access to the node plugins and can
independently determine if it's safe to send unpublish RPC to those
plugins. This will allow the server to treat the node plugin as
abandoned if a client is disconnected and `stop_on_client_disconnect`
is set. This will let the server try to send unpublish RPCs to the
controller plugins, under the assumption that the client will be
trying to unmount the volume on its end first.
Note that the CSI `NodeUnpublishVolume`/`NodeUnstageVolume` RPCs can
return ignorable errors in the case where the volume has already been
unmounted from the node. Handle all other errors by retrying until we
get success so as to give operators the opportunity to reschedule a
failed node plugin (ex. in the case where they accidentally drained a
node without `-ignore-system`). Fan-out the work for each volume into
its own goroutine so that we can release a subset of volumes if only
one is stuck.