The plugin supervisor lazily connects to plugins, but this means we
only get "Unavailable" back from the gRPC call in cases where the
plugin can never be reached (for example, if the Nomad client has the
wrong permissions for the socket).
This changeset improves the operator experience by switching to a
blocking `DialWithContext`. It eagerly connects so that we can
validate the connection is real and get a "failed to open" error in
case where Nomad can't establish the initial connection.
This fixes a bug in #7815 where you can’t open an exec window from
the allocation overview because accessing `allocation.job.plainId`
fails across the proxied relationship.
This uses a fork of the Ember Asset Size Action to report on changes to
the filesizes of the production UI build as a comment to the PR. The
fork includes fixes to make it work with a subdirectory (ui/ in our
case) and to make it update the PR comment if it exists vs posting
repeatedly to the thread.
The CSI plugins RPCs require the use of the storage provider's volume
ID, rather than the user-defined volume ID. Although changing the RPCs
to use the field name `ExternalID` risks breaking backwards
compatibility, we can use the `ExternalID` name internally for the
client and only use `VolumeID` at the RPC boundaries.
This ensures that token revocation is idempotent and can handle when
tokens are revoked out of band.
Idempotency is important to handle some transient failures and retries.
Consider when a single token of a batch fails to be revoked, nomad would
retry revoking the entire batch; tokens already revoked should be
gracefully handled, otherwise, nomad may retry revoking the same
tokens forever.
* jobspec, api: add stop_after_client_disconnect
* nomad/state/state_store: error message typo
* structs: alloc methods to support stop_after_client_disconnect
1. a global AllocStates to track status changes with timestamps. We
need this to track the time at which the alloc became lost
originally.
2. ShouldClientStop() and WaitClientStop() to actually do the math
* scheduler/reconcile_util: delayByStopAfterClientDisconnect
* scheduler/reconcile: use delayByStopAfterClientDisconnect
* scheduler/util: updateNonTerminalAllocsToLost comments
This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now
* scheduler/testing: AssertEvalStatus just use require
* scheduler/generic_sched: don't create a blocked eval if delayed
* scheduler/generic_sched_test: several scheduling cases
This fixes a bug where docker images may not be GCed. The cause of the
bug is that we track the task using `task.ID+task.Name` on task start
but remove on plain `task.ID`.
This haromize the two paths by using `task.ID`, as it's unique enough
and it's also used in the `loadImage` path (path when loading an image
from a local tarball instead of dockerhub).