This changeset implements a periodic garbage collection of unused CSI
plugins. Plugins are self-cleaning when the last allocation for a
plugin is stopped, but this feature will cover any missing edge cases
and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins
cleaned up.
This is part of #7834’s jQuery removal goal. It addresses a couple of jQuery-related deprecation warnings and also uses “native events mode” for ember-cli-page-object, which is needed so it doesn’t have to use jQuery via the Ember global.
In the client view list, only show running allocations count for each
client, rather than include already completed tasks.
This is done for two reasons:
First, consitency with the CLI: `nomad node status --allocs` only
shows running allocs.
Second, and more importantly, the count is useful to estimate how loaded
the clients are. Allocs that have completed (but not GCed yet) have
very little value to operators.
In this changeset:
* If a Nomad client node is running both a controller and a node
plugin (which is a common case), then if only the controller or the
node is removed, the plugin was not being updated with the correct
counts.
* The existing test for plugin cleanup didn't go back to the state
store, which normally is ok but is complicated in this case by
denormalization which changes the behavior. This commit makes the
test more comprehensive.
* Set "controller required" when plugin has `PUBLISH_READONLY`. All
known controllers that support `PUBLISH_READONLY` also support
`PUBLISH_UNPUBLISH_VOLUME` but we shouldn't assume this.
* Only create plugins when the allocs for those plugins are
healthy. If we allow a plugin to be created for the first time when
the alloc is not healthy, then we'll recreate deleted plugins when
the job's allocs all get marked terminal.
* Terminal plugin alloc updates should cleanup the plugin. The client
fingerprint can't tell if the plugin is unhealthy intentionally (for
the case of updates or job stop). Allocations that are
server-terminal should delete themselves from the plugin and trigger
a plugin self-GC, the same as an unused node.
Several of the CSI `VolumeCapability` methods return pointers, which
we were then comparing to pointers in the request rather than
dereferencing them and comparing their contents.
This changeset does a more fine-grained comparison of the request vs
the capabilities, and adds better error messaging.
Makes it possible to run Linux Containers On Windows with Nomad alongside Windows Containers. Fingerprint prevents only to run Nomad in Windows 10 with Linux Containers
The nil-check here is left-over from an earlier approach that didn't
get merged. It doesn't do anything for us now as we can't ever pass it
`nil` and if we leave it in the `getVolume` call it guards will panic
anyways.
In order to minimize this change while keeping a simple version of the
behavior, we set `lastOk` to the current time less the intial server
connection timeout. If the client starts and never contacts the
server, it will stop all configured tasks after the initial server
connection grace period, on the assumption that we've been out of
touch longer than any configured `stop_after_client_disconnect`.
The more complex state behavior might be justified later, but we
should learn about failure modes first.
- track lastHeartbeat, the client local time of the last successful
heartbeat round trip
- track allocations with `stop_after_client_disconnect` configured
- trigger allocation destroy (which handles cleanup)
- restore heartbeat/killable allocs tracking when allocs are recovered from disk
- on client restart, stop those allocs after a grace period if the
servers are still partioned
This changeset corrects handling of the `ValidationVolumeCapabilities`
response:
* The CSI spec for the `ValidationVolumeCapabilities` requires that
plugins only set the `Confirmed` field if they've validated all
capabilities. The Nomad client improperly assumes that the lack of a
`Confirmed` field should be treated as a failure. This breaks the
Azure and Linode block storage plugins, which don't set this
optional field.
* The CSI spec also requires that the orchestrator check the validation
responses to guard against older versions of a plugin reporting
"valid" for newer fields it doesn't understand.
During MVP development, we reduced the timeout for controller plugins
to avoid long hangs in GC workers. But now that this work has been
moved to the volume watcher, we can restore the original timeout which
is better suited for the characteristic timescales of some cloud
provider APIs and better matches the behavior of k8s.
We should only remove the `ReadAllocs`/`WriteAllocs` values for a
volume after the claim has entered the "ready to free"
state. The volume will eventually be released as expected. But
querying the volume API will show the volume is released before the
controller unpublish has finished and this can cause a race with
starting new jobs.
Test updates are to cover cases where we're dropping claims but not
running through the whole reaping process.