When a CSI client RPC is given a specific node for a controller but
the lookup fails (because the node is gone or is an older version), we
fallthrough to select a node from all those available. This adds
logging to this case to aid in diagnostics.
The watcher goroutines will be automatically started if a volume has
updates, but when idle we shouldn't keep a goroutine running and
taking up memory.
This changeset implements a periodic garbage collection of CSI volumes
with missing allocations. This can happen in a scenario where a node
update fails partially and the allocation updates are written to raft
but the evaluations to GC the volumes are dropped. This feature will
cover this edge case and ensure that upgrades from 0.11.0 and 0.11.1
get any stray claims cleaned up.
TestClientEndpoint_CreateNodeEvals flakes a bit but its output is very
confusing, as `structs.Evaluations` overrides GoString.
Here, we emit the entire struct of the evaluation, and hopefully we'll
figure out the problem the next time it happens
This change deflakes TestTaskTemplateManager_BlockedEvents test, because
it is expecting a number of events without accounting for transitional
state.
The test TestTaskTemplateManager_BlockedEvents attempts to ensure that a
template rendering emits blocked events for missing template ksys.
It works by setting a template that requires keys 0,1,2,3,4 and then
eventually sets keys 0,1,2,3 and ensures that we get a final event indicating
that keys 3 and 4 are still missing.
The test waits to get a blocked event for the final state, but it can
fail if receives a blocked event for a transitional state (e.g. one
reporting 2,3,4,5 are missing).
This fixes the test by ensuring that it waits until the final message
before assertion.
Also, it clarifies the intent of the test with stricter assertions and
additional comments.
Ensure that `""` Scheduler Algorithm gets explicitly set to binpack on
upgrades or on API handling when user misses the value.
The scheduler already treats `""` value as binpack. This PR merely
ensures that the operator API returns the effective value.
Changing namespaces can be done anywhere in the app even though many
Nomad resources aren't namespace-sensitive (e.g., clients, plugins).
A user changing namespaces is an intent to reset context, "now I want
to begin a task that relates to Namespace X". Where that task begins
used to always be the Jobs list, since it was the only namespace
sensitive resource. Now with CSI Volumes, "square 1" is Volumes if the
namespace is changed from a storage page.
The `volumewatcher` has a 250ms batch window so claim updates will not
typically be large enough to risk exceeding the maximum raft message
size. But large jobs might have enough volume claims that this could
be a danger. Set a maximum batch size of 100 messages per
batch (roughly 33K), as a very conservative safety/robustness guard.
Co-authored-by: Chris Baker <1675087+cgbaker@users.noreply.github.com>
An operator could submit a scale request including a negative count
value. This negative value caused the Nomad server to panic. The
fix adds validation to the submitted count, returning an error to
the caller if it is negative.
A few connect examples reference a fake 'test/test' image.
By replacing those with `hashicorpnomad/counter-api` we can
turn them into runnable examples.
* volumewatcher: remove redundant log fields
The constructor for `volumeWatcher` already sets a `logger.With` that
includes the volume ID and namespace fields. Remove them from the
various trace logs.
* volumewatcher: advance state for controller already released
One way of bypassing client RPCs in testing is to set a claim status
to controller-detached, but this results in an incorrect claim state
when we checkpoint.
Promote the Connect ACLs guide on the jobspec connect stanza docs
page. This was suggested in a ticket after someone got stuck not
realizing they needed to enable Consul Intentions for their connect
enabled services, which is covered in the guide.
Replace the existing top example with something that is directly
runnable on a `-dev-connect` nomad setup.
Add the _complete_ `countdash` example at the bottom in the
examples section, so that people do not need to go guide-hunting
to find a complete example. The hope is people will see more
runnable examples and be less afraid of connect.
This change replaces the top example for expose path configuration with
two new runnable examples. Users should be able to copy and paste those
jobs into a job file and run them against a basic connect enabled nomad
setup.
The example presented first demonstrates use of the service check expose
parameter with no dynamic port explicitly defined (new to 0.11.2).
This is expected to be the "90%" use case of users, and so we should
try to emphasise this pattern as best practice.
The example presented second demonstrates achieving the same goal as the
first exmaple, but utilizing the full plumbing available through the
`connect.proxy.expose` stanza. This should help readers comprehend what
is happening "under the hood".
There was a missing edge case where a job is pending. I took the moment
to also refactor the code to use async/await which cleaned up the
promise chaining.