When claiming a CSI volume, we need to ensure the CSI node plugin is running
before we send any CSI RPCs. This extends even to the controller publish RPC
because it requires the storage provider's "external node ID" for the
client. This primarily impacts client restarts but also is a problem if the node
plugin exits (and fingerprints) while the allocation that needs a CSI volume
claim is being placed.
Unfortunately there's no mapping of volume to plugin ID available in the
jobspec, so we don't have enough information to wait on plugins until we either
get the volume from the server or retrieve the plugin ID from data we've
persisted on the client.
If we always require getting the volume from the server before making the claim,
a client restart for disconnected clients will cause all the allocations that
need CSI volumes to fail. Even while connected, checking in with the server to
verify the volume's plugin before trying to make a claim RPC is inherently racy,
so we'll leave that case as-is and it will fail the claim if the node plugin
needed to support a newly-placed allocation is flapping such that the node
fingerprint is changing.
This changeset persists a minimum subset of data about the volume and its plugin
in the client state DB, and retrieves that data during the CSI hook's prerun to
avoid re-claiming and remounting the volume unnecessarily.
This changeset also updates the RPC handler to use the external node ID from the
claim whenever it is available.
Fixes: #13028
In Consul 1.15.0, the Delete Token API was changed so as to return an error when
deleting a non-existent ACL token. This means that if Nomad successfully deletes
the token but fails to persist that fact, it will get stuck trying to delete a
non-existent token forever.
Update the token deletion function to ignore "not found" errors and treat them
as successful deletions.
Fixes: #17833
* cni: ensure to setup CNI addresses in deterministic order
Currently as commented in the code the go-cni library returns an unordered map
of interfaces. In cases where there are multiple CNI interfaces being created this
creates a problem with service registration and healthchecking because the first
address in the map is being used.
The use case we have where this is an issue is that we run CNI with the macvlan
plugin to isolate workloads, but they still need to be able to access the host on
a static address to be able to perform local resolving and hit host services like
the Consul agent API. To make this work there are 2 options, you either add a
macvlan interface on the host with an assigned address for each VLAN you have or
you create an additional veth bridged interface in the container namespace.
We chose the latter option through a custom CNI plugin but the ordering issue
leaves us with incorrect service registration.
* Updates after feedback
* First check for the CNIResult interfaces length, if it's zero we don't need to proceed
at all.
* Use sorted interfaces list for the address fallback scenario as well.
* Remove "found" log message logic, when an address isn't found an error is returned stating
the allocation could not be configured as an address was missing from the CNIResult. If we
still need a Warn message then we can add it to the condition that returns the error if no
address could be found instead of using the "found" bool logic.
this is basically to avoid Fear/Uncertainty/Doubt
the github action actions/setup-go
(and, with a different chache key, hashicorp/setup-golang)
caches both GOMODCACHE (go source files), which is good,
and GOCACHE (build outputs), which *might* be bad,
if the cache was built on an OS with an older glibc
than we want to support. from `go help cache`:
> [...] the build cache does not detect changes to
> C libraries imported with cgo.
so in enterprise we can use Vault for secrets,
without merge conflicts from oss->ent.
also:
* use hashicorp/setup-golang
* setup-js for self-hosted runners
they don't come with yarn, nor chrome,
and might not always match node version.
Service discovery or mesh network systems consuming the Nomad event stream or API need to know the CNI assigned IP for the allocation. This data is returned by the underlying Nomad API but isn't mapped in the response struct.
* Text and code wrapping as a localStorage var
* task-log uses wrapping and kb shortcut
* Word wrap keyboard labels
* Wrapper as a toggle not a button
* Changelog and fixed an extra space trailing log lines
* Moves toggle to inside
* Acceptance tests for ww and toggle click
The requirements for client-to-server and client-to-client topologies are not
well-documented in the production install requirements docs. Document that
clients make connections to servers (and not the other way around), and that
clients don't need to communicate with each other (with some exceptions).
Fixes: #17631
Update the revision used by the docker action. This should always reflect the commit that's being built as this may differ from the default <github.sha> that the workflow was invoked at.
Goes with https://github.com/hashicorp/actions-docker-build/pull/59 - and should not be merged until this PR is merged and a new version of the action is cut.
* drivers/docker: refactor use of clients in docker driver
This PR refactors how we manage the two underlying clients used by the
docker driver for communicating with the docker daemon. We keep two clients
- one with a hard-coded timeout that applies to all operations no matter
what, intended for use with short lived / async calls to docker. The other
has no timeout and is the responsibility of the caller to set a context
that will ensure the call eventually terminates.
The use of these two clients has been confusing and mistakes were made
in a number of places where calls were making use of the wrong client.
This PR makes it so that a user must explicitly call a function to get
the client that makes sense for that use case.
Fixes#17023
* cr: followup items
Given a deployment that has a `progress_deadline`, if a task group runs
out of reschedule attempts, allow it to fail at this time instead of
waiting until the `progress_deadline` is reached.
Fixes: #17260
* CSS alignment and spacing for job status panel
* Only fade the count, not the legend icon, when count is 0
* Unrounded version corners
* changelog
* css has to only remove border radius when count is present
* Seed stabilization for services test
* Try consolidating the testfixes from before
* Total test isolation and bonus logs
* Drop the isolation but keep the logs
* Remove bonus logging