When we unmount a volume we need to be able to recover from cases
where the plugin has been shutdown before the allocation that needs
it, so in #11892 we blocked shutting down the alloc runner hook. But
this blocks client shutdown if we're in the middle of unmounting. The
client won't be able to communicate with the plugin or send the
unpublish RPC anyways, so we should cancel the context and assume that
we'll resume the unmounting process when the client restarts.
For `-dev` mode we don't send the graceful `Shutdown()` method and
instead destroy all the allocations. In this case, we'll never be able
to communicate with the plugin but also never close the context we
need to prevent the hook from blocking. To fix this, move the retries
into their own goroutine that doesn't block the main `Postrun`.
This is a test around upgrading from Nomad 0.8, which is long since
no longer supported. The test is slow, flaky, and imports consul/sdk.
Remove this test as it is no longer relevant.
In #12112 and #12113 we solved for the problem of races in releasing
volume claims, but there was a case that we missed. During a node
drain with a controller attach/detach, we can hit a race where we call
controller publish before the unpublish has completed. This is
discouraged in the spec but plugins are supposed to handle it
safely. But if the storage provider's API is slow enough and the
plugin doesn't handle the case safely, the volume can get "locked"
into a state where the provider's API won't detach it cleanly.
Check the claim before making any external controller publish RPC
calls so that Nomad is responsible for the canonical information about
whether a volume is currently claimed.
This has a couple side-effects that also had to get fixed here:
* Changing the order means that the volume will have a past claim
without a valid external node ID because it came from the client, and
this uncovered a separate bug where we didn't assert the external node
ID was valid before returning it. Fallthrough to getting the ID from
the plugins in the state store in this case. We avoided this
originally because of concerns around plugins getting lost during node
drain but now that we've fixed that we may want to revisit it in
future work.
* We should make sure we're handling `FailedPrecondition` cases from
the controller plugin the same way we handle other retryable cases.
* Several tests had to be updated because they were assuming we fail
in a particular order that we're no longer doing.
Resolves#12095 by WONTFIXing it.
This approach disables `writeToFile` as it allows arbitrary host
filesystem writes and is only a small quality of life improvement over
multiple `template` stanzas.
This approach has the significant downside of leaving people who have
altered their `template.function_denylist` *still vulnerable!* I added
an upgrade note, but we should have implemented the denylist as a
`map[string]bool` so that new funcs could be denied without overriding
custom configurations.
This PR also includes a bug fix that broke enabling all consul-template
funcs. We repeatedly failed to differentiate between a nil (unset)
denylist and an empty (allow all) one.
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.
Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.
Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.
When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.
Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.
The new cpuset management strategy fixes#11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.htmlCloses#11289Fixes#11705#11773#11933
* Use unix:// prefix for CSI_ENDPOINT variable by default
* Some plugins have strict validation over the format of the
`CSI_ENDPOINT` variable, and unfortunately not all plugins
agree. Allow the user to override the `CSI_ENDPOINT` to workaround
those cases.
* Update all demos and tests with CSI_ENDPOINT
The service registration wrapper handles sending requests to
backend providers without the caller needing to know this
information. This will be used within the task and alloc runner
service hooks when performing service registration activities.
This commit performs refactoring to pull out common service
registration objects into a new `client/serviceregistration`
package. This new package will form the base point for all
client specific service registration functionality.
The Consul specific implementation is not moved as it also
includes non-service registration implementations; this reduces
the blast radius of the changes as well.
CSI `CreateVolume` RPC is idempotent given that the topology,
capabilities, and parameters are unchanged. CSI volumes have many
user-defined fields that are immutable once set, and many fields that
are not user-settable.
Update the `Register` RPC so that updating a volume via the API merges
onto any existing volume without touching Nomad-controlled fields,
while validating it with the same strict requirements expected for
idempotent `CreateVolume` RPCs.
Also, clarify that this state store method is used for everything, not just
for the `Register` RPC.
The Prestart hook for task runner hooks doesn't get called when we
restore a task, because the task is already running. The Postrun hook
for CSI plugin supervisors needs the socket path to have been
populated so that the client has a valid path.
When the alloc runner claims a volume, an allocation for a previous
version of the job may still have the volume claimed because it's
still shutting down. In this case we'll receive an error from the
server. Retry this error until we succeed or until a very long timeout
expires, to give operators a chance to recover broken plugins.
Make the alloc runner hook tolerant of temporary RPC failures.
The dynamic plugin registry assumes that plugins are singletons, which
matches the behavior of other Nomad plugins. But because dynamic
plugins like CSI are implemented by allocations, we need to handle the
possibility of multiple allocations for a given plugin type + ID, as
well as behaviors around interleaved allocation starts and stops.
Update the data structure for the dynamic registry so that more recent
allocations take over as the instance manager singleton, but we still
preserve the previous running allocations so that restores work
without racing.
Multiple allocations can run on a client for the same plugin, even if
only during updates. Provide each plugin task a unique path for the
control socket so that the tasks don't interfere with each other.
In PR #11892 we updated the `csi_hook` to unmount the volume locally
via the CSI node RPCs before releasing the claim from the server. The
timer for this hook was initialized with the retry time, forcing us to
wait 1s before making the first unmount RPC calls.
Use the new helper for timers to ensure we clean up the timer nicely.
Nomad inherited protocol version numbering configuration from Consul and
Serf, but unlike those projects Nomad has never used it. Nomad's
`protocol_version` has always been `1`.
While the code is effectively unused and therefore poses no runtime
risks to leave, I felt like removing it was best because:
1. Nomad's RPC subsystem has been able to evolve extensively without
needing to increment the version number.
2. Nomad's HTTP API has evolved extensively without increment
`API{Major,Minor}Version`. If we want to version the HTTP API in the
future, I doubt this is the mechanism we would choose.
3. The presence of the `server.protocol_version` configuration
parameter is confusing since `server.raft_protocol` *is* an important
parameter for operators to consider. Even more confusing is that
there is a distinct Serf protocol version which is included in `nomad
server members` output under the heading `Protocol`. `raft_protocol`
is the *only* protocol version relevant to Nomad developers and
operators. The other protocol versions are either deadcode or have
never changed (Serf).
4. If we were to need to version the RPC, HTTP API, or Serf protocols, I
don't think these configuration parameters and variables are the best
choice. If we come to that point we should choose a versioning scheme
based on the use case and modern best practices -- not this 6+ year
old dead code.
This PR modifies the Consul CLI arguments used to bootstrap envoy for
Connect sidecars to make use of '-proxy-id' instead of '-sidecar-for'.
Nomad registers the sidecar service, so we know what ID it has. The
'-sidecar-for' was intended for use when you only know the name of the
service for which the sidecar is being created.
The improvement here is that using '-proxy-id' does not require an underlying
request for listing Consul services. This will make make the interaction
between Nomad and Consul more efficient.
Closes#10452
When Consul Connect just works, it's wonderful. When it doesn't work it
can be exceeding difficult to debug: operators have to check task
events, Nomad logs, Consul logs, Consul APIs, and even then critical
information is missing.
Using Consul to generate a bootstrap config for Envoy is notoriously
difficult. Nomad doesn't even log stderr, so operators are left trying
to piece together what went wrong.
This patch attempts to provide *maximal* context which unfortunately
includes secrets. **Secrets are always restricted to the secrets/
directory.** This makes debugging a little harder, but allows operators
to know exactly what operation Nomad was trying to perform.
What's added:
- stderr is sent to alloc/logs/envoy_bootstrap.stderr.0
- the CLI is written to secrets/.envoy_bootstrap.cmd
- the environment is written to secrets/.envoy_bootstrap.env as JSON
Accessing this information is unfortunately awkward:
```
nomad alloc exec -task connect-proxy-count-countdash b36a cat secrets/.envoy_bootstrap.env
nomad alloc exec -task connect-proxy-count-countdash b36a cat secrets/.envoy_bootstrap.cmd
nomad alloc fs b36a alloc/logs/envoy_bootstrap.stderr.0
```
The above assumes an alloc id that starts with `b36a` and a Connect
sidecar proxy for a service named `count-countdash`.
If the alloc is unable to start successfully, the debugging files are
only accessible from the host filesystem.
Nomad communicates with CSI plugin tasks via gRPC. The plugin
supervisor hook uses this to ping the plugin for health checks which
it emits as task events. After the first successful health check the
plugin supervisor registers the plugin in the client's dynamic plugin
registry, which in turn creates a CSI plugin manager instance that has
its own gRPC client for fingerprinting the plugin and sending mount
requests.
If the plugin manager instance fails to connect to the plugin on its
first attempt, it exits. The plugin supervisor hook is unaware that
connection failed so long as its own pings continue to work. A
transient failure during plugin startup may mislead the plugin
supervisor hook into thinking the plugin is up (so there's no need to
restart the allocation) but no fingerprinter is started.
* Refactors the gRPC client to connect on first use. This provides the
plugin manager instance the ability to retry the gRPC client
connection until success.
* Add a 30s timeout to the plugin supervisor so that we don't poll
forever waiting for a plugin that will never come back up.
Minor improvements:
* The plugin supervisor hook creates a new gRPC client for every probe
and then throws it away. Instead, reuse the client as we do for the
plugin manager.
* The gRPC client constructor has a 1 second timeout. Clarify that this
timeout applies to the connection and not the rest of the client
lifetime.
The CSI specification says:
> The CO SHALL provide the listen-address for the Plugin by way of the
`CSI_ENDPOINT` environment variable.
Note that plugins without filesystem isolation won't have the plugin
dir bind-mounted to their alloc dir, but we can provide a path to the
socket anyways.
Refactor to use opts struct for plugin supervisor hook config.
The parameter list for configuring the plugin supervisor hook has
grown enough where is makes sense to use an options struct similiar to
many of the other task runner hooks (ex. template).
In certain task lifecycles the taskrunner service deregister call
could be called three times for a task that is exiting. Whilst
each hook caller of deregister has its own purpose, we should try
and ensure it is only called once during the shutdown lifecycle of
a task.
This change therefore tracks when deregister has been called, so
that subsequent calls are noop. In the event the task is
restarting, the deregister value is reset to ensure proper
operation.
go-getter creates a circular dependency between a Client and Getter,
which means each is inherently thread-unsafe if you try to re-use
on or the other.
This PR fixes Nomad to no longer make use of the default Getter objects
provided by the go-getter package. Nomad must create a new Client object
on every artifact download, as the Client object controls the Src and Dst
among other things. When Caling Client.Get, the Getter modifies its own
Client reference, creating the circular reference and race condition.
We can still achieve most of the desired connection caching behavior by
re-using a shared HTTP client with transport pooling enabled.