This PR adds support for the raw_exec driver on systems with only cgroups v2.
The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.
There is a bit of refactoring in this PR, but the fundamental design remains
the same.
Closes#12351#12348
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.
Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.
Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.
When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.
Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.
The new cpuset management strategy fixes#11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.htmlCloses#11289Fixes#11705#11773#11933
The default behavior for `docker.volumes.enabled` is intended to be `false`,
but the HCL schema defaults to `true` if the value is unset. Set the default
literal value to `true`.
Additionally, Docker driver mounts of type "volume" (but not "bind") are not
being properly sandboxed with that setting. Disable Docker mounts with type
"volume" entirely whenever the `docker.volumes.enabled` flag is set to
false. Note this is unrelated to the `volume_mount` feature, which is
constrained to preconfigured host volumes or whatever is mounted by a CSI
plugin.
This changeset includes updates to unit tests that should have been failing
under the documented behavior but were not.
Pulling large docker containers can take longer than the default
context timeout. Without a way to change this it is very hard for
users to utilise Nomad properly without hacky work arounds.
This change adds an optional pull_timeout config parameter which
gives operators the possibility to account for increase pull times
where needed. The infra docker image also has the option to set a
custom timeout to keep consistency.
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.
Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.
This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
This commit introduces support for configuring mount propagation when
mounting volumes with the `volume_mount` stanza on Linux targets.
Similar to Kubernetes, we expose 3 options for configuring mount
propagation:
- private, which is equivalent to `rprivate` on Linux, which does not allow the
container to see any new nested mounts after the chroot was created.
- host-to-task, which is equivalent to `rslave` on Linux, which allows new mounts
that have been created _outside of the container_ to be visible
inside the container after the chroot is created.
- bidirectional, which is equivalent to `rshared` on Linux, which allows both
the container to see new mounts created on the host, but
importantly _allows the container to create mounts that are
visible in other containers an don the host_
private and host-to-task are safe, but bidirectional mounts can be
dangerous, as if the code inside a container creates a mount, and does
not clean it up before tearing down the container, it can cause bad
things to happen inside the kernel.
To add a layer of safety here, we require that the user has ReadWrite
permissions on the volume before allowing bidirectional mounts, as a
defense in depth / validation case, although creating mounts should also require
a priviliged execution environment inside the container.
In Nomad 0.9, we made volume driver handling the same for `""`, and
`"local"` volumes. Prior to Nomad 0.9 however these had slightly different
behaviour for relative paths and named volumes.
Prior to 0.9 the empty string would expand relative paths within the task
dir, and `"local"` volumes that are not absolute paths would be treated
as docker named volumes.
This commit reverts to the previous behaviour as follows:
| Nomad Version | Driver | Volume Spec | Behaviour |
|-------------------------------------------------------------------------
| all | "" | testing:/testing | allocdir/testing |
| 0.8.7 | "local" | testing:/testing | "testing" as named volume |
| 0.9.0 | "local" | testing:/testing | allocdir/testing |
| 0.9.1 | "local" | testing:/testing | "testing" as named volume |
Fix AppVeyor failing builds, by moving docker image url test to run on unix
systems only. The used paused image is a linux image only, not
available on Windows.
Due to https://github.com/tsenart/deadcode/issues/3 we can't specify
these consts on their own. This moves them into the _platform_test.go
files to avoid creating a package that only exposes a couple of values.
* Docker for Windows does not support ulimits
* Use filepath.ToSlash to test workdir
* Convert expected mount paths to system style
* Skip security-opt test on windows
- Windows does not support seccomp, and it's unclear which options are
available.
* Skip StartN due to lack of sigint
* docker: Use api to get image info on windows
* No bridge on windows
* Stop hardcoding /bin/