Stop joining libcontainer executor process into the newly created task
container cgroup, to ensure that the cgroups are fully destroyed on
shutdown, and to make it consistent with other plugin processes.
Previously, executor process is added to the container cgroup so the
executor process resources get aggregated along with user processes in
our metric aggregation.
However, adding executor process to container cgroup adds some
complications with much benefits:
First, it complicates cleanup. We must ensure that the executor is
removed from container cgroup on shutdown. Though, we had a bug where
we missed removing it from the systemd cgroup. Because executor uses
`containerState.CgroupPaths` on launch, which includes systemd, but
`cgroups.GetAllSubsystems` which doesn't.
Second, it may have advese side-effects. When a user process is cpu
bound or uses too much memory, executor should remain functioning
without risk of being killed (by OOM killer) or throttled.
Third, it is inconsistent with other drivers and plugins. Logmon and
DockerLogger processes aren't in the task cgroups. Neither are
containerd processes, though it is equivalent to executor in
responsibility.
Fourth, in my experience when executor process moves cgroup while it's
running, the cgroup aggregation is odd. The cgroup
`memory.usage_in_bytes` doesn't seem to capture the full memory usage of
the executor process and becomes a red-harring when investigating memory
issues.
For all the reasons above, I opted to have executor remain in nomad
agent cgroup and we can revisit this when we have a better story for
plugin process cgroup management.
This commit introduces support for configuring mount propagation when
mounting volumes with the `volume_mount` stanza on Linux targets.
Similar to Kubernetes, we expose 3 options for configuring mount
propagation:
- private, which is equivalent to `rprivate` on Linux, which does not allow the
container to see any new nested mounts after the chroot was created.
- host-to-task, which is equivalent to `rslave` on Linux, which allows new mounts
that have been created _outside of the container_ to be visible
inside the container after the chroot is created.
- bidirectional, which is equivalent to `rshared` on Linux, which allows both
the container to see new mounts created on the host, but
importantly _allows the container to create mounts that are
visible in other containers an don the host_
private and host-to-task are safe, but bidirectional mounts can be
dangerous, as if the code inside a container creates a mount, and does
not clean it up before tearing down the container, it can cause bad
things to happen inside the kernel.
To add a layer of safety here, we require that the user has ReadWrite
permissions on the volume before allowing bidirectional mounts, as a
defense in depth / validation case, although creating mounts should also require
a priviliged execution environment inside the container.
Nomad 0.9 incidentally set effective capabilities that is higher than
what's expected of a `nobody` process, and what's set in 0.8.
This change restores the capabilities to ones used in Nomad 0.9.
Implements streamign exec handling in both executors (i.e. universal and
libcontainer).
For creation of TTY, some incidental complexity leaked in. The universal
executor uses github.com/kr/pty for creation of TTYs.
On the other hand, libcontainer expects a console socket and for libcontainer to
create the underlying console object on process start. The caller can then use
`libcontainer.utils.RecvFd()` to get tty master end.
I chose github.com/kr/pty for managing TTYs here. I tried
`github.com/containerd/console` package (which is already imported), but the
package did not work as expected on macOS.
Reverts hashicorp/nomad#5433
Apparently, channel communications can constitute Happens-Before even for proximate variables, so this syncing isn't necessary.
> _The closing of a channel happens before a receive that returns a zero value because the channel is closed._
https://golang.org/ref/mem#tmp_7
strings.Replace call with n=0 argument makes no sense
as it will do nothing. Probably -1 is intended.
Signed-off-by: Iskander Sharipov <quasilyte@gmail.com>
* CVE-2019-5736: Update libcontainer depedencies
Libcontainer is vulnerable to a runc container breakout, that was
reported as CVE-2019-5736[1]. Upgrading vendored libcontainer with the fix.
The runc changes are captured in 369b920277 .
[1] https://seclists.org/oss-sec/2019/q1/119
Track current memory usage, `memory.usage_in_bytes`, in addition to
`memory.max_memory_usage_in_bytes` and friends. This number is closer
what Docker reports.
Related to https://github.com/hashicorp/nomad/issues/5165 .
plugins/driver: update driver interface to support streaming stats
client/tr: use streaming stats api
TODO:
* how to handle errors and closed channel during stats streaming
* prevent tight loop if Stats(ctx) returns an error
drivers: update drivers TaskStats RPC to handle streaming results
executor: better error handling in stats rpc
docker: better control and error handling of stats rpc
driver: allow stats to return a recoverable error
This PR fixes various instances of plugins being launched without using
the parent loggers. This meant that logs would not all go to the same
output, break formatting etc.
Re-export the ResourceUsage structs in drivers package to avoid drivers
directly depending on the internal client/structs package directly.
I attempted moving the structs to drivers, but that caused some import
cycles that was a bit hard to disentagle. Alternatively, I added an
alias here that's sufficient for our purposes of avoiding external
drivers depend on internal packages, while allowing us to restructure
packages in future without breaking source compatibility.
We ultimately decided to provide a limited set of devices in exec/java
drivers instead of all of host ones. Pre-0.9, we made all host devices
available to exec tasks accidentally, yet most applications only use a
small subset, and this choice limits our ability to restrict/isolate GPU
and other devices.
Starting with 0.9, by default, we only provide the same subset of
devices Docker provides, and allow users to provide more devices as
needed on case-by-case basis.
This reverts commit 5805c64a9f1c3b409693493dfa30e7136b9f547b.
This reverts commit ff9a4a17e59388dcab067949e0664f645b2f5bcf.
Use a dedicated /dev mount so we can inject more devices if necessary,
and avoid allowing a container to contaminate host /dev.
Follow up to https://github.com/hashicorp/nomad/pull/5143 - and fixes master.
Restores pre-0.9 behavior, where Nomad makes /dev available to exec
task. Switching to libcontainer, we accidentally made only a small
subset available.
Here, we err on the side of preserving behavior of 0.8, instead of going
for the sensible route, where only a reasonable subset of devices is
mounted by default and user can opt to request more.
* master: (71 commits)
Fix output of 'nomad deployment fail' with no arg
Always create a running allocation when testing task state
tests: ensure exec tests pass valid task resources (#4992)
some changes for more idiomatic code
fix iops related tests
fixed bug in loop delay
gofmt
improved code for readability
client: updateAlloc release lock after read
fixup! device attributes in `nomad node status -verbose`
drivers/exec: support device binds and mounts
fix iops bug and increase test matrix coverage
tests: tag image explicitly
changelog
ci: install lxc-templates explicitly
tests: skip checking rdma cgroup
ci: use Ubuntu 16.04 (Xenial) in TravisCI
client: update driver info on new fingerprint
drivers/docker: enforce volumes.enabled (#4983)
client: Style: use fluent style for building loggers
...