Stop joining libcontainer executor process into the newly created task
container cgroup, to ensure that the cgroups are fully destroyed on
shutdown, and to make it consistent with other plugin processes.
Previously, executor process is added to the container cgroup so the
executor process resources get aggregated along with user processes in
our metric aggregation.
However, adding executor process to container cgroup adds some
complications with much benefits:
First, it complicates cleanup. We must ensure that the executor is
removed from container cgroup on shutdown. Though, we had a bug where
we missed removing it from the systemd cgroup. Because executor uses
`containerState.CgroupPaths` on launch, which includes systemd, but
`cgroups.GetAllSubsystems` which doesn't.
Second, it may have advese side-effects. When a user process is cpu
bound or uses too much memory, executor should remain functioning
without risk of being killed (by OOM killer) or throttled.
Third, it is inconsistent with other drivers and plugins. Logmon and
DockerLogger processes aren't in the task cgroups. Neither are
containerd processes, though it is equivalent to executor in
responsibility.
Fourth, in my experience when executor process moves cgroup while it's
running, the cgroup aggregation is odd. The cgroup
`memory.usage_in_bytes` doesn't seem to capture the full memory usage of
the executor process and becomes a red-harring when investigating memory
issues.
For all the reasons above, I opted to have executor remain in nomad
agent cgroup and we can revisit this when we have a better story for
plugin process cgroup management.
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.
Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.
This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
Operators commonly have docker logs aggregated using various tools and
don't need nomad to manage their docker logs. Worse, Nomad uses a
somewhat heavy docker api call to collect them and it seems to cause
problems when a client runs hundreds of log collections.
Here we add a knob to disable log aggregation completely for nomad.
When log collection is disabled, we avoid running logmon and
docker_logger for the docker tasks in this implementation.
The downside here is once disabled, `nomad logs ...` commands and API
no longer return logs and operators must corrolate alloc-ids with their
aggregated log info.
This is meant as a stop gap measure. Ideally, we'd follow up with at
least two changes:
First, we should optimize behavior when we can such that operators don't
need to disable docker log collection. Potentially by reverting to
using pre-0.9 syslog aggregation in linux environments, though with
different trade-offs.
Second, when/if logs are disabled, nomad logs endpoints should lookup
docker logs api on demand. This ensures that the cost of log collection
is paid sparingly.
Looks like the RecoverTask doesn't set taskHandle.logger field causing
a panic when the handle attempts to log (e.g. when Shutdown or Signaling
fails).
When a job has a task group network, this log line ends up being
misleading if you're trying to debug networking issues. We really only
care about this when there's no port map set, in which case we get the
error returned anyways.
driver.SetConfig is not appropriate for starting up reconciler
goroutine. Some ephemeral driver instances are created for validating
config and we ought not to side-effecting goroutines for those.
We currently lack a lifecycle hook to inject these, so I picked the
`Fingerprinter` function for now, and reconciler should only run after
fingerprinter started.
Use `sync.Once` to ensure that we only start reconciler loop once.
When running at scale, it's possible that Docker Engine starts
containers successfully but gets wedged in a way where API call fails.
The Docker Engine may remain unavailable for arbitrary long time.
Here, we introduce a periodic reconcilation process that ensures that any
container started by nomad is tracked, and killed if is running
unexpectedly.
Basically, the periodic job inspects any container that isn't tracked in
its handlers. A creation grace period is used to prevent killing newly
created containers that aren't registered yet.
Also, we aim to avoid killing unrelated containters started by host or
through raw_exec drivers. The logic is to pattern against containers
environment variables and mounts to infer if they are an alloc docker
container.
Lastly, the periodic job can be disabled to avoid any interference if
need be.
This commit introduces support for configuring mount propagation when
mounting volumes with the `volume_mount` stanza on Linux targets.
Similar to Kubernetes, we expose 3 options for configuring mount
propagation:
- private, which is equivalent to `rprivate` on Linux, which does not allow the
container to see any new nested mounts after the chroot was created.
- host-to-task, which is equivalent to `rslave` on Linux, which allows new mounts
that have been created _outside of the container_ to be visible
inside the container after the chroot is created.
- bidirectional, which is equivalent to `rshared` on Linux, which allows both
the container to see new mounts created on the host, but
importantly _allows the container to create mounts that are
visible in other containers an don the host_
private and host-to-task are safe, but bidirectional mounts can be
dangerous, as if the code inside a container creates a mount, and does
not clean it up before tearing down the container, it can cause bad
things to happen inside the kernel.
To add a layer of safety here, we require that the user has ReadWrite
permissions on the volume before allowing bidirectional mounts, as a
defense in depth / validation case, although creating mounts should also require
a priviliged execution environment inside the container.
Without passing the network isolation configuration to the executor,
java tasks are not placed in the same network namespace as the other
processes in their task group, which breaks Consul Connect.
The docker creation API calls may fail with http errors (e.g. timeout)
even if container was successfully created.
Here, we force remove container if we got unexpected failure. We
already do this in some error handlers, and this commit updates all
paths.
I stopped short from a more aggressive refactoring, as the code is ripe
for refactoring and would rather do that in another PR.
This handles a bug where we may start a container successfully, yet we
fail due to retries and startContainer not being idempotent call.
Here, we ensure that when starting a container fails with 500 error,
the retry succeeds if container was started successfully.
hclspec.NewLiteral does not quote its values, which caused `3m` to be
parsed as a nonsensical literal which broke the plugin loader during
initialization. By quoting the value here, it starts correctly.
Using the API as provided from the `mounts` package imposes validation
on the `src:dest` which shouldn't be performed at this time. To workaround
that lift the internal code from that library required to only perform
the split.
Currently, nomad "plugin" processes (e.g. executor, logmon, docker_logger) are started as CLI
commands to be handled by command CLI framework. Plugin launchers use
`discover.NomadBinary()` to identify the binary and start it.
This has few downsides: The trivial one is that when running tests, one
must re-compile the nomad binary as the tests need to invoke the nomad
executable to start plugin. This is frequently overlooked, resulting in
puzzlement.
The more significant issue with `executor` in particular is in relation
to external driver:
* Plugin must identify the path of invoking nomad binary, which is not
trivial; `discvoer.NomadBinary()` now returns the path to the plugin
rather than to nomad, preventing external drivers from launching
executors.
* The external driver may get a different version of executor than it
expects (specially if we make a binary incompatible change in future).
This commit addresses both downside by having the plugin invocation
handling through an `init()` call, similar to how libcontainer init
handler is done in [1] and recommened by libcontainer [2]. `init()`
will be invoked and handled properly in tests and external drivers.
For external drivers, this change will cause external drivers to launch
the executor that's compiled against.
There a are a couple of downsides to this approach:
* These specific packages (i.e executor, logmon, and dockerlog) need to
be careful in use of `init()`, package initializers. Must avoid having
command execution rely on any other init in the package. I prefixed
files with `z_` (golang processes files in lexical order), but ensured
we don't depend on order.
* The command handling is spread in multiple packages making it a bit
less obvious how plugin starts are handled.
[1] drivers/shared/executor/libcontainer_nsenter_linux.go
[2] eb4aeed24f/libcontainer (using-libcontainer)
Nomad 0.9 incidentally set effective capabilities that is higher than
what's expected of a `nobody` process, and what's set in 0.8.
This change restores the capabilities to ones used in Nomad 0.9.
Implements streamign exec handling in both executors (i.e. universal and
libcontainer).
For creation of TTY, some incidental complexity leaked in. The universal
executor uses github.com/kr/pty for creation of TTYs.
On the other hand, libcontainer expects a console socket and for libcontainer to
create the underlying console object on process start. The caller can then use
`libcontainer.utils.RecvFd()` to get tty master end.
I chose github.com/kr/pty for managing TTYs here. I tried
`github.com/containerd/console` package (which is already imported), but the
package did not work as expected on macOS.
Support Docker `volumes` field in Windows. Previously, volumes parser
assumed some Unix-ism (e.g. didn't expect `:` in mount paths).
Here, we use the Docker parser to identify host and container paths.
Docker parsers use different validation logic from our previous unix
implementation: Docker parser accepts single path as a volume entry
(parsing it as a container path with auto-created volume) and enforces
additional checks (e.g. validity of mode). Thereforce, I opted to use
Docker parser only for Windows, and keep Nomad's linux parser to
preserve current behavior.
In Nomad 0.9, we made volume driver handling the same for `""`, and
`"local"` volumes. Prior to Nomad 0.9 however these had slightly different
behaviour for relative paths and named volumes.
Prior to 0.9 the empty string would expand relative paths within the task
dir, and `"local"` volumes that are not absolute paths would be treated
as docker named volumes.
This commit reverts to the previous behaviour as follows:
| Nomad Version | Driver | Volume Spec | Behaviour |
|-------------------------------------------------------------------------
| all | "" | testing:/testing | allocdir/testing |
| 0.8.7 | "local" | testing:/testing | "testing" as named volume |
| 0.9.0 | "local" | testing:/testing | allocdir/testing |
| 0.9.1 | "local" | testing:/testing | "testing" as named volume |
nvidia library use of dynamic library seems to conflict with alpine and
musl based OSes. This adds a `nonvidia` tag to allow compiling nomad
for alpine images.
The nomad releases currently only support glibc based OS environments,
so we default to compiling with nvidia.
Fix AppVeyor failing builds, by moving docker image url test to run on unix
systems only. The used paused image is a linux image only, not
available on Windows.
destCh was being written to by one goroutine and closed by another
goroutine. This panic occurred in Travis:
```
=== FAIL: drivers/docker TestDockerCoordinator_ConcurrentPulls (117.66s)
=== PAUSE TestDockerCoordinator_ConcurrentPulls
=== CONT TestDockerCoordinator_ConcurrentPulls
panic: send on closed channel
goroutine 5358 [running]:
github.com/hashicorp/nomad/drivers/docker.dockerStatsCollector(0xc0003a4a20, 0xc0003a49c0, 0x3b9aca00)
/home/travis/gopath/src/github.com/hashicorp/nomad/drivers/docker/stats.go:108 +0x167
created by
github.com/hashicorp/nomad/drivers/docker.TestDriver_DockerStatsCollector
/home/travis/gopath/src/github.com/hashicorp/nomad/drivers/docker/stats_test.go:33 +0x1ab
```
The 2 ways to fix this kind of error are to either (1) add extra
coordination around multiple goroutines writing to a chan or (2) make it
so only one goroutines writes to a chan.
I implemented (2) first as it's simpler, but @notnoop pointed out since
the same destCh in reused in the stats loop there's now a double close
panic possible!
So this implements (1) by adding a *usageSender struct for handling
concurrent senders and closing.