On Windows the executor returns an error when trying to open the `NUL` device
when we pass it `os.DevNull` for the stdout/stderr paths. Instead of opening the
device, use the discard pipe so that we have platform-specific behavior from the
executor itself.
Fixes: #17148
The `DisableLogCollection` capability was introduced as an experimental
interface for the Docker driver in 0.10.4. The interface has been stable and
allowing third-party task drivers the same capability would be useful for those
drivers that don't need the additional overhead of logmon.
This PR only makes the capability public. It doesn't yet add it to the
configuration options for the other internal drivers.
Fixes: #14636#15686
Tools like `nomad-nodesim` are unable to implement a minimal implementation of
an allocrunner so that we can test the client communication without having to
lug around the entire allocrunner/taskrunner code base. The allocrunner was
implemented with an interface specifically for this purpose, but there were
circular imports that made it challenging to use in practice.
Move the AllocRunner interface into an inner package and provide a factory
function type. Provide a minimal test that exercises the new function so that
consumers have some idea of what the minimum implementation required is.
While working on client status update improvements, I encountered problems
getting tests with the mock driver to correctly restore.
Unlike typical drivers the mock driver doesn't have an external source of truth
for whether the task is running (ex. making API calls to `dockerd` or looking
for a running PID), and so in order to make up that information, it re-parses
the original task config. But the taskrunner doesn't call the encoding step for
`RecoverTask`, only `StartTask`, so the task config the mock driver gets is
missing data.
Update the mock driver to stash the "external" state in the task state that
we'll get from the task runner, so that we don't have to try to recover from the
original `TaskConfig` anymore. This should bring the mock driver closer to the
behavior of the other drivers.
Some Nomad users ship application logs out-of-band via syslog. For these users
having `logmon` (and `docker_logger`) running is unnecessary overhead. Allow
disabling the logmon and pointing the task's stdout/stderr to /dev/null.
This changeset is the first of several incremental improvements to log
collection short of full-on logging plugins. The next step will likely be to
extend the internal-only task driver configuration so that cluster
administrators can turn off log collection for the entire driver.
---
Fixes: #11175
Co-authored-by: Thomas Weber <towe75@googlemail.com>
* [no ci] deps: update docker to 23.0.3
This PR brings our docker/docker dependency (which is hosted at github.com/moby/moby)
up to 23.0.3 (forward about 2 years). Refactored our use of docker/libnetwork to
reference the package in its new home, which is docker/docker/libnetwork (it is
no longer an independent repository). Some minor nearby test case cleanup as well.
* add cl
When we added recovery of pause containers in #16352 we called the recovery
function from the plugin factory function. But in our plugin setup protocol, a
plugin isn't ready for use until we call `SetConfig`. This meant that
recovering pause containers was always done with the default
config. Setting up the Docker client only happens once, so setting the wrong
config in the recovery function also means that all other Docker API calls will
use the default config.
Move the `recoveryPauseContainers` call into the `SetConfig`. Fix the error
handling so that we return any error but also don't log when the context is
canceled, which happens twice during normal startup as we fingerprint the
driver.
Currently, the `exec` driver is only setting the Bounding set, which is
not sufficient to actually enable the requisite capabilities for the
task process. In order for the capabilities to survive `execve`
performed by libcontainer, the `Permitted`, `Inheritable`, and `Ambient`
sets must also be set.
Per CAPABILITIES (7):
> Ambient: This is a set of capabilities that are preserved across an
> execve(2) of a program that is not privileged. The ambient capability
> set obeys the invariant that no capability can ever be ambient if it
> is not both permitted and inheritable.
* client/fingerprint: correctly fingerprint E/P cores of Apple Silicon chips
This PR adds detection of asymetric core types (Power & Efficiency) (P/E)
when running on M1/M2 Apple Silicon CPUs. This functionality is provided
by shoenig/go-m1cpu which makes use of the Apple IOKit framework to read
undocumented registers containing CPU performance data. Currently working
on getting that functionality merged upstream into gopsutil, but gopsutil
would still not support detecting P vs E cores like this PR does.
Also refactors the CPUFingerprinter code to handle the mixed core
types, now setting power vs efficiency cpu attributes.
For now the scheduler is still unaware of mixed core types - on Apple
platforms tasks cannot reserve cores anyway so it doesn't matter, but
at least now the total CPU shares available will be correct.
Future work should include adding support for detecting P/E cores on
the latest and upcoming Intel chips, where computation of total cpu shares
is currently incorrect. For that, we should also include updating the
scheduler to be core-type aware, so that tasks of resources.cores on Linux
platforms can be assigned the correct number of CPU shares for the core
type(s) they have been assigned.
node attributes before
cpu.arch = arm64
cpu.modelname = Apple M2 Pro
cpu.numcores = 12
cpu.reservablecores = 0
cpu.totalcompute = 1000
node attributes after
cpu.arch = arm64
cpu.frequency.efficiency = 2424
cpu.frequency.power = 3504
cpu.modelname = Apple M2 Pro
cpu.numcores.efficiency = 4
cpu.numcores.power = 8
cpu.reservablecores = 0
cpu.totalcompute = 37728
* fingerprint/cpu: follow up cr items
* Update ioutil library references to os and io respectively for drivers package
No user facing changes so I assume no change log is required
* Fix failing tests
* docker: disable driver when running as non-root on cgroups v2 hosts
This PR modifies the docker driver to behave like exec when being run
as a non-root user on a host machine with cgroups v2 enabled. Because
of how cpu resources are managed by the Nomad client, the nomad agent
must be run as root to manage docker-created cgroups.
* cl: update cl
* docker: set force=true on remove image to handle images referenced by multiple tags
This PR changes our call of docker client RemoveImage() to RemoveImageExtended with
the Force=true option set. This fixes a bug where an image referenced by more than
one tag could never be garbage collected by Nomad. The Force option only applies to
stopped containers; it does not affect running workloads.
* docker: add note about image_delay and multiple tags
* Ensure infra_image gets proper label used for reconciliation
Currently infra containers are not cleaned up as part of the dangling container
cleanup routine. The reason is that Nomad checks if a container is a Nomad owned
container by verifying the existence of the: `com.hashicorp.nomad.alloc_id` label.
Ensure we set this label on the infra container as well.
* fix unit test
* changelog: add entry
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
This PR modifies the configuration of the networking pause contaier to include
the "unless-stopped" restart policy. The pause container should always be
restored into a running state until Nomad itself issues a stop command for the
container.
This is not a _perfect_ fix for #12216 but it should cover the 99% use case -
where a pause container gets accidently stopped / killed for some reason. There
is still a possibility where the pause container and main task container are
stopped and started in the order where the bad behavior persists, but this is
fundamentally unavoidable due to how docker itself abstracts and manages the
underlying network namespace referenced by the containers.
Closes#12216
The exec driver and other drivers derived from the shared executor check the
path of the command before handing off to libcontainer to ensure that the
command doesn't escape the sandbox. But we don't check any host volume mounts,
which should be safe to use as a source for executables if we're letting the
user mount them to the container in the first place.
Check the mount config to verify the executable lives in the mount's host path,
but then return an absolute path within the mount's task path so that we can hand
that off to libcontainer to run.
Includes a good bit of refactoring here because the anchoring of the final task
path has different code paths for inside the task dir vs inside a mount. But
I've fleshed out the test coverage of this a good bit to ensure we haven't
created any regressions in the process.
* client: protect user lookups with global lock
This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).
* cl: add cl
* test: don't use loop vars in goroutines
fixes a data race in the test
* test: copy objects in statestore before mutating
fixes data race in test
* test: @lgfa29's segmgrep rule for loops/goroutines
Found 2 places where we were improperly using loop variables inside
goroutines.
Log lines which include an error should use the full term "error"
as the context key. This provides consistency across the codebase
and avoids a Go style which operators might not be aware of.
Neither the `os.Setenv` nor `t.Setenv` helper are safe to use in parallel tests
because environment variables are process-global. The stdlib panics if you try
to do this. Remove the `ci.Parallel()` call from all tests where we're setting
environment variables.
The QEMU driver can take an optional `graceful_shutdown` configuration
which will create a Unix socket to send ACPI shutdown signal to the VM.
Unix sockets have a hard length limit and the driver implementation
assumed that QEMU versions 2.10.1 were able to handle longer paths. This
is not correct, the linked QEMU fix only changed the behaviour from
silently truncating longer socket paths to throwing an error.
By validating the socket path before starting the QEMU machine we can
provide users a more actionable and meaningful error message, and by
using a shorter socket file name we leave a bit more room for
user-defined values in the path, such as the task name.
The maximum length allowed is also platform-dependant, so validation
needs to be different for each OS.
When a QEMU task is recovered the monitor socket path was not being
restored into the task handler, so the `graceful_shutdown` configuration
was effectively ignored if the client restarted.
OOM detection under cgroups v2 is flaky under versions of `containerd` before
v1.6.3, but our `containerd` dependency is transitive on `moby/moby`, who have
not yet updated. Disable this test for cgroups v2 environments until we can
update the dependency chain.
The `golang.org/x/net/context` package was merged into the stdlib as of go
1.7. Update the imports to use the identical stdlib version. Clean up import
blocks for the impacted files to remove unnecessary package aliasing.
Closes#12927Closes#12958
This PR updates the version of redis used in our examples from 3.2 to 7.
The old version is very not supported anymore, and we should be setting
a good example by using a supported version.
The long-form example job is now fixed so that the service stanza uses
nomad as the service discovery provider, and so now the job runs without
a requirement of having Consul running and configured.
* test: use `T.TempDir` to create temporary test directory
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix TestLogmon_Start_restart on Windows
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestConsul_Integration
t.TempDir fails to perform the cleanup properly because the folder is
still in use
testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This PR modifies raw_exec and exec to ensure the cgroup for a task
they are driving still exists during a task restart. These drivers
have the same bug but with different root cause.
For raw_exec, we were removing the cgroup in 2 places - the cpuset
manager, and in the unix containment implementation (the thing that
uses freezer cgroup to clean house). During a task restart, the
containment would remove the cgroup, and when the task runner hooks
went to start again would block on waiting for the cgroup to exist,
which will never happen, because it gets created by the cpuset manager
which only runs as an alloc pre-start hook. The fix here is to simply
not delete the cgroup in the containment implementation; killing the
PIDs is enough. The removal happens in the cpuset manager later anyway.
For exec, it's the same idea, except DestroyTask is called on task
failure, which in turn calls into libcontainer, which in turn deletes
the cgroup. In this case we do not have control over the deletion of
the cgroup, so instead we hack the cgroup back into life after the
call to DestroyTask.
All of this only applies to cgroups v2.
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
When shutting down an allocation that ends up needing to be
force-killed, we're getting a spurious "OOM Killed (137)" message on
the task termination event. We introduced this as part of cgroups v2
support because the Docker daemon isn't detecting the container status
correctly. Although exit code 137 is the exit code we get for
OOM-killed processes, that's because OOM kill is a `SIGKILL`. So any
sigkilled process will get that exit code.
This test checks for behavior when asking for logs of a docker task
configured with a log driver that does not support streaming logs.
Previously this was using the 'gelf' log driver, but it seems that no
longer returns an error as expected. Instead we can just use the 'none'
log driver, which has the desired effect
2022-04-19T10:23:19.129-0500 [ERROR] docklog/docker_logger.go:133: log streaming ended with terminal error: error="API error (501): configured logging driver does not support reading"