Currently the docker driver does not remove tasks from its state map
when destroying the task, which leads to issues when restarting tasks in
place, and leaks expired handles over time.
The environment variables needed for envoking `rkt` command line
should include host PATH (to access `iptables`).
Given that the command runs outside the VM, untrusted task environment
variables should NOT be honored here.
We do this already with `rkt`, but the change is quite subtle to miss
when refactoring.
We already have two other Kill tests (e.g.
TestDockerDriver_Start_Kill_Wait and
TestDockerDriver_Start_KillTimeout), so don't need yet another flaky
test.
Noticed an issue in Docker daemon failing to handle the OOM test case
failure in build https://travis-ci.org/hashicorp/nomad/jobs/468027848 ,
and I suspect it's related to the process dying so quickly, and
potentially the way we are starting the task, so added a start up delay
and made it more consistent with other tests that don't seem as flaky.
The following is the log line showing Docker returning 500 error condition; while we can probably handle it gracefully without retrying, the retry is very cheap in this case and it's more of an optimization that we can handle in follow up PR.
```
testlog.go:32: 2018-12-14T14:57:52.626Z [DEBUG] docker/driver.go:852: docker: setting container startup command: task_name=nc-demo command="/bin/nc -l 127.0.0.1 -p 0"
testlog.go:32: 2018-12-14T14:57:52.626Z [DEBUG] docker/driver.go:866: docker: setting container name: task_name=nc-demo container_name=724a3e77-8b15-e657-f6aa-84c2d3243b18
testlog.go:32: 2018-12-14T14:57:52.694Z [INFO ] docker/driver.go:196: docker: created container: container_id=362b6ea183f3c4ce472d7d7571ca47023cea1df0f5eb920827921716f17718be
testlog.go:32: 2018-12-14T14:57:53.523Z [DEBUG] docker/driver.go:416: docker: failed to start container: container_id=362b6ea183f3c4ce472d7d7571ca47023cea1df0f5eb920827921716f17718be attempt=1 error="API error (500): {"message":"cannot start a stopped process: unknown"}
"
testlog.go:32: 2018-12-14T14:57:55.394Z [DEBUG] docker/driver.go:416: docker: failed to start container: container_id=362b6ea183f3c4ce472d7d7571ca47023cea1df0f5eb920827921716f17718be attempt=2 error="API error (500): {"message":"cannot start a stopped process: unknown"}
"
testlog.go:32: 2018-12-14T14:57:57.243Z [DEBUG] docker/driver.go:416: docker: failed to start container: container_id=362b6ea183f3c4ce472d7d7571ca47023cea1df0f5eb920827921716f17718be attempt=3 error="API error (500): {"message":"cannot start a stopped process: unknown"}
"
```
Using `:latest` tag is typically a cause of pain, as underlying image
changes behavior. Here, I'm switching to using a point release, and
re-updating the stored tarballs with it.
Sadly, when saving/loading images, the repo digeset is not supported:
https://github.com/moby/moby/issues/22011 ; but using point releases
should mitigate the problem.
The motivation here is that docker tests have some flakiness due to
accidental importing of `busybox:latest` which has `/bin/nc` that no
longer supports `-p 0`:
```
$ docker run -it --rm busybox /bin/nc -l 127.0.0.1 -p 0
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox
Digest: sha256:2a03a6059f21e150ae84b0973863609494aad70f0a80eaeb64bddd8d92465812
Status: Downloaded newer image for busybox:latest
nc: bad local port '0'
```
Looks like older busybox versions (e.g. `busybox:1.24` do honor `-p 0`
as the test expect, but I would rather update busybox to fix.
* master: (71 commits)
Fix output of 'nomad deployment fail' with no arg
Always create a running allocation when testing task state
tests: ensure exec tests pass valid task resources (#4992)
some changes for more idiomatic code
fix iops related tests
fixed bug in loop delay
gofmt
improved code for readability
client: updateAlloc release lock after read
fixup! device attributes in `nomad node status -verbose`
drivers/exec: support device binds and mounts
fix iops bug and increase test matrix coverage
tests: tag image explicitly
changelog
ci: install lxc-templates explicitly
tests: skip checking rdma cgroup
ci: use Ubuntu 16.04 (Xenial) in TravisCI
client: update driver info on new fingerprint
drivers/docker: enforce volumes.enabled (#4983)
client: Style: use fluent style for building loggers
...
Prior to 97f33bb1537d04905cb84199672bcdf46ebb4e65, executor cgroup validation errors were
silently ignored. Enforcing them reveals test cases that missed them.
This doesn't change customer facing contract, as resource struct is
is either configured or we default to 100 (much higher than 2).
Noticed few places where tests seem to block indefinitely and panic
after the test run reaches the test package timeout.
I intend to follow up with the proper fix later, but timing out is much
better than indefinitely blocking.
Update rawexec and rkt stop/kill tests with the patterns introduced in
7a49e9b68e519050a0c2ef0b67c33503bfbc51be. This implementation should be
more resilient to discrepancy between task stopping and task being marked as exited.
Using statically linked busybox binary to setup a basic rootfs for
testing, by symlinking it to provide the basic commands used in tests.
I considered using a proper rootfs tarball, but the overhead of managing
tarfile and expanding it seems significant enough that I went with this
implementation.
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
Also, LXC requires target paths to be relative. Container paths in LXC
binds should never be absolute paths, so we strip any preceeding `/`,
even if a user sets one.
WaitForResult expects body to fail and retries few times before giving
up. Assertions inside the testfn body causes it to terminate abruptly
without retrying.
Some tests have containers that die almost immediately, and may die
and cleaned up before `driver.WaitUntilStarted` runs.
The causes for container dying seems special for each test:
* TestDockerDriver_Cleanup: `hello-world` image just emits a message and exits immediately
* TestDockerDriver_ForcePull_RepoDigest: the busybox image in `TestDockerDriver_ForcePull_RepoDigest` test didn't support `-p 0` argument
* TestDockerDriver_Entrypoint: with the entrypoint being `/bin/sh -c`, the command needs to be the entire string; otherwise, it ignores the comments
Currently, libcontainer-based executor, upon shutdown, kills the
container initial process. The children of the killed process remain
running, and the executor is never marked as terminated until they do.
Also, fix a case where we treat processes as successful, when
`proc.Wait()` fails. In some attempts, I was getting "waitid no child
processes" errors and such error shouldn't get process to be considered
successful.
this allows us to drop a cyclical import, but is subobptimal as it
requires BaseDriver tests to move. This falls firmly into the realm of
being a hack. Alternatives welcome.
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.
It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
As part of deprecating legacy drivers, we're moving the env package to a
new drivers/shared tree, as it is used by the modern docker and rkt
driver packages, and is useful for 3rd party plugins.
This allows the container to be tagged with a user friendly image name
(e.g. `redis:3.2`) rather than the image ID (e.g.
`sha256:87856cc39862cec77541d68382e4867d7ccb29a85a17221446c857ddaebca916`).
Useful for human debugging, as well as some debugging and image scanning
tools.
This risks two bad changes:
1. Discrepancy in image resolution between docker and Nomad's image
loader.
* I checked the image creation paths in Nomad, and noticed that we
either pulled the image or inspect the image with the user provided
name.
2. A race in image tagging where the tag is modified between image
loading and container creation.
* I, personally, don't think this case is cause for concern, as it is
analogous to the task running a bit later. As long as the image is
still present, creating the container should be good.
Tests expect that as soon as eventer shuts down immediately on context
cancellations; but golang does not guarantee priority when multiple
pending channels are ready in a select statement.
Mock driver config uses `time.Duration` fields but we initialize them
inconsistently, as time.Duration sometimes and as duration strings other
times. Previously, `mapstructure` handles it and does the right thing.
This is no longer the case with MsgPack. I could not find a good way to
bring back old behavior without too much complexity. `MsgPack` extended
types weren't ideal here as we lose type information (e.g. int64 vs
string), and the input is a generic map and not a MsgPack serialization
of duration.
As such, I went with the simple solution of declaring the config field
as duration string, and panicing if the test doesn't pass a valid
string.
I found this to cause the smallest change in tests, but we can
alternatively force all to be int64 instead.
This PR plumbs the plugins done ctx through the base and driver plugin
clients (device already had it). Further, it adds generic handling of
gRPC stream errors.
Introduce a device manager that manages the lifecycle of device plugins
on the client. It fingerprints, collects stats, and forwards Reserve
requests to the correct plugin. The manager, also handles device plugins
failing and validates their output.