Noticed an issue in Docker daemon failing to handle the OOM test case
failure in build https://travis-ci.org/hashicorp/nomad/jobs/468027848 ,
and I suspect it's related to the process dying so quickly, and
potentially the way we are starting the task, so added a start up delay
and made it more consistent with other tests that don't seem as flaky.
The following is the log line showing Docker returning 500 error condition; while we can probably handle it gracefully without retrying, the retry is very cheap in this case and it's more of an optimization that we can handle in follow up PR.
```
testlog.go:32: 2018-12-14T14:57:52.626Z [DEBUG] docker/driver.go:852: docker: setting container startup command: task_name=nc-demo command="/bin/nc -l 127.0.0.1 -p 0"
testlog.go:32: 2018-12-14T14:57:52.626Z [DEBUG] docker/driver.go:866: docker: setting container name: task_name=nc-demo container_name=724a3e77-8b15-e657-f6aa-84c2d3243b18
testlog.go:32: 2018-12-14T14:57:52.694Z [INFO ] docker/driver.go:196: docker: created container: container_id=362b6ea183f3c4ce472d7d7571ca47023cea1df0f5eb920827921716f17718be
testlog.go:32: 2018-12-14T14:57:53.523Z [DEBUG] docker/driver.go:416: docker: failed to start container: container_id=362b6ea183f3c4ce472d7d7571ca47023cea1df0f5eb920827921716f17718be attempt=1 error="API error (500): {"message":"cannot start a stopped process: unknown"}
"
testlog.go:32: 2018-12-14T14:57:55.394Z [DEBUG] docker/driver.go:416: docker: failed to start container: container_id=362b6ea183f3c4ce472d7d7571ca47023cea1df0f5eb920827921716f17718be attempt=2 error="API error (500): {"message":"cannot start a stopped process: unknown"}
"
testlog.go:32: 2018-12-14T14:57:57.243Z [DEBUG] docker/driver.go:416: docker: failed to start container: container_id=362b6ea183f3c4ce472d7d7571ca47023cea1df0f5eb920827921716f17718be attempt=3 error="API error (500): {"message":"cannot start a stopped process: unknown"}
"
```
Using `:latest` tag is typically a cause of pain, as underlying image
changes behavior. Here, I'm switching to using a point release, and
re-updating the stored tarballs with it.
Sadly, when saving/loading images, the repo digeset is not supported:
https://github.com/moby/moby/issues/22011 ; but using point releases
should mitigate the problem.
The motivation here is that docker tests have some flakiness due to
accidental importing of `busybox:latest` which has `/bin/nc` that no
longer supports `-p 0`:
```
$ docker run -it --rm busybox /bin/nc -l 127.0.0.1 -p 0
Unable to find image 'busybox:latest' locally
latest: Pulling from library/busybox
Digest: sha256:2a03a6059f21e150ae84b0973863609494aad70f0a80eaeb64bddd8d92465812
Status: Downloaded newer image for busybox:latest
nc: bad local port '0'
```
Looks like older busybox versions (e.g. `busybox:1.24` do honor `-p 0`
as the test expect, but I would rather update busybox to fix.
workaround a regression in 1.11.3
> We are aware of a functionality regression in "go get" when executed in GOPATH mode on an import path pattern containing "..." (e.g., "go get github.com/golang/pkg/..."), when downloading packages not already present in the GOPATH workspace. This is issue golang.org/issue/29241. It will be resolved in the next minor patch releases, Go 1.11.4 and Go 1.10.7, which we plan to release soon. We apologize for any disruption.
Prior to 97f33bb1537d04905cb84199672bcdf46ebb4e65, executor cgroup validation errors were
silently ignored. Enforcing them reveals test cases that missed them.
This doesn't change customer facing contract, as resource struct is
is either configured or we default to 100 (much higher than 2).
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.
This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
Fixes a bug where a driver health and attributes are never updated from
their initial status. If a driver started unhealthy, it may never go
into a healthy status.
Noticed few places where tests seem to block indefinitely and panic
after the test run reaches the test package timeout.
I intend to follow up with the proper fix later, but timing out is much
better than indefinitely blocking.
Update rawexec and rkt stop/kill tests with the patterns introduced in
7a49e9b68e519050a0c2ef0b67c33503bfbc51be. This implementation should be
more resilient to discrepancy between task stopping and task being marked as exited.
Using statically linked busybox binary to setup a basic rootfs for
testing, by symlinking it to provide the basic commands used in tests.
I considered using a proper rootfs tarball, but the overhead of managing
tarfile and expanding it seems significant enough that I went with this
implementation.
Lowering the runtime here to pre 7ca535aa90748caff1522468cc0c4ab672a74abb expectations.
The longest package at the time `client/driver` shrunk significantly,
and now the longest packages take less than 5 minutes.
We do have some long running timed out projects due to a stuck shutdown,
but in completed jobs (though they failed), the longest packages took
less than 5 minutes. The longest running packages in
https://travis-ci.org/hashicorp/nomad/jobs/464640776 were:
```
FAIL github.com/hashicorp/nomad/nomad 268.089s
ok github.com/hashicorp/nomad/drivers/docker 203.903s coverage: 68.8% of statements
ok github.com/hashicorp/nomad/drivers/rkt 132.104s coverage: 65.0% of statements
ok github.com/hashicorp/nomad/api 123.193s coverage: 62.9% of statements
ok github.com/hashicorp/nomad/command/agent 74.657s coverage: 72.3% of statements
ok github.com/hashicorp/nomad/command 63.592s coverage: 42.7% of statements
```