Fixes a bug where a driver health and attributes are never updated from
their initial status. If a driver started unhealthy, it may never go
into a healthy status.
Noticed few places where tests seem to block indefinitely and panic
after the test run reaches the test package timeout.
I intend to follow up with the proper fix later, but timing out is much
better than indefinitely blocking.
Update rawexec and rkt stop/kill tests with the patterns introduced in
7a49e9b68e519050a0c2ef0b67c33503bfbc51be. This implementation should be
more resilient to discrepancy between task stopping and task being marked as exited.
Using statically linked busybox binary to setup a basic rootfs for
testing, by symlinking it to provide the basic commands used in tests.
I considered using a proper rootfs tarball, but the overhead of managing
tarfile and expanding it seems significant enough that I went with this
implementation.
Lowering the runtime here to pre 7ca535aa90748caff1522468cc0c4ab672a74abb expectations.
The longest package at the time `client/driver` shrunk significantly,
and now the longest packages take less than 5 minutes.
We do have some long running timed out projects due to a stuck shutdown,
but in completed jobs (though they failed), the longest packages took
less than 5 minutes. The longest running packages in
https://travis-ci.org/hashicorp/nomad/jobs/464640776 were:
```
FAIL github.com/hashicorp/nomad/nomad 268.089s
ok github.com/hashicorp/nomad/drivers/docker 203.903s coverage: 68.8% of statements
ok github.com/hashicorp/nomad/drivers/rkt 132.104s coverage: 65.0% of statements
ok github.com/hashicorp/nomad/api 123.193s coverage: 62.9% of statements
ok github.com/hashicorp/nomad/command/agent 74.657s coverage: 72.3% of statements
ok github.com/hashicorp/nomad/command 63.592s coverage: 42.7% of statements
```
When starting an allocation that is preempting other allocs, we create a
new group allocation watcher, and then wait for the allocations to
terminate in the allocation PreRun hooks.
If there's no preempted allocations, then we simply provide a
NoopAllocWatcher.
The Group Alloc watcher is an implementation of a PrevAllocWatcher that
can wait for multiple previous allocs before terminating.
This is to be used when running an allocation that is preempting upstream
allocations, and thus only supports being ran with a local alloc watcher.
It also currently requires all of its child watchers to correctly handle
context cancellation. Should this be a problem, it should be fairly easy
to implement a replacement using channels rather than a waitgroup.
It obeys the PrevAllocWatcher interface for convenience, but it may be
better to extract Migration capabilities into a seperate interface for
greater clarity.
As of now, FileRotator uses bufio.Write under the hood to write data to
configured output file. Due to the way how bufio handles any occurred io
error - saves it into `err` variable never resetting it automatically -
any operation like `Write`, `Flush` etc will become a no-op, returning the very same,
saved error (eg. Out of disk space) even when the problem is fixed (eg. disk
space is available again).
That automatically means that FileRotator will stop writing any logs,
reporting the same error over and over again, even if it's no longer
valid.
This PR fixes it by resetting the bufio Writer, which resets any errors
and tries to write requested data.
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
Also, LXC requires target paths to be relative. Container paths in LXC
binds should never be absolute paths, so we strip any preceeding `/`,
even if a user sets one.
The previous integration test was broken during the client refactor, and
it seems to be some sort of race with state updating.
I'm going to try and construct a replacement test as part of work on
performance, but for now, the underlying behaviour is still being
tested.