Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.
This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.
This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
Prior to 97f33bb1537d04905cb84199672bcdf46ebb4e65, executor cgroup validation errors were
silently ignored. Enforcing them reveals test cases that missed them.
This doesn't change customer facing contract, as resource struct is
is either configured or we default to 100 (much higher than 2).
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.
This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
Fixes a bug where a driver health and attributes are never updated from
their initial status. If a driver started unhealthy, it may never go
into a healthy status.
This commit adds basic support for globbing namespaces in acl
definitions.
For concrete definitions, we merge all of the defined policies at load time, and
perform a simple lookup later on. If an exact match of a concrete
definition is found, we do not attempt to resolve globs.
For glob definitions, we merge definitions of exact replicas of a glob.
When loading a policy for a glob defintion, we choose the glob that has
the closest match to the namespace we are resolving for. We define the
closest match as the one with the _smallest character difference_
between the glob and the namespace we are matching.
Noticed few places where tests seem to block indefinitely and panic
after the test run reaches the test package timeout.
I intend to follow up with the proper fix later, but timing out is much
better than indefinitely blocking.
Update rawexec and rkt stop/kill tests with the patterns introduced in
7a49e9b68e519050a0c2ef0b67c33503bfbc51be. This implementation should be
more resilient to discrepancy between task stopping and task being marked as exited.
Using statically linked busybox binary to setup a basic rootfs for
testing, by symlinking it to provide the basic commands used in tests.
I considered using a proper rootfs tarball, but the overhead of managing
tarfile and expanding it seems significant enough that I went with this
implementation.
Lowering the runtime here to pre 7ca535aa90748caff1522468cc0c4ab672a74abb expectations.
The longest package at the time `client/driver` shrunk significantly,
and now the longest packages take less than 5 minutes.
We do have some long running timed out projects due to a stuck shutdown,
but in completed jobs (though they failed), the longest packages took
less than 5 minutes. The longest running packages in
https://travis-ci.org/hashicorp/nomad/jobs/464640776 were:
```
FAIL github.com/hashicorp/nomad/nomad 268.089s
ok github.com/hashicorp/nomad/drivers/docker 203.903s coverage: 68.8% of statements
ok github.com/hashicorp/nomad/drivers/rkt 132.104s coverage: 65.0% of statements
ok github.com/hashicorp/nomad/api 123.193s coverage: 62.9% of statements
ok github.com/hashicorp/nomad/command/agent 74.657s coverage: 72.3% of statements
ok github.com/hashicorp/nomad/command 63.592s coverage: 42.7% of statements
```
When starting an allocation that is preempting other allocs, we create a
new group allocation watcher, and then wait for the allocations to
terminate in the allocation PreRun hooks.
If there's no preempted allocations, then we simply provide a
NoopAllocWatcher.