I could not reproduce the failure locally even with `stress -cpu ...`
eating all the cpu it could on my machine.
But I think the race was in one of two places:
* The task could restart which could create new events
* I think there could be a race between the updater's version of events
and alloc runners as updates are async
I fixed both. Here's hoping that fixes this flaky test.
We were incorrectly relying on the count of alloc updates in a number of
tests. Since alloc updates are async, their number is non-determinstic
and largely meaningless.
This should fix quite a few flaky tests in Travis and prevent future
mistaken assumptions in tests.
We performed the DeploymentStatus nil checks a couple different ways, so
hopefully this helper will consoldiate them and make it more clear what
the code is doing.
Fixes the panic mentioned in
https://github.com/hashicorp/nomad/issues/3420#issuecomment-341666932
While a leader task dying serially stops all follower tasks, the
synchronizing of state is asynchrnous. Nomad can shutdown before all
follower tasks have updated their state to dead thus saving the state
necessary to hit this panic: *have a non-terminal alloc with a dead
leader.*
The actual fix is a simple nil check to not assume non-terminal allocs
leader's have a TaskRunner.
Fixes an issue in which the allocation health watcher was checking for
allocations health based on un-interpolated services and checks. Change
the interface for retrieving check information from Consul to retrieving
all registered services and checks by allocation. In the future this
will allow us to output nicer messages.
Fixes https://github.com/hashicorp/nomad/issues/2969
* alloc_runner
* Random tests
* parallel task_runner and no exec compatible check
* Parallel client
* Fail fast and use random ports
* Fix docker port mapping
* Make concurrent pull less timing dependant
* up parallel
* Fixes
* don't build chroots in parallel on travis
* Reduce parallelism on travis with lxc/rkt
* make java test app not run forever
* drop parallelism a little
* use docker ports that are out of the os's ephemeral port range
* Limit even more on travis
* rkt deadline
This PR adds watching of allocation health at the client. The client can
watch for health based on the tasks running on time and also based on
the consul checks passing.
Before this commit all tasks in a task group were destroyed
concurrently. This meant logging sidecars might be stopped before the
leader task whose logs still need to be shipped.
This commit blocks on the leader shutting down before signalling to
followers to shutdown.
Fixes#2478#2474#1995#2294
The new client only handles agent and task service advertisement. Server
discovery is mostly unchanged.
The Nomad client agent now handles all Consul operations instead of the
executor handling task related operations. When upgrading from an
earlier version of Nomad existing executors will be told to deregister
from Consul so that the Nomad agent can re-register the task's services
and checks.
Drivers - other than qemu - now support an Exec method for executing
abritrary commands in a task's environment. This is used to implement
script checks.
Interfaces are used extensively to avoid interacting with Consul in
tests that don't assert any Consul related behavior.
This PR adds tracking to when a task starts and finishes and the logs
API takes advantage of this and returns better errors when asking for
logs that do not exist.
This PR fixes our vet script and fixes all the missed vet changes.
It also fixes pointers being printed in `nomad stop <job>` and `nomad
node-status <node>`.
AllocRunner's state dropped the Context struct which needs to be
converted to the new AllocDir+TaskDir structs in RestoreState.
TaskRunner added a TaskDirBuilt flag, but it's safe to just let that
default to `false` and rebuild all task dirs once on upgrade.