When replacing an alloc the new alloc is blocked until the old alloc is
destroyed. This could cause a deadlock:
1. Destroying the old alloc includes a final sync of its status
2. Syncing status causes a GC
3. A GC looks for terminal allocs to cleanup
4. The GC waits for an alloc to stop completely before GC'ing
If the GC chooses the currently-being-destroyed-alloc to GC, the GC
deadlocks. If `client.max_parallel` deadlocks happen the GC is wedged
until the Nomad process is restarted.
Performing the final sync asynchronously is an ugly hack but prevents
the deadlock by allowing the final sync to occur after the alloc runner
has shutdown and been destroyed.
Since the AllocRunner.alloc struct can be mutated, most of AllocRunner
needs to acquire a lock to get the alloc's ID. Log lines always need to
include the alloc ID, so we often skipped acquiring a lock just to grab
the ID and accepted the race.
Let's make the race detector a little happier by storing the ID in a
single assignment field.
Task dir metadata is created in AllocRunner.Run which may not run before
an alloc is sync'd and Nomad exits. There's no reason not to just create
task dir metadata on restore if it doesn't exist.
Fixes#2835
Yet another bug caused by overwriting container and then trying to
reference container.ID in the err handling block. Did a quick audit of
docker.go and it seems to be the last offender. See #2804 for previous
bug.
Fixes#2802
While it's hard to reproduce the theoretical race is:
1. This goroutine calls ListContainers()
2. Another goroutine removes a container X
3. This goroutine attempts to InspectContainer(X)
However, this bug could be hit in the much simpler case of
InspectContainer() timing out.
In those cases an error is returned and the old code attempted to wrap
the error with the now-nil container.ID. Storing the container ID fixes
that panic.
This PR adds watching of allocation health at the client. The client can
watch for health based on the tasks running on time and also based on
the consul checks passing.