This PR fixes the rolling update limit calculation to avoid a panic when
there are more allocations for a deployment that haven't determined
their health than the max_parallel count of the task group.
Fixes https://github.com/hashicorp/nomad/issues/2820
This PR allows tuning of heartbeat TTLs. An example of very aggressive
settings is as follows:
```
server {
heartbeat_grace = "1s"
min_heartbeat_ttl = "1s"
max_heartbeats_per_second = 200.0
}
```
Fixes#2827
This is a tradeoff. The pro is that you can run separate client and
server agents on the same node and advertise both. The con is that if a
Nomad agent crashes and isn't restarted on that node in the same mode
its entry will not be cleaned up.
That con scenario seems far less likely to occur than the scenario on
the pro side, and even if we do leak an agent entry the checks will be
failing so nothing should attempt to use it.
When replacing an alloc the new alloc is blocked until the old alloc is
destroyed. This could cause a deadlock:
1. Destroying the old alloc includes a final sync of its status
2. Syncing status causes a GC
3. A GC looks for terminal allocs to cleanup
4. The GC waits for an alloc to stop completely before GC'ing
If the GC chooses the currently-being-destroyed-alloc to GC, the GC
deadlocks. If `client.max_parallel` deadlocks happen the GC is wedged
until the Nomad process is restarted.
Performing the final sync asynchronously is an ugly hack but prevents
the deadlock by allowing the final sync to occur after the alloc runner
has shutdown and been destroyed.
Since the AllocRunner.alloc struct can be mutated, most of AllocRunner
needs to acquire a lock to get the alloc's ID. Log lines always need to
include the alloc ID, so we often skipped acquiring a lock just to grab
the ID and accepted the race.
Let's make the race detector a little happier by storing the ID in a
single assignment field.
Task dir metadata is created in AllocRunner.Run which may not run before
an alloc is sync'd and Nomad exits. There's no reason not to just create
task dir metadata on restore if it doesn't exist.
Caught some typos. Made units separate from the numbers 1GHz -> 1 GHz
after talking to Nick about questions of style (this has the side effect of making future spell checking easier).