Fixes a deadlock in leadership handling if leadership flapped.
Raft propagates leadership transition to Nomad through a NotifyCh channel.
Raft blocks when writing to this channel, so channel must be buffered or
aggressively consumed[1]. Otherwise, Raft blocks indefinitely in `raft.runLeader`
until the channel is consumed[1] and does not move on to executing follower
related logic (in `raft.runFollower`).
While Raft `runLeader` defer function blocks, raft cannot process any other
raft operations. For example, `run{Leader|Follower}` methods consume
`raft.applyCh`, and while runLeader defer is blocked, all raft log applications
or config lookup will block indefinitely.
Sadly, `leaderLoop` and `establishLeader` makes few Raft calls!
`establishLeader` attempts to auto-create autopilot/scheduler config [3]; and
`leaderLoop` attempts to check raft configuration [4]. All of these calls occur
without a timeout.
Thus, if leadership flapped quickly while `leaderLoop/establishLeadership` is
invoked and hit any of these Raft calls, Raft handler _deadlock_ forever.
Depending on how many times it flapped and where exactly we get stuck, I suspect
it's possible to get in the following case:
* Agent metrics/stats http and RPC calls hang as they check raft.Configurations
* raft.State remains in Leader state, and server attempts to handle RPC calls
(e.g. node/alloc updates) and these hang as well
As we create goroutines per RPC call, the number of goroutines grow over time
and may trigger a out of memory errors in addition to missed updates.
[1] d90d6d6bda/config.go (L190-L193)
[2] d90d6d6bda/raft.go (L425-L436)
[3] 2a89e47746/nomad/leader.go (L198-L202)
[4] 2a89e47746/nomad/leader.go (L877)
If an existing system allocation is running and the node its running on
is marked as ineligible, subsequent plan/applys return an RPC error
instead of a more helpful plan result.
This change logs the error, and appends a failedTGAlloc for the
placement.
Fixes a bug introduced in 0aa58b9 where we're writing a test file to
a taskdir-interpolated location, which works when we `alloc exec` but
not in the jobspec for a group script check.
This changeset also makes the test safe to run multiple times by
namespacing the file with the alloc ID, which has the added bonus of
exercising our alloc interpolation code for group script checks.