If an allocation is slow to stop because of `kill_timeout` or `shutdown_delay`,
the node drain is marked as complete prematurely, even though drain monitoring
will continue to report allocation migrations. This impacts the UI or API
clients that monitor node draining to shut down nodes.
This changeset updates the behavior to wait until the client status of all
drained allocs are terminal before marking the node as done draining.
This changeset refactors the tests of the draining node watcher so that we don't
mock the node watcher's `Remove` and `Update` methods for its own tests. Instead
we'll mock the node watcher's dependencies (the job watcher and deadline
notifier) and now unit tests can cover the real code. This allows us to remove a
bunch of TODOs in `watch_nodes.go` around testing and clarify some important
behaviors:
* Nodes that are down or disconnected will still be watched until the scheduler
decides what to do with their allocations. This will drive the job watcher but
not the node watcher, and that lets the node watcher gracefully handle cases
where a heartbeat fails but the node heartbeats again before its allocs can be
evicted.
* Stop watching nodes that have been deleted. The blocking query for nodes set
the maximum index to the highest index of a node it found, rather than the
index of the nodes table. This misses updates to the index from deleting
nodes. This was done as an performance optimization to avoid excessive
unblocking, but because the query is over all nodes anyways there's no
optimization to be had here. Remove the optimization so we can detect deleted
nodes without having to wait for an update to an unrelated node.
This PR replaces use of time.After with a safe helper function
that creates a time.Timer to use instead. The new function returns
both a time.Timer and a Stop function that the caller must handle.
Unlike time.NewTimer, the helper function does not panic if the duration
set is <= 0.
In tests where the logger is a test logger, emitting a trace log in a
background thread while it's shutting down may trigger a panic. Thus
avoid logging Trace if err != nil. Note that we already log an error
when err isn't a trace.
This fixes cases where tests panic with a trace like:
```
panic: Log in goroutine after TestAllocGarbageCollector_MakeRoomFor_MaxAllocs has completed
goroutine 30 [running]:
testing.(*common).logDepth(0xc000aa9e60, 0xc000c4a000, 0xab, 0x3)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:680 +0x4d3
testing.(*common).log(...)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:662
testing.(*common).Logf(0xc000aa9e60, 0x690b941, 0x4, 0xc001366c00, 0x2, 0x2)
/usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:701 +0x7e
github.com/hashicorp/nomad/helper/testlog.(*writer).Write(0xc000a82a60, 0xc0000b48c0, 0xab, 0x13f, 0x0, 0x0, 0x0)
/Users/notnoop/go/src/github.com/hashicorp/nomad/helper/testlog/testlog.go:34 +0x106
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*writer).Flush(0xc000a80900, 0xbf9870f000000001, 0x20a87556e, 0x8b12bc0)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/writer.go:29 +0x14f
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*intLogger).log(0xc000e2c180, 0xc0003b6880, 0x17, 0x1, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:139 +0x15d
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*intLogger).Trace(0xc000e2c180, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:446 +0x7a
github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(*interceptLogger).Trace(0xc0002f1ad0, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6)
/Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/interceptlogger.go:48 +0x9c
github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc0002f2380)
/Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:147 +0x1125
created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher
/Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:89 +0x1e3
FAIL github.com/hashicorp/nomad/client 10.605s
FAIL
```
We performed the DeploymentStatus nil checks a couple different ways, so
hopefully this helper will consoldiate them and make it more clear what
the code is doing.