If an alloc is being preempted and marked as evict, but the underlying
node is lost before the migration takes place, the allocation currently
stays as desired evict, status running forever, or until the node comes
back online.
This commit updates updateNonTerminalAllocsToLost to check for a
destired status of Evict as well as Stop when updating allocations on
tainted nodes.
switch to table test for lost node cases
Increase the shortened timeout after the first loop so that metrics
that take longer to come in aren't failing the test unnecessarily.
Move the check for empty alloc metrics into the loop so that if the
first values we get are empty we don't fail the test too early.
* Making pull activity timeout configurable in Docker plugin config, first pass
* Fixing broken function call
* Fixing broken tests
* Fixing linter suggestion
* Adding documentation on new parameter in Docker plugin config
* Adding unit test
* Setting min value for pull_activity_timeout, making pull activity duration a private var
copy struct values
ensure groupserviceHook implements RunnerPreKillhook
run deregister first
test that shutdown times are delayed
move magic number into variable
Fixes a bug where if a command flag parsing errors, the resulting error
and help usage messages get interleaved in unexpected and non-user
friendly way.
The reason is that we have flag parsing library effectively writes to
ui.Error in a goroutine. This is problematic: first, we lose the sequencing between help
usage and error message; second, cli.Ui methods are not concurrent safe.
Here, we introduce a custom error writer that buffers result and calls
ui.Error() in the write method and in the same goroutine.
For context, we need to wrap ui.Error because it's line-oriented, while
flags library expects a io.Writer which is bytes oriented.
Adds Windows targets to the client/allocs metrics tests. Removes the
`allocstats` test, which covers less than these tests and is now
redundant.
Adds a firewall rule to our Windows instances so that the prometheus
server can scrape the Nomad HTTP API for metrics.
Avoid logging in the `watch` function as much as possible, since
it is not waited on during a server shutdown. When the logger
logs after a test passes, it may or may not cause the testing
framework to panic.
More info in:
https://github.com/golang/go/issues/29388#issuecomment-453648436
Previously, Nomad used hand rolled HTTP requests to interact with the
EC2 metadata API. Recently however, we switched to using the AWS SDK for
this fingerprinting.
The default behaviour of the AWS SDK is to perform retries with
exponential backoff when a request fails. This is problematic for Nomad,
because interacting with the EC2 API is in our client start path.
Here we revert to our pre-existing behaviour of not performing retries
in the fast path, as if the metadata service is unavailable, it's likely
that nomad is not running in AWS.