4afd7835e3
When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for https://github.com/hashicorp/nomad/issues/5840 . |
||
---|---|---|
.. | ||
getter | ||
interfaces | ||
restarts | ||
state | ||
template | ||
testdata | ||
artifact_hook.go | ||
artifact_hook_test.go | ||
device_hook.go | ||
device_hook_test.go | ||
dispatch_hook.go | ||
dispatch_hook_test.go | ||
driver_handle.go | ||
errors.go | ||
errors_test.go | ||
lazy_handle.go | ||
lifecycle.go | ||
logmon_hook.go | ||
logmon_hook_test.go | ||
logmon_hook_unix_test.go | ||
service_hook.go | ||
service_hook_test.go | ||
stats_hook.go | ||
stats_hook_test.go | ||
task_dir_hook.go | ||
task_runner.go | ||
task_runner_getters.go | ||
task_runner_hooks.go | ||
task_runner_test.go | ||
template_hook.go | ||
validate_hook.go | ||
validate_hook_test.go | ||
vault_hook.go | ||
vault_hook_test.go |