open-nomad/client/allocrunner/taskrunner
Mahmood Ali 4afd7835e3 Fail alloc if alloc runner prestart hooks fail
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
2019-07-02 18:35:47 +08:00
..
getter client: Rename drivers/shared/env => client/taskenv 2018-11-30 12:18:39 +01:00
interfaces client: fix tr lifecycle logic and shutdown delay 2018-11-05 12:32:05 -08:00
restarts Fix restart attempts of restart stanza. 2019-05-21 13:27:19 +02:00
state client: test logmon cleanup 2019-03-04 13:15:15 -08:00
template tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal 2019-05-21 13:56:58 -04:00
testdata executor/linux: make chroot binary paths absolute 2019-04-01 15:45:31 -07:00
artifact_hook.go client: ensure task is cleaned up when terminal 2019-03-01 14:00:23 -08:00
artifact_hook_test.go client: ensure task is cleaned up when terminal 2019-03-01 14:00:23 -08:00
device_hook.go Store device envs separately and pass to drivers 2018-12-19 14:23:09 -08:00
device_hook_test.go Device hook and devices affect computed node class 2018-11-27 17:25:33 -08:00
dispatch_hook.go client/state: support upgrading from 0.8->0.9 2018-12-19 10:39:27 -08:00
dispatch_hook_test.go use drivers.FSIsolation 2019-01-08 09:11:47 -05:00
driver_handle.go implement client endpoint of nomad exec 2019-05-09 16:49:08 -04:00
errors.go client: artifact errors are retry-able 2019-02-20 07:21:27 -08:00
errors_test.go client: artifact errors are retry-able 2019-02-20 07:21:27 -08:00
lazy_handle.go executor: implement streaming stats API 2019-01-12 12:18:22 -05:00
lifecycle.go tr: Fetch Wait channel before killTask in restart 2019-06-26 15:20:57 +02:00
logmon_hook.go retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
logmon_hook_test.go logmon: make Start rpc idempotent and simplify hook 2019-03-19 14:02:36 -04:00
logmon_hook_unix_test.go try checking process status 2019-04-25 18:16:13 -04:00
service_hook.go trhooks: Add TaskStopHook interface to services 2019-06-12 16:00:21 +02:00
service_hook_test.go consul: fix task deregistration hook 2019-02-12 15:36:02 -08:00
stats_hook.go tr: use context in as select statement 2019-01-22 20:11:39 -05:00
stats_hook_test.go executor: implement streaming stats API 2019-01-12 12:18:22 -05:00
task_dir_hook.go client: ensure task is cleaned up when terminal 2019-03-01 14:00:23 -08:00
task_runner.go Fail alloc if alloc runner prestart hooks fail 2019-07-02 18:35:47 +08:00
task_runner_getters.go Merge pull request #5518 from hashicorp/f-simplify-kill 2019-04-15 14:11:58 -07:00
task_runner_hooks.go client: ensure task is cleaned up when terminal 2019-03-01 14:00:23 -08:00
task_runner_test.go cleanup test 2019-06-18 14:15:25 +00:00
template_hook.go client: Rename drivers/shared/env => client/taskenv 2018-11-30 12:18:39 +01:00
validate_hook.go client: Rename drivers/shared/env => client/taskenv 2018-11-30 12:18:39 +01:00
validate_hook_test.go client: Rename drivers/shared/env => client/taskenv 2018-11-30 12:18:39 +01:00
vault_hook.go emit TaskRestartSignal event on vault restart 2019-02-22 15:56:14 -05:00
vault_hook_test.go client: support graceful shutdowns 2018-11-19 16:39:30 -08:00