Commit graph

58 commits

Author SHA1 Message Date
Luiz Aoqui e012d9411e
Task lifecycle restart (#14127)
* allocrunner: handle lifecycle when all tasks die

When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.

* allocrunner: implement different alloc restarts

Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.

* allocrunner: allow tasks to run again

Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.

The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.

* api/cli: add support for all tasks alloc restart

Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.

* test: fix tasklifecycle Coordinator test

* allocrunner: kill taskrunners if all tasks are dead

When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.

* taskrunner: fix tests that waited on WaitCh

Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.

* tests: add tests for all tasks alloc restart

* changelog: add entry for #14127

* taskrunner: fix restore logic.

The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.

It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.

The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.

This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.

* address code review requests

* apply more code review changes

* taskrunner: add different Restart modes

Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.

So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
  the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
  restart if its `Run()` method is still active. Callers will need to
  restart the allocRunner taskCoordinator to make sure it will allow the
  task to run again.

* minor fixes
2022-08-24 17:43:07 -04:00
Luiz Aoqui 7a8cacc9ec
allocrunner: refactor task coordinator (#14009)
The current implementation for the task coordinator unblocks tasks by
performing destructive operations over its internal state (like closing
channels and deleting maps from keys).

This presents a problem in situations where we would like to revert the
state of a task, such as when restarting an allocation with tasks that
have already exited.

With this new implementation the task coordinator behaves more like a
finite state machine where task may be blocked/unblocked multiple times
by performing a state transition.

This initial part of the work only refactors the task coordinator and
is functionally equivalent to the previous implementation. Future work
will build upon this to provide bug fixes and enhancements.
2022-08-22 18:38:49 -04:00
Seth Hoenig 297d386bdc client: add support for checks in nomad services
This PR adds support for specifying checks in services registered to
the built-in nomad service provider.

Currently only HTTP and TCP checks are supported, though more types
could be added later.
2022-07-12 17:09:50 -05:00
Seth Hoenig 5dd8aa3e27 client: enforce max_kill_timeout client configuration
This PR fixes a bug where client configuration max_kill_timeout was
not being enforced. The feature was introduced in 9f44780 but seems
to have been removed during the major drivers refactoring.

We can make sure the value is enforced by pluming it through the DriverHandler,
which now uses the lesser of the task.killTimeout or client.maxKillTimeout.
Also updates Event.SetKillTimeout to require both the task.killTimeout and
client.maxKillTimeout so that we don't make the mistake of using the wrong
value - as it was being given only the task.killTimeout before.
2022-07-06 15:29:38 -05:00
Derek Strickland 12f3ee46ea
alloc_runner: stop sidecar tasks last (#13055)
alloc_runner: stop sidecar tasks last
2022-06-07 11:35:19 -04:00
Radek Simko 9cc71d6665
client/allochealth: add healthy_deadline as context to error messages (#13214) 2022-06-06 10:11:08 -04:00
Derek Strickland c7a5b17251 Fix client test reconnect test; Remove guard test (#12173)
* Update reconnect test to new algorithm and interface; remove guard test
2022-04-05 17:12:23 -04:00
Derek Strickland 3cbd76ea9d disconnected clients: Add reconnect task event (#12133)
* Add TaskClientReconnectedEvent constant
* Add allocRunner.Reconnect function to manage task state manually
* Removes server-side push
2022-04-05 17:12:23 -04:00
James Rasell a646333263
Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
James Rasell 042bf0fa57
client: hookup service wrapper for use within client hooks. 2022-03-21 10:29:57 +01:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
James Rasell 7cd28a6fb6
client: refactor common service registration objects from Consul.
This commit performs refactoring to pull out common service
registration objects into a new `client/serviceregistration`
package. This new package will form the base point for all
client specific service registration functionality.

The Consul specific implementation is not moved as it also
includes non-service registration implementations; this reduces
the blast radius of the changes as well.
2022-03-15 09:38:30 +01:00
Jasmine Dahilig d6110cbed4
lifecycle: add poststop hook (#8194) 2020-11-12 08:01:42 -08:00
Jasmine Dahilig fbe0c89ab1 task lifecycle poststart: code review fixes 2020-08-31 13:22:41 -07:00
Michael Schurter de08ae8083 test: add allocrunner test for poststart hooks 2020-08-12 09:54:14 -07:00
Jasmine Dahilig e8ed6851e2 lifecycle: add allocrunner and task hook coordinator unit tests 2020-07-29 12:39:42 -07:00
Mahmood Ali 7f460d2706 allocrunner: terminate sidecars in the end
This fixes a bug where a batch allocation fails to complete if it has
sidecars.

If the only remaining running tasks in an allocations are sidecars - we
must kill them and mark the allocation as complete.
2020-06-29 15:12:15 -04:00
Mahmood Ali 5ed346bf05 tests: update AR task restart policy 2020-03-24 17:00:42 -04:00
Jasmine Dahilig 2a8dac077c remove debugging test code from TestAllocRunner_TaskLeader_StopRestoredTG 2020-03-21 17:52:54 -04:00
Jasmine Dahilig deb26aefab fix bug in lifecycle restore tests after refactor 2020-03-21 17:52:54 -04:00
Jasmine Dahilig 80f0256cb4 refactor task hook coordinator helper method and tests 2020-03-21 17:52:53 -04:00
Jasmine Dahilig a0fe570317 clean up restore test 2020-03-21 17:52:52 -04:00
Jasmine Dahilig 7ed08eb75a partial test for restore functionality 2020-03-21 17:52:52 -04:00
Drew Bailey ae145c9a37
allow only positive shutdown delay
more explicit test case, remove select statement
2019-12-16 11:38:30 -05:00
Drew Bailey 24929776a2
shutdown delay for task groups
copy struct values

ensure groupserviceHook implements RunnerPreKillhook

run deregister first

test that shutdown times are delayed

move magic number into variable
2019-12-16 11:38:16 -05:00
Nick Ethier bd454a4c6f
client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Mahmood Ali c132623ffc Don't persist allocs of destroyed alloc runners
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
2019-08-25 11:21:28 -04:00
Preetha Appan ef9a71c68b
code review feedback 2019-07-10 10:41:06 -05:00
Preetha Appan 990e468edc
Populate task event struct with kill timeout
This makes for a nicer task event message
2019-07-09 09:37:09 -05:00
Preetha Appan b99a204582
Update deployment health on failed allocations only if health is unset
This fixes a confusing UX where a previously successful deployment's
healthy/unhealthy count would get updated if any allocations failed after
the deployment was already marked as successful.
2019-05-02 22:59:56 -05:00
Michael Schurter c5271d3fa5 client: test logmon cleanup
The test is sadly quite complicated and peeks into things (logmon's
reattach config) AR doesn't normally have access to.

However, I couldn't find another way of asserting logmon got cleaned up
without resorting to smaller unit tests. Smaller unit tests risk
re-implementing dependencies in an unrealistic way, so I opted for an
ugly integration test.
2019-03-04 13:15:15 -08:00
Preetha Appan 43679f4ce1
More alloc runner tests ported from 0.8.7 2019-02-22 17:58:06 -06:00
Mahmood Ali 8cb4bbcc08 address review comments 2019-02-22 15:56:14 -05:00
Mahmood Ali 1b14214a88 tests: port TestAllocRunner_RetryArtifact
Port TestAllocRunner_RetryArtifact from https://github.com/hashicorp/nomad/blob/v0.8.7/client/alloc_runner_test.go#L610-L672

I changed the test name because it doesn't actually test that artifact
hooks is retried
2019-02-22 15:50:39 -05:00
Mahmood Ali c827e6e05a tests: port TestAllocRunner_MoveAllocDir test 2019-02-22 15:50:39 -05:00
Michael Schurter 2553800eb8 tests: port TestAllocRunner_Destroy from 0.8
Also add destroy(ar) helper to fix a bunch of shutdown races in AR
tests.
2019-02-20 12:35:09 -08:00
Michael Schurter 159042a1a3 client: fix setting alloc unhealthy at deadline
During the 0.9 client refactor the code to fail a deployment when the
deadline was reached was broken. This restores and tests that behavior.
2019-02-19 07:44:14 -08:00
Michael Schurter 4e7ea460e8 test: port some pre-0.9 DeploymentHealth tests
Skipping a failing one as I need to move to some other work and don't
want to leave this work orphaned on my machine.
2019-01-14 09:56:53 -08:00
Alex Dadgar 9d1403d617
Merge pull request #5002 from hashicorp/b-task-config-resources
Convert driver resource to AllocatedTaskResource
2018-12-18 16:46:34 -08:00
Alex Dadgar 8efac7ec81 Fix unit tests + upgrade pathing resources 2018-12-18 15:50:44 -08:00
Danielle Tomlinson d6eb084d8a allocrunner: Drop and log updates after closing waitCh 2018-12-18 23:38:34 +01:00
Danielle Tomlinson 986fde0f5a allocrunner: Handle updates asynchronously
This creates a new buffered channel and goroutine on the allocrunner for
serializing updates to allocations. This allows us to take updates off
the routine that is used from processing updates from the server,
without having complicated machinery for tracking update lifetimes, or
other external synchronization.

This results in a nice performance improvement and signficantly better
throughput on batch changes such as preempting a large number of jobs
for a larger placement.
2018-12-18 23:38:33 +01:00
Danielle Tomlinson d1fbac1aad allocrunner: Async shutdown and destroy
This commit reduces the locking required to shutdown or destroy
allocrunners, and allows parallel shutdown and destroy of allocrunners during
shutdown.
2018-12-18 23:38:33 +01:00
Danielle Tomlinson d043532cb0 allocrunner: Basic test alloc runner 2018-12-06 12:28:23 +01:00
Alex Dadgar 4ee603c382 Device hook and devices affect computed node class
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.

TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
2018-11-27 17:25:33 -08:00
Michael Schurter 5bd744ac3d client: support graceful shutdowns
Client.Shutdown now blocks until all AllocRunners and TaskRunners have
exited their Run loops. Tasks are left running.
2018-11-19 16:39:30 -08:00
Mahmood Ali 865419e756 convert all config durations to strings in tests 2018-11-13 10:21:40 -05:00
Michael Schurter 2d3479147a client: fix ar and tr tests 2018-11-05 12:32:05 -08:00
Michael Schurter 2bbd88888c client: first pass at implementing task restoring
Task restoring works but dead tasks may be restarted
2018-11-05 12:32:05 -08:00
Michael Schurter e060174130 ar: fix leader handling, state restoring, and destroying unrun ARs
* Migrated all of the old leader task tests and got them passing
* Refactor and consolidate task killing code in AR to always kill leader
  tasks first
* Fixed lots of issues with state restoring
* Fixed deadlock in AR.Destroy if AR.Run had never been called
* Added a new in memory statedb for testing
2018-10-19 09:45:45 -07:00