Task lifecycle restart (#14127)

* allocrunner: handle lifecycle when all tasks die

When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.

* allocrunner: implement different alloc restarts

Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.

* allocrunner: allow tasks to run again

Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.

The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.

* api/cli: add support for all tasks alloc restart

Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.

* test: fix tasklifecycle Coordinator test

* allocrunner: kill taskrunners if all tasks are dead

When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.

* taskrunner: fix tests that waited on WaitCh

Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.

* tests: add tests for all tasks alloc restart

* changelog: add entry for #14127

* taskrunner: fix restore logic.

The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.

It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.

The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.

This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.

* address code review requests

* apply more code review changes

* taskrunner: add different Restart modes

Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.

So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
  the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
  restart if its `Run()` method is still active. Callers will need to
  restart the allocRunner taskCoordinator to make sure it will allow the
  task to run again.

* minor fixes
This commit is contained in:
Luiz Aoqui 2022-08-24 17:43:07 -04:00 committed by GitHub
parent c732b215f0
commit e012d9411e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
22 changed files with 1098 additions and 212 deletions

7
.changelog/14127.txt Normal file
View File

@ -0,0 +1,7 @@
```release-note:improvement
client: add option to restart all tasks of an allocation, regardless of lifecycle type or state.
```
```release-note:improvement
client: only start poststop tasks after poststart tasks are done.
```

View File

@ -141,7 +141,9 @@ func (a *Allocations) GC(alloc *Allocation, q *QueryOptions) error {
return err return err
} }
// Restart restarts an allocation. // Restart restarts the tasks that are currently running or a specific task if
// taskName is provided. An error is returned if the task to be restarted is
// not running.
// //
// Note: for cluster topologies where API consumers don't have network access to // Note: for cluster topologies where API consumers don't have network access to
// Nomad clients, set api.ClientConnTimeout to a small value (ex 1ms) to avoid // Nomad clients, set api.ClientConnTimeout to a small value (ex 1ms) to avoid
@ -156,6 +158,22 @@ func (a *Allocations) Restart(alloc *Allocation, taskName string, q *QueryOption
return err return err
} }
// RestartAllTasks restarts all tasks in the allocation, regardless of
// lifecycle type or state. Tasks will restart following their lifecycle order.
//
// Note: for cluster topologies where API consumers don't have network access to
// Nomad clients, set api.ClientConnTimeout to a small value (ex 1ms) to avoid
// long pauses on this API call.
func (a *Allocations) RestartAllTasks(alloc *Allocation, q *QueryOptions) error {
req := AllocationRestartRequest{
AllTasks: true,
}
var resp struct{}
_, err := a.client.putQuery("/v1/client/allocation/"+alloc.ID+"/restart", &req, &resp, q)
return err
}
// Stop stops an allocation. // Stop stops an allocation.
// //
// Note: for cluster topologies where API consumers don't have network access to // Note: for cluster topologies where API consumers don't have network access to
@ -447,6 +465,7 @@ func (a Allocation) RescheduleInfo(t time.Time) (int, int) {
type AllocationRestartRequest struct { type AllocationRestartRequest struct {
TaskName string TaskName string
AllTasks bool
} }
type AllocSignalRequest struct { type AllocSignalRequest struct {

View File

@ -102,7 +102,7 @@ func (a *Allocations) Restart(args *nstructs.AllocRestartRequest, reply *nstruct
return nstructs.ErrPermissionDenied return nstructs.ErrPermissionDenied
} }
return a.c.RestartAllocation(args.AllocID, args.TaskName) return a.c.RestartAllocation(args.AllocID, args.TaskName, args.AllTasks)
} }
// Stats is used to collect allocation statistics // Stats is used to collect allocation statistics

View File

@ -68,6 +68,45 @@ func TestAllocations_Restart(t *testing.T) {
}) })
} }
func TestAllocations_RestartAllTasks(t *testing.T) {
ci.Parallel(t)
require := require.New(t)
client, cleanup := TestClient(t, nil)
defer cleanup()
alloc := mock.LifecycleAlloc()
require.Nil(client.addAlloc(alloc, ""))
// Can't restart all tasks while specifying a task name.
req := &nstructs.AllocRestartRequest{
AllocID: alloc.ID,
AllTasks: true,
TaskName: "web",
}
var resp nstructs.GenericResponse
err := client.ClientRPC("Allocations.Restart", &req, &resp)
require.Error(err)
// Good request.
req = &nstructs.AllocRestartRequest{
AllocID: alloc.ID,
AllTasks: true,
}
testutil.WaitForResult(func() (bool, error) {
var resp2 nstructs.GenericResponse
err := client.ClientRPC("Allocations.Restart", &req, &resp2)
if err != nil && strings.Contains(err.Error(), "not running") {
return false, err
}
return true, nil
}, func(err error) {
t.Fatalf("err: %v", err)
})
}
func TestAllocations_Restart_ACL(t *testing.T) { func TestAllocations_Restart_ACL(t *testing.T) {
ci.Parallel(t) ci.Parallel(t)
require := require.New(t) require := require.New(t)

View File

@ -28,7 +28,6 @@ import (
cstate "github.com/hashicorp/nomad/client/state" cstate "github.com/hashicorp/nomad/client/state"
cstructs "github.com/hashicorp/nomad/client/structs" cstructs "github.com/hashicorp/nomad/client/structs"
"github.com/hashicorp/nomad/client/vaultclient" "github.com/hashicorp/nomad/client/vaultclient"
agentconsul "github.com/hashicorp/nomad/command/agent/consul"
"github.com/hashicorp/nomad/helper/pointer" "github.com/hashicorp/nomad/helper/pointer"
"github.com/hashicorp/nomad/nomad/structs" "github.com/hashicorp/nomad/nomad/structs"
"github.com/hashicorp/nomad/plugins/device" "github.com/hashicorp/nomad/plugins/device"
@ -547,13 +546,15 @@ func (ar *allocRunner) handleTaskStateUpdates() {
} }
} }
if len(liveRunners) > 0 {
// if all live runners are sidecars - kill alloc // if all live runners are sidecars - kill alloc
if killEvent == nil && hasSidecars && !hasNonSidecarTasks(liveRunners) { onlySidecarsRemaining := hasSidecars && !hasNonSidecarTasks(liveRunners)
if killEvent == nil && onlySidecarsRemaining {
killEvent = structs.NewTaskEvent(structs.TaskMainDead) killEvent = structs.NewTaskEvent(structs.TaskMainDead)
} }
// If there's a kill event set and live runners, kill them // If there's a kill event set and live runners, kill them
if killEvent != nil && len(liveRunners) > 0 { if killEvent != nil {
// Log kill reason // Log kill reason
switch killEvent.Type { switch killEvent.Type {
@ -573,17 +574,39 @@ func (ar *allocRunner) handleTaskStateUpdates() {
// Kill 'em all // Kill 'em all
states = ar.killTasks() states = ar.killTasks()
// Wait for TaskRunners to exit before continuing to // Wait for TaskRunners to exit before continuing. This will
// prevent looping before TaskRunners have transitioned // prevent looping before TaskRunners have transitioned to
// to Dead. // Dead.
for _, tr := range liveRunners { for _, tr := range liveRunners {
ar.logger.Info("killing task", "task", tr.Task().Name) ar.logger.Info("waiting for task to exit", "task", tr.Task().Name)
select { select {
case <-tr.WaitCh(): case <-tr.WaitCh():
case <-ar.waitCh: case <-ar.waitCh:
} }
} }
} }
} else {
// If there are no live runners left kill all non-poststop task
// runners to unblock them from the alloc restart loop.
for _, tr := range ar.tasks {
if tr.IsPoststopTask() {
continue
}
select {
case <-tr.WaitCh():
case <-ar.waitCh:
default:
// Kill task runner without setting an event because the
// task is already dead, it's just waiting in the alloc
// restart loop.
err := tr.Kill(context.TODO(), nil)
if err != nil {
ar.logger.Warn("failed to kill task", "task", tr.Task().Name, "error", err)
}
}
}
}
ar.taskCoordinator.TaskStateUpdated(states) ar.taskCoordinator.TaskStateUpdated(states)
@ -648,7 +671,7 @@ func (ar *allocRunner) killTasks() map[string]*structs.TaskState {
break break
} }
// Kill the rest non-sidecar or poststop tasks concurrently // Kill the rest non-sidecar and non-poststop tasks concurrently
wg := sync.WaitGroup{} wg := sync.WaitGroup{}
for name, tr := range ar.tasks { for name, tr := range ar.tasks {
// Filter out poststop and sidecar tasks so that they stop after all the other tasks are killed // Filter out poststop and sidecar tasks so that they stop after all the other tasks are killed
@ -1205,19 +1228,37 @@ func (ar *allocRunner) GetTaskEventHandler(taskName string) drivermanager.EventH
return nil return nil
} }
// RestartTask signalls the task runner for the provided task to restart. // Restart satisfies the WorkloadRestarter interface and restarts all tasks
func (ar *allocRunner) RestartTask(taskName string, taskEvent *structs.TaskEvent) error { // that are currently running.
func (ar *allocRunner) Restart(ctx context.Context, event *structs.TaskEvent, failure bool) error {
return ar.restartTasks(ctx, event, failure, false)
}
// RestartTask restarts the provided task.
func (ar *allocRunner) RestartTask(taskName string, event *structs.TaskEvent) error {
tr, ok := ar.tasks[taskName] tr, ok := ar.tasks[taskName]
if !ok { if !ok {
return fmt.Errorf("Could not find task runner for task: %s", taskName) return fmt.Errorf("Could not find task runner for task: %s", taskName)
} }
return tr.Restart(context.TODO(), taskEvent, false) return tr.Restart(context.TODO(), event, false)
} }
// Restart satisfies the WorkloadRestarter interface restarts all task runners // RestartRunning restarts all tasks that are currently running.
// concurrently func (ar *allocRunner) RestartRunning(event *structs.TaskEvent) error {
func (ar *allocRunner) Restart(ctx context.Context, event *structs.TaskEvent, failure bool) error { return ar.restartTasks(context.TODO(), event, false, false)
}
// RestartAll restarts all tasks in the allocation, including dead ones. They
// will restart following their lifecycle order.
func (ar *allocRunner) RestartAll(event *structs.TaskEvent) error {
// Restart the taskCoordinator to allow dead tasks to run again.
ar.taskCoordinator.Restart()
return ar.restartTasks(context.TODO(), event, false, true)
}
// restartTasks restarts all task runners concurrently.
func (ar *allocRunner) restartTasks(ctx context.Context, event *structs.TaskEvent, failure bool, force bool) error {
waitCh := make(chan struct{}) waitCh := make(chan struct{})
var err *multierror.Error var err *multierror.Error
var errMutex sync.Mutex var errMutex sync.Mutex
@ -1230,10 +1271,19 @@ func (ar *allocRunner) Restart(ctx context.Context, event *structs.TaskEvent, fa
defer close(waitCh) defer close(waitCh)
for tn, tr := range ar.tasks { for tn, tr := range ar.tasks {
wg.Add(1) wg.Add(1)
go func(taskName string, r agentconsul.WorkloadRestarter) { go func(taskName string, taskRunner *taskrunner.TaskRunner) {
defer wg.Done() defer wg.Done()
e := r.Restart(ctx, event, failure)
if e != nil { var e error
if force {
e = taskRunner.ForceRestart(ctx, event.Copy(), failure)
} else {
e = taskRunner.Restart(ctx, event.Copy(), failure)
}
// Ignore ErrTaskNotRunning errors since tasks that are not
// running are expected to not be restarted.
if e != nil && e != taskrunner.ErrTaskNotRunning {
errMutex.Lock() errMutex.Lock()
defer errMutex.Unlock() defer errMutex.Unlock()
err = multierror.Append(err, fmt.Errorf("failed to restart task %s: %v", taskName, e)) err = multierror.Append(err, fmt.Errorf("failed to restart task %s: %v", taskName, e))
@ -1251,25 +1301,6 @@ func (ar *allocRunner) Restart(ctx context.Context, event *structs.TaskEvent, fa
return err.ErrorOrNil() return err.ErrorOrNil()
} }
// RestartAll signalls all task runners in the allocation to restart and passes
// a copy of the task event to each restart event.
// Returns any errors in a concatenated form.
func (ar *allocRunner) RestartAll(taskEvent *structs.TaskEvent) error {
var err *multierror.Error
// run alloc task restart hooks
ar.taskRestartHooks()
for tn := range ar.tasks {
rerr := ar.RestartTask(tn, taskEvent.Copy())
if rerr != nil {
err = multierror.Append(err, rerr)
}
}
return err.ErrorOrNil()
}
// Signal sends a signal request to task runners inside an allocation. If the // Signal sends a signal request to task runners inside an allocation. If the
// taskName is empty, then it is sent to all tasks. // taskName is empty, then it is sent to all tasks.
func (ar *allocRunner) Signal(taskName, signal string) error { func (ar *allocRunner) Signal(taskName, signal string) error {

View File

@ -10,6 +10,7 @@ import (
"time" "time"
"github.com/hashicorp/consul/api" "github.com/hashicorp/consul/api"
multierror "github.com/hashicorp/go-multierror"
"github.com/hashicorp/nomad/ci" "github.com/hashicorp/nomad/ci"
"github.com/hashicorp/nomad/client/allochealth" "github.com/hashicorp/nomad/client/allochealth"
"github.com/hashicorp/nomad/client/allocrunner/tasklifecycle" "github.com/hashicorp/nomad/client/allocrunner/tasklifecycle"
@ -483,6 +484,464 @@ func TestAllocRunner_Lifecycle_Poststop(t *testing.T) {
} }
func TestAllocRunner_Lifecycle_Restart(t *testing.T) {
ci.Parallel(t)
// test cases can use this default or override w/ taskDefs param
alloc := mock.LifecycleAllocFromTasks([]mock.LifecycleTaskDef{
{Name: "main", RunFor: "100s", ExitCode: 0, Hook: "", IsSidecar: false},
{Name: "prestart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "prestart", IsSidecar: false},
{Name: "prestart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "prestart", IsSidecar: true},
{Name: "poststart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "poststart", IsSidecar: false},
{Name: "poststart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "poststart", IsSidecar: true},
{Name: "poststop", RunFor: "1s", ExitCode: 0, Hook: "poststop", IsSidecar: false},
})
alloc.Job.Type = structs.JobTypeService
rp := &structs.RestartPolicy{
Attempts: 1,
Interval: 10 * time.Minute,
Delay: 1 * time.Nanosecond,
Mode: structs.RestartPolicyModeFail,
}
ev := &structs.TaskEvent{Type: structs.TaskRestartSignal}
testCases := []struct {
name string
taskDefs []mock.LifecycleTaskDef
isBatch bool
hasLeader bool
action func(*allocRunner, *structs.Allocation) error
expectedErr string
expectedAfter map[string]structs.TaskState
}{
{
name: "restart entire allocation",
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartAll(ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 1},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 1},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
{
name: "restart only running tasks",
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartRunning(ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
{
name: "batch job restart entire allocation",
taskDefs: []mock.LifecycleTaskDef{
{Name: "main", RunFor: "100s", ExitCode: 1, Hook: "", IsSidecar: false},
{Name: "prestart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "prestart", IsSidecar: false},
{Name: "prestart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "prestart", IsSidecar: true},
{Name: "poststart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "poststart", IsSidecar: false},
{Name: "poststart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "poststart", IsSidecar: true},
{Name: "poststop", RunFor: "1s", ExitCode: 0, Hook: "poststop", IsSidecar: false},
},
isBatch: true,
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartAll(ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 1},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 1},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
{
name: "batch job restart only running tasks ",
taskDefs: []mock.LifecycleTaskDef{
{Name: "main", RunFor: "100s", ExitCode: 1, Hook: "", IsSidecar: false},
{Name: "prestart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "prestart", IsSidecar: false},
{Name: "prestart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "prestart", IsSidecar: true},
{Name: "poststart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "poststart", IsSidecar: false},
{Name: "poststart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "poststart", IsSidecar: true},
{Name: "poststop", RunFor: "1s", ExitCode: 0, Hook: "poststop", IsSidecar: false},
},
isBatch: true,
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartRunning(ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
{
name: "restart entire allocation with leader",
hasLeader: true,
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartAll(ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 1},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 1},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
{
name: "stop from server",
action: func(ar *allocRunner, alloc *structs.Allocation) error {
stopAlloc := alloc.Copy()
stopAlloc.DesiredStatus = structs.AllocDesiredStatusStop
ar.Update(stopAlloc)
return nil
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "dead", Restarts: 0},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststop": structs.TaskState{State: "dead", Restarts: 0},
},
},
{
name: "restart main task",
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartTask("main", ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 0},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
{
name: "restart leader main task",
hasLeader: true,
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartTask("main", ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 0},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
{
name: "main task fails and restarts once",
taskDefs: []mock.LifecycleTaskDef{
{Name: "main", RunFor: "2s", ExitCode: 1, Hook: "", IsSidecar: false},
{Name: "prestart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "prestart", IsSidecar: false},
{Name: "prestart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "prestart", IsSidecar: true},
{Name: "poststart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "poststart", IsSidecar: false},
{Name: "poststart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "poststart", IsSidecar: true},
{Name: "poststop", RunFor: "1s", ExitCode: 0, Hook: "poststop", IsSidecar: false},
},
action: func(ar *allocRunner, alloc *structs.Allocation) error {
time.Sleep(3 * time.Second) // make sure main task has exited
return nil
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "dead", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststop": structs.TaskState{State: "dead", Restarts: 0},
},
},
{
name: "leader main task fails and restarts once",
taskDefs: []mock.LifecycleTaskDef{
{Name: "main", RunFor: "2s", ExitCode: 1, Hook: "", IsSidecar: false},
{Name: "prestart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "prestart", IsSidecar: false},
{Name: "prestart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "prestart", IsSidecar: true},
{Name: "poststart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "poststart", IsSidecar: false},
{Name: "poststart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "poststart", IsSidecar: true},
{Name: "poststop", RunFor: "1s", ExitCode: 0, Hook: "poststop", IsSidecar: false},
},
hasLeader: true,
action: func(ar *allocRunner, alloc *structs.Allocation) error {
time.Sleep(3 * time.Second) // make sure main task has exited
return nil
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "dead", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststop": structs.TaskState{State: "dead", Restarts: 0},
},
},
{
name: "main stopped unexpectedly and restarts once",
taskDefs: []mock.LifecycleTaskDef{
{Name: "main", RunFor: "2s", ExitCode: 0, Hook: "", IsSidecar: false},
{Name: "prestart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "prestart", IsSidecar: false},
{Name: "prestart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "prestart", IsSidecar: true},
{Name: "poststart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "poststart", IsSidecar: false},
{Name: "poststart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "poststart", IsSidecar: true},
{Name: "poststop", RunFor: "1s", ExitCode: 0, Hook: "poststop", IsSidecar: false},
},
action: func(ar *allocRunner, alloc *structs.Allocation) error {
time.Sleep(3 * time.Second) // make sure main task has exited
return nil
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "dead", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststop": structs.TaskState{State: "dead", Restarts: 0},
},
},
{
name: "leader main stopped unexpectedly and restarts once",
taskDefs: []mock.LifecycleTaskDef{
{Name: "main", RunFor: "2s", ExitCode: 0, Hook: "", IsSidecar: false},
{Name: "prestart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "prestart", IsSidecar: false},
{Name: "prestart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "prestart", IsSidecar: true},
{Name: "poststart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "poststart", IsSidecar: false},
{Name: "poststart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "poststart", IsSidecar: true},
{Name: "poststop", RunFor: "1s", ExitCode: 0, Hook: "poststop", IsSidecar: false},
},
action: func(ar *allocRunner, alloc *structs.Allocation) error {
time.Sleep(3 * time.Second) // make sure main task has exited
return nil
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "dead", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststop": structs.TaskState{State: "dead", Restarts: 0},
},
},
{
name: "failed main task cannot be restarted",
taskDefs: []mock.LifecycleTaskDef{
{Name: "main", RunFor: "2s", ExitCode: 1, Hook: "", IsSidecar: false},
{Name: "prestart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "prestart", IsSidecar: false},
{Name: "prestart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "prestart", IsSidecar: true},
{Name: "poststart-oneshot", RunFor: "1s", ExitCode: 0, Hook: "poststart", IsSidecar: false},
{Name: "poststart-sidecar", RunFor: "100s", ExitCode: 0, Hook: "poststart", IsSidecar: true},
{Name: "poststop", RunFor: "1s", ExitCode: 0, Hook: "poststop", IsSidecar: false},
},
action: func(ar *allocRunner, alloc *structs.Allocation) error {
// make sure main task has had a chance to restart once on its
// own and fail again before we try to manually restart it
time.Sleep(5 * time.Second)
return ar.RestartTask("main", ev)
},
expectedErr: "Task not running",
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "dead", Restarts: 1},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "dead", Restarts: 0},
"poststop": structs.TaskState{State: "dead", Restarts: 0},
},
},
{
name: "restart prestart-sidecar task",
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartTask("prestart-sidecar", ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 0},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 0},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
{
name: "restart poststart-sidecar task",
action: func(ar *allocRunner, alloc *structs.Allocation) error {
return ar.RestartTask("poststart-sidecar", ev)
},
expectedAfter: map[string]structs.TaskState{
"main": structs.TaskState{State: "running", Restarts: 0},
"prestart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"prestart-sidecar": structs.TaskState{State: "running", Restarts: 0},
"poststart-oneshot": structs.TaskState{State: "dead", Restarts: 0},
"poststart-sidecar": structs.TaskState{State: "running", Restarts: 1},
"poststop": structs.TaskState{State: "pending", Restarts: 0},
},
},
}
for _, tc := range testCases {
tc := tc
t.Run(tc.name, func(t *testing.T) {
ci.Parallel(t)
alloc := alloc.Copy()
alloc.Job.TaskGroups[0].RestartPolicy = rp
if tc.taskDefs != nil {
alloc = mock.LifecycleAllocFromTasks(tc.taskDefs)
alloc.Job.Type = structs.JobTypeService
}
for _, task := range alloc.Job.TaskGroups[0].Tasks {
task.RestartPolicy = rp // tasks inherit the group policy
}
if tc.hasLeader {
for _, task := range alloc.Job.TaskGroups[0].Tasks {
if task.Name == "main" {
task.Leader = true
}
}
}
if tc.isBatch {
alloc.Job.Type = structs.JobTypeBatch
}
conf, cleanup := testAllocRunnerConfig(t, alloc)
defer cleanup()
ar, err := NewAllocRunner(conf)
require.NoError(t, err)
defer destroy(ar)
go ar.Run()
upd := conf.StateUpdater.(*MockStateUpdater)
// assert our "before" states:
// - all one-shot tasks should be dead but not failed
// - all main tasks and sidecars should be running
// - no tasks should have restarted
testutil.WaitForResult(func() (bool, error) {
last := upd.Last()
if last == nil {
return false, fmt.Errorf("no update")
}
if last.ClientStatus != structs.AllocClientStatusRunning {
return false, fmt.Errorf(
"expected alloc to be running not %s", last.ClientStatus)
}
var errs *multierror.Error
expectedBefore := map[string]string{
"main": "running",
"prestart-oneshot": "dead",
"prestart-sidecar": "running",
"poststart-oneshot": "dead",
"poststart-sidecar": "running",
"poststop": "pending",
}
for task, expected := range expectedBefore {
got, ok := last.TaskStates[task]
if !ok {
continue
}
if got.State != expected {
errs = multierror.Append(errs, fmt.Errorf(
"expected initial state of task %q to be %q not %q",
task, expected, got.State))
}
if got.Restarts != 0 {
errs = multierror.Append(errs, fmt.Errorf(
"expected no initial restarts of task %q, not %q",
task, got.Restarts))
}
if expected == "dead" && got.Failed {
errs = multierror.Append(errs, fmt.Errorf(
"expected ephemeral task %q to be dead but not failed",
task))
}
}
if errs.ErrorOrNil() != nil {
return false, errs.ErrorOrNil()
}
return true, nil
}, func(err error) {
require.NoError(t, err, "error waiting for initial state")
})
// perform the action
err = tc.action(ar, alloc.Copy())
if tc.expectedErr != "" {
require.EqualError(t, err, tc.expectedErr)
} else {
require.NoError(t, err)
}
// assert our "after" states
testutil.WaitForResult(func() (bool, error) {
last := upd.Last()
if last == nil {
return false, fmt.Errorf("no update")
}
var errs *multierror.Error
for task, expected := range tc.expectedAfter {
got, ok := last.TaskStates[task]
if !ok {
errs = multierror.Append(errs, fmt.Errorf(
"no final state found for task %q", task,
))
}
if got.State != expected.State {
errs = multierror.Append(errs, fmt.Errorf(
"expected final state of task %q to be %q not %q",
task, expected.State, got.State))
}
if expected.State == "dead" {
if got.FinishedAt.IsZero() || got.StartedAt.IsZero() {
errs = multierror.Append(errs, fmt.Errorf(
"expected final state of task %q to have start and finish time", task))
}
if len(got.Events) < 2 {
errs = multierror.Append(errs, fmt.Errorf(
"expected final state of task %q to include at least 2 tasks", task))
}
}
if got.Restarts != expected.Restarts {
errs = multierror.Append(errs, fmt.Errorf(
"expected final restarts of task %q to be %v not %v",
task, expected.Restarts, got.Restarts))
}
}
if errs.ErrorOrNil() != nil {
return false, errs.ErrorOrNil()
}
return true, nil
}, func(err error) {
require.NoError(t, err, "error waiting for final state")
})
})
}
}
func TestAllocRunner_TaskGroup_ShutdownDelay(t *testing.T) { func TestAllocRunner_TaskGroup_ShutdownDelay(t *testing.T) {
ci.Parallel(t) ci.Parallel(t)
@ -1213,7 +1672,7 @@ func TestAllocRunner_MoveAllocDir(t *testing.T) {
ar.Run() ar.Run()
defer destroy(ar) defer destroy(ar)
require.Equal(t, structs.AllocClientStatusComplete, ar.AllocState().ClientStatus) WaitForClientState(t, ar, structs.AllocClientStatusComplete)
// Step 2. Modify its directory // Step 2. Modify its directory
task := alloc.Job.TaskGroups[0].Tasks[0] task := alloc.Job.TaskGroups[0].Tasks[0]
@ -1241,7 +1700,7 @@ func TestAllocRunner_MoveAllocDir(t *testing.T) {
ar2.Run() ar2.Run()
defer destroy(ar2) defer destroy(ar2)
require.Equal(t, structs.AllocClientStatusComplete, ar2.AllocState().ClientStatus) WaitForClientState(t, ar, structs.AllocClientStatusComplete)
// Ensure that data from ar was moved to ar2 // Ensure that data from ar was moved to ar2
dataFile = filepath.Join(ar2.allocDir.SharedDir, "data", "data_file") dataFile = filepath.Join(ar2.allocDir.SharedDir, "data", "data_file")

View File

@ -207,18 +207,18 @@ func TestAllocRunner_Restore_CompletedBatch(t *testing.T) {
go ar2.Run() go ar2.Run()
defer destroy(ar2) defer destroy(ar2)
// AR waitCh must be closed even when task doesn't run again // AR waitCh must be open as the task waits for a possible alloc restart.
select { select {
case <-ar2.WaitCh(): case <-ar2.WaitCh():
case <-time.After(10 * time.Second): require.Fail(t, "alloc.waitCh was closed")
require.Fail(t, "alloc.waitCh wasn't closed") default:
} }
// TR waitCh must be closed too! // TR waitCh must be open too!
select { select {
case <-ar2.tasks[task.Name].WaitCh(): case <-ar2.tasks[task.Name].WaitCh():
case <-time.After(10 * time.Second): require.Fail(t, "tr.waitCh was closed")
require.Fail(t, "tr.waitCh wasn't closed") default:
} }
// Assert that events are unmodified, which they would if task re-run // Assert that events are unmodified, which they would if task re-run

View File

@ -16,7 +16,7 @@ const (
coordinatorStatePrestart coordinatorStatePrestart
coordinatorStateMain coordinatorStateMain
coordinatorStatePoststart coordinatorStatePoststart
coordinatorStateWaitMain coordinatorStateWaitAlloc
coordinatorStatePoststop coordinatorStatePoststop
) )
@ -30,8 +30,8 @@ func (s coordinatorState) String() string {
return "main" return "main"
case coordinatorStatePoststart: case coordinatorStatePoststart:
return "poststart" return "poststart"
case coordinatorStateWaitMain: case coordinatorStateWaitAlloc:
return "wait_main" return "wait_alloc"
case coordinatorStatePoststop: case coordinatorStatePoststop:
return "poststart" return "poststart"
} }
@ -109,6 +109,15 @@ func NewCoordinator(logger hclog.Logger, tasks []*structs.Task, shutdownCh <-cha
return c return c
} }
// Restart sets the Coordinator state back to "init" and is used to coordinate
// a full alloc restart. Since all tasks will run again they need to be pending
// before they are allowed to proceed.
func (c *Coordinator) Restart() {
c.currentStateLock.Lock()
defer c.currentStateLock.Unlock()
c.enterStateLocked(coordinatorStateInit)
}
// Restore is used to set the Coordinator FSM to the correct state when an // Restore is used to set the Coordinator FSM to the correct state when an
// alloc is restored. Must be called before the allocrunner is running. // alloc is restored. Must be called before the allocrunner is running.
func (c *Coordinator) Restore(states map[string]*structs.TaskState) { func (c *Coordinator) Restore(states map[string]*structs.TaskState) {
@ -151,6 +160,13 @@ func (c *Coordinator) TaskStateUpdated(states map[string]*structs.TaskState) {
// current internal state and the received states of the tasks. // current internal state and the received states of the tasks.
// The currentStateLock must be held before calling this method. // The currentStateLock must be held before calling this method.
func (c *Coordinator) nextStateLocked(states map[string]*structs.TaskState) coordinatorState { func (c *Coordinator) nextStateLocked(states map[string]*structs.TaskState) coordinatorState {
// coordinatorStatePoststop is the terminal state of the FSM, and can be
// reached at any time.
if c.isAllocDone(states) {
return coordinatorStatePoststop
}
switch c.currentState { switch c.currentState {
case coordinatorStateInit: case coordinatorStateInit:
if !c.isInitDone(states) { if !c.isInitDone(states) {
@ -174,11 +190,11 @@ func (c *Coordinator) nextStateLocked(states map[string]*structs.TaskState) coor
if !c.isPoststartDone(states) { if !c.isPoststartDone(states) {
return coordinatorStatePoststart return coordinatorStatePoststart
} }
return coordinatorStateWaitMain return coordinatorStateWaitAlloc
case coordinatorStateWaitMain: case coordinatorStateWaitAlloc:
if !c.isWaitMainDone(states) { if !c.isAllocDone(states) {
return coordinatorStateWaitMain return coordinatorStateWaitAlloc
} }
return coordinatorStatePoststop return coordinatorStatePoststop
@ -233,7 +249,7 @@ func (c *Coordinator) enterStateLocked(state coordinatorState) {
c.allow(lifecycleStagePoststartEphemeral) c.allow(lifecycleStagePoststartEphemeral)
c.allow(lifecycleStagePoststartSidecar) c.allow(lifecycleStagePoststartSidecar)
case coordinatorStateWaitMain: case coordinatorStateWaitAlloc:
c.block(lifecycleStagePrestartEphemeral) c.block(lifecycleStagePrestartEphemeral)
c.block(lifecycleStagePoststartEphemeral) c.block(lifecycleStagePoststartEphemeral)
c.block(lifecycleStagePoststop) c.block(lifecycleStagePoststop)
@ -268,7 +284,7 @@ func (c *Coordinator) isInitDone(states map[string]*structs.TaskState) bool {
// isPrestartDone returns true when the following conditions are met: // isPrestartDone returns true when the following conditions are met:
// - there is at least one prestart task // - there is at least one prestart task
// - all ephemeral prestart tasks are in the "dead" state. // - all ephemeral prestart tasks are successful.
// - no ephemeral prestart task has failed. // - no ephemeral prestart task has failed.
// - all prestart sidecar tasks are running. // - all prestart sidecar tasks are running.
func (c *Coordinator) isPrestartDone(states map[string]*structs.TaskState) bool { func (c *Coordinator) isPrestartDone(states map[string]*structs.TaskState) bool {
@ -277,7 +293,7 @@ func (c *Coordinator) isPrestartDone(states map[string]*structs.TaskState) bool
} }
for _, task := range c.tasksByLifecycle[lifecycleStagePrestartEphemeral] { for _, task := range c.tasksByLifecycle[lifecycleStagePrestartEphemeral] {
if states[task].State != structs.TaskStateDead || states[task].Failed { if !states[task].Successful() {
return false return false
} }
} }
@ -321,9 +337,9 @@ func (c *Coordinator) isPoststartDone(states map[string]*structs.TaskState) bool
return true return true
} }
// isWaitMainDone returns true when the following conditions are met: // isAllocDone returns true when the following conditions are met:
// - all tasks that are not poststop are in the "dead" state. // - all non-poststop tasks are in the "dead" state.
func (c *Coordinator) isWaitMainDone(states map[string]*structs.TaskState) bool { func (c *Coordinator) isAllocDone(states map[string]*structs.TaskState) bool {
for lifecycle, tasks := range c.tasksByLifecycle { for lifecycle, tasks := range c.tasksByLifecycle {
if lifecycle == lifecycleStagePoststop { if lifecycle == lifecycleStagePoststop {
continue continue

View File

@ -58,6 +58,9 @@ func TestCoordinator_PrestartRunsBeforeMain(t *testing.T) {
sideTask := tasks[1] sideTask := tasks[1]
initTask := tasks[2] initTask := tasks[2]
// Only use the tasks that we care about.
tasks = []*structs.Task{mainTask, sideTask, initTask}
shutdownCh := make(chan struct{}) shutdownCh := make(chan struct{})
defer close(shutdownCh) defer close(shutdownCh)
coord := NewCoordinator(logger, tasks, shutdownCh) coord := NewCoordinator(logger, tasks, shutdownCh)
@ -161,6 +164,9 @@ func TestCoordinator_MainRunsAfterManyInitTasks(t *testing.T) {
init1Task := tasks[1] init1Task := tasks[1]
init2Task := tasks[2] init2Task := tasks[2]
// Only use the tasks that we care about.
tasks = []*structs.Task{mainTask, init1Task, init2Task}
shutdownCh := make(chan struct{}) shutdownCh := make(chan struct{})
defer close(shutdownCh) defer close(shutdownCh)
coord := NewCoordinator(logger, tasks, shutdownCh) coord := NewCoordinator(logger, tasks, shutdownCh)
@ -227,6 +233,9 @@ func TestCoordinator_FailedInitTask(t *testing.T) {
init1Task := tasks[1] init1Task := tasks[1]
init2Task := tasks[2] init2Task := tasks[2]
// Only use the tasks that we care about.
tasks = []*structs.Task{mainTask, init1Task, init2Task}
shutdownCh := make(chan struct{}) shutdownCh := make(chan struct{})
defer close(shutdownCh) defer close(shutdownCh)
coord := NewCoordinator(logger, tasks, shutdownCh) coord := NewCoordinator(logger, tasks, shutdownCh)
@ -292,6 +301,9 @@ func TestCoordinator_SidecarNeverStarts(t *testing.T) {
sideTask := tasks[1] sideTask := tasks[1]
initTask := tasks[2] initTask := tasks[2]
// Only use the tasks that we care about.
tasks = []*structs.Task{mainTask, sideTask, initTask}
shutdownCh := make(chan struct{}) shutdownCh := make(chan struct{})
defer close(shutdownCh) defer close(shutdownCh)
coord := NewCoordinator(logger, tasks, shutdownCh) coord := NewCoordinator(logger, tasks, shutdownCh)
@ -356,6 +368,9 @@ func TestCoordinator_PoststartStartsAfterMain(t *testing.T) {
sideTask := tasks[1] sideTask := tasks[1]
postTask := tasks[2] postTask := tasks[2]
// Only use the tasks that we care about.
tasks = []*structs.Task{mainTask, sideTask, postTask}
// Make the the third task is a poststart hook // Make the the third task is a poststart hook
postTask.Lifecycle.Hook = structs.TaskLifecycleHookPoststart postTask.Lifecycle.Hook = structs.TaskLifecycleHookPoststart

View File

@ -6,28 +6,103 @@ import (
"github.com/hashicorp/nomad/nomad/structs" "github.com/hashicorp/nomad/nomad/structs"
) )
// Restart a task. Returns immediately if no task is running. Blocks until // Restart restarts a task that is already running. Returns an error if the
// existing task exits or passed-in context is canceled. // task is not running. Blocks until existing task exits or passed-in context
// is canceled.
func (tr *TaskRunner) Restart(ctx context.Context, event *structs.TaskEvent, failure bool) error { func (tr *TaskRunner) Restart(ctx context.Context, event *structs.TaskEvent, failure bool) error {
tr.logger.Trace("Restart requested", "failure", failure) tr.logger.Trace("Restart requested", "failure", failure, "event", event.GoString())
// Grab the handle taskState := tr.TaskState()
handle := tr.getDriverHandle() if taskState == nil {
return ErrTaskNotRunning
}
// Check it is running switch taskState.State {
if handle == nil { case structs.TaskStatePending, structs.TaskStateDead:
return ErrTaskNotRunning
}
return tr.restartImpl(ctx, event, failure)
}
// ForceRestart restarts a task that is already running or reruns it if dead.
// Returns an error if the task is not able to rerun. Blocks until existing
// task exits or passed-in context is canceled.
//
// Callers must restart the AllocRuner taskCoordinator beforehand to make sure
// the task will be able to run again.
func (tr *TaskRunner) ForceRestart(ctx context.Context, event *structs.TaskEvent, failure bool) error {
tr.logger.Trace("Force restart requested", "failure", failure, "event", event.GoString())
taskState := tr.TaskState()
if taskState == nil {
return ErrTaskNotRunning
}
tr.stateLock.Lock()
localState := tr.localState.Copy()
tr.stateLock.Unlock()
if localState == nil {
return ErrTaskNotRunning
}
switch taskState.State {
case structs.TaskStatePending:
return ErrTaskNotRunning
case structs.TaskStateDead:
// Tasks that are in the "dead" state are only allowed to restart if
// their Run() method is still active.
if localState.RunComplete {
return ErrTaskNotRunning
}
}
return tr.restartImpl(ctx, event, failure)
}
// restartImpl implements to task restart process.
//
// It should never be called directly as it doesn't verify if the task state
// allows for a restart.
func (tr *TaskRunner) restartImpl(ctx context.Context, event *structs.TaskEvent, failure bool) error {
// Check if the task is able to restart based on its state and the type of
// restart event that was triggered.
taskState := tr.TaskState()
if taskState == nil {
return ErrTaskNotRunning return ErrTaskNotRunning
} }
// Emit the event since it may take a long time to kill // Emit the event since it may take a long time to kill
tr.EmitEvent(event) tr.EmitEvent(event)
// Run the pre-kill hooks prior to restarting the task
tr.preKill()
// Tell the restart tracker that a restart triggered the exit // Tell the restart tracker that a restart triggered the exit
tr.restartTracker.SetRestartTriggered(failure) tr.restartTracker.SetRestartTriggered(failure)
// Signal a restart to unblock tasks that are in the "dead" state, but
// don't block since the channel is buffered. Only one signal is enough to
// notify the tr.Run() loop.
// The channel must be signaled after SetRestartTriggered is called so the
// tr.Run() loop runs again.
if taskState.State == structs.TaskStateDead {
select {
case tr.restartCh <- struct{}{}:
default:
}
}
// Grab the handle to see if the task is still running and needs to be
// killed.
handle := tr.getDriverHandle()
if handle == nil {
return nil
}
// Run the pre-kill hooks prior to restarting the task
tr.preKill()
// Grab a handle to the wait channel that will timeout with context cancelation // Grab a handle to the wait channel that will timeout with context cancelation
// _before_ killing the task. // _before_ killing the task.
waitCh, err := handle.WaitCh(ctx) waitCh, err := handle.WaitCh(ctx)
@ -69,14 +144,17 @@ func (tr *TaskRunner) Signal(event *structs.TaskEvent, s string) error {
// Kill a task. Blocks until task exits or context is canceled. State is set to // Kill a task. Blocks until task exits or context is canceled. State is set to
// dead. // dead.
func (tr *TaskRunner) Kill(ctx context.Context, event *structs.TaskEvent) error { func (tr *TaskRunner) Kill(ctx context.Context, event *structs.TaskEvent) error {
tr.logger.Trace("Kill requested", "event_type", event.Type, "event_reason", event.KillReason) tr.logger.Trace("Kill requested")
// Cancel the task runner to break out of restart delay or the main run // Cancel the task runner to break out of restart delay or the main run
// loop. // loop.
tr.killCtxCancel() tr.killCtxCancel()
// Emit kill event // Emit kill event
if event != nil {
tr.logger.Trace("Kill event", "event_type", event.Type, "event_reason", event.KillReason)
tr.EmitEvent(event) tr.EmitEvent(event)
}
select { select {
case <-tr.WaitCh(): case <-tr.WaitCh():

View File

@ -22,7 +22,6 @@ import (
"github.com/hashicorp/nomad/helper/uuid" "github.com/hashicorp/nomad/helper/uuid"
"github.com/hashicorp/nomad/nomad/mock" "github.com/hashicorp/nomad/nomad/mock"
"github.com/hashicorp/nomad/nomad/structs" "github.com/hashicorp/nomad/nomad/structs"
"github.com/hashicorp/nomad/testutil"
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
"golang.org/x/sys/unix" "golang.org/x/sys/unix"
) )
@ -297,11 +296,7 @@ func TestTaskRunner_DeriveSIToken_UnWritableTokenFile(t *testing.T) {
go tr.Run() go tr.Run()
// wait for task runner to finish running // wait for task runner to finish running
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
r.Fail("timed out waiting for task runner")
}
// assert task exited un-successfully // assert task exited un-successfully
finalState := tr.TaskState() finalState := tr.TaskState()

View File

@ -16,6 +16,11 @@ type LocalState struct {
// TaskHandle is the handle used to reattach to the task during recovery // TaskHandle is the handle used to reattach to the task during recovery
TaskHandle *drivers.TaskHandle TaskHandle *drivers.TaskHandle
// RunComplete is set to true when the TaskRunner.Run() method finishes.
// It is used to distinguish between a dead task that could be restarted
// and one that will never run again.
RunComplete bool
} }
func NewLocalState() *LocalState { func NewLocalState() *LocalState {
@ -52,6 +57,7 @@ func (s *LocalState) Copy() *LocalState {
Hooks: make(map[string]*HookState, len(s.Hooks)), Hooks: make(map[string]*HookState, len(s.Hooks)),
DriverNetwork: s.DriverNetwork.Copy(), DriverNetwork: s.DriverNetwork.Copy(),
TaskHandle: s.TaskHandle.Copy(), TaskHandle: s.TaskHandle.Copy(),
RunComplete: s.RunComplete,
} }
// Copy the hook state // Copy the hook state

View File

@ -62,6 +62,11 @@ const (
// updates have come in since the last one was handled, we only need to // updates have come in since the last one was handled, we only need to
// handle the last one. // handle the last one.
triggerUpdateChCap = 1 triggerUpdateChCap = 1
// restartChCap is the capacity for the restartCh used for triggering task
// restarts. It should be exactly 1 as even if multiple restarts have come
// we only need to handle the last one.
restartChCap = 1
) )
type TaskRunner struct { type TaskRunner struct {
@ -95,6 +100,9 @@ type TaskRunner struct {
// stateDB is for persisting localState and taskState // stateDB is for persisting localState and taskState
stateDB cstate.StateDB stateDB cstate.StateDB
// restartCh is used to signal that the task should restart.
restartCh chan struct{}
// shutdownCtx is used to exit the TaskRunner *without* affecting task state. // shutdownCtx is used to exit the TaskRunner *without* affecting task state.
shutdownCtx context.Context shutdownCtx context.Context
@ -367,6 +375,7 @@ func NewTaskRunner(config *Config) (*TaskRunner, error) {
shutdownCtx: trCtx, shutdownCtx: trCtx,
shutdownCtxCancel: trCancel, shutdownCtxCancel: trCancel,
triggerUpdateCh: make(chan struct{}, triggerUpdateChCap), triggerUpdateCh: make(chan struct{}, triggerUpdateChCap),
restartCh: make(chan struct{}, restartChCap),
waitCh: make(chan struct{}), waitCh: make(chan struct{}),
csiManager: config.CSIManager, csiManager: config.CSIManager,
cpusetCgroupPathGetter: config.CpusetCgroupPathGetter, cpusetCgroupPathGetter: config.CpusetCgroupPathGetter,
@ -506,21 +515,26 @@ func (tr *TaskRunner) Run() {
tr.stateLock.RLock() tr.stateLock.RLock()
dead := tr.state.State == structs.TaskStateDead dead := tr.state.State == structs.TaskStateDead
runComplete := tr.localState.RunComplete
tr.stateLock.RUnlock() tr.stateLock.RUnlock()
// if restoring a dead task, ensure that task is cleared and all post hooks // If restoring a dead task, ensure the task is cleared and, if the local
// are called without additional state updates // state indicates that the previous Run() call is complete, execute all
// post stop hooks and exit early, otherwise proceed until the
// ALLOC_RESTART loop skipping MAIN since the task is dead.
if dead { if dead {
// do cleanup functions without emitting any additional events/work // do cleanup functions without emitting any additional events/work
// to handle cases where we restored a dead task where client terminated // to handle cases where we restored a dead task where client terminated
// after task finished before completing post-run actions. // after task finished before completing post-run actions.
tr.clearDriverHandle() tr.clearDriverHandle()
tr.stateUpdater.TaskStateUpdated() tr.stateUpdater.TaskStateUpdated()
if runComplete {
if err := tr.stop(); err != nil { if err := tr.stop(); err != nil {
tr.logger.Error("stop failed on terminal task", "error", err) tr.logger.Error("stop failed on terminal task", "error", err)
} }
return return
} }
}
// Updates are handled asynchronously with the other hooks but each // Updates are handled asynchronously with the other hooks but each
// triggered update - whether due to alloc updates or a new vault token // triggered update - whether due to alloc updates or a new vault token
@ -544,27 +558,24 @@ func (tr *TaskRunner) Run() {
// Set the initial task state. // Set the initial task state.
tr.stateUpdater.TaskStateUpdated() tr.stateUpdater.TaskStateUpdated()
select {
case <-tr.startConditionMetCh:
tr.logger.Debug("lifecycle start condition has been met, proceeding")
// yay proceed
case <-tr.killCtx.Done():
case <-tr.shutdownCtx.Done():
return
}
timer, stop := helper.NewSafeTimer(0) // timer duration calculated JIT timer, stop := helper.NewSafeTimer(0) // timer duration calculated JIT
defer stop() defer stop()
MAIN: MAIN:
for !tr.shouldShutdown() { for !tr.shouldShutdown() {
if dead {
break
}
select { select {
case <-tr.killCtx.Done(): case <-tr.killCtx.Done():
break MAIN break MAIN
case <-tr.shutdownCtx.Done(): case <-tr.shutdownCtx.Done():
// TaskRunner was told to exit immediately // TaskRunner was told to exit immediately
return return
default: case <-tr.startConditionMetCh:
tr.logger.Debug("lifecycle start condition has been met, proceeding")
// yay proceed
} }
// Run the prestart hooks // Run the prestart hooks
@ -674,6 +685,38 @@ MAIN:
// Mark the task as dead // Mark the task as dead
tr.UpdateState(structs.TaskStateDead, nil) tr.UpdateState(structs.TaskStateDead, nil)
// Wait here in case the allocation is restarted. Poststop tasks will never
// run again so skip them to avoid blocking forever.
if !tr.Task().IsPoststop() {
ALLOC_RESTART:
// Run in a loop to handle cases where restartCh is triggered but the
// task runner doesn't need to restart.
for {
select {
case <-tr.killCtx.Done():
break ALLOC_RESTART
case <-tr.shutdownCtx.Done():
return
case <-tr.restartCh:
// Restart without delay since the task is not running anymore.
restart, _ := tr.shouldRestart()
if restart {
// Set runner as not dead to allow the MAIN loop to run.
dead = false
goto MAIN
}
}
}
}
tr.stateLock.Lock()
tr.localState.RunComplete = true
err := tr.stateDB.PutTaskRunnerLocalState(tr.allocID, tr.taskName, tr.localState)
if err != nil {
tr.logger.Warn("error persisting task state on run loop exit", "error", err)
}
tr.stateLock.Unlock()
// Run the stop hooks // Run the stop hooks
if err := tr.stop(); err != nil { if err := tr.stop(); err != nil {
tr.logger.Error("stop failed", "error", err) tr.logger.Error("stop failed", "error", err)
@ -1200,8 +1243,10 @@ func (tr *TaskRunner) UpdateState(state string, event *structs.TaskEvent) {
tr.stateLock.Lock() tr.stateLock.Lock()
defer tr.stateLock.Unlock() defer tr.stateLock.Unlock()
tr.logger.Trace("setting task state", "state", state)
if event != nil { if event != nil {
tr.logger.Trace("setting task state", "state", state, "event", event.Type) tr.logger.Trace("appending task event", "state", state, "event", event.Type)
// Append the event // Append the event
tr.appendEvent(event) tr.appendEvent(event)

View File

@ -335,7 +335,7 @@ func TestTaskRunner_Restore_Running(t *testing.T) {
defer newTR.Kill(context.Background(), structs.NewTaskEvent("cleanup")) defer newTR.Kill(context.Background(), structs.NewTaskEvent("cleanup"))
// Wait for new task runner to exit when the process does // Wait for new task runner to exit when the process does
<-newTR.WaitCh() testWaitForTaskToDie(t, newTR)
// Assert that the process was only started once // Assert that the process was only started once
started := 0 started := 0
@ -349,6 +349,87 @@ func TestTaskRunner_Restore_Running(t *testing.T) {
assert.Equal(t, 1, started) assert.Equal(t, 1, started)
} }
// TestTaskRunner_Restore_Dead asserts that restoring a dead task will place it
// back in the correct state. If the task was waiting for an alloc restart it
// must be able to be restarted after restore, otherwise a restart must fail.
func TestTaskRunner_Restore_Dead(t *testing.T) {
ci.Parallel(t)
alloc := mock.BatchAlloc()
alloc.Job.TaskGroups[0].Count = 1
task := alloc.Job.TaskGroups[0].Tasks[0]
task.Driver = "mock_driver"
task.Config = map[string]interface{}{
"run_for": "2s",
}
conf, cleanup := testTaskRunnerConfig(t, alloc, task.Name)
conf.StateDB = cstate.NewMemDB(conf.Logger) // "persist" state between task runners
defer cleanup()
// Run the first TaskRunner
origTR, err := NewTaskRunner(conf)
require.NoError(t, err)
go origTR.Run()
defer origTR.Kill(context.Background(), structs.NewTaskEvent("cleanup"))
// Wait for it to be dead
testWaitForTaskToDie(t, origTR)
// Cause TR to exit without shutting down task
origTR.Shutdown()
// Start a new TaskRunner and do the Restore
newTR, err := NewTaskRunner(conf)
require.NoError(t, err)
require.NoError(t, newTR.Restore())
go newTR.Run()
defer newTR.Kill(context.Background(), structs.NewTaskEvent("cleanup"))
// Verify that the TaskRunner is still active since it was recovered after
// a forced shutdown.
select {
case <-newTR.WaitCh():
require.Fail(t, "WaitCh is not blocking")
default:
}
// Verify that we can restart task.
// Retry a few times as the newTR.Run() may not have started yet.
testutil.WaitForResult(func() (bool, error) {
ev := &structs.TaskEvent{Type: structs.TaskRestartSignal}
err = newTR.ForceRestart(context.Background(), ev, false)
return err == nil, err
}, func(err error) {
require.NoError(t, err)
})
testWaitForTaskToStart(t, newTR)
// Kill task to verify that it's restored as dead and not able to restart.
newTR.Kill(context.Background(), nil)
testutil.WaitForResult(func() (bool, error) {
select {
case <-newTR.WaitCh():
return true, nil
default:
return false, fmt.Errorf("task still running")
}
}, func(err error) {
require.NoError(t, err)
})
newTR2, err := NewTaskRunner(conf)
require.NoError(t, err)
require.NoError(t, newTR2.Restore())
go newTR2.Run()
defer newTR2.Kill(context.Background(), structs.NewTaskEvent("cleanup"))
ev := &structs.TaskEvent{Type: structs.TaskRestartSignal}
err = newTR2.ForceRestart(context.Background(), ev, false)
require.Equal(t, err, ErrTaskNotRunning)
}
// setupRestoreFailureTest starts a service, shuts down the task runner, and // setupRestoreFailureTest starts a service, shuts down the task runner, and
// kills the task before restarting a new TaskRunner. The new TaskRunner is // kills the task before restarting a new TaskRunner. The new TaskRunner is
// returned once it is running and waiting in pending along with a cleanup // returned once it is running and waiting in pending along with a cleanup
@ -603,11 +684,7 @@ func TestTaskRunner_TaskEnv_Interpolated(t *testing.T) {
defer cleanup() defer cleanup()
// Wait for task to complete // Wait for task to complete
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(3 * time.Second):
require.Fail("timeout waiting for task to exit")
}
// Get the mock driver plugin // Get the mock driver plugin
driverPlugin, err := conf.DriverManager.Dispense(mockdriver.PluginID.Name) driverPlugin, err := conf.DriverManager.Dispense(mockdriver.PluginID.Name)
@ -654,7 +731,9 @@ func TestTaskRunner_TaskEnv_Chroot(t *testing.T) {
go tr.Run() go tr.Run()
defer tr.Kill(context.Background(), structs.NewTaskEvent("cleanup")) defer tr.Kill(context.Background(), structs.NewTaskEvent("cleanup"))
// Wait for task to exit // Wait for task to exit and kill the task runner to run the stop hooks.
testWaitForTaskToDie(t, tr)
tr.Kill(context.Background(), structs.NewTaskEvent("kill"))
timeout := 15 * time.Second timeout := 15 * time.Second
if testutil.IsCI() { if testutil.IsCI() {
timeout = 120 * time.Second timeout = 120 * time.Second
@ -703,7 +782,9 @@ func TestTaskRunner_TaskEnv_Image(t *testing.T) {
tr, conf, cleanup := runTestTaskRunner(t, alloc, task.Name) tr, conf, cleanup := runTestTaskRunner(t, alloc, task.Name)
defer cleanup() defer cleanup()
// Wait for task to exit // Wait for task to exit and kill task runner to run the stop hooks.
testWaitForTaskToDie(t, tr)
tr.Kill(context.Background(), structs.NewTaskEvent("kill"))
select { select {
case <-tr.WaitCh(): case <-tr.WaitCh():
case <-time.After(15 * time.Second): case <-time.After(15 * time.Second):
@ -750,7 +831,9 @@ func TestTaskRunner_TaskEnv_None(t *testing.T) {
%s %s
`, root, taskDir, taskDir, os.Getenv("PATH")) `, root, taskDir, taskDir, os.Getenv("PATH"))
// Wait for task to exit // Wait for task to exit and kill the task runner to run the stop hooks.
testWaitForTaskToDie(t, tr)
tr.Kill(context.Background(), structs.NewTaskEvent("kill"))
select { select {
case <-tr.WaitCh(): case <-tr.WaitCh():
case <-time.After(15 * time.Second): case <-time.After(15 * time.Second):
@ -818,10 +901,7 @@ func TestTaskRunner_DevicePropogation(t *testing.T) {
defer tr.Kill(context.Background(), structs.NewTaskEvent("cleanup")) defer tr.Kill(context.Background(), structs.NewTaskEvent("cleanup"))
// Wait for task to complete // Wait for task to complete
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(3 * time.Second):
}
// Get the mock driver plugin // Get the mock driver plugin
driverPlugin, err := conf.DriverManager.Dispense(mockdriver.PluginID.Name) driverPlugin, err := conf.DriverManager.Dispense(mockdriver.PluginID.Name)
@ -1328,11 +1408,7 @@ func TestTaskRunner_CheckWatcher_Restart(t *testing.T) {
// Wait until the task exits. Don't simply wait for it to run as it may // Wait until the task exits. Don't simply wait for it to run as it may
// get restarted and terminated before the test is able to observe it // get restarted and terminated before the test is able to observe it
// running. // running.
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
require.Fail(t, "timeout")
}
state := tr.TaskState() state := tr.TaskState()
actualEvents := make([]string, len(state.Events)) actualEvents := make([]string, len(state.Events))
@ -1421,11 +1497,7 @@ func TestTaskRunner_BlockForSIDSToken(t *testing.T) {
// task runner should exit now that it has been unblocked and it is a batch // task runner should exit now that it has been unblocked and it is a batch
// job with a zero sleep time // job with a zero sleep time
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(15 * time.Second * time.Duration(testutil.TestMultiplier())):
r.Fail("timed out waiting for batch task to exist")
}
// assert task exited successfully // assert task exited successfully
finalState := tr.TaskState() finalState := tr.TaskState()
@ -1478,11 +1550,7 @@ func TestTaskRunner_DeriveSIToken_Retry(t *testing.T) {
go tr.Run() go tr.Run()
// assert task runner blocks on SI token // assert task runner blocks on SI token
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
r.Fail("timed out waiting for task runner")
}
// assert task exited successfully // assert task exited successfully
finalState := tr.TaskState() finalState := tr.TaskState()
@ -1598,11 +1666,7 @@ func TestTaskRunner_BlockForVaultToken(t *testing.T) {
// TR should exit now that it's unblocked by vault as its a batch job // TR should exit now that it's unblocked by vault as its a batch job
// with 0 sleeping. // with 0 sleeping.
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(15 * time.Second * time.Duration(testutil.TestMultiplier())):
require.Fail(t, "timed out waiting for batch task to exit")
}
// Assert task exited successfully // Assert task exited successfully
finalState := tr.TaskState() finalState := tr.TaskState()
@ -1615,6 +1679,14 @@ func TestTaskRunner_BlockForVaultToken(t *testing.T) {
require.NoError(t, err) require.NoError(t, err)
require.Equal(t, token, string(data)) require.Equal(t, token, string(data))
// Kill task runner to trigger stop hooks
tr.Kill(context.Background(), structs.NewTaskEvent("kill"))
select {
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
require.Fail(t, "timed out waiting for task runner to exit")
}
// Check the token was revoked // Check the token was revoked
testutil.WaitForResult(func() (bool, error) { testutil.WaitForResult(func() (bool, error) {
if len(vaultClient.StoppedTokens()) != 1 { if len(vaultClient.StoppedTokens()) != 1 {
@ -1661,17 +1733,21 @@ func TestTaskRunner_DeriveToken_Retry(t *testing.T) {
defer tr.Kill(context.Background(), structs.NewTaskEvent("cleanup")) defer tr.Kill(context.Background(), structs.NewTaskEvent("cleanup"))
go tr.Run() go tr.Run()
// Wait for TR to exit and check its state // Wait for TR to die and check its state
testWaitForTaskToDie(t, tr)
state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State)
require.False(t, state.Failed)
// Kill task runner to trigger stop hooks
tr.Kill(context.Background(), structs.NewTaskEvent("kill"))
select { select {
case <-tr.WaitCh(): case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second): case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
require.Fail(t, "timed out waiting for task runner to exit") require.Fail(t, "timed out waiting for task runner to exit")
} }
state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State)
require.False(t, state.Failed)
require.Equal(t, 1, count) require.Equal(t, 1, count)
// Check that the token is on disk // Check that the token is on disk
@ -1771,11 +1847,7 @@ func TestTaskRunner_Download_ChrootExec(t *testing.T) {
defer cleanup() defer cleanup()
// Wait for task to run and exit // Wait for task to run and exit
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
require.Fail(t, "timed out waiting for task runner to exit")
}
state := tr.TaskState() state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State) require.Equal(t, structs.TaskStateDead, state.State)
@ -1816,11 +1888,7 @@ func TestTaskRunner_Download_RawExec(t *testing.T) {
defer cleanup() defer cleanup()
// Wait for task to run and exit // Wait for task to run and exit
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
require.Fail(t, "timed out waiting for task runner to exit")
}
state := tr.TaskState() state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State) require.Equal(t, structs.TaskStateDead, state.State)
@ -1851,11 +1919,7 @@ func TestTaskRunner_Download_List(t *testing.T) {
defer cleanup() defer cleanup()
// Wait for task to run and exit // Wait for task to run and exit
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
require.Fail(t, "timed out waiting for task runner to exit")
}
state := tr.TaskState() state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State) require.Equal(t, structs.TaskStateDead, state.State)
@ -1902,11 +1966,7 @@ func TestTaskRunner_Download_Retries(t *testing.T) {
tr, _, cleanup := runTestTaskRunner(t, alloc, task.Name) tr, _, cleanup := runTestTaskRunner(t, alloc, task.Name)
defer cleanup() defer cleanup()
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
require.Fail(t, "timed out waiting for task to exit")
}
state := tr.TaskState() state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State) require.Equal(t, structs.TaskStateDead, state.State)
@ -2100,6 +2160,8 @@ func TestTaskRunner_RestartSignalTask_NotRunning(t *testing.T) {
case <-time.After(1 * time.Second): case <-time.After(1 * time.Second):
} }
require.Equal(t, structs.TaskStatePending, tr.TaskState().State)
// Send a signal and restart // Send a signal and restart
err = tr.Signal(structs.NewTaskEvent("don't panic"), "QUIT") err = tr.Signal(structs.NewTaskEvent("don't panic"), "QUIT")
require.EqualError(t, err, ErrTaskNotRunning.Error()) require.EqualError(t, err, ErrTaskNotRunning.Error())
@ -2110,12 +2172,7 @@ func TestTaskRunner_RestartSignalTask_NotRunning(t *testing.T) {
// Unblock and let it finish // Unblock and let it finish
waitCh <- struct{}{} waitCh <- struct{}{}
testWaitForTaskToDie(t, tr)
select {
case <-tr.WaitCh():
case <-time.After(10 * time.Second):
require.Fail(t, "timed out waiting for task to complete")
}
// Assert the task ran and never restarted // Assert the task ran and never restarted
state := tr.TaskState() state := tr.TaskState()
@ -2153,11 +2210,7 @@ func TestTaskRunner_Run_RecoverableStartError(t *testing.T) {
tr, _, cleanup := runTestTaskRunner(t, alloc, task.Name) tr, _, cleanup := runTestTaskRunner(t, alloc, task.Name)
defer cleanup() defer cleanup()
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(time.Duration(testutil.TestMultiplier()*15) * time.Second):
require.Fail(t, "timed out waiting for task to exit")
}
state := tr.TaskState() state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State) require.Equal(t, structs.TaskStateDead, state.State)
@ -2202,11 +2255,7 @@ func TestTaskRunner_Template_Artifact(t *testing.T) {
go tr.Run() go tr.Run()
// Wait for task to run and exit // Wait for task to run and exit
select { testWaitForTaskToDie(t, tr)
case <-tr.WaitCh():
case <-time.After(15 * time.Second * time.Duration(testutil.TestMultiplier())):
require.Fail(t, "timed out waiting for task runner to exit")
}
state := tr.TaskState() state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State) require.Equal(t, structs.TaskStateDead, state.State)
@ -2536,7 +2585,9 @@ func TestTaskRunner_UnregisterConsul_Retries(t *testing.T) {
tr, err := NewTaskRunner(conf) tr, err := NewTaskRunner(conf)
require.NoError(t, err) require.NoError(t, err)
defer tr.Kill(context.Background(), structs.NewTaskEvent("cleanup")) defer tr.Kill(context.Background(), structs.NewTaskEvent("cleanup"))
tr.Run() go tr.Run()
testWaitForTaskToDie(t, tr)
state := tr.TaskState() state := tr.TaskState()
require.Equal(t, structs.TaskStateDead, state.State) require.Equal(t, structs.TaskStateDead, state.State)
@ -2562,7 +2613,17 @@ func TestTaskRunner_UnregisterConsul_Retries(t *testing.T) {
func testWaitForTaskToStart(t *testing.T, tr *TaskRunner) { func testWaitForTaskToStart(t *testing.T, tr *TaskRunner) {
testutil.WaitForResult(func() (bool, error) { testutil.WaitForResult(func() (bool, error) {
ts := tr.TaskState() ts := tr.TaskState()
return ts.State == structs.TaskStateRunning, fmt.Errorf("%v", ts.State) return ts.State == structs.TaskStateRunning, fmt.Errorf("expected task to be running, got %v", ts.State)
}, func(err error) {
require.NoError(t, err)
})
}
// testWaitForTaskToDie waits for the task to die or fails the test
func testWaitForTaskToDie(t *testing.T, tr *TaskRunner) {
testutil.WaitForResult(func() (bool, error) {
ts := tr.TaskState()
return ts.State == structs.TaskStateDead, fmt.Errorf("expected task to be dead, got %v", ts.State)
}, func(err error) { }, func(err error) {
require.NoError(t, err) require.NoError(t, err)
}) })

View File

@ -4,6 +4,7 @@
package allocrunner package allocrunner
import ( import (
"fmt"
"sync" "sync"
"testing" "testing"
@ -20,6 +21,7 @@ import (
"github.com/hashicorp/nomad/client/state" "github.com/hashicorp/nomad/client/state"
"github.com/hashicorp/nomad/client/vaultclient" "github.com/hashicorp/nomad/client/vaultclient"
"github.com/hashicorp/nomad/nomad/structs" "github.com/hashicorp/nomad/nomad/structs"
"github.com/hashicorp/nomad/testutil"
"github.com/stretchr/testify/require" "github.com/stretchr/testify/require"
) )
@ -104,3 +106,13 @@ func TestAllocRunnerFromAlloc(t *testing.T, alloc *structs.Allocation) (*allocRu
return ar, cleanup return ar, cleanup
} }
func WaitForClientState(t *testing.T, ar *allocRunner, state string) {
testutil.WaitForResult(func() (bool, error) {
got := ar.AllocState().ClientStatus
return got == state,
fmt.Errorf("expected alloc runner to be in state %s, got %s", state, got)
}, func(err error) {
require.NoError(t, err)
})
}

View File

@ -160,6 +160,7 @@ type AllocRunner interface {
PersistState() error PersistState() error
RestartTask(taskName string, taskEvent *structs.TaskEvent) error RestartTask(taskName string, taskEvent *structs.TaskEvent) error
RestartRunning(taskEvent *structs.TaskEvent) error
RestartAll(taskEvent *structs.TaskEvent) error RestartAll(taskEvent *structs.TaskEvent) error
Reconnect(update *structs.Allocation) error Reconnect(update *structs.Allocation) error
@ -925,22 +926,33 @@ func (c *Client) CollectAllAllocs() {
c.garbageCollector.CollectAll() c.garbageCollector.CollectAll()
} }
func (c *Client) RestartAllocation(allocID, taskName string) error { func (c *Client) RestartAllocation(allocID, taskName string, allTasks bool) error {
if allTasks && taskName != "" {
return fmt.Errorf("task name cannot be set when restarting all tasks")
}
ar, err := c.getAllocRunner(allocID) ar, err := c.getAllocRunner(allocID)
if err != nil { if err != nil {
return err return err
} }
event := structs.NewTaskEvent(structs.TaskRestartSignal).
SetRestartReason("User requested restart")
if taskName != "" { if taskName != "" {
event := structs.NewTaskEvent(structs.TaskRestartSignal).
SetRestartReason("User requested task to restart")
return ar.RestartTask(taskName, event) return ar.RestartTask(taskName, event)
} }
if allTasks {
event := structs.NewTaskEvent(structs.TaskRestartSignal).
SetRestartReason("User requested all tasks to restart")
return ar.RestartAll(event) return ar.RestartAll(event)
} }
event := structs.NewTaskEvent(structs.TaskRestartSignal).
SetRestartReason("User requested running tasks to restart")
return ar.RestartRunning(event)
}
// Node returns the locally registered node // Node returns the locally registered node
func (c *Client) Node() *structs.Node { func (c *Client) Node() *structs.Node {
return c.GetConfig().Node return c.GetConfig().Node

View File

@ -283,6 +283,7 @@ func (s *HTTPServer) allocRestart(allocID string, resp http.ResponseWriter, req
// Explicitly parse the body separately to disallow overriding AllocID in req Body. // Explicitly parse the body separately to disallow overriding AllocID in req Body.
var reqBody struct { var reqBody struct {
TaskName string TaskName string
AllTasks bool
} }
err := json.NewDecoder(req.Body).Decode(&reqBody) err := json.NewDecoder(req.Body).Decode(&reqBody)
if err != nil && err != io.EOF { if err != nil && err != io.EOF {
@ -291,6 +292,9 @@ func (s *HTTPServer) allocRestart(allocID string, resp http.ResponseWriter, req
if reqBody.TaskName != "" { if reqBody.TaskName != "" {
args.TaskName = reqBody.TaskName args.TaskName = reqBody.TaskName
} }
if reqBody.AllTasks {
args.AllTasks = reqBody.AllTasks
}
// Determine the handler to use // Determine the handler to use
useLocalClient, useClientRPC, useServerRPC := s.rpcHandlerForAlloc(allocID) useLocalClient, useClientRPC, useServerRPC := s.rpcHandlerForAlloc(allocID)

View File

@ -18,8 +18,11 @@ func (c *AllocRestartCommand) Help() string {
Usage: nomad alloc restart [options] <allocation> <task> Usage: nomad alloc restart [options] <allocation> <task>
Restart an existing allocation. This command is used to restart a specific alloc Restart an existing allocation. This command is used to restart a specific alloc
and its tasks. If no task is provided then all of the allocation's tasks will and its tasks. If no task is provided then all of the allocation's tasks that
be restarted. are currently running will be restarted.
Use the option '-all-tasks' to restart tasks that have already run, such as
non-sidecar prestart and poststart tasks.
When ACLs are enabled, this command requires a token with the When ACLs are enabled, this command requires a token with the
'alloc-lifecycle', 'read-job', and 'list-jobs' capabilities for the 'alloc-lifecycle', 'read-job', and 'list-jobs' capabilities for the
@ -31,9 +34,15 @@ General Options:
Restart Specific Options: Restart Specific Options:
-all-tasks
If set, all tasks in the allocation will be restarted, even the ones that
already ran. This option cannot be used with '-task' or the '<task>'
argument.
-task <task-name> -task <task-name>
Specify the individual task to restart. If task name is given with both an Specify the individual task to restart. If task name is given with both an
argument and the '-task' option, preference is given to the '-task' option. argument and the '-task' option, preference is given to the '-task' option.
This option cannot be used with '-all-tasks'.
-verbose -verbose
Show full information. Show full information.
@ -44,11 +53,12 @@ Restart Specific Options:
func (c *AllocRestartCommand) Name() string { return "alloc restart" } func (c *AllocRestartCommand) Name() string { return "alloc restart" }
func (c *AllocRestartCommand) Run(args []string) int { func (c *AllocRestartCommand) Run(args []string) int {
var verbose bool var allTasks, verbose bool
var task string var task string
flags := c.Meta.FlagSet(c.Name(), FlagSetClient) flags := c.Meta.FlagSet(c.Name(), FlagSetClient)
flags.Usage = func() { c.Ui.Output(c.Help()) } flags.Usage = func() { c.Ui.Output(c.Help()) }
flags.BoolVar(&allTasks, "all-tasks", false, "")
flags.BoolVar(&verbose, "verbose", false, "") flags.BoolVar(&verbose, "verbose", false, "")
flags.StringVar(&task, "task", "", "") flags.StringVar(&task, "task", "", "")
@ -66,6 +76,17 @@ func (c *AllocRestartCommand) Run(args []string) int {
allocID := args[0] allocID := args[0]
// If -task isn't provided fallback to reading the task name
// from args.
if task == "" && len(args) >= 2 {
task = args[1]
}
if allTasks && task != "" {
c.Ui.Error("The -all-tasks option is not allowed when restarting a specific task.")
return 1
}
// Truncate the id unless full length is requested // Truncate the id unless full length is requested
length := shortId length := shortId
if verbose { if verbose {
@ -113,12 +134,6 @@ func (c *AllocRestartCommand) Run(args []string) int {
return 1 return 1
} }
// If -task isn't provided fallback to reading the task name
// from args.
if task == "" && len(args) >= 2 {
task = args[1]
}
if task != "" { if task != "" {
err := validateTaskExistsInAllocation(task, alloc) err := validateTaskExistsInAllocation(task, alloc)
if err != nil { if err != nil {
@ -127,9 +142,17 @@ func (c *AllocRestartCommand) Run(args []string) int {
} }
} }
if allTasks {
err = client.Allocations().RestartAllTasks(alloc, nil)
} else {
err = client.Allocations().Restart(alloc, task, nil) err = client.Allocations().Restart(alloc, task, nil)
}
if err != nil { if err != nil {
c.Ui.Error(fmt.Sprintf("Failed to restart allocation:\n\n%s", err.Error())) target := "allocation"
if task != "" {
target = "task"
}
c.Ui.Error(fmt.Sprintf("Failed to restart %s:\n\n%s", target, err.Error()))
return 1 return 1
} }

View File

@ -700,6 +700,50 @@ func LifecycleAlloc() *structs.Allocation {
return alloc return alloc
} }
type LifecycleTaskDef struct {
Name string
RunFor string
ExitCode int
Hook string
IsSidecar bool
}
// LifecycleAllocFromTasks generates an Allocation with mock tasks that have
// the provided lifecycles.
func LifecycleAllocFromTasks(tasks []LifecycleTaskDef) *structs.Allocation {
alloc := LifecycleAlloc()
alloc.Job.TaskGroups[0].Tasks = []*structs.Task{}
for _, task := range tasks {
var lc *structs.TaskLifecycleConfig
if task.Hook != "" {
// TODO: task coordinator doesn't treat nil and empty structs the same
lc = &structs.TaskLifecycleConfig{
Hook: task.Hook,
Sidecar: task.IsSidecar,
}
}
alloc.Job.TaskGroups[0].Tasks = append(alloc.Job.TaskGroups[0].Tasks,
&structs.Task{
Name: task.Name,
Driver: "mock_driver",
Config: map[string]interface{}{
"run_for": task.RunFor,
"exit_code": task.ExitCode},
Lifecycle: lc,
LogConfig: structs.DefaultLogConfig(),
Resources: &structs.Resources{CPU: 100, MemoryMB: 256},
},
)
alloc.TaskResources[task.Name] = &structs.Resources{CPU: 100, MemoryMB: 256}
alloc.AllocatedResources.Tasks[task.Name] = &structs.AllocatedTaskResources{
Cpu: structs.AllocatedCpuResources{CpuShares: 100},
Memory: structs.AllocatedMemoryResources{MemoryMB: 256},
}
}
return alloc
}
func LifecycleJobWithPoststopDeploy() *structs.Job { func LifecycleJobWithPoststopDeploy() *structs.Job {
job := &structs.Job{ job := &structs.Job{
Region: "global", Region: "global",

View File

@ -1028,6 +1028,7 @@ type AllocsGetRequest struct {
type AllocRestartRequest struct { type AllocRestartRequest struct {
AllocID string AllocID string
TaskName string TaskName string
AllTasks bool
QueryOptions QueryOptions
} }
@ -8043,7 +8044,7 @@ const (
// restarted because it has exceeded its restart policy. // restarted because it has exceeded its restart policy.
TaskNotRestarting = "Not Restarting" TaskNotRestarting = "Not Restarting"
// TaskRestartSignal indicates that the task has been signalled to be // TaskRestartSignal indicates that the task has been signaled to be
// restarted // restarted
TaskRestartSignal = "Restart Signaled" TaskRestartSignal = "Restart Signaled"
@ -8324,6 +8325,9 @@ func (e *TaskEvent) PopulateEventDisplayMessage() {
} }
func (e *TaskEvent) GoString() string { func (e *TaskEvent) GoString() string {
if e == nil {
return ""
}
return fmt.Sprintf("%v - %v", e.Time, e.Type) return fmt.Sprintf("%v - %v", e.Time, e.Type)
} }

View File

@ -731,6 +731,13 @@ The table below shows this endpoint's support for
must be the full UUID, not the short 8-character one. This is specified as must be the full UUID, not the short 8-character one. This is specified as
part of the path. part of the path.
- `TaskName` `(string: "")` - Specifies the individual task to restart. Cannot
be used with `AllTasks` set to `true`.
- `AllTasks` `(bool: false)` - If set to `true` all tasks in the allocation
will be restarted, even the ones that already ran. Cannot be set to `true` if
`TaskName` is defined.
### Sample Payload ### Sample Payload
```json ```json

View File

@ -18,12 +18,16 @@ nomad alloc restart [options] <allocation> <task>
This command accepts a single allocation ID and a task name. The task name must This command accepts a single allocation ID and a task name. The task name must
be part of the allocation and the task must be currently running. The task name be part of the allocation and the task must be currently running. The task name
is optional and if omitted every task in the allocation will be restarted. is optional and if omitted all tasks that are currently running will be
restarted.
Task name may also be specified using the `-task` option rather than a command Task name may also be specified using the `-task` option rather than a command
argument. If task name is given with both an argument and the `-task` option, argument. If task name is given with both an argument and the `-task` option,
preference is given to the `-task` option. preference is given to the `-task` option.
Use the option `-all-tasks` to restart tasks that have already run, such as
non-sidecar prestart and poststart tasks.
When ACLs are enabled, this command requires a token with the When ACLs are enabled, this command requires a token with the
`alloc-lifecycle`, `read-job`, and `list-jobs` capabilities for the `alloc-lifecycle`, `read-job`, and `list-jobs` capabilities for the
allocation's namespace. allocation's namespace.
@ -34,7 +38,12 @@ allocation's namespace.
## Restart Options ## Restart Options
- `-task`: Specify the individual task to restart. - `-all-tasks`: If set, all tasks in the allocation will be restarted, even the
ones that already ran. This option cannot be used with `-task` or the
`<task>` argument.
- `-task`: Specify the individual task to restart. This option cannot be used
with `-all-tasks`.
- `-verbose`: Display verbose output. - `-verbose`: Display verbose output.