open-nomad/website/source/guides/operating-a-job/failure-handling-strategies/restart.html.md
Alex Dadgar 1e3c3cb287 Deprecate IOPS
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
2018-12-06 15:09:26 -08:00

3.7 KiB

layout page_title sidebar_current description
guides Restart Stanza - Operating a Job guides-operating-a-job-failure-handling-strategies-local-restarts Nomad can restart a task on the node it is running on to recover from failures. Task restarts can be configured to be limited by number of attempts within a specific interval.

Restart Stanza

To enable restarting a failed task on the node it is running on, the task group can be annotated with configurable options using the restart stanza. Nomad will restart the failed task up to attempts times within a provided interval. Operators can also choose whether to keep attempting restarts on the same node, or to fail the task so that it can be rescheduled on another node, via the mode parameter.

We recommend setting mode to fail in the restart stanza to allow rescheduling the task on another node.

Example

The following CLI example shows job status and allocation status for a failed task that is being restarted by Nomad. Allocations are in the pending state while restarts are attempted. The Recent Events section in the CLI shows ongoing restart attempts.

$nomad job status demo
ID            = demo
Name          = demo
Submit Date   = 2018-04-12T14:37:18-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
demo        0       3         0        0       0         0

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
ce5bf1d1  8a184f31  demo        0        run      pending  27s ago  5s ago
d5dee7c8  8a184f31  demo        0        run      pending  27s ago  5s ago
ed815997  8a184f31  demo        0        run      pending  27s ago  5s ago

In the following example, the allocation ce5bf1d1 is restarted by Nomad approximately every ten seconds, with a small random jitter. It eventually reaches its limit of three attempts and transitions into a failed state, after which it becomes eligible for rescheduling.

$nomad alloc-status ce5bf1d1
ID                     = ce5bf1d1
Eval ID                = 64e45d11
Name                   = demo.demo[1]
Node ID                = a0ccdd8b
Job ID                 = demo
Job Version            = 0
Client Status          = failed
Client Description     = <none>
Desired Status         = run
Desired Description    = <none>
Created                = 56s ago
Modified               = 22s ago

Task "demo" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB

Task Events:
Started At     = 2018-04-12T22:29:08Z
Finished At    = 2018-04-12T22:29:08Z
Total Restarts = 3
Last Restart   = 2018-04-12T17:28:57-05:00

Recent Events:
Time                       Type            Description
2018-04-12T17:29:08-05:00  Not Restarting  Exceeded allowed attempts 3 in interval 5m0s and mode is "fail"
2018-04-12T17:29:08-05:00  Terminated      Exit Code: 127
2018-04-12T17:29:08-05:00  Started         Task started by client
2018-04-12T17:28:57-05:00  Restarting      Task restarting in 10.364602876s
2018-04-12T17:28:57-05:00  Terminated      Exit Code: 127
2018-04-12T17:28:57-05:00  Started         Task started by client
2018-04-12T17:28:47-05:00  Restarting      Task restarting in 10.666963769s
2018-04-12T17:28:47-05:00  Terminated      Exit Code: 127
2018-04-12T17:28:47-05:00  Started         Task started by client
2018-04-12T17:28:35-05:00  Restarting      Task restarting in 11.777324721s