open-nomad/website/source/guides/operating-a-job/failure-handling-strategies/reschedule.html.md
Alex Dadgar 1e3c3cb287 Deprecate IOPS
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
2018-12-06 15:09:26 -08:00

3.3 KiB

layout page_title sidebar_current description
guides Reschedule Stanza - Operating a Job guides-operating-a-job-failure-handling-strategies-reschedule Nomad can reschedule failing tasks after any local restart attempts have been exhausted. This is useful to recover from failures stemming from problems in the node running the task.

Reschedule Stanza

Tasks can sometimes fail due to network, CPU or memory issues on the node running the task. In such situations, Nomad can reschedule the task on another node. The reschedule stanza can be used to configure how Nomad should try placing failed tasks on another node in the cluster. Reschedule attempts have a delay between each attempt, and the delay can be configured to increase between each rescheduling attempt according to a configurable delay_function. See the reschedule stanza for more information.

Service jobs are configured by default to have unlimited reschedule attempts. We recommend using the reschedule stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention.

Example

The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad. The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts. The CLI also shows when the next reschedule will be attempted.

$nomad job status demo
ID            = demo
Name          = demo
Submit Date   = 2018-04-12T15:48:37-05:00
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
demo        0       0         0        2       0         0

Future Rescheduling Attempts
Task Group  Eval ID   Eval Time
demo        ee3de93f  5s from now

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created  Modified
39d7823d  f2c2eaa6  demo        0        run      failed  5s ago   5s ago
fafb011b  f2c2eaa6  demo        0        run      failed  11s ago  10s ago

$nomad alloc status 3d0b
ID                     = 3d0bbdb1
Eval ID                = 79b846a9
Name                   = demo.demo[0]
Node ID                = 8a184f31
Job ID                 = demo
Job Version            = 0
Client Status          = failed
Client Description     = <none>
Desired Status         = run
Desired Description    = <none>
Created                = 15s ago
Modified               = 15s ago
Reschedule Attempts    = 3/5
Reschedule Eligibility = 25s from now

Task "demo" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  300 MiB  300 MiB  p1: 127.0.0.1:27646

Task Events:
Started At     = 2018-04-12T20:44:25Z
Finished At    = 2018-04-12T20:44:25Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2018-04-12T15:44:25-05:00  Not Restarting  Policy allows no restarts
2018-04-12T15:44:25-05:00  Terminated      Exit Code: 127
2018-04-12T15:44:25-05:00  Started         Task started by client
2018-04-12T15:44:25-05:00  Task Setup      Building Task Directory
2018-04-12T15:44:25-05:00  Received        Task received by client