Added section on failure recovery under operating a job with details and examples of different restarts.
This commit is contained in:
parent
4f9a52e4a4
commit
7d246b56b7
|
@ -0,0 +1,23 @@
|
||||||
|
---
|
||||||
|
layout: "docs"
|
||||||
|
page_title: "Check Restart Stanza - Operating a Job"
|
||||||
|
sidebar_current: "docs-operating-a-job-failure-handling-strategies-check-restart"
|
||||||
|
description: |-
|
||||||
|
Nomad can restart service job tasks if they have a failing health check based on
|
||||||
|
configuration specified in the `check_restart` stanza. Restarts are done locally on the node
|
||||||
|
running the task based on their `restart` policy.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Check Restart Stanza
|
||||||
|
|
||||||
|
The [`check_restart` stanza][check restart] instructs Nomad when to restart tasks with unhealthy service checks.
|
||||||
|
When a health check in Consul has been unhealthy for the limit specified in a check_restart stanza,
|
||||||
|
it is restarted according to the task group's restart policy.
|
||||||
|
|
||||||
|
The `limit ` field is used to specify the number of times a failing healthcheck is seen before local restarts are attempted.
|
||||||
|
Operators can also specify a `grace` duration to wait after a task restarts before checking its health.
|
||||||
|
|
||||||
|
We recommend configuring the check restart on services if its likely that a restart would resolve the failure. This
|
||||||
|
is applicable in cases like temporary memory issues on the service.
|
||||||
|
|
||||||
|
[check restart]: /docs/job-specification/check_restart.html "Nomad check restart Stanza"
|
|
@ -0,0 +1,25 @@
|
||||||
|
---
|
||||||
|
layout: "docs"
|
||||||
|
page_title: "Handling Failures - Operating a Job"
|
||||||
|
sidebar_current: "docs-operating-a-job-failure-handling-strategies"
|
||||||
|
description: |-
|
||||||
|
This section describes features in Nomad that automate recovering from failed tasks.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Failure Recovery Strategies
|
||||||
|
|
||||||
|
Most applications deployed in Nomad are either long running services or one time batch jobs.
|
||||||
|
They can fail for various reasons like:
|
||||||
|
|
||||||
|
- A temporary error in the service that resolves when its restarted.
|
||||||
|
- An upstream dependency might not be available, leading to a health check failure.
|
||||||
|
- Disk, Memory or CPU contention on the node that the application is running on.
|
||||||
|
- The application uses Docker and the Docker daemon on that node is no longer running.
|
||||||
|
|
||||||
|
Nomad provides configurable options to enable recovering failed tasks to avoid downtime. Nomad will
|
||||||
|
try to restart a failed task on the node it is running on, and also try to reschedule it on another node.
|
||||||
|
Please see one of the guides below or use the navigation on the left for details on each option:
|
||||||
|
|
||||||
|
1. [Local Restarts](/docs/operating-a-job/failure-handling-strategies/restart.html)
|
||||||
|
1. [Check Restarts](/docs/operating-a-job/failure-handling-strategies/check-restart.html)
|
||||||
|
1. [Rescheduling](/docs/operating-a-job/failure-handling-strategies/rescheduling.html)
|
|
@ -0,0 +1,92 @@
|
||||||
|
---
|
||||||
|
layout: "docs"
|
||||||
|
page_title: "Reschedule Stanza - Operating a Job"
|
||||||
|
sidebar_current: "docs-operating-a-job-failure-handling-strategies-reschedule"
|
||||||
|
description: |-
|
||||||
|
Nomad can reschedule failing tasks after any local restart attempts have been
|
||||||
|
exhausted. This is useful to recover from failures stemming from problems in the node
|
||||||
|
running the task.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Reschedule Stanza
|
||||||
|
|
||||||
|
Tasks can sometimes fail due to network, CPU or memory issues on the node running the task. In such situations,
|
||||||
|
Nomad can reschedule the task on another node. The [`reschedule` stanza][reschedule] can be used to configure how
|
||||||
|
Nomad should try placing failed tasks on another node in the cluster. Reschedule attempts have a delay between
|
||||||
|
each attempt, and the delay can be configured to increase between each rescheduling attempt according to a configurable
|
||||||
|
`delay-function`. See the [documentation][reschedule] for more information on all the options for rescheduling.
|
||||||
|
|
||||||
|
Service jobs are configured by default to have unlimited reschedule attempts. We recommend using the reschedule
|
||||||
|
stanza to ensure that failed tasks are automatically reattempted on another node without needing operator intervention.
|
||||||
|
|
||||||
|
# Example
|
||||||
|
The following CLI example shows job and allocation statuses for a task being rescheduled by Nomad.
|
||||||
|
The CLI shows the number of previous attempts if there is a limit on the number of reschedule attempts.
|
||||||
|
The CLI also shows when the next reschedule will be attempted.
|
||||||
|
|
||||||
|
```text
|
||||||
|
$nomad job status demo
|
||||||
|
ID = demo
|
||||||
|
Name = demo
|
||||||
|
Submit Date = 2018-04-12T15:48:37-05:00
|
||||||
|
Type = service
|
||||||
|
Priority = 50
|
||||||
|
Datacenters = dc1
|
||||||
|
Status = pending
|
||||||
|
Periodic = false
|
||||||
|
Parameterized = false
|
||||||
|
|
||||||
|
Summary
|
||||||
|
Task Group Queued Starting Running Failed Complete Lost
|
||||||
|
demo 0 0 0 2 0 0
|
||||||
|
|
||||||
|
Future Rescheduling Attempts
|
||||||
|
Task Group Eval ID Eval Time
|
||||||
|
demo ee3de93f 5s from now
|
||||||
|
|
||||||
|
Allocations
|
||||||
|
ID Node ID Task Group Version Desired Status Created Modified
|
||||||
|
39d7823d f2c2eaa6 demo 0 run failed 5s ago 5s ago
|
||||||
|
fafb011b f2c2eaa6 demo 0 run failed 11s ago 10s ago
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
```text
|
||||||
|
$nomad alloc status 3d0b
|
||||||
|
ID = 3d0bbdb1
|
||||||
|
Eval ID = 79b846a9
|
||||||
|
Name = demo.demo[0]
|
||||||
|
Node ID = 8a184f31
|
||||||
|
Job ID = demo
|
||||||
|
Job Version = 0
|
||||||
|
Client Status = failed
|
||||||
|
Client Description = <none>
|
||||||
|
Desired Status = run
|
||||||
|
Desired Description = <none>
|
||||||
|
Created = 15s ago
|
||||||
|
Modified = 15s ago
|
||||||
|
Reschedule Attempts = 3/5
|
||||||
|
Reschedule Eligibility = 25s from now
|
||||||
|
|
||||||
|
Task "demo" is "dead"
|
||||||
|
Task Resources
|
||||||
|
CPU Memory Disk IOPS Addresses
|
||||||
|
100 MHz 300 MiB 300 MiB 0 p1: 127.0.0.1:27646
|
||||||
|
|
||||||
|
Task Events:
|
||||||
|
Started At = 2018-04-12T20:44:25Z
|
||||||
|
Finished At = 2018-04-12T20:44:25Z
|
||||||
|
Total Restarts = 0
|
||||||
|
Last Restart = N/A
|
||||||
|
|
||||||
|
Recent Events:
|
||||||
|
Time Type Description
|
||||||
|
2018-04-12T15:44:25-05:00 Not Restarting Policy allows no restarts
|
||||||
|
2018-04-12T15:44:25-05:00 Terminated Exit Code: 127
|
||||||
|
2018-04-12T15:44:25-05:00 Started Task started by client
|
||||||
|
2018-04-12T15:44:25-05:00 Task Setup Building Task Directory
|
||||||
|
2018-04-12T15:44:25-05:00 Received Task received by client
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
[reschedule]: /docs/job-specification/reschedule.html "Nomad reschedule Stanza"
|
|
@ -0,0 +1,91 @@
|
||||||
|
---
|
||||||
|
layout: "docs"
|
||||||
|
page_title: "Restart Stanza - Operating a Job"
|
||||||
|
sidebar_current: "docs-operating-a-job-failure-handling-strategies-local-restarts"
|
||||||
|
description: |-
|
||||||
|
Nomad can restart a task on the node it is running on to recover from
|
||||||
|
failures. Task restarts can be configured to be limited by number of attempts within
|
||||||
|
a specific interval.
|
||||||
|
---
|
||||||
|
|
||||||
|
# Restart Stanza
|
||||||
|
|
||||||
|
To enable restarting a failed task on the node it is running on, the task group can be annotated
|
||||||
|
with configurable options using the [`restart` stanza][restart]. Nomad will restart the failed task
|
||||||
|
upto `attempts` times within a provided `interval`. Operators can also choose whether to
|
||||||
|
keep attempting restarts on the same node, or to fail the task so that it can be rescheduled
|
||||||
|
on another node, via the `mode` parameter.
|
||||||
|
|
||||||
|
We recommend setting mode to `fail` in the restart stanza to allow rescheduling the task on another node.
|
||||||
|
|
||||||
|
|
||||||
|
## Example
|
||||||
|
The following CLI example shows job status and allocation status for a failed task that is being restarted by Nomad.
|
||||||
|
Allocations are in the `pending` state while restarts are attempted. The `Recent Events` section in the CLI
|
||||||
|
shows ongoing restart attempts.
|
||||||
|
|
||||||
|
```text
|
||||||
|
$nomad job status demo
|
||||||
|
ID = demo
|
||||||
|
Name = demo
|
||||||
|
Submit Date = 2018-04-12T14:37:18-05:00
|
||||||
|
Type = service
|
||||||
|
Priority = 50
|
||||||
|
Datacenters = dc1
|
||||||
|
Status = running
|
||||||
|
Periodic = false
|
||||||
|
Parameterized = false
|
||||||
|
|
||||||
|
Summary
|
||||||
|
Task Group Queued Starting Running Failed Complete Lost
|
||||||
|
demo 0 3 0 0 0 0
|
||||||
|
|
||||||
|
Allocations
|
||||||
|
ID Node ID Task Group Version Desired Status Created Modified
|
||||||
|
ce5bf1d1 8a184f31 demo 0 run pending 27s ago 5s ago
|
||||||
|
d5dee7c8 8a184f31 demo 0 run pending 27s ago 5s ago
|
||||||
|
ed815997 8a184f31 demo 0 run pending 27s ago 5s ago
|
||||||
|
```
|
||||||
|
|
||||||
|
```text
|
||||||
|
$nomad alloc-status ce5b
|
||||||
|
ID = ce5bf1d1
|
||||||
|
Eval ID = 05681b90
|
||||||
|
Name = demo.demo[1]
|
||||||
|
Node ID = 8a184f31
|
||||||
|
Job ID = demo
|
||||||
|
Job Version = 0
|
||||||
|
Client Status = pending
|
||||||
|
Client Description = <none>
|
||||||
|
Desired Status = run
|
||||||
|
Desired Description = <none>
|
||||||
|
Created = 31s ago
|
||||||
|
Modified = 9s ago
|
||||||
|
|
||||||
|
Task "demo" is "pending"
|
||||||
|
Task Resources
|
||||||
|
CPU Memory Disk IOPS Addresses
|
||||||
|
100 MHz 300 MiB 300 MiB 0
|
||||||
|
|
||||||
|
Task Events:
|
||||||
|
Started At = 2018-04-12T19:37:40Z
|
||||||
|
Finished At = N/A
|
||||||
|
Total Restarts = 3
|
||||||
|
Last Restart = 2018-04-12T14:37:40-05:00
|
||||||
|
|
||||||
|
Recent Events:
|
||||||
|
Time Type Description
|
||||||
|
2018-04-12T14:37:40-05:00 Restarting Task restarting in 11.686056069s
|
||||||
|
2018-04-12T14:37:40-05:00 Terminated Exit Code: 127
|
||||||
|
2018-04-12T14:37:40-05:00 Started Task started by client
|
||||||
|
2018-04-12T14:37:29-05:00 Restarting Task restarting in 10.97348449s
|
||||||
|
2018-04-12T14:37:29-05:00 Terminated Exit Code: 127
|
||||||
|
2018-04-12T14:37:29-05:00 Started Task started by client
|
||||||
|
2018-04-12T14:37:19-05:00 Restarting Task restarting in 10.619985509s
|
||||||
|
2018-04-12T14:37:19-05:00 Terminated Exit Code: 127
|
||||||
|
2018-04-12T14:37:19-05:00 Started Task started by client
|
||||||
|
2018-04-12T14:37:19-05:00 Task Setup Building Task Directory
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
[restart]: /docs/job-specification/restart.html "Nomad restart Stanza"
|
|
@ -132,6 +132,20 @@
|
||||||
</li>
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
</li>
|
</li>
|
||||||
|
<li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies") %>>
|
||||||
|
<a href="/docs/operating-a-job/failure-handling-strategies/index.html">Failure Recovery Strategies</a>
|
||||||
|
<ul class="nav">
|
||||||
|
<li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-local-restarts") %>>
|
||||||
|
<a href="/docs/operating-a-job/failure-handling-strategies/restart.html">Local Restarts</a>
|
||||||
|
</li>
|
||||||
|
<li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-check-restart") %>>
|
||||||
|
<a href="/docs/operating-a-job/failure-handling-strategies/check-restart.html">Check Restarts</a>
|
||||||
|
</li>
|
||||||
|
<li<%= sidebar_current("docs-operating-a-job-failure-handling-strategies-reschedule") %>>
|
||||||
|
<a href="/docs/operating-a-job/failure-handling-strategies/reschedule.html">Rescheduling</a>
|
||||||
|
</li>
|
||||||
|
</ul>
|
||||||
|
</li>
|
||||||
</ul>
|
</ul>
|
||||||
</li>
|
</li>
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue