open-nomad/website/source/guides/operating-a-job/failure-handling-strategies/index.html.md
Michael Schurter 5f1033263b docs: move operating-a-job to guides
Add redirects from /docs/ -> /guides/ and update old redirects to point
to the new location.
2018-04-18 16:21:16 -07:00

1.3 KiB

layout page_title sidebar_current description
guides Handling Failures - Operating a Job guides-operating-a-job-failure-handling-strategies This section describes features in Nomad that automate recovering from failed tasks.

Failure Recovery Strategies

Most applications deployed in Nomad are either long running services or one time batch jobs. They can fail for various reasons like:

  • A temporary error in the service that resolves when its restarted.
  • An upstream dependency might not be available, leading to a health check failure.
  • Disk, Memory or CPU contention on the node that the application is running on.
  • The application uses Docker and the Docker daemon on that node is unresponsive.

Nomad provides configurable options to enable recovering failed tasks to avoid downtime. Nomad will try to restart a failed task on the node it is running on, and also try to reschedule it on another node. Please see one of the guides below or use the navigation on the left for details on each option:

  1. Local Restarts
  2. Check Restarts
  3. Rescheduling