Michael Schurter 5f1033263b docs: move operating-a-job to guides

Add redirects from /docs/ -> /guides/ and update old redirects to point
to the new location.

2018-04-18 16:21:16 -07:00

1.3 KiB

Raw Blame History

layout	page_title	sidebar_current	description
guides	Handling Failures - Operating a Job	guides-operating-a-job-failure-handling-strategies	This section describes features in Nomad that automate recovering from failed tasks.

Failure Recovery Strategies

Most applications deployed in Nomad are either long running services or one time batch jobs. They can fail for various reasons like:

A temporary error in the service that resolves when its restarted.
An upstream dependency might not be available, leading to a health check failure.
Disk, Memory or CPU contention on the node that the application is running on.
The application uses Docker and the Docker daemon on that node is unresponsive.

Nomad provides configurable options to enable recovering failed tasks to avoid downtime. Nomad will try to restart a failed task on the node it is running on, and also try to reschedule it on another node. Please see one of the guides below or use the navigation on the left for details on each option:

1.3 KiB Raw Blame History

Failure Recovery Strategies

1.3 KiB

Raw Blame History