open-nomad/website/source/guides/operating-a-job/failure-handling-strategies/check-restart.html.md
Michael Schurter 5f1033263b docs: move operating-a-job to guides
Add redirects from /docs/ -> /guides/ and update old redirects to point
to the new location.
2018-04-18 16:21:16 -07:00

3.2 KiB

layout page_title sidebar_current description
guides Check Restart Stanza - Operating a Job guides-operating-a-job-failure-handling-strategies-check-restart Nomad can restart tasks if they have a failing health check based on configuration specified in the `check_restart` stanza. Restarts are done locally on the node running the task based on their `restart` policy.

Check Restart Stanza

The check_restart stanza instructs Nomad when to restart tasks with unhealthy service checks. When a health check in Consul has been unhealthy for the limit specified in a check_restart stanza, it is restarted according to the task group's restart policy.

The limit field is used to specify the number of times a failing healthcheck is seen before local restarts are attempted. Operators can also specify a grace duration to wait after a task restarts before checking its health.

We recommend configuring the check restart on services if its likely that a restart would resolve the failure. This is applicable in cases like temporary memory issues on the service.

Example

The following check_restart stanza waits for two consecutive health check failures with a grace period and considers both critical and warning statuses as failures

check_restart {
  limit           = 2
  grace           = "10s"
  ignore_warnings = false
}

The following CLI example output shows healthcheck failures triggering restarts until its restart limit is reached.

$nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690
ID                   = e1b43128
Eval ID              = 249cbfe9
Name                 = demo.demo[0]
Node ID              = 221e998e
Job ID               = demo
Job Version          = 0
Client Status        = failed
Client Description   = <none>
Desired Status       = run
Desired Description  = <none>
Created              = 2m59s ago
Modified             = 39s ago

Task "test" is "dead"
Task Resources
CPU      Memory   Disk     IOPS  Addresses
100 MHz  300 MiB  300 MiB  0     p1: 127.0.0.1:28422

Task Events:
Started At     = 2018-04-12T22:50:32Z
Finished At    = 2018-04-12T22:50:54Z
Total Restarts = 3
Last Restart   = 2018-04-12T17:50:15-05:00

Recent Events:
Time                       Type              Description
2018-04-12T17:50:54-05:00  Not Restarting    Exceeded allowed attempts 3 in interval 30m0s and mode is "fail"
2018-04-12T17:50:54-05:00  Killed            Task successfully killed
2018-04-12T17:50:54-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:54-05:00  Restart Signaled  healthcheck: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:50:32-05:00  Started           Task started by client
2018-04-12T17:50:15-05:00  Restarting        Task restarting in 16.887291122s
2018-04-12T17:50:15-05:00  Killed            Task successfully killed
2018-04-12T17:50:15-05:00  Killing           Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:15-05:00  Restart Signaled  healthcheck: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:49:53-05:00  Started           Task started by client