IOPS have been modelled as a resource since Nomad 0.1 but has never actually been detected and there is no plan in the short term to add detection. This is because IOPS is a bit simplistic of a unit to define the performance requirements from the underlying storage system. In its current state it adds unnecessary confusion and can be removed without impacting any users. This PR leaves IOPS defined at the jobspec parsing level and in the api/ resources since these are the two public uses of the field. These should be considered deprecated and only exist to allow users to stop using them during the Nomad 0.9.x release. In the future, there should be no expectation that the field will exist.
3.2 KiB
layout | page_title | sidebar_current | description |
---|---|---|---|
guides | Check Restart Stanza - Operating a Job | guides-operating-a-job-failure-handling-strategies-check-restart | Nomad can restart tasks if they have a failing health check based on configuration specified in the `check_restart` stanza. Restarts are done locally on the node running the task based on their `restart` policy. |
Check Restart Stanza
The check_restart
stanza instructs Nomad when to restart tasks with unhealthy service checks.
When a health check in Consul has been unhealthy for the limit specified in a check_restart stanza,
it is restarted according to the task group's restart policy.
The limit
field is used to specify the number of times a failing healthcheck is seen before local restarts are attempted.
Operators can also specify a grace
duration to wait after a task restarts before checking its health.
We recommend configuring the check restart on services if its likely that a restart would resolve the failure. This is applicable in cases like temporary memory issues on the service.
Example
The following check_restart
stanza waits for two consecutive health check failures with a
grace period and considers both critical
and warning
statuses as failures
check_restart {
limit = 2
grace = "10s"
ignore_warnings = false
}
The following CLI example output shows healthcheck failures triggering restarts until its restart limit is reached.
$nomad alloc status e1b43128-2a0a-6aa3-c375-c7e8a7c48690
ID = e1b43128
Eval ID = 249cbfe9
Name = demo.demo[0]
Node ID = 221e998e
Job ID = demo
Job Version = 0
Client Status = failed
Client Description = <none>
Desired Status = run
Desired Description = <none>
Created = 2m59s ago
Modified = 39s ago
Task "test" is "dead"
Task Resources
CPU Memory Disk Addresses
100 MHz 300 MiB 300 MiB p1: 127.0.0.1:28422
Task Events:
Started At = 2018-04-12T22:50:32Z
Finished At = 2018-04-12T22:50:54Z
Total Restarts = 3
Last Restart = 2018-04-12T17:50:15-05:00
Recent Events:
Time Type Description
2018-04-12T17:50:54-05:00 Not Restarting Exceeded allowed attempts 3 in interval 30m0s and mode is "fail"
2018-04-12T17:50:54-05:00 Killed Task successfully killed
2018-04-12T17:50:54-05:00 Killing Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:54-05:00 Restart Signaled healthcheck: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:50:32-05:00 Started Task started by client
2018-04-12T17:50:15-05:00 Restarting Task restarting in 16.887291122s
2018-04-12T17:50:15-05:00 Killed Task successfully killed
2018-04-12T17:50:15-05:00 Killing Sent interrupt. Waiting 5s before force killing
2018-04-12T17:50:15-05:00 Restart Signaled healthcheck: check "service: \"demo-service-test\" check" unhealthy
2018-04-12T17:49:53-05:00 Started Task started by client