5.0 KiB
layout | page_title | sidebar_current | description |
---|---|---|---|
docs | Check Definition | docs-agent-checks | One of the primary roles of the agent is the management of system and application level health checks. A health check is considered to be application level if it associated with a service. A check is defined in a configuration file, or added at runtime over the HTTP interface. |
Checks
One of the primary roles of the agent is the management of system and application level health checks. A health check is considered to be application level if it associated with a service. A check is defined in a configuration file, or added at runtime over the HTTP interface.
There are three different kinds of checks:
-
Script + Interval - These checks depend on invoking an external application that does the health check and exits with an appropriate exit code, potentially generating some output. A script is paired with an invocation interval (e.g. every 30 seconds). This is similar to the Nagios plugin system.
-
HTTP + Interval - These checks make an
HTTP GET
request every Interval (e.g. every 30 seconds) to the specified URL. The status of the service depends on the HTTP Response Code. any2xx
code is passing,429 Too Many Requests
is warning and anything else is failing. This type of check should be preferred over a script that for example usescurl
. -
Time to Live (TTL) - These checks retain their last known state for a given TTL. The state of the check must be updated periodically over the HTTP interface. If an external system fails to update the status within a given TTL, the check is set to the failed state. This mechanism is used to allow an application to directly report its health. For example, a web app can periodically curl the endpoint, and if the app fails, then the TTL will expire and the health check enters a critical state. This is conceptually similar to a dead man's switch.
Check Definition
A check definition that is a script looks like:
{
"check": {
"id": "mem-util",
"name": "Memory utilization",
"script": "/usr/local/bin/check_mem.py",
"interval": "10s"
}
}
An HTTP based check looks like:
{
"check": {
"id": "api",
"name": "HTTP API on port 5000",
"http": "http://localhost:5000/health",
"interval": "10s"
}
}
A TTL based check is very similar:
{
"check": {
"id": "web-app",
"name": "Web App Status",
"notes": "Web app does a curl internally every 10 seconds",
"ttl": "30s"
}
}
Each type of definitions must include a name
, and may optionally
provide an id
and notes
field. The id
is set to the name
if not
provided. It is required that all checks have a unique ID per node, so if names
might conflict then unique ID's should be provided.
The notes
field is opaque to Consul, but may be used for human
readable descriptions. The field is set to any output that a script
generates, and similarly the TTL update hooks can update the notes
as well.
To configure a check, either provide it as a -config-file
option to the
agent, or place it inside the -config-dir
of the agent. The file must
end in the ".json" extension to be loaded by Consul. Check definitions can
also be updated by sending a SIGHUP
to the agent. Alternatively, the
check can be registered dynamically using the HTTP API.
Check Scripts
A check script is generally free to do anything to determine the status of the check. The only limitations placed are that the exit codes must convey a specific meaning. Specifically:
- Exit code 0 - Check is passing
- Exit code 1 - Check is warning
- Any other code - Check is failing
This is the only convention that Consul depends on. Any output of the script
will be captured and stored in the notes
field so that it can be viewed
by human operators.
Service-bound checks
Health checks may also be optionally bound to a specific service. This ensures
that the status of the health check will only affect the health status of the
given service instead of the entire node. Service-bound health checks may be
provided by adding a service_id
field to a check configuration:
{
"check": {
"id": "web-app",
"name": "Web App Status",
"service_id": "web-app",
"ttl": "30s"
}
}
In the above configuration, if the web-app health check begins failing, it will only affect the availability of the web-app service and no other services provided by the node.
Multiple Check Definitions
Multiple check definitions can be provided at once using the checks
(plural)
key in your configuration file.
{
"checks": [
{
"id": "chk1",
"name": "mem",
"script": "/bin/check_mem",
"interval": "5s"
},
{
"id": "chk2",
"name": "/health",
"http": "http://localhost:5000/health",
"interval": "15s"
},
{
"id": "chk3",
"name": "cpu",
"script": "/bin/check_cpu",
"interval": "10s"
},
...
]
}