docs: improve health check related docs
Includes: - Improved scannability and organization of checks overview - Checks overview includes more guidance on - How to register a health check - The options available for a health check definition - Contextual cross-references to maintenance mode
This commit is contained in:
parent
f6a163f239
commit
99df4df057
|
@ -6,7 +6,10 @@ description: The /agent/check endpoints interact with checks on the local agent
|
|||
|
||||
# Check - Agent HTTP API
|
||||
|
||||
The `/agent/check` endpoints interact with checks on the local agent in Consul.
|
||||
Consul's health check capabilities are described in the
|
||||
[health checks overview](/docs/discovery/checks).
|
||||
The `/agent/check` endpoints interact with health checks
|
||||
managed by the local agent in Consul.
|
||||
These should not be confused with checks in the catalog.
|
||||
|
||||
## List Checks
|
||||
|
@ -418,6 +421,10 @@ $ curl \
|
|||
This endpoint is used with a TTL type check to set the status of the check to
|
||||
`critical` and to reset the TTL clock.
|
||||
|
||||
If you want to manually mark a service as unhealthy,
|
||||
use [maintenance mode](/api-docs/agent#enable-maintenance-mode)
|
||||
instead of defining a TTL health check and using this endpoint.
|
||||
|
||||
| Method | Path | Produces |
|
||||
| ------ | ----------------------------- | ------------------ |
|
||||
| `PUT` | `/agent/check/fail/:check_id` | `application/json` |
|
||||
|
@ -456,6 +463,10 @@ $ curl \
|
|||
This endpoint is used with a TTL type check to set the status of the check and
|
||||
to reset the TTL clock.
|
||||
|
||||
If you want to manually mark a service as unhealthy,
|
||||
use [maintenance mode](/api-docs/agent#enable-maintenance-mode)
|
||||
instead of defining a TTL health check and using this endpoint.
|
||||
|
||||
| Method | Path | Produces |
|
||||
| ------ | ------------------------------- | ------------------ |
|
||||
| `PUT` | `/agent/check/update/:check_id` | `application/json` |
|
||||
|
|
|
@ -14,6 +14,9 @@ optional health checking mechanisms. Additionally, some of the query results
|
|||
from the health endpoints are filtered while the catalog endpoints provide the
|
||||
raw entries.
|
||||
|
||||
To modify health check registration or information,
|
||||
use the [`/agent/check`](/api-docs/agent/check) endpoints.
|
||||
|
||||
## List Checks for Node
|
||||
|
||||
This endpoint returns the checks specific to the node provided on the path.
|
||||
|
|
|
@ -13,144 +13,72 @@ description: >-
|
|||
One of the primary roles of the agent is management of system-level and application-level health
|
||||
checks. A health check is considered to be application-level if it is associated with a
|
||||
service. If not associated with a service, the check monitors the health of the entire node.
|
||||
Review the [health checks tutorial](https://learn.hashicorp.com/tutorials/consul/service-registration-health-checks) to get a more complete example on how to leverage health check capabilities in Consul.
|
||||
|
||||
A check is defined in a configuration file or added at runtime over the HTTP interface. Checks
|
||||
created via the HTTP interface persist with that node.
|
||||
Review the [service health checks tutorial](https://learn.hashicorp.com/tutorials/consul/service-registration-health-checks)
|
||||
to get a more complete example on how to leverage health check capabilities in Consul.
|
||||
|
||||
There are several different kinds of checks:
|
||||
## Registering a health check
|
||||
|
||||
- Script + Interval - These checks depend on invoking an external application
|
||||
that performs the health check, exits with an appropriate exit code, and potentially
|
||||
generates some output. A script is paired with an invocation interval (e.g.
|
||||
every 30 seconds). This is similar to the Nagios plugin system. The output of
|
||||
a script check is limited to 4KB. Output larger than this will be truncated.
|
||||
By default, Script checks will be configured with a timeout equal to 30 seconds.
|
||||
It is possible to configure a custom Script check timeout value by specifying the
|
||||
`timeout` field in the check definition. When the timeout is reached on Windows,
|
||||
Consul will wait for any child processes spawned by the script to finish. For any
|
||||
other system, Consul will attempt to force-kill the script and any child processes
|
||||
it has spawned once the timeout has passed.
|
||||
In Consul 0.9.0 and later, script checks are not enabled by default. To use them you
|
||||
can either use :
|
||||
There are three ways to register a service with health checks:
|
||||
|
||||
- [`enable_local_script_checks`](/docs/agent/config/cli-flags#_enable_local_script_checks):
|
||||
enable script checks defined in local config files. Script checks defined via the HTTP
|
||||
API will not be allowed.
|
||||
- [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks): enable
|
||||
script checks regardless of how they are defined.
|
||||
1. Start or reload a Consul agent with a service definition file in the
|
||||
[agent's configuration directory](/docs/agent#configuring-consul-agents).
|
||||
1. Call the
|
||||
[`/agent/service/register`](/api-docs/agent/service#register-service)
|
||||
HTTP API endpoint to register the service.
|
||||
1. Use the
|
||||
[`consul services register`](/commands/services/register)
|
||||
CLI command to register the service.
|
||||
|
||||
~> **Security Warning:** Enabling script checks in some configurations may
|
||||
introduce a remote execution vulnerability which is known to be targeted by
|
||||
malware. We strongly recommend `enable_local_script_checks` instead. See [this
|
||||
blog post](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations)
|
||||
for more details.
|
||||
When a service is registered using the HTTP API endpoint or CLI command,
|
||||
the checks persist in the Consul data folder across Consul agent restarts.
|
||||
|
||||
- `HTTP + Interval` - These checks make an HTTP `GET` request to the specified URL,
|
||||
waiting the specified `interval` amount of time between requests (eg. 30 seconds).
|
||||
The status of the service depends on the HTTP response code: any `2xx` code is
|
||||
considered passing, a `429 Too ManyRequests` is a warning, and anything else is
|
||||
a failure. This type of check
|
||||
should be preferred over a script that uses `curl` or another external process
|
||||
to check a simple HTTP operation. By default, HTTP checks are `GET` requests
|
||||
unless the `method` field specifies a different method. Additional header
|
||||
fields can be set through the `header` field which is a map of lists of
|
||||
strings, e.g. `{"x-foo": ["bar", "baz"]}`. By default, HTTP checks will be
|
||||
configured with a request timeout equal to 10 seconds.
|
||||
## Types of checks
|
||||
|
||||
It is possible to configure a custom HTTP check timeout value by
|
||||
specifying the `timeout` field in the check definition. The output of the
|
||||
check is limited to roughly 4KB. Responses larger than this will be truncated.
|
||||
HTTP checks also support TLS. By default, a valid TLS certificate is expected.
|
||||
Certificate verification can be turned off by setting the `tls_skip_verify`
|
||||
field to `true` in the check definition. When using TLS, the SNI will be set
|
||||
automatically from the URL if it uses a hostname (as opposed to an IP address);
|
||||
the value can be overridden by setting `tls_server_name`.
|
||||
This section describes the available types of health checks you can use to
|
||||
automatically monitor the health of a service instance or node.
|
||||
|
||||
Consul follows HTTP redirects by default. Set the `disable_redirects` field to
|
||||
`true` to disable redirects.
|
||||
-> **To manually mark a service unhealthy:** Use the maintenance mode
|
||||
[CLI command](/commands/maint) or
|
||||
[HTTP API endpoint](/api-docs/agent#enable-maintenance-mode)
|
||||
to temporarily remove one or all service instances on a node
|
||||
from service discovery DNS and HTTP API query results.
|
||||
|
||||
- `TCP + Interval` - These checks make a TCP connection attempt to the specified
|
||||
IP/hostname and port, waiting `interval` amount of time between attempts
|
||||
(e.g. 30 seconds). If no hostname
|
||||
is specified, it defaults to "localhost". The status of the service depends on
|
||||
whether the connection attempt is successful (ie - the port is currently
|
||||
accepting connections). If the connection is accepted, the status is
|
||||
`success`, otherwise the status is `critical`. In the case of a hostname that
|
||||
resolves to both IPv4 and IPv6 addresses, an attempt will be made to both
|
||||
addresses, and the first successful connection attempt will result in a
|
||||
successful check. This type of check should be preferred over a script that
|
||||
uses `netcat` or another external process to check a simple socket operation.
|
||||
By default, TCP checks will be configured with a request timeout of 10 seconds.
|
||||
It is possible to configure a custom TCP check timeout value by specifying the
|
||||
`timeout` field in the check definition.
|
||||
### Script check ((#script-interval))
|
||||
|
||||
- `UDP + Interval` - These checks direct the client to periodically send UDP datagrams
|
||||
to the specified IP/hostname and port. The duration specified in the `interval` field sets the amount of time
|
||||
between attempts, such as `30s` to indicate 30 seconds. The check is logged as healthy if any response from the UDP server is received. Any other result sets the status to `critical`.
|
||||
The default interval for, UDP checks is `10s`, but you can configure a custom UDP check timeout value by specifying the
|
||||
`timeout` field in the check definition. If any timeout on read exists, the check is still considered healthy.
|
||||
Script checks periodically invoke an external application that performs the health check,
|
||||
exits with an appropriate exit code, and potentially generates some output.
|
||||
The specified `interval` determines the time between check invocations.
|
||||
The output of a script check is limited to 4KB.
|
||||
Larger outputs are truncated.
|
||||
|
||||
- `Time to Live (TTL)` ((#ttl)) - These checks retain their last known state
|
||||
for a given TTL. The state of the check must be updated periodically over the HTTP
|
||||
interface. If an external system fails to update the status within a given TTL,
|
||||
the check is set to the failed state. This mechanism, conceptually similar to a
|
||||
dead man's switch, relies on the application to directly report its health. For
|
||||
example, a healthy app can periodically `PUT` a status update to the HTTP endpoint;
|
||||
if the app fails, the TTL will expire and the health check enters a critical state.
|
||||
The endpoints used to update health information for a given check are: [pass](/api-docs/agent/check#ttl-check-pass),
|
||||
[warn](/api-docs/agent/check#ttl-check-warn), [fail](/api-docs/agent/check#ttl-check-fail),
|
||||
and [update](/api-docs/agent/check#ttl-check-update). TTL checks also persist their
|
||||
last known status to disk. This allows the Consul agent to restore the last known
|
||||
status of the check across restarts. Persisted check status is valid through the
|
||||
end of the TTL from the time of the last check.
|
||||
By default, script checks are configured with a timeout equal to 30 seconds.
|
||||
To configure a custom script check timeout value,
|
||||
specify the `timeout` field in the check definition.
|
||||
After reaching the timeout on a Windows system,
|
||||
Consul waits for any child processes spawned by the script to finish.
|
||||
After reaching the timeout on other systems,
|
||||
Consul attempts to force-kill the script and any child processes it spawned.
|
||||
|
||||
- `Docker + Interval` - These checks depend on invoking an external application which
|
||||
is packaged within a Docker Container. The application is triggered within the running
|
||||
container via the Docker Exec API. We expect that the Consul agent user has access
|
||||
to either the Docker HTTP API or the unix socket. Consul uses `$DOCKER_HOST` to
|
||||
determine the Docker API endpoint. The application is expected to run, perform a health
|
||||
check of the service running inside the container, and exit with an appropriate exit code.
|
||||
The check should be paired with an invocation interval. The shell on which the check
|
||||
has to be performed is configurable which makes it possible to run containers which
|
||||
have different shells on the same host. Check output for Docker is limited to
|
||||
4KB. Any output larger than this will be truncated. In Consul 0.9.0 and later, the agent
|
||||
must be configured with [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks)
|
||||
set to `true` in order to enable Docker health checks.
|
||||
Script checks are not enabled by default.
|
||||
To enable a Consul agent to perform script checks,
|
||||
use one of the following agent configuration options:
|
||||
|
||||
- `gRPC + Interval` - These checks are intended for applications that support the standard
|
||||
[gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
|
||||
The state of the check will be updated by probing the configured endpoint, waiting `interval`
|
||||
amount of time between probes (eg. 30 seconds). By default, gRPC checks will be configured
|
||||
with a default timeout of 10 seconds.
|
||||
It is possible to configure a custom timeout value by specifying the `timeout` field in
|
||||
the check definition. gRPC checks will default to not using TLS, but TLS can be enabled by
|
||||
setting `grpc_use_tls` in the check definition. If TLS is enabled, then by default, a valid
|
||||
TLS certificate is expected. Certificate verification can be turned off by setting the
|
||||
`tls_skip_verify` field to `true` in the check definition.
|
||||
To check on a specific service instead of the whole gRPC server, add the service identifier after the `gRPC` check's endpoint in the following format `/:service_identifier`.
|
||||
- [`enable_local_script_checks`](/docs/agent/config/cli-flags#_enable_local_script_checks):
|
||||
Enable script checks defined in local config files.
|
||||
Script checks registered using the HTTP API are not allowed.
|
||||
- [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks):
|
||||
Enable script checks no matter how they are registered.
|
||||
|
||||
- `H2ping + Interval` - These checks test an endpoint that uses http2
|
||||
by connecting to the endpoint and sending a ping frame. TLS is assumed to be configured by default.
|
||||
To disable TLS and use h2c, set `h2ping_use_tls` to `false`. If the ping is successful
|
||||
within a specified timeout, then the check is updated as passing.
|
||||
The timeout defaults to 10 seconds, but is configurable using the `timeout` field. If TLS is enabled a valid
|
||||
certificate is required, unless `tls_skip_verify` is set to `true`.
|
||||
The check will be run on the interval specified by the `interval` field.
|
||||
~> **Security Warning:**
|
||||
Enabling non-local script checks in some configurations may introduce
|
||||
a remote execution vulnerability known to be targeted by malware.
|
||||
We strongly recommend `enable_local_script_checks` instead.
|
||||
For more information, refer to
|
||||
[this blog post](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations).
|
||||
|
||||
- `Alias` - These checks alias the health state of another registered
|
||||
node or service. The state of the check will be updated asynchronously, but is
|
||||
nearly instant. For aliased services on the same agent, the local state is monitored
|
||||
and no additional network resources are consumed. For other services and nodes,
|
||||
the check maintains a blocking query over the agent's connection with a current
|
||||
server and allows stale requests. If there are any errors in watching the aliased
|
||||
node or service, the check state will be critical. For the blocking query, the
|
||||
check will use the ACL token set on the service or check definition or otherwise
|
||||
will fall back to the default ACL token set with the agent (`acl_token`).
|
||||
|
||||
## Check Definition
|
||||
|
||||
A script check:
|
||||
The following service definition file snippet is an example
|
||||
of a script check definition:
|
||||
|
||||
<CodeTabs heading="Script Check">
|
||||
|
||||
|
@ -162,7 +90,6 @@ check = {
|
|||
interval = "10s"
|
||||
timeout = "1s"
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
```json
|
||||
|
@ -179,7 +106,47 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
A HTTP check:
|
||||
#### Check script conventions
|
||||
|
||||
A check script's exit code is used to determine the health check status:
|
||||
|
||||
- Exit code 0 - Check is passing
|
||||
- Exit code 1 - Check is warning
|
||||
- Any other code - Check is failing
|
||||
|
||||
Any output of the script is captured and made available in the
|
||||
`Output` field of checks included in HTTP API responses,
|
||||
as in this example from the [local service health endpoint](/api-docs/agent/service#by-name-json).
|
||||
|
||||
### HTTP check ((#http-interval))
|
||||
|
||||
HTTP checks periodically make an HTTP `GET` request to the specified URL,
|
||||
waiting the specified `interval` amount of time between requests.
|
||||
The status of the service depends on the HTTP response code: any `2xx` code is
|
||||
considered passing, a `429 Too ManyRequests` is a warning, and anything else is
|
||||
a failure. This type of check
|
||||
should be preferred over a script that uses `curl` or another external process
|
||||
to check a simple HTTP operation. By default, HTTP checks are `GET` requests
|
||||
unless the `method` field specifies a different method. Additional request
|
||||
headers can be set through the `header` field which is a map of lists of
|
||||
strings, such as `{"x-foo": ["bar", "baz"]}`.
|
||||
|
||||
By default, HTTP checks are configured with a request timeout equal to 10 seconds.
|
||||
To configure a custom HTTP check timeout value,
|
||||
specify the `timeout` field in the check definition.
|
||||
The output of an HTTP check is limited to approximately 4KB.
|
||||
Larger outputs are truncated.
|
||||
HTTP checks also support TLS. By default, a valid TLS certificate is expected.
|
||||
Certificate verification can be turned off by setting the `tls_skip_verify`
|
||||
field to `true` in the check definition. When using TLS, the SNI is implicitly
|
||||
determined from the URL if it uses a hostname instead of an IP address.
|
||||
You can explicitly set the SNI value by setting `tls_server_name`.
|
||||
|
||||
Consul follows HTTP redirects by default.
|
||||
To disable redirects, set the `disable_redirects` field to `true`.
|
||||
|
||||
The following service definition file snippet is an example
|
||||
of an HTTP check definition:
|
||||
|
||||
<CodeTabs heading="HTTP Check">
|
||||
|
||||
|
@ -220,7 +187,23 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
A TCP check:
|
||||
### TCP check ((#tcp-interval))
|
||||
|
||||
TCP checks periodically make a TCP connection attempt to the specified IP/hostname and port, waiting `interval` amount of time between attempts.
|
||||
If no hostname is specified, it defaults to "localhost".
|
||||
The health check status is `success` if the target host accepts the connection attempt,
|
||||
otherwise the status is `critical`. In the case of a hostname that
|
||||
resolves to both IPv4 and IPv6 addresses, an attempt is made to both
|
||||
addresses, and the first successful connection attempt results in a
|
||||
successful check. This type of check should be preferred over a script that
|
||||
uses `netcat` or another external process to check a simple socket operation.
|
||||
|
||||
By default, TCP checks are configured with a request timeout equal to 10 seconds.
|
||||
To configure a custom TCP check timeout value,
|
||||
specify the `timeout` field in the check definition.
|
||||
|
||||
The following service definition file snippet is an example
|
||||
of a TCP check definition:
|
||||
|
||||
<CodeTabs heading="TCP Check">
|
||||
|
||||
|
@ -232,7 +215,6 @@ check = {
|
|||
interval = "10s"
|
||||
timeout = "1s"
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
```json
|
||||
|
@ -249,7 +231,21 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
A UDP check:
|
||||
### UDP check ((#udp-interval))
|
||||
|
||||
UDP checks periodically direct the Consul agent to send UDP datagrams
|
||||
to the specified IP/hostname and port,
|
||||
waiting `interval` amount of time between attempts.
|
||||
The check status is set to `success` if any response is received from the targeted UDP server.
|
||||
Any other result sets the status to `critical`.
|
||||
|
||||
By default, UDP checks are configured with a request timeout equal to 10 seconds.
|
||||
To configure a custom UDP check timeout value,
|
||||
specify the `timeout` field in the check definition.
|
||||
If any timeout on read exists, the check is still considered healthy.
|
||||
|
||||
The following service definition file snippet is an example
|
||||
of a UDP check definition:
|
||||
|
||||
<CodeTabs heading="UDP Check">
|
||||
|
||||
|
@ -261,7 +257,6 @@ check = {
|
|||
interval = "10s"
|
||||
timeout = "1s"
|
||||
}
|
||||
|
||||
```
|
||||
|
||||
```json
|
||||
|
@ -278,7 +273,32 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
A TTL check:
|
||||
### Time to live (TTL) check ((#ttl))
|
||||
|
||||
TTL checks retain their last known state for the specified `ttl` duration.
|
||||
If the `ttl` duration elapses before a new check update
|
||||
is provided over the HTTP interface,
|
||||
the check is set to `critical` state.
|
||||
|
||||
This mechanism relies on the application to directly report its health.
|
||||
For example, a healthy app can periodically `PUT` a status update to the HTTP endpoint.
|
||||
Then, if the app is disrupted and unable to perform this update
|
||||
before the TTL expires, the health check enters the `critical` state.
|
||||
The endpoints used to update health information for a given check are: [pass](/api-docs/agent/check#ttl-check-pass),
|
||||
[warn](/api-docs/agent/check#ttl-check-warn), [fail](/api-docs/agent/check#ttl-check-fail),
|
||||
and [update](/api-docs/agent/check#ttl-check-update). TTL checks also persist their
|
||||
last known status to disk. This persistence allows the Consul agent to restore the last known
|
||||
status of the check across agent restarts. Persisted check status is valid through the
|
||||
end of the TTL from the time of the last check.
|
||||
|
||||
To manually mark a service unhealthy,
|
||||
it is far more convenient to use the maintenance mode
|
||||
[CLI command](/commands/maint) or
|
||||
[HTTP API endpoint](/api-docs/agent#enable-maintenance-mode)
|
||||
rather than a TTL health check with arbitrarily high `ttl`.
|
||||
|
||||
The following service definition file snippet is an example
|
||||
of a TTL check definition:
|
||||
|
||||
<CodeTabs heading="TTL Check">
|
||||
|
||||
|
@ -304,7 +324,24 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
A Docker check:
|
||||
### Docker check ((#docker-interval))
|
||||
|
||||
These checks depend on periodically invoking an external application that
|
||||
is packaged within a Docker Container. The application is triggered within the running
|
||||
container through the Docker Exec API. We expect that the Consul agent user has access
|
||||
to either the Docker HTTP API or the unix socket. Consul uses `$DOCKER_HOST` to
|
||||
determine the Docker API endpoint. The application is expected to run, perform a health
|
||||
check of the service running inside the container, and exit with an appropriate exit code.
|
||||
The check should be paired with an invocation interval. The shell on which the check
|
||||
has to be performed is configurable, making it possible to run containers which
|
||||
have different shells on the same host.
|
||||
The output of a Docker check is limited to 4KB.
|
||||
Larger outputs are truncated.
|
||||
The agent must be configured with [`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks)
|
||||
set to `true` in order to enable Docker health checks.
|
||||
|
||||
The following service definition file snippet is an example
|
||||
of a Docker check definition:
|
||||
|
||||
<CodeTabs heading="Docker Check">
|
||||
|
||||
|
@ -334,7 +371,26 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
A gRPC check for the whole application:
|
||||
### gRPC check ((##grpc-interval))
|
||||
|
||||
gRPC checks are intended for applications that support the standard
|
||||
[gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
|
||||
The state of the check will be updated by periodically probing the configured endpoint,
|
||||
waiting `interval` amount of time between attempts.
|
||||
|
||||
By default, gRPC checks are configured with a timeout equal to 10 seconds.
|
||||
To configure a custom Docker check timeout value,
|
||||
specify the `timeout` field in the check definition.
|
||||
|
||||
gRPC checks default to not using TLS.
|
||||
To enable TLS, set `grpc_use_tls` in the check definition.
|
||||
If TLS is enabled, then by default, a valid TLS certificate is expected.
|
||||
Certificate verification can be turned off by setting the
|
||||
`tls_skip_verify` field to `true` in the check definition.
|
||||
To check on a specific service instead of the whole gRPC server, add the service identifier after the `gRPC` check's endpoint in the following format `/:service_identifier`.
|
||||
|
||||
The following service definition file snippet is an example
|
||||
of a gRPC check for a whole application:
|
||||
|
||||
<CodeTabs heading="gRPC Check">
|
||||
|
||||
|
@ -362,7 +418,8 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
A gRPC check for the specific `my_service` service:
|
||||
The following service definition file snippet is an example
|
||||
of a gRPC check for the specific `my_service` service
|
||||
|
||||
<CodeTabs heading="gRPC Specific Service Check">
|
||||
|
||||
|
@ -390,7 +447,23 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
A h2ping check:
|
||||
### H2ping check ((#h2ping-interval))
|
||||
|
||||
H2ping checks test an endpoint that uses http2 by connecting to the endpoint
|
||||
and sending a ping frame, waiting `interval` amount of time between attempts.
|
||||
If the ping is successful within a specified timeout,
|
||||
then the check status is set to `success`.
|
||||
|
||||
By default, h2ping checks are configured with a request timeout equal to 10 seconds.
|
||||
To configure a custom h2ping check timeout value,
|
||||
specify the `timeout` field in the check definition.
|
||||
|
||||
TLS is enabled by default.
|
||||
To disable TLS and use h2c, set `h2ping_use_tls` to `false`.
|
||||
If TLS is not disabled, a valid certificate is required unless `tls_skip_verify` is set to `true`.
|
||||
|
||||
The following service definition file snippet is an example
|
||||
of an h2ping check definition:
|
||||
|
||||
<CodeTabs heading="H2ping Check">
|
||||
|
||||
|
@ -418,7 +491,29 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
An alias check for a local service:
|
||||
### Alias check
|
||||
|
||||
These checks alias the health state of another registered
|
||||
node or service. The state of the check updates asynchronously, but is
|
||||
nearly instant. For aliased services on the same agent, the local state is monitored
|
||||
and no additional network resources are consumed. For other services and nodes,
|
||||
the check maintains a blocking query over the agent's connection with a current
|
||||
server and allows stale requests. If there are any errors in watching the aliased
|
||||
node or service, the check state is set to `critical`.
|
||||
For the blocking query, the check uses the ACL token set on the service or check definition.
|
||||
If no ACL token is set in the service or check definition,
|
||||
the blocking query uses the agent's default ACL token
|
||||
([`acl.tokens.default`](/docs/agent/config/config-files#acl_tokens_default)).
|
||||
|
||||
~> **Configuration info**: The alias check configuration expects the alias to be
|
||||
registered on the same agent as the one you are aliasing. If the service is
|
||||
not registered with the same agent, `"alias_node": "<node_id>"` must also be
|
||||
specified. When using `alias_node`, if no service is specified, the check will
|
||||
alias the health of the node. If a service is specified, the check will alias
|
||||
the specified service on this particular node.
|
||||
|
||||
The following service definition file snippet is an example
|
||||
of an alias check for a local service:
|
||||
|
||||
<CodeTabs heading="Alias Check">
|
||||
|
||||
|
@ -440,72 +535,137 @@ check = {
|
|||
|
||||
</CodeTabs>
|
||||
|
||||
~> Configuration info: The alias check configuration expects the alias to be
|
||||
registered on the same agent as the one you are aliasing. If the service is
|
||||
not registered with the same agent, `"alias_node": "<node_id>"` must also be
|
||||
specified. When using `alias_node`, if no service is specified, the check will
|
||||
alias the health of the node. If a service is specified, the check will alias
|
||||
the specified service on this particular node.
|
||||
## Check definition
|
||||
|
||||
Each type of definition must include a `name` and may optionally provide an
|
||||
`id` and `notes` field. The `id` must be unique per _agent_ otherwise only the
|
||||
last defined check with that `id` will be registered. If the `id` is not set
|
||||
and the check is embedded within a service definition a unique check id is
|
||||
generated. Otherwise, `id` will be set to `name`. If names might conflict,
|
||||
unique IDs should be provided.
|
||||
This section covers some of the most common options for check definitions.
|
||||
For a complete list of all check options, refer to the
|
||||
[Register Check HTTP API endpoint documentation](/api-docs/agent/check#json-request-body-schema).
|
||||
|
||||
The `notes` field is opaque to Consul but can be used to provide a human-readable
|
||||
description of the current state of the check. Similarly, an external process
|
||||
updating a TTL check via the HTTP interface can set the `notes` value.
|
||||
-> **Casing for check options:**
|
||||
The correct casing for an option depends on whether the check is defined in
|
||||
a service definition file or an HTTP API JSON request body.
|
||||
For example, the option `deregister_critical_service_after` in a service
|
||||
definition file is instead named `DeregisterCriticalServiceAfter` in an
|
||||
HTTP API JSON request body.
|
||||
|
||||
Checks may also contain a `token` field to provide an ACL token. This token is
|
||||
used for any interaction with the catalog for the check, including
|
||||
[anti-entropy syncs](/docs/architecture/anti-entropy) and deregistration.
|
||||
For Alias checks, this token is used if a remote blocking query is necessary
|
||||
to watch the state of the aliased node or service.
|
||||
#### General options
|
||||
|
||||
Script, TCP, UDP, HTTP, Docker, and gRPC checks must include an `interval` field. This
|
||||
field is parsed by Go's `time` package, and has the following
|
||||
[formatting specification](https://golang.org/pkg/time/#ParseDuration):
|
||||
- `name` `(string: <required>)` - Specifies the name of the check.
|
||||
|
||||
> A duration string is a possibly signed sequence of decimal numbers, each with
|
||||
> optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m".
|
||||
> Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
|
||||
- `id` `(string: "")` - Specifies a unique ID for this check on this node.
|
||||
|
||||
If unspecified, Consul defines the check id by:
|
||||
- If the check definition is embedded within a service definition file,
|
||||
a unique check id is auto-generated.
|
||||
- Otherwise, the `id` is set to the value of `name`.
|
||||
If names might conflict, you must provide unique IDs to avoid
|
||||
overwriting existing checks with the same id on this node.
|
||||
|
||||
In Consul 0.7 and later, checks that are associated with a service may also contain
|
||||
an optional `deregister_critical_service_after` field, which is a timeout in the
|
||||
same Go time format as `interval` and `ttl`. If a check is in the critical state
|
||||
for more than this configured value, then its associated service (and all of its
|
||||
associated checks) will automatically be deregistered. The minimum timeout is 1
|
||||
minute, and the process that reaps critical services runs every 30 seconds, so it
|
||||
may take slightly longer than the configured timeout to trigger the deregistration.
|
||||
This should generally be configured with a timeout that's much, much longer than
|
||||
any expected recoverable outage for the given service.
|
||||
- `interval` `(string: <required for interval-based checks>)` - Specifies
|
||||
the frequency at which to run this check.
|
||||
Required for all check types except TTL and alias checks.
|
||||
|
||||
To configure a check, either provide it as a `-config-file` option to the
|
||||
agent or place it inside the `-config-dir` of the agent. The file must
|
||||
end in a ".json" or ".hcl" extension to be loaded by Consul. Check definitions
|
||||
can also be updated by sending a `SIGHUP` to the agent. Alternatively, the
|
||||
check can be registered dynamically using the [HTTP API](/api).
|
||||
The value is parsed by Go's `time` package, and has the following
|
||||
[formatting specification](https://golang.org/pkg/time/#ParseDuration):
|
||||
|
||||
## Check Scripts
|
||||
> A duration string is a possibly signed sequence of decimal numbers, each with
|
||||
> optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m".
|
||||
> Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
|
||||
|
||||
A check script is generally free to do anything to determine the status
|
||||
of the check. The only limitations placed are that the exit codes must obey
|
||||
this convention:
|
||||
- `service_id` `(string: <required for service health checks>)` - Specifies
|
||||
the ID of a service instance to associate this check with.
|
||||
That service instance must be on this node.
|
||||
If not specified, this check is treated as a node-level check.
|
||||
For more information, refer to the
|
||||
[service-bound checks](#service-bound-checks) section.
|
||||
|
||||
- Exit code 0 - Check is passing
|
||||
- Exit code 1 - Check is warning
|
||||
- Any other code - Check is failing
|
||||
- `status` `(string: "")` - Specifies the initial status of the health check as
|
||||
"critical" (default), "warning", or "passing". For more details, refer to
|
||||
the [initial health check status](#initial-health-check-status) section.
|
||||
|
||||
-> **Health defaults to critical:** If health status it not initially specified,
|
||||
it defaults to "critical" to protect against including a service
|
||||
in discovery results before it is ready.
|
||||
|
||||
This is the only convention that Consul depends on. Any output of the script
|
||||
will be captured and stored in the `output` field.
|
||||
- `deregister_critical_service_after` `(string: "")` - If specified,
|
||||
the associated service and all its checks are deregistered
|
||||
after this check is in the critical state for more than the specified value.
|
||||
The value has the same formatting specification as the [`interval`](#interval) field.
|
||||
|
||||
In Consul 0.9.0 and later, the agent must be configured with
|
||||
[`enable_script_checks`](/docs/agent/config/cli-flags#_enable_script_checks) set to `true`
|
||||
in order to enable script checks.
|
||||
The minimum timeout is 1 minute,
|
||||
and the process that reaps critical services runs every 30 seconds,
|
||||
so it may take slightly longer than the configured timeout to trigger the deregistration.
|
||||
This field should generally be configured with a timeout that's significantly longer than
|
||||
any expected recoverable outage for the given service.
|
||||
|
||||
## Initial Health Check Status
|
||||
- `notes` `(string: "")` - Provides a human-readable description of the check.
|
||||
This field is opaque to Consul and can be used however is useful to the user.
|
||||
For example, it could be used to describe the current state of the check.
|
||||
|
||||
- `token` `(string: "")` - Specifies an ACL token used for any interaction
|
||||
with the catalog for the check, including
|
||||
[anti-entropy syncs](/docs/architecture/anti-entropy) and deregistration.
|
||||
|
||||
For alias checks, this token is used if a remote blocking query is necessary to watch the state of the aliased node or service.
|
||||
|
||||
#### Success/failures before passing/warning/critical
|
||||
|
||||
To prevent flapping health checks and limit the load they cause on the cluster,
|
||||
a health check may be configured to become passing/warning/critical only after a
|
||||
specified number of consecutive checks return as passing/critical.
|
||||
The status does not transition states until the configured threshold is reached.
|
||||
|
||||
- `success_before_passing` - Number of consecutive successful results required
|
||||
before check status transitions to passing. Defaults to `0`. Added in Consul 1.7.0.
|
||||
|
||||
- `failures_before_warning` - Number of consecutive unsuccessful results required
|
||||
before check status transitions to warning. Defaults to the same value as that of
|
||||
`failures_before_critical` to maintain the expected behavior of not changing the
|
||||
status of service checks to `warning` before `critical` unless configured to do so.
|
||||
Values higher than `failures_before_critical` are invalid. Added in Consul 1.11.0.
|
||||
|
||||
- `failures_before_critical` - Number of consecutive unsuccessful results required
|
||||
before check status transitions to critical. Defaults to `0`. Added in Consul 1.7.0.
|
||||
|
||||
This feature is available for all check types except TTL and alias checks.
|
||||
By default, both passing and critical thresholds are set to 0 so the check
|
||||
status always reflects the last check result.
|
||||
|
||||
<CodeTabs heading="Flapping Prevention Example">
|
||||
|
||||
```hcl
|
||||
checks = [
|
||||
{
|
||||
name = "HTTP TCP on port 80"
|
||||
tcp = "localhost:80"
|
||||
interval = "10s"
|
||||
timeout = "1s"
|
||||
success_before_passing = 3
|
||||
failures_before_warning = 1
|
||||
failures_before_critical = 3
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"checks": [
|
||||
{
|
||||
"name": "HTTP TCP on port 80",
|
||||
"tcp": "localhost:80",
|
||||
"interval": "10s",
|
||||
"timeout": "1s",
|
||||
"success_before_passing": 3,
|
||||
"failures_before_warning": 1,
|
||||
"failures_before_critical": 3
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
</CodeTabs>
|
||||
|
||||
## Initial health check status
|
||||
|
||||
By default, when checks are registered against a Consul agent, the state is set
|
||||
immediately to "critical". This is useful to prevent services from being
|
||||
|
@ -576,13 +736,13 @@ In the above configuration, if the web-app health check begins failing, it will
|
|||
only affect the availability of the web-app service. All other services
|
||||
provided by the node will remain unchanged.
|
||||
|
||||
## Agent Certificates for TLS Checks
|
||||
## Agent certificates for TLS checks
|
||||
|
||||
The [enable_agent_tls_for_checks](/docs/agent/config/config-files#enable_agent_tls_for_checks)
|
||||
agent configuration option can be utilized to have HTTP or gRPC health checks
|
||||
to use the agent's credentials when configured for TLS.
|
||||
|
||||
## Multiple Check Definitions
|
||||
## Multiple check definitions
|
||||
|
||||
Multiple check definitions can be defined using the `checks` (plural)
|
||||
key in your configuration file.
|
||||
|
@ -640,58 +800,3 @@ checks = [
|
|||
```
|
||||
|
||||
</CodeTabs>
|
||||
|
||||
## Success/Failures before passing/warning/critical
|
||||
|
||||
To prevent flapping health checks, and limit the load they cause on the cluster,
|
||||
a health check may be configured to become passing/warning/critical only after a
|
||||
specified number of consecutive checks return passing/critical.
|
||||
The status will not transition states until the configured threshold is reached.
|
||||
|
||||
- `success_before_passing` - Number of consecutive successful results required
|
||||
before check status transitions to passing. Defaults to `0`. Added in Consul 1.7.0.
|
||||
- `failures_before_warning` - Number of consecutive unsuccessful results required
|
||||
before check status transitions to warning. Defaults to the same value as that of
|
||||
`failures_before_critical` to maintain the expected behavior of not changing the
|
||||
status of service checks to `warning` before `critical` unless configured to do so.
|
||||
Values higher than `failures_before_critical` are invalid. Added in Consul 1.11.0.
|
||||
- `failures_before_critical` - Number of consecutive unsuccessful results required
|
||||
before check status transitions to critical. Defaults to `0`. Added in Consul 1.7.0.
|
||||
|
||||
This feature is available for HTTP, TCP, gRPC, Docker & Monitor checks.
|
||||
By default, both passing and critical thresholds will be set to 0 so the check
|
||||
status will always reflect the last check result.
|
||||
|
||||
<CodeTabs heading="Flapping Prevention Example">
|
||||
|
||||
```hcl
|
||||
checks = [
|
||||
{
|
||||
name = "HTTP TCP on port 80"
|
||||
tcp = "localhost:80"
|
||||
interval = "10s"
|
||||
timeout = "1s"
|
||||
success_before_passing = 3
|
||||
failures_before_warning = 1
|
||||
failures_before_critical = 3
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"checks": [
|
||||
{
|
||||
"name": "HTTP TCP on port 80",
|
||||
"tcp": "localhost:80",
|
||||
"interval": "10s",
|
||||
"timeout": "1s",
|
||||
"success_before_passing": 3,
|
||||
"failures_before_warning": 1,
|
||||
"failures_before_critical": 3
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
</CodeTabs>
|
||||
|
|
Loading…
Reference in New Issue