This adds two goroutines to perform autopilot tasks on the leader - one
to monitor the health of servers and another to periodically clean up
dead servers with a limit on removal count. Also adds a new http endpoint,
`/v1/operator/autopilot/health`, for querying this information through an
operator RPC endpoint.
Given a list of HealthChecks, this determines the "best" status for the
collective group. This is useful for nodes and services, which may have
multiple checks associated with them.
Log a warning instead of a success message when attempting to deregister a nonexistent service. In Consul 0.8 this can be changed to giving an error outright, but for now we can keep the idempotent delete behavior.
* * adding cli config and config file support for specifying the serf wan and lan bind addresses
* updating documentation for serf wan and lan options
Fixes#2007
* Cleans up some small things from #2380.
* Uses the bind default for the agent test for Serf WAN and LAN.
* Updates Raft library to get new snapshot/restore API.
* Basic backup and restore working, but need some cleanup.
* Breaks out a snapshot module and adds a SHA256 integrity check.
* Adds snapshot ACL and fills in some missing comments.
* Require a consistent read for snapshots.
* Make sure snapshot works if ACLs aren't enabled.
* Adds a bit of package documentation.
* Returns an empty response from restore to avoid EOF errors.
* Adds API client support for snapshots.
* Makes internal file names match on-disk file snapshots.
* Adds DC and token coverage for snapshot API test.
* Adds missing documentation.
* Adds a unit test for the snapshot client endpoint.
* Moves the connection pool out of the client for easier testing.
* Fixes an incidental issue in the prepared query unit test.
I realized I had two servers in bootstrap mode so this wasn't a good setup.
* Adds a half close to the TCP stream and fixes panic on error.
* Adds client and endpoint tests for snapshots.
* Moves the pool back into the snapshot RPC client.
* Adds a TLS test and fixes half-closes for TLS connections.
* Tweaks some comments.
* Adds a low-level snapshot test.
This is independent of Consul so we can pull this out into a library
later if we want to.
* Cleans up snapshot and archive and completes archive tests.
* Sends a clear error for snapshot operations in dev mode.
Snapshots require the Raft snapshots to be readable, which isn't supported
in dev mode. Send a clear error instead of a deep-down Raft one.
* Adds docs for the snapshot endpoint.
* Adds a stale mode and index feedback for snapshot saves.
This gives folks a way to extract data even if the cluster has no
leader.
* Changes the internal format of a snapshot from zip to tgz.
* Pulls in Raft fix to cancel inflight before a restore.
* Pulls in new Raft restore interface.
* Adds metadata to snapshot saves and a verify function.
* Adds basic save and restore snapshot CLI commands.
* Gets rid of tarball extensions and adds restore message.
* Fixes an incidental bad link in the KV docs.
* Adds documentation for the snapshot CLI commands.
* Scuttle any request body when a snapshot is saved.
* Fixes archive unit test error message check.
* Allows for nil output writers in snapshot RPC handlers.
* Renames hash list Decode to DecodeAndVerify.
* Closes the client connection for snapshot ops.
* Lowers timeout for restore ops.
* Updates Raft vendor to get new Restore signature and integrates with Consul.
* Bounces the leader's internal state when we do a restore.
This experiment was brought about because of variable naming
confusion where name and checkIDs were interchanged. Gave CheckID
an Qualified Type Name and chased downstream changes.
`AdvertiseAddrs` has been introduced as a configuration option, which
duplicates a few other options, namely `AdvertiseAddrWan`. We need to
use this value elsewhere, so rather than doing a precedence check every
time we need to access it, rectify the value of `AdvertiseAddrWan` to
match
Consolidate code duplication and tests into a single lib package. Most of these functions were from various **/util.go functions that couldn't be imported due to cyclic imports. The consul/lib package is intended to be a terminal node in an import DAG and a place to stash various consul-only helper functions. Pulled in hashicorp/go-uuid instead of consolidating UUID access.
* A batch of updates is done all in a single transaction.
* We no longer need to get an update to kick things, there's a periodic flush.
* If incoming updates overwhelm the configured flush rate they will be dumped with an error.
Adds the ability to simply check whether a TCP socket accepts
connections to determine if it is healthy. This is a light-weight -
though less comprehensive than scripting - method of checking network
service health.
The check parameter `tcp` should be set to the `address:port`
combination for the service to be tested. Supports both IPv6 and IPv4,
in the case of a hostname that resolves to both, connections will be
attempted via both protocol versions, with the first successful
connection returning a successful check result.
Example check:
```json
{
"check": {
"id": "ssh",
"name": "SSH (TCP)",
"tcp": "example.com:22",
"interval": "10s"
}
}
```
Fixes#550.
This will make it possible to configure the advertised adresses for
SerfLan, SerfWan and RPC. It will enable multiple consul clients on a
single host which is very useful in a container environment.
This option might override advertise_addr and advertise_addr_wan
depending on the configuration.
It will be configureable with advertise_addrs. Example:
{
"advertise_addrs": {
"serf_lan": "10.0.120.91:4424",
"serf_wan": "201.20.10.61:4423",
"rpc": "10.20.10.61:4424"
}
}
This status must be one of the valid check statuses: 'passing', 'warning', 'critical', 'unknown'.
If the status field is not present or the empty string, the default of 'critical' is used.
For long (>10s) interval checks the http timeout is 10s, otherwise thetimeout is the interval. This means that a check *should* return
before the next check begins.
These checks make an `HTTP GET` request every Interval to the specified URL.
The status of the service depends on the HTTP Response Code.
`200` is passing, `503` is warning and anything else is failing.
This change consolidates loading services and checks from both config
and persisted state into methods on the agent. As part of this, we
introduce optional persistence when calling RemoveCheck/RemoveService.
Fixes a bug where config reloads would kill persisted services/checks.
Also fixes an edge case:
1. A service or check is registered via the HTTP API
2. A new service or check definition with the same ID is added to config
3. Config is reloaded
The desired behavior (which this implements) is:
1. All services and checks deregistered in memory
2. All services and checks in config are registered first
3. All persisted checks are restored using the same logic as the agent
start sequence, which prioritizes config over persisted, and removes
any persistence files if new config counterparts are present.