This adds two goroutines to perform autopilot tasks on the leader - one
to monitor the health of servers and another to periodically clean up
dead servers with a limit on removal count. Also adds a new http endpoint,
`/v1/operator/autopilot/health`, for querying this information through an
operator RPC endpoint.
I'm torn on this. It's useful from a UX perspective for an operator to
be able to type in something that's short. At the same time, by
enforcing an `8` character length, we reduced the probability of a user
depending on the behavior and having it suddenly stop working in the
future when a duplicate prefix is injected into the environment.
lookup returned nil.
Add a TODO to note where a future point of logging should occur once a
logger is present and a few additional comments to explain the program
flow.
Assuming the following output from a consul agent:
```
==> Consul agent running!
Version: 'v0.7.3-43-gc5e140c-dev (c5e140c+CHANGES)'
Node ID: '40e4a748-2192-161a-0510-9bf59fe950b5'
Node name: 'myhost'
```
it is now possible to lookup nodes by their Node Name or Node ID, or a
prefix match of the Node ID, with the following caveats re: the prefix
match:
1) first eight digits of the Node ID are a required minimum (eight was
chosen as an arbitrary number)
2) the length of the Node ID must be an even number or no result will be
returned.
```
% dig @127.0.0.1 -p 8600 myhost.node.dc1.consul.
myhost.node.dc1.consul. 0 IN A 127.0.0.1
% dig @127.0.0.1 -p 8600 40e4a748-2192-161a-0510-9bf59fe950b5.node.dc1.consul.
40e4a748-2192-161a-0510-9bf59fe950b5.node.dc1.consul. 0 IN A 127.0.0.1
% dig @127.0.0.1 -p 8600 40e4a748.node.dc1.consul.
40e4a748.node.dc1.consul. 0 IN A 127.0.0.1
% dig @127.0.0.1 -p 8600 40e4a74821.node.dc1.consul.
40e4a74821.node.dc1.consul. 0 IN A 127.0.0.1
% dig @127.0.0.1 -p 8600 40e4a748-21.node.dc1.consul.
40e4a748-21.node.dc1.consul. 0 IN A 127.0.0.1
```
Previously the blocking functions all closed over the state store from
their first query, with would not have worked properly when a restore
occurred. This makes sure they get a frest state store pointer each time,
and that pointer is synchronized with the abandon watch.
We always did an update before which caused excessive watch churn, even
with our new fine-grained queries. This does a diff any only updates the
node and service records if something actually changed.
We can't actually return a fine-grained index from these tables unless
support is added for tombstones. Otherwise, the index could slip backwards
as things are deleted.
This fixes#2663 and fixes#1899. It's not super related to this PR,
but the startup time changes that this PR brings made this a lot worse
so I was able to track it down.
This would return a "permission denied" error, but this changes it to
return the same response as a node that doesn't exist (as was originally
intended and written in the code comments).
Given a list of HealthChecks, this determines the "best" status for the
collective group. This is useful for nodes and services, which may have
multiple checks associated with them.
* Updates Raft library to get new snapshot/restore API.
* Basic backup and restore working, but need some cleanup.
* Breaks out a snapshot module and adds a SHA256 integrity check.
* Adds snapshot ACL and fills in some missing comments.
* Require a consistent read for snapshots.
* Make sure snapshot works if ACLs aren't enabled.
* Adds a bit of package documentation.
* Returns an empty response from restore to avoid EOF errors.
* Adds API client support for snapshots.
* Makes internal file names match on-disk file snapshots.
* Adds DC and token coverage for snapshot API test.
* Adds missing documentation.
* Adds a unit test for the snapshot client endpoint.
* Moves the connection pool out of the client for easier testing.
* Fixes an incidental issue in the prepared query unit test.
I realized I had two servers in bootstrap mode so this wasn't a good setup.
* Adds a half close to the TCP stream and fixes panic on error.
* Adds client and endpoint tests for snapshots.
* Moves the pool back into the snapshot RPC client.
* Adds a TLS test and fixes half-closes for TLS connections.
* Tweaks some comments.
* Adds a low-level snapshot test.
This is independent of Consul so we can pull this out into a library
later if we want to.
* Cleans up snapshot and archive and completes archive tests.
* Sends a clear error for snapshot operations in dev mode.
Snapshots require the Raft snapshots to be readable, which isn't supported
in dev mode. Send a clear error instead of a deep-down Raft one.
* Adds docs for the snapshot endpoint.
* Adds a stale mode and index feedback for snapshot saves.
This gives folks a way to extract data even if the cluster has no
leader.
* Changes the internal format of a snapshot from zip to tgz.
* Pulls in Raft fix to cancel inflight before a restore.
* Pulls in new Raft restore interface.
* Adds metadata to snapshot saves and a verify function.
* Adds basic save and restore snapshot CLI commands.
* Gets rid of tarball extensions and adds restore message.
* Fixes an incidental bad link in the KV docs.
* Adds documentation for the snapshot CLI commands.
* Scuttle any request body when a snapshot is saved.
* Fixes archive unit test error message check.
* Allows for nil output writers in snapshot RPC handlers.
* Renames hash list Decode to DecodeAndVerify.
* Closes the client connection for snapshot ops.
* Lowers timeout for restore ops.
* Updates Raft vendor to get new Restore signature and integrates with Consul.
* Bounces the leader's internal state when we do a restore.