This adds two goroutines to perform autopilot tasks on the leader - one
to monitor the health of servers and another to periodically clean up
dead servers with a limit on removal count. Also adds a new http endpoint,
`/v1/operator/autopilot/health`, for querying this information through an
operator RPC endpoint.
I'm torn on this. It's useful from a UX perspective for an operator to
be able to type in something that's short. At the same time, by
enforcing an `8` character length, we reduced the probability of a user
depending on the behavior and having it suddenly stop working in the
future when a duplicate prefix is injected into the environment.
lookup returned nil.
Add a TODO to note where a future point of logging should occur once a
logger is present and a few additional comments to explain the program
flow.
Assuming the following output from a consul agent:
```
==> Consul agent running!
Version: 'v0.7.3-43-gc5e140c-dev (c5e140c+CHANGES)'
Node ID: '40e4a748-2192-161a-0510-9bf59fe950b5'
Node name: 'myhost'
```
it is now possible to lookup nodes by their Node Name or Node ID, or a
prefix match of the Node ID, with the following caveats re: the prefix
match:
1) first eight digits of the Node ID are a required minimum (eight was
chosen as an arbitrary number)
2) the length of the Node ID must be an even number or no result will be
returned.
```
% dig @127.0.0.1 -p 8600 myhost.node.dc1.consul.
myhost.node.dc1.consul. 0 IN A 127.0.0.1
% dig @127.0.0.1 -p 8600 40e4a748-2192-161a-0510-9bf59fe950b5.node.dc1.consul.
40e4a748-2192-161a-0510-9bf59fe950b5.node.dc1.consul. 0 IN A 127.0.0.1
% dig @127.0.0.1 -p 8600 40e4a748.node.dc1.consul.
40e4a748.node.dc1.consul. 0 IN A 127.0.0.1
% dig @127.0.0.1 -p 8600 40e4a74821.node.dc1.consul.
40e4a74821.node.dc1.consul. 0 IN A 127.0.0.1
% dig @127.0.0.1 -p 8600 40e4a748-21.node.dc1.consul.
40e4a748-21.node.dc1.consul. 0 IN A 127.0.0.1
```
Previously the blocking functions all closed over the state store from
their first query, with would not have worked properly when a restore
occurred. This makes sure they get a frest state store pointer each time,
and that pointer is synchronized with the abandon watch.
We always did an update before which caused excessive watch churn, even
with our new fine-grained queries. This does a diff any only updates the
node and service records if something actually changed.