This fixes#2896 by switching to the `notifyCh` instead of the `leaderCh`,
so we get all up/down events from Raft regarding leadership. We also wait
for the old leader loop to shut down before we ever consider starting a
new one, which keeps that single-threaded and fixes the panic in that issue.
This adds two goroutines to perform autopilot tasks on the leader - one
to monitor the health of servers and another to periodically clean up
dead servers with a limit on removal count. Also adds a new http endpoint,
`/v1/operator/autopilot/health`, for querying this information through an
operator RPC endpoint.
It turns out the indexer can only use strings as arguments when
creating a query. Cast `types.CheckID` to a `string` before calling
into `memdb`.
Ideally the indexer would be smart enough to do this at compile-time,
but I need to look into how to do this without reflection and the
runtime package. For the time being statically cast `types.CheckID`
to a `string` at the call sites.
This may be short-lived, but it also seems like this is going to lead us down a path where ServerDetails is going to evolve into a more powerful package that will encapsulate more behavior behind a coherent API.
The design of the session TTLs is based on the Google Chubby approach
(http://research.google.com/archive/chubby-osdi06.pdf). The Session
struct has an additional TTL field now. This attaches an implicit
heartbeat based failure detector. Tracking of heartbeats is done by
the current leader and not persisted via the Raft log. The implication
of this is during a leader failover, we do not retain the last
heartbeat times.
Similar to Chubby, the TTL represents a lower-bound. Consul promises
not to terminate a session before the TTL has expired, but is allowed
to extend the expiration past it. This enables us to reset the TTL on
a leader failover. The TTL is also extended when the client does a
heartbeat. Like Chubby, this means a TTL is extended on creation,
heartbeat or failover.
Additionally, because we must account for time requests are in transit
and the relative rates of clocks on the clients and servers, Consul
will take the conservative approach of internally multiplying the TTL
by 2x. This helps to compensate for network latency and clock skew
without violating the contract.
Reference: https://docs.google.com/document/d/1Y5-pahLkUaA7Kz4SBU_mehKiyt9yaaUGcBTMZR7lToY/edit?usp=sharing