Rely on Serf for liveliness. In the event of a failure, simply cycle the server to the end of the list. If the server is unhealthy, Serf will reap the dead server.
Additional simplifications:
*) Only rebalance servers based on timers, not when a new server is readded to the cluster.
*) Back out the failure count in server_details.ServerDetails
Instead of blocking the RPC call path and performing a potentially expensive calculation (including a call to `c.LANMembers()`), introduce a channel to request a rebalance. Some events don't force a reshuffle, instead the extend the duration of the current rebalance window because the environment thrashed enough to redistribute a client's load.
Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.
This commit brings in a background task that proactively manages the server list and:
*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed
This is a WIP, more work in testing needs to be completed.
Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.
This commit brings in a background task that proactively manages the server list and:
*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed
This is a WIP, more work in testing needs to be completed.
Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.
This commit brings in a background task that proactively manages the server list and:
*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed
This is a WIP, more work in testing needs to be completed.
Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here
This may be short-lived, but it also seems like this is going to lead us down a path where ServerDetails is going to evolve into a more powerful package that will encapsulate more behavior behind a coherent API.
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.
This commit brings in a background task that proactively manages the server list and:
*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed
This is a WIP, more work in testing needs to be completed.
A server is not normally disabled, but in the event of an RPC error, we want to mark a server as down to allow for fast failover to a different server. This value must be an int in order to support atomic operations.
Additionally, this is the preliminary work required to bring up a server in a disabled state. RPC health checks in the future could mark the server as alive, thereby creating an organic "slow start" feature for Consul.
Expanding the domain of lastServer beyond RPC() changes the meaning of this variable. Rename accordingly to match the intent coming in a subsequent commit: a background thread will be in charge of rotating preferredServer.
It is theoretically possible that the number of queued serf events can back up. If this happens, emit a warning message if there are more than 200 events in queue.
Most notably, this can happen if `c.consulServerLock` is held for an "extended period of time". The probability of anyone ever seeing this log message is hopefully low to nonexistent, but if it happens, the warning message indicating a large number of serf events fired while a lock was held is likely to be helpful (vs serf mysteriously blocking when attempting to add an event to a channel).