Fantastic observation by slackpad. This was left over from when there was a boolean for health in the server struct (vs current strategy where we use server position in the list and rely on serf to cleanup the stale members).
Pointed out by: slackpad
Change the signature so it returns a value so that this can be tested externally with mock data. See the sample table in TestServerManagerInternal_refreshServerRebalanceTimer() for the rate at which it will back off. This function is mostly used to not cripple large clusters in the event of a partition.
Rely on Serf for liveliness. In the event of a failure, simply cycle the server to the end of the list. If the server is unhealthy, Serf will reap the dead server.
Additional simplifications:
*) Only rebalance servers based on timers, not when a new server is readded to the cluster.
*) Back out the failure count in server_details.ServerDetails
Instead of blocking the RPC call path and performing a potentially expensive calculation (including a call to `c.LANMembers()`), introduce a channel to request a rebalance. Some events don't force a reshuffle, instead the extend the duration of the current rebalance window because the environment thrashed enough to redistribute a client's load.
Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.
This commit brings in a background task that proactively manages the server list and:
*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed
This is a WIP, more work in testing needs to be completed.
Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.
This commit brings in a background task that proactively manages the server list and:
*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed
This is a WIP, more work in testing needs to be completed.
Relocated to its own package, server_manager. This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary. More work is needed to be done here