Commit Graph

1055 Commits

Author SHA1 Message Date
James Phillips 6e177a9b44 Merge pull request #1895 from shoenig/fixtypo
doc: fix trivial typo s/NewFSMPath/NewFSM/
2016-04-12 21:53:24 -07:00
James Phillips 3f340716fd Adds a clone method to HealthCheck and uses that in local.go. 2016-04-11 00:05:39 -07:00
Chavez c9602c561c Add description to rpc test client pool member failure message 2016-04-01 19:17:38 -07:00
Seth Hoenig 7f67c123b7 doc: fix trivial typo s/NewFSMPath/NewFSM/ 2016-03-29 20:52:17 -05:00
Sean Chittenden 5ae7835988 Rename server_details package to agent 2016-03-29 17:39:19 -07:00
Sean Chittenden 7f06c71650 Add a quick package doc for the servers package 2016-03-29 16:22:53 -07:00
Sean Chittenden 897282f77d Rename serverConfig to serverList
serverList is a vastly more accurate name.  Chase accordingly.  No functional change other than types and APIs.
2016-03-29 16:17:16 -07:00
Sean Chittenden 4984b6111d Gratuitous rename 1/2
Reduce cognative load and perform an overdue rename.  No functional change.

Rename the `server_manager` package to `servers`.  Rename the `ServerManager` package to `Manager`.  In `client`, rename `serverMgr` to `servers`.
2016-03-29 16:12:00 -07:00
Sean Chittenden 4734e0113f Remove two unused constants 2016-03-29 11:11:41 -07:00
Sean Chittenden cb9833b134 Remove useless comment residual from decomposing functions 2016-03-29 10:53:00 -07:00
Sean Chittenden 1f049a3c38 EDYSLEXICMOMENT 2016-03-29 10:50:10 -07:00
Sean Chittenden 177f64134e Refactor out recocileServerList anon function
Add testing to reconcileServerList and test various server sizes.

Test that a percentage of nodes fail their Ping (50% in testing atm)
2016-03-29 02:45:38 -07:00
Sean Chittenden 6609ee5d51 Teach fauxConnPool to fail a pct of the time
50% failure rate seems legit as a starting point w/ 100 servers.
2016-03-28 14:53:29 -07:00
Sean Chittenden 7d26f7bfa7 Call NotifyFailedServers to rotate the server list 2016-03-28 14:12:41 -07:00
Sean Chittenden 6a987062b9 Add log line re: server manager backing off and sleeping
This is useful in situations where the RPC rotate duration is greater than 1µs.  WTB exponential backoff of logging so we don't spam forever.
2016-03-28 14:04:04 -07:00
Sean Chittenden 689b79aef3 Remove old debugging lines of questionable future value 2016-03-28 14:02:53 -07:00
Sean Chittenden 0b0a07a280 Shuffle in place
Don't create a copy and save the copy, not necessary any more.
2016-03-28 14:02:27 -07:00
Sean Chittenden e230b3a3b7 Nuke unnecessary comment
See above function comments for details
2016-03-28 13:57:36 -07:00
Sean Chittenden 34a29a2107 Move FIXME comment to the right call site 2016-03-28 13:49:55 -07:00
Sean Chittenden b38d3d71c8 Rename the ConnPoolPinger interface to Pinger 2016-03-28 13:46:01 -07:00
Sean Chittenden d6b4345375 Return error from PingConsulServer
In order to report why a Ping failed, change the signature of PingConsulServers to include an error message.
2016-03-28 13:38:58 -07:00
Sean Chittenden 6c9fb06511 Change the definition of the ServerDetails struct key
Use only the serf Name for now.  Leaving the plumbing for now.
2016-03-28 12:53:19 -07:00
Sean Chittenden 2bcff6bac4 Correct the comment to match reality 2016-03-28 12:32:30 -07:00
Sean Chittenden fc1edea1ef Rename serverCfg to sc for consistency 2016-03-28 12:06:26 -07:00
Sean Chittenden 988b05700d Add a quick length check
Verify that AddServer behaved as expected
2016-03-28 11:38:12 -07:00
Sean Chittenden 7181e42ba8 Switch the order of ServerDetails.String()
It's more natrual to have the network first.  I think I flipped the order accidentally.
2016-03-28 11:37:25 -07:00
Sean Chittenden dca8fd2643 Move rebalance log statement from INFO to DEBUG 2016-03-27 01:32:04 -07:00
Sean Chittenden 180edd8e7b Chase the API bump re: refreshServerRebalanceTimer
If it works in prod, why shouldn't it work in the tests?
2016-03-27 00:04:52 -07:00
Sean Chittenden 9b5dd7a785 Move initialization of the rebalanceTimer to New() 2016-03-27 00:03:48 -07:00
Sean Chittenden 86d1bad541 Add a test for ConnPool.PingConsulServer
Spin up 5x servers, join and ping each server
2016-03-26 23:52:06 -07:00
Sean Chittenden f903005080 Expose ServerManager.ResetRebalanceTimer
Move the rebalance timer from ServerManager.Start's stack to struct ServerManager.  This makes it possible to shuffle during tests without actually waiting >120s.
2016-03-26 23:41:01 -07:00
Sean Chittenden 2ba281bc5a Logging improvements
Comment out noisly loggers for the time being.

Improve the final logging statement to be useful and hint what the next active server for the client is going to be.
2016-03-26 22:41:08 -07:00
Sean Chittenden fab3981b1d Standardize the log message based on the package
This log statement used to belong in the consul package but has since moved to the server manager package.
2016-03-26 22:29:00 -07:00
Sean Chittenden c6d9c42d9f Reduce the error level from Fatal when unit testing 2016-03-26 22:07:09 -07:00
Sean Chittenden 4747cf3cab Start server rebalance task after init'ing Serf
Now that there is no longer an event loop driven directly by Serf, start the ServerManager task after Serf has been setup.  When testing and adjusting timers and timeouts to unreasonably low values, it's possible to tickle a race condition where Serf's NumNodes() would fail because Serf had not been initialized.
2016-03-26 22:04:41 -07:00
Sean Chittenden 2ddf82d9d8 Catch up to a few renames 2016-03-26 19:32:11 -07:00
Sean Chittenden 640ced7c11 Use empty string for addr in ServerDetails.String() 2016-03-26 19:30:04 -07:00
Sean Chittenden e0f29c17cd Guard against a nil ServerDetails.Addr
It's not clear how or why this would ever be nil, but some of the unit tests produce a nil addr.  Be defensive.
2016-03-26 19:29:31 -07:00
Sean Chittenden 2d9982eb27 Proactively ping server before rotation
Before shuffling the server list, proactively ping the next server in the list to establish the connection and verify the remote endpoint is healthy.
2016-03-26 19:28:13 -07:00
Sean Chittenden b3a8e2f115 Factor out the shuffle server 2016-03-26 19:19:04 -07:00
Sean Chittenden 766ddae165 Revise comments re: cycleServer
Improve the comments to discuss what happens presently.  Add a note to consider possibly calling to TestConsulServer proactively.
2016-03-26 18:53:13 -07:00
Sean Chittenden ac1d42e9d8 Comment why the interface is needed: cyclic import 2016-03-26 18:38:35 -07:00
Sean Chittenden a9b3dba05f Add a struct key type for server_details 2016-03-26 17:58:12 -07:00
Sean Chittenden 496f05b561 Add additional checks 2016-03-25 14:40:46 -07:00
Sean Chittenden c18158aac3 Delete the right tag
"role" != "consul"
2016-03-25 14:31:48 -07:00
Sean Chittenden b44554f882 Don't pass in sm, server manager is already in scope
Go closures are implicitly capturing lambdas.
2016-03-25 14:10:09 -07:00
Sean Chittenden 2713899a5b Trim residual complexity from server join notifications
Now that serf node join events are decoupled from rebalancing activities completely, remove the complixity of draining the channel and ensuring only one go routine was rebalancing the server list.

Now that we're no longer initializing a notification channel, we can remove the config load/save from `Start()`
2016-03-25 14:06:35 -07:00
Sean Chittenden b3298ce4c3 Only log in FindServers
In FindServer this is a useful warning hinting why its call failed. RPC returns error and leaves it to the higher level caller to do whatever it wants. As an operator, I'd have the detail necessary to know why the RPC call(s) failed.
2016-03-25 13:58:50 -07:00
Sean Chittenden f024272ab2 Initialize the rebalancce to clientRPCMinReuseDuration
In an earlier version there was a channel to notify when a new server was added, however this has long since been removed.  Just default to the sane value of 2min before the first rebalance calc takes place.

Pointed out by: slackpad
2016-03-25 13:46:18 -07:00
Sean Chittenden 89311a5859 Use range vs for
Returning a new array vs mutating an array in place so we can use range now.
2016-03-25 13:08:08 -07:00
Sean Chittenden 643997623e Comment updates 2016-03-25 13:06:59 -07:00
Sean Chittenden 072f34cf02 Only rotate server list with more than one server
Fantastic observation by slackpad.  This was left over from when there was a boolean for health in the server struct (vs current strategy where we use server position in the list and rely on serf to cleanup the stale members).

Pointed out by: slackpad
2016-03-25 12:54:36 -07:00
Sean Chittenden aadd274a13 Relocate saveServerConfig next to getServerConfig
Requested by: slackpad
2016-03-25 12:41:22 -07:00
Sean Chittenden cf271e7f65 Clarify that ConsulClusterInfo is an interface over serf
An interface was used to break a cyclic import dependency.
2016-03-25 12:38:40 -07:00
Sean Chittenden 973d924ab4 Reword comment after moving code into new packages 2016-03-25 12:34:46 -07:00
Sean Chittenden 78ec9f241d Change initialReblaanaceTimeout to a time.Duration
Pointed out by: @slackpad
2016-03-25 12:34:12 -07:00
Sean Chittenden 328728c88a Negative check: test an invalid condition 2016-03-25 12:22:33 -07:00
Sean Chittenden 22e546ff32 Test to make sure bootstrap is missing 2016-03-25 12:20:12 -07:00
Sean Chittenden 5f035da4f1 Be more Go idiomatic w/ variable names: s/valid/ok/g
Cargo culting is bad, m'kay?

Pointy Hat: sean-
2016-03-25 12:14:24 -07:00
Sean Chittenden e041c3905d Fix stale comment
Pointed out by: @slackpad
2016-03-25 12:00:40 -07:00
Sean Chittenden 45fc7c362e Add a comment for Client serverMgr 2016-03-25 11:59:27 -07:00
Sean Chittenden 5873b7e28e Correct a bogus goimport rewrite for tests 2016-03-23 22:35:49 -07:00
Sean Chittenden dcc64d91c6 Test ServerManager.refreshServerRebalanceTimer
Change the signature so it returns a value so that this can be tested externally with mock data.  See the sample table in TestServerManagerInternal_refreshServerRebalanceTimer() for the rate at which it will back off.  This function is mostly used to not cripple large clusters in the event of a partition.
2016-03-23 22:10:50 -07:00
Sean Chittenden 8e3b3d766d Add a handful more unit tests to the public interface 2016-03-23 22:10:50 -07:00
Sean Chittenden d5f72e8c07 Rename GetNumServers to NumServers()
Matches the style of the rest of the repo
2016-03-23 22:10:50 -07:00
Sean Chittenden 9de9cf90f1 Rename NewServerManger to just New
Follow go style recommendations now that this has been refactored out of the consul package and doesn't need the qualifier in the name.
2016-03-23 22:10:50 -07:00
Sean Chittenden 7faea986a0 Rename FindHealthyServer() to FindServer()
There is no guarantee the server coming back is healthy.  It's apt to be healthy by virtue of its place in the server list, but it's not guaranteed.
2016-03-23 22:10:50 -07:00
Sean Chittenden 18885e3214 cycleServer is a pure function, save the result 2016-03-23 22:10:50 -07:00
Sean Chittenden 4ec9ed4de2 Missed unit test cruft 2016-03-23 22:10:50 -07:00
Sean Chittenden b906e40811 Update comments to reflect reality 2016-03-23 22:10:50 -07:00
Sean Chittenden 1a09a5b2cf Remove additional cruft from ServerManager's channels
No longer needed code.
2016-03-23 22:10:50 -07:00
Sean Chittenden c980d492c6 Emulate a TryLock using atomic.CompareAndSwap
Prevent possible queueing behind serverConfigLock in the event that a server fails on a busy host.
2016-03-23 22:10:50 -07:00
Sean Chittenden 102dcafe76 Make use of interfaces
Use an interface instead of serf.Serf as arg to NewServerManager.  Bonus points for improved testability.

Pointed out by: @slackpad
2016-03-23 22:10:50 -07:00
Sean Chittenden 231768faea Simplify error handling
Rely on Serf for liveliness.  In the event of a failure, simply cycle the server to the end of the list.  If the server is unhealthy, Serf will reap the dead server.

Additional simplifications:

*) Only rebalance servers based on timers, not when a new server is readded to the cluster.
*) Back out the failure count in server_details.ServerDetails
2016-03-23 22:10:50 -07:00
Sean Chittenden 0c519aa90d Unbreak client tests by reverting to original test
Debugging code crept into the actual test and hung out for much longer than it should have.
2016-03-23 22:10:50 -07:00
Sean Chittenden 26e51376d9 Introduce asynchronous management of consul server lists
Instead of blocking the RPC call path and performing a potentially expensive calculation (including a call to `c.LANMembers()`), introduce a channel to request a rebalance.  Some events don't force a reshuffle, instead the extend the duration of the current rebalance window because the environment thrashed enough to redistribute a client's load.
2016-03-23 22:10:50 -07:00
Sean Chittenden 6ed37d1d8d Comment nits 2016-03-23 22:10:50 -07:00
Sean Chittenden c8ab3ae4cb Use saveServerConfig vs atomic.Value.Store(config) 2016-03-23 22:10:50 -07:00
Sean Chittenden 12377e80e6 Commit a handful of refactoring && copy/paste-o fixes 2016-03-23 22:10:50 -07:00
Sean Chittenden c1c17f158b Mutate copies of serverCfg.servers, not original
Removing any ambiguity re: ownership of the mutated server lists is a win for maintenance and debugging.
2016-03-23 22:10:50 -07:00
Sean Chittenden 753766cc5d rebalanceTimer may be nil during initialization
When first starting the server manager, it's possible that the rebalanceTimer in serverConfig will be nil, test accordingly.
2016-03-23 22:10:50 -07:00
Sean Chittenden d0e2792d5c Properly retain a pointer to the rebalanceTimer 2016-03-23 22:10:50 -07:00
Sean Chittenden 62785de865 Cosmetic and various other wordsmithing cleanups 2016-03-23 22:10:50 -07:00
Sean Chittenden 31de4290cf Document the various functions and their locking 2016-03-23 22:10:50 -07:00
Sean Chittenden ffcd939feb Use config convenience method to get config
'cause ELETTHECOMPILERSDOTHEWORK.  I don't need that cluttering up the subconscious with more complexity.
2016-03-23 22:10:50 -07:00
Sean Chittenden ed7fee7a3c Move consul.serverConfig out of the consul package
Relocated to its own package, server_manager.  This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary.  More work is needed to be done here
2016-03-23 22:10:50 -07:00
Sean Chittenden ab80393198 Rename serverConfigMtx to serverConfigLock
Pointed out by: @slackpad
2016-03-23 22:10:50 -07:00
Sean Chittenden 1866d94285 Refactor out the management of Consul servers
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.

This commit brings in a background task that proactively manages the server list and:

*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed

This is a WIP, more work in testing needs to be completed.
2016-03-23 22:10:50 -07:00
Sean Chittenden 73497f7915 Move consul.serverConfig out of the consul package
Relocated to its own package, server_manager.  This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary.  More work is needed to be done here
2016-03-23 22:10:50 -07:00
Sean Chittenden 2a52d3eb80 Rename serverConfigMtx to serverConfigLock
Pointed out by: @slackpad
2016-03-23 22:10:32 -07:00
Sean Chittenden 49425c5371 Refactor out the management of Consul servers
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.

This commit brings in a background task that proactively manages the server list and:

*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed

This is a WIP, more work in testing needs to be completed.
2016-03-23 22:09:46 -07:00
Sean Chittenden ebdccf0f35 Move consul.serverConfig out of the consul package
Relocated to its own package, server_manager.  This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary.  More work is needed to be done here
2016-03-23 22:05:29 -07:00
Sean Chittenden b7213d9daa Rename serverConfigMtx to serverConfigLock
Pointed out by: @slackpad
2016-03-23 22:05:05 -07:00
Sean Chittenden e29b8de0a6 Refactor out the management of Consul servers
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.

This commit brings in a background task that proactively manages the server list and:

*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed

This is a WIP, more work in testing needs to be completed.
2016-03-23 22:03:20 -07:00
Sean Chittenden 3730eaf6df Commit miss re: consuls variable rename 2016-03-23 16:24:29 -07:00
Sean Chittenden b33648ca5c Move consul.serverConfig out of the consul package
Relocated to its own package, server_manager.  This now greatly simplifies the RPC() call path and appropriately hides the locking behind the package boundary.  More work is needed to be done here
2016-03-23 16:16:22 -07:00
Sean Chittenden f3a69c939d Refactor consul.serverParts into server_details.ServerDetails
This may be short-lived, but it also seems like this is going to lead us down a path where ServerDetails is going to evolve into a more powerful package that will encapsulate more behavior behind a coherent API.
2016-03-23 16:15:47 -07:00
Sean Chittenden b3192ca410 Rename serverConfigMtx to serverConfigLock
Pointed out by: @slackpad
2016-03-23 16:15:47 -07:00
Sean Chittenden 82458fa9e8 Handle the case where there are no healthy servers
Pointed out by: @slackpad
2016-03-23 16:15:47 -07:00
Sean Chittenden 09d4c6439c Refactor out the management of Consul servers
Move the management of c.consulServers (fka c.consuls) into consul/server_manager.go.

This commit brings in a background task that proactively manages the server list and:

*) reshuffles the list
*) manages the timer out of the RPC() path
*) uses atomics to detect a server has failed

This is a WIP, more work in testing needs to be completed.
2016-03-23 16:15:47 -07:00