open-nomad

Author	SHA1	Message	Date
Tim Gross	5c57a84e99	autopilot: deflake tests (#14475 ) Includes: * Remove leader upgrade raft version test, as older versions of raft are now incompatible with our autopilot library. * Remove attempt to assert initial non-voter status on the `PromoteNonVoter` test, as this happens too quickly to reliably detect. * Unskip some previously-skipped tests which we should make stable. * Remove the `consul/sdk` retry helper for these tests; this uses panic recovery in a kind of a clever/gross way to reduce LoC but it seems to introduce some timing issues in the process. * Add more test step logging and reduce logging noise from the scheduler goroutines to make it easier to debug failing tests. * Be more consistent about using the `waitForStableLeadership` helper so that we can assert the cluster is fully stable and not just that we've added peers.	2022-09-07 09:35:01 -04:00
Tim Gross	7921f044e5	migrate autopilot implementation to raft-autopilot (#14441 ) Nomad's original autopilot was importing from a private package in Consul. It has been moved out to a shared library. Switch Nomad to use this library so that we can eliminate the import of Consul, which is necessary to build Nomad ENT with the current version of the Consul SDK. This also will let us pick up autopilot improvements shared with Consul more easily.	2022-09-01 14:27:10 -04:00
Seth Hoenig	2631659551	ci: swap ci parallelization for unconstrained gomaxprocs	2022-03-15 12:58:52 -05:00
Michael Schurter	c1bd10456c	test: fix flaky TestAutopilot_CleanupDeadServer The fix seems to be related to the pointer comparison and swapping we did around killing a non-leader. I actually can't quite explain it, but when comparing against Consul's version of this test I noticed they used the slice index to track the killed server instead of pointer swapping. As soon as I switched to slice index tracking I could no longer reproduce the failure. In addition: - Tested membership counts on all servers instead of just 1 for added correctness. - Stopped testing raft v1 because it is unsupported.	2021-09-28 16:38:56 -07:00
Mahmood Ali	b4ed8acbff	tests: attempt deflaking TestAutopilot_CleanupDeadServer Attempt to deflake the test by avoiding shutting down the leaders, as leadership recovery takes more time, and consequently longer to process raft configuration changes and potentially failing the test.	2021-08-18 15:37:25 -04:00
Mahmood Ali	8009d9837c	tests: deflake TestMonitor_Monitor_RemoteServer and cross-region tests Ensure that all servers are joined to each other before test proceed, instead of just joining them to the first server and relying on background serf propagation. Relying on backgorund serf propagation is a cause of flakiness, specially for tests with multiple regions. The server receiving the RPC may not be aware of the region and fail to forward RPC accordingly. For example, consider `TestMonitor_Monitor_RemoteServer` failure in https://app.circleci.com/pipelines/github/hashicorp/nomad/16402/workflows/7f327235-7d0c-40ba-9757-600522afca51/jobs/158045 you can observe: * `nomad-117` is joined to `nomad-118` and `nomad-119` * `nomad-119` is the foreign region * `nomad-117` gains leadership in the default region, `nomad-118` is the non-leader * search logs for `nomad: adding server` and notice that `nomad-118` only added `nomad-118` and `nomad-118`, but not `nomad-119`! * so the query to the non-leader in the test fails to be forwarded to the appopriate region.	2021-06-10 21:27:55 -04:00
Mahmood Ali	816a93ed4a	tests: deflake TestAutopilot_RollingUpdate I hypothesize that the flakiness in rolling update is due to shutting down s3 server before s4 is properly added as a voter. The chain of the flakiness is as follows: 1. Bootstrap with s1, s2, s3 2. Add s4 3. Wait for servers to register with 3 voting peers * But we already have 3 voters (s1, s2, and s3) * s4 is added as a non-voter in Raft v3 and must wait until autopilot promots it 4. Test proceeds without s4 being a voter 5. s3 shutdown 6. cluster changes stall due to leader election and too many pending configuration changes (e.g. removing s3 from raft, promoting s4). Here, I have the test wait until s4 is marked as a voter before shutting down s3, so we don't have too many configuration changes at once. In https://circleci.com/gh/hashicorp/nomad/57092, I noticed the following events: ``` TestAutopilot_RollingUpdate: autopilot_test.go:204: adding server s4 TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO] nomad/serf.go:60: nomad: adding server: server="nomad-137.global (Addr: 127.0.0.1:9177) (DC: dc1)" TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO] raft/raft.go:1018: nomad.raft: updating configuration: command=AddNonvoter server-id=c54b5bf4-1159-34f6-032d-56aefeb08425 server-addr=127.0.0.1:9177 servers="[{Suffrage:Voter ID:df01ba65-d1b2-17a9-f792-a4459b3a7c09 Address:127.0.0.1:9171} {Suffrage:Voter ID:c3337778-811e-2675-87f5-006309888387 Address:127.0.0.1:9173} {Suffrage:Voter ID:186d5e15-c473-e2b3-b5a4-3259a84e10ef Address:127.0.0.1:9169} {Suffrage:Nonvoter ID:c54b5bf4-1159-34f6-032d-56aefeb08425 Address:127.0.0.1:9177}]" TestAutopilot_RollingUpdate: autopilot_test.go:218: shutting down server s3 TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.797Z [INFO] raft/replication.go:456: nomad.raft: aborting pipeline replication: peer="{Nonvoter c54b5bf4-1159-34f6-032d-56aefeb08425 127.0.0.1:9177}" TestAutopilot_RollingUpdate: autopilot_test.go:235: waiting for s4 to stabalize and be promoted TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.975Z [ERROR] raft/raft.go:1656: nomad.raft: failed to make requestVote RPC: target="{Voter c3337778-811e-2675-87f5-006309888387 127.0.0.1:9173}" error="dial tcp 127.0.0.1:9173: connect: connection refused" TestAutopilot_RollingUpdate: retry.go:121: autopilot_test.go:241: don't want "c3337778-811e-2675-87f5-006309888387" autopilot_test.go:241: didn't find map[c54b5bf4-1159-34f6-032d-56aefeb08425:true] in []raft.ServerID{"df01ba65-d1b2-17a9-f792-a4459b3a7c09", "186d5e15-c473-e2b3-b5a4-3259a84e10ef"} ``` Note how s3, c3337778, is present in the peers list in the final failure, but s4, c54b5bf4, is added as a Nonvoter and isn't present in the final peers list.	2020-04-03 17:15:41 -04:00
Mahmood Ali	36ad8ee2e0	tests: add debugging for TestAutopilot_RollingUpdate	2020-03-30 07:06:53 -04:00
Mahmood Ali	acbfeb5815	Simplify Bootstrap logic in tests This change updates tests to honor `BootstrapExpect` exclusively when forming test clusters and removes test only knobs, e.g. `config.DevDisableBootstrap`. Background: Test cluster creation is fragile. Test servers don't follow the BootstapExpected route like production clusters. Instead they start as single node clusters and then get rejoin and may risk causing brain split or other test flakiness. The test framework expose few knobs to control those (e.g. `config.DevDisableBootstrap` and `config.Bootstrap`) that control whether a server should bootstrap the cluster. These flags are confusing and it's unclear when to use: their usage in multi-node cluster isn't properly documented. Furthermore, they have some bad side-effects as they don't control Raft library: If `config.DevDisableBootstrap` is true, the test server may not immediately attempt to bootstrap a cluster, but after an election timeout (~50ms), Raft may force a leadership election and win it (with only one vote) and cause a split brain. The knobs are also confusing as Bootstrap is an overloaded term. In BootstrapExpect, we refer to bootstrapping the cluster only after N servers are connected. But in tests and the knobs above, it refers to whether the server is a single node cluster and shouldn't wait for any other server. Changes: This commit makes two changes: First, it relies on `BootstrapExpected` instead of `Bootstrap` and/or `DevMode` flags. This change is relatively trivial. Introduce a `Bootstrapped` flag to track if the cluster is bootstrapped. This allows us to keep `BootstrapExpected` immutable. Previously, the flag was a config value but it gets set to 0 after cluster bootstrap completes.	2020-03-02 13:47:43 -05:00
Mahmood Ali	98ad59b1de	update rest of consul packages	2020-02-16 16:25:04 -06:00
Seth Hoenig	f0c3dca49c	tests: swap lib/freeport for tweaked helper/freeport Copy the updated version of freeport (sdk/freeport), and tweak it for use in Nomad tests. This means staying below port 10000 to avoid conflicts with the lib/freeport that is still transitively used by the old version of consul that we vendor. Also provide implementations to find ephemeral ports of macOS and Windows environments. Ports acquired through freeport are supposed to be returned to freeport, which this change now also introduces. Many tests are modified to include calls to a cleanup function for Server objects. This should help quite a bit with some flakey tests, but not all of them. Our port problems will not go away completely until we upgrade our vendor version of consul. With Go modules, we'll probably do a 'replace' to swap out other copies of freeport with the one now in 'nomad/helper/freeport'.	2019-12-09 08:37:32 -06:00
Mahmood Ali	c94a5ef1f8	tests: give up on TestAutopilot_CleanupStaleRaftServer for now	2019-09-04 09:10:53 -04:00
Mahmood Ali	6cefd8f97e	tests: attempt to fix TestAutopilot_CleanupStaleRaftServer Also add a utility function for waiting for stable leadership	2019-09-04 08:49:33 -04:00
Mahmood Ali	9bd56587cd	Fix raft tests Wait until leadership stabalizes and all non-voters get promoted before killing leader	2019-09-03 14:53:29 -04:00
Alex Dadgar	a6dfffa4fa	Add testing interfaces	2018-02-15 13:59:00 -08:00
Kyle Havlovitz	2ccf565bf6	Refactor redundancy_zone/upgrade_version out of client meta	2018-01-29 20:03:38 -08:00
Kyle Havlovitz	1c07066064	Add autopilot functionality based on Consul's autopilot	2017-12-18 14:29:41 -08:00

17 commits