open-nomad/nomad
Mahmood Ali 816a93ed4a tests: deflake TestAutopilot_RollingUpdate
I hypothesize that the flakiness in rolling update is due to shutting
down s3 server before s4 is properly added as a voter.

The chain of the flakiness is as follows:

1. Bootstrap with s1, s2, s3
2. Add s4
3. Wait for servers to register with 3 voting peers
   * But we already have 3 voters (s1, s2, and s3)
   * s4 is added as a non-voter in Raft v3 and must wait until autopilot promots it
4. Test proceeds without s4 being a voter
5. s3 shutdown
6. cluster changes stall due to leader election and too many pending configuration
changes (e.g. removing s3 from raft, promoting s4).

Here, I have the test wait until s4 is marked as a voter before shutting
down s3, so we don't have too many configuration changes at once.

In https://circleci.com/gh/hashicorp/nomad/57092, I noticed the
following events:

```
TestAutopilot_RollingUpdate: autopilot_test.go:204: adding server s4
    TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO]  nomad/serf.go:60: nomad: adding server: server="nomad-137.global (Addr: 127.0.0.1:9177) (DC: dc1)"
    TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO]  raft/raft.go:1018: nomad.raft: updating configuration: command=AddNonvoter server-id=c54b5bf4-1159-34f6-032d-56aefeb08425 server-addr=127.0.0.1:9177 servers="[{Suffrage:Voter ID:df01ba65-d1b2-17a9-f792-a4459b3a7c09 Address:127.0.0.1:9171} {Suffrage:Voter ID:c3337778-811e-2675-87f5-006309888387 Address:127.0.0.1:9173} {Suffrage:Voter ID:186d5e15-c473-e2b3-b5a4-3259a84e10ef Address:127.0.0.1:9169} {Suffrage:Nonvoter ID:c54b5bf4-1159-34f6-032d-56aefeb08425 Address:127.0.0.1:9177}]"

    TestAutopilot_RollingUpdate: autopilot_test.go:218: shutting down server s3
    TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.797Z [INFO]  raft/replication.go:456: nomad.raft: aborting pipeline replication: peer="{Nonvoter c54b5bf4-1159-34f6-032d-56aefeb08425 127.0.0.1:9177}"
    TestAutopilot_RollingUpdate: autopilot_test.go:235: waiting for s4 to stabalize and be promoted
    TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.975Z [ERROR] raft/raft.go:1656: nomad.raft: failed to make requestVote RPC: target="{Voter c3337778-811e-2675-87f5-006309888387 127.0.0.1:9173}" error="dial tcp 127.0.0.1:9173: connect: connection refused"
    TestAutopilot_RollingUpdate: retry.go:121: autopilot_test.go:241: don't want "c3337778-811e-2675-87f5-006309888387"
        autopilot_test.go:241: didn't find map[c54b5bf4-1159-34f6-032d-56aefeb08425:true] in []raft.ServerID{"df01ba65-d1b2-17a9-f792-a4459b3a7c09", "186d5e15-c473-e2b3-b5a4-3259a84e10ef"}
```

Note how s3, c3337778, is present in the peers list in the final
failure, but s4, c54b5bf4, is added as a Nonvoter and isn't present in
the final peers list.
2020-04-03 17:15:41 -04:00
..
deploymentwatcher tests: deflake deploymentwatcher package 2020-03-12 15:42:01 -04:00
drainer avoid logging in draining job watcher 2020-03-30 07:06:53 -04:00
mock state: support snapshot of CSI plugin and volume tables (#7546) 2020-03-30 11:17:16 -04:00
state csi: volume validate namespace (#7587) 2020-04-02 10:13:41 -04:00
structs Use lowercase for hcl keys 2020-04-03 07:56:00 -04:00
types Change the signature of the PeriodicCallback to return an error 2016-06-10 15:54:39 -04:00
acl.go Audit config, seams for enterprise audit features 2020-03-23 13:47:42 -04:00
acl_endpoint.go address feedback review 2019-11-26 08:39:04 -05:00
acl_endpoint_test.go Simplify Bootstrap logic in tests 2020-03-02 13:47:43 -05:00
acl_test.go Audit config, seams for enterprise audit features 2020-03-23 13:47:42 -04:00
alloc_endpoint.go acl: check ACL against object namespace 2019-10-08 12:59:22 -04:00
alloc_endpoint_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
autopilot.go implement MinQuorum 2020-02-16 16:04:59 -06:00
autopilot_test.go tests: deflake TestAutopilot_RollingUpdate 2020-04-03 17:15:41 -04:00
blocked_evals.go blocked_evals reset system evals on Flush 2019-07-18 10:32:13 -04:00
blocked_evals_system.go blocked_evals system evals indexed by job and node 2019-07-18 10:32:12 -04:00
blocked_evals_test.go blocked_evals_test disable calls Flush 2019-07-18 10:32:13 -04:00
client_agent_endpoint.go vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:45:21 -04:00
client_agent_endpoint_test.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
client_alloc_endpoint.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
client_alloc_endpoint_test.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
client_csi_endpoint.go CSI: move node unmount to server-driven RPCs (#7596) 2020-04-02 16:04:56 -04:00
client_csi_endpoint_test.go CSI: move node unmount to server-driven RPCs (#7596) 2020-04-02 16:04:56 -04:00
client_fs_endpoint.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
client_fs_endpoint_test.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
client_rpc.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
client_rpc_test.go Simplify Bootstrap logic in tests 2020-03-02 13:47:43 -05:00
client_stats_endpoint.go server 2018-09-15 16:23:13 -07:00
client_stats_endpoint_test.go Simplify Bootstrap logic in tests 2020-03-02 13:47:43 -05:00
config.go csi: volume claim garbage collection (#7125) 2020-03-23 13:58:30 -04:00
consul.go consul: annotate Consul interfaces with ACLs 2020-03-30 10:17:28 -06:00
consul_policy.go nomad: fix leftover missed refactoring in consul policy checking 2020-01-31 19:05:06 -06:00
consul_policy_test.go nomad: fix leftover missed refactoring in consul policy checking 2020-01-31 19:05:06 -06:00
consul_test.go nomad: handle SI token revocations concurrently 2020-01-31 19:04:14 -06:00
core_sched.go CSI: move node unmount to server-driven RPCs (#7596) 2020-04-02 16:04:56 -04:00
core_sched_test.go CSI: move node unmount to server-driven RPCs (#7596) 2020-04-02 16:04:56 -04:00
csi_endpoint.go CSI: move node unmount to server-driven RPCs (#7596) 2020-04-02 16:04:56 -04:00
csi_endpoint_test.go CSI: move node unmount to server-driven RPCs (#7596) 2020-04-02 16:04:56 -04:00
deployment_endpoint.go acl: check ACL against object namespace 2019-10-08 12:59:22 -04:00
deployment_endpoint_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
deployment_watcher_shims.go Fix typos 2018-05-07 14:50:01 -05:00
drainer_int_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
drainer_shims.go set node.StatusUpdatedAt in raft 2019-05-21 16:13:32 -04:00
endpoints_oss.go include pro tag in serveral oss.go files 2020-02-10 15:56:14 -05:00
eval_broker.go nomad: refactor waitForIndex into SnapshotAfter 2019-05-17 13:30:23 -07:00
eval_broker_test.go nomad: TestEvalBroker_Dequeue_Empty_Timeout() proper goroutine error handling (#6657) 2019-11-08 14:35:06 -05:00
eval_endpoint.go acl: check ACL against object namespace 2019-10-08 12:59:22 -04:00
eval_endpoint_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
fsm.go scaling api: more testing around the scaling events api 2020-04-01 16:39:23 +00:00
fsm_not_ent.go sync 2017-10-13 14:36:02 -07:00
fsm_registry_oss.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
fsm_test.go implement MinQuorum 2020-02-16 16:04:59 -06:00
heartbeat.go goimports 2019-01-22 15:44:31 -08:00
heartbeat_test.go Simplify Bootstrap logic in tests 2020-03-02 13:47:43 -05:00
interfaces.go CSI: move node unmount to server-driven RPCs (#7596) 2020-04-02 16:04:56 -04:00
job_endpoint.go Merge branch 'f-7422-scaling-events' of github.com:hashicorp/nomad into f-7422-scaling-events 2020-04-01 17:28:50 +00:00
job_endpoint_hook_connect.go lint: gofmt 2020-04-01 21:23:47 -04:00
job_endpoint_hook_connect_test.go connect: canonicalize before adding sidecar 2019-12-12 20:55:56 -08:00
job_endpoint_hook_expose_check.go connect: enable automatic expose paths for individual group service checks 2020-03-31 17:15:50 -06:00
job_endpoint_hook_expose_check_test.go connect: enable automatic expose paths for individual group service checks 2020-03-31 17:15:50 -06:00
job_endpoint_hooks.go core: add semver constraint 2019-11-19 08:40:19 -08:00
job_endpoint_oss.go sync 2017-09-19 10:08:23 -05:00
job_endpoint_test.go job_endpoint: fixed bad test 2020-04-01 18:11:58 +00:00
leader.go csi: volume claim garbage collection (#7125) 2020-03-23 13:58:30 -04:00
leader_oss.go include pro tag in serveral oss.go files 2020-02-10 15:56:14 -05:00
leader_test.go tests: deflake TestServer_ReconcileMember 2020-03-06 14:14:41 -05:00
merge.go
node_endpoint.go csi: volume claim garbage collection (#7125) 2020-03-23 13:58:30 -04:00
node_endpoint_test.go csi: volume validate namespace (#7587) 2020-04-02 10:13:41 -04:00
operator_endpoint.go Add code for plan normalization 2019-04-23 09:18:01 -07:00
operator_endpoint_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
periodic.go add create and modify timestamps to evaluations (#5881) 2019-08-07 09:50:35 -07:00
periodic_endpoint.go goimports 2019-01-22 15:44:31 -08:00
periodic_endpoint_test.go csi: fix index maintenance for CSIVolume and CSIPlugin tables (#7049) 2020-03-23 13:58:29 -04:00
periodic_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
plan_apply.go add create and modify timestamps to evaluations (#5881) 2019-08-07 09:50:35 -07:00
plan_apply_not_ent.go sync 2017-10-13 14:36:02 -07:00
plan_apply_pool.go Log reason a plan gets rejected per node. 2017-07-13 17:14:02 -07:00
plan_apply_pool_test.go Enable more linters 2017-09-26 15:26:33 -07:00
plan_apply_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
plan_endpoint.go goimports 2019-01-22 15:44:31 -08:00
plan_endpoint_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
plan_normalization_test.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
plan_queue.go nomad: refactor waitForIndex into SnapshotAfter 2019-05-17 13:30:23 -07:00
plan_queue_test.go nomad: fix test goroutine (#6593) 2019-10-31 08:23:32 -04:00
raft_rpc.go Refactor 2018-02-15 13:59:00 -08:00
regions_endpoint.go server 2018-09-15 16:23:13 -07:00
regions_endpoint_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
rpc.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
rpc_test.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
scaling_endpoint.go add acl validation to Scaling.ListPolicies and Scaling.GetPolicy 2020-03-24 14:39:05 +00:00
scaling_endpoint_test.go update RPC scaling endpoint tests to use renamed 'scale' policy disposition 2020-03-24 20:18:12 +00:00
search_endpoint.go csi: CLI for volume status, registration/deregistration and plugin status (#7193) 2020-03-23 13:58:30 -04:00
search_endpoint_oss.go csi: implement volume ACLs (#7339) 2020-03-23 13:59:25 -04:00
search_endpoint_test.go csi: plugin deregistration on plugin job GC (#7502) 2020-03-26 17:07:18 -04:00
serf.go Simplify Bootstrap logic in tests 2020-03-02 13:47:43 -05:00
serf_test.go Simplify Bootstrap logic in tests 2020-03-02 13:47:43 -05:00
server.go CSI: move node unmount to server-driven RPCs (#7596) 2020-04-02 16:04:56 -04:00
server_setup_oss.go Update consul autopilot dependency 2020-02-16 15:41:43 -06:00
server_test.go Simplify Bootstrap logic in tests 2020-03-02 13:47:43 -05:00
stats_fetcher.go server 2018-09-15 16:23:13 -07:00
stats_fetcher_test.go Simplify Bootstrap logic in tests 2020-03-02 13:47:43 -05:00
status_endpoint.go server 2018-09-15 16:23:13 -07:00
status_endpoint_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
system_endpoint.go server 2018-09-15 16:23:13 -07:00
system_endpoint_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00
testing.go csi: plugin deregistration on plugin job GC (#7502) 2020-03-26 17:07:18 -04:00
timetable.go vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:45:21 -04:00
timetable_test.go fixup! vendor: explicit use of hashicorp/go-msgpack 2020-03-31 09:48:07 -04:00
util.go remove unused dropButLastChannel 2020-02-13 18:56:53 -05:00
util_test.go remove unused dropButLastChannel 2020-02-13 18:56:53 -05:00
vault.go nomad: handle SI token revocations concurrently 2020-01-31 19:04:14 -06:00
vault_test.go vendor: vault api and sdk 2020-03-21 17:57:48 +01:00
vault_testing.go nomad: refactor waitForIndex into SnapshotAfter 2019-05-17 13:30:23 -07:00
worker.go add create and modify timestamps to evaluations (#5881) 2019-08-07 09:50:35 -07:00
worker_test.go tests: swap lib/freeport for tweaked helper/freeport 2019-12-09 08:37:32 -06:00