Some tests assert on numbers on numbers of servers, e.g.
TestHTTP_AgentSetServers and TestHTTP_AgentListServers_ACL . Though, in dev and
test modes, the agent starts with servers having duplicate entries for
advertised and normalized RPC values, then settles with one unique value after
Raft/Serf re-sets servers with one single unique value.
This leads to flakiness, as the test will fail if assertion runs before Serf
update takes effect.
Here, we update the inital dev handling so it only adds a unique value if the
advertised and normalized values are the same.
Sample log lines illustrating the problem:
```
=== CONT TestHTTP_AgentSetServers
TestHTTP_AgentSetServers: testlog.go:34: 2020-04-06T21:47:51.016Z [INFO] nomad.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:127.0.0.1:9008 Address:127.0.0.1:9008}]"
TestHTTP_AgentSetServers: testlog.go:34: 2020-04-06T21:47:51.016Z [INFO] nomad: serf: EventMemberJoin: TestHTTP_AgentSetServers.global 127.0.0.1
TestHTTP_AgentSetServers: testlog.go:34: 2020-04-06T21:47:51.035Z [DEBUG] client.server_mgr: new server list: new_servers=[127.0.0.1:9008, 127.0.0.1:9008] old_servers=[]
...
TestHTTP_AgentSetServers: agent_endpoint_test.go:759:
Error Trace: agent_endpoint_test.go:759
http_test.go:1089
agent_endpoint_test.go:705
Error: "[127.0.0.1:9008 127.0.0.1:9008]" should have 1 item(s), but has 2
Test: TestHTTP_AgentSetServers
```
This closes#7456. It hides the terminal when the job is dead and
displays an error when trying to open an exec session for a task
that isn’t running. There’s a skipped test for the latter behaviour
that I’ll have to come back for.
This closes#7454. It makes use of the existing watchable tools to
allow the exec popup sidebar to be live-updating. It also adds
alphabetic sorting of task groups and tasks.
The `Job.Deregister` call will block on the client CSI controller RPCs
while the alloc still exists on the Nomad client node. So we need to
make the volume claim reaping async from the `Job.Deregister`. This
allows `nomad job stop` to return immediately. In order to make this
work, this changeset changes the volume GC so that the GC jobs are on a
by-volume basis rather than a by-job basis; we won't have to query
the (possibly deleted) job at the time of volume GC. We smuggle the
volume ID and whether it's a purge into the GC eval ID the same way we
smuggled the job ID previously.
The CSI plugins uses the external volume ID for all operations, but
the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the
mount paths. Pass the External ID as an arg in the RPC call so that
the unpublish workflows have it without calling back to the server to
find the external ID.
The controller CSI plugins need the CSI node ID (or in other words,
the storage provider's view of node ID like the EC2 instance ID), not
the Nomad node ID, to determine how to detach the external volume.
* nomad/state/state_store: error message copy/paste error
* nomad/structs/structs: add a VolumeEval to the JobDeregisterResponse
* nomad/job_endpoint: synchronously, volumeClaimReap on job Deregister
* nomad/core_sched: make volumeClaimReap available without a CoreSched
* nomad/job_endpoint: Deregister return early if the job is missing
* nomad/job_endpoint_test: job Deregistion is idempotent
* nomad/core_sched: conditionally ignore alloc status in volumeClaimReap
* nomad/job_endpoint: volumeClaimReap all allocations, even running
* nomad/core_sched_test: extra argument to collectClaimsToGCImpl
* nomad/job_endpoint: job deregistration is not idempotent
I hypothesize that the flakiness in rolling update is due to shutting
down s3 server before s4 is properly added as a voter.
The chain of the flakiness is as follows:
1. Bootstrap with s1, s2, s3
2. Add s4
3. Wait for servers to register with 3 voting peers
* But we already have 3 voters (s1, s2, and s3)
* s4 is added as a non-voter in Raft v3 and must wait until autopilot promots it
4. Test proceeds without s4 being a voter
5. s3 shutdown
6. cluster changes stall due to leader election and too many pending configuration
changes (e.g. removing s3 from raft, promoting s4).
Here, I have the test wait until s4 is marked as a voter before shutting
down s3, so we don't have too many configuration changes at once.
In https://circleci.com/gh/hashicorp/nomad/57092, I noticed the
following events:
```
TestAutopilot_RollingUpdate: autopilot_test.go:204: adding server s4
TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO] nomad/serf.go:60: nomad: adding server: server="nomad-137.global (Addr: 127.0.0.1:9177) (DC: dc1)"
TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO] raft/raft.go:1018: nomad.raft: updating configuration: command=AddNonvoter server-id=c54b5bf4-1159-34f6-032d-56aefeb08425 server-addr=127.0.0.1:9177 servers="[{Suffrage:Voter ID:df01ba65-d1b2-17a9-f792-a4459b3a7c09 Address:127.0.0.1:9171} {Suffrage:Voter ID:c3337778-811e-2675-87f5-006309888387 Address:127.0.0.1:9173} {Suffrage:Voter ID:186d5e15-c473-e2b3-b5a4-3259a84e10ef Address:127.0.0.1:9169} {Suffrage:Nonvoter ID:c54b5bf4-1159-34f6-032d-56aefeb08425 Address:127.0.0.1:9177}]"
TestAutopilot_RollingUpdate: autopilot_test.go:218: shutting down server s3
TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.797Z [INFO] raft/replication.go:456: nomad.raft: aborting pipeline replication: peer="{Nonvoter c54b5bf4-1159-34f6-032d-56aefeb08425 127.0.0.1:9177}"
TestAutopilot_RollingUpdate: autopilot_test.go:235: waiting for s4 to stabalize and be promoted
TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.975Z [ERROR] raft/raft.go:1656: nomad.raft: failed to make requestVote RPC: target="{Voter c3337778-811e-2675-87f5-006309888387 127.0.0.1:9173}" error="dial tcp 127.0.0.1:9173: connect: connection refused"
TestAutopilot_RollingUpdate: retry.go:121: autopilot_test.go:241: don't want "c3337778-811e-2675-87f5-006309888387"
autopilot_test.go:241: didn't find map[c54b5bf4-1159-34f6-032d-56aefeb08425:true] in []raft.ServerID{"df01ba65-d1b2-17a9-f792-a4459b3a7c09", "186d5e15-c473-e2b3-b5a4-3259a84e10ef"}
```
Note how s3, c3337778, is present in the peers list in the final
failure, but s4, c54b5bf4, is added as a Nonvoter and isn't present in
the final peers list.
When `nomad job inspect` encodes the response, if the decoded JSON
from the API doesn't exactly match the API struct, the field value
will be omitted even if it has a value. We only want the JSON struct
tag to `omitempty`.
This changeset:
* adds eval status to the error messages emitted when we have
placement failure in tests. The implementation here isn't quite
perfect but it's a lot better than "condition not met".
* enforces the ordering of teardown of the CSI test
* doesn't pass the purge flag to one of the two CSI tests, so that we
exercise both code paths.