Commit Graph

14 Commits

Author SHA1 Message Date
Tim Gross 4368dcc02f
fix deadlock in plan_apply (#13407)
The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the `indexCh` forever, and
all schedulers will block in `Plan.Submit`.

Closing the `indexCh` when the `asyncPlanWait` is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.

This changeset includes the a PoC failing test that works by injecting
a large timeout into the state store. We need to turn this into a test
we can run normally without breaking the state store before we can
merge this PR.

Increase `snapshotMinIndex` timeout to 10s.
This timeout creates backpressure where any concurrent `Plan.Submit`
RPCs will block waiting for results. This sheds load across all
servers and gives raft some CPU to catch up, because schedulers won't
dequeue more work while waiting. Increase it to 10s based on
observations of large production clusters.
2022-06-23 12:06:27 -04:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Michael Schurter d87ed3fcd7 core: prevent malformed plans from crashing leader
The Plan.Submit endpoint assumed PlanRequest.Plan was never nil. While
there is no evidence it ever has been nil, we should not panic if a nil
plan is ever submitted because that would crash the leader.
2022-01-31 12:15:15 -08:00
Seth Hoenig f0c3dca49c tests: swap lib/freeport for tweaked helper/freeport
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.

Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.

This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.
2019-12-09 08:37:32 -06:00
Alex Dadgar 4bdccab550 goimports 2019-01-22 15:44:31 -08:00
Alex Dadgar a6dfffa4fa Add testing interfaces 2018-02-15 13:59:00 -08:00
Alex Dadgar c1cc51dbee sync 2017-10-13 14:36:02 -07:00
Alex Dadgar 06eddf243c parallel nomad tests 2017-07-25 17:39:36 -07:00
Alex Dadgar 045f7807e0 eval_broker.Enqueue no longer returns an error 2016-05-18 11:35:15 -07:00
Armon Dadgar 2ff133c0e6 nomad: rename region1 to global. Fixes #41 2015-09-13 18:18:40 -07:00
Armon Dadgar f7007bfeb5 nomad: avoid split-brain in plan processing due to leader transition or eval retry 2015-08-12 15:44:36 -07:00
Armon Dadgar 343b1b9c89 nomad: move state and mocks into shared packages 2015-08-11 14:27:14 -07:00
Armon Dadgar df6ab9898b nomad: fixing tests 2015-08-04 17:11:20 -07:00
Armon Dadgar 4be2c1b9db nomad: adding plan endpoint 2015-07-27 15:31:49 -07:00