Commit graph

42 commits

Author SHA1 Message Date
Tim Gross 65c7e149d3
eval broker: use write lock when reaping cancelable evals (#16112)
The eval broker's `Cancelable` method used by the cancelable eval reaper mutates
the slice of cancelable evals by removing a batch at a time from the slice. But
this method unsafely uses a read lock despite this mutation. Under normal
workloads this is likely to be safe but when the eval broker is under the heavy
load this feature is intended to fix, we're likely to have a race
condition. Switch this to a write lock, like the other locks that mutate the
eval broker state.

This changeset also adjusts the timeout to allow poorly-sized Actions runners
more time to schedule the appropriate goroutines. The test has also been updated
to use `shoenig/test/wait` so we can have sensible reporting of the results
rather than just a timeout error when things go wrong.
2023-02-10 10:40:41 -05:00
Tim Gross a51149736d
Rename nomad.broker.total_blocked metric (#15835)
This changeset fixes a long-standing point of confusion in metrics emitted by
the eval broker. The eval broker has a queue of "blocked" evals that are waiting
for an in-flight ("unacked") eval of the same job to be completed. But this
"blocked" state is not the same as the `blocked` status that we write to raft
and expose in the Nomad API to end users. There's a second metric
`nomad.blocked_eval.total_blocked` that refers to evaluations in that
state. This has caused ongoing confusion in major customer incidents and even in
our own documentation! (Fixed in this PR.)

There's little functional change in this PR aside from the name of the metric
emitted, but there's a bit refactoring to clean up the names in `eval_broker.go`
so that there aren't name collisions and multiple names for the same
state. Changes included are:
* Everything that was previously called "pending" referred to entities that were
  associated witht he "ready" metric. These are all now called "ready" to match
  the metric.
* Everything named "blocked" in `eval_broker.go` is now named "pending", except
  for a couple of comments that actually refer to blocked RPCs.
* Added a note to the upgrade guide docs for 1.5.0.
* Fixed the scheduling performance metrics docs because the description for
  `nomad.broker.total_blocked` was actually the description for
  `nomad.blocked_eval.total_blocked`.
2023-01-20 14:23:56 -05:00
Tim Gross 6415fb4284
eval broker: shed all but one blocked eval per job after ack (#14621)
When an evaluation is acknowledged by a scheduler, the resulting plan is
guaranteed to cover up to the `waitIndex` set by the worker based on the most
recent evaluation for that job in the state store. At that point, we no longer
need to retain blocked evaluations in the broker that are older than that index.

Move all but the highest priority / highest `ModifyIndex` blocked eval into a
canceled set. When the `Eval.Ack` RPC returns from the eval broker it will
signal a reap of a batch of cancelable evals to write to raft. This paces the
cancelations limited by how frequently the schedulers are acknowledging evals;
this should reduce the risk of cancelations from overwhelming raft relative to
scheduler progress. In order to avoid straggling batches when the cluster is
quiet, we also include a periodic sweep through the cancelable list.
2022-11-16 16:10:11 -05:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Michael Schurter d7e123d7cd test: fix fake by increasing time window
Test originally only had a 10ms time window tolerance. Increased to
100ms and also improved assertions and docstrings.
2021-09-28 12:22:59 -07:00
Lars Lehtonen adbab29228 nomad: TestEvalBroker_Dequeue_Empty_Timeout() proper goroutine error handling (#6657) 2019-11-08 14:35:06 -05:00
Lars Lehtonen 39b68e0b88 TestEvalBroker_Dequeue_Blocked() proper goroutine error handling (#6651)
TestEvalBroker_Dequeue_Blocked() improve test readability
2019-11-08 08:52:23 -05:00
Lars Lehtonen 6deae70e35 TestEvalBroker_PauseResumeNackTimeout() proper goroutine error handling (#6649)
TestEvalBroker_PauseResumeNackTimeout() improve test readability
2019-11-07 16:04:59 -05:00
Lars Lehtonen 2638cbb31d nomad: TestEvalBroker_EnqueueAll_Dequeue_Fair() proper goroutine error handling (#6636)
nomad: TestEvalBroker_EnqueueAll_Dequeue_Fair() improve test readability
2019-11-07 10:39:29 -05:00
Danielle Lancashire 2fb93a6229 evalbroker: test for no enqueue on disabled 2019-05-15 11:02:21 +02:00
Preetha Appan 1ab8f2b57a
Address some code review comments 2018-03-14 16:10:32 -05:00
Preetha Appan 7887f39ff4
Added a delay heap to track evals with WaitUntil set, and use in eval broker 2018-03-14 16:10:32 -05:00
Preetha Appan 9a4fcaaf34
Fix bug with not including namespace in indexing blocked evals 2018-03-13 13:23:11 -05:00
Josh Soref f1d21bfdfe spelling: enqueuing 2018-03-11 18:00:07 +00:00
Alex Dadgar 84d06f6abe Sync namespace changes 2017-09-07 17:04:21 -07:00
Alex Dadgar 06eddf243c parallel nomad tests 2017-07-25 17:39:36 -07:00
Alex Dadgar a9c8b09da8 Push to configs 2017-04-14 15:24:55 -07:00
Alex Dadgar ef875f6dda Delay Nack re-enqueue
Add a delay when an evaluation is nacked that starts off small but
compounds to a larger delay for subsequent Nacks. This creates some
back pressure.
2017-04-12 13:41:40 -07:00
Alex Dadgar fd3e469d5e Remove requeue because it is a subset of EnqueueAll now 2016-06-24 10:14:34 -07:00
Alex Dadgar b7e3a45fef fix channel being nil on restore 2016-06-07 15:03:08 -07:00
Alex Dadgar 1f9f015c1b Fix race condition in which a reblocked evaluation could be dropped 2016-05-27 16:53:10 -07:00
Alex Dadgar 1c6d3e129a EnqueueAll inserts all evaluations before unblocking dequeue calls 2016-05-18 12:13:59 -07:00
Alex Dadgar 045f7807e0 eval_broker.Enqueue no longer returns an error 2016-05-18 11:35:15 -07:00
Alex Dadgar 74726278b9 core: Pause NackTimeout while in the plan_queue as progress is being made 2016-03-04 12:59:35 -08:00
Alex Dadgar d2e88f0116 Fix panic when Ack occurs at delivery limit 2016-02-11 11:07:18 -08:00
Alex Dadgar cbb1992929 Make a zero timeout for eval_broker.Dequeue() block 2015-11-23 11:59:49 -08:00
Alex Dadgar af7d896c2a nil-pointer dereference with empty timeout Dequeue 2015-11-20 16:49:55 -08:00
Armon Dadgar b9bb7bdaaa nomad: OutstandingReset returns specific errors 2015-10-23 10:22:17 -07:00
Armon Dadgar 16fd84f25a nomad: Adding OutstandingReset to EvalBroker 2015-10-23 10:14:16 -07:00
Armon Dadgar fe2e046481 nomad: support time wait for evaluations 2015-09-07 13:00:45 -07:00
Armon Dadgar 79a1471b85 nomad: add delivery limit to eval broker 2015-08-16 10:55:55 -07:00
Armon Dadgar 183a238481 nomad: avoid split-brain eval handling after leader transition 2015-08-12 15:25:31 -07:00
Armon Dadgar 343b1b9c89 nomad: move state and mocks into shared packages 2015-08-11 14:27:14 -07:00
Armon Dadgar 011484ea74 nomad: eval broker serializes by JobID 2015-08-05 17:55:15 -07:00
Armon Dadgar 078d80432a nomad: deduplicate enqueue of evaluations 2015-08-05 17:06:02 -07:00
Armon Dadgar 2f05a260bb nomad: add checking if eval broker enabled 2015-08-05 16:41:39 -07:00
Armon Dadgar df6ab9898b nomad: fixing tests 2015-08-04 17:11:20 -07:00
Armon Dadgar 7f49f25de1 nomad: ensure FIFO dequeue with same priority 2015-07-23 22:58:12 -07:00
Armon Dadgar 7aad418345 nomad: method to test if outstanding evaluation 2015-07-23 21:58:13 -07:00
Armon Dadgar fc11808fe9 nomad: add eval broker, configurable nack timeout 2015-07-23 21:44:17 -07:00
Armon Dadgar 5fcc39c399 nomad: testing the eval broker 2015-07-23 21:37:28 -07:00
Armon Dadgar 2a76fd34f0 nomad: first pass at eval broker 2015-07-23 17:31:08 -07:00