open-nomad

Commit Graph

Author	SHA1	Message	Date
hashicorp-copywrite[bot]	005636afa0	[COMPLIANCE] Add Copyright and License Headers	2023-04-10 15:36:59 +00:00
Tim Gross	65c7e149d3	eval broker: use write lock when reaping cancelable evals (#16112 ) The eval broker's `Cancelable` method used by the cancelable eval reaper mutates the slice of cancelable evals by removing a batch at a time from the slice. But this method unsafely uses a read lock despite this mutation. Under normal workloads this is likely to be safe but when the eval broker is under the heavy load this feature is intended to fix, we're likely to have a race condition. Switch this to a write lock, like the other locks that mutate the eval broker state. This changeset also adjusts the timeout to allow poorly-sized Actions runners more time to schedule the appropriate goroutines. The test has also been updated to use `shoenig/test/wait` so we can have sensible reporting of the results rather than just a timeout error when things go wrong.	2023-02-10 10:40:41 -05:00
Tim Gross	a51149736d	Rename `nomad.broker.total_blocked` metric (#15835 ) This changeset fixes a long-standing point of confusion in metrics emitted by the eval broker. The eval broker has a queue of "blocked" evals that are waiting for an in-flight ("unacked") eval of the same job to be completed. But this "blocked" state is not the same as the `blocked` status that we write to raft and expose in the Nomad API to end users. There's a second metric `nomad.blocked_eval.total_blocked` that refers to evaluations in that state. This has caused ongoing confusion in major customer incidents and even in our own documentation! (Fixed in this PR.) There's little functional change in this PR aside from the name of the metric emitted, but there's a bit refactoring to clean up the names in `eval_broker.go` so that there aren't name collisions and multiple names for the same state. Changes included are: * Everything that was previously called "pending" referred to entities that were associated witht he "ready" metric. These are all now called "ready" to match the metric. * Everything named "blocked" in `eval_broker.go` is now named "pending", except for a couple of comments that actually refer to blocked RPCs. * Added a note to the upgrade guide docs for 1.5.0. * Fixed the scheduling performance metrics docs because the description for `nomad.broker.total_blocked` was actually the description for `nomad.blocked_eval.total_blocked`.	2023-01-20 14:23:56 -05:00
Tim Gross	6415fb4284	eval broker: shed all but one blocked eval per job after ack (#14621 ) When an evaluation is acknowledged by a scheduler, the resulting plan is guaranteed to cover up to the `waitIndex` set by the worker based on the most recent evaluation for that job in the state store. At that point, we no longer need to retain blocked evaluations in the broker that are older than that index. Move all but the highest priority / highest `ModifyIndex` blocked eval into a canceled set. When the `Eval.Ack` RPC returns from the eval broker it will signal a reap of a batch of cancelable evals to write to raft. This paces the cancelations limited by how frequently the schedulers are acknowledging evals; this should reduce the risk of cancelations from overwhelming raft relative to scheduler progress. In order to avoid straggling batches when the cluster is quiet, we also include a periodic sweep through the cancelable list.	2022-11-16 16:10:11 -05:00
Seth Hoenig	b3ea68948b	build: run gofmt on all go source files Go 1.19 will forecefully format all your doc strings. To get this out of the way, here is one big commit with all the changes gofmt wants to make.	2022-08-16 11:14:11 -05:00
James Rasell	181b247384	core: allow pausing and un-pausing of leader broker routine (#13045 ) * core: allow pause/un-pause of eval broker on region leader. * agent: add ability to pause eval broker via scheduler config. * cli: add operator scheduler commands to interact with config. * api: add ability to pause eval broker via scheduler config * e2e: add operator scheduler test for eval broker pause. * docs: include new opertor scheduler CLI and pause eval API info.	2022-07-06 16:13:48 +02:00
Seth Hoenig	db2347a86c	cleanup: prevent leaks from time.After This PR replaces use of time.After with a safe helper function that creates a time.Timer to use instead. The new function returns both a time.Timer and a Stop function that the caller must handle. Unlike time.NewTimer, the helper function does not panic if the duration set is <= 0.	2022-02-02 14:32:26 -06:00
Joel May	a8fc048151	Emit metrics for eval waitUntil as nomad.nomad.broker.eval_waiting (#10236 )	2022-01-06 15:57:40 -05:00
Michael Schurter	d7e123d7cd	test: fix fake by increasing time window Test originally only had a 10ms time window tolerance. Increased to 100ms and also improved assertions and docstrings.	2021-09-28 12:22:59 -07:00
Michael Schurter	9732bc37ff	nomad: refactor waitForIndex into SnapshotAfter Generalize wait for index logic in the state store for reuse elsewhere. Also begin plumbing in a context to combine handling of timeouts and shutdown.	2019-05-17 13:30:23 -07:00
Danielle Lancashire	d9815888ed	evalbroker: Simplify nextDelayedEval locking	2019-05-14 14:06:27 +02:00
Danielle Lancashire	38562afbc1	evalbroker: No new enqueues when disabled Currently when an evalbroker is disabled, it still recieves delayed enqueues via log application in the fsm. This causes an ever growing heap of evaluations that will never be drained, and can cause memory issues in larger clusters, or when left running for an extended period of time without a leader election. This commit prevents the enqueuing of evaluations while we are disabled, and relies on the leader restoreEvals routine to handle reconciling state during a leadership transition. Existing dequeues during an Enabled->Disabled broker state transition are handled by the enqueueLocked function dropping evals.	2019-05-14 13:59:10 +02:00
Danielle Lancashire	c91ae21a6c	evalbroker: Flush within update lock Primarily a cleanup commit, however, currently there is a potential race condition (that I'm not sure we've ever actually hit) during a flapping SetEnabled/Disabled state where we may never correctly restart the eval broker, if it was being called from multiple routines.	2019-05-14 13:26:56 +02:00
Michael Schurter	004fa574cb	test: fix race in eval broker update chan Similar to previous commits the delayed eval update chan was set and access from different goroutines causing a race. Passing the chan on the stack resolves the race. Race output from `go test -race -run 'Server_RPC$'` in nomad/ ``` ================== WARNING: DATA RACE Write at 0x00c000339150 by goroutine 63: github.com/hashicorp/nomad/nomad.(EvalBroker).flush() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/eval_broker.go:708 +0x3dc github.com/hashicorp/nomad/nomad.(EvalBroker).SetEnabled() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/eval_broker.go:174 +0xc4 github.com/hashicorp/nomad/nomad.(Server).revokeLeadership() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/leader.go:718 +0x1fd github.com/hashicorp/nomad/nomad.(Server).leaderLoop() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/leader.go:122 +0x95d github.com/hashicorp/nomad/nomad.(Server).monitorLeadership.func1() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/leader.go:72 +0x6c Previous read at 0x00c000339150 by goroutine 73: github.com/hashicorp/nomad/nomad.(EvalBroker).runDelayedEvalsWatcher() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/eval_broker.go:771 +0x176 Goroutine 63 (running) created at: github.com/hashicorp/nomad/nomad.(Server).monitorLeadership() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/leader.go:70 +0x269 Goroutine 73 (running) created at: github.com/hashicorp/nomad/nomad.(EvalBroker).SetEnabled() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/eval_broker.go:170 +0x173 github.com/hashicorp/nomad/nomad.(Server).establishLeadership() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/leader.go:207 +0x355 github.com/hashicorp/nomad/nomad.(Server).leaderLoop() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/leader.go:117 +0x82e github.com/hashicorp/nomad/nomad.(*Server).monitorLeadership.func1() /home/schmichael/go/src/github.com/hashicorp/nomad/nomad/leader.go:72 +0x6c ================== ```	2018-12-19 15:48:02 -08:00
Preetha Appan	4f8e925b54	Move topk and delay heap to separate packages under lib	2018-09-04 16:10:11 -05:00
Preetha Appan	9e29cfee76	Use readlock	2018-09-04 11:45:05 -05:00
Preetha Appan	062e5f1898	Use eval broker lock when reading/modifying delay heap	2018-08-31 10:59:48 -05:00
Charlie Voiselle	ba88f00ccb	Changed "til" to "until" Should be "till" or "until"; chose "until" because it is unambiguous as to meaning.	2018-04-11 12:36:28 -05:00
Preetha Appan	56e60e5840	Fix linting warning	2018-03-14 16:12:22 -05:00
Preetha Appan	4193015e48	Update comment	2018-03-14 16:10:32 -05:00
Preetha Appan	3e96c6c4e0	Address more code review feedback	2018-03-14 16:10:32 -05:00
Preetha Appan	9749b7a05c	Avoids unnecessary timer object creation	2018-03-14 16:10:32 -05:00
Preetha Appan	c6f333c90f	Move delayheap to lib package	2018-03-14 16:10:32 -05:00
Preetha Appan	1ab8f2b57a	Address some code review comments	2018-03-14 16:10:32 -05:00
Preetha Appan	7887f39ff4	Added a delay heap to track evals with WaitUntil set, and use in eval broker	2018-03-14 16:10:32 -05:00
Preetha Appan	9a4fcaaf34	Fix bug with not including namespace in indexing blocked evals	2018-03-13 13:23:11 -05:00
Josh Soref	6f06f0e6f2	spelling: requeuing	2018-03-11 18:43:05 +00:00
Josh Soref	d300623abe	spelling: evaluation	2018-03-11 18:01:35 +00:00
Josh Soref	f1d21bfdfe	spelling: enqueuing	2018-03-11 18:00:07 +00:00
Michael Schurter	a66c53d45a	Remove `structs` import from `api` Goes a step further and removes structs import from api's tests as well by moving GenerateUUID to its own package.	2017-09-29 10:36:08 -07:00
Alex Dadgar	84d06f6abe	Sync namespace changes	2017-09-07 17:04:21 -07:00
Alex Dadgar	a331a234d4	NewEvalBroker comment	2017-04-14 15:26:54 -07:00
Alex Dadgar	a9c8b09da8	Push to configs	2017-04-14 15:24:55 -07:00
Alex Dadgar	ef875f6dda	Delay Nack re-enqueue Add a delay when an evaluation is nacked that starts off small but compounds to a larger delay for subsequent Nacks. This creates some back pressure.	2017-04-12 13:41:40 -07:00
Alex Dadgar	fd3e469d5e	Remove requeue because it is a subset of EnqueueAll now	2016-06-24 10:14:34 -07:00
Alex Dadgar	2f8bb4b235	When enqueuing into eval broker always pass blocked eval's token	2016-06-23 22:40:22 -07:00
Sean Chittenden	1aefdb1e15	Use the correctly typed `rand.Int*` variant	2016-06-10 15:50:11 -04:00
Sean Chittenden	3a1dc9a194	Use `rand.Int*n()` where appropriate	2016-06-10 15:50:11 -04:00
Sean Chittenden	4e2835d5ff	Use the correctly typed `rand.Int*` variant	2016-06-10 15:48:36 -04:00
Sean Chittenden	66b4b2a99f	Use `rand.Int*n()` where appropriate	2016-06-10 15:48:36 -04:00
Alex Dadgar	060318845f	Comments addressed	2016-05-31 11:39:03 -07:00
Alex Dadgar	1f9f015c1b	Fix race condition in which a reblocked evaluation could be dropped	2016-05-27 16:53:10 -07:00
Alex Dadgar	1c6d3e129a	EnqueueAll inserts all evaluations before unblocking dequeue calls	2016-05-18 12:13:59 -07:00
Alex Dadgar	045f7807e0	eval_broker.Enqueue no longer returns an error	2016-05-18 11:35:15 -07:00
Alex Dadgar	74726278b9	core: Pause NackTimeout while in the plan_queue as progress is being made	2016-03-04 12:59:35 -08:00
Alex Dadgar	d2e88f0116	Fix panic when Ack occurs at delivery limit	2016-02-11 11:07:18 -08:00
Alex Dadgar	74135f02a4	Blocked Eval tracker	2016-01-31 18:04:45 -08:00
Alex Dadgar	cbb1992929	Make a zero timeout for eval_broker.Dequeue() block	2015-11-23 11:59:49 -08:00
Alex Dadgar	af7d896c2a	nil-pointer dereference with empty timeout Dequeue	2015-11-20 16:49:55 -08:00
Armon Dadgar	b9bb7bdaaa	nomad: OutstandingReset returns specific errors	2015-10-23 10:22:17 -07:00

1 2

65 Commits