Commit Graph

20 Commits

Author SHA1 Message Date
Tim Gross 74486d86fb
scheduler: prevent panic in spread iterator during alloc stop
The spread iterator can panic when processing an evaluation, resulting
in an unrecoverable state in the cluster. Whenever a panicked server
restarts and quorum is restored, the next server to dequeue the
evaluation will panic.

To trigger this state:
* The job must have `max_parallel = 0` and a `canary >= 1`.
* The job must not have a `spread` block.
* The job must have a previous version.
* The previous version must have a `spread` block and at least one
  failed allocation.

In this scenario, the desired changes include `(place 1+) (stop
1+), (ignore n) (canary 1)`. Before the scheduler can place the canary
allocation, it tries to find out which allocations can be
stopped. This passes back through the stack so that we can determine
previous-node penalties, etc. We call `SetJob` on the stack with the
previous version of the job, which will include assessing the `spread`
block (even though the results are unused). The task group spread info
state from that pass through the spread iterator is not reset when we
call `SetJob` again. When the new job version iterates over the
`groupPropertySets`, it will get an empty `spreadAttributeMap`,
resulting in an unexpected nil pointer dereference.

This changeset resets the spread iterator internal state when setting
the job, logging with a bypass around the bug in case we hit similar
cases, and a test that panics the scheduler without the patch.
2022-02-09 19:53:06 -05:00
Preetha Appan afff27b69b More error->debug for logging in the bin packing iterator 2019-12-12 15:50:16 -06:00
Preetha Appan 374eee421f
Fix comment and assert score in test case 2019-05-15 12:35:57 -05:00
Nick Ethier f0b9f8e37a
fix missing brace 2019-05-15 13:02:04 -04:00
Nick Ethier 0d851b5d11
scheduler: add check to prohibit returning inf during spread boost calculation 2019-05-15 13:00:24 -04:00
Alex Dadgar 41265d4d61 Change types of weights on spread/affinity 2019-01-30 12:20:38 -08:00
Alex Dadgar 3c19d01d7a server 2018-09-15 16:23:13 -07:00
Preetha Appan 9bc0962527
Track top k nodes by norm score rather than top k nodes per scorer 2018-09-04 16:10:11 -05:00
Preetha Appan 65cf4373b3
fix linting error 2018-09-04 16:10:11 -05:00
Preetha Appan dd5fe6373f
Fix scoring logic for uneven spread to incorporate current alloc count
Also addressed other small code review comments
2018-09-04 16:10:11 -05:00
Preetha Appan e72c0fe527
more cleanup 2018-09-04 16:10:11 -05:00
Preetha Appan 4c624424e6
added some unit tests for -1 spread score 2018-09-04 16:10:11 -05:00
Preetha Appan 92d37acc2a
comment and formatting cleanup 2018-09-04 16:10:11 -05:00
Preetha Appan 7b0a27cad6
fix scoring algorithm when min count == current count 2018-09-04 16:10:11 -05:00
Preetha Appan bad075f640
Remove hardcoded boosts for even spread.
instead, calculate them based on delta between current and minimum value
2018-09-04 16:10:11 -05:00
Preetha Appan c56873ff37
Implement support for even spread across datacenters, with unit test 2018-09-04 16:10:11 -05:00
Preetha Appan d091c00dd3
Support implicit spread target to account for remaining desired counts 2018-09-04 16:10:11 -05:00
Preetha Appan 33779abe5f
fix comments 2018-09-04 16:10:11 -05:00
Preetha Appan 55f276c189
Include spreads configured at job level when precomputing weights/desired counts. 2018-09-04 16:10:11 -05:00
Preetha Appan db0d95b09c
Implement spread iterator that scores according to percentage of desired count in each target.
Added this as a new step in the stack and some unit tests
2018-09-04 16:10:11 -05:00