When calculating the score in the `SpreadIterator`, the score boost is
proportional to the difference between the current and desired count. But when
there are implicit spread targets, the current count is the sum of the possible
implicit targets, which results in incorrect scoring unless there's only one
implicit target.
This changeset updates the `propertySet` struct to accept a set of explicit
target values so it can detect when a property value falls into the implicit set
and should be combined with other implicit values.
Fixes: #11823
When spread targets have a percent value of zero it's possible for them to
return -Inf scoring because of a float divide by zero. This is very hard for
operators to debug because the string "-Inf" is returned in the API and that
breaks the presentation of debugging data.
Most scoring iterators are bracketed to -1/+1, but spread iterators do not so
that they can handle greatly unbalanced scoring so we can't simply return a -1
score without generating a score that might be greater than the negative scores
set by other spread targets. Instead, track the lowest-seen spread boost and use
that as the spread boost for any cases where we'd divide by zero.
Fixes: #8863
* api: enable support for setting original source alongside job
This PR adds support for setting job source material along with
the registration of a job.
This includes a new HTTP endpoint and a new RPC endpoint for
making queries for the original source of a job. The
HTTP endpoint is /v1/job/<id>/submission?version=<version> and
the RPC method is Job.GetJobSubmission.
The job source (if submitted, and doing so is always optional), is
stored in the job_submission memdb table, separately from the
actual job. This way we do not incur overhead of reading the large
string field throughout normal job operations.
The server config now includes job_max_source_size for configuring
the maximum size the job source may be, before the server simply
drops the source material. This should help prevent Bad Things from
happening when huge jobs are submitted. If the value is set to 0,
all job source material will be dropped.
* api: avoid writing var content to disk for parsing
* api: move submission validation into RPC layer
* api: return an error if updating a job submission without namespace or job id
* api: be exact about the job index we associate a submission with (modify)
* api: reword api docs scheduling
* api: prune all but the last 6 job submissions
* api: protect against nil job submission in job validation
* api: set max job source size in test server
* api: fixups from pr
* main: remove deprecated uses of rand.Seed
go1.20 deprecates rand.Seed, and seeds the rand package
automatically. Remove cases where we seed the random package,
and cleanup the one case where we intentionally create a
known random source.
* cl: update cl
* mod: update go mod
The spread iterator can panic when processing an evaluation, resulting
in an unrecoverable state in the cluster. Whenever a panicked server
restarts and quorum is restored, the next server to dequeue the
evaluation will panic.
To trigger this state:
* The job must have `max_parallel = 0` and a `canary >= 1`.
* The job must not have a `spread` block.
* The job must have a previous version.
* The previous version must have a `spread` block and at least one
failed allocation.
In this scenario, the desired changes include `(place 1+) (stop
1+), (ignore n) (canary 1)`. Before the scheduler can place the canary
allocation, it tries to find out which allocations can be
stopped. This passes back through the stack so that we can determine
previous-node penalties, etc. We call `SetJob` on the stack with the
previous version of the job, which will include assessing the `spread`
block (even though the results are unused). The task group spread info
state from that pass through the spread iterator is not reset when we
call `SetJob` again. When the new job version iterates over the
`groupPropertySets`, it will get an empty `spreadAttributeMap`,
resulting in an unexpected nil pointer dereference.
This changeset resets the spread iterator internal state when setting
the job, logging with a bypass around the bug in case we hit similar
cases, and a test that panics the scheduler without the patch.
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.
Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.
In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.
This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
* use msgtype in upsert node
adds message type to signature for upsert node, update tests, remove placeholder method
* UpsertAllocs msg type test setup
* use upsertallocs with msg type in signature
update test usage of delete node
delete placeholder msgtype method
* add msgtype to upsert evals signature, update test call sites with test setup msg type
handle snapshot upsert eval outside of FSM and ignore eval event
remove placeholder upsertevalsmsgtype
handle job plan rpc and prevent event creation for plan
msgtype cleanup upsertnodeevents
updatenodedrain msgtype
msg type 0 is a node registration event, so set the default to the ignore type
* fix named import
* fix signature ordering on upsertnode to match