open-nomad/website/content/docs/job-specification/spread.mdx
Tim Gross b0c3b99b03
scheduler: fix quadratic performance with spread blocks (#11712)
When the scheduler picks a node for each evaluation, the
`LimitIterator` provides at most 2 eligible nodes for the
`MaxScoreIterator` to choose from. This keeps scheduling fast while
producing acceptable results because the results are binpacked.

Jobs with a `spread` block (or node affinity) remove this limit in
order to produce correct spread scoring. This means that every
allocation within a job with a `spread` block is evaluated against
_all_ eligible nodes. Operators of large clusters have reported that
jobs with `spread` blocks that are eligible on a large number of nodes
can take longer than the nack timeout to evaluate (60s). Typical
evaluations are processed in milliseconds.

In practice, it's not necessary to evaluate every eligible node for
every allocation on large clusters, because the `RandomIterator` at
the base of the scheduler stack produces enough variation in each pass
that the likelihood of an uneven spread is negligible. Note that
feasibility is checked before the limit, so this only impacts the
number of _eligible_ nodes available for scoring, not the total number
of nodes.

This changeset sets the iterator limit for "large" `spread` block and
node affinity jobs to be equal to the number of desired
allocations. This brings an example problematic job evaluation down
from ~3min to ~10s. The included tests ensure that we have acceptable
spread results across a variety of large cluster topologies.
2021-12-21 10:10:01 -05:00

168 lines
5.5 KiB
Plaintext

---
layout: docs
page_title: spread Stanza - Job Specification
description: >-
The "spread" stanza is used to spread placements across a certain node
attributes such as datacenter.
Spread may be specified at the job or group levels for ultimate flexibility.
More than one spread stanza may be specified with relative weights between
each.
---
# `spread` Stanza
<Placement
groups={[
['job', 'spread'],
['job', 'group', 'spread'],
]}
/>
The `spread` stanza allows operators to increase the failure tolerance of their
applications by specifying a node attribute that allocations should be spread
over. This allows operators to spread allocations over attributes such as
datacenter, availability zone, or even rack in a physical datacenter. By
default, when using spread the scheduler will attempt to place allocations
equally among the available values of the given target.
```hcl
job "docs" {
# Spread allocations over all datacenter
spread {
attribute = "${node.datacenter}"
}
group "example" {
# Spread allocations over each rack based on desired percentage
spread {
attribute = "${meta.rack}"
target "r1" {
percent = 60
}
target "r2" {
percent = 40
}
}
}
}
```
Nodes are scored according to how closely they match the desired target percentage defined in the
spread stanza. Spread scores are combined with other scoring factors such as bin packing.
A job or task group can have more than one spread criteria, with weights to express relative preference.
Spread criteria are treated as a soft preference by the Nomad
scheduler. If no nodes match a given spread criteria, placement is
still successful. To avoid scoring every node for every placement,
allocations may not be perfectly spread. Spread works best on
attributes with similar number of nodes: identically configured racks
or similarly configured datacenters.
Spread may be expressed on [attributes][interpolation] or [client metadata][client-meta].
Additionally, spread may be specified at the [job][job] and [group][group] levels for ultimate flexibility. Job level spread criteria are inherited by all task groups in the job.
## `spread` Parameters
- `attribute` `(string: "")` - Specifies the name or reference of the attribute
to use. This can be any of the [Nomad interpolated
values](/docs/runtime/interpolation#interpreted_node_vars).
- `target` <code>([target](#target-parameters): &lt;required&gt;)</code> - Specifies one or more target
percentages for each value of the `attribute` in the spread stanza. If this is omitted,
Nomad will spread allocations evenly across all values of the attribute.
- `weight` `(integer:0)` - Specifies a weight for the spread stanza. The weight is used
during scoring and must be an integer between 0 to 100. Weights can be used
when there is more than one spread or affinity stanza to express relative preference across them.
## `target` Parameters
- `value` `(string:"")` - Specifies a target value of the attribute from a `spread` stanza.
- `percent` `(integer:0)` - Specifies the percentage associated with the target value.
## `spread` Examples
The following examples show different ways to use the `spread` stanza.
### Even Spread Across Data Center
This example shows a spread stanza across the node's `datacenter` attribute. If we have
two datacenters `us-east1` and `us-west1`, and a task group of `count = 10`,
Nomad will attempt to place 5 allocations in each datacenter.
```hcl
spread {
attribute = "${node.datacenter}"
weight = 100
}
```
### Spread With Target Percentages
This example shows a spread stanza that specifies one target percentage. If we
have three datacenters `us-east1`, `us-east2`, and `us-west1`, and a task group
of `count = 10`, Nomad will attempt to place place 5 of the allocations in "us-east1",
and will spread the remaining among the other two datacenters.
```hcl
spread {
attribute = "${node.datacenter}"
weight = 100
target "us-east1" {
percent = 50
}
}
```
This example shows a spread stanza that specifies target percentages for two
different datacenters. If we have two datacenters `us-east1` and `us-west1`,
and a task group of `count = 10`, Nomad will attempt to place 6 allocations
in `us-east1` and 4 in `us-west1`.
```hcl
spread {
attribute = "${node.datacenter}"
weight = 100
target "us-east1" {
percent = 60
}
target "us-west1" {
percent = 40
}
}
```
### Spread Across Multiple Attributes
This example shows spread stanzas with multiple attributes. Consider a Nomad cluster
where there are two datacenters `us-east1` and `us-west1`, and each datacenter has nodes
with `${meta.rack}` being `r1` or `r2`. With the following spread stanza used on a job with `count=12`, Nomad
will attempt to place 6 allocations in each datacenter. Within a datacenter, Nomad will
attempt to place 3 allocations in nodes on rack `r1`, and 3 allocations in nodes on rack `r2`.
```hcl
spread {
attribute = "${node.datacenter}"
weight = 50
}
spread {
attribute = "${meta.rack}"
weight = 50
}
```
[job]: /docs/job-specification/job 'Nomad job Job Specification'
[group]: /docs/job-specification/group 'Nomad group Job Specification'
[client-meta]: /docs/configuration/client#meta 'Nomad meta Job Specification'
[task]: /docs/job-specification/task 'Nomad task Job Specification'
[interpolation]: /docs/runtime/interpolation 'Nomad interpolation'
[node-variables]: /docs/runtime/interpolation#node-variables- 'Nomad interpolation-Node variables'
[constraint]: /docs/job-specification/constraint 'Nomad Constraint job Specification'