If an alloc is being preempted and marked as evict, but the underlying
node is lost before the migration takes place, the allocation currently
stays as desired evict, status running forever, or until the node comes
back online.
This commit updates updateNonTerminalAllocsToLost to check for a
destired status of Evict as well as Stop when updating allocations on
tainted nodes.
switch to table test for lost node cases
We currently log an error if preemption is unable to find a suitable set of
allocations to preempt. This commit changes that to debug level since not finding
preemptable allocations is not an error condition.
Fixes#5856
When the scheduler looks for a placement for an allocation that's
replacing another allocation, it's supposed to penalize the previous
node if the allocation had been rescheduled or failed. But we're
currently always penalizing the node, which leads to unnecessary
migrations on job update.
This commit leaves in place the existing behavior where if the
previous alloc was itself rescheduled, its previous nodes are also
penalized. This is conservative but the right behavior especially on
larger clusters where a group of hosts might be having correlated
trouble (like an AZ failure).
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
Fixes#6787
In ProposedAllocs the proposed alloc slice was being copied while its
contents were not. Since RemoveAllocs nils elements of the proposed
alloc slice and is called twice, it could panic on the second call when
erroneously accessing a nil'd alloc.
The fix is to not copy the proposed alloc slice and pass the slice
returned by the 1st RemoveAllocs call to the 2nd call, thus maintaining
the trimmed length.
The existing version constraint uses logic optimized for package
managers, not schedulers, when checking prereleases:
- 1.3.0-beta1 will *not* satisfy ">= 0.6.1"
- 1.7.0-rc1 will *not* satisfy ">= 1.6.0-beta1"
This is due to package managers wishing to favor final releases over
prereleases.
In a scheduler versions more often represent the earliest release all
required features/APIs are available in a system. Whether the constraint
or the version being evaluated are prereleases has no impact on
ordering.
This commit adds a new constraint - `semver` - which will use Semver
v2.0 ordering when evaluating constraints. Given the above examples:
- 1.3.0-beta1 satisfies ">= 0.6.1" using `semver`
- 1.7.0-rc1 satisfies ">= 1.6.0-beta1" using `semver`
Since existing jobspecs may rely on the old behavior, a new constraint
was added and the implicit Consul Connect and Vault constraints were
updated to use it.
Fixes documentation inaccuracy for spread stanza placement. Spreads can
only exist on the top level job struct or within a group.
comment about nil assumption
Adds checks for affinity and constraint changes when determining if we
should update inplace.
refactor to check all levels at once
check for spread changes when checking inplace update
Currently, using a Volume in a job uses the following configuration:
```
volume "alias-name" {
type = "volume-type"
read_only = true
config {
source = "host_volume_name"
}
}
```
This commit migrates to the following:
```
volume "alias-name" {
type = "volume-type"
source = "host_volume_name"
read_only = true
}
```
The original design was based due to being uncertain about the future of storage
plugins, and to allow maxium flexibility.
However, this causes a few issues, namely:
- We frequently need to parse this configuration during submission,
scheduling, and mounting
- It complicates the configuration from and end users perspective
- It complicates the ability to do validation
As we understand the problem space of CSI a little more, it has become
clear that we won't need the `source` to be in config, as it will be
used in the majority of cases:
- Host Volumes: Always need a source
- Preallocated CSI Volumes: Always needs a source from a volume or claim name
- Dynamic Persistent CSI Volumes*: Always needs a source to attach the volumes
to for managing upgrades and to avoid dangling.
- Dynamic Ephemeral CSI Volumes*: Less thought out, but `source` will probably point
to the plugin name, and a `config` block will
allow you to pass meta to the plugin. Or will
point to a pre-configured ephemeral config.
*If implemented
The new design simplifies this by merging the source into the volume
stanza to solve the above issues with usability, performance, and error
handling.
When checking driver feasability for an alloc with multiple drivers, we
must check that all drivers are detected and healthy.
Nomad 0.9 and 0.8 have a bug where we may check a single driver only,
but which driver is dependent on map traversal order, which is
unspecified in golang spec.
When a Client declares a volume is ReadOnly, we should only schedule it
for requests for ReadOnly volumes. This change means that if a host
exposes a readonly volume, we then validate that the group level
requests for the volume are all read only for that host.
Also includes unit tests for binpacker and preemption.
The tests verify that network resources specified at the
task group level are properly accounted for
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
When an alloc is due to be rescheduleLater, it goes through the
reconciler twice: once to be ignored with a follow up evals, and once
again when processing the follow up eval where they appear as
rescheduleNow.
Here, we ignore them in the first run and mark them as stopped in second
iteration; rather than stop them twice.
Currently, when an alloc fails and is rescheduled, the alloc desired
state remains as "run" and the nomad client may not free the resources.
Here, we ensure that an alloc is marked as stopped when it's
rescheduled.
Notice the Desired Status and Description before and after this change:
Before:
```
mars-2:nomad notnoop$ nomad alloc status 02aba49e
ID = 02aba49e
Eval ID = bb9ed1d2
Name = example-reschedule.nodes[0]
Node ID = 5853d547
Node Name = mars-2.local
Job ID = example-reschedule
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 10s ago
Modified = 5s ago
Replacement Alloc ID = d6bf872b
Task "payload" is "dead"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 24 MiB/300 MiB 300 MiB
Task Events:
Started At = 2019-06-06T21:12:45Z
Finished At = 2019-06-06T21:12:50Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2019-06-06T17:12:50-04:00 Not Restarting Policy allows no restarts
2019-06-06T17:12:50-04:00 Terminated Exit Code: 1
2019-06-06T17:12:45-04:00 Started Task started by client
2019-06-06T17:12:45-04:00 Task Setup Building Task Directory
2019-06-06T17:12:45-04:00 Received Task received by client
```
After:
```
ID = 5001ccd1
Eval ID = 53507a02
Name = example-reschedule.nodes[0]
Node ID = a3b04364
Node Name = mars-2.local
Job ID = example-reschedule
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = stop
Desired Description = alloc was rescheduled because it failed
Created = 13s ago
Modified = 3s ago
Replacement Alloc ID = 7ba7ac20
Task "payload" is "dead"
Task Resources
CPU Memory Disk Addresses
21/100 MHz 24 MiB/300 MiB 300 MiB
Task Events:
Started At = 2019-06-06T21:22:50Z
Finished At = 2019-06-06T21:22:55Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2019-06-06T17:22:55-04:00 Not Restarting Policy allows no restarts
2019-06-06T17:22:55-04:00 Terminated Exit Code: 1
2019-06-06T17:22:50-04:00 Started Task started by client
2019-06-06T17:22:50-04:00 Task Setup Building Task Directory
2019-06-06T17:22:50-04:00 Received Task received by client
```
Fix `TestServiceSched_NodeDown` for checking that the migrated allocs
are actually marked to be stopped.
The boolean logic in test made it skip actually checking client status
as long as desired status was stop.
Here, we mark some jobs for migration while leaving others as running,
and we check that lost flag is only set for non-migrated allocs.