If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.
Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
In the system scheduler, if a subset of clients are filtered by class,
we hit a code path where the `AllocMetric` has been copied, but the
`Copy` method does not instantiate the various maps. This leads to an
assignment to a nil map. This changeset ensures that the maps are
non-nil before continuing.
The `Copy` method relies on functions in the `helper` package that all
return nil slices or maps when passed zero-length inputs. This
changeset to fix the panic bug intentionally defers updating those
functions because it'll have potential impact on memory usage. See
https://github.com/hashicorp/nomad/issues/11564 for more details.
The system scheduler should leave allocs on draining nodes as-is, but
stop node stop allocs on nodes that are no longer part of the job
datacenters.
Previously, the scheduler did not make the distinction and left system
job allocs intact if they are already running.
I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 .
Fixes https://github.com/hashicorp/nomad/issues/11373
When a system or sysbatch job specify constraints that none of the
current nodes meet, report a warning to the user.
Also, for sysbatch job, mark the job as dead as a result.
A sample run would look like:
```
$ nomad job run ./example.nomad
==> 2021-08-31T16:57:35-04:00: Monitoring evaluation "b48e8882"
2021-08-31T16:57:35-04:00: Evaluation triggered by job "example"
==> 2021-08-31T16:57:36-04:00: Monitoring evaluation "b48e8882"
2021-08-31T16:57:36-04:00: Evaluation status changed: "pending" -> "complete"
==> 2021-08-31T16:57:36-04:00: Evaluation "b48e8882" finished with status "complete" but failed to place all allocations:
2021-08-31T16:57:36-04:00: Task Group "cache" (failed to place 1 allocation):
* Constraint "${meta.tag} = bar": 2 nodes excluded by filter
* Constraint "${attr.kernel.name} = linux": 1 nodes excluded by filter
$ nomad job status example
ID = example
Name = example
Submit Date = 2021-08-31T16:57:35-04:00
Type = sysbatch
Priority = 50
Datacenters = dc1
Namespace = default
Status = dead
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
cache 0 0 0 0 0 0
Allocations
No allocations placed
```
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527