The allocReconciler's computeGroup function contained a significant amount of inline logic that was difficult to understand the intent of. This commit extracts inline logic into the following intention revealing subroutines. It also includes updates to the function internals also aimed at improving maintainability and renames some existing functions for the same purpose. New or renamed functions include.
Renamed functions
- handleGroupCanaries -> cancelUnneededCanaries
- handleDelayedLost -> createLostLaterEvals
- handeDelayedReschedules -> createRescheduleLaterEvals
New functions
- filterAndStopAll
- initializeDeploymentState
- requiresCanaries
- computeCanaries
- computeUnderProvisionedBy
- computeReplacements
- computeDestructiveUpdates
- computeMigrations
- createDeployment
- isDeploymentComplete
If processing a specific evaluation causes the scheduler (and
therefore the entire server) to panic, that evaluation will never
get a chance to be nack'd and cleared from the state store. It will
get dequeued by another scheduler, causing that server to panic, and
so forth until all servers are in a panic loop. This prevents the
operator from intervening to remove the evaluation or update the
state.
Recover the goroutine from the top-level `Process` methods for each
scheduler so that this condition can be detected without panicking the
server process. This will lead to a loop of recovering the scheduler
goroutine until the eval can be removed or nack'd, but that's much
better than taking a downtime.
This change modifies the Nomad job register and deregister RPCs to
accept an updated option set which includes eval priority. This
param is optional and override the use of the job priority to set
the eval priority.
In order to ensure all evaluations as a result of the request use
the same eval priority, the priority is shared to the
allocReconciler and deploymentWatcher. This creates a new
distinction between eval priority and job priority.
The Nomad agent HTTP API has been modified to allow setting the
eval priority on job update and delete. To keep consistency with
the current v1 API, job update accepts this as a payload param;
job delete accepts this as a query param.
Any user supplied value is validated within the agent HTTP handler
removing the need to pass invalid requests to the server.
The register and deregister opts functions now all for setting
the eval priority on requests.
The change includes a small change to the DeregisterOpts function
which handles nil opts. This brings the function inline with the
RegisterOpts.
The system scheduler should leave allocs on draining nodes as-is, but
stop node stop allocs on nodes that are no longer part of the job
datacenters.
Previously, the scheduler did not make the distinction and left system
job allocs intact if they are already running.
I've added a failing test first, which you can see in https://app.circleci.com/jobs/github/hashicorp/nomad/179661 .
Fixes https://github.com/hashicorp/nomad/issues/11373
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.
Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.
As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).
Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.
Closes#2527
Add a new driver capability: RemoteTasks.
When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.
This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.
See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.
Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
Fixes#9017
The ?resources=true query parameter includes resources in the object
stub listings. Specifically:
- For `/v1/nodes?resources=true` both the `NodeResources` and
`ReservedResources` field are included.
- For `/v1/allocations?resources=true` the `AllocatedResources` field is
included.
The ?task_states=false query parameter removes TaskStates from
/v1/allocations responses. (By default TaskStates are included.)
This handles the case where a job when from no-deployment to deployment
with canaries.
Consider a case where a `max_parallel=0` job is submitted as version 0,
then an update is submitted with `max_parallel=1, canary=1` as verion 1.
In this case, we will have 1 canary alloc, and all remaining allocs will
be version 0. Until the deployment is promoted, we ought to replace the
canaries with version 0 job (which isn't associated with a deployment).
This change fixes a bug where lost/failed allocations are replaced by
allocations with the latest versions, even if the version hasn't been
promoted yet.
Now, when generating a plan for lost/failed allocations, the scheduler
first checks if the current deployment is in Canary stage, and if so, it
ensures that any lost/failed allocations is replaced one with the latest
promoted version instead.
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect
* scheduler/reconcile: thread follupEvalIDs through to results.stop
* scheduler/reconcile: comment typo
* nomad/_test: correct arguments for plan.AppendStoppedAlloc
* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
* client/heartbeatstop: reversed time condition for startup grace
* scheduler/generic_sched: use `delayInstead` to avoid a loop
Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.
* scheduler/reconcile: lostLater are different than delayedReschedules
Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
* jobspec, api: add stop_after_client_disconnect
* nomad/state/state_store: error message typo
* structs: alloc methods to support stop_after_client_disconnect
1. a global AllocStates to track status changes with timestamps. We
need this to track the time at which the alloc became lost
originally.
2. ShouldClientStop() and WaitClientStop() to actually do the math
* scheduler/reconcile_util: delayByStopAfterClientDisconnect
* scheduler/reconcile: use delayByStopAfterClientDisconnect
* scheduler/util: updateNonTerminalAllocsToLost comments
This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now
* scheduler/testing: AssertEvalStatus just use require
* scheduler/generic_sched: don't create a blocked eval if delayed
* scheduler/generic_sched_test: several scheduling cases
Fixes#5856
When the scheduler looks for a placement for an allocation that's
replacing another allocation, it's supposed to penalize the previous
node if the allocation had been rescheduled or failed. But we're
currently always penalizing the node, which leads to unnecessary
migrations on job update.
This commit leaves in place the existing behavior where if the
previous alloc was itself rescheduled, its previous nodes are also
penalized. This is conservative but the right behavior especially on
larger clusters where a group of hosts might be having correlated
trouble (like an AZ failure).
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
Currently, when an alloc fails and is rescheduled, the alloc desired
state remains as "run" and the nomad client may not free the resources.
Here, we ensure that an alloc is marked as stopped when it's
rescheduled.
Notice the Desired Status and Description before and after this change:
Before:
```
mars-2:nomad notnoop$ nomad alloc status 02aba49e
ID = 02aba49e
Eval ID = bb9ed1d2
Name = example-reschedule.nodes[0]
Node ID = 5853d547
Node Name = mars-2.local
Job ID = example-reschedule
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = run
Desired Description = <none>
Created = 10s ago
Modified = 5s ago
Replacement Alloc ID = d6bf872b
Task "payload" is "dead"
Task Resources
CPU Memory Disk Addresses
0/100 MHz 24 MiB/300 MiB 300 MiB
Task Events:
Started At = 2019-06-06T21:12:45Z
Finished At = 2019-06-06T21:12:50Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2019-06-06T17:12:50-04:00 Not Restarting Policy allows no restarts
2019-06-06T17:12:50-04:00 Terminated Exit Code: 1
2019-06-06T17:12:45-04:00 Started Task started by client
2019-06-06T17:12:45-04:00 Task Setup Building Task Directory
2019-06-06T17:12:45-04:00 Received Task received by client
```
After:
```
ID = 5001ccd1
Eval ID = 53507a02
Name = example-reschedule.nodes[0]
Node ID = a3b04364
Node Name = mars-2.local
Job ID = example-reschedule
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = stop
Desired Description = alloc was rescheduled because it failed
Created = 13s ago
Modified = 3s ago
Replacement Alloc ID = 7ba7ac20
Task "payload" is "dead"
Task Resources
CPU Memory Disk Addresses
21/100 MHz 24 MiB/300 MiB 300 MiB
Task Events:
Started At = 2019-06-06T21:22:50Z
Finished At = 2019-06-06T21:22:55Z
Total Restarts = 0
Last Restart = N/A
Recent Events:
Time Type Description
2019-06-06T17:22:55-04:00 Not Restarting Policy allows no restarts
2019-06-06T17:22:55-04:00 Terminated Exit Code: 1
2019-06-06T17:22:50-04:00 Started Task started by client
2019-06-06T17:22:50-04:00 Task Setup Building Task Directory
2019-06-06T17:22:50-04:00 Received Task received by client
```
This adds a `nomad alloc stop` command that can be used to stop and
force migrate an allocation to a different node.
This is built on top of the AllocUpdateDesiredTransitionRequest and
explicitly limits the scope of access to that transition to expose it
under the alloc-lifecycle ACL.
The API returns the follow up eval that can be used as part of
monitoring in the CLI or parsed and used in an external tool.
Currently when operators need to log onto a machine where an alloc
is running they will need to perform both an alloc/job status
call and then a call to discover the node name from the node list.
This updates both the job status and alloc status output to include
the node name within the information to make operator use easier.
Closes#2359
Cloess #1180
This commit implements an allocation selection algorithm for finding
allocations to preempt. It currently special cases network resource asks
from others (cpu/memory/disk/iops).