Cluster operators want to have better control over memory
oversubscription and may want to enable/disable it based on their
experience.
This PR adds a scheduler configuration field to control memory
oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config).
I opted to have the memory oversubscription be an opt-in, but happy to change it. To enable it, operators should call the API with:
```json
{
"MemoryOversubscriptionEnabled": true
}
```
If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not
enabled" warnings, but the jobs will be accepted without them accessing
the additional memory.
The warning message is like:
```
$ nomad job run /tmp/j
Job Warnings:
1 warning(s):
* Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored
==> Monitoring evaluation "7c444157"
Evaluation triggered by job "example"
==> Monitoring evaluation "7c444157"
Evaluation within deployment: "9d826f13"
Allocation "aa5c3cad" created: node "9272088e", group "cache"
Evaluation status changed: "pending" -> "complete"
==> Evaluation "7c444157" finished with status "complete"
# then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory:
$ nomad alloc status -json aa5c3cad | jq '.AllocatedResources.Tasks["redis"].Memory'
{
"MemoryMB": 256,
"MemoryMaxMB": 0
}
```
Add a new driver capability: RemoteTasks.
When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.
This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.
See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
This PR adds the common OSS changes for adding support for Consul Namespaces,
which is going to be a Nomad Enterprise feature. There is no new functionality
provided by this changeset and hopefully no new bugs.
Start tracking a new MemoryMaxMB field that represents the maximum memory a task
may use in the client. This allows tasks to specify a memory reservation (to be
used by scheduler when placing the task) but use excess memory used on the
client if the client has any.
This commit adds the server tracking for the value, and ensures that allocations
AllocatedResource fields include the value.
node drain: use msgtype on txn so that events are emitted
wip: encoding extension to add Node.Drain field back to API responses
new approach for hiding Node.SecretID in the API, using `json` tag
documented this approach in the contributing guide
refactored the JSON handlers with extensions
modified event stream encoding to use the go-msgpack encoders with the extensions
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.
Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
Callers of `CSIVolumeByID` are generally assuming they should receive a single
volume. This potentially results in feasibility checking being performed
against the wrong volume if a volume's ID is a prefix substring of other
volume (for example: "test" and "testing").
Removing the incorrect prefix matching from `CSIVolumeByID` breaks prefix
matching in the command line client. Add the required elements for prefix
matching to the commands and API.
When multiple CSI volumes are requested, the feasibility check could return
early for read/write volumes with free claims, even if a later volume in the
request was not feasible for any other reason (including not existing at
all). This can result in random failure to fail feasibility checking,
depending on how the map of volumes was being ordered at runtime.
Remove the early return from the feasibility check. Add a test to verify that
missing volumes in the map will cause a failure; this test will not catch a
regression every test run because of the random map ordering, but any failure
will be caught over the course of several CI runs.
This PR fixes a bug where tasks with Connect services could be
triggered to destructively update (i.e. placed in a new alloc)
when no update should be necessary.
Fixes#10077
* Persist shared allocated ports for inplace update
Ports were not copied over when performing inplace updates in the
generic scheduler
* changelog
* drop spew
AllocatedSharedResources were not being copied over to the new
allocation struct the scheduler makes during inplace updates. This
caused downstream issues after the plan was applied, namely the shared
ports were dropped causing issues with service
registration/deregistration.
test that shared ports are preserved
change log, also carry over shared network
copy networks
This PR enables users of Nomad < 0.12 to upgrade to Nomad 0.12
and beyond. Nomad 0.12 introduced a network fingerprinter for
bridge networks, which is a contstraint checked for if bridge
network is being used. If users upgrade servers first as is
recommended, suddenly no clients running older versions of Nomad
will satisfy the bridge network resource constraint. Instead,
this change only enforces the constraint if the Nomad client
version is also >= 0.12.
Closes#8423
* use msgtype in upsert node
adds message type to signature for upsert node, update tests, remove placeholder method
* UpsertAllocs msg type test setup
* use upsertallocs with msg type in signature
update test usage of delete node
delete placeholder msgtype method
* add msgtype to upsert evals signature, update test call sites with test setup msg type
handle snapshot upsert eval outside of FSM and ignore eval event
remove placeholder upsertevalsmsgtype
handle job plan rpc and prevent event creation for plan
msgtype cleanup upsertnodeevents
updatenodedrain msgtype
msg type 0 is a node registration event, so set the default to the ignore type
* fix named import
* fix signature ordering on upsertnode to match
Fixes#9017
The ?resources=true query parameter includes resources in the object
stub listings. Specifically:
- For `/v1/nodes?resources=true` both the `NodeResources` and
`ReservedResources` field are included.
- For `/v1/allocations?resources=true` the `AllocatedResources` field is
included.
The ?task_states=false query parameter removes TaskStates from
/v1/allocations responses. (By default TaskStates are included.)
* Node Drain events and Node Events (#8980)
Deployment status updates
handle deployment status updates (paused, failed, resume)
deployment alloc health
generate events from apply plan result
txn err check, slim down deployment event
one ndjson line per index
* consolidate down to node event + type
* fix UpdateDeploymentAllocHealth test invocations
* fix test
Fixes a bug where CSI volumes with the `MULTI_NODE_MULTI_WRITER` access mode
were using the same logic as `MULTI_NODE_SINGLE_WRITER` to determine whether
the volume had writer claims available for scheduling.
Extends CSI claim endpoint test to exercise multi-reader and make sure `WriteFreeClaims`
is exercised for multi-writer in feasibility test.
This PR fixes a long standing bug where submitting jobs with changes
to connect services would not trigger updates as expected. Previously,
service blocks were not considered as sources of destructive updates
since they could be synced with consul non-destructively. With Connect,
task group services that have changes to their connect block or to
the service port should be destructive, since the network plumbing of
the alloc is going to need updating.
Fixes#8596#7991
Non-destructive half in #7192
The change was intended to fix a case where a canary alloc may fail to
be rescheduled if all the other allocs fail as well (e.g. if all allocs
happen to be placed on a node that died). However, it introduced some
unintended side-effects.
Reverting the change for now and will investigate further.
This handles the case where a job when from no-deployment to deployment
with canaries.
Consider a case where a `max_parallel=0` job is submitted as version 0,
then an update is submitted with `max_parallel=1, canary=1` as verion 1.
In this case, we will have 1 canary alloc, and all remaining allocs will
be version 0. Until the deployment is promoted, we ought to replace the
canaries with version 0 job (which isn't associated with a deployment).
This change fixes a bug where lost/failed allocations are replaced by
allocations with the latest versions, even if the version hasn't been
promoted yet.
Now, when generating a plan for lost/failed allocations, the scheduler
first checks if the current deployment is in Canary stage, and if so, it
ensures that any lost/failed allocations is replaced one with the latest
promoted version instead.
The reconcile loop sets `DeploymentState.DesiredCanaries` only on the first
pass through the loop and if the job is not paused/pending. In MRD,
deployments will make one pass though the loop while "pending", and were not
ever getting `DesiredCanaries` set. We can't set it in the initial
`DeploymentState` constructor because the first pass through setting up
canaries expects it's not there yet. However, this value is static for a given
version of a job because it's coming from the update stanza, so it's safe to
re-assign the value on subsequent passes.