Commit Graph

842 Commits

Author SHA1 Message Date
Tim Gross 417ec91317 scheduler: datacenter updates should be destructive
Updates to the datacenter field should be destructive for any allocation that
is on a node no longer in the list of datacenters, but inplace for any
allocation on a node that is still in the list. Add a check for this change to
the system and generic schedulers after we've checked the task definition for
updates and obtained the node for each current allocation.
2021-07-07 11:18:30 -04:00
Tim Gross 38a0057715
quotas: evaluate quota feasibility last in scheduler (#10753)
The `QuotaIterator` is used as the source of nodes passed into feasibility
checking for constraints. Every node that passes the quota check counts the
allocation resources agains the quota, and as a result we count nodes which
will be later filtered out by constraints. Therefore for jobs with
constraints, nodes that are feasibility checked but fail have been counted
against quotas. This failure mode is order dependent; if all the unfiltered
nodes happen to be quota checked first, everything works as expected.

This changeset moves the `QuotaIterator` to happen last among all feasibility
checkers (but before ranking). The `QuotaIterator` will never receive filtered
nodes so it will calculate quotas correctly.
2021-06-14 10:11:40 -04:00
Mahmood Ali aa77c2731b tests: use standard library testing.TB
Glint pulled in an updated version of mitchellh/go-testing-interface
which broke some existing tests because the update added a Parallel()
method to testing.T. This switches to the standard library testing.TB
which doesn't have a Parallel() method.
2021-06-09 16:18:45 -07:00
Tim Gross 37fa6850d2 scheduler: test for reconciler's in-place rollback behavior
The reconciler has some complicated behavior when there are already running
allocations from a previous version of the job that we want to keep, as
happens during a rollback. Document this behavior with a test.
2021-06-03 10:02:19 -04:00
Michael Schurter 547a718ef6
Merge pull request #10248 from hashicorp/f-remotetask-2021
core: propagate remote task handles
2021-04-30 08:57:26 -07:00
Michael Schurter 641eb1dc1a clarify docs from pr comments 2021-04-30 08:31:31 -07:00
Mahmood Ali 52d881f567
Allow configuring memory oversubscription (#10466)
Cluster operators want to have better control over memory
oversubscription and may want to enable/disable it based on their
experience.

This PR adds a scheduler configuration field to control memory
oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config).

I opted to have the memory oversubscription be an opt-in, but happy to change it.  To enable it, operators should call the API with:
```json
{
  "MemoryOversubscriptionEnabled": true
}
```

If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not
enabled" warnings, but the jobs will be accepted without them accessing
the additional memory.

The warning message is like:
```
$ nomad job run /tmp/j
Job Warnings:
1 warning(s):

* Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored

==> Monitoring evaluation "7c444157"
    Evaluation triggered by job "example"
==> Monitoring evaluation "7c444157"
    Evaluation within deployment: "9d826f13"
    Allocation "aa5c3cad" created: node "9272088e", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "7c444157" finished with status "complete"

# then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory:
$ nomad alloc status -json aa5c3cad | jq '.AllocatedResources.Tasks["redis"].Memory'
{
  "MemoryMB": 256,
  "MemoryMaxMB": 0
}
```
2021-04-29 22:09:56 -04:00
Luiz Aoqui f1b9055d21
Add metrics for blocked eval resources (#10454)
* add metrics for blocked eval resources

* docs: add new blocked_evals metrics

* fix to call `pruneStats` instead of `stats.prune` directly
2021-04-29 15:03:45 -04:00
Michael Schurter e62795798d core: propagate remote task handles
Add a new driver capability: RemoteTasks.

When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.

This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.

See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
2021-04-27 15:07:03 -07:00
Andrii Chubatiuk 712bd5f5a6
add support for host network interpolation 2021-04-13 09:53:05 -04:00
Seth Hoenig f17ba33f61 consul: plubming for specifying consul namespace in job/group
This PR adds the common OSS changes for adding support for Consul Namespaces,
which is going to be a Nomad Enterprise feature. There is no new functionality
provided by this changeset and hopefully no new bugs.
2021-04-05 10:03:19 -06:00
Chris Baker 436d46bd19
Merge branch 'main' into f-node-drain-api 2021-04-01 15:22:57 -05:00
Mahmood Ali 0c2551270a oversubscription: Add MemoryMaxMB to internal structs
Start tracking a new MemoryMaxMB field that represents the maximum memory a task
may use in the client. This allows tasks to specify a memory reservation (to be
used by scheduler when placing the task) but use excess memory used on the
client if the client has any.

This commit adds the server tracking for the value, and ensures that allocations
AllocatedResource fields include the value.
2021-03-30 16:55:58 -04:00
Nick Ethier daecfa61e6
Merge pull request #10203 from hashicorp/f-cpu-cores
Reserved Cores [1/4]: Structs and scheduler implementation
2021-03-29 14:05:54 -04:00
Chris Baker 770c9cecb5 restored Node.Sanitize() for RPC endpoints
multiple other updates from code review
2021-03-26 17:03:15 +00:00
Chris Baker dd291e69f4 removed deprecated fields from Drain structs and API
node drain: use msgtype on txn so that events are emitted
wip: encoding extension to add Node.Drain field back to API responses

new approach for hiding Node.SecretID in the API, using `json` tag
documented this approach in the contributing guide
refactored the JSON handlers with extensions
modified event stream encoding to use the go-msgpack encoders with the extensions
2021-03-21 15:30:11 +00:00
Nick Ethier b8a48bc325 scheduler: detect job change in cores resource 2021-03-19 22:25:50 -04:00
Nick Ethier 648ade63ad scheduler: implement scheduling of reserved cores 2021-03-19 00:29:07 -04:00
Tim Gross fa25e048b2
CSI: unique volume per allocation
Add a `PerAlloc` field to volume requests that directs the scheduler to test
feasibility for volumes with a source ID that includes the allocation index
suffix (ex. `[0]`), rather than the exact source ID.

Read the `PerAlloc` field when making the volume claim at the client to
determine if the allocation index suffix (ex. `[0]`) should be added to the
volume source ID.
2021-03-18 15:35:11 -04:00
Tim Gross 9b2b580d1a
CSI: remove prefix matching from CSIVolumeByID and fix CLI prefix matching (#10158)
Callers of `CSIVolumeByID` are generally assuming they should receive a single
volume. This potentially results in feasibility checking being performed
against the wrong volume if a volume's ID is a prefix substring of other
volume (for example: "test" and "testing").

Removing the incorrect prefix matching from `CSIVolumeByID` breaks prefix
matching in the command line client. Add the required elements for prefix
matching to the commands and API.
2021-03-18 14:32:40 -04:00
Tim Gross 0e3264aa4f scheduler/csi: fix early return when multiple volumes are requested
When multiple CSI volumes are requested, the feasibility check could return
early for read/write volumes with free claims, even if a later volume in the
request was not feasible for any other reason (including not existing at
all). This can result in random failure to fail feasibility checking,
depending on how the map of volumes was being ordered at runtime.

Remove the early return from the feasibility check. Add a test to verify that
missing volumes in the map will cause a failure; this test will not catch a
regression every test run because of the random map ordering, but any failure
will be caught over the course of several CI runs.
2021-03-10 15:18:36 -05:00
Seth Hoenig 4f759f1cc8 consul/connect: correctly detect when connect tasks not updated
This PR fixes a bug where tasks with Connect services could be
triggered to destructively update (i.e. placed in a new alloc)
when no update should be necessary.

Fixes #10077
2021-02-23 15:12:49 -06:00
Nick Ethier dc29b679b4
Merge pull request #9937 from hashicorp/b-9728
scheduler: add tests and fix for detected host_network and to port field changes
2021-02-02 13:54:41 -05:00
Nick Ethier 93095917dc scheduler: add tests and fix for detected host_network and to port field changes 2021-02-01 15:56:43 -05:00
Drew Bailey 009b8d5363
Persist shared allocated ports for inplace update (#9830)
* Persist shared allocated ports for inplace update

Ports were not copied over when performing inplace updates in the
generic scheduler

* changelog

* drop spew
2021-01-15 12:45:12 -05:00
Drew Bailey c87adfac62
persist shared ports during inplace updates (#9736)
AllocatedSharedResources were not being copied over to the new
allocation struct the scheduler makes during inplace updates. This
caused downstream issues after the plan was applied, namely the shared
ports were dropped causing issues with service
registration/deregistration.

test that shared ports are preserved

change log, also carry over shared network

copy networks
2021-01-08 09:00:41 -05:00
Kris Hicks 0cf9cae656
Apply some suggested fixes from staticcheck (#9598) 2020-12-10 07:29:18 -08:00
Kris Hicks 0a3a748053
Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
Kris Hicks 62972cc839
scheduler: Fix always-false sort func (#9547)
Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
2020-12-08 09:57:47 -08:00
Nick Ethier d21cbeb30f command: remove task network usage from init examples 2020-11-23 10:25:11 -06:00
Seth Hoenig 6b89527505 scheduler: enable upgrade path for bridge network finger print
This PR enables users of Nomad < 0.12 to upgrade to Nomad 0.12
and beyond. Nomad 0.12 introduced a network fingerprinter for
bridge networks, which is a contstraint checked for if bridge
network is being used. If users upgrade servers first as is
recommended, suddenly no clients running older versions of Nomad
will satisfy the bridge network resource constraint. Instead,
this change only enforces the constraint if the Nomad client
version is also >= 0.12.

Closes #8423
2020-11-13 14:17:01 -06:00
Drew Bailey 6c788fdccd
Events/msgtype cleanup (#9117)
* use msgtype in upsert node

adds message type to signature for upsert node, update tests, remove placeholder method

* UpsertAllocs msg type test setup

* use upsertallocs with msg type in signature

update test usage of delete node

delete placeholder msgtype method

* add msgtype to upsert evals signature, update test call sites with test setup msg type

handle snapshot upsert eval outside of FSM and ignore eval event

remove placeholder upsertevalsmsgtype

handle job plan rpc and prevent event creation for plan

msgtype cleanup upsertnodeevents

updatenodedrain msgtype

msg type 0 is a node registration event, so set the default  to the ignore type

* fix named import

* fix signature ordering on upsertnode to match
2020-10-19 09:30:15 -04:00
Michael Schurter dd09fa1a4a
Merge pull request #9055 from hashicorp/f-9017-resources
api: add field filters to /v1/{allocations,nodes}
2020-10-14 14:49:39 -07:00
Michael Schurter 8ccbd92cb6 api: add field filters to /v1/{allocations,nodes}
Fixes #9017

The ?resources=true query parameter includes resources in the object
stub listings. Specifically:

- For `/v1/nodes?resources=true` both the `NodeResources` and
  `ReservedResources` field are included.
- For `/v1/allocations?resources=true` the `AllocatedResources` field is
  included.

The ?task_states=false query parameter removes TaskStates from
/v1/allocations responses. (By default TaskStates are included.)
2020-10-14 10:35:22 -07:00
Drew Bailey b4c135358d
use Events to wrap index and events, store in events table 2020-10-14 12:44:39 -04:00
Drew Bailey 9d48818eb8
writetxn can return error, add alloc and job generic events. Add events
table for durability
2020-10-14 12:44:39 -04:00
Drew Bailey 400455d302
Events/eval alloc events (#9012)
* generic eval update event

first pass at alloc client update events

* api/event client
2020-10-14 12:44:37 -04:00
Drew Bailey 4793bb4e01
Events/deployment events (#9004)
* Node Drain events and Node Events (#8980)

Deployment status updates

handle deployment status updates (paused, failed, resume)

deployment alloc health

generate events from apply plan result

txn err check, slim down deployment event

one ndjson line per index

* consolidate down to node event + type

* fix UpdateDeploymentAllocHealth test invocations

* fix test
2020-10-14 12:44:37 -04:00
Tim Gross 3ceb5b36b1
csi: allow more than 1 writer claim for multi-writer mode (#9040)
Fixes a bug where CSI volumes with the `MULTI_NODE_MULTI_WRITER` access mode
were using the same logic as `MULTI_NODE_SINGLE_WRITER` to determine whether
the volume had writer claims available for scheduling.

Extends CSI claim endpoint test to exercise multi-reader and make sure `WriteFreeClaims`
is exercised for multi-writer in feasibility test.
2020-10-07 10:43:23 -04:00
Seth Hoenig f44a4f68ee consul/connect: trigger update as necessary on connect changes
This PR fixes a long standing bug where submitting jobs with changes
to connect services would not trigger updates as expected. Previously,
service blocks were not considered as sources of destructive updates
since they could be synced with consul non-destructively. With Connect,
task group services that have changes to their connect block or to
the service port should be destructive, since the network plumbing of
the alloc is going to need updating.

Fixes #8596 #7991

Non-destructive half in #7192
2020-10-05 14:53:00 -05:00
Neil Mock f749de8543
Fix multi-interface networking in the system scheduler (#8822) 2020-09-22 12:54:34 -04:00
Mahmood Ali 6a0dd8bc87
Merge pull request #8867 from hashicorp/b-canary-substitution
scheduler: Revert requireCanary logic
2020-09-15 12:58:55 -05:00
Mahmood Ali 339617a836 Only ignore rescheduled allocations if they got stopped 2020-09-14 21:11:52 -04:00
Mahmood Ali 98de2d2278 add a test when .NextAllocation is set but alloc is still running 2020-09-14 17:12:53 -04:00
Mahmood Ali fd54cfce6e Revert the `requireCanary` check introduced in https://github.com/hashicorp/nomad/pull/8691/files#diff-1801138ac4d10f2064ba6f2e434ac9b4L430-R431 .
The change was intended to fix a case where a canary alloc may fail to
be rescheduled if all the other allocs fail as well (e.g. if all allocs
happen to be placed on a node that died).  However, it introduced some
unintended side-effects.

Reverting the change for now and will investigate further.
2020-09-10 14:59:02 -04:00
Mahmood Ali c6e1d22697 test for rescheduling non-canaries 2020-09-10 14:59:02 -04:00
Mahmood Ali 8837c9a45d Handle migration of non-deployment jobs
This handles the case where a job when from no-deployment to deployment
with canaries.

Consider a case where a `max_parallel=0` job is submitted as version 0,
then an update is submitted with `max_parallel=1, canary=1` as verion 1.
In this case, we will have 1 canary alloc, and all remaining allocs will
be version 0.  Until the deployment is promoted, we ought to replace the
canaries with version 0 job (which isn't associated with a deployment).
2020-08-26 10:36:34 -04:00
Mahmood Ali 2438b90334 Update scheduler/reconcile.go
Co-authored-by: Chris Baker <1675087+cgbaker@users.noreply.github.com>
2020-08-25 17:37:19 -04:00
Mahmood Ali 38b61b97d8 simplify canary check
`(alloc.DeploymentStatus == nil || !alloc.DeploymentStatus.IsCanary())`
and `!alloc.DeploymentStatus.IsCanary()` are equivalent.
2020-08-25 17:37:19 -04:00
Mahmood Ali e4bb88dfcf tweak stack job manipulation
To address review comments
2020-08-25 17:37:19 -04:00
Mahmood Ali def768728e Have Plan.AppendAlloc accept the job 2020-08-25 17:22:09 -04:00
Mahmood Ali 8a342926b7 Respect alloc job version for lost/failed allocs
This change fixes a bug where lost/failed allocations are replaced by
allocations with the latest versions, even if the version hasn't been
promoted yet.

Now, when generating a plan for lost/failed allocations, the scheduler
first checks if the current deployment is in Canary stage, and if so, it
ensures that any lost/failed allocations is replaced one with the latest
promoted version instead.
2020-08-19 09:52:48 -04:00
Lars Lehtonen fb7b2282b1
scheduler: label loops with nested switch statements for effective break (#8528) 2020-07-24 08:50:41 -04:00
Tim Gross 1ca2c4ec2c scheduler: DesiredCanaries can be set on every pass safely
The reconcile loop sets `DeploymentState.DesiredCanaries` only on the first
pass through the loop and if the job is not paused/pending. In MRD,
deployments will make one pass though the loop while "pending", and were not
ever getting `DesiredCanaries` set. We can't set it in the initial
`DeploymentState` constructor because the first pass through setting up
canaries expects it's not there yet. However, this value is static for a given
version of a job because it's coming from the update stanza, so it's safe to
re-assign the value on subsequent passes.
2020-07-20 11:25:53 -04:00
Tim Gross d3341a2019 refactor: make it clear where we're accessing dstate
The field name `Deployment.TaskGroups` contains a map of `DeploymentState`,
which makes it a little harder to follow state updates when combined with
inconsistent naming conventions, particularly when we also have the state
store or actual `TaskGroup`s in scope. This changeset changes all uses to
`dstate` so as not to be confused with actual TaskGroups.
2020-07-20 11:25:53 -04:00
Tim Gross fe5f5e35aa
mrd: reconcile should treat pending deployments as paused (#8446)
If a job update includes a task group that has no changes, those allocations
have their version bumped in-place. The ends up triggering an eval from
`deploymentwatcher` when it verifies their health. Although this eval is a
no-op, we were only treating pending deployments the same as paused when
the deployment was a new MRD. This means that any eval after the initial one
will kick off the deployment, and that caused pending deployments to "jump
the queue" and run ahead of schedule, breaking MRD invariants and resulting in
a state with all regions blocked.

This behavior can be replicated even in the case of job updates with no
in-place updates by patching `deploymentwatcher` to inject a spurious no-op
eval. This changeset fixes the behavior by treating pending deployments the
same as paused in all cases in the reconciler.
2020-07-16 13:00:08 -04:00
Tim Gross bd457343de
MRD: all regions should start pending (#8433)
Deployments should wait until kicked off by `Job.Register` so that we can
assert that all regions have a scheduled deployment before starting any
region. This changeset includes the OSS fixes to support the ENT work.

`IsMultiregionStarter` has no more callers in OSS, so remove it here.
2020-07-14 10:57:37 -04:00
Nick Ethier e0fb634309
ar: support opting into binding host ports to default network IP (#8321)
* ar: support opting into binding host ports to default network IP

* fix config plumbing

* plumb node address into network resource

* struct: only handle network resource upgrade path once
2020-07-06 18:51:46 -04:00
Tim Gross 31185325c9
reconcile should not overwrite unblocking state (#8349)
Pre-0.12.0 beta, a deployment was considered "complete" if it was
successful. But with MRD we have "blocked" and "unblocking" states as well. We
did not consider the case where a concurrent alloc health status update
triggers a `Compute` call on a deployment that's moved from "blocked" to
"unblocking" (it's a small window), which caused an extra pass thru the
`nextRegion` logic in `deploymentwatcher` and triggered an error when later
transitioning to "successful".

This changeset makes sure we don't overwrite that status.
2020-07-06 11:31:33 -04:00
Nick Ethier 89118016fc
command: correctly show host IP in ports output /w multi-host networks (#8289) 2020-06-25 15:16:01 -04:00
Nick Ethier 416efd83ee
scheduler: do network feasibility checking for system jobs (#8256) 2020-06-24 16:01:00 -04:00
Mahmood Ali 1c1fb5da0a this is OSS 2020-06-22 10:28:45 -04:00
Michael Schurter 562704124d
Merge pull request #8208 from hashicorp/f-multi-network
multi-interface network support
2020-06-19 15:46:48 -07:00
Tim Gross d3ecb87984
multiregion: initial deploymentPaused must match start condition (#8215)
In #8209 we fixed the `max_parallel` stanza for multiregion by introducing the
`IsMultiregionStarter` check, but didn't apply it to the earlier place its
required. The result is that deployments start but don't place allocations.
2020-06-19 13:42:38 -04:00
Tim Gross b654e1b8a4
multiregion: all regions start in running if no max_parallel (#8209)
If `max_parallel` is not set, all regions should begin in a `running` state
rather than a `pending` state. Otherwise the first region is set to `running`
and then all the remaining regions once it enters `blocked. That behavior is
technically correct in that we have at most `max_parallel` regions running,
but definitely not what a user expects.
2020-06-19 11:17:09 -04:00
Nick Ethier f0559a8162
multi-interface network support 2020-06-19 09:42:10 -04:00
Nick Ethier 1e4ea699ad fix test failures from rebase 2020-06-18 11:05:32 -07:00
Nick Ethier 4a44deaa5c CNI Implementation (#7518) 2020-06-18 11:05:29 -07:00
Nick Ethier 0bc0403cc3 Task DNS Options (#7661)
Co-Authored-By: Tim Gross <tgross@hashicorp.com>
Co-Authored-By: Seth Hoenig <shoenig@hashicorp.com>
2020-06-18 11:01:31 -07:00
Tim Gross c14a75bfab multiregion: use pending instead of paused
The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.

This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
2020-06-17 11:06:14 -04:00
Tim Gross fd50b12ee2 multiregion: integrate with deploymentwatcher
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
2020-06-17 11:06:00 -04:00
Tim Gross 5c4d0a73f4 start all but first region deployment in paused state 2020-06-17 11:05:34 -04:00
Tim Gross 473a0f1d44 multiregion: unblock and cancel RPCs 2020-06-17 11:02:26 -04:00
Lang Martin 069840bef8
scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105) (#8138)
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect

* scheduler/reconcile: thread follupEvalIDs through to results.stop

* scheduler/reconcile: comment typo

* nomad/_test: correct arguments for plan.AppendStoppedAlloc

* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
2020-06-09 17:13:53 -04:00
Lang Martin ac7c39d3d3
Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection (#8099)
* client/heartbeatstop: reversed time condition for startup grace

* scheduler/generic_sched: use `delayInstead` to avoid a loop

Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.

* scheduler/reconcile: lostLater are different than delayedReschedules

Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
2020-06-03 09:48:38 -04:00
Mahmood Ali 21c948f3d3 keep promotion score constants next to use 2020-05-27 15:13:19 -04:00
Mahmood Ali d9792777d9 Open source Preemption code
Nomad 0.12 OSS is to include preemption feature.

This commit moves the private code for managing preemption to OSS
repository.
2020-05-27 15:02:01 -04:00
Lang Martin d3c4700cd3
server: stop after client disconnect (#7939)
* jobspec, api: add stop_after_client_disconnect

* nomad/state/state_store: error message typo

* structs: alloc methods to support stop_after_client_disconnect

1. a global AllocStates to track status changes with timestamps. We
   need this to track the time at which the alloc became lost
   originally.

2. ShouldClientStop() and WaitClientStop() to actually do the math

* scheduler/reconcile_util: delayByStopAfterClientDisconnect

* scheduler/reconcile: use delayByStopAfterClientDisconnect

* scheduler/util: updateNonTerminalAllocsToLost comments

This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now

* scheduler/testing: AssertEvalStatus just use require

* scheduler/generic_sched: don't create a blocked eval if delayed

* scheduler/generic_sched_test: several scheduling cases
2020-05-13 16:39:04 -04:00
Mahmood Ali 759eade78b missed fixing one invocation 2020-05-01 13:38:46 -04:00
Mahmood Ali b9e3cde865 tests and some clean up 2020-05-01 13:13:30 -04:00
Charlie Voiselle d8e5e02398 Wiring algorithm to scheduler calls 2020-05-01 13:13:29 -04:00
Michael Schurter c901d0e7dd
Merge branch 'master' into b-reserved-scoring 2020-04-30 14:48:14 -07:00
Mahmood Ali 9f005201e2 Ensure that alloc updates preserve device offers
When an alloc is updated in-place, ensure that the allocated device are
preserved and carried over to new alloc.
2020-04-21 08:57:15 -04:00
Mahmood Ali 2ff2745374 test for allocated devices on job in-update update
When an alloc is updated in-place, test that the allocated devices are
preserved in new alloc struct.
2020-04-21 08:56:05 -04:00
Michael Schurter 4c5a0cae35 core: fix node reservation scoring
The BinPackIter accounted for node reservations twice when scoring nodes
which could bias scores toward nodes with reservations.

Pseudo-code for previous algorithm:
```
	proposed  = reservedResources + sum(allocsResources)
	available = nodeResources - reservedResources
	score     = 1 - (proposed / available)
```

The node's reserved resources are added to the total resources used by
allocations, and then the node's reserved resources are later
substracted from the node's overall resources.

The new algorithm is:
```
	proposed  = sum(allocResources)
	available = nodeResources - reservedResources
	score     = 1 - (proposed / available)
```

The node's reserved resources are no longer added to the total resources
used by allocations.

My guess as to how this bug happened is that the resource utilization
variable (`util`) is calculated and returned by the `AllocsFit` function
which needs to take reserved resources into account as a basic
feasibility check.

To avoid re-calculating alloc resource usage (because there may be a
large number of allocs), we reused `util` in the `ScoreFit` function.
`ScoreFit` properly accounts for reserved resources by subtracting them
from the node's overall resources. However since `util` _also_ took
reserved resources into account the score would be incorrect.

Prior to the fix the added test output:
```
Node: reserved     Score: 1.0000
Node: reserved2    Score: 1.0000
Node: no-reserved  Score: 0.9741
```

The scores being 1.0 for *both* nodes with reserved resources is a good
hint something is wrong as they should receive different scores. Upon
further inspection the double accounting of reserved resources caused
their scores to be >1.0 and clamped.

After the fix the added test outputs:
```
Node: no-reserved  Score: 0.9741
Node: reserved     Score: 0.9480
Node: reserved2    Score: 0.8717
```
2020-04-15 15:13:30 -07:00
Michael Schurter 4b475db408 core: fix comment on system stack
This makes me do a double take every time I run into it, so what if we
just changed it?
2020-04-09 15:19:11 -07:00
Tim Gross 161f9aedc3
scheduler: prevent a reported NPE for CSI (#7633) 2020-04-06 09:42:27 -04:00
Lang Martin e03c328792
csi: use node MaxVolumes during scheduling (#7565)
* nomad/state/state_store: CSIVolumesByNodeID ignores namespace

* scheduler/scheduler: add CSIVolumesByNodeID to the state interface

* scheduler/feasible: check node MaxVolumes

* nomad/csi_endpoint: no namespace inn CSIVolumesByNodeID anymore

* nomad/state/state_store: avoid DenormalizeAllocationSlice

* nomad/state/iterator: clean up SliceIterator Next

* scheduler/feasible_test: block with MaxVolumes

* nomad/state/state_store_test: fix args to CSIVolumesByNodeID
2020-03-31 17:16:47 -04:00
Chris Baker 179ab68258 wip: added job.scale rpc endpoint, needs explicit test (tested via http now) 2020-03-24 13:57:09 +00:00
Mahmood Ali 6ddf3d1742
Merge pull request #7414 from hashicorp/b-network-mode-change
Detect network mode change
2020-03-24 09:46:40 -04:00
Lang Martin d994990ef0
csi: the scheduler allows a job with a volume write claim to be updated (#7438)
* nomad/structs/csi: split CanWrite into health, in use

* scheduler/scheduler: expose AllocByID in the state interface

* nomad/state/state_store_test

* scheduler/stack: SetJobID on the matcher

* scheduler/feasible: when a volume writer is in use, check if it's us

* scheduler/feasible: remove SetJob

* nomad/state/state_store: denormalize allocs before Claim

* nomad/structs/csi: return errors on claim, with context

* nomad/csi_endpoint_test: new alloc doesn't look like an update

* nomad/state/state_store_test: change test reference to CanWrite
2020-03-23 21:21:04 -04:00
Tim Gross d1f43a5fea csi: improve error messages from scheduler (#7426) 2020-03-23 13:59:25 -04:00
Lang Martin 3621df1dbf csi: volume ids are only unique per namespace (#7358)
* nomad/state/schema: use the namespace compound index

* scheduler/scheduler: CSIVolumeByID interface signature namespace

* scheduler/stack: SetJob on CSIVolumeChecker to capture namespace

* scheduler/feasible: pass the captured namespace to CSIVolumeByID

* nomad/state/state_store: use namespace in csi_volume index

* nomad/fsm: pass namespace to CSIVolumeDeregister & Claim

* nomad/core_sched: pass the namespace in volumeClaimReap

* nomad/node_endpoint_test: namespaces in Claim testing

* nomad/csi_endpoint: pass RequestNamespace to state.*

* nomad/csi_endpoint_test: appropriately failed test

* command/alloc_status_test: appropriately failed test

* node_endpoint_test: avoid notTheNamespace for the job

* scheduler/feasible_test: call SetJob to capture the namespace

* nomad/csi_endpoint: ACL check the req namespace, query by namespace

* nomad/state/state_store: remove deregister namespace check

* nomad/state/state_store: remove unused CSIVolumes

* scheduler/feasible: CSIVolumeChecker SetJob -> SetNamespace

* nomad/csi_endpoint: ACL check

* nomad/state/state_store_test: remove call to state.CSIVolumes

* nomad/core_sched_test: job namespace match so claim gc works
2020-03-23 13:59:25 -04:00
Danielle Lancashire e227f31584 sched/feasible: Return more detailed CSI Failure messages 2020-03-23 13:58:30 -04:00
Danielle Lancashire a2e01c4369 sched/feasible: Validate CSIVolume's correctly
Previously we were looking up plugins based on the Alias Name for a CSI
Volume within the context of its task group.

Here we first look up a volume based on its identifier and then validate
the existence of the plugin based on its `PluginID`.
2020-03-23 13:58:30 -04:00
Danielle Lancashire e56c677221 sched/feasible: CSI - Filter applicable volumes
This commit filters the jobs volumes when setting them on the
feasibility checker. This ensures that the rest of the checker does not
have to worry about non-csi volumes.
2020-03-23 13:58:30 -04:00
Lang Martin 7b675f89ac csi: fix index maintenance for CSIVolume and CSIPlugin tables (#7049)
* state_store: csi volumes/plugins store the index in the txn

* nomad: csi_endpoint_test require index checks need uint64()

* nomad: other tests using int 0 not uint64(0)

* structs: pass index into New, but not other struct methods

* state_store: csi plugin indexes, use new struct interface

* nomad: csi_endpoint_test check index/query meta (on explicit 0)

* structs: NewCSIVolume takes an index arg now

* scheduler/test: NewCSIVolume takes an index arg now
2020-03-23 13:58:29 -04:00
Lang Martin a0a6766740 CSI: Scheduler knows about CSI constraints and availability (#6995)
* structs: piggyback csi volumes on host volumes for job specs

* state_store: CSIVolumeByID always includes plugins, matches usecase

* scheduler/feasible: csi volume checker

* scheduler/stack: add csi volumes

* contributing: update rpc checklist

* scheduler: add volumes to State interface

* scheduler/feasible: introduce new checker collection tgAvailable

* scheduler/stack: taskGroupCSIVolumes checker is transient

* state_store CSIVolumeDenormalizePlugins comment clarity

* structs: remote TODO comment in TaskGroup Validate

* scheduler/feasible: CSIVolumeChecker hasPlugins improve comment

* scheduler/feasible_test: set t.Parallel

* Update nomad/state/state_store.go

Co-Authored-By: Danielle <dani@hashicorp.com>

* Update scheduler/feasible.go

Co-Authored-By: Danielle <dani@hashicorp.com>

* structs: lift ControllerRequired to each volume

* state_store: store plug.ControllerRequired, use it for volume health

* feasible: csi match fast path remove stale host volume copied logic

* scheduler/feasible: improve comments

Co-authored-by: Danielle <dani@builds.terrible.systems>
2020-03-23 13:58:29 -04:00
Jasmine Dahilig 81d051d7e8 fix bug in lifecycle scheduler test mocks 2020-03-21 17:52:51 -04:00
Jasmine Dahilig 0cc9212a54 add test cases for scheduler alloc placement with lifecycle resources 2020-03-21 17:52:47 -04:00
Jasmine Dahilig 3e4e8f2b02 add allocfit test for lifecycles 2020-03-21 17:52:46 -04:00
Mahmood Ali b880607bad update scheduler to account for hooks 2020-03-21 17:52:45 -04:00
Mahmood Ali 9568553d7e Detect network mode change
Mark job as updated if network mode changed.
2020-03-21 16:51:10 -04:00
Drew Bailey 6bd6c6638c
include pro tag in serveral oss.go files 2020-02-10 15:56:14 -05:00
Drew Bailey 9a65556211
add state store test to ensure PlacedCanaries is updated 2020-02-03 13:58:01 -05:00
Drew Bailey f51a3d1f37
nomad state store must be modified through raft, rm local state change 2020-02-03 13:57:34 -05:00
Drew Bailey 1c046a74d8
comment for filtering reason 2020-02-03 09:02:09 -05:00
Drew Bailey e71f132455
add test for node eligibility 2020-02-03 09:02:09 -05:00
Drew Bailey 6b492630dd
make diffSystemAllocsForNode aware of eligibility
diffSystemAllocs -> diffSystemAllocsForNode, this function is only used
for diffing system allocations, but lacked awareness of eligible
nodes and the node ID that the allocation was going to be placed.

This change now ignores a change if its existing allocation is on an
ineligible node. For a new allocation, it also checks tainted and
ineligible nodes in the same function instead of nil-ing out the diff
after computation in diffSystemAllocs
2020-02-03 09:02:08 -05:00
Drew Bailey e613a258da
ignore computed diffs if node is ineligible
test flakey, add temp sleeps for debugging

fix computed class
2020-02-03 09:02:08 -05:00
Drew Bailey 63ddda71e1
Return FailedTGAlloc metric instead of no node err
If an existing system allocation is running and the node its running on
is marked as ineligible, subsequent plan/applys return an RPC error
instead of a more helpful plan result.

This change logs the error, and appends a failedTGAlloc for the
placement.
2020-01-22 10:07:15 -05:00
Drew Bailey ef175c0b31
Update Evicted allocations to lost when lost
If an alloc is being preempted and marked as evict, but the underlying
node is lost before the migration takes place, the allocation currently
stays as desired evict, status running forever, or until the node comes
back online.

This commit updates updateNonTerminalAllocsToLost to check for a
destired status of Evict as well as Stop when updating allocations on
tainted nodes.

switch to table test for lost node cases
2020-01-07 13:34:18 -05:00
Preetha Appan afff27b69b More error->debug for logging in the bin packing iterator 2019-12-12 15:50:16 -06:00
Preetha Appan 3458b41290 Use debug logging for scheduler internals
We currently log an error if preemption is unable to find a suitable set of
allocations to preempt. This commit changes that to debug level since not finding
preemptable allocations is not an error condition.
2019-12-12 12:05:29 -06:00
Michael Schurter 7655e0cee4
Merge pull request #6792 from hashicorp/b-propose-panic
scheduler: fix panic when preempting and evicting allocs
2019-12-03 10:40:19 -08:00
Tim Gross c50057bf1f
scheduler: fix job update placement on prev node penalized (#6781)
Fixes #5856

When the scheduler looks for a placement for an allocation that's
replacing another allocation, it's supposed to penalize the previous
node if the allocation had been rescheduled or failed. But we're
currently always penalizing the node, which leads to unnecessary
migrations on job update.

This commit leaves in place the existing behavior where if the
previous alloc was itself rescheduled, its previous nodes are also
penalized. This is conservative but the right behavior especially on
larger clusters where a group of hosts might be having correlated
trouble (like an AZ failure).

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-12-03 06:14:49 -08:00
Michael Schurter 0374069f82 scheduler: update tests with modern error helper 2019-12-02 20:25:52 -08:00
Michael Schurter 19a2ee71d3 scheduler: fix panic when preempting and evicting
Fixes #6787

In ProposedAllocs the proposed alloc slice was being copied while its
contents were not. Since RemoveAllocs nils elements of the proposed
alloc slice and is called twice, it could panic on the second call when
erroneously accessing a nil'd alloc.

The fix is to not copy the proposed alloc slice and pass the slice
returned by the 1st RemoveAllocs call to the 2nd call, thus maintaining
the trimmed length.
2019-12-02 20:22:22 -08:00
Michael Schurter 6f64e52d61
Merge pull request #6699 from hashicorp/f-semver-constraints
Add new "semver" constraint
2019-11-19 12:18:43 -08:00
Drew Bailey 876618b5d2
Removes checking constraints for inplace update 2019-11-19 13:34:41 -05:00
Michael Schurter 796758b8a5 core: add semver constraint
The existing version constraint uses logic optimized for package
managers, not schedulers, when checking prereleases:

- 1.3.0-beta1 will *not* satisfy ">= 0.6.1"
- 1.7.0-rc1 will *not* satisfy ">= 1.6.0-beta1"

This is due to package managers wishing to favor final releases over
prereleases.

In a scheduler versions more often represent the earliest release all
required features/APIs are available in a system. Whether the constraint
or the version being evaluated are prereleases has no impact on
ordering.

This commit adds a new constraint - `semver` - which will use Semver
v2.0 ordering when evaluating constraints. Given the above examples:

- 1.3.0-beta1 satisfies ">= 0.6.1" using `semver`
- 1.7.0-rc1 satisfies ">= 1.6.0-beta1" using `semver`

Since existing jobspecs may rely on the old behavior, a new constraint
was added and the implicit Consul Connect and Vault constraints were
updated to use it.
2019-11-19 08:40:19 -08:00
Drew Bailey e44a66d7fc
DOCS: Spread stanza does not exist on task
Fixes documentation inaccuracy for spread stanza placement. Spreads can
only exist on the top level job struct or within a group.

comment about nil assumption
2019-11-19 08:26:36 -05:00
Drew Bailey 07e3164bf9
Check for changes to affinity and constraints
Adds checks for affinity and constraint changes when determining if we
should update inplace.

refactor to check all levels at once

check for spread changes when checking inplace update
2019-11-19 08:26:34 -05:00
Chris Baker e0105f817a changed all tests to require from t.Fatalf 2019-11-07 22:39:47 +00:00
Chris Baker 95ae01a9f4 the scheduler checks whether task changes require a restart, this needed
to be updated to consider devices
2019-11-07 17:51:15 +00:00
Michael Schurter c6bbe85f42 core: fix panic when AllocatedResources is nil
Fix for #6540
2019-10-28 14:38:21 -07:00
Danielle Lancashire 78b61de45f
config: Hoist volume.config.source into volume
Currently, using a Volume in a job uses the following configuration:

```
volume "alias-name" {
  type = "volume-type"
  read_only = true

  config {
    source = "host_volume_name"
  }
}
```

This commit migrates to the following:

```
volume "alias-name" {
  type = "volume-type"
  source = "host_volume_name"
  read_only = true
}
```

The original design was based due to being uncertain about the future of storage
plugins, and to allow maxium flexibility.

However, this causes a few issues, namely:
- We frequently need to parse this configuration during submission,
scheduling, and mounting
- It complicates the configuration from and end users perspective
- It complicates the ability to do validation

As we understand the problem space of CSI a little more, it has become
clear that we won't need the `source` to be in config, as it will be
used in the majority of cases:

- Host Volumes: Always need a source
- Preallocated CSI Volumes: Always needs a source from a volume or claim name
- Dynamic Persistent CSI Volumes*: Always needs a source to attach the volumes
                                   to for managing upgrades and to avoid dangling.
- Dynamic Ephemeral CSI Volumes*: Less thought out, but `source` will probably point
                                  to the plugin name, and a `config` block will
                                  allow you to pass meta to the plugin. Or will
                                  point to a pre-configured ephemeral config.
*If implemented

The new design simplifies this by merging the source into the volume
stanza to solve the above issues with usability, performance, and error
handling.
2019-09-13 04:37:59 +02:00
Preetha Appan 9accf60805
update comment 2019-09-05 18:43:30 -05:00
Preetha Appan d21c708c4a
Fix inplace updates bug with group level networks
During inplace updates, we should be using network information
from the previous allocation being updated.
2019-09-05 18:37:24 -05:00
Jasmine Dahilig 4edebe389a
add default update stanza and max_parallel=0 disables deployments (#6191) 2019-09-02 10:30:09 -07:00
Mahmood Ali 3a1cb51539 schedulers: check all drivers on node
When checking driver feasability for an alloc with multiple drivers, we
must check that all drivers are detected and healthy.

Nomad 0.9 and 0.8 have a bug where we may check a single driver only,
but which driver is dependent on map traversal order, which is
unspecified in golang spec.
2019-08-29 09:03:31 -04:00
Mahmood Ali 3da10b5cb3 scheduler: tests for multiple drivers in TG 2019-08-29 09:03:31 -04:00
Danielle Lancashire 3a5e48ad18
scheduler: Implicit constraint on readonly hostvol
When a Client declares a volume is ReadOnly, we should only schedule it
for requests for ReadOnly volumes. This change means that if a host
exposes a readonly volume, we then validate that the group level
requests for the volume are all read only for that host.
2019-08-21 20:57:05 +02:00
Danielle Lancashire e132a30899
structs: Unify Volume and VolumeRequest 2019-08-12 15:39:08 +02:00
Danielle fc53283489
Update scheduler/feasible.go
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2019-08-12 15:39:08 +02:00
Danielle Lancashire 073836ec67
scheduler: Add a feasability checker for Host Vols 2019-08-12 15:39:08 +02:00
Preetha Appan e6a496bac0
Code review feedback 2019-07-31 01:04:08 -04:00
Preetha Appan 99eca85206
Scheduler changes to support network at task group level
Also includes unit tests for binpacker and preemption.
The tests verify that network resources specified at the
task group level are properly accounted for
2019-07-31 01:04:08 -04:00
Nick Ethier 7c9520b404
scheduler: fix disk constraints 2019-07-31 01:04:08 -04:00
Nick Ethier 09a4cfd8d7
fix failing tests 2019-07-31 01:04:07 -04:00
Nick Ethier af66a35924
networking: Add new bridge networking mode implementation 2019-07-31 01:04:06 -04:00
Nick Ethier 15989bba8e
ar: cleanup lint errors 2019-07-31 01:03:18 -04:00
Nick Ethier 66c514a388
Add network lifecycle management
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
2019-07-31 01:03:17 -04:00
Lang Martin 8157a7b6f8 system_sched submits failed evals as blocked 2019-07-18 10:32:12 -04:00
Preetha Appan 3484f18984
Fix more tests 2019-06-26 16:30:53 -05:00
Preetha Appan 10e7d6df6d
Remove compat code associated with many previous versions of nomad
This removes compat code for namespaces (0.7), Drain(0.8) and other
older features from releases older than Nomad 0.7
2019-06-25 19:05:25 -05:00
Mahmood Ali 8d4f914be9
Merge pull request #5790 from hashicorp/b-reschedule-desired-state
Mark rescheduled allocs as stopped.
2019-06-13 17:28:59 -04:00
Mahmood Ali 5e6327b6a1 Test behavior no reschedule for service/batch jobs 2019-06-13 16:41:19 -04:00
Mahmood Ali faf643a375 Don't stop rescheduleLater allocations
When an alloc is due to be rescheduleLater, it goes through the
reconciler twice: once to be ignored with a follow up evals, and once
again when processing the follow up eval where they appear as
rescheduleNow.

Here, we ignore them in the first run and mark them as stopped in second
iteration; rather than stop them twice.
2019-06-13 09:44:41 -04:00
Mahmood Ali 5dc404ecab Only preempt for network when there is a network
When examining preemption for networks, only consider allocs that have
networks.

Fixes https://github.com/hashicorp/nomad/issues/5793
2019-06-07 18:55:55 -04:00