Commit Graph

599 Commits

Author SHA1 Message Date
Alex Dadgar bac5cb1e8b Scheduler uses allocated resources 2018-10-02 17:08:25 -07:00
Preetha Appan a10118c461 Add failed follow up to the list of allowed eval trigger reasons
needs unit test
2018-09-25 10:49:55 -07:00
Alex Dadgar 6a21f9fe96 Unique TriggerBy for blocked evals
Give blocked evals a unique triggerby reason to make debugging a chain
of evaluations easier.
2018-09-24 14:47:49 -07:00
Alex Dadgar 3c19d01d7a server 2018-09-15 16:23:13 -07:00
Alex Dadgar 3ba62efd5e Failed/paused deployments do not block migrations
This PR changes behavior of the scheduler such that a task group with a
deployment that is failed or paused will not cause the scheduler to skip
migrations.

The reason for this change is that it causes a bad UX when draining
nodes with allocations that are part of a failed/paused deployment.
These operations should not be coupled in any way and this remedies
that.

Prior behavior was still correct, but required either jobs to
transistion to a healthy state or for the node to hit its drain
deadline.
2018-09-10 15:28:45 -07:00
Alex Dadgar cc92cd92cd
Merge pull request #4642 from hashicorp/b-vet
Fix vet errors and use newer go version in travis
2018-09-04 17:04:02 -07:00
Alex Dadgar c6576ddac1 Fix make check errors 2018-09-04 16:03:52 -07:00
Preetha Appan 751c0eb5a5
code review feedback 2018-09-04 16:10:11 -05:00
Preetha Appan 9bc0962527
Track top k nodes by norm score rather than top k nodes per scorer 2018-09-04 16:10:11 -05:00
Preetha Appan 6ed527c636
Use heap to store top K scoring nodes.
Scoring metadata is now aggregated by scorer type to make it easier
to parse when reading it in the CLI.
2018-09-04 16:10:11 -05:00
Preetha Appan 65cf4373b3
fix linting error 2018-09-04 16:10:11 -05:00
Preetha Appan dd5fe6373f
Fix scoring logic for uneven spread to incorporate current alloc count
Also addressed other small code review comments
2018-09-04 16:10:11 -05:00
Preetha Appan e72c0fe527
more cleanup 2018-09-04 16:10:11 -05:00
Preetha Appan 4c624424e6
added some unit tests for -1 spread score 2018-09-04 16:10:11 -05:00
Preetha Appan 92d37acc2a
comment and formatting cleanup 2018-09-04 16:10:11 -05:00
Preetha Appan 7b0a27cad6
fix scoring algorithm when min count == current count 2018-09-04 16:10:11 -05:00
Preetha Appan bad075f640
Remove hardcoded boosts for even spread.
instead, calculate them based on delta between current and minimum value
2018-09-04 16:10:11 -05:00
Preetha Appan c56873ff37
Implement support for even spread across datacenters, with unit test 2018-09-04 16:10:11 -05:00
Preetha Appan d091c00dd3
Support implicit spread target to account for remaining desired counts 2018-09-04 16:10:11 -05:00
Preetha Appan 33779abe5f
fix comments 2018-09-04 16:10:11 -05:00
Preetha Appan 5812f906c8
Allow empty spread targets, and validate target percentages. 2018-09-04 16:10:11 -05:00
Preetha Appan 55f276c189
Include spreads configured at job level when precomputing weights/desired counts. 2018-09-04 16:10:11 -05:00
Preetha Appan fbd0004707
Fix warnings 2018-09-04 16:10:11 -05:00
Preetha Appan db0d95b09c
Implement spread iterator that scores according to percentage of desired count in each target.
Added this as a new step in the stack and some unit tests
2018-09-04 16:10:11 -05:00
Preetha Appan eccf128c5c
Some minor changes from code review 2018-09-04 16:10:11 -05:00
Preetha Appan 038ed52877
Fix after rename to ConstraintSetContainsAny 2018-09-04 16:10:11 -05:00
Preetha Appan 3a39db3902
Fix linting 2018-09-04 16:10:11 -05:00
Preetha Appan d5cd2bbddb
Remove unnecessary reset 2018-09-04 16:10:11 -05:00
Preetha Appan dccb693221
test for setcontainsany, and treat set_contains same as set_contains_all 2018-09-04 16:10:11 -05:00
Preetha Appan 70bfd0c0cb
Address some review feedback 2018-09-04 16:10:11 -05:00
Preetha Appan 8685593ec0
Back out changes to propertyset that were not necessary for affinities 2018-09-04 16:10:11 -05:00
Preetha Appan 5eacd6ada4
Implement affinity support in generic scheduler 2018-09-04 16:10:11 -05:00
Alex Dadgar e1c239daae
Merge pull request #4414 from hashicorp/b-stop-summary
Reset Queued allocs to zero when job stopped
2018-07-16 14:32:55 -07:00
Nick Ethier 6b6777359b
scheduler: fix missing err assignment 2018-07-11 14:27:10 -04:00
Nick Ethier 5f6def5b04
scheduler: better error handling 2018-07-05 11:00:03 -04:00
Nick Ethier 030e650e78
scheduler: fix nil pointer exception 2018-07-02 16:05:38 -04:00
Alex Dadgar 300b1a7a15 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Alex Dadgar c3c79c408e Reset Queued allocs to zero when job stopped
When a job is stopped but not purged, we should set the Queued count to
be zero.
2018-06-13 10:46:39 -07:00
Preetha Appan b64788043e
make test create index clearer 2018-06-05 17:29:59 -05:00
Preetha Appan 3e264dcb79
Fix reconciler bug with deployment not being created if job create index is different
This fixes an issue where if a job is purged and resubmitted Nomad does not create
a new deployment. Adds unit test that failed before this fix
2018-06-05 13:58:53 -05:00
Preetha Appan f8a23bc54a
fix test comment 2018-05-09 16:01:34 -05:00
Preetha Appan ef531b0f34
Add unit tests for forced rescheduling 2018-05-09 11:30:42 -05:00
Preetha Appan c1b92c284e
Work in progress - force rescheduling of failed allocs 2018-05-08 17:26:57 -05:00
Alex Dadgar 555d14fd92
Add test 2018-05-07 14:55:01 -05:00
Preetha Appan cf44670d56
Make sure that task group has a deployment state before using it 2018-05-07 14:55:01 -05:00
Alex Dadgar c6478d9469
clarify comment 2018-05-07 14:55:01 -05:00
Alex Dadgar 768fec8505
Allow healthy canary deployment to skip progress deadline 2018-05-07 14:55:01 -05:00
Alex Dadgar 8626c1b94a
Reschedule when we have canaries properly 2018-05-07 14:55:01 -05:00
Alex Dadgar 8dee3ab068
canary reschedule test 2018-05-07 14:50:01 -05:00
Alex Dadgar deb93dc7b7
Test for rescheduling when there are canaries 2018-05-07 14:50:01 -05:00
Alex Dadgar 550f5e31f8
Allow canary count greater than desired 2018-05-07 14:50:01 -05:00
Alex Dadgar f95ab4ade8
Mark canaries on creation, and unmark on promotion 2018-05-07 14:50:01 -05:00
Preetha Appan 5329900f6d
Only use DesiredTransition.Reschedule in reconciler when its an active deployment 2018-05-07 14:50:01 -05:00
Alex Dadgar e7444c3873
Add test where deployment is marked as complete when done even with failed allocs 2018-05-07 14:50:01 -05:00
Alex Dadgar 57969b4ee0
fix reconcile tests 2018-05-07 14:50:01 -05:00
Alex Dadgar 5547974f35
Only reschedule allowed deployment allocs 2018-05-07 14:50:01 -05:00
Alex Dadgar fcf4f582d0
small review feedback fixes 2018-05-07 14:50:01 -05:00
Alex Dadgar 1336002255
Progress deadline in deployment state 2018-05-07 14:50:01 -05:00
Alex Dadgar ee50789c22
Initial implementation 2018-05-07 14:50:01 -05:00
Preetha Appan a569d34f25
Add custom status description for rescheduling follow up evals, and make unit test robust 2018-04-10 15:30:15 -05:00
Alex Dadgar e5b5803265 Only mark allocs as part of deployment if deployment is active 2018-04-05 15:40:49 -07:00
Preetha Appan 7e17bc231f
remove unnecessary check and other fixes from code review 2018-04-04 07:35:20 -05:00
Preetha Appan 00537c739b
Fixes edge cases around timing and task finish time being set more than once 2018-04-03 16:34:59 -05:00
Alex Dadgar 3aa4ee9d75 Fix lost handling of not actually down nodes 2018-03-30 14:17:41 -07:00
Preetha Appan d87e528059
rename skip->ignore and improve comment formatting 2018-03-29 15:11:10 -05:00
Preetha Appan 38a7614776
Refactored for readability, pair programmed with @dadgar 2018-03-29 13:28:37 -05:00
Preetha Appan 5090fefe96
Filter out allocs with DesiredState = stop, and unit tests 2018-03-29 09:28:52 -05:00
Alex Dadgar b18f789020 Unmark drain when nodes hit their deadline and only batch/system left and add all job type integration test 2018-03-28 17:25:58 -07:00
Preetha Appan d2899728fd
Fix linting 2018-03-28 12:26:28 -05:00
Alex Dadgar 9d60e2cebf Correct status desc on draining system allocs 2018-03-26 17:54:46 -07:00
Preetha Appan 33e170c15d
s/linear/constant/g 2018-03-26 14:45:09 -05:00
Preetha 5668c3c38e
Merge pull request #4037 from hashicorp/b-fix-terminal-filtering-service-allocs
Fix edge case in reconciler
2018-03-26 13:14:51 -05:00
Preetha Appan 1b9e413a1a
one field per line in struct definition 2018-03-26 13:13:21 -05:00
Alex Dadgar e106da84de name and test 2018-03-26 11:06:21 -07:00
Alex Dadgar e2a6e64fca Don't create unnecessary deployments 2018-03-23 16:55:21 -07:00
Preetha Appan cbfd69ce7a
Fix edge case in reconciler where service jobs with ClientstatusComplete were not replaced 2018-03-23 18:41:00 -05:00
Alex Dadgar 3b72dd94ba Do not mark an allocation as an inplace update if specification hasn't changed 2018-03-23 14:36:05 -07:00
Michael Schurter cb61a4bdc7 Fix linting errors 2018-03-21 16:51:45 -07:00
Alex Dadgar 92b636dd32 Fix deadline handling 2018-03-21 16:51:44 -07:00
Michael Schurter 9263cc2ed7 scheduler: migrate non-terminal migrating allocs
filterByTainted node should always migrate non-terminal migrating allocs
2018-03-21 16:49:48 -07:00
Michael Schurter d1ec65d765 switch to new raft DesiredTransition message 2018-03-21 16:49:48 -07:00
Alex Dadgar db4a634072 RPC, FSM, State Store for marking DesiredTransistion
fix build tag
2018-03-21 16:49:48 -07:00
Michael Schurter c0542474db drain: initial drainv2 structs and impl 2018-03-21 16:49:48 -07:00
Chelsea Holland Komlo 329605b7cc fix up scheduling test 2018-03-21 15:54:03 -04:00
Chelsea Holland Komlo 60f12d206f improve comments; update watchDriver 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo d92703617c simplify logic
bump log level
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo d8f68e5ef8 fix up codereview feedback 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo c7fd0bd8a1 fix up scheduler mocks 2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo 3aa726baab fix scheduler driver name; create node structs file 2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo 3cba95e8a7 allow nomad to schedule based on the status of a client driver health check
Slight updates for go style
2018-03-21 15:15:25 -04:00
Preetha Appan 56e60e5840
Fix linting warning 2018-03-14 16:12:22 -05:00
Preetha Appan 9a5e6edf1f
Rename DelayCeiling to MaxDelay 2018-03-14 16:10:32 -05:00
Preetha Appan 3e96c6c4e0
Address more code review feedback 2018-03-14 16:10:32 -05:00
Preetha Appan 9fed0d2103
Get reschedule policy from the alloc directly 2018-03-14 16:10:32 -05:00
Preetha Appan e89bbf7289
Update comment about WaitTime 2018-03-14 16:10:32 -05:00
Preetha Appan e2656ef546
Cleaner handling of batched evals 2018-03-14 16:10:32 -05:00
Preetha Appan 47e0280d96
More small review feedback 2018-03-14 16:10:32 -05:00
Preetha Appan 2ba976dec8
Remove unnecessary check against 5 second window for determining immediate scheduling eligibility 2018-03-14 16:10:32 -05:00
Preetha Appan 5373ade731
Scheduler and Reconciler changes to support delayed rescheduling 2018-03-14 16:10:32 -05:00
Josh Soref e0f6a33fe5 spelling: system 2018-03-11 19:01:19 +00:00
Josh Soref a89e1b8395 spelling: strategy 2018-03-11 18:58:19 +00:00
Josh Soref f8eb766fb5 spelling: reschedulable 2018-03-11 18:48:12 +00:00
Josh Soref ed8db9992e spelling: feasibility 2018-03-11 18:07:09 +00:00
Josh Soref bf9283c606 spelling: corresponding 2018-03-11 17:51:41 +00:00
Josh Soref ca4ceb0e5c spelling: commits 2018-03-11 17:47:45 +00:00
Preetha Appan 7b6ba7a1f4
Fixes bug in reconciler where previously rescheduled allocs are rescheduled again. Simplified logic and added test case to catch this. 2018-02-20 12:07:56 -06:00
Preetha Appan 7c57303dd2
Clarify comment 2018-02-05 16:37:07 -06:00
Preetha Appan d48c411692
Reconciler should consider failed allocs when marking deployment as failed. 2018-02-02 19:40:25 -06:00
Preetha Appan a1237d627a
code review feedback 2018-01-31 09:58:05 -06:00
Preetha Appan 5ad892026a
Add a field to track the next allocation during a replacement 2018-01-31 09:58:05 -06:00
Preetha Appan 2ed4de7e7b
Track previous node id correctly, plus unit test 2018-01-31 09:58:05 -06:00
Preetha Appan dd4917c2f0
Add more clarification in comment 2018-01-31 09:58:05 -06:00
Preetha Appan 09bef7d1ce
Preallocate slice for skipped nodes 2018-01-31 09:58:05 -06:00
Preetha Appan 237beb49ae
Better score threshold 2018-01-31 09:58:05 -06:00
Preetha Appan fa18c0def4
Add one more unit test 2018-01-31 09:58:05 -06:00
Preetha Appan a75540cec6
Limit iterator uses a score threshold and a maxSkip value to be able to skip lower scoring nodes 2018-01-31 09:58:05 -06:00
Preetha Appan b6268a5fab
Beef up unit test for rescheduling batch jobs 2018-01-31 09:56:53 -06:00
Preetha Appan ea4a889e28
Address more code review feedback 2018-01-31 09:56:53 -06:00
Preetha Appan bd89d2b39e
Make sure that reschedule trackers are not added for node drain replacements 2018-01-31 09:56:53 -06:00
Preetha Appan a662b38801
Improve reconciler unit tests 2018-01-31 09:56:53 -06:00
Preetha Appan fee4ccf154
Prevent side effect modification of select options when preferred nodes are set 2018-01-31 09:56:53 -06:00
Preetha Appan 21b7b79d5d
Add helper methods, use require and other code review feedback 2018-01-31 09:56:53 -06:00
Preetha Appan d0f9d59abb
Reconile with changes to structs for reschedule tracking 2018-01-31 09:56:53 -06:00
Preetha Appan fbb1936dee
Fix some comments and lint warnings, remove unused method 2018-01-31 09:56:53 -06:00
Preetha Appan 031c566ada
Reschedule previous allocs and track their reschedule attempts 2018-01-31 09:56:53 -06:00
Preetha Appan fd2fbefa4c
Add a field to track the next allocation during a replacement 2018-01-24 17:55:05 -06:00
Alex Dadgar 6dda0ebaed gofmt 2018-01-04 14:45:15 -08:00
Alex Dadgar 2f561609b7 Fix detection of successful batch allocations
This PR restores older behavior of detecting successful batch
allocations (04d86ffd1006fde9dfb2ca8c1237fe60b995b0e3). This has the
side effect that we correctly filter desired status stop but not
successful batch allocations and create their replacements.
2018-01-04 14:20:32 -08:00
Preetha 1712b03705
Merge branch 'master' into 0.8 2018-01-03 16:06:38 -06:00
Preetha Appan 51bd0b59c7
Return an error if evaluation doesn't exist in state store at plan apply time. 2017-12-18 14:55:36 -06:00
Preetha Appan 3c36abfe14
Update eval modify index as part of plan apply. 2017-12-18 10:03:55 -06:00
Preetha Appan 3b4d7ac2a3
Fix some typos 2017-12-14 13:29:27 -06:00
Michael Schurter 45494f7304 Fix port labels on mock Alloc/Job/Node 2017-12-08 14:50:06 -08:00
Alex Dadgar 44240ce440 Merge pull request #3375 from hashicorp/b-batch
Allow batch jobs to be rerun if purged
2017-10-13 17:11:45 -07:00
Alex Dadgar c1cc51dbee sync 2017-10-13 14:36:02 -07:00
Alex Dadgar 746cd7403f Allow batch jobs to be rerun if purged
This PR allows batch jobs to be rerun if they have been purged.
2017-10-13 12:40:37 -07:00
Michael Schurter a66c53d45a Remove `structs` import from `api`
Goes a step further and removes structs import from api's tests as well
by moving GenerateUUID to its own package.
2017-09-29 10:36:08 -07:00
Alex Dadgar 4173834231 Enable more linters 2017-09-26 15:26:33 -07:00
Alex Dadgar 3904bde9a3 Fix batch handling of complete allocs/node drains
This PR fixes:
* An issue in which a node-drain that contains a complete batch alloc
would cause a replacement
* An issue in which allocations with the same name during a scale
down/stop event wouldn't be properly stopped.
* An issue in which batch allocations from previous job versions may not
have been stopped properly.

Fixes https://github.com/hashicorp/nomad/issues/3210
2017-09-14 15:08:57 -07:00
Alex Dadgar 84d06f6abe Sync namespace changes 2017-09-07 17:04:21 -07:00
Alex Dadgar 0aef02a4f9 fix test 2017-08-21 14:07:54 -07:00
Alex Dadgar 27256ebcc6 Placing allocs counts towards placement limit
This PR makes placing new allocations count towards the limit. We do not
restrict how many new placements are made by the limit but we still
count towards the limit. This has the nice affect that if you have a
group with count = 5 and max_parallel = 1 but only 3 allocs exist for it
and a change is made, you will create 2 more at the new version but not
destroy one, taking you down to two running as you would have
previously.

Fixes https://github.com/hashicorp/nomad/issues/3053
2017-08-21 12:41:19 -07:00
Alex Dadgar 2453f13fc5 fixes 2017-08-15 12:27:05 -07:00
Alex Dadgar 0570e09feb Fix panic occuring from improper bitmap size
This PR fixes an allignment calculation when determining the bitmap
size.

Fixes https://github.com/hashicorp/nomad/issues/3008
2017-08-12 15:37:02 -07:00
Luke Farnell f0ced87b95 fixed all spelling mistakes for goreport 2017-08-07 17:13:05 -04:00
Alex Dadgar 7b13c0d702 Lost allocs replaced even if deployment failed
This PR allows the scheduler to replace lost allocations even if the job
has a failed or paused deployment. The prior behavior was confusing to
users.

Fixes https://github.com/hashicorp/nomad/issues/2958
2017-08-03 17:42:14 -07:00
Alex Dadgar 7d2b84ab01 Review fixes 2017-08-01 14:18:52 -07:00
Alex Dadgar 2650bb1d12 Distinct Property supports arbitrary limit
This PR enhances the distinct_property constraint such that a limit can
be specified in the RTarget/value parameter. This allows constraints
such as:

```
constraint {
  distinct_property = "${meta.rack}"
  value = "2"
}
```

This restricts any given rack from running more than 2 allocations from
the task group.

Fixes https://github.com/hashicorp/nomad/issues/1146
2017-07-31 16:52:13 -07:00
Alex Dadgar 4f69355a66 Fix incorrect destructive update with distinct_property constraint
This PR fixes an issue in which an update to a task group with a
distinct property constraint would result in an incorrect destructive
update.
2017-07-31 11:17:35 -07:00
Michael Schurter 5f1f91a46c Use go-testing-interface instead of testing
This drops the testings stdlib pkg from our dependencies. Saves a
whopping 46kb on our binary (was really hoping for more of a win there),
but also avoids potential ugliness with how testing sets flags.
2017-07-25 15:35:19 -07:00