Commit graph

84 commits

Author SHA1 Message Date
Tim Gross c14a75bfab multiregion: use pending instead of paused
The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.

This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
2020-06-17 11:06:14 -04:00
Tim Gross fd50b12ee2 multiregion: integrate with deploymentwatcher
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
2020-06-17 11:06:00 -04:00
Tim Gross 5c4d0a73f4 start all but first region deployment in paused state 2020-06-17 11:05:34 -04:00
Tim Gross 473a0f1d44 multiregion: unblock and cancel RPCs 2020-06-17 11:02:26 -04:00
Lang Martin 069840bef8
scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105) (#8138)
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect

* scheduler/reconcile: thread follupEvalIDs through to results.stop

* scheduler/reconcile: comment typo

* nomad/_test: correct arguments for plan.AppendStoppedAlloc

* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
2020-06-09 17:13:53 -04:00
Lang Martin ac7c39d3d3
Delayed evaluations for stop_after_client_disconnect can cause unwanted extra followup evaluations around job garbage collection (#8099)
* client/heartbeatstop: reversed time condition for startup grace

* scheduler/generic_sched: use `delayInstead` to avoid a loop

Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.

* scheduler/reconcile: lostLater are different than delayedReschedules

Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
2020-06-03 09:48:38 -04:00
Lang Martin d3c4700cd3
server: stop after client disconnect (#7939)
* jobspec, api: add stop_after_client_disconnect

* nomad/state/state_store: error message typo

* structs: alloc methods to support stop_after_client_disconnect

1. a global AllocStates to track status changes with timestamps. We
   need this to track the time at which the alloc became lost
   originally.

2. ShouldClientStop() and WaitClientStop() to actually do the math

* scheduler/reconcile_util: delayByStopAfterClientDisconnect

* scheduler/reconcile: use delayByStopAfterClientDisconnect

* scheduler/util: updateNonTerminalAllocsToLost comments

This was setup to only update allocs to lost if the DesiredStatus had
already been set by the scheduler. It seems like the intention was to
update the status from any non-terminal state, and not all lost allocs
have been marked stop or evict by now

* scheduler/testing: AssertEvalStatus just use require

* scheduler/generic_sched: don't create a blocked eval if delayed

* scheduler/generic_sched_test: several scheduling cases
2020-05-13 16:39:04 -04:00
Jasmine Dahilig 4edebe389a
add default update stanza and max_parallel=0 disables deployments (#6191) 2019-09-02 10:30:09 -07:00
Mahmood Ali faf643a375 Don't stop rescheduleLater allocations
When an alloc is due to be rescheduleLater, it goes through the
reconciler twice: once to be ignored with a follow up evals, and once
again when processing the follow up eval where they appear as
rescheduleNow.

Here, we ignore them in the first run and mark them as stopped in second
iteration; rather than stop them twice.
2019-06-13 09:44:41 -04:00
Mahmood Ali fd8fb8c22b Stop allocs to be rescheduled
Currently, when an alloc fails and is rescheduled, the alloc desired
state remains as "run" and the nomad client may not free the resources.

Here, we ensure that an alloc is marked as stopped when it's
rescheduled.

Notice the Desired Status and Description before and after this change:

Before:
```
mars-2:nomad notnoop$ nomad alloc status 02aba49e
ID                   = 02aba49e
Eval ID              = bb9ed1d2
Name                 = example-reschedule.nodes[0]
Node ID              = 5853d547
Node Name            = mars-2.local
Job ID               = example-reschedule
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = run
Desired Description  = <none>
Created              = 10s ago
Modified             = 5s ago
Replacement Alloc ID = d6bf872b

Task "payload" is "dead"
Task Resources
CPU        Memory          Disk     Addresses
0/100 MHz  24 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2019-06-06T21:12:45Z
Finished At    = 2019-06-06T21:12:50Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2019-06-06T17:12:50-04:00  Not Restarting  Policy allows no restarts
2019-06-06T17:12:50-04:00  Terminated      Exit Code: 1
2019-06-06T17:12:45-04:00  Started         Task started by client
2019-06-06T17:12:45-04:00  Task Setup      Building Task Directory
2019-06-06T17:12:45-04:00  Received        Task received by client

```

After:

```
ID                   = 5001ccd1
Eval ID              = 53507a02
Name                 = example-reschedule.nodes[0]
Node ID              = a3b04364
Node Name            = mars-2.local
Job ID               = example-reschedule
Job Version          = 0
Client Status        = failed
Client Description   = Failed tasks
Desired Status       = stop
Desired Description  = alloc was rescheduled because it failed
Created              = 13s ago
Modified             = 3s ago
Replacement Alloc ID = 7ba7ac20

Task "payload" is "dead"
Task Resources
CPU         Memory          Disk     Addresses
21/100 MHz  24 MiB/300 MiB  300 MiB

Task Events:
Started At     = 2019-06-06T21:22:50Z
Finished At    = 2019-06-06T21:22:55Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type            Description
2019-06-06T17:22:55-04:00  Not Restarting  Policy allows no restarts
2019-06-06T17:22:55-04:00  Terminated      Exit Code: 1
2019-06-06T17:22:50-04:00  Started         Task started by client
2019-06-06T17:22:50-04:00  Task Setup      Building Task Directory
2019-06-06T17:22:50-04:00  Received        Task received by client
```
2019-06-06 17:27:12 -04:00
Lang Martin 34230577df describe a pending deployment with auto_promote accurately 2019-05-22 12:32:08 -04:00
Lang Martin d462639cc9 sched reconcile copy AutoPromote to DeploymentState 2019-05-22 12:32:08 -04:00
Preetha Appan 1574e898af
Fix bug in reconciler where terminal allocs on a job already stopped were unnecessarily updated 2018-10-08 21:03:49 -05:00
Alex Dadgar 3c19d01d7a server 2018-09-15 16:23:13 -07:00
Alex Dadgar 3ba62efd5e Failed/paused deployments do not block migrations
This PR changes behavior of the scheduler such that a task group with a
deployment that is failed or paused will not cause the scheduler to skip
migrations.

The reason for this change is that it causes a bad UX when draining
nodes with allocations that are part of a failed/paused deployment.
These operations should not be coupled in any way and this remedies
that.

Prior behavior was still correct, but required either jobs to
transistion to a healthy state or for the node to hit its drain
deadline.
2018-09-10 15:28:45 -07:00
Preetha Appan 3e264dcb79
Fix reconciler bug with deployment not being created if job create index is different
This fixes an issue where if a job is purged and resubmitted Nomad does not create
a new deployment. Adds unit test that failed before this fix
2018-06-05 13:58:53 -05:00
Preetha Appan cf44670d56
Make sure that task group has a deployment state before using it 2018-05-07 14:55:01 -05:00
Alex Dadgar 768fec8505
Allow healthy canary deployment to skip progress deadline 2018-05-07 14:55:01 -05:00
Alex Dadgar 8626c1b94a
Reschedule when we have canaries properly 2018-05-07 14:55:01 -05:00
Alex Dadgar 550f5e31f8
Allow canary count greater than desired 2018-05-07 14:50:01 -05:00
Preetha Appan 5329900f6d
Only use DesiredTransition.Reschedule in reconciler when its an active deployment 2018-05-07 14:50:01 -05:00
Alex Dadgar 57969b4ee0
fix reconcile tests 2018-05-07 14:50:01 -05:00
Alex Dadgar fcf4f582d0
small review feedback fixes 2018-05-07 14:50:01 -05:00
Alex Dadgar 1336002255
Progress deadline in deployment state 2018-05-07 14:50:01 -05:00
Alex Dadgar ee50789c22
Initial implementation 2018-05-07 14:50:01 -05:00
Preetha Appan a569d34f25
Add custom status description for rescheduling follow up evals, and make unit test robust 2018-04-10 15:30:15 -05:00
Preetha Appan 7e17bc231f
remove unnecessary check and other fixes from code review 2018-04-04 07:35:20 -05:00
Preetha Appan 00537c739b
Fixes edge cases around timing and task finish time being set more than once 2018-04-03 16:34:59 -05:00
Alex Dadgar e106da84de name and test 2018-03-26 11:06:21 -07:00
Alex Dadgar e2a6e64fca Don't create unnecessary deployments 2018-03-23 16:55:21 -07:00
Alex Dadgar 3b72dd94ba Do not mark an allocation as an inplace update if specification hasn't changed 2018-03-23 14:36:05 -07:00
Michael Schurter cb61a4bdc7 Fix linting errors 2018-03-21 16:51:45 -07:00
Alex Dadgar 92b636dd32 Fix deadline handling 2018-03-21 16:51:44 -07:00
Alex Dadgar db4a634072 RPC, FSM, State Store for marking DesiredTransistion
fix build tag
2018-03-21 16:49:48 -07:00
Preetha Appan 56e60e5840
Fix linting warning 2018-03-14 16:12:22 -05:00
Preetha Appan 9fed0d2103
Get reschedule policy from the alloc directly 2018-03-14 16:10:32 -05:00
Preetha Appan e2656ef546
Cleaner handling of batched evals 2018-03-14 16:10:32 -05:00
Preetha Appan 47e0280d96
More small review feedback 2018-03-14 16:10:32 -05:00
Preetha Appan 5373ade731
Scheduler and Reconciler changes to support delayed rescheduling 2018-03-14 16:10:32 -05:00
Josh Soref a89e1b8395 spelling: strategy 2018-03-11 18:58:19 +00:00
Josh Soref f8eb766fb5 spelling: reschedulable 2018-03-11 18:48:12 +00:00
Preetha Appan 7c57303dd2
Clarify comment 2018-02-05 16:37:07 -06:00
Preetha Appan d48c411692
Reconciler should consider failed allocs when marking deployment as failed. 2018-02-02 19:40:25 -06:00
Preetha Appan ea4a889e28
Address more code review feedback 2018-01-31 09:56:53 -06:00
Preetha Appan bd89d2b39e
Make sure that reschedule trackers are not added for node drain replacements 2018-01-31 09:56:53 -06:00
Preetha Appan 21b7b79d5d
Add helper methods, use require and other code review feedback 2018-01-31 09:56:53 -06:00
Preetha Appan fbb1936dee
Fix some comments and lint warnings, remove unused method 2018-01-31 09:56:53 -06:00
Preetha Appan 031c566ada
Reschedule previous allocs and track their reschedule attempts 2018-01-31 09:56:53 -06:00
Alex Dadgar 746cd7403f Allow batch jobs to be rerun if purged
This PR allows batch jobs to be rerun if they have been purged.
2017-10-13 12:40:37 -07:00
Alex Dadgar 3904bde9a3 Fix batch handling of complete allocs/node drains
This PR fixes:
* An issue in which a node-drain that contains a complete batch alloc
would cause a replacement
* An issue in which allocations with the same name during a scale
down/stop event wouldn't be properly stopped.
* An issue in which batch allocations from previous job versions may not
have been stopped properly.

Fixes https://github.com/hashicorp/nomad/issues/3210
2017-09-14 15:08:57 -07:00