Commit Graph

107 Commits

Author SHA1 Message Date
Tim Gross b764f52ab9
deploymentwatcher: reset progress deadline on promotion (#10042)
In a deployment with two groups (ex. A and B), if group A's canary becomes
healthy before group B's, the deadline for the overall deployment will be set
to that of group A. When the deployment is promoted, if group A is done it
will not contribute to the next deadline cutoff. Group B's old deadline will
be used instead, which will be in the past and immediately trigger a
deployment progress failure. Reset the progress deadline when the job is
promotion to avoid this bug, and to better conform with implicit user
expectations around how the progress deadline should interact with promotions.
2021-02-22 16:44:03 -05:00
Kris Hicks 0a3a748053
Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
Drew Bailey 6c788fdccd
Events/msgtype cleanup (#9117)
* use msgtype in upsert node

adds message type to signature for upsert node, update tests, remove placeholder method

* UpsertAllocs msg type test setup

* use upsertallocs with msg type in signature

update test usage of delete node

delete placeholder msgtype method

* add msgtype to upsert evals signature, update test call sites with test setup msg type

handle snapshot upsert eval outside of FSM and ignore eval event

remove placeholder upsertevalsmsgtype

handle job plan rpc and prevent event creation for plan

msgtype cleanup upsertnodeevents

updatenodedrain msgtype

msg type 0 is a node registration event, so set the default  to the ignore type

* fix named import

* fix signature ordering on upsertnode to match
2020-10-19 09:30:15 -04:00
Michael Schurter dd09fa1a4a
Merge pull request #9055 from hashicorp/f-9017-resources
api: add field filters to /v1/{allocations,nodes}
2020-10-14 14:49:39 -07:00
Michael Schurter 8ccbd92cb6 api: add field filters to /v1/{allocations,nodes}
Fixes #9017

The ?resources=true query parameter includes resources in the object
stub listings. Specifically:

- For `/v1/nodes?resources=true` both the `NodeResources` and
  `ReservedResources` field are included.
- For `/v1/allocations?resources=true` the `AllocatedResources` field is
  included.

The ?task_states=false query parameter removes TaskStates from
/v1/allocations responses. (By default TaskStates are included.)
2020-10-14 10:35:22 -07:00
Drew Bailey df96b89958
Add EvictCallbackFn to handle removing entries from go-memdb when they
are removed from the event buffer.

Wire up event buffer size config, use pointers for structs.Events
instead of copying.
2020-10-14 12:44:42 -04:00
Drew Bailey 9d48818eb8
writetxn can return error, add alloc and job generic events. Add events
table for durability
2020-10-14 12:44:39 -04:00
Drew Bailey 400455d302
Events/eval alloc events (#9012)
* generic eval update event

first pass at alloc client update events

* api/event client
2020-10-14 12:44:37 -04:00
Drew Bailey 4793bb4e01
Events/deployment events (#9004)
* Node Drain events and Node Events (#8980)

Deployment status updates

handle deployment status updates (paused, failed, resume)

deployment alloc health

generate events from apply plan result

txn err check, slim down deployment event

one ndjson line per index

* consolidate down to node event + type

* fix UpdateDeploymentAllocHealth test invocations

* fix test
2020-10-14 12:44:37 -04:00
Tim Gross d3341a2019 refactor: make it clear where we're accessing dstate
The field name `Deployment.TaskGroups` contains a map of `DeploymentState`,
which makes it a little harder to follow state updates when combined with
inconsistent naming conventions, particularly when we also have the state
store or actual `TaskGroup`s in scope. This changeset changes all uses to
`dstate` so as not to be confused with actual TaskGroups.
2020-07-20 11:25:53 -04:00
Tim Gross c14a75bfab multiregion: use pending instead of paused
The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.

This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
2020-06-17 11:06:14 -04:00
Tim Gross fd50b12ee2 multiregion: integrate with deploymentwatcher
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
2020-06-17 11:06:00 -04:00
Tim Gross 48e9f75c1e multiregion: deploymentwatcher hooks
This changeset establishes hooks in deploymentwatcher for multiregion
deployments (for the enterprise version of Nomad).
2020-06-17 11:05:18 -04:00
Tim Gross 473a0f1d44 multiregion: unblock and cancel RPCs 2020-06-17 11:02:26 -04:00
Mahmood Ali 19141f8103 {volume|deployment}watcher: check for nil batcher 2020-05-26 14:54:27 -04:00
Mahmood Ali 81ac098a22 deploymentwatcher: no batcher when disabling
When disabling deploymentwatcher (at the end of a test), avoid starting a
new update batcher with its new goroutine.
2020-05-26 14:44:47 -04:00
Mahmood Ali 9d88f1d568 tests: deflake deploymentwatcher package
This deflake the tests in the deploymentwatcher package.  The package
uses a mock deployment watcher backend, where the watcher in a
background goroutine calls UpdateDeploymentStatus .  If the mock isn't
configured to expect the call, the background goroutine will fail.  One
UpdateDeploymentStatus call is made at the end of the background
goroutine, which may occur after the test completes, thus explaining the
flakiness.
2020-03-12 15:42:01 -04:00
Jasmine Dahilig 8d980edd2e
add create and modify timestamps to evaluations (#5881) 2019-08-07 09:50:35 -07:00
Mahmood Ali d30c3d10b0
Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1
More test fixes
2019-06-05 19:09:18 -04:00
Lang Martin 0f6f543a5f deployment_watcher auto promote iff every task group is auto promotable 2019-05-22 12:34:57 -04:00
Lang Martin 0c668ecc7a log error on autoPromoteDeployment failure 2019-05-22 12:32:08 -04:00
Lang Martin b5fd735960 add update AutoPromote bool 2019-05-22 12:32:08 -04:00
Lang Martin 3c5a9fed22 deployments_watcher_test new TestWatcher_AutoPromoteDeployment 2019-05-22 12:32:08 -04:00
Lang Martin 0bebf5d7f8 deployment_watcher when it's ok to autopromote, do so 2019-05-22 12:32:08 -04:00
Lang Martin 0cf4168ed9 deployments_watcher comments 2019-05-22 12:32:08 -04:00
Lang Martin e1e28307be new deploymentwatcher/doc.go for package level documentation 2019-05-22 12:32:08 -04:00
Mahmood Ali 3b0152d778 tests: fix deploymentwatcher tests data races 2019-05-21 14:29:45 -04:00
Alex Dadgar be54e56570 review fixes 2018-11-08 09:48:36 -08:00
Alex Dadgar 1c31970464 Fix multiple tgs with progress deadline handling
Fix an issue in which the deployment watcher would fail the deployment
based on the earliest progress deadline of the deployment regardless of
if the task group has finished.

Further fix an issue where the blocked eval optimization would make it
so no evals were created to progress the deployment. To reproduce this
issue, prior to this commit, you can create a job with two task groups.
The first group has count 1 and resources such that it can not be
placed. The second group has count 3, max_parallel=1, and can be placed.
Run this first and then update the second group to do a deployment. It
will place the first of three, but never progress since there exists a
blocked eval. However, that doesn't capture the fact that there are two
groups being deployed.
2018-11-05 16:06:17 -08:00
Alex Dadgar de442226ae Fix other instances of blocking queries 2018-09-24 13:52:39 -07:00
Alex Dadgar 7f0d241ef4 always handle failed allocation 2018-09-21 15:13:54 -07:00
Alex Dadgar b2449ae1ce Fix deployment watcher index usage
Fixes three issues:
1. Retrieving the latest evaluation index was not properly selecting the
greatest index. This would undermine checks we had to reduce the number
of evaluations created when the latest eval index was greater than any
alloc change
2. Fix an issue where the blocking query code was using the incorrect
index such that the index was higher than necassary.
3. Special case handling of blocked evaluation since the create/snapshot
index is no particularly useful since they can be reblocked.
2018-09-21 13:59:11 -07:00
Alex Dadgar 3c19d01d7a server 2018-09-15 16:23:13 -07:00
Alex Dadgar c6576ddac1 Fix make check errors 2018-09-04 16:03:52 -07:00
Alex Dadgar 300b1a7a15 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Preetha Appan c7edbd5f41
newlines in test 2018-05-07 14:55:01 -05:00
Preetha Appan 4e75456beb
Fix deadlock in deadline timer logic when progress deadline is passed and the deployment is updated. 2018-05-07 14:55:01 -05:00
Preetha Appan 19b096d203
Set modify time for allocs in unit test, and define current time in one spot 2018-05-07 14:55:01 -05:00
Preetha Appan 4c377b112e
Fix panic in deployment watcher when deployment is not in the state store due to a gc 2018-05-07 14:55:01 -05:00
Alex Dadgar 768fec8505
Allow healthy canary deployment to skip progress deadline 2018-05-07 14:55:01 -05:00
Michael Schurter a3038cefb4
typo: transistion -> transition 2018-05-07 14:50:01 -05:00
Alex Dadgar bd38675365
Fix tests 2018-05-07 14:50:01 -05:00
Preetha Appan b2b773e696
better comments and remove commented code 2018-05-07 14:50:01 -05:00
Preetha Appan 90a2311cef
Fix deadlock in deployment watcher when deployment starts with no allocations and eventually has failed allocations 2018-05-07 14:50:01 -05:00
Alex Dadgar 8d50955054
Fix typos 2018-05-07 14:50:01 -05:00
Alex Dadgar 641ef81cbf
Test fixes 2018-05-07 14:50:01 -05:00
Alex Dadgar 8a81038cdb
Set Reschedule from deployment watcher 2018-05-07 14:50:01 -05:00
Alex Dadgar a510774451
Use UpdateAllocDesiredTransistion instead of UpsertEval but no transistions yet 2018-05-07 14:50:01 -05:00
Alex Dadgar fcf4f582d0
small review feedback fixes 2018-05-07 14:50:01 -05:00
Alex Dadgar 9bff9024b3
add latest eval back 2018-05-07 14:50:01 -05:00