Commit graph

3353 commits

Author SHA1 Message Date
Mahmood Ali 38a01c050e
Merge pull request #8192 from hashicorp/f-status-allnamespaces-2
CLI Allow querying all namespaces for jobs and allocations - Try 2
2020-06-18 20:16:52 -04:00
Nick Ethier 4a44deaa5c CNI Implementation (#7518) 2020-06-18 11:05:29 -07:00
Nick Ethier 0bc0403cc3 Task DNS Options (#7661)
Co-Authored-By: Tim Gross <tgross@hashicorp.com>
Co-Authored-By: Seth Hoenig <shoenig@hashicorp.com>
2020-06-18 11:01:31 -07:00
Mahmood Ali c0aa06d9c7 rpc: allow querying allocs across namespaces
This implements the backend handling for querying across namespaces for
allocation list endpoints.
2020-06-17 16:31:06 -04:00
Mahmood Ali e784fe331a use '*' to indicate all namespaces
This reverts the introduction of AllNamespaces parameter that was merged
earlier but never got released.
2020-06-17 16:27:43 -04:00
Tim Gross 81ae581da6
test: remove flaky test from volumewatcher (#8189)
The volumewatcher restores itself on notification, but detecting this is racy
because it may reap any claim (or find there are no claims to reap) and
shutdown before we can test whether it's running. This appears to have become
flaky with a new version of golang. The other cases in this test case
sufficiently exercise the start/stop behavior of the volumewatcher, so remove
the flaky section.
2020-06-17 15:41:51 -04:00
Chris Baker fe9d654640
Merge pull request #8187 from hashicorp/f-8143-block-scaling-during-deployment
modify Job.Scale RPC to return an error if there is an active deployment
2020-06-17 14:38:55 -05:00
Chris Baker cd903218f7 added changelog entry and satisfied make check 2020-06-17 17:43:45 +00:00
Chris Baker ab2b15d8cb modify Job.Scale RPC to return an error if there is an active deployment
resolves #8143
2020-06-17 17:03:35 +00:00
Tim Gross 6b1cb61888 remove test for ent-only behavior 2020-06-17 11:27:29 -04:00
Tim Gross c14a75bfab multiregion: use pending instead of paused
The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.

This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
2020-06-17 11:06:14 -04:00
Tim Gross fd50b12ee2 multiregion: integrate with deploymentwatcher
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
2020-06-17 11:06:00 -04:00
Tim Gross 7b12445f29 multiregion: change AutoRevert to OnFailure 2020-06-17 11:05:45 -04:00
Tim Gross 5c4d0a73f4 start all but first region deployment in paused state 2020-06-17 11:05:34 -04:00
Tim Gross 48e9f75c1e multiregion: deploymentwatcher hooks
This changeset establishes hooks in deploymentwatcher for multiregion
deployments (for the enterprise version of Nomad).
2020-06-17 11:05:18 -04:00
Tim Gross b09b7a2475 Multiregion job registration
Integration points for multiregion jobs to be registered in the enterprise
version of Nomad:
* hook in `Job.Register` for enterprise to send job to peer regions
* remove monitoring from `nomad job run` and `nomad job stop` for multiregion jobs
2020-06-17 11:04:58 -04:00
Drew Bailey 9263fcb0d3 Multiregion deploy status and job status CLI 2020-06-17 11:03:34 -04:00
Tim Gross 473a0f1d44 multiregion: unblock and cancel RPCs 2020-06-17 11:02:26 -04:00
Tim Gross ede3a4f1c4 multiregion: request structs 2020-06-17 11:00:34 -04:00
Tim Gross 6851024925 Multiregion structs
Initial struct definitions, jobspec parsing, validation, and conversion
between Nomad structs and API structs for multi-region deployments.
2020-06-17 11:00:14 -04:00
Chris Baker 9fc66bc1aa support in API client and Job.Register RPC for PreserveCounts 2020-06-16 18:45:28 +00:00
Chris Baker 1e3563e08c wip: added PreserveCounts to struct.JobRegisterRequest, development test for Job.Register 2020-06-16 18:45:17 +00:00
Chris Baker 7ed06cced0 core: update Job.Scale to save the previous job count in the ScalingEvent 2020-06-15 19:49:22 +00:00
Chris Baker aeb3ed449e wip: added .PreviousCount to api.ScalingEvent and structs.ScalingEvent, with developmental tests 2020-06-15 19:40:21 +00:00
Mahmood Ali c17ffb2d35
Merge pull request #8131 from hashicorp/f-snapshot-restore
Implement snapshot restore
2020-06-15 08:32:34 -04:00
Mahmood Ali 9bfc3e28d9
Apply suggestions from code review
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2020-06-15 08:32:16 -04:00
Lang Martin 069840bef8
scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105) (#8138)
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect

* scheduler/reconcile: thread follupEvalIDs through to results.stop

* scheduler/reconcile: comment typo

* nomad/_test: correct arguments for plan.AppendStoppedAlloc

* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
2020-06-09 17:13:53 -04:00
Mahmood Ali 63e048e972 clarify ccomments, esp related to leadership code 2020-06-09 12:01:31 -04:00
Mahmood Ali b543460e0a loosen raft timeout 2020-06-07 16:38:11 -04:00
Mahmood Ali 69bb42acf8 tests: prefix agent logs to identify agent sources 2020-06-07 16:38:11 -04:00
Mahmood Ali 47a163b63f reassert leadership 2020-06-07 15:47:06 -04:00
Mahmood Ali 9eb13ae144 basic snapshot restore 2020-06-07 15:46:23 -04:00
Mahmood Ali bf7a3583e5
Merge pull request #8089 from hashicorp/b-leader-worker-count
leadership: pause and unpause workers consistently
2020-06-04 12:01:01 -04:00
Mahmood Ali cd8e1b4d62
stop periodic dispatch at end of tests (#8111) 2020-06-04 09:15:00 -04:00
Lang Martin ac7c39d3d3
Delayed evaluations for stop_after_client_disconnect can cause unwanted extra followup evaluations around job garbage collection (#8099)
* client/heartbeatstop: reversed time condition for startup grace

* scheduler/generic_sched: use `delayInstead` to avoid a loop

Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.

* scheduler/reconcile: lostLater are different than delayedReschedules

Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
2020-06-03 09:48:38 -04:00
Mahmood Ali 70fbcb99c2 leadership: pause and unpause workers consistently
This fixes a bug where leadership establishment pauses 3/4 of workers
but stepping down unpause only 1/2!
2020-06-01 10:57:53 -04:00
Mahmood Ali 891fb3f8a9 test for paused workers upon leadership revocation 2020-06-01 10:48:42 -04:00
Mahmood Ali de44d9641b
Merge pull request #8047 from hashicorp/f-snapshot-save
API for atomic snapshot backups
2020-06-01 07:55:16 -04:00
Mahmood Ali e37a3312d5 If leadership fails, consider it handled
The callers for `forward` and old implementation expect failures to be
accompanied with a true value!  This fixes the issue and have tests
passing!
2020-05-31 22:06:17 -04:00
Mahmood Ali 30ab9c84e5 more review feedback 2020-05-31 21:39:09 -04:00
Mahmood Ali a73cd01a00
Merge pull request #8001 from hashicorp/f-jobs-list-across-nses
endpoint to expose all jobs across all namespaces
2020-05-31 21:28:03 -04:00
Mahmood Ali 082c085068
Merge pull request #8036 from hashicorp/f-background-vault-revoke-on-restore
Speed up leadership establishment
2020-05-31 21:27:16 -04:00
Mahmood Ali 1af32e65bc clarify rpc consistency readiness comment 2020-05-31 21:26:41 -04:00
Mahmood Ali 0819ea60ea
Apply suggestions from code review
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2020-05-31 21:04:39 -04:00
Mahmood Ali 37c6160b96 Handle nil/empty cluster metadata
Handle case where a snapshot is made before cluster metadata is created.

This fixes a bug where a server may have empty cluster metadata if it
created and installed a Raft snapshot before a new cluster metadata ID is
generated.

This case is very unlikely to arise.  Most likely reason is when
upgrading from an old version slowly where servers may use snapshots
before all servers upgrade.  This happened for a user with a log line
like:

```
2020-05-21T15:21:56.996Z [ERROR] nomad.fsm: ClusterSetMetadata failed: error=""set cluster metadata failed: refusing to set new cluster id, previous: , new: <<redacted>
```
2020-05-29 13:34:21 -04:00
Drew Bailey 23d24c7a7f
removes pro tags (#8014) 2020-05-28 15:40:17 -04:00
Mahmood Ali 475b3b77ad
Merge pull request #8060 from hashicorp/tests-deflake-20200526
Deflake some tests - 2020-05-27 edition
2020-05-27 15:24:31 -04:00
Drew Bailey 34871f89be
Oss license support for ent builds (#8054)
* changes necessary to support oss licesning shims

revert nomad fmt changes

update test to work with enterprise changes

update tests to work with new ent enforcements

make check

update cas test to use scheduler algorithm

back out preemption changes

add comments

* remove unused method
2020-05-27 13:46:52 -04:00
Mahmood Ali 61e4f5aaf9 tests: use GreaterOrEqual and apply change to other tests 2020-05-27 11:22:48 -04:00
Mahmood Ali 6dfe0f5d3b tests: use t.Fatalf when it's clearer 2020-05-27 10:09:56 -04:00