Commit Graph

3441 Commits

Author SHA1 Message Date
Seth Hoenig 6fc63ede76
Merge pull request #7733 from jorgemarey/b-vault-policies
Fix get all vault token policies
2020-07-09 10:05:59 -05:00
Seth Hoenig f023df7b68
Merge pull request #8392 from hashicorp/f-infer-cn-taskname
consul/connect: infer task name for native service if possible
2020-07-08 14:17:25 -05:00
Seth Hoenig 5be1679b86
Merge pull request #8338 from jorgemarey/b-fix-sidecar-task
Change connectDriverConfig to be a func
2020-07-08 14:00:27 -05:00
Seth Hoenig 1a75da0ce0 consul/connect: infer task name in service if possible
Before, the service definition for a Connect Native service would always
require setting the `service.task` parameter. Now, that parameter is
automatically inferred when there is only one task in the task group.

Fixes #8274
2020-07-08 13:31:44 -05:00
Tim Gross ec96ddf648
fix swapped old/new multiregion plan diffs (#8378)
The multiregion plan diffs swap the old and new versions for each region when
they're edited (rather than added/removed). The `multiregionRegionDiff`
function call incorrectly reversed its arguments for existing regions.
2020-07-08 10:10:50 -04:00
Jorge Marey a3740cba9b Change connectDriverConfig to be a func 2020-07-07 08:59:59 +02:00
Nick Ethier e0fb634309
ar: support opting into binding host ports to default network IP (#8321)
* ar: support opting into binding host ports to default network IP

* fix config plumbing

* plumb node address into network resource

* struct: only handle network resource upgrade path once
2020-07-06 18:51:46 -04:00
Chris Baker 5b96c3d50e
Merge pull request #8360 from hashicorp/b-8355-better-scaling-validation
better error handling around Scaling->Max
2020-07-06 11:32:02 -05:00
Chris Baker 5aa46e9a8f modified state store to allow version skipping, to support multiregion version syncing also, passing existing version into multiregionRegister to support this 2020-07-06 14:16:55 +00:00
Lars Lehtonen f32e80175d
nomad: fix dropped test error (#8356) 2020-07-06 08:46:54 -04:00
Chris Baker a77e012220 better testing of scaling parsing, fixed some broken tests by api
changes
2020-07-04 19:32:37 +00:00
Chris Baker 9100b6b7c0 changes to make sure that Max is present and valid, to improve error messages
* made api.Scaling.Max a pointer, so we can detect (and complain) when it is neglected
* added checks to HCL parsing that it is present
* when Scaling.Max is absent/invalid, don't return extraneous error messages during validation
* tweak to multiregion handling to ensure that the count is valid on the interpolated regional jobs

resolves #8355
2020-07-04 19:05:50 +00:00
Lang Martin 6c22cd587d
api: `nomad debug` new /agent/host (#8325)
* command/agent/host: collect host data, multi platform

* nomad/structs/structs: new HostDataRequest/Response

* client/agent_endpoint: add RPC endpoint

* command/agent/agent_endpoint: add Host

* api/agent: add the Host endpoint

* nomad/client_agent_endpoint: add Agent Host with forwarding

* nomad/client_agent_endpoint: use findClientConn

This changes forwardMonitorClient and forwardProfileClient to use
findClientConn, which was cribbed from the common parts of those
funcs.

* command/debug: call agent hosts

* command/agent/host: eliminate calling external programs
2020-07-02 09:51:25 -04:00
Tim Gross 23be116da0
csi: add -force flag to volume deregister (#8295)
The `nomad volume deregister` command currently returns an error if the volume
has any claims, but in cases where the claims can't be dropped because of
plugin errors, providing a `-force` flag gives the operator an escape hatch.

If the volume has no allocations or if they are all terminal, this flag
deletes the volume from the state store, immediately and implicitly dropping
all claims without further CSI RPCs. Note that this will not also
unmount/detach the volume, which we'll make the responsibility of a separate
`nomad volume detach` command.
2020-07-01 12:17:51 -04:00
Mahmood Ali 7f460d2706 allocrunner: terminate sidecars in the end
This fixes a bug where a batch allocation fails to complete if it has
sidecars.

If the only remaining running tasks in an allocations are sidecars - we
must kill them and mark the allocation as complete.
2020-06-29 15:12:15 -04:00
Drew Bailey 01e2cc5054
allow ClusterMetadata to accept a watchset (#8299)
* allow ClusterMetadata to accept a watchset

* use nil instead of empty watchset
2020-06-26 13:23:32 -04:00
Mahmood Ali 49a177ce28
Merge pull request #8017 from hashicorp/f-change-sched-updated
Set Updated to true for all non-CAS requests on v1/operator/scheduler/configuration
2020-06-26 08:39:37 -04:00
Mahmood Ali 6605ebd314
Merge pull request #8223 from hashicorp/f-multi-network-validate-ports
core: validate port numbers are < 65535
2020-06-26 08:31:01 -04:00
Nick Ethier 89118016fc
command: correctly show host IP in ports output /w multi-host networks (#8289) 2020-06-25 15:16:01 -04:00
Tim Gross 67ffcb35e9
multiregion: add support for 'job plan' (#8266)
Add a scatter-gather for multiregion job plans. Each region's servers
interpolate the plan locally in `Job.Plan` but don't distribute the plan as
done in `Job.Run`.

Note that it's not possible to return a usable modify index from a multiregion
plan for use with `-check-index`. Even if we were to force the modify index to
be the same at the start of `Job.Run` the index immediately drifts during each
region's deployments, depending on events local to each region. So we omit
this section of a multiregion plan.
2020-06-24 13:24:55 -04:00
Tim Gross a449009e9f
multiregion validation fixes (#8265)
Multi-region jobs need to bypass validating counts otherwise we get spurious
warnings in Job.Plan.
2020-06-24 12:18:51 -04:00
Seth Hoenig 3872b493e5
Merge pull request #8011 from hashicorp/f-cnative-host
consul/connect: implement initial support for connect native
2020-06-24 10:33:12 -05:00
Seth Hoenig 011c6b027f connect/native: doc and comment tweaks from PR 2020-06-24 10:13:22 -05:00
Michael Schurter 7869ebc587 docs: add comments to structs.Port struct 2020-06-23 11:38:01 -07:00
Michael Schurter 13ed710a04 core: validate port numbers are <= 65535
The scheduler returns a very strange error if it detects a port number
out of range. If these would somehow make it to the client they would
overflow when converted to an int32 and could cause conflicts.
2020-06-23 11:31:49 -07:00
Seth Hoenig 6c5ab7f45e consul/connect: split connect native flag and task in service 2020-06-23 10:22:22 -05:00
Seth Hoenig 4d71f22a11 consul/connect: add support for running connect native tasks
This PR adds the capability of running Connect Native Tasks on Nomad,
particularly when TLS and ACLs are enabled on Consul.

The `connect` stanza now includes a `native` parameter, which can be
set to the name of task that backs the Connect Native Consul service.

There is a new Client configuration parameter for the `consul` stanza
called `share_ssl`. Like `allow_unauthenticated` the default value is
true, but recommended to be disabled in production environments. When
enabled, the Nomad Client's Consul TLS information is shared with
Connect Native tasks through the normal Consul environment variables.
This does NOT include auth or token information.

If Consul ACLs are enabled, Service Identity Tokens are automatically
and injected into the Connect Native task through the CONSUL_HTTP_TOKEN
environment variable.

Any of the automatically set environment variables can be overridden by
the Connect Native task using the `env` stanza.

Fixes #6083
2020-06-22 14:07:44 -05:00
Mahmood Ali 862834a792 testS: add all namespaces test for allocations 2020-06-22 10:26:08 -04:00
Michael Schurter 562704124d
Merge pull request #8208 from hashicorp/f-multi-network
multi-interface network support
2020-06-19 15:46:48 -07:00
Nick Ethier fb9c458df1
nomad/mock: add NodeNetworkResources to mock Node 2020-06-19 14:22:24 -04:00
Nick Ethier a87e91e971
test: fix up testing around host networks 2020-06-19 13:53:31 -04:00
Nick Ethier f0ac1f027a
lint: spelling 2020-06-19 11:29:41 -04:00
Tim Gross b654e1b8a4
multiregion: all regions start in running if no max_parallel (#8209)
If `max_parallel` is not set, all regions should begin in a `running` state
rather than a `pending` state. Otherwise the first region is set to `running`
and then all the remaining regions once it enters `blocked. That behavior is
technically correct in that we have at most `max_parallel` regions running,
but definitely not what a user expects.
2020-06-19 11:17:09 -04:00
Nick Ethier f0559a8162
multi-interface network support 2020-06-19 09:42:10 -04:00
Tim Gross 8a354f828f
store ACL Accessor ID from Job.Register with Job (#8204)
In multiregion deployments when ACLs are enabled, the deploymentwatcher needs
an appropriately scoped ACL token with the same `submit-job` rights as the
user who submitted it. The token will already be replicated, so store the
accessor ID so that it can be retrieved by the leader.
2020-06-19 07:53:29 -04:00
Mahmood Ali 38a01c050e
Merge pull request #8192 from hashicorp/f-status-allnamespaces-2
CLI Allow querying all namespaces for jobs and allocations - Try 2
2020-06-18 20:16:52 -04:00
Nick Ethier 4a44deaa5c CNI Implementation (#7518) 2020-06-18 11:05:29 -07:00
Nick Ethier 0bc0403cc3 Task DNS Options (#7661)
Co-Authored-By: Tim Gross <tgross@hashicorp.com>
Co-Authored-By: Seth Hoenig <shoenig@hashicorp.com>
2020-06-18 11:01:31 -07:00
Mahmood Ali c0aa06d9c7 rpc: allow querying allocs across namespaces
This implements the backend handling for querying across namespaces for
allocation list endpoints.
2020-06-17 16:31:06 -04:00
Mahmood Ali e784fe331a use '*' to indicate all namespaces
This reverts the introduction of AllNamespaces parameter that was merged
earlier but never got released.
2020-06-17 16:27:43 -04:00
Tim Gross 81ae581da6
test: remove flaky test from volumewatcher (#8189)
The volumewatcher restores itself on notification, but detecting this is racy
because it may reap any claim (or find there are no claims to reap) and
shutdown before we can test whether it's running. This appears to have become
flaky with a new version of golang. The other cases in this test case
sufficiently exercise the start/stop behavior of the volumewatcher, so remove
the flaky section.
2020-06-17 15:41:51 -04:00
Chris Baker fe9d654640
Merge pull request #8187 from hashicorp/f-8143-block-scaling-during-deployment
modify Job.Scale RPC to return an error if there is an active deployment
2020-06-17 14:38:55 -05:00
Chris Baker cd903218f7 added changelog entry and satisfied `make check` 2020-06-17 17:43:45 +00:00
Chris Baker ab2b15d8cb modify Job.Scale RPC to return an error if there is an active deployment
resolves #8143
2020-06-17 17:03:35 +00:00
Tim Gross 6b1cb61888 remove test for ent-only behavior 2020-06-17 11:27:29 -04:00
Tim Gross c14a75bfab multiregion: use pending instead of paused
The `paused` state is used as an operator safety mechanism, so that they can
debug a deployment or halt one that's causing a wider failure. By using the
`paused` state as the first state of a multiregion deployment, we risked
resuming an intentionally operator-paused deployment because of activity in a
peer region.

This changeset replaces the use of the `paused` state with a `pending` state,
and provides a `Deployment.Run` internal RPC to replace the use of the
`Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.
2020-06-17 11:06:14 -04:00
Tim Gross fd50b12ee2 multiregion: integrate with deploymentwatcher
* `nextRegion` should take status parameter
* thread Deployment/Job RPCs thru `nextRegion`
* add `nextRegion` calls to `deploymentwatcher`
* use a better description for paused for peer
2020-06-17 11:06:00 -04:00
Tim Gross 7b12445f29 multiregion: change AutoRevert to OnFailure 2020-06-17 11:05:45 -04:00
Tim Gross 5c4d0a73f4 start all but first region deployment in paused state 2020-06-17 11:05:34 -04:00
Tim Gross 48e9f75c1e multiregion: deploymentwatcher hooks
This changeset establishes hooks in deploymentwatcher for multiregion
deployments (for the enterprise version of Nomad).
2020-06-17 11:05:18 -04:00
Tim Gross b09b7a2475 Multiregion job registration
Integration points for multiregion jobs to be registered in the enterprise
version of Nomad:
* hook in `Job.Register` for enterprise to send job to peer regions
* remove monitoring from `nomad job run` and `nomad job stop` for multiregion jobs
2020-06-17 11:04:58 -04:00
Drew Bailey 9263fcb0d3 Multiregion deploy status and job status CLI 2020-06-17 11:03:34 -04:00
Tim Gross 473a0f1d44 multiregion: unblock and cancel RPCs 2020-06-17 11:02:26 -04:00
Tim Gross ede3a4f1c4 multiregion: request structs 2020-06-17 11:00:34 -04:00
Tim Gross 6851024925 Multiregion structs
Initial struct definitions, jobspec parsing, validation, and conversion
between Nomad structs and API structs for multi-region deployments.
2020-06-17 11:00:14 -04:00
Chris Baker 9fc66bc1aa support in API client and Job.Register RPC for PreserveCounts 2020-06-16 18:45:28 +00:00
Chris Baker 1e3563e08c wip: added PreserveCounts to struct.JobRegisterRequest, development test for Job.Register 2020-06-16 18:45:17 +00:00
Chris Baker 7ed06cced0 core: update Job.Scale to save the previous job count in the ScalingEvent 2020-06-15 19:49:22 +00:00
Chris Baker aeb3ed449e wip: added .PreviousCount to api.ScalingEvent and structs.ScalingEvent, with developmental tests 2020-06-15 19:40:21 +00:00
Mahmood Ali c17ffb2d35
Merge pull request #8131 from hashicorp/f-snapshot-restore
Implement snapshot restore
2020-06-15 08:32:34 -04:00
Mahmood Ali 9bfc3e28d9
Apply suggestions from code review
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2020-06-15 08:32:16 -04:00
Lang Martin 069840bef8
scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105) (#8138)
* scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect

* scheduler/reconcile: thread follupEvalIDs through to results.stop

* scheduler/reconcile: comment typo

* nomad/_test: correct arguments for plan.AppendStoppedAlloc

* scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost|Reschedules)
2020-06-09 17:13:53 -04:00
Mahmood Ali 63e048e972 clarify ccomments, esp related to leadership code 2020-06-09 12:01:31 -04:00
Mahmood Ali b543460e0a loosen raft timeout 2020-06-07 16:38:11 -04:00
Mahmood Ali 69bb42acf8 tests: prefix agent logs to identify agent sources 2020-06-07 16:38:11 -04:00
Mahmood Ali 47a163b63f reassert leadership 2020-06-07 15:47:06 -04:00
Mahmood Ali 9eb13ae144 basic snapshot restore 2020-06-07 15:46:23 -04:00
Mahmood Ali bf7a3583e5
Merge pull request #8089 from hashicorp/b-leader-worker-count
leadership: pause and unpause workers consistently
2020-06-04 12:01:01 -04:00
Mahmood Ali cd8e1b4d62
stop periodic dispatch at end of tests (#8111) 2020-06-04 09:15:00 -04:00
Lang Martin ac7c39d3d3
Delayed evaluations for `stop_after_client_disconnect` can cause unwanted extra followup evaluations around job garbage collection (#8099)
* client/heartbeatstop: reversed time condition for startup grace

* scheduler/generic_sched: use `delayInstead` to avoid a loop

Without protecting the loop that creates followUpEvals, a delayed eval
is allowed to create an immediate subsequent delayed eval. For both
`stop_after_client_disconnect` and the `reschedule` block, a delayed
eval should always produce some immediate result (running or blocked)
and then only after the outcome of that eval produce a second delayed
eval.

* scheduler/reconcile: lostLater are different than delayedReschedules

Just slightly. `lostLater` allocs should be used to create batched
evaluations, but `handleDelayedReschedules` assumes that the
allocations are in the untainted set. When it creates the in-place
updates to those allocations at the end, it causes the allocation to
be treated as running over in the planner, which causes the initial
`stop_after_client_disconnect` evaluation to be retried by the worker.
2020-06-03 09:48:38 -04:00
Mahmood Ali 70fbcb99c2 leadership: pause and unpause workers consistently
This fixes a bug where leadership establishment pauses 3/4 of workers
but stepping down unpause only 1/2!
2020-06-01 10:57:53 -04:00
Mahmood Ali 891fb3f8a9 test for paused workers upon leadership revocation 2020-06-01 10:48:42 -04:00
Mahmood Ali de44d9641b
Merge pull request #8047 from hashicorp/f-snapshot-save
API for atomic snapshot backups
2020-06-01 07:55:16 -04:00
Mahmood Ali e37a3312d5 If leadership fails, consider it handled
The callers for `forward` and old implementation expect failures to be
accompanied with a true value!  This fixes the issue and have tests
passing!
2020-05-31 22:06:17 -04:00
Mahmood Ali 30ab9c84e5 more review feedback 2020-05-31 21:39:09 -04:00
Mahmood Ali a73cd01a00
Merge pull request #8001 from hashicorp/f-jobs-list-across-nses
endpoint to expose all jobs across all namespaces
2020-05-31 21:28:03 -04:00
Mahmood Ali 082c085068
Merge pull request #8036 from hashicorp/f-background-vault-revoke-on-restore
Speed up leadership establishment
2020-05-31 21:27:16 -04:00
Mahmood Ali 1af32e65bc clarify rpc consistency readiness comment 2020-05-31 21:26:41 -04:00
Mahmood Ali 0819ea60ea
Apply suggestions from code review
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2020-05-31 21:04:39 -04:00
Mahmood Ali 37c6160b96 Handle nil/empty cluster metadata
Handle case where a snapshot is made before cluster metadata is created.

This fixes a bug where a server may have empty cluster metadata if it
created and installed a Raft snapshot before a new cluster metadata ID is
generated.

This case is very unlikely to arise.  Most likely reason is when
upgrading from an old version slowly where servers may use snapshots
before all servers upgrade.  This happened for a user with a log line
like:

```
2020-05-21T15:21:56.996Z [ERROR] nomad.fsm: ClusterSetMetadata failed: error=""set cluster metadata failed: refusing to set new cluster id, previous: , new: <<redacted>
```
2020-05-29 13:34:21 -04:00
Drew Bailey 23d24c7a7f
removes pro tags (#8014) 2020-05-28 15:40:17 -04:00
Mahmood Ali 475b3b77ad
Merge pull request #8060 from hashicorp/tests-deflake-20200526
Deflake some tests - 2020-05-27 edition
2020-05-27 15:24:31 -04:00
Drew Bailey 34871f89be
Oss license support for ent builds (#8054)
* changes necessary to support oss licesning shims

revert nomad fmt changes

update test to work with enterprise changes

update tests to work with new ent enforcements

make check

update cas test to use scheduler algorithm

back out preemption changes

add comments

* remove unused method
2020-05-27 13:46:52 -04:00
Mahmood Ali 61e4f5aaf9 tests: use GreaterOrEqual and apply change to other tests 2020-05-27 11:22:48 -04:00
Mahmood Ali 6dfe0f5d3b tests: use t.Fatalf when it's clearer 2020-05-27 10:09:56 -04:00
Mahmood Ali ec1fcedb93 tests: node drain events may be duplicated 2020-05-27 08:59:06 -04:00
Mahmood Ali c3c2a85314 tests: wait until clients are in the state store 2020-05-26 18:53:24 -04:00
Mahmood Ali 5d80d2a511 tests: eval may be processed quickly 2020-05-26 18:53:24 -04:00
Mahmood Ali 19141f8103 {volume|deployment}watcher: check for nil batcher 2020-05-26 14:54:27 -04:00
Mahmood Ali 81ac098a22 deploymentwatcher: no batcher when disabling
When disabling deploymentwatcher (at the end of a test), avoid starting a
new update batcher with its new goroutine.
2020-05-26 14:44:47 -04:00
Mahmood Ali ccc89f940a terminate leader goroutines on shutdown
Ensure that nomad steps down (and terminate leader goroutines) on
shutdown, when the server is the leader.

Without this change, `monitorLeadership` may handle `shutdownCh` event
and exit early before handling the raft `leaderCh` event and end up
leaking leadership goroutines.
2020-05-26 10:18:10 -04:00
Mahmood Ali e671913e56 fix a trace logline 2020-05-26 10:18:09 -04:00
Mahmood Ali 1c79c3b93d refactor: context is first parameter
By convention, go functions take `context.Context` as the first
argument.
2020-05-26 10:18:09 -04:00
Mahmood Ali 1eff8b0ed8 volumewatcher: no batcher when disabling
When disabling volumewatcher (at the end of a test), avoid starting a
new update batcher with its new goroutine.
2020-05-26 10:18:09 -04:00
Mahmood Ali b895cef622 always set purgeFunc
purgeFunc cannot be nil, so ensure it's set to a no-op function in
tests.
2020-05-21 21:05:53 -04:00
Mahmood Ali 2108681c1d Endpoint for snapshotting server state 2020-05-21 20:04:38 -04:00
Mahmood Ali fbe140b26c vault: ensure ttl expired tokens are purge
If a token is scheduled for revocation expires before we revoke it,
ensure that it is marked as purged in raft and is only removed from
local vault state if the purge operation succeeds.

Prior to this change, we may remove the accessor from local state but
not purge it from Raft.  This causes unnecessary and churn in the next
leadership elections (and until 0.11.2 result in indefinite retries).
2020-05-21 19:54:50 -04:00
Mahmood Ali aa8e79e55b Reorder leadership handling
Start serving RPC immediately after leader components are enabled, and
move clean up to the bottom as they don't block leadership
responsibilities.
2020-05-21 08:30:31 -04:00
Mahmood Ali 1cf1114627 apply the same change to consul revocation 2020-05-21 08:30:31 -04:00
Mahmood Ali 1399d02f45 rate limit revokeDaemon 2020-05-21 08:30:31 -04:00