open-nomad

Author	SHA1	Message	Date
Michael Dwan	ba70c54340	fix panic while deleting CSI plugins for missing job (#7758 )	2020-04-20 17:13:33 -04:00
Seth Hoenig	40e0f8a346	Merge pull request #7690 from hashicorp/b-inspect-proxy-output two fixes for inspect on connect proxy	2020-04-20 10:17:54 -06:00
Anthony Scalisi	9664c6b270	fix spelling errors (#6985 )	2020-04-20 09:28:19 -04:00
Seth Hoenig	d5ad580d5c	structs: fix compatibility between api and nomad/structs proxy definitions The field names within the structs representing the Connect proxy definition were not the same (nomad/structs/ vs api/), causing the values to be lost in translation for the 'nomad job inspect' command. Since the field names already shipped in v0.11.0 we cannot simply fix the names. Instead, use the json struct tag on the structs/ structs to remap the name to match the publicly expose api/ package on json encoding. This means existing jobs from v0.11.0 will continue to work, and the JSON API for job submission will remain backwards compatible.	2020-04-13 15:59:45 -06:00
Tim Gross	4e9bd1e1d1	refactor: consolidate private methods for CSI RPC (#7702 ) Follow-up for a method missed in the refactor for #7688. The `volAndPluginLookup` method is only ever called from the server's `CSI` RPC and never the `ClientCSI` RPC, so move it into that scope.	2020-04-13 10:46:43 -04:00
Tim Gross	f37e986b1b	refactor: make nodeForControllerPlugin private to ClientCSI (#7688 ) The current design of `ClientCSI` RPC requires that callers in the server know about the free-standing `nodeForControllerPlugin` function. This makes it difficult to send `ClientCSI` RPC messages from subpackages of `nomad` and adds a bunch of boilerplate to every server-side caller of a controller RPC. This changeset makes it so that the `ClientCSI` RPCs will populate and validate the controller's client node ID if it hasn't been passed by the caller, centralizing the logic of picking and validating controller targets into the `nomad.ClientCSI` struct.	2020-04-10 16:47:21 -04:00
Seth Hoenig	20802da8fd	connect: correctly deal with nil sidecar_service task stanza Before, if the sidecar_service stanza of a connect enabled service was missing, the job submission would cause a panic in the nomad agent. Since the panic was happening in the API handler the agent itself continued running, but this change will the condition more gracefully. By fixing the `Copy` method, the API handler now returns the proper error. $ nomad job run foo.nomad Error submitting job: Unexpected response code: 500 (1 error occurred: * Task group api validation failed: 2 errors occurred: * Missing tasks for task group * Task group service validation failed: 1 error occurred: * Service[0] count-api validation failed: 1 error occurred: * Consul Connect must be native or use a sidecar service	2020-04-09 20:28:17 -06:00
Drew Bailey	4ab7c03641	Merge pull request #7618 from hashicorp/b-shutdown-delay-updates Fixes bug that prevented group shutdown_delay updates	2020-04-06 13:05:20 -04:00
Drew Bailey	0d4bb6bf92	guard against nil maps	2020-04-06 12:25:50 -04:00
Drew Bailey	3b8afce9e6	test added and removed	2020-04-06 11:53:46 -04:00
Drew Bailey	9874e7b21d	Group shutdown delay fixes Group shutdown delay updates were not properly handled in Update hook. This commit also ensures that plan output is displayed.	2020-04-06 11:29:12 -04:00
Tim Gross	73dc2ad443	e2e/csi: add waiting for alloc stop	2020-04-06 10:15:55 -04:00
Tim Gross	027277a0d9	csi: make volume GC in job deregister safely async The `Job.Deregister` call will block on the client CSI controller RPCs while the alloc still exists on the Nomad client node. So we need to make the volume claim reaping async from the `Job.Deregister`. This allows `nomad job stop` to return immediately. In order to make this work, this changeset changes the volume GC so that the GC jobs are on a by-volume basis rather than a by-job basis; we won't have to query the (possibly deleted) job at the time of volume GC. We smuggle the volume ID and whether it's a purge into the GC eval ID the same way we smuggled the job ID previously.	2020-04-06 10:15:55 -04:00
Tim Gross	5a3b45864d	csi: fix unpublish workflow ID mismatches The CSI plugins uses the external volume ID for all operations, but the Client CSI RPCs uses the Nomad volume ID (human-friendly) for the mount paths. Pass the External ID as an arg in the RPC call so that the unpublish workflows have it without calling back to the server to find the external ID. The controller CSI plugins need the CSI node ID (or in other words, the storage provider's view of node ID like the EC2 instance ID), not the Nomad node ID, to determine how to detach the external volume.	2020-04-06 10:15:55 -04:00
Lang Martin	1750426d04	csi: run volume claim GC on `job stop -purge` (#7615 ) * nomad/state/state_store: error message copy/paste error * nomad/structs/structs: add a VolumeEval to the JobDeregisterResponse * nomad/job_endpoint: synchronously, volumeClaimReap on job Deregister * nomad/core_sched: make volumeClaimReap available without a CoreSched * nomad/job_endpoint: Deregister return early if the job is missing * nomad/job_endpoint_test: job Deregistion is idempotent * nomad/core_sched: conditionally ignore alloc status in volumeClaimReap * nomad/job_endpoint: volumeClaimReap all allocations, even running * nomad/core_sched_test: extra argument to collectClaimsToGCImpl * nomad/job_endpoint: job deregistration is not idempotent	2020-04-03 17:37:26 -04:00
Mahmood Ali	816a93ed4a	tests: deflake TestAutopilot_RollingUpdate I hypothesize that the flakiness in rolling update is due to shutting down s3 server before s4 is properly added as a voter. The chain of the flakiness is as follows: 1. Bootstrap with s1, s2, s3 2. Add s4 3. Wait for servers to register with 3 voting peers * But we already have 3 voters (s1, s2, and s3) * s4 is added as a non-voter in Raft v3 and must wait until autopilot promots it 4. Test proceeds without s4 being a voter 5. s3 shutdown 6. cluster changes stall due to leader election and too many pending configuration changes (e.g. removing s3 from raft, promoting s4). Here, I have the test wait until s4 is marked as a voter before shutting down s3, so we don't have too many configuration changes at once. In https://circleci.com/gh/hashicorp/nomad/57092, I noticed the following events: ``` TestAutopilot_RollingUpdate: autopilot_test.go:204: adding server s4 TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO] nomad/serf.go:60: nomad: adding server: server="nomad-137.global (Addr: 127.0.0.1:9177) (DC: dc1)" TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.789Z [INFO] raft/raft.go:1018: nomad.raft: updating configuration: command=AddNonvoter server-id=c54b5bf4-1159-34f6-032d-56aefeb08425 server-addr=127.0.0.1:9177 servers="[{Suffrage:Voter ID:df01ba65-d1b2-17a9-f792-a4459b3a7c09 Address:127.0.0.1:9171} {Suffrage:Voter ID:c3337778-811e-2675-87f5-006309888387 Address:127.0.0.1:9173} {Suffrage:Voter ID:186d5e15-c473-e2b3-b5a4-3259a84e10ef Address:127.0.0.1:9169} {Suffrage:Nonvoter ID:c54b5bf4-1159-34f6-032d-56aefeb08425 Address:127.0.0.1:9177}]" TestAutopilot_RollingUpdate: autopilot_test.go:218: shutting down server s3 TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.797Z [INFO] raft/replication.go:456: nomad.raft: aborting pipeline replication: peer="{Nonvoter c54b5bf4-1159-34f6-032d-56aefeb08425 127.0.0.1:9177}" TestAutopilot_RollingUpdate: autopilot_test.go:235: waiting for s4 to stabalize and be promoted TestAutopilot_RollingUpdate: testlog.go:34: 2020-04-03T20:08:19.975Z [ERROR] raft/raft.go:1656: nomad.raft: failed to make requestVote RPC: target="{Voter c3337778-811e-2675-87f5-006309888387 127.0.0.1:9173}" error="dial tcp 127.0.0.1:9173: connect: connection refused" TestAutopilot_RollingUpdate: retry.go:121: autopilot_test.go:241: don't want "c3337778-811e-2675-87f5-006309888387" autopilot_test.go:241: didn't find map[c54b5bf4-1159-34f6-032d-56aefeb08425:true] in []raft.ServerID{"df01ba65-d1b2-17a9-f792-a4459b3a7c09", "186d5e15-c473-e2b3-b5a4-3259a84e10ef"} ``` Note how s3, c3337778, is present in the peers list in the final failure, but s4, c54b5bf4, is added as a Nonvoter and isn't present in the final peers list.	2020-04-03 17:15:41 -04:00
Mahmood Ali	5587dc58c0	Use lowercase for hcl keys This is not a change in behavior, hcl key matching is case insensitive as desmonstrated in `command.agent/TestConfig_Parse`	2020-04-03 07:56:00 -04:00
Tim Gross	f6b3d38eb8	CSI: move node unmount to server-driven RPCs (#7596 ) If a volume-claiming alloc stops and the CSI Node plugin that serves that alloc's volumes is missing, there's no way for the allocrunner hook to send the `NodeUnpublish` and `NodeUnstage` RPCs. This changeset addresses this issue with a redesign of the client-side for CSI. Rather than unmounting in the alloc runner hook, the alloc runner hook will simply exit. When the server gets the `Node.UpdateAlloc` for the terminal allocation that had a volume claim, it creates a volume claim GC job. This job will made client RPCs to a new node plugin RPC endpoint, and only once that succeeds, move on to making the client RPCs to the controller plugin. If the node plugin is unavailable, the GC job will fail and be requeued.	2020-04-02 16:04:56 -04:00
Nick Ethier	3557099f4c	Merge pull request #7594 from hashicorp/f-connect-lifecycle connect: set task lifecycle config for injected sidecar task	2020-04-02 12:51:01 -04:00
Lang Martin	24449e23af	csi: volume validate namespace (#7587 ) * nomad/state/state_store: enforce that the volume namespace exists * nomad/csi_endpoint_test: a couple of broken namespaces now * nomad/csi_endpoint_test: one more test * nomad/node_endpoint_test: use structs.DefaultNamespace * nomad/state/state_store_test: use DefaultNamespace	2020-04-02 10:13:41 -04:00
Nick Ethier	90b5d2b13f	lint: gofmt	2020-04-01 21:23:47 -04:00
Nick Ethier	92f8bfc729	connect: set task lifecycle config for injected sidecar task fixes #7593	2020-04-01 21:19:41 -04:00
Chris Baker	c3ab837d9e	job_endpoint: fixed bad test	2020-04-01 18:11:58 +00:00
Chris Baker	285728f3fa	Merge branch 'f-7422-scaling-events' of github.com:hashicorp/nomad into f-7422-scaling-events	2020-04-01 17:28:50 +00:00
Chris Baker	8ec252e627	added indices to the job scaling events, so we could properly do blocking queries on the job scaling status	2020-04-01 17:28:19 +00:00
Chris Baker	4ac36b7c89	Update nomad/state/state_store.go Co-Authored-By: Drew Bailey <2614075+drewbailey@users.noreply.github.com>	2020-04-01 11:56:12 -05:00
Chris Baker	eb19fe16d2	Update nomad/state/state_store.go Co-Authored-By: Drew Bailey <2614075+drewbailey@users.noreply.github.com>	2020-04-01 11:56:01 -05:00
Chris Baker	6dbfb36e14	Update nomad/job_endpoint.go Co-Authored-By: Drew Bailey <2614075+drewbailey@users.noreply.github.com>	2020-04-01 11:55:55 -05:00
Chris Baker	b2ab42afbb	scaling api: more testing around the scaling events api	2020-04-01 16:39:23 +00:00
Chris Baker	40d6b3bbd1	adding raft and state_store support to track job scaling events updated ScalingEvent API to record "message string,error bool" instead of confusing "reason,error *string"	2020-04-01 16:15:14 +00:00
Mahmood Ali	37c0dbcfe6	fix codegen for ugorji/go When generating ugorji/go package, we should use github.com/hashicorp/go-msgpack/codec instead. Also fix the reference for codegen_generated	2020-03-31 21:30:21 -04:00
Seth Hoenig	9880e798bf	docs: note why check.Expose is not part of chech.Hash	2020-03-31 17:15:50 -06:00
Seth Hoenig	14c7cebdea	connect: enable automatic expose paths for individual group service checks Part of #6120 Building on the support for enabling connect proxy paths in #7323, this change adds the ability to configure the 'service.check.expose' flag on group-level service check definitions for services that are connect-enabled. This is a slight deviation from the "magic" that Consul provides. With Consul, the 'expose' flag exists on the connect.proxy stanza, which will then auto-generate expose paths for every HTTP and gRPC service check associated with that connect-enabled service. A first attempt at providing similar magic for Nomad's Consul Connect integration followed that pattern exactly, as seen in #7396. However, on reviewing the PR we realized having the `expose` flag on the proxy stanza inseperably ties together the automatic path generation with every HTTP/gRPC defined on the service. This makes sense in Consul's context, because a service definition is reasonably associated with a single "task". With Nomad's group level service definitions however, there is a reasonable expectation that a service definition is more abstractly representative of multiple services within the task group. In this case, one would want to define checks of that service which concretely make HTTP or gRPC requests to different underlying tasks. Such a model is not possible with the course `proxy.expose` flag. Instead, we now have the flag made available within the check definitions themselves. By making the expose feature resolute to each check, it is possible to have some HTTP/gRPC checks which make use of the envoy exposed paths, as well as some HTTP/gRPC checks which make use of some orthongonal port-mapping to do checks on some other task (or even some other bound port of the same task) within the task group. Given this example, group "server-group" { network { mode = "bridge" port "forchecks" { to = -1 } } service { name = "myserver" port = 2000 connect { sidecar_service { } } check { name = "mycheck-myserver" type = "http" port = "forchecks" interval = "3s" timeout = "2s" method = "GET" path = "/classic/responder/health" expose = true } } } Nomad will automatically inject (via job endpoint mutator) the extrapolated expose path configuration, i.e. expose { path { path = "/classic/responder/health" protocol = "http" local_path_port = 2000 listener_port = "forchecks" } } Documentation is coming in #7440 (needs updating, doing next) Modifications to the `countdash` examples in https://github.com/hashicorp/demo-consul-101/pull/6 which will make the examples in the documentation actually runnable. Will add some e2e tests based on the above when it becomes available.	2020-03-31 17:15:50 -06:00
Seth Hoenig	0266f056b8	connect: enable proxy.passthrough configuration Enable configuration of HTTP and gRPC endpoints which should be exposed by the Connect sidecar proxy. This changeset is the first "non-magical" pass that lays the groundwork for enabling Consul service checks for tasks running in a network namespace because they are Connect-enabled. The changes here provide for full configuration of the connect { sidecar_service { proxy { expose { paths = [{ path = <exposed endpoint> protocol = <http or grpc> local_path_port = <local endpoint port> listener_port = <inbound mesh port> }, ... ] } } } stanza. Everything from `expose` and below is new, and partially implements the precedent set by Consul: https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference Combined with a task-group level network port-mapping in the form: port "exposeExample" { to = -1 } it is now possible to "punch a hole" through the network namespace to a specific HTTP or gRPC path, with the anticipated use case of creating Consul checks on Connect enabled services. A future PR may introduce more automagic behavior, where we can do things like 1) auto-fill the 'expose.path.local_path_port' with the default value of the 'service.port' value for task-group level connect-enabled services. 2) automatically generate a port-mapping 3) enable an 'expose.checks' flag which automatically creates exposed endpoints for every compatible consul service check (http/grpc checks on connect enabled services).	2020-03-31 17:15:27 -06:00
Lang Martin	e03c328792	csi: use node MaxVolumes during scheduling (#7565 ) * nomad/state/state_store: CSIVolumesByNodeID ignores namespace * scheduler/scheduler: add CSIVolumesByNodeID to the state interface * scheduler/feasible: check node MaxVolumes * nomad/csi_endpoint: no namespace inn CSIVolumesByNodeID anymore * nomad/state/state_store: avoid DenormalizeAllocationSlice * nomad/state/iterator: clean up SliceIterator Next * scheduler/feasible_test: block with MaxVolumes * nomad/state/state_store_test: fix args to CSIVolumesByNodeID	2020-03-31 17:16:47 -04:00
Lang Martin	8d4f39fba1	csi: add node events to report progress mounting and unmounting volumes (#7547 ) * nomad/structs/structs: new NodeEventSubsystemCSI * client/client: pass triggerNodeEvent in the CSIConfig * client/pluginmanager/csimanager/instance: add eventer to instanceManager * client/pluginmanager/csimanager/manager: pass triggerNodeEvent * client/pluginmanager/csimanager/volume: node event on [un]mount * nomad/structs/structs: use storage, not CSI * client/pluginmanager/csimanager/volume: use storage, not CSI * client/pluginmanager/csimanager/volume_test: eventer * client/pluginmanager/csimanager/volume: event on error * client/pluginmanager/csimanager/volume_test: check event on error * command/node_status: remove an extra space in event detail format * client/pluginmanager/csimanager/volume: use snake_case for details * client/pluginmanager/csimanager/volume_test: snake_case details	2020-03-31 17:13:52 -04:00
Yoan Blanc	225c9c1215	fixup! vendor: explicit use of hashicorp/go-msgpack Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-31 09:48:07 -04:00
Yoan Blanc	761d014071	vendor: explicit use of hashicorp/go-msgpack Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-03-31 09:45:21 -04:00
Michael Schurter	464dae514c	test: assert HostVolumes included in ListNodes	2020-03-30 17:34:44 -07:00
Michael Lange	4707a625d6	Add HostVolumes to the NodeListStub	2020-03-30 17:33:43 -07:00
Seth Hoenig	b3664c628c	Merge pull request #7524 from hashicorp/docs-consul-acl-minimums consul: annotate Consul interfaces with ACLs	2020-03-30 13:27:27 -06:00
Seth Hoenig	0a812ab689	consul: annotate Consul interfaces with ACLs	2020-03-30 10:17:28 -06:00
Tim Gross	54b3573fc9	state: support snapshot of CSI plugin and volume tables (#7546 ) The `csi_plugins` and `csi_volumes` tables were missing support for snapshot persist and restore. This means restoring a snapshot would result in missing information for CSI.	2020-03-30 11:17:16 -04:00
Drew Bailey	a98dc8c768	update audit examples to an endpoint that is audited	2020-03-30 10:03:11 -04:00
Mahmood Ali	e76ff9f679	Merge pull request #7543 from hashicorp/test-flakiness-20200330_1 Test flakiness fixes - 2020-03-30 Edition	2020-03-30 09:26:26 -04:00
Mahmood Ali	57bebfdb5c	tests: avoid logging after test completion	2020-03-30 09:08:34 -04:00
Mahmood Ali	13381448e0	avoid logging in draining job watcher In tests where the logger is a test logger, emitting a trace log in a background thread while it's shutting down may trigger a panic. Thus avoid logging Trace if err != nil. Note that we already log an error when err isn't a trace. This fixes cases where tests panic with a trace like: ``` panic: Log in goroutine after TestAllocGarbageCollector_MakeRoomFor_MaxAllocs has completed goroutine 30 [running]: testing.(common).logDepth(0xc000aa9e60, 0xc000c4a000, 0xab, 0x3) /usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:680 +0x4d3 testing.(common).log(...) /usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:662 testing.(common).Logf(0xc000aa9e60, 0x690b941, 0x4, 0xc001366c00, 0x2, 0x2) /usr/local/Cellar/go/1.14/libexec/src/testing/testing.go:701 +0x7e github.com/hashicorp/nomad/helper/testlog.(writer).Write(0xc000a82a60, 0xc0000b48c0, 0xab, 0x13f, 0x0, 0x0, 0x0) /Users/notnoop/go/src/github.com/hashicorp/nomad/helper/testlog/testlog.go:34 +0x106 github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(writer).Flush(0xc000a80900, 0xbf9870f000000001, 0x20a87556e, 0x8b12bc0) /Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/writer.go:29 +0x14f github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(intLogger).log(0xc000e2c180, 0xc0003b6880, 0x17, 0x1, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6) /Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:139 +0x15d github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(intLogger).Trace(0xc000e2c180, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6) /Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/intlogger.go:446 +0x7a github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog.(interceptLogger).Trace(0xc0002f1ad0, 0x6974edc, 0x22, 0xc000db57a0, 0x6, 0x6) /Users/notnoop/go/src/github.com/hashicorp/nomad/vendor/github.com/hashicorp/go-hclog/interceptlogger.go:48 +0x9c github.com/hashicorp/nomad/nomad/drainer.(*drainingJobWatcher).watch(0xc0002f2380) /Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:147 +0x1125 created by github.com/hashicorp/nomad/nomad/drainer.NewDrainingJobWatcher /Users/notnoop/go/src/github.com/hashicorp/nomad/nomad/drainer/watch_jobs.go:89 +0x1e3 FAIL github.com/hashicorp/nomad/client 10.605s FAIL ```	2020-03-30 07:06:53 -04:00
Mahmood Ali	36ad8ee2e0	tests: add debugging for TestAutopilot_RollingUpdate	2020-03-30 07:06:53 -04:00
Chris Baker	d6287c43b9	clean up some tests	2020-03-29 23:38:36 +00:00
Chris Baker	5e3c38be2f	state_store: * added method to retrieve all scaling policies for use in snapshotting, plus test * better testing for ScalingPoliciesByNamespace * added scaling policy snapshot persist and restore (and test of restore) manually tested snapshot restore. resolves #7539	2020-03-29 13:32:44 +00:00

1 2 3 4 5 ...

3232 commits