open-nomad

Commit Graph

Author	SHA1	Message	Date
Drew Bailey	86d9e1ff90	Merge pull request #9955 from hashicorp/on-update-services Service and Check on_update configuration option (readiness checks)	2021-02-24 10:11:05 -05:00
Tim Gross	b764f52ab9	deploymentwatcher: reset progress deadline on promotion (#10042 ) In a deployment with two groups (ex. A and B), if group A's canary becomes healthy before group B's, the deadline for the overall deployment will be set to that of group A. When the deployment is promoted, if group A is done it will not contribute to the next deadline cutoff. Group B's old deadline will be used instead, which will be in the past and immediately trigger a deployment progress failure. Reset the progress deadline when the job is promotion to avoid this bug, and to better conform with implicit user expectations around how the progress deadline should interact with promotions.	2021-02-22 16:44:03 -05:00
James Rasell	6553cc3da7	drainer: fix error message when handling drain deadlined nodes.	2021-02-18 11:45:44 +01:00
AndrewChubatiuk	cd152643fb	fixed connect port label	2021-02-13 02:42:14 +02:00
AndrewChubatiuk	3d0aa2ef56	allocate sidecar task port on host_network interface	2021-02-13 02:42:13 +02:00
Nick Ethier	fcc1f4c805	Merge pull request #9946 from hashicorp/b-9477 structs: namespace port validation by host_network	2021-02-11 12:53:28 -05:00
Seth Hoenig	45e0e70a50	consul/connect: enable custom sidecars to use expose checks This PR enables jobs configured with a custom sidecar_task to make use of the `service.expose` feature for creating checks on services in the service mesh. Before we would check that sidecar_task had not been set (indicating that something other than envoy may be in use, which would not support envoy's expose feature). However Consul has not added support for anything other than envoy and probably never will, so having the restriction in place seems like an unnecessary hindrance. If Consul ever does support something other than Envoy, they will likely find a way to provide the expose feature anyway. Fixes #9854	2021-02-09 10:49:37 -06:00
Drew Bailey	8507d54e3b	e2e test for on_update service checks check_restart not compatible with on_update=ignore reword caveat	2021-02-08 08:32:40 -05:00
Drew Bailey	82f971f289	OnUpdate configuration for services and checks Allow for readiness type checks by configuring nomad to ignore warnings or errors reported by a service check. This allows the deployment to progress and while Consul handles introducing the sercive into a resource pool once the check passes.	2021-02-08 08:32:40 -05:00
Nick Ethier	eacc4da499	Merge branch 'master' into b-9477	2021-02-05 11:58:13 -05:00
Tim Gross	eb3dd17fb2	volumes: implement plan diff for volume requests The details of host volume and CSI volume requests do not show up in `nomad plan` outputs, although the updates are detected by the scheduler and result in an update as expected.	2021-02-04 16:55:17 -05:00
Chris Baker	ebbb760ec4	support for scaling_policy in global prefix search	2021-02-03 19:26:57 +00:00
Nick Ethier	43a4d72fda	structs: namespace port validation by host_network	2021-02-02 14:56:52 -05:00
Seth Hoenig	720780992c	consul/connect: copy bind address map if empty This parameter is now supposed to be non-nil even if empty, and the Copy method should also maintain that invariant.	2021-01-25 10:36:04 -06:00
Seth Hoenig	1ad219c441	consul/connect: remove debug line	2021-01-25 10:36:04 -06:00
Seth Hoenig	8b05efcf88	consul/connect: Add support for Connect terminating gateways This PR implements Nomad built-in support for running Consul Connect terminating gateways. Such a gateway can be used by services running inside the service mesh to access "legacy" services running outside the service mesh while still making use of Consul's service identity based networking and ACL policies. https://www.consul.io/docs/connect/gateways/terminating-gateway These gateways are declared as part of a task group level service definition within the connect stanza. service { connect { gateway { proxy { // envoy proxy configuration } terminating { // terminating-gateway configuration entry } } } } Currently Envoy is the only supported gateway implementation in Consul. The gateay task can be customized by configuring the connect.sidecar_task block. When the gateway.terminating field is set, Nomad will write/update the Configuration Entry into Consul on job submission. Because CEs are global in scope and there may be more than one Nomad cluster communicating with Consul, there is an assumption that any terminating gateway defined in Nomad for a particular service will be the same among Nomad clusters. Gateways require Consul 1.8.0+, checked by a node constraint. Closes #9445	2021-01-25 10:36:04 -06:00
Drew Bailey	007158ee75	ignore setting job summary when oldstatus == newstatus (#9884 )	2021-01-25 10:34:27 -05:00
Drew Bailey	630babb886	prevent double job status update (#9768 ) * Prevent Job Statuses from being calculated twice https://github.com/hashicorp/nomad/pull/8435 introduced atomic eval insertion iwth job (de-)registration. This change removes a now obsolete guard which checked if the index was equal to the job.CreateIndex, which would empty the status. Now that the job regisration eval insetion is atomic with the registration this check is no longer necessary to set the job statuses correctly. * test to ensure only single job event for job register * periodic e2e * separate job update summary step * fix updatejobstability to use copy instead of modified reference of job * update envoygatewaybindaddresses copy to prevent job diff on null vs empty * set ConsulGatewayBindAddress to empty map instead of nil fix nil assertions for empty map rm unnecessary guard	2021-01-22 09:18:17 -05:00
Kris Hicks	8f9e47a8e7	Clean up Task Validation tests (#9833 ) Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>	2021-01-21 11:53:02 -08:00
Dennis Schön	3eaf1432aa	validate connect block allowed only within group.service	2021-01-20 14:34:23 -05:00
Seth Hoenig	f213b8c51b	consul/connect: always set gateway proxy default timeout If the connect.proxy stanza is left unset, the connection timeout value is not set but is assumed to be, and may cause a non-fatal NPE on job submission.	2021-01-19 11:23:41 -06:00
Kris Hicks	d71a90c8a4	Fix some errcheck errors (#9811 ) * Throw away result of multierror.Append When given a multierror.Error, it is mutated, therefore the return value is not needed. Simplify MergeMultierrorWarnings, use StringBuilder * Hash.Write() never returns an error * Remove error that was always nil * Remove error from Resources.Add signature When this was originally written it could return an error, but that was refactored away, and callers of it as of today never handle the error. * Throw away results of io.Copy during Bridge * Handle errors when computing node class in test	2021-01-14 12:46:35 -08:00
Kris Hicks	39e369c3bb	csi: Return error when deleting node (#9803 ) In this change we'll properly return the error in the CSIPluginTypeMonolith case (which is the type given in DeleteNode()), and also return the error when the given ID is not found. This was found via errcheck.	2021-01-14 12:44:50 -08:00
Kris Hicks	abb8f2ebc0	Refactor Job.Scale() (#9771 )	2021-01-14 12:40:42 -08:00
Kris Hicks	f77ffb3b5b	Add missing sink.Cancel() in fsm (#9818 )	2021-01-14 12:39:20 -08:00
Drew Bailey	03a9541822	ignore poststop task in alloc health tracker (#9548 ), fixes #9361 * investigating where to ignore poststop task in alloc health tracker * ignore poststop when setting latest start time for allocation * clean up logic * lifecycle: isolate mocks for poststop deployment test * lifecycle: update comments in tracker Co-authored-by: Jasmine Dahilig <jasmine@dahilig.com>	2021-01-12 10:03:48 -08:00
Chris Baker	3546469205	nicer error message	2021-01-08 21:13:29 +00:00
Chris Baker	d43e0d10c0	appease the linter and fix an incorrect test	2021-01-08 19:38:25 +00:00
Chris Baker	49effd5840	in Job.Scale, ensure that new count is within [min,max] configured in scaling policy resolves #9758	2021-01-08 19:24:36 +00:00
Seth Hoenig	6c9366986b	consul/connect: avoid NPE from unset connect gateway proxy Submitting a job with an ingress gateway in host networking mode with an absent gateway.proxy block would cause the Nomad client to panic on NPE. The consul registration bits would assume the proxy stanza was not nil, but it could be if the user does not supply any manually configured envoy proxy settings. Check the proxy field is not nil before using it. Fixes #9669	2021-01-05 09:27:01 -06:00
Chris Baker	fd6beefe11	simple test to ensure that scaling endpoint methods support IsRead for stale read support	2021-01-05 13:42:18 +00:00
Mahmood Ali	ae0be24abb	tweak bootstrap testing	2021-01-04 09:00:40 -05:00
Mahmood Ali	2ea8ae7584	Only bootstrap when `bootstrap_expect`	2020-12-17 20:06:14 -05:00
Mahmood Ali	421380a300	add a failing test for unexpected bootstrapping	2020-12-17 20:06:14 -05:00
Seth Hoenig	3a3a175e1a	consul/connect: enable configuring custom gateway task Add the ability to configure the Task used for Connect gateways, similar to how sidecar Task can be configured. The implementation here simply re-uses the sidecar_task stanza, and now gets applied whether connect.sidecar_service or connect.gateway is the thing being defined. In retrospect, connect.sidecar_task could have been more generically named like connect.task to make it a little more re-usable. Closes #9474	2020-12-17 08:51:52 -06:00
Seth Hoenig	beaa6359d5	consul/connect: fix regression where client connect images ignored Nomad v1.0.0 introduced a regression where the client configurations for `connect.sidecar_image` and `connect.gateway_image` would be ignored despite being set. This PR restores that functionality. There was a missing layer of interpolation that needs to occur for these parameters. Since Nomad 1.0 now supports dynamic envoy versioning through the ${NOMAD_envoy_version} psuedo variable, we basically need to first interpolate ${connect.sidecar_image} => envoyproxy/envoy:v${NOMAD_envoy_version} then use Consul at runtime to resolve to a real image, e.g. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.16.0 Of course, if the version of Consul is too old to provide an envoy version preference, we then need to know to fallback to the old version of envoy that we used before. envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09 Beyond that, we also need to continue to support jobs that set the sidecar task themselves, e.g. sidecar_task { config { image: "custom/envoy" } } which itself could include teh pseudo envoy version variable.	2020-12-14 09:47:55 -06:00
Drew Bailey	54becaab7d	Events/acl events (#9595 ) * fix acl event creation * allow way to access secretID without exposing it to stream test that values are omitted test event creation test acl events payloads are pointers fix failing tests, do all security steps inside constructor * increase time * ignore empty tokens * uncomment line * changelog	2020-12-11 10:40:50 -05:00
Seth Hoenig	52c9dbbb91	consul/connect: set default Envoy worker threads for gateways Applying the default --concurrency for gateways was missed before. Set the default Envoy concurrency to 1 for connect gateways. The same override value meta.connect.proxy_concurrency applies.	2020-12-10 10:36:29 -06:00
Kris Hicks	0cf9cae656	Apply some suggested fixes from staticcheck (#9598 )	2020-12-10 07:29:18 -08:00
Seth Hoenig	b3d744fea3	Merge pull request #9586 from hashicorp/f-connect-interp consul/connect: interpolate connect block	2020-12-09 13:21:50 -06:00
Kris Hicks	0a3a748053	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Seth Hoenig	b51459a879	consul/connect: interpolate connect block This PR enables job submitters to use interpolation in the connect block of jobs making use of consul connect. Before, only the name of the connect service would be interpolated, and only for a few select identifiers related to the job itself (#6853). Now, all connect fields can be interpolated using the full spectrum of runtime parameters. Note that the service name is interpolated at job-submission time, and cannot make use of values known only at runtime. Fixes #7221	2020-12-09 09:10:00 -06:00
Kris Hicks	93155ba3da	Add gocritic to golangci-lint config (#9556 )	2020-12-08 12:47:04 -08:00
Dennis Schön	a9c97d9257	use os.ErrDeadlineExceeded in tests	2020-12-07 10:40:28 -05:00
Drew Bailey	c0bd238eb2	fix allocation spelling error, update docs (#9527 ) * fix allocation spelling error, update docs * assign TopicACLPolicy and TopicACLToken properly	2020-12-04 12:04:58 -05:00
James Rasell	fd53963afb	core: fix typo msg used when job ID/name contains a null char.	2020-12-04 09:49:31 +01:00
Drew Bailey	ce85288f2f	ensure node secret ID is not included in event stream (#9510 )	2020-12-03 12:27:14 -05:00
Drew Bailey	17de8ebcb1	API: Event stream use full name instead of Eval/Alloc (#9509 ) * use full name for events use evaluation and allocation instead of short name * update api event stream package and shortnames * update docs * make sync; fix typo * backwards compat not from 1.0.0-beta event stream api changes * use api types instead of string * rm backwards compat note that only changed between prereleases * remove backwards incompat that only existed in prereleases	2020-12-03 11:48:18 -05:00
Seth Hoenig	3b2b083cbf	Merge pull request #9487 from hashicorp/f-connect-sidecar-concurrency consul/connect: default envoy concurrency to 1	2020-12-01 15:51:41 -06:00
Drew Bailey	f9f5fe8236	Events switch on memdb change table instead of type to prevent duplicates (#9486 ) * prevent duplicate job events when a job is updated, the job_version table is updated with a structs.Job, this caused there to be multiple job events since we are switching off the change type and not the table * test length * add table value to tests	2020-12-01 15:14:05 -05:00
Seth Hoenig	bf857684d1	consul/connect: default envoy concurrency to 1 Previously, every Envoy Connect sidecar would spawn as many worker threads as logical CPU cores. That is Envoy's default behavior when `--concurrency` is not explicitly set. Nomad now sets the concurrency flag to 1, which is sensible for the default cpu = 250 Mhz resources allocated for sidecar proxies. The concurrency value can be configured in Client configuration by setting `meta.connect.proxy_concurrency`. Closes #9341	2020-12-01 13:12:45 -06:00
Drew Bailey	1f8e1aa631	pass in msgType for UpsertJob (#9475 )	2020-12-01 14:00:52 -05:00
Michael Schurter	ea0e1789f4	Merge pull request #9435 from hashicorp/f-allocupdate-timer client: always wait 200ms before sending updates	2020-12-01 08:45:17 -08:00
Drew Bailey	9adca240f8	Event Stream: Track ACL changes, unsubscribe on invalidating changes (#9447 ) * upsertaclpolicies * delete acl policies msgtype * upsert acl policies msgtype * delete acl tokens msgtype * acl bootstrap msgtype wip unsubscribe on token delete test that subscriptions are closed after an ACL token has been deleted Start writing policyupdated test * update test to use before/after policy * add SubscribeWithACLCheck to run acl checks on subscribe * update rpc endpoint to use broker acl check * Add and use subscriptions.closeSubscriptionFunc This fixes the issue of not being able to defer unlocking the mutex on the event broker in the for loop. handle acl policy updates * rpc endpoint test for terminating acl change * add comments Co-authored-by: Kris Hicks <khicks@hashicorp.com>	2020-12-01 11:11:34 -05:00
Drew Bailey	70ae7ec621	return potential errors from txn.Commit (#9483 )	2020-12-01 10:05:37 -05:00
Benjamin Buzbee	e0acbbfcc6	Fix RPC retry logic in nomad client's rpc.go for blocking queries (#9266 )	2020-11-30 15:11:10 -05:00
Drew Bailey	a0b7f05a7b	Remove Managed Sinks from Nomad (#9470 ) * Remove Managed Sinks from Nomad Managed Sinks were a beta feature in Nomad 1.0-beta2. During the beta period it was determined that this was not a scalable approach to support community and third party sinks. * update comment * changelog	2020-11-30 14:00:31 -05:00
Seth Hoenig	e81e9223ef	consul/connect: enable setting datacenter in connect upstream Before, upstreams could only be defined using the default datacenter. Now, the `datacenter` field can be set in a connect upstream definition, informing consul of the desire for an instance of the upstream service in the specified datacenter. The field is optional and continues to default to the local datacenter. Closes #8964	2020-11-30 10:38:30 -06:00
Tim Gross	b2cd0da0a2	CSI: fix transaction handling in state store (#9438 ) When making updates to CSI plugins, the state store methods that have open write transactions were querying the state store using the same methods used by the CSI RPC endpoint, but these method creates their own top-level read transactions. During concurrent plugin updates (as happens when a plugin job is stopped), this can cause write skew in the plugin counts. * Refactor the CSIPlugin query methods to have an implementation method that accepts a transaction, which can be called with either a read txn or a write txn. * Refactor the CSIVolume query methods to have an implementation method that accepts a transaction, which can be called with either a read txn or a write txn. * CSI volumes need to be "denormalized" with their plugins and (optionally) allocations. Read-only RPC endpoints should take a snapshot so that we can make multiple state store method calls with a consistent view.	2020-11-25 11:15:57 -05:00
Michael Schurter	9bd1f267d2	nomad: try to avoid slice resizing when batching	2020-11-24 09:14:00 -08:00
Seth Hoenig	74a34704c5	Merge pull request #8743 from hashicorp/f-task_network_warning Validate and document 0.12 mbits/network deprecations	2020-11-23 15:36:18 -06:00
Drew Bailey	c8b1a84d1e	Events/mv structs (#9430 ) * move structs to structs/event.go to avoid import cycle	2020-11-23 14:01:10 -05:00
Seth Hoenig	a35c0db6c7	nomad/structs: validate deprecated task.resource.network port labels Enable users to submit jobs that still make use of the deprecated task.resources.network stanza. Such jobs can be submitted, but will emit a warning.	2020-11-23 12:40:40 -06:00
Nick Ethier	f1ea79f5a8	remove references to default mbits	2020-11-23 10:32:13 -06:00
Nick Ethier	d21cbeb30f	command: remove task network usage from init examples	2020-11-23 10:25:11 -06:00
Nick Ethier	9471892df4	mock: add default host network	2020-11-23 10:11:00 -06:00
Nick Ethier	7266376ae6	nomad: update validate to check group networks for task port usage	2020-11-23 10:11:00 -06:00
Nick Ethier	c4ddb0a43a	website: add mbits and network deprecation notice	2020-11-23 10:09:36 -06:00
Tim Gross	c320c1ba57	CSI: fix struct copying errors (#9239 ) The CSIVolume struct "denormalizes" allocations when it's first queried from the state store. The CSIVolumeByID method on the state store copies the volume before denormalizing so that we don't end up with unexpected changes. The copying has some subtle bugs that meant that Allocations (as well as Topologies and MountOptions) were not getting copied when expected. Also, ensure we never write allocations attached to volumes to the state store during claims.	2020-11-18 10:59:25 -05:00
Seth Hoenig	4cc3c01d5b	Merge pull request #9352 from hashicorp/f-artifact-headers jobspec: add support for headers in artifact stanza	2020-11-13 14:04:27 -06:00
Seth Hoenig	bb8a5816a0	jobspec: add support for headers in artifact stanza This PR adds the ability to set HTTP headers when downloading an artifact from an `http` or `https` resource. The implementation in `go-getter` is such that a new `HTTPGetter` must be created for each artifact that sets headers (as opposed to conveniently setting headers per-request). This PR maintains the memoization of the default Getter objects, creating new ones only for artifacts where headers are set. Closes #9306	2020-11-13 12:03:54 -06:00
Lars Lehtonen	60936f554c	nomad/structs: fix noop breaks (#9348 )	2020-11-13 08:28:11 -05:00
Jasmine Dahilig	d6110cbed4	lifecycle: add poststop hook (#8194 )	2020-11-12 08:01:42 -08:00
Nick Ethier	5e1634eda1	structs: canonicalize allocatedtaskresources to populate shared ports (#9309 )	2020-11-11 16:21:47 -05:00
Tim Gross	60874ebe25	csi: Postrun hook should not change mode (#9323 ) The unpublish workflow requires that we know the mode (RW vs RO) if we want to unpublish the node. Update the hook and the Unpublish RPC so that we mark the claim for release in a new state but leave the mode alone. This fixes a bug where RO claims were failing node unpublish. The core job GC doesn't know the mode, but we don't need it for that workflow, so add a mode specifically for GC; the volumewatcher uses this as a sentinel to check whether claims (with their specific RW vs RO modes) need to be claimed.	2020-11-11 13:06:30 -05:00
Chris Baker	fbe3670b74	Merge pull request #9317 from hashicorp/f-recommendations-cli-autocomplete recommendations CLI: autocomplete support	2020-11-11 12:04:23 -06:00
Chris Baker	e3c0ea654d	auto-complete for recommendations CLI, plus OSS components of recommendations prefix search	2020-11-11 11:13:43 +00:00
Chris Baker	53aa5e75c9	fix #9227 : use both job and type query on scaling policy list endpoint	2020-11-10 23:26:35 +00:00
Luiz Aoqui	ea81ac5d3d	Merge pull request #9296 from hashicorp/b-remove-namespace-from-scale-request Remove Namespace field from JobScaleRequest	2020-11-09 15:13:33 -05:00
Luiz Aoqui	c536286c7a	remove Namespace field from JobScaleRequest	2020-11-09 13:02:05 -05:00
Kris Hicks	0b590a5040	events: Use single eventsFromChanges func (#9281 )	2020-11-05 13:06:52 -08:00
Chris Baker	b2a4f64b65	Merge pull request #9278 from hashicorp/b-9268-all-namespace-allocs-acl fix ACL bugs in listing allocs across all namespaces	2020-11-05 14:59:47 -06:00
Kris Hicks	bcb460c36e	Fix handling of deleted change (#9280 )	2020-11-05 11:06:41 -08:00
Chris Baker	be32fb7d3c	updated Allocation.List to properly handle ACL checking for namespace=*	2020-11-05 17:26:33 +00:00
Kris Hicks	20f5fa7f99	Refactor GenericEventsFromChanges (#9279 )	2020-11-05 09:06:08 -08:00
Chris Baker	6743803e5c	documenting test for #9268	2020-11-05 16:19:55 +00:00
Mahmood Ali	be11f735c2	Merge pull request #9205 from hashicorp/b-gh-7703 Repurpose dispatch-job capability to dispatch periodic jobs	2020-11-02 13:11:58 -05:00
Drew Bailey	d62d8a8587	Event sink manager improvements (#9206 ) * Improve managed sink run loop and reloading resetCh no longer needed length of buffer equal to count of items, not count of events in each item update equality fn name, pr feedback clean up sink manager sink creation * update test to reflect changes * bad editor find and replace * pr feedback	2020-11-02 09:21:32 -05:00
Kris Hicks	a98a8253d8	Update subscription filter func (#9232 ) This adds support for specifying a global topic match for a specific key.	2020-10-30 10:07:38 -07:00
Chris Baker	719077a26d	added new policy capabilities for recommendations API state store: call-out to generic update of job recommendations from job update method recommendations API work, and http endpoint errors for OSS support for scaling polices in task block of job spec add query filters for ScalingPolicy list endpoint command: nomad scaling policy list: added -job and -type	2020-10-28 14:32:16 +00:00
Mahmood Ali	320239264f	dispatch-job capability to dispatch periodic jobs	2020-10-27 16:33:01 -04:00
Drew Bailey	86080e25a9	Send events to EventSinks (#9171 ) * Process to send events to configured sinks This PR adds a SinkManager to a server which is responsible for managing managed sinks. Managed sinks subscribe to the event broker and send events to a sink writer (webhook). When changes to the eventstore are made the sinkmanager and managed sink are responsible for reloading or starting a new managed sink. * periodically check in sink progress to raft Save progress on the last successfully sent index to raft. This allows a managed sink to resume close to where it left off in the event of a lost server or leadership change dereference eventsink so we can accurately use the watchch When using a pointer to eventsink struct it was updated immediately and our reload logic would not trigger	2020-10-26 17:27:54 -04:00
Drew Bailey	1ae39a9ed9	event sink crud operation api (#9155 ) * network sink rpc/api plumbing state store methods and restore upsert sink test get sink delete sink event sink list and tests go generate new msg types validate sink on upsert * go generate	2020-10-23 14:23:00 -04:00
Michael Schurter	c2dd9bc996	core: open source namespaces	2020-10-22 15:26:32 -07:00
Drew Bailey	f3dcefe5a9	remove event durability (#9147 ) * remove event durability temporarily removing go-memdb event durability until a new strategy is developed on how to best handled increased durability needs * drop events table schema and state store methods * fix neweventbuffer invocations	2020-10-22 12:21:03 -04:00
Tim Gross	8459f1ead5	csi: prevent in-use plugin GC from blocking volume GC (#9141 ) During CSI plugin GC, we don't return an error if the volume is in use, because this is not an error condition. If we were to return an error during a `nomad system gc`, we would not continue on to GC volumes. But check for the specific error message fails if the GC is performed on a worker rather than on the leader, due to RPC forwarding wrapping the error message. Use a less specific test so that we don't return an error.	2020-10-21 16:54:28 -04:00
Alexander Shtuchkin	90fd8bb85f	Implement 'batch mode' for persisting allocations on the client. (#9093 ) Fixes #9047, see problem details there. As a solution, we use BoltDB's 'Batch' mode that combines multiple parallel writes into small number of transactions. See https://github.com/boltdb/bolt#batch-read-write-transactions for more information.	2020-10-20 16:15:37 -04:00
Drew Bailey	8451de99b2	adds two base event stream e2e tests (#9126 ) * adds two base event stream e2e tests test evaluation filter keys are included * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * gc aftereach Co-authored-by: Tim Gross <tgross@hashicorp.com>	2020-10-20 08:26:21 -04:00
Drew Bailey	6c788fdccd	Events/msgtype cleanup (#9117 ) * use msgtype in upsert node adds message type to signature for upsert node, update tests, remove placeholder method * UpsertAllocs msg type test setup * use upsertallocs with msg type in signature update test usage of delete node delete placeholder msgtype method * add msgtype to upsert evals signature, update test call sites with test setup msg type handle snapshot upsert eval outside of FSM and ignore eval event remove placeholder upsertevalsmsgtype handle job plan rpc and prevent event creation for plan msgtype cleanup upsertnodeevents updatenodedrain msgtype msg type 0 is a node registration event, so set the default to the ignore type * fix named import * fix signature ordering on upsertnode to match	2020-10-19 09:30:15 -04:00
Drew Bailey	c57e760933	remove special node drain event type rely on standardized events instead of special node drain event	2020-10-15 16:44:36 -04:00
Nick Ethier	4903e5b114	Consul with CNI and host_network addresses (#9095 ) * consul: advertise cni and multi host interface addresses * structs: add service/check address_mode validation * ar/groupservices: fetch networkstatus at hook runtime * ar/groupservice: nil check network status getter before calling * consul: comment network status can be nil	2020-10-15 15:32:21 -04:00
Pierre Cauchois	13218dc345	Enforce bounds on MaxQueryTime (#9064 ) The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to an invalid value that would throw off how RPC requests are retried. This fix uses the same logic that enforces the MaxQueryTime bounds in the blockingRPC() call.	2020-10-15 08:43:06 -04:00
Michael Schurter	dd09fa1a4a	Merge pull request #9055 from hashicorp/f-9017-resources api: add field filters to /v1/{allocations,nodes}	2020-10-14 14:49:39 -07:00
Drew Bailey	c463479848	filter on additional filter keys, remove switch statement duplication properly wire up durable event count move newline responsibility moves newline creation from NDJson to the http handler, json stream only encodes and sends now ignore snapshot restore if broker is disabled enable dev mode to access event steam without acl use mapping instead of switch use pointers for config sizes, remove unused ttl, simplify closed conn logic	2020-10-14 14:14:33 -04:00
Michael Schurter	8ccbd92cb6	api: add field filters to /v1/{allocations,nodes} Fixes #9017 The ?resources=true query parameter includes resources in the object stub listings. Specifically: - For `/v1/nodes?resources=true` both the `NodeResources` and `ReservedResources` field are included. - For `/v1/allocations?resources=true` the `AllocatedResources` field is included. The ?task_states=false query parameter removes TaskStates from /v1/allocations responses. (By default TaskStates are included.)	2020-10-14 10:35:22 -07:00
Drew Bailey	684807bddb	namespace filtering	2020-10-14 12:44:43 -04:00
Drew Bailey	fdc576af09	handle txn returning error	2020-10-14 12:44:42 -04:00
Drew Bailey	df96b89958	Add EvictCallbackFn to handle removing entries from go-memdb when they are removed from the event buffer. Wire up event buffer size config, use pointers for structs.Events instead of copying.	2020-10-14 12:44:42 -04:00
Drew Bailey	315f77a301	rehydrate event publisher on snapshot restore address pr feedback	2020-10-14 12:44:41 -04:00
Drew Bailey	d793529d61	event durability count and cfg	2020-10-14 12:44:40 -04:00
Drew Bailey	b4c135358d	use Events to wrap index and events, store in events table	2020-10-14 12:44:39 -04:00
Drew Bailey	9d48818eb8	writetxn can return error, add alloc and job generic events. Add events table for durability	2020-10-14 12:44:39 -04:00
Drew Bailey	400455d302	Events/eval alloc events (#9012 ) * generic eval update event first pass at alloc client update events * api/event client	2020-10-14 12:44:37 -04:00
Drew Bailey	4793bb4e01	Events/deployment events (#9004 ) * Node Drain events and Node Events (#8980) Deployment status updates handle deployment status updates (paused, failed, resume) deployment alloc health generate events from apply plan result txn err check, slim down deployment event one ndjson line per index * consolidate down to node event + type * fix UpdateDeploymentAllocHealth test invocations * fix test	2020-10-14 12:44:37 -04:00
Drew Bailey	a4a2975edf	Event Stream API/RPC (#8947 ) This Commit adds an /v1/events/stream endpoint to stream events from. The stream framer has been updated to include a SendFull method which does not fragment the data between multiple frames. This essentially treats the stream framer as a envelope to adhere to the stream framer interface in the UI. If the `encode` query parameter is omitted events will be streamed as newline delimted JSON.	2020-10-14 12:44:36 -04:00
Drew Bailey	207068ca28	Events/event source node (#8918 ) * Node Register/Deregister event sourcing example upsert node with context fill in writetxnwithctx ctx passing to handle event type creation, wip test node deregistration event drop Node from registration event * node batch deregistration	2020-10-14 12:44:35 -04:00
Drew Bailey	4753904b90	Events/cfg enable publisher (#8916 ) * only enable publisher based on config * add default prune tick * back out state abandon changes on fsm close	2020-10-14 12:44:35 -04:00
Drew Bailey	f820744746	abandon current state on server shutdown	2020-10-14 12:44:34 -04:00
Drew Bailey	fddac3af00	Event Buffer Implemenation adds an event buffer to hold events from raft changes. update events to use event buffer fix append call provide way to prune buffer items after TTL event publisher tests basic publish test wire up max item ttl rename package to stream, cleanup exploratory work subscription filtering subscription plumbing allow subscribers to consume events, handle closing subscriptions back out old exploratory ctx work fix lint remove unused ctx bits add a few comments fix test stop publisher on abandon	2020-10-14 12:44:34 -04:00
Chris Baker	1d35578bed	removed backwards-compatible/untagged metrics deprecated in 0.7	2020-10-13 20:18:39 +00:00
Seth Hoenig	ed13e5723f	consul/connect: dynamically select envoy sidecar at runtime As newer versions of Consul are released, the minimum version of Envoy it supports as a sidecar proxy also gets bumped. Starting with the upcoming Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the default implementation of Connect sidecar proxy. This PR introduces a change such that each Nomad Client will query its local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545) and then launch the Connect sidecar proxy task using the latest supported version of Envoy. If the `SupportedProxies` API component is not available from Consul, Nomad will fallback to the old version of Envoy supported by old versions of Consul. Setting the meta configuration option `meta.connect.sidecar_image` or setting the `connect.sidecar_task` stanza will take precedence as is the current behavior for sidecar proxies. Setting the meta configuration option `meta.connect.gateway_image` will take precedence as is the current behavior for connect gateways. `meta.connect.sidecar_image` and `meta.connect.gateway_image` may make use of the special `${NOMAD_envoy_version}` variable interpolation, which resolves to the newest version of Envoy supported by the Consul agent. Addresses #8585 #7665	2020-10-13 09:14:12 -05:00
Tim Gross	4335d847a4	Allow job Version to start at non-zero value (#9071 ) Stop coercing version of new job to 0 in the state_store, so that we can add regions to a multi-region deployment. Send new version, rather than existing version, to MRD to accomodate version-choosing logic changes in ENT. Co-authored-by: Chris Baker <1675087+cgbaker@users.noreply.github.com>	2020-10-12 13:59:48 -04:00
Nick Ethier	d45be0b5a6	client: add NetworkStatus to Allocation (#8657 )	2020-10-12 13:43:04 -04:00
Yoan Blanc	891accb89a	use allow/deny instead of the colored alternatives (#9019 ) Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-10-12 08:47:05 -04:00
Tim Gross	9b4917ae5f	csi: volumewatcher only needs one pass to collect past claims If a volume GC and a `nomad volume detach` command land concurrently, we can end up with multiple claims without an allocation, which results in extra no-op work when finding claims to collect as past claims.	2020-10-09 11:03:51 -04:00
Tim Gross	ec1e75d9f4	csi: remove stray TODO comment This item was completed in #8626	2020-10-09 11:03:51 -04:00
Tim Gross	e8c13a2307	csi: validate mount options during volume registration (#9044 ) Volumes using attachment mode `file-system` use the CSI filesystem API when they're mounted, and can be passed mount options. But `block-device` mode volumes don't have this option. When RPCs are made to plugins, we are silently dropping the mount options we don't expect to see, but this results in a poor operator experience when the mount options aren't honored. This changeset makes passing mount options to a `block-device` volume a validation error.	2020-10-08 09:23:21 -04:00
Tim Gross	3ceb5b36b1	csi: allow more than 1 writer claim for multi-writer mode (#9040 ) Fixes a bug where CSI volumes with the `MULTI_NODE_MULTI_WRITER` access mode were using the same logic as `MULTI_NODE_SINGLE_WRITER` to determine whether the volume had writer claims available for scheduling. Extends CSI claim endpoint test to exercise multi-reader and make sure `WriteFreeClaims` is exercised for multi-writer in feasibility test.	2020-10-07 10:43:23 -04:00
Seth Hoenig	0c5ae5769f	Merge pull request #9029 from hashicorp/b-tgs-updates consul/connect: trigger update as necessary on connect changes	2020-10-05 16:48:04 -05:00
Seth Hoenig	f44a4f68ee	consul/connect: trigger update as necessary on connect changes This PR fixes a long standing bug where submitting jobs with changes to connect services would not trigger updates as expected. Previously, service blocks were not considered as sources of destructive updates since they could be synced with consul non-destructively. With Connect, task group services that have changes to their connect block or to the service port should be destructive, since the network plumbing of the alloc is going to need updating. Fixes #8596 #7991 Non-destructive half in #7192	2020-10-05 14:53:00 -05:00
Chris Baker	7f701fddd0	updated docs and validation to further prohibit null chars in region, datacenter, and job name	2020-10-05 18:01:50 +00:00
Chris Baker	23ea7cd27c	updated job validate to refute job/group/task IDs containing null characters updated CHANGELOG and upgrade guide	2020-10-05 18:01:49 +00:00
Chris Baker	c8fd9428d4	documenting tests around null characters in job id, task group name, and task name	2020-10-05 18:01:49 +00:00
Fredrik Hoem Grelland	a015c52846	configure nomad cluster to use a Consul Namespace [Consul Enterprise] (#8849 )	2020-10-02 14:46:36 -04:00
Michael Schurter	765473e8b0	jobspec: lower min cpu resources from 10->1 Since CPU resources are usually a soft limit it is desirable to allow setting it as low as possible to allow tasks to run only in "idle" time. Setting it to 0 is still not allowed to avoid potential unintentional side effects with allowing a zero value. While there may not be any side effects this commit attempts to minimize risk by avoiding the issue. This does not change the defaults.	2020-09-30 12:15:13 -07:00
Luiz Aoqui	88d4eecfd0	add scaling policy type	2020-09-29 17:57:46 -04:00
Seth Hoenig	af9543c997	consul: fix validation of task in group-level script-checks When defining a script-check in a group-level service, Nomad needs to know which task is associated with the check so that it can use the correct task driver to execute the check. This PR fixes two bugs: 1) validate service.task or service.check.task is configured 2) make service.check.task inherit service.task if it is itself unset Fixes #8952	2020-09-28 15:02:59 -05:00
Michael Schurter	9dd59ceaa7	core: improve job deregister error logging Noticed this error in some production logs, and they were far from helpful. Changes: 1. Include job ID in logs 2. Wrap errors and log once instead of double log lines 3. Test fsm error handling behavior	2020-09-21 08:59:03 -07:00
Pierre Cauchois	e4b739cafd	RPC Timeout/Retries account for blocking requests (#8921 ) The current implementation measures RPC request timeout only against config.RPCHoldTimeout, which is fine for non-blocking requests but will almost surely be exceeded by long-poll requests that block for minutes at a time. This adds an HasTimedOut method on the RPCInfo interface that takes into account whether the request is blocking, its maximum wait time, and the RPCHoldTimeout.	2020-09-18 08:58:41 -04:00
Seth Hoenig	57fc593363	consul/connect: validate group network on expose port injection In #7800, Nomad would automatically generate a port label for service checks making use of the expose feature, if the port was not already set. This change assumed the group network would be correctly defined (as is checked in a validation hook later). If the group network was not definied, a panic would occur on job submisssion. This change re-uses the group network validation helper to make sure the network is correctly definied before adding ports to it. Fixes #8875	2020-09-14 10:25:03 -05:00
Chris Baker	d0cc0a768b	Update nomad/job_endpoint.go	2020-09-10 17:18:23 -05:00
Chris Baker	eff726609d	move variable out of oss-only build into shared file, fixes ent compile error introduced by #8834	2020-09-10 22:08:25 +00:00
Yoan Blanc	48d07c4d12	fix: panic in test introduced by #8453 (#8834 ) Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-09-09 09:38:15 -04:00
Chris Baker	bfa366ea72	Update nomad/deployment_endpoint.go	2020-09-08 16:39:51 -05:00
Chris Baker	0e509bc11e	check ACLs against deployment namespace on Deployment.GetDeployment, filtering the deployment if the ACL isn't appropriate	2020-09-08 19:57:28 +00:00
Drew Bailey	28aa0387e9	remove node events for state track changing pr remove Txn and update calls with ReadTxn() constructor for changetrackerdb	2020-09-04 10:23:35 -04:00
Drew Bailey	d5f6d3b3c5	fix a few missed txn changes	2020-09-01 10:27:21 -04:00
Drew Bailey	9253146bf4	fix bad merge from scalingpoliciesbynamespace	2020-09-01 10:27:20 -04:00
Drew Bailey	45762d8df8	noop changetracker for snapshots	2020-09-01 10:27:20 -04:00
Drew Bailey	0af749c92e	Transaction change tracking This commit wraps memdb.DB with a changeTrackerDB, which is a thin wrapper around memdb.DB which enables go-memdb's TrackChanges on all write transactions. When the transaction is comitted the changes are sent to an eventPublisher which will be used to create and emit change events. debugging TestFSM_ReconcileSummaries wip revert back rebase revert back rebase fix snapshot to actually use a snapshot	2020-09-01 10:27:20 -04:00
Jasmine Dahilig	71a694f39c	Merge pull request #8390 from hashicorp/lifecycle-poststart-hook task lifecycle poststart hook	2020-08-31 13:53:24 -07:00
Jasmine Dahilig	fbe0c89ab1	task lifecycle poststart: code review fixes	2020-08-31 13:22:41 -07:00
Mahmood Ali	117aec0036	Fix accidental broken clones Fix CSIMountOptions.Copy() and VolumeRequest.Copy() where they accidentally returned a reference to self rather than a deep copy. `&(*ref)` in Golang apparently equivalent to plain `&ref`.	2020-08-28 15:29:22 -04:00
Tim Gross	b77fe023b5	MRD: move 'job stop -global' handling into RPC (#8776 ) The initial implementation of global job stop for MRD looped over all the regions in the CLI for expedience. This changeset includes the OSS parts of moving this into the RPC layer so that API consumers don't have to implement this logic themselves.	2020-08-28 14:28:13 -04:00
Tim Gross	35b1b3bed7	structs: filter NomadTokenID from job diff (#8773 ) Multiregion deployments use the `NomadTokenID` to allow the deploymentwatcher to send RPCs between regions with the original submitter's ACL token. This ID should be filtered from diffs so that it doesn't cause a difference for purposes of job plans.	2020-08-28 13:40:51 -04:00
Lang Martin	7d483f93c0	csi: plugins track jobs in addition to allocations, and use job information to set expected counts (#8699 ) * nomad/structs/csi: add explicit job support * nomad/state/state_store: capture job updates directly * api/nodes: CSIInfo needs the AllocID * command/agent/csi_endpoint: AllocID was missing Co-authored-by: Tim Gross <tgross@hashicorp.com>	2020-08-27 17:20:00 -04:00
Seth Hoenig	c4fd1c97aa	Merge pull request #8761 from hashicorp/b-consul-op-token-check consul/connect: make use of task kind to determine service name in consul token checks	2020-08-27 14:08:33 -05:00
Tim Gross	606df14e78	MRD: deregister regions that are dropped on update (#8763 ) This changeset is the OSS hooks for what will be implemented in ENT.	2020-08-27 14:54:45 -04:00
Seth Hoenig	84176c9a41	consul/connect: make use of task kind to determine service name in consul token checks When consul.allow_unauthenticated is set to false, the job_endpoint hook validates that a `-consul-token` is provided and validates the token against the privileges inherent to a Consul Service Identity policy for all the Connect enabled services defined in the job. Before, the check was assuming the service was of type sidecar-proxy. This fixes the check to use the type of the task so we can distinguish between the different connect types.	2020-08-27 12:14:40 -05:00
Chris Baker	8b9145fabd	state_store/fix the prefix bugs for scaling policies documented in 1a9318	2020-08-27 04:25:37 +00:00
Chris Baker	655cbb4d3c	documenting tests for prefix bugs around job scaling policies	2020-08-27 03:22:13 +00:00
Seth Hoenig	9f1f2a5673	Merge branch 'master' into f-cc-ingress	2020-08-26 15:31:05 -05:00
Seth Hoenig	5d670c6d01	consul/connect: use context cancel more safely	2020-08-26 14:23:31 -05:00
Seth Hoenig	dfe179abc5	consul/connect: fixup some comments and context timeout	2020-08-26 13:17:16 -05:00
Mahmood Ali	45f549e29e	Merge pull request #8691 from hashicorp/b-reschedule-job-versions Respect alloc job version for lost/failed allocs	2020-08-25 18:02:45 -04:00
Mahmood Ali	def768728e	Have Plan.AppendAlloc accept the job	2020-08-25 17:22:09 -04:00
Mahmood Ali	18632955f2	clarify PathEscapesAllocDir specification Clarify how to handle prefix value and path traversal within the alloc dir but outside the prefix directory.	2020-08-24 20:44:26 -04:00
Mahmood Ali	9794760933	validate parameterized job request meta Fixes a bug where `keys` metadata wasn't populated, as we iterated over the empty newly-created `keys` map rather than the request Meta field.	2020-08-24 20:39:01 -04:00
Seth Hoenig	26e77623e5	consul/connect: fixup tests to use new consul sdk	2020-08-24 12:02:41 -05:00
Seth Hoenig	c4fa644315	consul/connect: remove envoy dns option from gateway proxy config	2020-08-24 09:11:55 -05:00
Yoan Blanc	327d17e0dc	fixup! vendor: consul/api, consul/sdk v1.6.0 Signed-off-by: Yoan Blanc <yoan@dosimple.ch>	2020-08-24 08:59:03 +02:00
Seth Hoenig	5b072029f2	consul/connect: add initial support for ingress gateways This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza. ```hcl service { connect { gateway { proxy { // envoy proxy configuration } ingress { // ingress-gateway configuration entry } } } } ``` A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value. Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver. Aims to address #8294 and tangentially #8647	2020-08-21 16:21:54 -05:00
Nick Ethier	3cd5f46613	Update UI to use new allocated ports fields (#8631 ) * nomad: canonicalize alloc shared resources to populate ports * ui: network ports * ui: remove unused task network references and update tests with new shared ports model * ui: lint * ui: revert auto formatting * ui: remove unused page objects * structs: remove unrelated test from bad conflict resolution * ui: formatting	2020-08-20 11:07:13 -04:00
Mahmood Ali	8a342926b7	Respect alloc job version for lost/failed allocs This change fixes a bug where lost/failed allocations are replaced by allocations with the latest versions, even if the version hasn't been promoted yet. Now, when generating a plan for lost/failed allocations, the scheduler first checks if the current deployment is in Canary stage, and if so, it ensures that any lost/failed allocations is replaced one with the latest promoted version instead.	2020-08-19 09:52:48 -04:00
Tim Gross	1aa242c15a	failed core jobs should not have follow-ups (#8682 ) If a core job fails more than the delivery limit, the leader will create a new eval with the TriggeredBy field set to `failed-follow-up`. Evaluations for core jobs have the leader's ACL, which is not valid on another leader after an election. The `failed-follow-up` evals do not have ACLs, so core job evals that fail more than the delivery limit or core job evals that span leader elections will never succeed and will be re-enqueued forever. So we should not retry with a `failed-follow-up`.	2020-08-18 16:48:43 -04:00
Tim Gross	38ec70eb8d	multiregion: validation should always return error for OSS (#8687 )	2020-08-18 15:35:38 -04:00
Lang Martin	e8a5565c1a	nomad/state/state_store: handle type conversion failure explicitly (#8660 )	2020-08-12 17:53:12 -04:00
Michael Schurter	de08ae8083	test: add allocrunner test for poststart hooks	2020-08-12 09:54:14 -07:00
Mahmood Ali	c462f8d0d5	Merge pull request #8524 from hashicorp/b-vault-health-checks Skip checking Vault health	2020-08-11 16:01:07 -04:00
Lang Martin	a27913e699	CSI RPC Token (#8626 ) * client/allocrunner/csi_hook: use the Node SecretID * client/allocrunner/csi_hook: include the namespace for Claim	2020-08-11 13:08:39 -04:00
Lang Martin	c82b2a2454	CSI: volume and plugin allocations in the API (#8590 ) * command/agent/csi_endpoint: explicitly convert to API structs, and convert allocs for single object get endpoints	2020-08-11 12:24:41 -04:00
Tim Gross	def7084be7	msgpack-rpc errors cannot be wrapped (#8633 ) Our RPC calls mangle the errors we get, which prevents us from using wrapped errors and `errors.Is`. Also fixes log message fields.	2020-08-11 10:25:43 -04:00
Tim Gross	443fdaa86b	csi: nomad volume detach command (#8584 ) The soundness guarantees of the CSI specification leave a little to be desired in our ability to provide a 100% reliable automated solution for managing volumes. This changeset provides a new command to bridge this gap by providing the operator the ability to intervene. The command doesn't take an allocation ID so that the operator doesn't have to keep track of alloc IDs that may have been GC'd. Handle this case in the unpublish RPC by sending the client RPC for all the terminal/nil allocs on the selected node.	2020-08-11 10:18:54 -04:00
Tim Gross	fb27082e5c	RPC errors must be wrapped in order to wrap internal errors (#8632 ) The CSI client RPC uses error wrapping to detect the type of error bubbling up from plugins, but if the errors we get aren't wrapped at each layer, we can't unwrap the inner error. Also eliminates some unused args.	2020-08-11 09:13:52 -04:00
Lang Martin	f245ba91c4	nomad/state/state_store: two cases of incorrect CSIPlugin in-place (#8630 )	2020-08-10 18:15:29 -04:00
Mahmood Ali	dce1dc44eb	distinguish between transient and persistent errors	2020-08-10 16:46:06 -04:00
Seth Hoenig	6ab3d21d2c	consul: validate script type when ussing check thresholds	2020-08-10 14:08:09 -05:00
Seth Hoenig	fd4804bf26	consul: able to set pass/fail thresholds on consul service checks This change adds the ability to set the fields `success_before_passing` and `failures_before_critical` on Consul service check definitions. This is a feature added to Consul v1.7.0 and later. https://www.consul.io/docs/agent/checks#success-failures-before-passing-critical Nomad doesn't do much besides pass the fields through to Consul. Fixes #6913	2020-08-10 14:08:09 -05:00
Tim Gross	e5496c7994	csi: missing plugins during node delete are not an error (#8619 ) When deregistering a client, CSI plugins running on that client may not get a chance to fingerprint before being stopped. Account for the case where a plugin allocation is the last instance of the plugin and has been deleted from the state store to avoid errors during node deregistration.	2020-08-10 11:02:01 -04:00
Mahmood Ali	628985c51e	Merge pull request #8613 from alrs/state-test-errs nomad/state: fix dropped scaling_policy test errors	2020-08-10 08:14:19 -04:00
Lars Lehtonen	f8a42f587f	nomad/state: fix dropped scaling_policy test errors	2020-08-07 23:05:33 -07:00
Tim Gross	69f4f171e5	CSI: fix missing ACL tokens for leader-driven RPCs (#8607 ) The volumewatcher and GC job in the leader can't make CSI RPCs when ACLs are enabled without the leader ACL token being passed thru.	2020-08-07 15:37:27 -04:00
Tim Gross	7d53ed88d6	csi: client RPCs should return wrapped errors for checking (#8605 ) When the client-side actions of a CSI client RPC succeed but we get disconnected during the RPC or we fail to checkpoint the claim state, we want to be able to retry the client RPC without getting blocked by the client-side state (ex. mount points) already having been cleaned up in previous calls.	2020-08-07 11:01:36 -04:00
Tim Gross	81b604fa13	csi: controller unpublish should check current alloc count (#8604 ) Using the count of node claims from earlier in the `CSIVolume.Unpublish RPC doesn't correctly account for cases where the RPC was interrupted but checkpointed. Instead, we'll check the current allocation count and status to determine whether we need to send a controller unpublish.	2020-08-07 10:43:45 -04:00
Tim Gross	2854298089	csi: release claims via csi_hook postrun unpublish RPC (#8580 ) Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This may forward client RPCs to the node plugins or to the controller plugins, depending on whether other allocations on this node have claims on this volume. By making clients responsible for running the `CSIVolume.Unpublish` RPC (and making the RPC available to a `nomad volume detach` command), the volumewatcher becomes only used by the core GC job and we no longer need async volume GC from job deregister and node update.	2020-08-06 14:51:46 -04:00
Michael Schurter	057e1c021f	Merge pull request #8597 from hashicorp/b-vault-revoke-log-line vault: log once per interval if batching revocation	2020-08-06 11:32:47 -07:00
Tim Gross	314458ebdb	csi: update volumewatcher to use unpublish RPC (#8579 ) This changeset updates `nomad/volumewatcher` to take advantage of the `CSIVolume.Unpublish` RPC. This lets us eliminate a bunch of code and associated tests. The raft batching code can be safely dropped, as the characteristic times of the CSI RPCs are on the order of seconds or even minutes, so batching up raft RPCs added complexity without any real world performance wins. Includes refactor w/ test cleanup and dead code elimination in volumewatcher	2020-08-06 14:31:18 -04:00
Tim Gross	eaa14ab64c	csi: add unpublish RPC (#8572 ) This changeset is plumbing for a `nomad volume detach` command that will be reused by the volumewatcher claim GC as well.	2020-08-06 13:51:29 -04:00
Tim Gross	4bbf18703f	csi: retry controller client RPCs on next controller (#8561 ) The documentation encourages operators to run multiple controller plugin instances for HA, but the client RPCs don't take advantage of this by retrying when the RPC fails in cases when the plugin is unavailable (because the node has drained or the alloc has failed but we haven't received an updated fingerprint yet). This changeset tries all known controllers on ready nodes before giving up, and adds tests that exercise the client RPC routing and retries.	2020-08-06 13:24:24 -04:00
Michael Schurter	2385fee0d2	vault: log once per interval if batching revocation This log line should be rare since: 1. Most tokens should be logged synchronously, not via this async batched method. Async revocation only takes place when Vault connectivity is lost and after leader election so no revocations are missed. 2. There should rarely be >1 batch (1,000) tokens to revoke since the above conditions should be brief and infrequent. 3. Interval is 5 minutes, so this log line will be emitted at most once every 5 minutes. What makes this log line rare is also what makes it interesting: due to a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore Nomad tries to re-revoke them on every leader election. This caused a massive buildup of old tokens that would never be properly revoked and purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging revoked tokens via Raft (fixed in #8553). The nomad.vault.distributed_tokens_revoked metric is only ticked upon successful revocation and purging, making any bugs or slowness in the process difficult to detect. Logging before a potentially slow revocation+purge operation is performed will give users much better indications of what activity is going on should the process fail to make it to the metric.	2020-08-05 15:39:21 -07:00
Mahmood Ali	490b9ce3a0	Handle Scaling Policies in Job Plan endpoint (#8567 ) Fixes https://github.com/hashicorp/nomad/issues/8544 This PR fixes a bug where using `nomad job plan ...` always report no change if the submitted job contain scaling. The issue has three contributing factors: 1. The plan endpoint doesn't populate the required scaling policy ID; unlike the job register endpoint 2. The plan endpoint suppresses errors on job insertion - the job insertion fails here, because the scaling policy is missing the required ID 3. The scheduler reports no update necessary when the relevant job isn't in store (because the insertion failed) This PR fixes the first two factors. Changing the scheduler to be more strict might make sense, but may violate some idempotency invariant or make the scheduler more brittle.	2020-07-30 12:27:36 -04:00
Seth Hoenig	2511f48351	consul/connect: add support for bridge networks with connect native tasks Before, Connect Native Tasks needed one of these to work: - To be run in host networking mode - To have the Consul agent configured to listen to a unix socket - To have the Consul agent configured to listen to a public interface None of these are a great experience, though running in host networking is still the best solution for non-Linux hosts. This PR establishes a connection proxy between the Consul HTTP listener and a unix socket inside the alloc fs, bypassing the network namespace for any Connect Native task. Similar to and re-uses a bunch of code from the gRPC listener version for envoy sidecar proxies. Proxy is established only if the alloc is configured for bridge networking and there is at least one Connect Native task in the Task Group. Fixes #8290	2020-07-29 09:26:01 -05:00
Michael Schurter	80f521cce5	vault: expired tokens count toward batch limit As of 0.11.3 Vault token revocation and purging was done in batches. However the batch size was only limited by the number of non-expired tokens being revoked. Due to bugs prior to 0.11.3, expired tokens were not properly purged. Long-lived clusters could have thousands to millions of very old expired tokens that never got purged from the state store. Since these expired tokens did not count against the batch limit, very large batches could be created and overwhelm servers. This commit ensures expired tokens count toward the batch limit with this one line change: ``` - if len(revoking) >= toRevoke { + if len(revoking)+len(ttlExpired) >= toRevoke { ``` However, this code was difficult to test due to being in a periodically executing loop. Most of the changes are to make this one line change testable and test it.	2020-07-28 15:42:47 -07:00
Drew Bailey	bd421b6197	Merge pull request #8453 from hashicorp/oss-multi-vault-ns oss compoments for multi-vault namespaces	2020-07-27 08:45:22 -04:00
Drew Bailey	b296558b8e	oss compoments for multi-vault namespaces adds in oss components to support enterprise multi-vault namespace feature upgrade specific doc on vault multi-namespaces vault docs update test to reflect new error	2020-07-24 10:14:59 -04:00
James Rasell	da91e1d0fc	api: add namespace to scaling status GET response object.	2020-07-24 11:19:25 +02:00
Mahmood Ali	5d86f84c5a	test tweaks	2020-07-23 13:25:25 -04:00
Mahmood Ali	5f6162ba46	run revoke daemon if connection is successful	2020-07-23 13:08:16 -04:00
Mahmood Ali	48ebedb738	vault: simply make the API call Avoid checking if API is accessible, just make the API call and handle when it fails.	2020-07-23 11:33:08 -04:00
Tim Gross	d3341a2019	refactor: make it clear where we're accessing dstate The field name `Deployment.TaskGroups` contains a map of `DeploymentState`, which makes it a little harder to follow state updates when combined with inconsistent naming conventions, particularly when we also have the state store or actual `TaskGroup`s in scope. This changeset changes all uses to `dstate` so as not to be confused with actual TaskGroups.	2020-07-20 11:25:53 -04:00
Lang Martin	a3bfd8c209	structs: Job.Validate only allows stop_after_client_disconnected on batch and service jobs (#8444 ) * nomad/structs/structs: add to Job.Validate * Update nomad/structs/structs.go Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> * nomad/structs/structs: match error strings to the config file * nomad/structs/structs_test: clarify the test a bit * nomad/structs/structs_test: typo in the test error comparison Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>	2020-07-20 10:27:25 -04:00
Mahmood Ali	78568b8e63	Remove unused state.TestInitState	2020-07-20 09:55:55 -04:00
Mahmood Ali	a483dde8b9	minor tweaks from Ent	2020-07-20 09:25:09 -04:00
Mahmood Ali	5adbd9f666	enterprise specific state store objects	2020-07-20 09:22:26 -04:00
Mahmood Ali	ad2d484974	Set AgentShutdown	2020-07-17 11:04:57 -04:00
Mahmood Ali	647c5e4c03	Merge pull request #8435 from hashicorp/b-atomic-job-register Atomic eval insertion with job (de-)registration	2020-07-15 13:48:07 -04:00
Mahmood Ali	aa500f7ba3	comment compat concern in fsm.go	2020-07-15 11:23:49 -04:00
Mahmood Ali	f4a921f2be	no need to handle duplicate evals anymore	2020-07-15 11:14:49 -04:00
Mahmood Ali	a314744210	only set args.Eval after all servers upgrade We set the Eval field on job (de-)registration only after all servers get upgraded, to avoid dealing with duplicate evals.	2020-07-15 11:10:57 -04:00
Mahmood Ali	910776caf0	time.Now().UTC().UnixNano() -> time.Now().UnixNano()	2020-07-15 08:49:17 -04:00
Kurt Neufeld	62851f6ccb	fixed typo in output (#1 )	2020-07-14 10:33:17 -06:00
Mahmood Ali	fbfe4ab1bd	Atomic eval insertion with job (de-)registration This fixes a bug where jobs may get "stuck" unprocessed that dispropotionately affect periodic jobs around leadership transitions. When registering a job, the job registration and the eval to process it get applied to raft as two separate transactions; if the job registration succeeds but eval application fails, the job may remain unprocessed. Operators may detect such failure, when submitting a job update and get a 500 error code, and they could retry; periodic jobs failures are more likely to go unnoticed, and no further periodic invocations will be processed until an operator force evaluation. This fixes the issue by ensuring that the job registration and eval application get persisted and processed atomically in the same raft log entry. Also, applies the same change to ensure atomicity in job deregistration. Backward Compatibility We must maintain compatibility in two scenarios: mixed clusters where a leader can handle atomic updates but followers cannot, and a recent cluster processes old log entries from legacy or mixed cluster mode. To handle this constraints: ensure that the leader continue to emit the Evaluation log entry until all servers have upgraded; also, when processing raft logs, the servers honor evaluations found in both spots, the Eval in job (de-)registration and the eval update entries. When an updated server sees mix-mode behavior where an eval is inserted into the raft log twice, it ignores the second instance. I made one compromise in consistency in the mixed-mode scenario: servers may disagree on the eval.CreateIndex value: the leader and updated servers will report the job registration index while old servers will report the index of the eval update log entry. This discripency doesn't seem to be material - it's the eval.JobModifyIndex that matters.	2020-07-14 11:59:29 -04:00
Tim Gross	bd457343de	MRD: all regions should start pending (#8433 ) Deployments should wait until kicked off by `Job.Register` so that we can assert that all regions have a scheduled deployment before starting any region. This changeset includes the OSS fixes to support the ENT work. `IsMultiregionStarter` has no more callers in OSS, so remove it here.	2020-07-14 10:57:37 -04:00
Tim Gross	0ce3c1e942	multiregion: allow empty region DCs (#8426 ) It's supposed to be possible for a region not to have `datacenters` set so that it can use the job's `datacenters` field. This requires that operators use the same DC name across multiple regions, but that's the default client configuration.	2020-07-13 13:34:19 -04:00
Nick Ethier	d171189afc	nomad: recanonicalize network after connect hook (#8407 ) * nomad: recanonicalize network after connect hook	2020-07-10 10:59:51 -04:00
Seth Hoenig	6fc63ede76	Merge pull request #7733 from jorgemarey/b-vault-policies Fix get all vault token policies	2020-07-09 10:05:59 -05:00
Seth Hoenig	f023df7b68	Merge pull request #8392 from hashicorp/f-infer-cn-taskname consul/connect: infer task name for native service if possible	2020-07-08 14:17:25 -05:00
Seth Hoenig	5be1679b86	Merge pull request #8338 from jorgemarey/b-fix-sidecar-task Change connectDriverConfig to be a func	2020-07-08 14:00:27 -05:00
Seth Hoenig	1a75da0ce0	consul/connect: infer task name in service if possible Before, the service definition for a Connect Native service would always require setting the `service.task` parameter. Now, that parameter is automatically inferred when there is only one task in the task group. Fixes #8274	2020-07-08 13:31:44 -05:00
Jasmine Dahilig	9e27231953	add poststart hook to task hook coordinator & structs	2020-07-08 11:01:35 -07:00
Tim Gross	ec96ddf648	fix swapped old/new multiregion plan diffs (#8378 ) The multiregion plan diffs swap the old and new versions for each region when they're edited (rather than added/removed). The `multiregionRegionDiff` function call incorrectly reversed its arguments for existing regions.	2020-07-08 10:10:50 -04:00
Jorge Marey	a3740cba9b	Change connectDriverConfig to be a func	2020-07-07 08:59:59 +02:00
Nick Ethier	e0fb634309	ar: support opting into binding host ports to default network IP (#8321 ) * ar: support opting into binding host ports to default network IP * fix config plumbing * plumb node address into network resource * struct: only handle network resource upgrade path once	2020-07-06 18:51:46 -04:00
Chris Baker	5b96c3d50e	Merge pull request #8360 from hashicorp/b-8355-better-scaling-validation better error handling around Scaling->Max	2020-07-06 11:32:02 -05:00
Chris Baker	5aa46e9a8f	modified state store to allow version skipping, to support multiregion version syncing also, passing existing version into multiregionRegister to support this	2020-07-06 14:16:55 +00:00
Lars Lehtonen	f32e80175d	nomad: fix dropped test error (#8356 )	2020-07-06 08:46:54 -04:00
Chris Baker	a77e012220	better testing of scaling parsing, fixed some broken tests by api changes	2020-07-04 19:32:37 +00:00
Chris Baker	9100b6b7c0	changes to make sure that Max is present and valid, to improve error messages * made api.Scaling.Max a pointer, so we can detect (and complain) when it is neglected * added checks to HCL parsing that it is present * when Scaling.Max is absent/invalid, don't return extraneous error messages during validation * tweak to multiregion handling to ensure that the count is valid on the interpolated regional jobs resolves #8355	2020-07-04 19:05:50 +00:00
Lang Martin	6c22cd587d	api: `nomad debug` new /agent/host (#8325 ) * command/agent/host: collect host data, multi platform * nomad/structs/structs: new HostDataRequest/Response * client/agent_endpoint: add RPC endpoint * command/agent/agent_endpoint: add Host * api/agent: add the Host endpoint * nomad/client_agent_endpoint: add Agent Host with forwarding * nomad/client_agent_endpoint: use findClientConn This changes forwardMonitorClient and forwardProfileClient to use findClientConn, which was cribbed from the common parts of those funcs. * command/debug: call agent hosts * command/agent/host: eliminate calling external programs	2020-07-02 09:51:25 -04:00
Tim Gross	23be116da0	csi: add -force flag to volume deregister (#8295 ) The `nomad volume deregister` command currently returns an error if the volume has any claims, but in cases where the claims can't be dropped because of plugin errors, providing a `-force` flag gives the operator an escape hatch. If the volume has no allocations or if they are all terminal, this flag deletes the volume from the state store, immediately and implicitly dropping all claims without further CSI RPCs. Note that this will not also unmount/detach the volume, which we'll make the responsibility of a separate `nomad volume detach` command.	2020-07-01 12:17:51 -04:00
Mahmood Ali	7f460d2706	allocrunner: terminate sidecars in the end This fixes a bug where a batch allocation fails to complete if it has sidecars. If the only remaining running tasks in an allocations are sidecars - we must kill them and mark the allocation as complete.	2020-06-29 15:12:15 -04:00
Drew Bailey	01e2cc5054	allow ClusterMetadata to accept a watchset (#8299 ) * allow ClusterMetadata to accept a watchset * use nil instead of empty watchset	2020-06-26 13:23:32 -04:00
Mahmood Ali	49a177ce28	Merge pull request #8017 from hashicorp/f-change-sched-updated Set Updated to true for all non-CAS requests on v1/operator/scheduler/configuration	2020-06-26 08:39:37 -04:00
Mahmood Ali	6605ebd314	Merge pull request #8223 from hashicorp/f-multi-network-validate-ports core: validate port numbers are < 65535	2020-06-26 08:31:01 -04:00
Nick Ethier	89118016fc	command: correctly show host IP in ports output /w multi-host networks (#8289 )	2020-06-25 15:16:01 -04:00
Tim Gross	67ffcb35e9	multiregion: add support for 'job plan' (#8266 ) Add a scatter-gather for multiregion job plans. Each region's servers interpolate the plan locally in `Job.Plan` but don't distribute the plan as done in `Job.Run`. Note that it's not possible to return a usable modify index from a multiregion plan for use with `-check-index`. Even if we were to force the modify index to be the same at the start of `Job.Run` the index immediately drifts during each region's deployments, depending on events local to each region. So we omit this section of a multiregion plan.	2020-06-24 13:24:55 -04:00
Tim Gross	a449009e9f	multiregion validation fixes (#8265 ) Multi-region jobs need to bypass validating counts otherwise we get spurious warnings in Job.Plan.	2020-06-24 12:18:51 -04:00
Seth Hoenig	3872b493e5	Merge pull request #8011 from hashicorp/f-cnative-host consul/connect: implement initial support for connect native	2020-06-24 10:33:12 -05:00
Seth Hoenig	011c6b027f	connect/native: doc and comment tweaks from PR	2020-06-24 10:13:22 -05:00
Michael Schurter	7869ebc587	docs: add comments to structs.Port struct	2020-06-23 11:38:01 -07:00

... 3 4 5 6 7 ...

3817 Commits