open-nomad

Author	SHA1	Message	Date
Seth Hoenig	d026ff1f66	consul/connect: add support for connect mesh gateways This PR implements first-class support for Nomad running Consul Connect Mesh Gateways. Mesh gateways enable services in the Connect mesh to make cross-DC connections via gateways, where each datacenter may not have full node interconnectivity. Consul docs with more information: https://www.consul.io/docs/connect/gateways/mesh-gateway The following group level service block can be used to establish a Connect mesh gateway. service { connect { gateway { mesh { // no configuration } } } } Services can make use of a mesh gateway by configuring so in their upstream blocks, e.g. service { connect { sidecar_service { proxy { upstreams { destination_name = "<service>" local_bind_port = <port> datacenter = "<datacenter>" mesh_gateway { mode = "<mode>" } } } } } } Typical use of a mesh gateway is to create a bridge between datacenters. A mesh gateway should then be configured with a service port that is mapped from a host_network configured on a WAN interface in Nomad agent config, e.g. client { host_network "public" { interface = "eth1" } } Create a port mapping in the group.network block for use by the mesh gateway service from the public host_network, e.g. network { mode = "bridge" port "mesh_wan" { host_network = "public" } } Use this port label for the service.port of the mesh gateway, e.g. service { name = "mesh-gateway" port = "mesh_wan" connect { gateway { mesh {} } } } Currently Envoy is the only supported gateway implementation in Consul. By default Nomad client will run the latest official Envoy docker image supported by the local Consul agent. The Envoy task can be customized by setting `meta.connect.gateway_image` in agent config or by setting the `connect.sidecar_task` block. Gateways require Consul 1.8.0+, enforced by the Nomad scheduler. Closes #9446	2021-06-04 08:24:49 -05:00
Seth Hoenig	d359eb6f3a	consul/connect: use additional constraints in scheduling connect tasks This PR adds two additional constraints on Connect sidecar and gateway tasks, making sure Nomad schedules them only onto nodes where Connect is actually enabled on the Consul agent. Consul requires `connect.enabled = true` and `ports.grpc = <number>` to be explicitly set on agent configuration before Connect APIs will work. Until now, Nomad would only validate a minimum version of Consul, which would cause confusion for users who try to run Connect tasks on nodes where Consul is not yet sufficiently configured. These contstraints prevent job scheduling on nodes where Connect is not actually use-able. Closes #10700	2021-06-03 15:43:34 -05:00
Tim Gross	e9777a88ce	plan applier: add trace-level log of plan The plans generated by the scheduler produce high-level output of counts on each evaluation, but when debugging scheduler issues it'd be nice to have a more detailed view of the resulting plan. Emitting this log at trace minimizes the overhead, and producing it in the plan applyer makes it easier to find as it will always be on the leader.	2021-06-02 10:25:23 -04:00
Chris Baker	263ddd567c	Node Drain Metadata (#10250 )	2021-05-07 13:58:40 -04:00
Michael Schurter	547a718ef6	Merge pull request #10248 from hashicorp/f-remotetask-2021 core: propagate remote task handles	2021-04-30 08:57:26 -07:00
Luiz Aoqui	f1b9055d21	Add metrics for blocked eval resources (#10454 ) * add metrics for blocked eval resources * docs: add new blocked_evals metrics * fix to call `pruneStats` instead of `stats.prune` directly	2021-04-29 15:03:45 -04:00
Michael Schurter	e62795798d	core: propagate remote task handles Add a new driver capability: RemoteTasks. When a task is run by a driver with RemoteTasks set, its TaskHandle will be propagated to the server in its allocation's TaskState. If the task is replaced due to a down node or draining, its TaskHandle will be propagated to its replacement allocation. This allows tasks to be scheduled in remote systems whose lifecycles are disconnected from the Nomad node's lifecycle. See https://github.com/hashicorp/nomad-driver-ecs for an example ECS remote task driver.	2021-04-27 15:07:03 -07:00
Mahmood Ali	cf24a9eaaf	api: /v1/jobs always include namespaces (#10434 ) Add Namespace as a top-level field in `/v1/jobs` stub. The `/v1/jobs` endpoint already includes the namespace under `JobSummary`, though the API is odd, as typically the job ID and Namespace are in the same level, and the oddity complicates the UI frontend development. The downside of adding it is redundant field, that makes the response body a bit bigger, specially for clusters with large jobs. Though, it should compress nicely and I expect the overhead to be small to overall response size. The benefit of a cleaner and more consistent API seem worth it. Fixes #10431	2021-04-23 16:36:54 -04:00
Mahmood Ali	d2fcce21f8	Migrate all allocs when draining a node (#10411 ) This fixes a bug affecting drain nodes, where allocs may fail to be migrated if they belong to different namespaces but share the same job name. The reason is that the helper function that creates the migration evals indexed the allocs by job ID without accounting for the namespaces. When job ids clash, only an eval is created for one and the rest of the allocs remain intact. Fixes #10172	2021-04-21 12:11:14 -04:00
Seth Hoenig	1ee8d5ffc5	api: implement fuzzy search API This PR introduces the /v1/search/fuzzy API endpoint, used for fuzzy searching objects in Nomad. The fuzzy search endpoint routes requests to the Nomad Server leader, which implements the Search.FuzzySearch RPC method. Requests to the fuzzy search API are based on the api.FuzzySearchRequest object, e.g. { "Text": "ed", "Context": "all" } Responses from the fuzzy search API are based on the api.FuzzySearchResponse object, e.g. { "Index": 27, "KnownLeader": true, "LastContact": 0, "Matches": { "tasks": [ { "ID": "redis", "Scope": [ "default", "example", "cache" ] } ], "evals": [], "deployment": [], "volumes": [], "scaling_policy": [], "images": [ { "ID": "redis:3.2", "Scope": [ "default", "example", "cache", "redis" ] } ] }, "Truncations": { "volumes": false, "scaling_policy": false, "evals": false, "deployment": false } } The API is tunable using the new server.search stanza, e.g. server { search { fuzzy_enabled = true limit_query = 200 limit_results = 1000 min_term_length = 5 } } These values can be increased or decreased, so as to provide more search results or to reduce load on the Nomad Server. The fuzzy search API can be disabled entirely by setting `fuzzy_enabled` to `false`.	2021-04-16 16:36:07 -06:00
Tim Gross	dbcc2694b0	refactor: move VolumeRequest validation to Validate method	2021-04-07 11:24:09 -04:00
Seth Hoenig	f17ba33f61	consul: plubming for specifying consul namespace in job/group This PR adds the common OSS changes for adding support for Consul Namespaces, which is going to be a Nomad Enterprise feature. There is no new functionality provided by this changeset and hopefully no new bugs.	2021-04-05 10:03:19 -06:00
Chris Baker	436d46bd19	Merge branch 'main' into f-node-drain-api	2021-04-01 15:22:57 -05:00
Tim Gross	aec5337862	CSI: HTTP handlers for create/delete/list	2021-03-31 16:37:09 -04:00
Mahmood Ali	0c2551270a	oversubscription: Add MemoryMaxMB to internal structs Start tracking a new MemoryMaxMB field that represents the maximum memory a task may use in the client. This allows tasks to specify a memory reservation (to be used by scheduler when placing the task) but use excess memory used on the client if the client has any. This commit adds the server tracking for the value, and ensures that allocations AllocatedResource fields include the value.	2021-03-30 16:55:58 -04:00
Nick Ethier	daecfa61e6	Merge pull request #10203 from hashicorp/f-cpu-cores Reserved Cores [1/4]: Structs and scheduler implementation	2021-03-29 14:05:54 -04:00
Chris Baker	16e37b986a	reworked Node.Canonicalize() to enforce invariants, fixed a broken test	2021-03-26 18:58:38 +00:00
Chris Baker	770c9cecb5	restored Node.Sanitize() for RPC endpoints multiple other updates from code review	2021-03-26 17:03:15 +00:00
Chris Baker	cb540ed691	added tests that the API doesn't leak Node.SecretID added more documentation on JSON encoding to the contributing guide	2021-03-23 18:09:20 +00:00
Chris Baker	a186badf35	moved JSON handlers and extension code around a bit for proper order of initialization	2021-03-22 14:12:42 +00:00
Chris Baker	9f7bc5a575	refactor?	2021-03-22 01:49:21 +00:00
Chris Baker	dd291e69f4	removed deprecated fields from Drain structs and API node drain: use msgtype on txn so that events are emitted wip: encoding extension to add Node.Drain field back to API responses new approach for hiding Node.SecretID in the API, using `json` tag documented this approach in the contributing guide refactored the JSON handlers with extensions modified event stream encoding to use the go-msgpack encoders with the extensions	2021-03-21 15:30:11 +00:00
Nick Ethier	648ade63ad	scheduler: implement scheduling of reserved cores	2021-03-19 00:29:07 -04:00
Nick Ethier	26b200e8bd	api: add new 'cores' field to task resources	2021-03-18 23:13:30 -04:00
Nick Ethier	4b2912d343	structs: add struct fields and funcs for reservable cpu cores	2021-03-18 22:49:06 -04:00
Tim Gross	fa25e048b2	CSI: unique volume per allocation Add a `PerAlloc` field to volume requests that directs the scheduler to test feasibility for volumes with a source ID that includes the allocation index suffix (ex. `[0]`), rather than the exact source ID. Read the `PerAlloc` field when making the volume claim at the client to determine if the allocation index suffix (ex. `[0]`) should be added to the volume source ID.	2021-03-18 15:35:11 -04:00
Charlie Voiselle	0473f35003	Fixup uses of `sanity` (#10187 ) * Fixup uses of `sanity` * Remove unnecessary comments. These checks are better explained by earlier comments about the context of the test. Per @tgross, moved the tests together to better reinforce the overall shared context. * Update nomad/fsm_test.go	2021-03-16 18:05:08 -04:00
Tim Gross	97b0e26d1f	RPC endpoints to support 'nomad ui -login' RPC endpoints for the user-driven APIs (`UpsertOneTimeToken` and `ExchangeOneTimeToken`) and token expiration (`ExpireOneTimeTokens`). Includes adding expiration to the periodic core GC job.	2021-03-10 08:17:56 -05:00
Tim Gross	6b2c4b56d0	state store updates for one-time tokens The `OneTimeToken` struct is to support the `nomad ui -login` command. This changeset adds the struct to the Nomad state store.	2021-03-10 08:17:56 -05:00
Tim Gross	b764f52ab9	deploymentwatcher: reset progress deadline on promotion (#10042 ) In a deployment with two groups (ex. A and B), if group A's canary becomes healthy before group B's, the deadline for the overall deployment will be set to that of group A. When the deployment is promoted, if group A is done it will not contribute to the next deadline cutoff. Group B's old deadline will be used instead, which will be in the past and immediately trigger a deployment progress failure. Reset the progress deadline when the job is promotion to avoid this bug, and to better conform with implicit user expectations around how the progress deadline should interact with promotions.	2021-02-22 16:44:03 -05:00
AndrewChubatiuk	3d0aa2ef56	allocate sidecar task port on host_network interface	2021-02-13 02:42:13 +02:00
Nick Ethier	eacc4da499	Merge branch 'master' into b-9477	2021-02-05 11:58:13 -05:00
Chris Baker	ebbb760ec4	support for scaling_policy in global prefix search	2021-02-03 19:26:57 +00:00
Nick Ethier	43a4d72fda	structs: namespace port validation by host_network	2021-02-02 14:56:52 -05:00
Seth Hoenig	8b05efcf88	consul/connect: Add support for Connect terminating gateways This PR implements Nomad built-in support for running Consul Connect terminating gateways. Such a gateway can be used by services running inside the service mesh to access "legacy" services running outside the service mesh while still making use of Consul's service identity based networking and ACL policies. https://www.consul.io/docs/connect/gateways/terminating-gateway These gateways are declared as part of a task group level service definition within the connect stanza. service { connect { gateway { proxy { // envoy proxy configuration } terminating { // terminating-gateway configuration entry } } } } Currently Envoy is the only supported gateway implementation in Consul. The gateay task can be customized by configuring the connect.sidecar_task block. When the gateway.terminating field is set, Nomad will write/update the Configuration Entry into Consul on job submission. Because CEs are global in scope and there may be more than one Nomad cluster communicating with Consul, there is an assumption that any terminating gateway defined in Nomad for a particular service will be the same among Nomad clusters. Gateways require Consul 1.8.0+, checked by a node constraint. Closes #9445	2021-01-25 10:36:04 -06:00
Dennis Schön	3eaf1432aa	validate connect block allowed only within group.service	2021-01-20 14:34:23 -05:00
Kris Hicks	d71a90c8a4	Fix some errcheck errors (#9811 ) * Throw away result of multierror.Append When given a multierror.Error, it is mutated, therefore the return value is not needed. Simplify MergeMultierrorWarnings, use StringBuilder * Hash.Write() never returns an error * Remove error that was always nil * Remove error from Resources.Add signature When this was originally written it could return an error, but that was refactored away, and callers of it as of today never handle the error. * Throw away results of io.Copy during Bridge * Handle errors when computing node class in test	2021-01-14 12:46:35 -08:00
Kris Hicks	abb8f2ebc0	Refactor Job.Scale() (#9771 )	2021-01-14 12:40:42 -08:00
Drew Bailey	54becaab7d	Events/acl events (#9595 ) * fix acl event creation * allow way to access secretID without exposing it to stream test that values are omitted test event creation test acl events payloads are pointers fix failing tests, do all security steps inside constructor * increase time * ignore empty tokens * uncomment line * changelog	2020-12-11 10:40:50 -05:00
Kris Hicks	0cf9cae656	Apply some suggested fixes from staticcheck (#9598 )	2020-12-10 07:29:18 -08:00
Kris Hicks	0a3a748053	Add gosimple linter (#9590 )	2020-12-09 11:05:18 -08:00
Kris Hicks	93155ba3da	Add gocritic to golangci-lint config (#9556 )	2020-12-08 12:47:04 -08:00
James Rasell	fd53963afb	core: fix typo msg used when job ID/name contains a null char.	2020-12-04 09:49:31 +01:00
Benjamin Buzbee	e0acbbfcc6	Fix RPC retry logic in nomad client's rpc.go for blocking queries (#9266 )	2020-11-30 15:11:10 -05:00
Seth Hoenig	a35c0db6c7	nomad/structs: validate deprecated task.resource.network port labels Enable users to submit jobs that still make use of the deprecated task.resources.network stanza. Such jobs can be submitted, but will emit a warning.	2020-11-23 12:40:40 -06:00
Nick Ethier	f1ea79f5a8	remove references to default mbits	2020-11-23 10:32:13 -06:00
Nick Ethier	7266376ae6	nomad: update validate to check group networks for task port usage	2020-11-23 10:11:00 -06:00
Nick Ethier	c4ddb0a43a	website: add mbits and network deprecation notice	2020-11-23 10:09:36 -06:00
Seth Hoenig	4cc3c01d5b	Merge pull request #9352 from hashicorp/f-artifact-headers jobspec: add support for headers in artifact stanza	2020-11-13 14:04:27 -06:00
Seth Hoenig	bb8a5816a0	jobspec: add support for headers in artifact stanza This PR adds the ability to set HTTP headers when downloading an artifact from an `http` or `https` resource. The implementation in `go-getter` is such that a new `HTTPGetter` must be created for each artifact that sets headers (as opposed to conveniently setting headers per-request). This PR maintains the memoization of the default Getter objects, creating new ones only for artifacts where headers are set. Closes #9306	2020-11-13 12:03:54 -06:00
Jasmine Dahilig	d6110cbed4	lifecycle: add poststop hook (#8194 )	2020-11-12 08:01:42 -08:00
Nick Ethier	5e1634eda1	structs: canonicalize allocatedtaskresources to populate shared ports (#9309 )	2020-11-11 16:21:47 -05:00
Chris Baker	e3c0ea654d	auto-complete for recommendations CLI, plus OSS components of recommendations prefix search	2020-11-11 11:13:43 +00:00
Luiz Aoqui	c536286c7a	remove Namespace field from JobScaleRequest	2020-11-09 13:02:05 -05:00
Chris Baker	719077a26d	added new policy capabilities for recommendations API state store: call-out to generic update of job recommendations from job update method recommendations API work, and http endpoint errors for OSS support for scaling polices in task block of job spec add query filters for ScalingPolicy list endpoint command: nomad scaling policy list: added -job and -type	2020-10-28 14:32:16 +00:00
Drew Bailey	86080e25a9	Send events to EventSinks (#9171 ) * Process to send events to configured sinks This PR adds a SinkManager to a server which is responsible for managing managed sinks. Managed sinks subscribe to the event broker and send events to a sink writer (webhook). When changes to the eventstore are made the sinkmanager and managed sink are responsible for reloading or starting a new managed sink. * periodically check in sink progress to raft Save progress on the last successfully sent index to raft. This allows a managed sink to resume close to where it left off in the event of a lost server or leadership change dereference eventsink so we can accurately use the watchch When using a pointer to eventsink struct it was updated immediately and our reload logic would not trigger	2020-10-26 17:27:54 -04:00
Drew Bailey	1ae39a9ed9	event sink crud operation api (#9155 ) * network sink rpc/api plumbing state store methods and restore upsert sink test get sink delete sink event sink list and tests go generate new msg types validate sink on upsert * go generate	2020-10-23 14:23:00 -04:00
Michael Schurter	c2dd9bc996	core: open source namespaces	2020-10-22 15:26:32 -07:00
Nick Ethier	4903e5b114	Consul with CNI and host_network addresses (#9095 ) * consul: advertise cni and multi host interface addresses * structs: add service/check address_mode validation * ar/groupservices: fetch networkstatus at hook runtime * ar/groupservice: nil check network status getter before calling * consul: comment network status can be nil	2020-10-15 15:32:21 -04:00
Pierre Cauchois	13218dc345	Enforce bounds on MaxQueryTime (#9064 ) The MaxQueryTime value used in QueryOptions.HasTimedOut() can be set to an invalid value that would throw off how RPC requests are retried. This fix uses the same logic that enforces the MaxQueryTime bounds in the blockingRPC() call.	2020-10-15 08:43:06 -04:00
Michael Schurter	dd09fa1a4a	Merge pull request #9055 from hashicorp/f-9017-resources api: add field filters to /v1/{allocations,nodes}	2020-10-14 14:49:39 -07:00
Drew Bailey	c463479848	filter on additional filter keys, remove switch statement duplication properly wire up durable event count move newline responsibility moves newline creation from NDJson to the http handler, json stream only encodes and sends now ignore snapshot restore if broker is disabled enable dev mode to access event steam without acl use mapping instead of switch use pointers for config sizes, remove unused ttl, simplify closed conn logic	2020-10-14 14:14:33 -04:00
Michael Schurter	8ccbd92cb6	api: add field filters to /v1/{allocations,nodes} Fixes #9017 The ?resources=true query parameter includes resources in the object stub listings. Specifically: - For `/v1/nodes?resources=true` both the `NodeResources` and `ReservedResources` field are included. - For `/v1/allocations?resources=true` the `AllocatedResources` field is included. The ?task_states=false query parameter removes TaskStates from /v1/allocations responses. (By default TaskStates are included.)	2020-10-14 10:35:22 -07:00
Drew Bailey	684807bddb	namespace filtering	2020-10-14 12:44:43 -04:00
Drew Bailey	b4c135358d	use Events to wrap index and events, store in events table	2020-10-14 12:44:39 -04:00
Drew Bailey	9d48818eb8	writetxn can return error, add alloc and job generic events. Add events table for durability	2020-10-14 12:44:39 -04:00
Drew Bailey	4793bb4e01	Events/deployment events (#9004 ) * Node Drain events and Node Events (#8980) Deployment status updates handle deployment status updates (paused, failed, resume) deployment alloc health generate events from apply plan result txn err check, slim down deployment event one ndjson line per index * consolidate down to node event + type * fix UpdateDeploymentAllocHealth test invocations * fix test	2020-10-14 12:44:37 -04:00
Drew Bailey	a4a2975edf	Event Stream API/RPC (#8947 ) This Commit adds an /v1/events/stream endpoint to stream events from. The stream framer has been updated to include a SendFull method which does not fragment the data between multiple frames. This essentially treats the stream framer as a envelope to adhere to the stream framer interface in the UI. If the `encode` query parameter is omitted events will be streamed as newline delimted JSON.	2020-10-14 12:44:36 -04:00
Seth Hoenig	ed13e5723f	consul/connect: dynamically select envoy sidecar at runtime As newer versions of Consul are released, the minimum version of Envoy it supports as a sidecar proxy also gets bumped. Starting with the upcoming Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the default implementation of Connect sidecar proxy. This PR introduces a change such that each Nomad Client will query its local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545) and then launch the Connect sidecar proxy task using the latest supported version of Envoy. If the `SupportedProxies` API component is not available from Consul, Nomad will fallback to the old version of Envoy supported by old versions of Consul. Setting the meta configuration option `meta.connect.sidecar_image` or setting the `connect.sidecar_task` stanza will take precedence as is the current behavior for sidecar proxies. Setting the meta configuration option `meta.connect.gateway_image` will take precedence as is the current behavior for connect gateways. `meta.connect.sidecar_image` and `meta.connect.gateway_image` may make use of the special `${NOMAD_envoy_version}` variable interpolation, which resolves to the newest version of Envoy supported by the Consul agent. Addresses #8585 #7665	2020-10-13 09:14:12 -05:00
Nick Ethier	d45be0b5a6	client: add NetworkStatus to Allocation (#8657 )	2020-10-12 13:43:04 -04:00
Seth Hoenig	0c5ae5769f	Merge pull request #9029 from hashicorp/b-tgs-updates consul/connect: trigger update as necessary on connect changes	2020-10-05 16:48:04 -05:00
Seth Hoenig	f44a4f68ee	consul/connect: trigger update as necessary on connect changes This PR fixes a long standing bug where submitting jobs with changes to connect services would not trigger updates as expected. Previously, service blocks were not considered as sources of destructive updates since they could be synced with consul non-destructively. With Connect, task group services that have changes to their connect block or to the service port should be destructive, since the network plumbing of the alloc is going to need updating. Fixes #8596 #7991 Non-destructive half in #7192	2020-10-05 14:53:00 -05:00
Chris Baker	7f701fddd0	updated docs and validation to further prohibit null chars in region, datacenter, and job name	2020-10-05 18:01:50 +00:00
Chris Baker	23ea7cd27c	updated job validate to refute job/group/task IDs containing null characters updated CHANGELOG and upgrade guide	2020-10-05 18:01:49 +00:00
Michael Schurter	765473e8b0	jobspec: lower min cpu resources from 10->1 Since CPU resources are usually a soft limit it is desirable to allow setting it as low as possible to allow tasks to run only in "idle" time. Setting it to 0 is still not allowed to avoid potential unintentional side effects with allowing a zero value. While there may not be any side effects this commit attempts to minimize risk by avoiding the issue. This does not change the defaults.	2020-09-30 12:15:13 -07:00
Luiz Aoqui	88d4eecfd0	add scaling policy type	2020-09-29 17:57:46 -04:00
Seth Hoenig	af9543c997	consul: fix validation of task in group-level script-checks When defining a script-check in a group-level service, Nomad needs to know which task is associated with the check so that it can use the correct task driver to execute the check. This PR fixes two bugs: 1) validate service.task or service.check.task is configured 2) make service.check.task inherit service.task if it is itself unset Fixes #8952	2020-09-28 15:02:59 -05:00
Pierre Cauchois	e4b739cafd	RPC Timeout/Retries account for blocking requests (#8921 ) The current implementation measures RPC request timeout only against config.RPCHoldTimeout, which is fine for non-blocking requests but will almost surely be exceeded by long-poll requests that block for minutes at a time. This adds an HasTimedOut method on the RPCInfo interface that takes into account whether the request is blocking, its maximum wait time, and the RPCHoldTimeout.	2020-09-18 08:58:41 -04:00
Jasmine Dahilig	71a694f39c	Merge pull request #8390 from hashicorp/lifecycle-poststart-hook task lifecycle poststart hook	2020-08-31 13:53:24 -07:00
Jasmine Dahilig	fbe0c89ab1	task lifecycle poststart: code review fixes	2020-08-31 13:22:41 -07:00
Tim Gross	b77fe023b5	MRD: move 'job stop -global' handling into RPC (#8776 ) The initial implementation of global job stop for MRD looped over all the regions in the CLI for expedience. This changeset includes the OSS parts of moving this into the RPC layer so that API consumers don't have to implement this logic themselves.	2020-08-28 14:28:13 -04:00
Seth Hoenig	84176c9a41	consul/connect: make use of task kind to determine service name in consul token checks When consul.allow_unauthenticated is set to false, the job_endpoint hook validates that a `-consul-token` is provided and validates the token against the privileges inherent to a Consul Service Identity policy for all the Connect enabled services defined in the job. Before, the check was assuming the service was of type sidecar-proxy. This fixes the check to use the type of the task so we can distinguish between the different connect types.	2020-08-27 12:14:40 -05:00
Seth Hoenig	9f1f2a5673	Merge branch 'master' into f-cc-ingress	2020-08-26 15:31:05 -05:00
Mahmood Ali	45f549e29e	Merge pull request #8691 from hashicorp/b-reschedule-job-versions Respect alloc job version for lost/failed allocs	2020-08-25 18:02:45 -04:00
Mahmood Ali	def768728e	Have Plan.AppendAlloc accept the job	2020-08-25 17:22:09 -04:00
Mahmood Ali	18632955f2	clarify PathEscapesAllocDir specification Clarify how to handle prefix value and path traversal within the alloc dir but outside the prefix directory.	2020-08-24 20:44:26 -04:00
Seth Hoenig	26e77623e5	consul/connect: fixup tests to use new consul sdk	2020-08-24 12:02:41 -05:00
Seth Hoenig	5b072029f2	consul/connect: add initial support for ingress gateways This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza. ```hcl service { connect { gateway { proxy { // envoy proxy configuration } ingress { // ingress-gateway configuration entry } } } } ``` A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value. Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver. Aims to address #8294 and tangentially #8647	2020-08-21 16:21:54 -05:00
Nick Ethier	3cd5f46613	Update UI to use new allocated ports fields (#8631 ) * nomad: canonicalize alloc shared resources to populate ports * ui: network ports * ui: remove unused task network references and update tests with new shared ports model * ui: lint * ui: revert auto formatting * ui: remove unused page objects * structs: remove unrelated test from bad conflict resolution * ui: formatting	2020-08-20 11:07:13 -04:00
Mahmood Ali	8a342926b7	Respect alloc job version for lost/failed allocs This change fixes a bug where lost/failed allocations are replaced by allocations with the latest versions, even if the version hasn't been promoted yet. Now, when generating a plan for lost/failed allocations, the scheduler first checks if the current deployment is in Canary stage, and if so, it ensures that any lost/failed allocations is replaced one with the latest promoted version instead.	2020-08-19 09:52:48 -04:00
Tim Gross	38ec70eb8d	multiregion: validation should always return error for OSS (#8687 )	2020-08-18 15:35:38 -04:00
Drew Bailey	bd421b6197	Merge pull request #8453 from hashicorp/oss-multi-vault-ns oss compoments for multi-vault namespaces	2020-07-27 08:45:22 -04:00
Drew Bailey	b296558b8e	oss compoments for multi-vault namespaces adds in oss components to support enterprise multi-vault namespace feature upgrade specific doc on vault multi-namespaces vault docs update test to reflect new error	2020-07-24 10:14:59 -04:00
James Rasell	da91e1d0fc	api: add namespace to scaling status GET response object.	2020-07-24 11:19:25 +02:00
Lang Martin	a3bfd8c209	structs: Job.Validate only allows stop_after_client_disconnected on batch and service jobs (#8444 ) * nomad/structs/structs: add to Job.Validate * Update nomad/structs/structs.go Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> * nomad/structs/structs: match error strings to the config file * nomad/structs/structs_test: clarify the test a bit * nomad/structs/structs_test: typo in the test error comparison Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>	2020-07-20 10:27:25 -04:00
Mahmood Ali	fbfe4ab1bd	Atomic eval insertion with job (de-)registration This fixes a bug where jobs may get "stuck" unprocessed that dispropotionately affect periodic jobs around leadership transitions. When registering a job, the job registration and the eval to process it get applied to raft as two separate transactions; if the job registration succeeds but eval application fails, the job may remain unprocessed. Operators may detect such failure, when submitting a job update and get a 500 error code, and they could retry; periodic jobs failures are more likely to go unnoticed, and no further periodic invocations will be processed until an operator force evaluation. This fixes the issue by ensuring that the job registration and eval application get persisted and processed atomically in the same raft log entry. Also, applies the same change to ensure atomicity in job deregistration. Backward Compatibility We must maintain compatibility in two scenarios: mixed clusters where a leader can handle atomic updates but followers cannot, and a recent cluster processes old log entries from legacy or mixed cluster mode. To handle this constraints: ensure that the leader continue to emit the Evaluation log entry until all servers have upgraded; also, when processing raft logs, the servers honor evaluations found in both spots, the Eval in job (de-)registration and the eval update entries. When an updated server sees mix-mode behavior where an eval is inserted into the raft log twice, it ignores the second instance. I made one compromise in consistency in the mixed-mode scenario: servers may disagree on the eval.CreateIndex value: the leader and updated servers will report the job registration index while old servers will report the index of the eval update log entry. This discripency doesn't seem to be material - it's the eval.JobModifyIndex that matters.	2020-07-14 11:59:29 -04:00
Tim Gross	bd457343de	MRD: all regions should start pending (#8433 ) Deployments should wait until kicked off by `Job.Register` so that we can assert that all regions have a scheduled deployment before starting any region. This changeset includes the OSS fixes to support the ENT work. `IsMultiregionStarter` has no more callers in OSS, so remove it here.	2020-07-14 10:57:37 -04:00
Tim Gross	0ce3c1e942	multiregion: allow empty region DCs (#8426 ) It's supposed to be possible for a region not to have `datacenters` set so that it can use the job's `datacenters` field. This requires that operators use the same DC name across multiple regions, but that's the default client configuration.	2020-07-13 13:34:19 -04:00
Jasmine Dahilig	9e27231953	add poststart hook to task hook coordinator & structs	2020-07-08 11:01:35 -07:00
Chris Baker	a77e012220	better testing of scaling parsing, fixed some broken tests by api changes	2020-07-04 19:32:37 +00:00
Chris Baker	9100b6b7c0	changes to make sure that Max is present and valid, to improve error messages * made api.Scaling.Max a pointer, so we can detect (and complain) when it is neglected * added checks to HCL parsing that it is present * when Scaling.Max is absent/invalid, don't return extraneous error messages during validation * tweak to multiregion handling to ensure that the count is valid on the interpolated regional jobs resolves #8355	2020-07-04 19:05:50 +00:00
Lang Martin	6c22cd587d	api: `nomad debug` new /agent/host (#8325 ) * command/agent/host: collect host data, multi platform * nomad/structs/structs: new HostDataRequest/Response * client/agent_endpoint: add RPC endpoint * command/agent/agent_endpoint: add Host * api/agent: add the Host endpoint * nomad/client_agent_endpoint: add Agent Host with forwarding * nomad/client_agent_endpoint: use findClientConn This changes forwardMonitorClient and forwardProfileClient to use findClientConn, which was cribbed from the common parts of those funcs. * command/debug: call agent hosts * command/agent/host: eliminate calling external programs	2020-07-02 09:51:25 -04:00
Mahmood Ali	7f460d2706	allocrunner: terminate sidecars in the end This fixes a bug where a batch allocation fails to complete if it has sidecars. If the only remaining running tasks in an allocations are sidecars - we must kill them and mark the allocation as complete.	2020-06-29 15:12:15 -04:00
Mahmood Ali	6605ebd314	Merge pull request #8223 from hashicorp/f-multi-network-validate-ports core: validate port numbers are < 65535	2020-06-26 08:31:01 -04:00
Nick Ethier	89118016fc	command: correctly show host IP in ports output /w multi-host networks (#8289 )	2020-06-25 15:16:01 -04:00
Tim Gross	a449009e9f	multiregion validation fixes (#8265 ) Multi-region jobs need to bypass validating counts otherwise we get spurious warnings in Job.Plan.	2020-06-24 12:18:51 -04:00
Seth Hoenig	3872b493e5	Merge pull request #8011 from hashicorp/f-cnative-host consul/connect: implement initial support for connect native	2020-06-24 10:33:12 -05:00
Michael Schurter	7869ebc587	docs: add comments to structs.Port struct	2020-06-23 11:38:01 -07:00
Michael Schurter	13ed710a04	core: validate port numbers are <= 65535 The scheduler returns a very strange error if it detects a port number out of range. If these would somehow make it to the client they would overflow when converted to an int32 and could cause conflicts.	2020-06-23 11:31:49 -07:00
Seth Hoenig	4d71f22a11	consul/connect: add support for running connect native tasks This PR adds the capability of running Connect Native Tasks on Nomad, particularly when TLS and ACLs are enabled on Consul. The `connect` stanza now includes a `native` parameter, which can be set to the name of task that backs the Connect Native Consul service. There is a new Client configuration parameter for the `consul` stanza called `share_ssl`. Like `allow_unauthenticated` the default value is true, but recommended to be disabled in production environments. When enabled, the Nomad Client's Consul TLS information is shared with Connect Native tasks through the normal Consul environment variables. This does NOT include auth or token information. If Consul ACLs are enabled, Service Identity Tokens are automatically and injected into the Connect Native task through the CONSUL_HTTP_TOKEN environment variable. Any of the automatically set environment variables can be overridden by the Connect Native task using the `env` stanza. Fixes #6083	2020-06-22 14:07:44 -05:00
Michael Schurter	562704124d	Merge pull request #8208 from hashicorp/f-multi-network multi-interface network support	2020-06-19 15:46:48 -07:00
Nick Ethier	a87e91e971	test: fix up testing around host networks	2020-06-19 13:53:31 -04:00
Nick Ethier	f0ac1f027a	lint: spelling	2020-06-19 11:29:41 -04:00
Tim Gross	b654e1b8a4	multiregion: all regions start in running if no max_parallel (#8209 ) If `max_parallel` is not set, all regions should begin in a `running` state rather than a `pending` state. Otherwise the first region is set to `running` and then all the remaining regions once it enters `blocked. That behavior is technically correct in that we have at most `max_parallel` regions running, but definitely not what a user expects.	2020-06-19 11:17:09 -04:00
Nick Ethier	f0559a8162	multi-interface network support	2020-06-19 09:42:10 -04:00
Tim Gross	8a354f828f	store ACL Accessor ID from Job.Register with Job (#8204 ) In multiregion deployments when ACLs are enabled, the deploymentwatcher needs an appropriately scoped ACL token with the same `submit-job` rights as the user who submitted it. The token will already be replicated, so store the accessor ID so that it can be retrieved by the leader.	2020-06-19 07:53:29 -04:00
Mahmood Ali	38a01c050e	Merge pull request #8192 from hashicorp/f-status-allnamespaces-2 CLI Allow querying all namespaces for jobs and allocations - Try 2	2020-06-18 20:16:52 -04:00
Nick Ethier	4a44deaa5c	CNI Implementation (#7518 )	2020-06-18 11:05:29 -07:00
Nick Ethier	0bc0403cc3	Task DNS Options (#7661 ) Co-Authored-By: Tim Gross <tgross@hashicorp.com> Co-Authored-By: Seth Hoenig <shoenig@hashicorp.com>	2020-06-18 11:01:31 -07:00
Mahmood Ali	e784fe331a	use '*' to indicate all namespaces This reverts the introduction of AllNamespaces parameter that was merged earlier but never got released.	2020-06-17 16:27:43 -04:00
Tim Gross	c14a75bfab	multiregion: use pending instead of paused The `paused` state is used as an operator safety mechanism, so that they can debug a deployment or halt one that's causing a wider failure. By using the `paused` state as the first state of a multiregion deployment, we risked resuming an intentionally operator-paused deployment because of activity in a peer region. This changeset replaces the use of the `paused` state with a `pending` state, and provides a `Deployment.Run` internal RPC to replace the use of the `Deployment.Pause` (resume) RPC we were using in `deploymentwatcher`.	2020-06-17 11:06:14 -04:00
Tim Gross	fd50b12ee2	multiregion: integrate with deploymentwatcher * `nextRegion` should take status parameter * thread Deployment/Job RPCs thru `nextRegion` * add `nextRegion` calls to `deploymentwatcher` * use a better description for paused for peer	2020-06-17 11:06:00 -04:00
Tim Gross	7b12445f29	multiregion: change AutoRevert to OnFailure	2020-06-17 11:05:45 -04:00
Tim Gross	5c4d0a73f4	start all but first region deployment in paused state	2020-06-17 11:05:34 -04:00
Tim Gross	b09b7a2475	Multiregion job registration Integration points for multiregion jobs to be registered in the enterprise version of Nomad: * hook in `Job.Register` for enterprise to send job to peer regions * remove monitoring from `nomad job run` and `nomad job stop` for multiregion jobs	2020-06-17 11:04:58 -04:00
Drew Bailey	9263fcb0d3	Multiregion deploy status and job status CLI	2020-06-17 11:03:34 -04:00
Tim Gross	473a0f1d44	multiregion: unblock and cancel RPCs	2020-06-17 11:02:26 -04:00
Tim Gross	ede3a4f1c4	multiregion: request structs	2020-06-17 11:00:34 -04:00
Tim Gross	6851024925	Multiregion structs Initial struct definitions, jobspec parsing, validation, and conversion between Nomad structs and API structs for multi-region deployments.	2020-06-17 11:00:14 -04:00
Chris Baker	1e3563e08c	wip: added PreserveCounts to struct.JobRegisterRequest, development test for Job.Register	2020-06-16 18:45:17 +00:00
Chris Baker	aeb3ed449e	wip: added .PreviousCount to api.ScalingEvent and structs.ScalingEvent, with developmental tests	2020-06-15 19:40:21 +00:00
Lang Martin	069840bef8	scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect (#8105 ) (#8138 ) * scheduler/reconcile: set FollowupEvalID on lost stop_after_client_disconnect * scheduler/reconcile: thread follupEvalIDs through to results.stop * scheduler/reconcile: comment typo * nomad/_test: correct arguments for plan.AppendStoppedAlloc * scheduler/reconcile: avoid nil, cleanup handleDelayed(Lost\|Reschedules)	2020-06-09 17:13:53 -04:00
Mahmood Ali	a73cd01a00	Merge pull request #8001 from hashicorp/f-jobs-list-across-nses endpoint to expose all jobs across all namespaces	2020-05-31 21:28:03 -04:00
Drew Bailey	34871f89be	Oss license support for ent builds (#8054 ) * changes necessary to support oss licesning shims revert nomad fmt changes update test to work with enterprise changes update tests to work with new ent enforcements make check update cas test to use scheduler algorithm back out preemption changes add comments * remove unused method	2020-05-27 13:46:52 -04:00
Seth Hoenig	f6c8db8a8a	consul/connect: use task kind to get service name Fixes #8000 When requesting a Service Identity token from Consul, use the TaskKind of the Task to get at the service name associated with the task. In the past using the TaskName worked because it was generated as a sidecar task with a name that included the service. In the Native context, we need to get at the service name in a more correct way, i.e. using the TaskKind which is defined to include the service name.	2020-05-18 13:46:00 -06:00
Mahmood Ali	5ab2d52e27	endpoint to expose all jobs across all namespaces Allow a `/v1/jobs?all_namespaces=true` to list all jobs across all namespaces. The returned list is to contain a `Namespace` field indicating the job namespace. If ACL is enabled, the request token needs to be a management token or have `namespace:list-jobs` capability on all existing namespaces.	2020-05-18 13:50:46 -04:00
Lang Martin	d3c4700cd3	server: stop after client disconnect (#7939 ) * jobspec, api: add stop_after_client_disconnect * nomad/state/state_store: error message typo * structs: alloc methods to support stop_after_client_disconnect 1. a global AllocStates to track status changes with timestamps. We need this to track the time at which the alloc became lost originally. 2. ShouldClientStop() and WaitClientStop() to actually do the math * scheduler/reconcile_util: delayByStopAfterClientDisconnect * scheduler/reconcile: use delayByStopAfterClientDisconnect * scheduler/util: updateNonTerminalAllocsToLost comments This was setup to only update allocs to lost if the DesiredStatus had already been set by the scheduler. It seems like the intention was to update the status from any non-terminal state, and not all lost allocs have been marked stop or evict by now * scheduler/testing: AssertEvalStatus just use require * scheduler/generic_sched: don't create a blocked eval if delayed * scheduler/generic_sched_test: several scheduling cases	2020-05-13 16:39:04 -04:00
Mahmood Ali	3b4116e0db	Merge pull request #7894 from hashicorp/b-cronexpr-dst-fix Fix Daylight saving transition handling	2020-05-12 16:36:11 -04:00
Mahmood Ali	938e916d9c	When serializing msgpack, only consider codec tag When serializing structs with msgpack, only consider type tags of `codec`. Hashicorp/go-msgpack (based on ugorji/go) defaults to interpretting `codec` tag if it's available, but falls to using `json` if `codec` isn't present. This behavior is surprising in cases where we want to serialize json differently from msgpack, e.g. serializing `ConsulExposeConfig`.	2020-05-11 14:14:10 -04:00
Mahmood Ali	b4fa8e9588	codec: we use hashicorp/go-msgpack exclusively No need to maintain two msgpack handles!	2020-05-11 14:05:29 -04:00
Mahmood Ali	57435950d7	Update current DST and some code style issues	2020-05-07 19:27:05 -04:00
Mahmood Ali	c8fb132956	Update cronexpr to point to hashicorp/cronexpr	2020-05-07 17:50:45 -04:00
Tim Gross	801ebcfe8d	periodic GC for CSI plugins (#7878 ) This changeset implements a periodic garbage collection of unused CSI plugins. Plugins are self-cleaning when the last allocation for a plugin is stopped, but this feature will cover any missing edge cases and ensure that upgrades from 0.11.0 and 0.11.1 get any stray plugins cleaned up.	2020-05-06 16:49:12 -04:00
Tim Gross	00c9bd7ff0	reorder volume claim batch request raft message (#7871 ) For backwards compatibility during upgrades, new raft message types need to come at the end of the enum.	2020-05-06 08:57:51 -04:00
Lang Martin	28bac139cb	client/heartbeatstop: destroy allocs when disconnected from servers - track lastHeartbeat, the client local time of the last successful heartbeat round trip - track allocations with `stop_after_client_disconnect` configured - trigger allocation destroy (which handles cleanup) - restore heartbeat/killable allocs tracking when allocs are recovered from disk - on client restart, stop those allocs after a grace period if the servers are still partioned	2020-05-01 12:35:49 -04:00
Tim Gross	a7a64443e1	csi: move volume claim release into volumewatcher (#7794 ) This changeset adds a subsystem to run on the leader, similar to the deployment watcher or node drainer. The `Watcher` performs a blocking query on updates to the `CSIVolumes` table and triggers reaping of volume claims. This will avoid tying up scheduling workers by immediately sending volume claim workloads into their own loop, rather than blocking the scheduling workers in the core GC job doing things like talking to CSI controllers The volume watcher is enabled on leader step-up and disabled on leader step-down. The volume claim GC mechanism now makes an empty claim RPC for the volume to trigger an index bump. That in turn unblocks the blocking query in the volume watcher so it can assess which claims can be released for a volume.	2020-04-30 09:13:00 -04:00
Anthony Scalisi	9664c6b270	fix spelling errors (#6985 )	2020-04-20 09:28:19 -04:00
Lang Martin	1750426d04	csi: run volume claim GC on `job stop -purge` (#7615 ) * nomad/state/state_store: error message copy/paste error * nomad/structs/structs: add a VolumeEval to the JobDeregisterResponse * nomad/job_endpoint: synchronously, volumeClaimReap on job Deregister * nomad/core_sched: make volumeClaimReap available without a CoreSched * nomad/job_endpoint: Deregister return early if the job is missing * nomad/job_endpoint_test: job Deregistion is idempotent * nomad/core_sched: conditionally ignore alloc status in volumeClaimReap * nomad/job_endpoint: volumeClaimReap all allocations, even running * nomad/core_sched_test: extra argument to collectClaimsToGCImpl * nomad/job_endpoint: job deregistration is not idempotent	2020-04-03 17:37:26 -04:00
Chris Baker	8ec252e627	added indices to the job scaling events, so we could properly do blocking queries on the job scaling status	2020-04-01 17:28:19 +00:00
Chris Baker	40d6b3bbd1	adding raft and state_store support to track job scaling events updated ScalingEvent API to record "message string,error bool" instead of confusing "reason,error *string"	2020-04-01 16:15:14 +00:00

1 2 3 4 5 ...

1365 commits