open-nomad

Author	SHA1	Message	Date
James Rasell	45f4689f9c	chore: fixup inconsistent method receiver names. (#11704 )	2021-12-20 11:44:21 +01:00
Tim Gross	a0cf5db797	provide `-no-shutdown-delay` flag for job/alloc stop (#11596 ) Some operators use very long group/task `shutdown_delay` settings to safely drain network connections to their workloads after service deregistration. But during incident response, they may want to cause that drain to be skipped so they can quickly shed load. Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and `nomad job stop` commands that bypasses the delay. This sets a new desired transition state on the affected allocations that the allocation/task runner will identify during pre-kill on the client. Note (as documented here) that using this flag will almost always result in failed inbound network connections for workloads as the tasks will exit before clients receive updated service discovery information and won't be gracefully drained.	2021-12-13 14:54:53 -05:00
Tim Gross	624ecab901	evaluations list pagination and filtering (#11648 ) API queries can request pagination using the `NextToken` and `PerPage` fields of `QueryOptions`, when supported by the underlying API. Add a `NextToken` field to the `structs.QueryMeta` so that we have a common field across RPCs to tell the caller where to resume paging from on their next API call. Include this field on the `api.QueryMeta` as well so that it's available for future versions of List HTTP APIs that wrap the response with `QueryMeta` rather than returning a simple list of structs. In the meantime callers can get the `X-Nomad-NextToken`. Add pagination to the `Eval.List` RPC by checking for pagination token and page size in `QueryOptions`. This will allow resuming from the last ID seen so long as the query parameters and the state store itself are unchanged between requests. Add filtering by job ID or evaluation status over the results we get out of the state store. Parse the query parameters of the `Eval.List` API into the arguments expected for filtering in the RPC call.	2021-12-10 13:43:03 -05:00
Tim Gross	03e697a69d	scheduler: config option to reject job registration (#11610 ) During incident response, operators may find that automated processes elsewhere in the organization can be generating new workloads on Nomad clusters that are unable to handle the workload. This changeset adds a field to the `SchedulerConfiguration` API that causes all job registration calls to be rejected unless the request has a management ACL token.	2021-12-06 15:20:34 -05:00
Michael Schurter	3d248153f4	Merge pull request #11579 from hashicorp/b-getscalingpolicy-rpc-index-response rpc: fix scaling policy get index response when policy is found.	2021-11-30 10:43:20 -08:00
Tim Gross	39acac33a0	ui: change Consul/Vault base URL field name (#11589 ) Give ourselves some room for extension in the UI configuration block by naming the field `ui_url`, which will let us have an `api_url`. Fix the template path to ensure we're getting the right value from the API.	2021-11-30 13:20:29 -05:00
James Rasell	2412e9916d	rpc: fix scaling policy get index response when policy is found. When GetPolicy is called within the scaling handler, the index table was being used to populate the reply index irregardless of whether the policy was found or not. This change fixes that behaviour so that the policy modify index is used when the policy lookup is successful.	2021-11-26 10:40:27 +01:00
Luiz Aoqui	0cf1964651	Merge remote-tracking branch 'origin/release-1.2.2' into merge-release-1.2.2-branch	2021-11-24 14:40:45 -05:00
Nomad Release Bot	2e4ef67c2d	remove generated files	2021-11-24 18:54:50 +00:00
Tim Gross	fcb96de9a7	config: UI configuration block with Vault/Consul links (#11555 ) Add `ui` block to agent configuration to enable/disable the web UI and provide the web UI with links to Vault/Consul.	2021-11-24 11:20:02 -05:00
Luiz Aoqui	9d6842dd4d	Don't emit scaling event error when a deployment is underway (#11556 )	2021-11-23 10:20:18 -05:00
James Rasell	751c8217d1	core: allow setting and propagation of eval priority on job de/registration (#11532 ) This change modifies the Nomad job register and deregister RPCs to accept an updated option set which includes eval priority. This param is optional and override the use of the job priority to set the eval priority. In order to ensure all evaluations as a result of the request use the same eval priority, the priority is shared to the allocReconciler and deploymentWatcher. This creates a new distinction between eval priority and job priority. The Nomad agent HTTP API has been modified to allow setting the eval priority on job update and delete. To keep consistency with the current v1 API, job update accepts this as a payload param; job delete accepts this as a query param. Any user supplied value is validated within the agent HTTP handler removing the need to pass invalid requests to the server. The register and deregister opts functions now all for setting the eval priority on requests. The change includes a small change to the DeregisterOpts function which handles nil opts. This brings the function inline with the RegisterOpts.	2021-11-23 09:23:31 +01:00
Luiz Aoqui	d3c1a03edd	Version 1.2.1 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJhl94SAAoJELC0QQl2hbZ2pqoP/R7HyOxvealo5MBJcG4mGiWT Hsu9VXpYKDWn0GSXd3JmqYWH7tIwFMXispZ7pMlDLieypW3UpMYIbIquaePxOaRL yhlc0CLT7JDsFPx8Puv1fgKXaS3EfFyJlYx437bhCQ+K0k2+1n3EOhrzU/DQ4j8V D5qxlkZh6IK6brIJ54NivGzTxtzGGvIGXCrDPolX3cwoBtyO/pbecfEkRlN2xwxl P68l52+Jit3lK2Cljh4Kr1qFj8voHPjYUTXGas8ZkIVrx9l4fb6CHib2y3hy4bRR qwXT4keWc8bxtLQ7vtetGBAXp4UKJigziE4imhHAttBN9th2/Oy0qSQCNX3xELJC Jwgc+N+ON63QI2sP/8FWvmeUrJpASRITYl/Gr8uOR6n1PacrBhFT9OV4VMkte1ua jS/WF/7k21NZYqZca+thvN12wmw/gSEAEeCHH5kR3vPLeV6FdanhKLjufMNuMShc UKJCEZw1/Lyux1XkLqMPoZ4DCak8/HskupQoLNsekF1Uki8ObU4as7GERedxqkj6 i2+1QIQMqvviskOwT0QOWm4RFXjRQsIK8uUfXzHHWDMzDhvnGjB0eWVMLAj4/rTe 46yUP4kdarFkxwkDmLEyoogdD35wC4Xc8Y8IynzUTN77pOWID5QEyFZVaaBB4NR3 wNowUJGrNkxEYXwGSkjh =Zuw2 -----END PGP SIGNATURE----- gpgsig -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEElFaq1Z5DKdB91i+lKfRZwNnLtXMFAmGbu3sACgkQKfRZwNnL tXMx4BAAksQ07tSoOku8zDwx2JpoiNApoYhMLlfJ4S3Mw+RYtbayAMRyA08GG56I U85XJB/Z2CzliYL/Nya1e3z6Gyn92V0iD9u7N1xEAPt8PdyiXqIBZn1rWoiCcnMO C3f2aRGhLZMVOZG0v7fgbh1PkhJt4MLcRQE9nn5ojPvFzW9bL0Iz7lc9IxHQtaU0 rANDcXdj3IhiOdEgjtO++Qhdeu3t2SBhT2xFnlJ3gXC2q/aY1a2C7BYdlSxtw0JU nKpxvBTsB7rINGcYxhXZlckui5YLL4BX11XqsYhUTMC+33vxE5HNty1ANc1+SNyO 0iHp0yc5J6MCLuiZ/2sBek2tC+KHCufb+qEIqPmBpcWPJRT8HjginLxj/HyL2TQc pLF9XxhYKvv0sm3Zr3Ima5kqWgayph3XhQ73hKs9f7SLfErr6qr4XaI8egZA4OTG 0QGmY/61UlAdsz5tUvIGRWYD5rqXyXIYnUprldPSQdeZ0o2GjX7T0GZ934O5uHfE Ne73GafGn8JaGxH9+AEHMJAVpkrzWR1wrExL3kGJ8NF40HlsYofIuhTkZqMKX3EH 7KfefSJW1NQAGeAEwjtvzhmUiM0cVoCWGd4COxX1G3oJ0o8gZ3RklDEA4Pa9C0rO pBW/KIckPpGieGvPaA3mqmXDjx6oOaxPi9wd5TniBHh43pgrASo= =KVce -----END PGP SIGNATURE----- Merge tag 'v1.2.1' into merge-release-1.2.1-branch Version 1.2.1	2021-11-22 10:47:04 -05:00
Tim Gross	e729133134	api: return 404 for alloc FS list/stat endpoints (#11482 ) * api: return 404 for alloc FS list/stat endpoints If the alloc filesystem doesn't have a file requested by the List Files or Stat File API, we currently return a HTTP 500 error with the expected "file not found" error message. Return a HTTP 404 error instead. * update FS Handler Previously the FS handler would interpret a 500 status as a 404 in the adapter layer by checking if the response body contained the text or is the response status was 500 and then throw an error code for 404. Co-authored-by: Jai Bhagat <jaybhagat841@gmail.com>	2021-11-17 11:15:07 -05:00
Charlie Voiselle	176de1bfe6	Refactor sendAck(3) into sendAck(2),sendNack(2),sendAcknowledgement(3) (#11506 )	2021-11-17 10:49:55 -05:00
Danish Prakash	1e2c9b3aa0	client: emit max_memory metric (#11490 )	2021-11-17 08:34:22 -05:00
Nomad Release bot	c4463682e7	Generate files for 1.2.0 release	2021-11-15 23:00:30 +00:00
James Rasell	99955eb80f	Merge pull request #11426 from hashicorp/b-set-dereg-eval-priority-correctly rpc: set the deregistration eval priority to the job priority.	2021-11-05 15:53:10 +01:00
James Rasell	2cc661c523	Merge pull request #11429 from hashicorp/b-set-scale-eval-priority-correctly rpc: set the job scale eval priority to the job priority.	2021-11-05 15:52:31 +01:00
Alessandro De Blasis	07c670fdc0	cli: show `host_network` in `nomad status` (#11432 ) Enhance the CLI in order to return the host network in two flavors (default, verbose) of the `node status` command. Fixes: #11223. Signed-off-by: Alessandro De Blasis <alex@deblasis.net>	2021-11-05 09:02:46 -04:00
Michael Schurter	3718557041	Merge pull request #11416 from hashicorp/f-rejected-info core: bump rejected plans from debug -> info	2021-11-03 16:49:28 -07:00
Luiz Aoqui	5be6710216	add `/s/port-plan-failure` redirect and link to in in plan reject log message	2021-11-02 20:43:54 -04:00
James Rasell	ac9268a429	rpc: set the job scale eval priority to the job priority.	2021-11-02 12:57:53 +01:00
James Rasell	afb6913428	rpc: set the deregistration eval priority to the job priority. Previously when creating an eval for job deregistration, the eval priority was set to the default value irregardless of the job priority. In situations where an operator would want to deregister a high priority job so they could re-register; the evaluation may get blocked for some time on a busy cluster because of the deregsiter priority. If a job had a lower than default priority and was deregistered, the deregister eval would get a priority higher than that of the job. If we attempted to register another job with a higher priority than this, but still below the default, the deregister would be actioned before the register. Both situations described above seem incorrect and unexpected from a user prespective. This fix modifies to behaviour to set the deregister eval priority to that of the job, if available. Otherwise the default value is still used.	2021-11-02 09:11:44 +01:00
Luiz Aoqui	655ac2719f	Allow using specific object ID on diff (#11400 )	2021-11-01 15:16:31 -04:00
Michael Schurter	efe5714840	core: bump rejected plans from debug -> info As we have continued to see reports of #9506 we need to elevate this log line as it is the only way to detect when plans are being erroneously rejected. Users who see this log line repeatedly should drain and restart the node in the log line. This seems to workaorund the issue. Please post any details on #9506!	2021-10-31 12:51:42 -07:00
Mahmood Ali	daf20f9788	vault: set JobID in Vault metadata (#11397 ) Closes: #11395 .	2021-10-27 07:20:29 -07:00
Mahmood Ali	1de395b42c	Fix preemption panic (#11346 ) Fix a bug where the scheduler may panic when preemption is enabled. The conditions are a bit complicated: A job with higher priority that schedule multiple allocations that preempt other multiple allocations on the same node, due to port/network/device assignments. The cause of the bug is incidental mutation of internal cached data. `RankedNode` computes and cache proposed allocations in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L42-L53 . But scheduler then mutates the list to remove pre-emptable allocs in https://github.com/hashicorp/nomad/blob/v1.1.6/scheduler/rank.go#L293-L294, and `RemoveAllocs` mutates and sets the tail of cached slice with `nil`s triggering a nil-pointer derefencing case. I fixed the issue by avoiding the mutation in `RemoveAllocs` - the micro-optimization there doesn't seem necessary. Fixes https://github.com/hashicorp/nomad/issues/11342	2021-10-19 20:22:03 -04:00
Michael Schurter	59fda1894e	Merge pull request #11167 from a-zagaevskiy/master Support configurable dynamic port range	2021-10-13 16:47:38 -07:00
Michael Schurter	e14cd34392	client: improve errors & tests for dynamic ports	2021-10-13 16:25:25 -07:00
Dave May	2d14c54fa0	debug: Improve namespace and region support (#11269 ) * Include region and namespace in CLI output * Add region and prefix matching for server members * Add namespace and region API outputs to cluster metadata folder * Add region awareness to WaitForClient helper function * Add helper functions for SliceStringHasPrefix and StringHasPrefixInSlice * Refactor test client agent generation * Add tests for region * Add changelog	2021-10-12 16:58:41 -04:00
Florian Apolloner	511cae92b4	Fixed plan diffing to handle non-unique service names. (#10965 )	2021-10-12 16:42:39 -04:00
Mahmood Ali	583b9f2506	Merge pull request #11089 from hashicorp/b-cve-2021-37218 Apply authZ for nomad Raft RPC layer	2021-10-05 08:49:21 -04:00
Michael Schurter	7071425af3	client: defensively log reserved ports - Fix test broken due to being improperly setup. - Include min/max ports in default client config.	2021-10-04 15:43:35 -07:00
Luiz Aoqui	0a62bdc3c5	fix panic when Connect mesh gateway doesn't have a proxy block (#11257 ) Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2021-10-04 15:52:07 -04:00
Michael Schurter	20cf993b29	Merge pull request #11235 from hashicorp/test-evalbroker test: fix fake by increasing time window	2021-10-04 11:16:41 -07:00
Mahmood Ali	4d90afb425	gofmt all the files mostly to handle build directives in 1.17.	2021-10-01 10:14:28 -04:00
Michael Schurter	c6e72b6818	client: output reserved ports with min/max ports Also add a little more min/max port testing and add the consts back that had been removed: but unexported and as defaults.	2021-09-30 17:05:46 -07:00
Michael Schurter	c1bd10456c	test: fix flaky TestAutopilot_CleanupDeadServer The fix seems to be related to the pointer comparison and swapping we did around killing a non-leader. I actually can't quite explain it, but when comparing against Consul's version of this test I noticed they used the slice index to track the killed server instead of pointer swapping. As soon as I switched to slice index tracking I could no longer reproduce the failure. In addition: - Tested membership counts on all servers instead of just 1 for added correctness. - Stopped testing raft v1 because it is unsupported.	2021-09-28 16:38:56 -07:00
Michael Schurter	d7e123d7cd	test: fix fake by increasing time window Test originally only had a 10ms time window tolerance. Increased to 100ms and also improved assertions and docstrings.	2021-09-28 12:22:59 -07:00
Luiz Aoqui	1035805a42	connect: update allowed protocols in ingress gateway config (#11187 )	2021-09-16 10:47:53 -04:00
James Rasell	0e926ef3fd	allow configuration of Docker hostnames in bridge mode (#11173 ) Add a new hostname string parameter to the network block which allows operators to specify the hostname of the network namespace. Changing this causes a destructive update to the allocation and it is omitted if empty from API responses. This parameter also supports interpolation. In order to have a hostname passed as a configuration param when creating an allocation network, the CreateNetwork func of the DriverNetworkManager interface needs to be updated. In order to minimize the disruption of future changes, rather than add another string func arg, the function now accepts a request struct along with the allocID param. The struct has the hostname as a field. The in-tree implementations of DriverNetworkManager.CreateNetwork have been modified to account for the function signature change. In updating for the change, the enhancement of adding hostnames to network namespaces has also been added to the Docker driver, whilst the default Linux manager does not current implement it.	2021-09-16 08:13:09 +02:00
Aleksandr Zagaevskiy	ebb87e65fe	Support configurable dynamic port range	2021-09-10 11:52:47 +03:00
Isabel Suchanek	ab51050ce8	events: fix wildcard namespace handling (#10935 ) This fixes a bug in the event stream API where it currently interprets namespace=* as an actual namespace, not a wildcard. When Nomad parses incoming requests, it sets namespace to default if not specified, which means the request namespace will never be an empty string, which is what the event subscription was checking for. This changes the conditional logic to check for a wildcard namespace instead of an empty one. It also updates some event tests to include the default namespace in the subscription to match current behavior. Fixes #10903	2021-09-02 09:36:55 -07:00
Andy Assareh	ed580726d6	corrected peersInfoContent - was copied from Consul and not updated for Nomad (#11109 ) updated with Nomad ports and web link (learn guide: https://learn.hashicorp.com/tutorials/nomad/outage-recovery)	2021-09-01 08:30:49 +02:00
James Rasell	b6813f1221	chore: fix incorrect docstring formatting.	2021-08-30 11:08:12 +02:00
Mahmood Ali	ce43a7a852	update tests to make an actual RaftRPC	2021-08-27 10:37:30 -04:00
Mahmood Ali	ff7c1ca79b	Apply authZ for nomad Raft RPC layer When mTLS is enabled, only nomad servers of the region should access the Raft RPC layer. Clients and servers in other regions should only use the Nomad RPC endpoints. Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@hashicorp.com>	2021-08-26 15:10:07 -04:00
Kush	1d6da9b55e	docs: fix typo in structs/event.go	2021-08-21 17:02:07 +05:30
Mahmood Ali	b4ed8acbff	tests: attempt deflaking TestAutopilot_CleanupDeadServer Attempt to deflake the test by avoiding shutting down the leaders, as leadership recovery takes more time, and consequently longer to process raft configuration changes and potentially failing the test.	2021-08-18 15:37:25 -04:00
Mahmood Ali	bcac5268df	tests: deflake TestLeader_LeftLeader Wait for leadership to be established before killing leader.	2021-08-18 14:19:00 -04:00
Mahmood Ali	84a3522133	Consider all system jobs for a new node (#11054 ) When a node becomes ready, create an eval for all system jobs across namespaces. The previous code uses `job.ID` to deduplicate evals, but that ignores the job namespace. Thus if there are multiple jobs in different namespaces sharing the same ID/Name, only one will be considered for running in the new node. Thus, Nomad may skip running some system jobs in that node.	2021-08-18 09:50:37 -04:00
Mahmood Ali	c37339a8c8	Merge pull request #9160 from hashicorp/f-sysbatch core: implement system batch scheduler	2021-08-16 09:30:24 -04:00
Michael Schurter	a7aae6fa0c	Merge pull request #10848 from ggriffiths/listsnapshot_secrets CSI Listsnapshot secrets support	2021-08-10 15:59:33 -07:00
Mahmood Ali	ea003188fa	system: re-evaluate node on feasibility changes (#11007 ) Fix a bug where system jobs may fail to be placed on a node that initially was not eligible for system job placement. This changes causes the reschedule to re-evaluate the node if any attribute used in feasibility checks changes. Fixes https://github.com/hashicorp/nomad/issues/8448	2021-08-10 17:17:44 -04:00
Mahmood Ali	bfc766357e	deployments: canary=0 is implicitly autopromote (#11013 ) In a multi-task-group job, treat 0 canary groups as auto-promote. This change fixes an edge case where Nomad requires a manual promotion, if the job had any group with canary=0 and rest of groups having auto_promote set. Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2021-08-10 17:06:40 -04:00
Seth Hoenig	3371214431	core: implement system batch scheduler This PR implements a new "System Batch" scheduler type. Jobs can make use of this new scheduler by setting their type to 'sysbatch'. Like the name implies, sysbatch can be thought of as a hybrid between system and batch jobs - it is for running short lived jobs intended to run on every compatible node in the cluster. As with batch jobs, sysbatch jobs can also be periodic and/or parameterized dispatch jobs. A sysbatch job is considered complete when it has been run on all compatible nodes until reaching a terminal state (success or failed on retries). Feasibility and preemption are governed the same as with system jobs. In this PR, the update stanza is not yet supported. The update stanza is sill limited in functionality for the underlying system scheduler, and is not useful yet for sysbatch jobs. Further work in #4740 will improve support for the update stanza and deployments. Closes #2527	2021-08-03 10:30:47 -04:00
Grant Griffiths	fecbbaee22	CSI ListSnapshots secrets implementation Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>	2021-07-28 11:30:29 -07:00
Michael Schurter	ea996c321d	Merge pull request #10916 from hashicorp/f-audit-log-mode Add audit log file mode config parameter	2021-07-27 12:16:37 -07:00
Mahmood Ali	ac3cf10849	nomad: only activate one-time auth tokens with 1.1.0 (#10952 ) Fix a panic in handling one-time auth tokens, used to support `nomad ui --authenticate`. If the nomad leader is a 1.1.x with some servers running as 1.0.x, the pre-1.1.0 servers risk crashing and the cluster may lose quorum. That can happen when `nomad authenticate -ui` command is issued, or when the leader scans for expired tokens every 10 minutes. Fixed #10943 .	2021-07-27 13:17:55 -04:00
Seth Hoenig	54d9bad657	Merge pull request #10904 from hashicorp/b-no-affinity-intern core: remove internalization of affinity strings	2021-07-22 09:09:07 -05:00
Michael Schurter	c06ea132d3	audit: add file mode configuration parameter Rest of implementation is in nomad-enterprise	2021-07-20 10:54:53 -07:00
Alan Guo Xiang Tan	e2d1372ac9	Fix typo.	2021-07-16 13:49:15 +08:00
Seth Hoenig	ac5c83cafd	core: remove internalization of affinity strings Basically the same as #10896 but with the Affinity struct. Since we use reflect.DeepEquals for job comparison, there is risk of false positives for changes due to a job struct with memoized vs non-memoized strings. Closes #10897	2021-07-15 15:15:39 -05:00
Seth Hoenig	bea8066187	core: add spec changed test with constriants	2021-07-14 10:44:09 -05:00
Seth Hoenig	52cf03df4a	core: fix constraint tests	2021-07-14 10:39:38 -05:00
Seth Hoenig	1aec25f1df	core: do not memoize constraint strings This PR causes Nomad to no longer memoize the String value of a Constraint. The private memoized variable may or may not be initialized at any given time, which means a reflect.DeepEqual comparison between two jobs (e.g. during Plan) may return incorrect results. Fixes #10836	2021-07-14 10:04:35 -05:00
Mahmood Ali	1f34f2197b	Merge pull request #10806 from hashicorp/munda/idempotent-job-dispatch Enforce idempotency of dispatched jobs using token on dispatch request	2021-07-08 10:23:31 -04:00
Tim Gross	9f128a28ae	service: remove duplicate name check during validation (#10868 ) When a task group with `service` block(s) is validated, we validate that there are no duplicates, but this validation doesn't have access to the task environment because it hasn't been created yet. Services and checks with interpolation can be flagged incorrectly as conflicting. Name conflicts in services are not actually an error in Consul and users have reported wanting to use the same service name for task groups differentiated by tags.	2021-07-08 09:43:38 -04:00
Alex Munda	9e5061ef87	Update idempotency comment to reflect all jobs Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>	2021-07-07 15:54:56 -05:00
Alex Munda	557a227de1	Match idempotency key on all child jobs and return existing job when idempotency keys match.	2021-07-02 14:08:46 -05:00
Alex Munda	34c63b086b	Move idempotency check closer to validate. Log error.	2021-07-02 10:58:42 -05:00
Grant Griffiths	7f8e285559	CSI: Snapshot volume create should use vol.Secrets (#10840 ) Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>	2021-07-02 08:28:22 -04:00
Alex Munda	baba8fe7df	Update tests after moving idempotency token to WriteOptions	2021-07-01 08:48:57 -05:00
Alex Munda	848918018c	Move idempotency token to write options. Remove DispatchIdempotent	2021-06-30 15:10:48 -05:00
Alex Munda	baae6d5546	Update comment about idempotency check	2021-06-30 12:30:44 -05:00
Alex Munda	01bcd9c41c	Make idempotency error user friendly Co-authored-by: Tim Gross <tgross@hashicorp.com>	2021-06-30 12:26:33 -05:00
Alex Munda	ca86c7ba0c	Add idempotency token to dispatch request instead of special meta key	2021-06-29 15:59:23 -05:00
Alex Munda	122136b657	Always allow idempotency key meta. Tests for idempotent dispatch	2021-06-29 10:30:04 -05:00
Alex Munda	561cd9fc7f	Enforce idempotency of dispatched jobs using special meta key	2021-06-23 17:10:31 -05:00
Seth Hoenig	ebaaaae88e	consul/connect: Validate uniqueness of Connect upstreams within task group This PR adds validation during job submission that Connect proxy upstreams within a task group are using different listener addresses. Otherwise, a duplicate envoy listener will be created and not be able to bind. Closes #7833	2021-06-18 16:50:53 -05:00
Mahmood Ali	33dfe98770	deployment watcher: Reuse allocsCh if allocIndex remains the same (#10756 ) Fix deployment watchers to avoid creating unnecessary deployment watcher goroutines and blocking queries. `deploymentWatcher.getAllocsCh` creates a new goroutine that makes a blocking query to fetch updates of deployment allocs. ## Background When operators submit a new or updated service job, Nomad create a new deployment by default. The deployment object controls how fast to place the allocations through [`max_parallel`](https://www.nomadproject.io/docs/job-specification/update#max_parallel) and health checks configurations. The `scheduler` and `deploymentwatcher` package collaborate to achieve deployment logic: The scheduler only places the canaries and `max_parallel` allocations for a new deployment; the `deploymentwatcher` monitors for alloc progress and then enqueues a new evaluation whenever the scheduler should reprocess a job and places the next `max_parallel` round of allocations. The `deploymentwatcher` package makes blocking queries against the state store, to fetch all deployments and the relevant allocs for each running deployments. If `deploymentwatcher` fails or is hindered from fetching the state, the deployments fail to make progress. `Deploymentwatcher` logic only runs on the leader. ## Why unnecessary deployment watchers can halt cluster progress Previously, `getAllocsCh` is called on every for loop iteration in `deploymentWatcher.watch()` function. However, the for-loop may iterate many times before the allocs get updated. In fact, whenever a new deployment is created/updated/deleted, all `deploymentWatcher`s get notified through `w.deploymentUpdateCh`. The `getAllocsCh` goroutines and blocking queries spike significantly and grow quadratically with respect to the number of running deployments. The growth leads to two adverse outcomes: 1. it spikes the CPU/Memory usage resulting potentially leading to OOM or very slow processing 2. it activates the [query rate limiter](`abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L896-L898)`), so later the watcher fails to get updates and consequently fails to make progress towards placing new allocations for the deployment! So the cluster fails to catch up and fails to make progress in almost all deployments. The cluster recovers after a leader transition: the deposed leader stops all watchers and free up goroutines and blocking queries; the new leader recreates the watchers without the quadratic growth and remaining under the rate limiter. Well, until a spike of deployments are created triggering the condition again. ### Relevant Code References Path for deployment monitoring: * [`Watcher.watchDeployments`](`abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L164-L192)`) loops waiting for deployment updates. * On every deployment update, [`w.getDeploys`](`abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L194-L229)`) returns all deployments in the system * `watchDeployments` calls `w.add(d)` on every active deployment * which in turns, [updates existing watcher if one is found](`abaa9c5c5b/nomad/deploymentwatcher/deployments_watcher.go (L251-L255)`). * The deployment watcher [updates local local deployment field and trigger `deploymentUpdateCh` channel]( `abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L136-L147)`) * The [deployment watcher `deploymentUpdateCh` selector is activated](`abaa9c5c5b/nomad/deploymentwatcher/deployment_watcher.go (L455-L489)`). Most of the time the selector clause is a no-op, because the flow was triggered due to another deployment update * The `watch` for-loop iterates again and in the previous code we create yet another goroutine and blocking call that risks being rate limited. Co-authored-by: Tim Gross <tgross@hashicorp.com>	2021-06-14 16:01:01 -04:00
Seth Hoenig	6eeaefa59f	Merge pull request #10754 from hashicorp/b-client-connect-constraint consul/connect: remove unnecessary connect constraint on clients	2021-06-14 09:41:25 -05:00
Seth Hoenig	7b8e15159b	consul/connect: remove unnecessary connect constraint on clients PR https://github.com/hashicorp/nomad/pull/10702 added 2 new constraints for connect jobs - one for Consul gRPC listener, and one for Connect being enabled on Clients. Connect does not need to be enabled on clients, only on Consul servers. Remove the extra constraint. Discuss: https://discuss.hashicorp.com/t/nomad-1-1-1-and-consul-connect-enabled-on-consul-clients/25295	2021-06-14 08:01:45 -05:00
James Rasell	0cccf7c2b8	volumewatcher: fix test data race.	2021-06-14 12:11:35 +02:00
James Rasell	a99fcfb4c8	Merge pull request #10745 from hashicorp/b-fix-test-datarace-deploymentwatcher deploymentwatcher: fix test data race.	2021-06-11 17:23:03 +02:00
James Rasell	939b23936a	Merge pull request #10744 from hashicorp/b-remove-duplicate-imports chore: remove duplicate import statements	2021-06-11 16:42:34 +02:00
Mahmood Ali	74efd3626e	Merge pull request #10742 from hashicorp/deflake-tests-20210608 Deflaking Test 2021 June edition	2021-06-11 09:14:40 -04:00
James Rasell	c168108bb7	Merge pull request #10739 from hashicorp/f-remove-unused-types-pkg core: remove unused types pkg and PeriodicCallback type.	2021-06-11 13:27:22 +02:00
James Rasell	ff75b4da09	deploymentwatcher: fix test data race.	2021-06-11 11:55:21 +02:00
James Rasell	492e308846	tests: remove duplicate import statements.	2021-06-11 09:39:22 +02:00
Mahmood Ali	9b35bf1858	deflake TestNomad_BootstrapExpect and other leader tests The test fails reliably locally on my machine. The test uses non-dev mode where Raft actions get committed to disk, causing operations to exceed the 50ms tight Raft deadlines. So, here we ensure that non-dev servers use default Raft config files with longer timeouts. Also, noticed that the test queries a server, that may a follower with a stale state. I've updated the test to ensure we query the leader for its state. The Barrier call ensures that the leader is a "stable" leader with committed entries. Protects against a window where a new leader reports the previous term before it commits a raft log entry.	2021-06-10 22:04:10 -04:00
Mahmood Ali	ff73cc279e	tests: deflake TestAgentProfile_RemoteClient TestAgentProfile_RemoteClient test must wait for the client node to be registered in raft state store, and not merely that the server has a network connection from the client. In https://app.circleci.com/pipelines/github/hashicorp/nomad/15539/workflows/8dcbc3f3-946b-4da0-b089-9093788bc0c9/jobs/147919, notice how `node registration complete` log line occured after the test already have failed. This is another case of flakiness due to not waiting for client registration.	2021-06-10 22:00:15 -04:00
Mahmood Ali	8009d9837c	tests: deflake TestMonitor_Monitor_RemoteServer and cross-region tests Ensure that all servers are joined to each other before test proceed, instead of just joining them to the first server and relying on background serf propagation. Relying on backgorund serf propagation is a cause of flakiness, specially for tests with multiple regions. The server receiving the RPC may not be aware of the region and fail to forward RPC accordingly. For example, consider `TestMonitor_Monitor_RemoteServer` failure in https://app.circleci.com/pipelines/github/hashicorp/nomad/16402/workflows/7f327235-7d0c-40ba-9757-600522afca51/jobs/158045 you can observe: * `nomad-117` is joined to `nomad-118` and `nomad-119` * `nomad-119` is the foreign region * `nomad-117` gains leadership in the default region, `nomad-118` is the non-leader * search logs for `nomad: adding server` and notice that `nomad-118` only added `nomad-118` and `nomad-118`, but not `nomad-119`! * so the query to the non-leader in the test fails to be forwarded to the appopriate region.	2021-06-10 21:27:55 -04:00
James Rasell	25883eca43	core: remove unused types pkg and PeriodicCallback type.	2021-06-10 15:57:13 +02:00
Nomad Release Bot	4fe52bc753	remove generated files	2021-06-10 08:04:25 -04:00
Nomad Release bot	7cc7389afd	Generate files for 1.1.1 release	2021-06-10 08:04:25 -04:00
Mahmood Ali	aa77c2731b	tests: use standard library testing.TB Glint pulled in an updated version of mitchellh/go-testing-interface which broke some existing tests because the update added a Parallel() method to testing.T. This switches to the standard library testing.TB which doesn't have a Parallel() method.	2021-06-09 16:18:45 -07:00
Seth Hoenig	dbdc479970	consul: move consul acl tests into ent files (cherry-pick ent back to oss) This PR moves a lot of Consul ACL token validation tests into ent files, so that we can verify correct behavior difference between OSS and ENT Nomad versions.	2021-06-09 08:38:42 -05:00
Seth Hoenig	87be8c4c4b	consul: correctly check consul acl token namespace when using consul oss This PR fixes the Nomad Object Namespace <-> Consul ACL Token relationship check when using Consul OSS (or Consul ENT without namespace support). Nomad v1.1.0 introduced a regression where Nomad would fail the validation when submitting Connect jobs and allow_unauthenticated set to true, with Consul OSS - because it would do the namespace check against the Consul ACL token assuming the "default" namespace, which does not work because Consul OSS does not have namespaces. Instead of making the bad assumption, expand the namespace check to handle each special case explicitly. Fixes #10718	2021-06-08 13:55:57 -05:00
Jasmine Dahilig	ca4be6857e	deployment query rate limit (#10706 )	2021-06-04 12:38:46 -07:00
Seth Hoenig	d026ff1f66	consul/connect: add support for connect mesh gateways This PR implements first-class support for Nomad running Consul Connect Mesh Gateways. Mesh gateways enable services in the Connect mesh to make cross-DC connections via gateways, where each datacenter may not have full node interconnectivity. Consul docs with more information: https://www.consul.io/docs/connect/gateways/mesh-gateway The following group level service block can be used to establish a Connect mesh gateway. service { connect { gateway { mesh { // no configuration } } } } Services can make use of a mesh gateway by configuring so in their upstream blocks, e.g. service { connect { sidecar_service { proxy { upstreams { destination_name = "<service>" local_bind_port = <port> datacenter = "<datacenter>" mesh_gateway { mode = "<mode>" } } } } } } Typical use of a mesh gateway is to create a bridge between datacenters. A mesh gateway should then be configured with a service port that is mapped from a host_network configured on a WAN interface in Nomad agent config, e.g. client { host_network "public" { interface = "eth1" } } Create a port mapping in the group.network block for use by the mesh gateway service from the public host_network, e.g. network { mode = "bridge" port "mesh_wan" { host_network = "public" } } Use this port label for the service.port of the mesh gateway, e.g. service { name = "mesh-gateway" port = "mesh_wan" connect { gateway { mesh {} } } } Currently Envoy is the only supported gateway implementation in Consul. By default Nomad client will run the latest official Envoy docker image supported by the local Consul agent. The Envoy task can be customized by setting `meta.connect.gateway_image` in agent config or by setting the `connect.sidecar_task` block. Gateways require Consul 1.8.0+, enforced by the Nomad scheduler. Closes #9446	2021-06-04 08:24:49 -05:00
Seth Hoenig	4c087efd59	Merge pull request #10702 from hashicorp/f-cc-constraints consul/connect: use additional constraints in scheduling connect tasks	2021-06-04 08:11:21 -05:00
Tim Gross	8b2ecde5b4	csi: accept list of caps during validation in volume register When `nomad volume create` was introduced in Nomad 1.1.0, we changed the volume spec to take a list of capabilities rather than a single capability, to meet the requirements of the CSI spec. When a volume is registered via `nomad volume register`, we should be using the same fields to validate the volume with the controller plugin.	2021-06-04 07:57:26 -04:00
Seth Hoenig	d359eb6f3a	consul/connect: use additional constraints in scheduling connect tasks This PR adds two additional constraints on Connect sidecar and gateway tasks, making sure Nomad schedules them only onto nodes where Connect is actually enabled on the Consul agent. Consul requires `connect.enabled = true` and `ports.grpc = <number>` to be explicitly set on agent configuration before Connect APIs will work. Until now, Nomad would only validate a minimum version of Consul, which would cause confusion for users who try to run Connect tasks on nodes where Consul is not yet sufficiently configured. These contstraints prevent job scheduling on nodes where Connect is not actually use-able. Closes #10700	2021-06-03 15:43:34 -05:00
Tim Gross	c01d661c98	csi: validate `volume` block has `attachment_mode` and `access_mode` The `attachment_mode` and `access_mode` fields are required for CSI volumes. The `mount_options` block is only allowed for CSI volumes.	2021-06-03 16:07:19 -04:00
Tim Gross	e9777a88ce	plan applier: add trace-level log of plan The plans generated by the scheduler produce high-level output of counts on each evaluation, but when debugging scheduler issues it'd be nice to have a more detailed view of the resulting plan. Emitting this log at trace minimizes the overhead, and producing it in the plan applyer makes it easier to find as it will always be on the leader.	2021-06-02 10:25:23 -04:00
Tim Gross	7a55a6af16	leader: call eval log formatting lazily Arguments to our logger's various write methods are evaluated eagerly, so method calls in log parameters will always be called, regardless of log level. Move some logger messages to the logger's `Fmt` method so that `GoString` is evaluated lazily instead.	2021-06-02 09:59:55 -04:00
Ryan Sundberg	d43c5f98a5	CSI: Include MountOptions in capabilities sent to CSI for all RPCs Include the VolumeCapability.MountVolume data in ControllerPublishVolume, CreateVolume, and ValidateVolumeCapabilities RPCs sent to the CSI controller. The previous behavior was to only include the MountVolume capability in the NodeStageVolume request, which on some CSI implementations would be rejected since the Volume was not originally provisioned with the specific mount capabilities requested.	2021-05-24 10:59:54 -04:00
James Rasell	db4e2541bd	events: fix event endpoint tests to ignore heartbeats. seems when this PR was raised, the Nomad CI provider was having availability issues meaning the test suite was not correctly run, thus allowing broken tests into main. The PR itself exercised test code which had not been hit before. The particular problem is when identifying whether the event received is a heartbeat; this was performed using standard Golang conditionals. Unfortunately the operator == is not defined on byte arrays, resulting in the check always returning false. To overcome this issue the code now uses the bytes.Equal function to correctly compare the data.	2021-05-24 10:28:19 +02:00
Szabolcs Gelencsér	fc97bd6acf	events: fix slow client connection to empty event stream (#10637 ) * events: fix slow client connection to empty event stream * doc: fix changelog of event stream connection init	2021-05-21 13:17:07 -04:00
Chris Baker	263ddd567c	Node Drain Metadata (#10250 )	2021-05-07 13:58:40 -04:00
Mahmood Ali	102763c979	Support disabling TCP checks for connect sidecar services	2021-05-07 12:10:26 -04:00
Seth Hoenig	b024d85f48	connect: use deterministic injected dynamic exposed port This PR uses the checksum of the check for which a dynamic exposed port is being generated (instead of a UUID prefix) so that the generated port label is deterministic. This fixes 2 bugs: - 'job plan' output is now idempotent for jobs making use of injected ports - tasks will no longer be destructively updated when jobs making use of injected ports are re-run without changing any user specified part of job config. Closes: https://github.com/hashicorp/nomad/issues/10099	2021-04-30 15:18:22 -06:00
Michael Schurter	547a718ef6	Merge pull request #10248 from hashicorp/f-remotetask-2021 core: propagate remote task handles	2021-04-30 08:57:26 -07:00
Mahmood Ali	52d881f567	Allow configuring memory oversubscription (#10466 ) Cluster operators want to have better control over memory oversubscription and may want to enable/disable it based on their experience. This PR adds a scheduler configuration field to control memory oversubscription. It's additional field that can be set in the [API via Scheduler Config](https://www.nomadproject.io/api-docs/operator/scheduler), or [the agent server config](https://www.nomadproject.io/docs/configuration/server#configuring-scheduler-config). I opted to have the memory oversubscription be an opt-in, but happy to change it. To enable it, operators should call the API with: ```json { "MemoryOversubscriptionEnabled": true } ``` If memory oversubscription is disabled, submitting jobs specifying `memory_max` will get a "Memory oversubscription is not enabled" warnings, but the jobs will be accepted without them accessing the additional memory. The warning message is like: ``` $ nomad job run /tmp/j Job Warnings: 1 warning(s): * Memory oversubscription is not enabled; Task cache.redis memory_max value will be ignored ==> Monitoring evaluation "7c444157" Evaluation triggered by job "example" ==> Monitoring evaluation "7c444157" Evaluation within deployment: "9d826f13" Allocation "aa5c3cad" created: node "9272088e", group "cache" Evaluation status changed: "pending" -> "complete" ==> Evaluation "7c444157" finished with status "complete" # then you can examine the Alloc AllocatedResources to validate whether the task is allowed to exceed memory: $ nomad alloc status -json aa5c3cad \| jq '.AllocatedResources.Tasks["redis"].Memory' { "MemoryMB": 256, "MemoryMaxMB": 0 } ```	2021-04-29 22:09:56 -04:00
Luiz Aoqui	f1b9055d21	Add metrics for blocked eval resources (#10454 ) * add metrics for blocked eval resources * docs: add new blocked_evals metrics * fix to call `pruneStats` instead of `stats.prune` directly	2021-04-29 15:03:45 -04:00
Seth Hoenig	d54a606819	Merge pull request #10439 from hashicorp/pick-ent-acls-changes e2e: add e2e tests for consul namespaces on ent with acls	2021-04-28 08:30:08 -06:00
Tim Gross	79f81d617e	licensing: remove raft storage and sync This changeset is the OSS portion of the work to remove the raft storage and sync for Nomad Enterprise.	2021-04-28 10:28:23 -04:00
Michael Schurter	e62795798d	core: propagate remote task handles Add a new driver capability: RemoteTasks. When a task is run by a driver with RemoteTasks set, its TaskHandle will be propagated to the server in its allocation's TaskState. If the task is replaced due to a down node or draining, its TaskHandle will be propagated to its replacement allocation. This allows tasks to be scheduled in remote systems whose lifecycles are disconnected from the Nomad node's lifecycle. See https://github.com/hashicorp/nomad-driver-ecs for an example ECS remote task driver.	2021-04-27 15:07:03 -07:00
Seth Hoenig	09cd01a5f3	e2e: add e2e tests for consul namespaces on ent with acls This PR adds e2e tests for Consul Namespaces for Nomad Enterprise with Consul ACLs enabled. Needed to add support for Consul ACL tokens with `namespace` and `namespace_prefix` blocks, which Nomad parses and validates before tossing the token. These bits will need to be picked back to OSS.	2021-04-27 14:45:54 -06:00
Seth Hoenig	d76bcf0e12	Merge pull request #10457 from hashicorp/b-igce-wildcard consul/connect: fix bug where ingress gateways could not use wildcard services	2021-04-27 14:41:47 -06:00
Seth Hoenig	865c7a5841	consul/connect: fix bug where ingress gateways could not use wildcard services This PR fixes a bug where Nomad was more restrictive on Ingress Gateway Configuration Entry definitions than Consul. Before, Nomad would not allow for declaring IGCEs with http listeners with service name "*", which is a special feature allowable by Consul. Note: to make http protocol work, a service-default must be defined setting the protocol to http for each service. Fixes: #9729	2021-04-27 13:42:26 -06:00
Seth Hoenig	f47c6d34f7	consul/connect: check connect group and service names for uppercase characters This PR adds job-submission validation that checks for the use of uppercase characters in group and service names for services that make use of Consul Connect. This prevents attempting to launch services that Consul will not validate correctly, which in turn causes tasks to fail to launch in Nomad. Underlying Consul issue: https://github.com/hashicorp/consul/issues/6765 Closes #7581 #10450	2021-04-27 11:26:37 -06:00
Mahmood Ali	cf24a9eaaf	api: /v1/jobs always include namespaces (#10434 ) Add Namespace as a top-level field in `/v1/jobs` stub. The `/v1/jobs` endpoint already includes the namespace under `JobSummary`, though the API is odd, as typically the job ID and Namespace are in the same level, and the oddity complicates the UI frontend development. The downside of adding it is redundant field, that makes the response body a bit bigger, specially for clusters with large jobs. Though, it should compress nicely and I expect the overhead to be small to overall response size. The benefit of a cleaner and more consistent API seem worth it. Fixes #10431	2021-04-23 16:36:54 -04:00
Mahmood Ali	d2fcce21f8	Migrate all allocs when draining a node (#10411 ) This fixes a bug affecting drain nodes, where allocs may fail to be migrated if they belong to different namespaces but share the same job name. The reason is that the helper function that creates the migration evals indexed the allocs by job ID without accounting for the namespaces. When job ids clash, only an eval is created for one and the rest of the allocs remain intact. Fixes #10172	2021-04-21 12:11:14 -04:00
Seth Hoenig	f71dd3857e	api: include ent fuzzy struct types in oss Small change to pull in ent struct types in a switch statement used by ent. They are benign in oss, this is just to make sure OSS->ENT merges don't create a diff.	2021-04-20 11:19:38 -06:00
Seth Hoenig	4e6dbaaec1	Merge pull request #10184 from hashicorp/f-fuzzy-search api: implement fuzzy search API	2021-04-20 09:06:40 -06:00
Seth Hoenig	509490e5d2	e2e: consul namespace tests from nomad ent (cherry-picked from ent without _ent things) This is part 2/4 of e2e tests for Consul Namespaces. Took a first pass at what the parameterized tests can look like, but only on the ENT side for this PR. Will continue to refactor in the next PRs. Also fixes 2 bugs: - Config Entries registered by Nomad Server on job registration were not getting Namespace set - Group level script checks were not getting Namespace set Those changes will need to be copied back to Nomad OSS. Nomad OSS + no ACLs (previously, needs refactor) Nomad ENT + no ACLs (this) Nomad OSS + ACLs (todo) Nomad ENT + ALCs (todo)	2021-04-19 15:35:31 -06:00
Seth Hoenig	c34ef9eb78	api: fuzzy search results include job name with id in scope	2021-04-16 17:03:36 -06:00
Seth Hoenig	0b2114a7a5	api: make fuzzy searching case-agnostic	2021-04-16 16:56:10 -06:00
Seth Hoenig	1ee8d5ffc5	api: implement fuzzy search API This PR introduces the /v1/search/fuzzy API endpoint, used for fuzzy searching objects in Nomad. The fuzzy search endpoint routes requests to the Nomad Server leader, which implements the Search.FuzzySearch RPC method. Requests to the fuzzy search API are based on the api.FuzzySearchRequest object, e.g. { "Text": "ed", "Context": "all" } Responses from the fuzzy search API are based on the api.FuzzySearchResponse object, e.g. { "Index": 27, "KnownLeader": true, "LastContact": 0, "Matches": { "tasks": [ { "ID": "redis", "Scope": [ "default", "example", "cache" ] } ], "evals": [], "deployment": [], "volumes": [], "scaling_policy": [], "images": [ { "ID": "redis:3.2", "Scope": [ "default", "example", "cache", "redis" ] } ] }, "Truncations": { "volumes": false, "scaling_policy": false, "evals": false, "deployment": false } } The API is tunable using the new server.search stanza, e.g. server { search { fuzzy_enabled = true limit_query = 200 limit_results = 1000 min_term_length = 5 } } These values can be increased or decreased, so as to provide more search results or to reduce load on the Nomad Server. The fuzzy search API can be disabled entirely by setting `fuzzy_enabled` to `false`.	2021-04-16 16:36:07 -06:00
Nick Spain	085b54bd0b	Update TaskGroup services edited diff test to actually check Body	2021-04-13 09:15:35 -04:00
Nick Spain	bfd4980e3f	Hash Body field as part of ServiceCheck	2021-04-13 09:15:35 -04:00
Nick Spain	653d84ef68	Add a 'body' field to the check stanza Consul allows specifying the HTTP body to send in a health check. Nomad uses Consul for health checking so this just plumbs the value through to where the Consul API is called. There is no validation that `body` is not used with an incompatible check method like GET.	2021-04-13 09:15:35 -04:00
Charlie Voiselle	8afb9eb05d	Fix parameterized <-> non-parameterized job error (#10357 ) The error messages are reversed from tests performed above them. The test uses the `validateJobUpdate()` function, but ignores the text of the error message itself.	2021-04-12 09:27:04 -04:00
Tim Gross	3113cced7b	CSI: ensure page slices are within bounds Plugins could potentially ignore the `max_entries` field and return a list of entries that is greater, so we slice the return value in the server RPC to enforce these value. But page sizes less than the number of entries for the external CSI ListVolumes and ListSnapshots RPCs could cause a panic, so fix the boundary checking.	2021-04-09 14:12:38 -04:00
Lars Lehtonen	61d3c3b480	nomad/structs: fix diff	2021-04-09 08:21:46 -04:00
Tim Gross	0892d34ff9	CSI: capability block is required for volume registration	2021-04-08 13:02:24 -04:00
Tim Gross	d2e479505c	CSI: capability check ListVolumes at RPC for nicer error messages The plugin stub object does not include fine-grained capability checks, which means `nomad volume status -verbose` will return ugly and verbose error "Unimplemented" messages from the plugin if it does not support the CSI `ListVolumes` RPC. Return a nicer error message from our RPC handler instead.	2021-04-07 12:00:22 -04:00
Tim Gross	276633673d	CSI: use AccessMode/AttachmentMode from CSIVolumeClaim Registration of Nomad volumes previously allowed for a single volume capability (access mode + attachment mode pair). The recent `volume create` command requires that we pass a list of requested capabilities, but the existing workflow for claiming volumes and attaching them on the client assumed that the volume's single capability was correct and unchanging. Add `AccessMode` and `AttachmentMode` to `CSIVolumeClaim`, use these fields to set the initial claim value, and add backwards compatibility logic to handle the existing volumes that already have claims without these fields.	2021-04-07 11:24:09 -04:00
Tim Gross	dbcc2694b0	refactor: move VolumeRequest validation to Validate method	2021-04-07 11:24:09 -04:00
Tim Gross	72c07f15fb	refactor: internal claim methods should be private	2021-04-07 11:24:09 -04:00
Seth Hoenig	fe8fce00d9	consul: minor CR cleanup	2021-04-05 10:10:16 -06:00
Seth Hoenig	f17ba33f61	consul: plubming for specifying consul namespace in job/group This PR adds the common OSS changes for adding support for Consul Namespaces, which is going to be a Nomad Enterprise feature. There is no new functionality provided by this changeset and hopefully no new bugs.	2021-04-05 10:03:19 -06:00
Chris Baker	21bc48ca29	json handles were moved to a new package in #10202 this was unecessary after refactoring, so this moves them back to their original location in package structs	2021-04-02 13:31:10 +00:00
Chris Baker	436d46bd19	Merge branch 'main' into f-node-drain-api	2021-04-01 15:22:57 -05:00
Tim Gross	0856483115	CSI: fingerprint detailed node capabilities In order to support new node RPCs, we need to fingerprint plugin capabilities in more detail. This changeset mirrors recent work to fingerprint controller capabilities, but is not yet in use by any Nomad RPC.	2021-04-01 16:00:58 -04:00
Tim Gross	466b620fa4	CSI: volume snapshot	2021-04-01 11:16:52 -04:00
Tim Gross	71b9daffb9	CSI: fix misleading HTTP test The HTTP test to create CSI volumes depends on having a controller plugin to talk to, but the test was using a node-only plugin, which allows it to silently ignore the missing controller.	2021-03-31 16:37:09 -04:00
Tim Gross	9fc4cf1419	CSI: fingerprint detailed controller capabilities In order to support new controller RPCs, we need to fingerprint volume capabilities in more detail and perform controller RPCs only when the specific capability is present. This fixes a bug in Ceph support where the plugin can only suport create/delete but we assume that it also supports attach/detach.	2021-03-31 16:37:09 -04:00
Tim Gross	f149abfa41	CSI: volume creation/registration should not validate attachment The CSI specification requires that we validate a list of `Capability` (access mode + accessibility) when we create volume, but the existing volume registration workflow incorrectly validates a single capability. The specific capability required by a volume claim is checked at the time we make the claim, so remove the check for `AttachmentMode`/`AcccessMode`.	2021-03-31 16:37:09 -04:00
Tim Gross	aec5337862	CSI: HTTP handlers for create/delete/list	2021-03-31 16:37:09 -04:00
Tim Gross	d38008176e	CSI: create/delete/list volume RPCs This commit implements the RPC handlers on the client that talk to the CSI plugins on that client for the Create/Delete/List RPC.	2021-03-31 16:37:09 -04:00
Tim Gross	43622680fa	test infrastructure for mock client RPCs (#10193 ) This commit includes a new test client that allows overriding the RPC protocols. Only the RPCs that are passed in are registered, which lets you implement a mock RPC in the server tests. This commit includes an example of this for the ClientCSI RPC server.	2021-03-31 16:37:09 -04:00
Mahmood Ali	e24dac2f39	fixup! oversubscription: Add MemoryMaxMB to internal structs	2021-03-31 09:26:26 -04:00
Mahmood Ali	4ec2d8f5e4	fixup! oversubscription: Add MemoryMaxMB to internal structs	2021-03-31 08:52:20 -04:00
Mahmood Ali	0c2551270a	oversubscription: Add MemoryMaxMB to internal structs Start tracking a new MemoryMaxMB field that represents the maximum memory a task may use in the client. This allows tasks to specify a memory reservation (to be used by scheduler when placing the task) but use excess memory used on the client if the client has any. This commit adds the server tracking for the value, and ensures that allocations AllocatedResource fields include the value.	2021-03-30 16:55:58 -04:00
Nick Ethier	daecfa61e6	Merge pull request #10203 from hashicorp/f-cpu-cores Reserved Cores [1/4]: Structs and scheduler implementation	2021-03-29 14:05:54 -04:00
Chris Baker	16e37b986a	reworked Node.Canonicalize() to enforce invariants, fixed a broken test	2021-03-26 18:58:38 +00:00
Chris Baker	99ef00d5da	reinserted/expanded fsm node.canonicalize test that was still needed	2021-03-26 17:10:39 +00:00
Chris Baker	770c9cecb5	restored Node.Sanitize() for RPC endpoints multiple other updates from code review	2021-03-26 17:03:15 +00:00
Mahmood Ali	dbc3850358	Merge pull request #10145 from hashicorp/b-periodic-init-status periodic: always reset periodic children status	2021-03-26 09:19:08 -04:00
Mahmood Ali	5d75705edd	dispatched parameterized job should clear status too	2021-03-25 15:14:21 -04:00
Mahmood Ali	e643742a38	Add a test for parameterized summary counts	2021-03-25 11:27:09 -04:00
Mahmood Ali	b0e048bfa4	periodic: always reset periodic children status Fixes a bug where Nomad reports negative or incorrect running children counts for periodic jobs. The periodic dispatcher derives a child job without reseting the status. If the periodic job has a `running` status, the derived job will start as `running` status and transition to `pending`. Since this is unexpected transition, the counting in StateStore.setJobSummary gets out of sync and result in negative/incorrect values. Note that this only affects periodic jobs after a leader transition. During the first job registration, the job is added with `pending` or `""` status. However, after a leader transition, the new leader repopulates the dispatcher heap with `"running"` status and triggers the bug.	2021-03-25 11:27:09 -04:00
Chris Baker	07646316e5	some comments on the new json extensions/encoding	2021-03-23 18:18:51 +00:00
Chris Baker	a43b3a8736	change to fail-safe in json encoding	2021-03-23 18:13:10 +00:00
Chris Baker	cb540ed691	added tests that the API doesn't leak Node.SecretID added more documentation on JSON encoding to the contributing guide	2021-03-23 18:09:20 +00:00
Drew Bailey	74836b95b2	configuration and oss components for licensing (#10216 ) * configuration and oss components for licensing * vendor sync	2021-03-23 09:08:14 -04:00
Mahmood Ali	73b7b98018	testing: default nomad test nodes to 1.0.0 (#10213 )	2021-03-23 08:24:26 -04:00
Chris Baker	a186badf35	moved JSON handlers and extension code around a bit for proper order of initialization	2021-03-22 14:12:42 +00:00
Chris Baker	9f7bc5a575	refactor?	2021-03-22 01:49:21 +00:00
Chris Baker	dd291e69f4	removed deprecated fields from Drain structs and API node drain: use msgtype on txn so that events are emitted wip: encoding extension to add Node.Drain field back to API responses new approach for hiding Node.SecretID in the API, using `json` tag documented this approach in the contributing guide refactored the JSON handlers with extensions modified event stream encoding to use the go-msgpack encoders with the extensions	2021-03-21 15:30:11 +00:00
Nick Ethier	648ade63ad	scheduler: implement scheduling of reserved cores	2021-03-19 00:29:07 -04:00
Nick Ethier	26b200e8bd	api: add new 'cores' field to task resources	2021-03-18 23:13:30 -04:00
Nick Ethier	4b2912d343	structs: add struct fields and funcs for reservable cpu cores	2021-03-18 22:49:06 -04:00
Tim Gross	fa25e048b2	CSI: unique volume per allocation Add a `PerAlloc` field to volume requests that directs the scheduler to test feasibility for volumes with a source ID that includes the allocation index suffix (ex. `[0]`), rather than the exact source ID. Read the `PerAlloc` field when making the volume claim at the client to determine if the allocation index suffix (ex. `[0]`) should be added to the volume source ID.	2021-03-18 15:35:11 -04:00
Tim Gross	9b2b580d1a	CSI: remove prefix matching from CSIVolumeByID and fix CLI prefix matching (#10158 ) Callers of `CSIVolumeByID` are generally assuming they should receive a single volume. This potentially results in feasibility checking being performed against the wrong volume if a volume's ID is a prefix substring of other volume (for example: "test" and "testing"). Removing the incorrect prefix matching from `CSIVolumeByID` breaks prefix matching in the command line client. Add the required elements for prefix matching to the commands and API.	2021-03-18 14:32:40 -04:00
Charlie Voiselle	0473f35003	Fixup uses of `sanity` (#10187 ) * Fixup uses of `sanity` * Remove unnecessary comments. These checks are better explained by earlier comments about the context of the test. Per @tgross, moved the tests together to better reinforce the overall shared context. * Update nomad/fsm_test.go	2021-03-16 18:05:08 -04:00
Mahmood Ali	1d48433356	server: handle invalid jobs in expose handler hook (#10154 ) The expose handler hook must handle if the submitted job is invalid. Without this validation, the rpc handler panics on invalid input. Co-authored-by: Tim Gross <tgross@hashicorp.com>	2021-03-10 09:12:46 -05:00
Tim Gross	7010a344d6	one-time token: never return expired tokens	2021-03-10 08:17:56 -05:00
Tim Gross	97b0e26d1f	RPC endpoints to support 'nomad ui -login' RPC endpoints for the user-driven APIs (`UpsertOneTimeToken` and `ExchangeOneTimeToken`) and token expiration (`ExpireOneTimeTokens`). Includes adding expiration to the periodic core GC job.	2021-03-10 08:17:56 -05:00
Tim Gross	6b2c4b56d0	state store updates for one-time tokens The `OneTimeToken` struct is to support the `nomad ui -login` command. This changeset adds the struct to the Nomad state store.	2021-03-10 08:17:56 -05:00
Andre Ilhicas	f45fc6c899	consul/connect: enable setting local_bind_address in upstream	2021-02-26 11:37:31 +00:00
Drew Bailey	86d9e1ff90	Merge pull request #9955 from hashicorp/on-update-services Service and Check on_update configuration option (readiness checks)	2021-02-24 10:11:05 -05:00
Tim Gross	b764f52ab9	deploymentwatcher: reset progress deadline on promotion (#10042 ) In a deployment with two groups (ex. A and B), if group A's canary becomes healthy before group B's, the deadline for the overall deployment will be set to that of group A. When the deployment is promoted, if group A is done it will not contribute to the next deadline cutoff. Group B's old deadline will be used instead, which will be in the past and immediately trigger a deployment progress failure. Reset the progress deadline when the job is promotion to avoid this bug, and to better conform with implicit user expectations around how the progress deadline should interact with promotions.	2021-02-22 16:44:03 -05:00
James Rasell	6553cc3da7	drainer: fix error message when handling drain deadlined nodes.	2021-02-18 11:45:44 +01:00
AndrewChubatiuk	cd152643fb	fixed connect port label	2021-02-13 02:42:14 +02:00
AndrewChubatiuk	3d0aa2ef56	allocate sidecar task port on host_network interface	2021-02-13 02:42:13 +02:00
Nick Ethier	fcc1f4c805	Merge pull request #9946 from hashicorp/b-9477 structs: namespace port validation by host_network	2021-02-11 12:53:28 -05:00
Seth Hoenig	45e0e70a50	consul/connect: enable custom sidecars to use expose checks This PR enables jobs configured with a custom sidecar_task to make use of the `service.expose` feature for creating checks on services in the service mesh. Before we would check that sidecar_task had not been set (indicating that something other than envoy may be in use, which would not support envoy's expose feature). However Consul has not added support for anything other than envoy and probably never will, so having the restriction in place seems like an unnecessary hindrance. If Consul ever does support something other than Envoy, they will likely find a way to provide the expose feature anyway. Fixes #9854	2021-02-09 10:49:37 -06:00
Drew Bailey	8507d54e3b	e2e test for on_update service checks check_restart not compatible with on_update=ignore reword caveat	2021-02-08 08:32:40 -05:00
Drew Bailey	82f971f289	OnUpdate configuration for services and checks Allow for readiness type checks by configuring nomad to ignore warnings or errors reported by a service check. This allows the deployment to progress and while Consul handles introducing the sercive into a resource pool once the check passes.	2021-02-08 08:32:40 -05:00
Nick Ethier	eacc4da499	Merge branch 'master' into b-9477	2021-02-05 11:58:13 -05:00
Tim Gross	eb3dd17fb2	volumes: implement plan diff for volume requests The details of host volume and CSI volume requests do not show up in `nomad plan` outputs, although the updates are detected by the scheduler and result in an update as expected.	2021-02-04 16:55:17 -05:00
Chris Baker	ebbb760ec4	support for scaling_policy in global prefix search	2021-02-03 19:26:57 +00:00
Nick Ethier	43a4d72fda	structs: namespace port validation by host_network	2021-02-02 14:56:52 -05:00
Seth Hoenig	720780992c	consul/connect: copy bind address map if empty This parameter is now supposed to be non-nil even if empty, and the Copy method should also maintain that invariant.	2021-01-25 10:36:04 -06:00
Seth Hoenig	1ad219c441	consul/connect: remove debug line	2021-01-25 10:36:04 -06:00

... 2 3 4 5 6 ...

3952 commits