open-nomad

Author	SHA1	Message	Date
Piotr Kazmierczak	f4d6efe69f	acl: make auth method default across all types (#15869 )	2023-01-26 14:17:11 +01:00
James Rasell	5d33891910	sso: allow binding rules to create management ACL tokens. (#15860 ) * sso: allow binding rules to create management ACL tokens. * docs: update binding rule docs to detail management type addition.	2023-01-26 09:57:44 +01:00
scottduszy	851a3a8e6c	docs: correct "User" attribute in Podman Task Driver Docs (#15421 )	2023-01-25 18:52:16 -05:00
Luiz Aoqui	f2dd46d1db	docs: add caveat on dynamic blocks (#15857 )	2023-01-25 15:54:45 -05:00
Ashlee M Boyer	57f8ebfa26	docs: Migrate link formats (#15779 ) * Adding check-legacy-links-format workflow * Adding test-link-rewrites workflow * chore: updates link checker workflow hash * Migrating links to new format Co-authored-by: Kendall Strautman <kendallstrautman@gmail.com>	2023-01-25 09:31:14 -08:00
Nick Wales	825af1f62a	docker: add option for Windows isolation modes (#15819 )	2023-01-24 16:31:48 -05:00
Karl Johann Schubert	b773a1b77f	client: add disk_total_mb and disk_free_mb config options (#15852 )	2023-01-24 09:14:22 -05:00
Tim Gross	a51149736d	Rename `nomad.broker.total_blocked` metric (#15835 ) This changeset fixes a long-standing point of confusion in metrics emitted by the eval broker. The eval broker has a queue of "blocked" evals that are waiting for an in-flight ("unacked") eval of the same job to be completed. But this "blocked" state is not the same as the `blocked` status that we write to raft and expose in the Nomad API to end users. There's a second metric `nomad.blocked_eval.total_blocked` that refers to evaluations in that state. This has caused ongoing confusion in major customer incidents and even in our own documentation! (Fixed in this PR.) There's little functional change in this PR aside from the name of the metric emitted, but there's a bit refactoring to clean up the names in `eval_broker.go` so that there aren't name collisions and multiple names for the same state. Changes included are: * Everything that was previously called "pending" referred to entities that were associated witht he "ready" metric. These are all now called "ready" to match the metric. * Everything named "blocked" in `eval_broker.go` is now named "pending", except for a couple of comments that actually refer to blocked RPCs. * Added a note to the upgrade guide docs for 1.5.0. * Fixed the scheduling performance metrics docs because the description for `nomad.broker.total_blocked` was actually the description for `nomad.blocked_eval.total_blocked`.	2023-01-20 14:23:56 -05:00
Charlie Voiselle	5ea1d8a970	Add raft snapshot configuration options (#15522 ) * Add config elements * Wire in snapshot configuration to raft * Add hot reload of raft config * Add documentation for new raft settings * Add changelog	2023-01-20 14:21:51 -05:00
Karel	ad56b4dbd2	docs: fix conflict metric documentation, fix typo (#15805 ) The description for the `nomad.nomad.blocked_evals.total_blocked` states that this could include evals blocked due to reached quota limits, but the `total_quota_limit` mentions being exclusive to its own metric. I personally interpret `total_blocked` as encompassing any blocked evals for any reason, as written in the docs. Though someone will have to verify the validity of that statement and possibly rectify the other metric description. Fixed a typo: `limtis` vs `limits`.	2023-01-20 13:54:11 -05:00
James Rasell	4cf40f5606	docs: clarify installing from source requirement on PATH. (#15833 )	2023-01-20 16:10:02 +01:00
James Rasell	c55efdd928	docs: add OIDC login API and CLI docs. (#15818 )	2023-01-20 10:07:26 +01:00
Ashlee M Boyer	4e82c96d36	[docs] Adjusting links for rewrite project (#15810 ) * Adjusting link to page about features * Fixing typo * Replacing old learn links with devdot paths * Removing extra space	2023-01-17 10:55:47 -05:00
Luiz Aoqui	a0652af5dd	docs: add missing parameter `propagation_mode` to `volume_mount` (#15785 )	2023-01-16 10:18:50 -05:00
Ashlee M Boyer	c75ea79f25	Fixing yaml syntax in frontmatter (#15781 )	2023-01-13 14:06:46 -05:00
Seth Hoenig	fe7795ce16	consul/connect: support for proxy upstreams opaque config (#15761 ) This PR adds support for configuring `proxy.upstreams[].config` for Consul Connect upstreams. This is an opaque config value to Nomad - the data is passed directly to Consul and is unknown to Nomad.	2023-01-12 08:20:54 -06:00
Anthony Davis	1c32471805	Fix rejoin_after_leave behavior (#15552 )	2023-01-11 16:39:24 -05:00
Seth Hoenig	719eee8112	consul: add client configuration for grpc_ca_file (#15701 ) * [no ci] first pass at plumbing grpc_ca_file * consul: add support for grpc_ca_file for tls grpc connections in consul 1.14+ This PR adds client config to Nomad for specifying consul.grpc_ca_file These changes combined with https://github.com/hashicorp/consul/pull/15913 should finally enable Nomad users to upgrade to Consul 1.14+ and use tls grpc connections. * consul: add cl entgry for grpc_ca_file * docs: mention grpc_tls changes due to Consul 1.14	2023-01-11 09:34:28 -06:00
Dao Thanh Tung	09b25d71b8	cli: Add a nomad operator client state command (#15469 ) Signed-off-by: dttung2905 <ttdao.2015@accountancy.smu.edu.sg>	2023-01-11 10:03:31 -05:00
Luiz Aoqui	ed5fccc183	scheduler: allow using device ID as attribute (#15455 ) Devices are fingerprinted as groups of similar devices. This prevented specifying specific device by their ID in constraint and affinity rules. This commit introduces the `${device.ids}` attribute that returns a comma separated list of IDs that are part of the device group. Users can then use the set operators to write rules.	2023-01-10 14:28:23 -05:00
Cyrille Colin	d9bf6ec6f7	Update template.mdx (#15737 ) fix typo issue in variable url : remove unwanted "r"	2023-01-10 10:42:33 +01:00
Luiz Aoqui	f4bf4528a1	docs: networking (#15358 ) Co-authored-by: Charlie Voiselle <464492+angrycub@users.noreply.github.com>	2023-01-06 11:47:10 -05:00
James Rasell	fc08eb9e12	docs: clarify shutdown_delay jobspec param and service behaviour. (#15695 )	2023-01-05 16:57:13 +01:00
Dao Thanh Tung	ca2f509e82	agent: Make agent syslog log level inherit from Nomad agent log (#15625 )	2023-01-04 09:38:06 -05:00
dgotlieb	a991342f8d	docs: nomad `eval delete` typo fix (#15667 ) Status instead of Stauts	2023-01-03 14:18:03 -05:00
huazhihao	9771281ecd	docs: fix system sample request (#15650 )	2023-01-03 10:58:21 -05:00
James Rasell	11744de527	docs: fix service name interpolation key details. (#15643 )	2023-01-03 10:58:00 +01:00
Piotr Kazmierczak	f1450d25d2	ACL Binding Rules CLI documentation (#15584 )	2022-12-22 16:36:25 +01:00
Piotr Kazmierczak	3af32c78b7	acl: binding rules API documentation (#15581 )	2022-12-20 11:22:51 +01:00
Danish Prakash	dc81568f93	command/job_stop: accept multiple jobs, stop concurrently (#12582 ) * command/job_stop: accept multiple jobs, stop concurrently Signed-off-by: danishprakash <grafitykoncept@gmail.com> * command/job_stop_test: add test for multiple job stops Signed-off-by: danishprakash <grafitykoncept@gmail.com> * improve output, add changelog and docs Signed-off-by: danishprakash <grafitykoncept@gmail.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-12-16 15:46:58 -08:00
Piotr Kazmierczak	f91ab03920	acl: SSO auth methods CLI documentation (#15538 ) This PR provides documentation for the ACL Auth Methods CLI commands. Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2022-12-14 13:35:26 +01:00
Seth Hoenig	be3f89b5f9	artifact: enable inheriting environment variables from client (#15514 ) * artifact: enable inheriting environment variables from client This PR adds client configuration for specifying environment variables that should be inherited by the artifact sandbox process from the Nomad Client agent. Most users should not need to set these values but the configuration is provided to ensure backwards compatability. Configuration of go-getter should ideally be done through the artifact block in a jobspec task. e.g. ```hcl client { artifact { set_environment_variables = "TMPDIR,GIT_SSH_OPTS" } } ``` Closes #15498 * website: update set_environment_variables text to mention PATH	2022-12-09 15:46:07 -06:00
Piotr Kazmierczak	9562662774	acl: SSO auth methods API documentation (#15475 ) This PR provides documentation for the ACL Auth Methods API endpoints. Co-authored-by: James Rasell <jrasell@users.noreply.github.com>	2022-12-09 09:47:31 +01:00
Michael Schurter	c28c5ad2e8	docs: clarify rescheduling happens when tasks fail (#15485 )	2022-12-08 12:58:26 -08:00
Seth Hoenig	825c5cc65e	artifact: add client toggle to disable filesystem isolation (#15503 ) This PR adds the client config option for turning off filesystem isolation, applicable on Linux systems where filesystem isolation is possible and enabled by default. ```hcl client{ artifact { disable_filesystem_isolation = <bool:false> } } ``` Closes #15496	2022-12-08 12:29:23 -06:00
Seth Hoenig	51a2212d3d	client: sandbox go-getter subprocess with landlock (#15328 ) * client: sandbox go-getter subprocess with landlock This PR re-implements the getter package for artifact downloads as a subprocess. Key changes include On all platforms, run getter as a child process of the Nomad agent. On Linux platforms running as root, run the child process as the nobody user. On supporting Linux kernels, uses landlock for filesystem isolation (via go-landlock). On all platforms, restrict environment variables of the child process to a static set. notably TMP/TEMP now points within the allocation's task directory kernel.landlock attribute is fingerprinted (version number or unavailable) These changes make Nomad client more resilient against a faulty go-getter implementation that may panic, and more secure against bad actors attempting to use artifact downloads as a privilege escalation vector. Adds new e2e/artifact suite for ensuring artifact downloading works. TODO: Windows git test (need to modify the image, etc... followup PR) * landlock: fixup items from cr * cr: fixup tests and go.mod file	2022-12-07 16:02:25 -06:00
Tim Gross	7404ef46e9	docs: update `plugin status` docs with capabilities and topology (#15448 ) The `plugin status` command supports displaying CSI capabilities and topology accessibility, but this was missing from the documentation. Extend the `-verbose` example to show that info.	2022-12-01 12:18:56 -05:00
Matus Goljer	2283c2d583	Update affinity.mdx (#15168 ) Fix the comment to correspond to the code	2022-11-30 19:01:56 -05:00
Luiz Aoqui	c6ae5d95ac	docs: clarify autoscaling factor and threshold for target-value plugin (#15418 )	2022-11-30 10:56:16 -05:00
Luiz Aoqui	5995ea9981	docs: improve job parse API documentation (#15387 )	2022-11-25 12:46:53 -05:00
Jack	62f7de7ed5	cli: `wait` flag for use with `deployment status -monitor` (#15262 )	2022-11-23 16:36:13 -05:00
Lance Haig	0263e7af34	Add command "nomad tls" (#14296 )	2022-11-22 14:12:07 -05:00
James Rasell	e2a2ea68fc	client: accommodate Consul 1.14.0 gRPC and agent self changes. (#15309 ) * client: accommodate Consul 1.14.0 gRPC and agent self changes. Consul 1.14.0 changed the way in which gRPC listeners are configured, particularly when using TLS. Prior to the change, a single listener was responsible for handling plain-text and encrypted gRPC requests. In 1.14.0 and beyond, separate listeners will be used for each, defaulting to 8502 and 8503 for plain-text and TLS respectively. The change means that Nomad’s Consul Connect integration would not work when integrated with Consul clusters using TLS and running 1.14.0 or greater. The Nomad Consul fingerprinter identifies the gRPC port Consul has exposed using the "DebugConfig.GRPCPort" value from Consul’s “/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only represents the plain-text gRPC port which is likely to be disbaled in clusters running TLS. In order to fix this issue, Nomad now takes into account the Consul version and configured scheme to optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent self return. The “consul_grcp_socket” allocrunner hook has also been updated so that the fingerprinted gRPC port attribute is passed in. This provides a better fallback method, when the operator does not configure the “consul.grpc_address” option. * docs: modify Consul Connect entries to detail 1.14.0 changes. * changelog: add entry for #15309 * fixup: tidy tests and clean version match from review feedback. * fixup: use strings tolower func.	2022-11-21 09:19:09 -06:00
Luiz Aoqui	b28494ec9a	docs: add cpu-allocated and memory-allocated (#15299 ) Document the Autoscaler Nomad APM paramemeters `cpu-allocated` and `memory-allocated` that were implemented in https://github.com/hashicorp/nomad-autoscaler/pull/324 and https://github.com/hashicorp/nomad-autoscaler/pull/334	2022-11-18 10:55:17 -05:00
Tim Gross	510eb435dc	remove deprecated `AllocUpdateRequestType` raft entry (#15285 ) After Deployments were added in Nomad 0.6.0, the `AllocUpdateRequestType` raft log entry was no longer in use. Mark this as deprecated, remove the associated dead code, and remove references to the metrics it emits from the docs. We'll leave the entry itself just in case we encounter old raft logs that we need to be able to safely load.	2022-11-17 12:08:04 -05:00
Ayrat Badykov	c94c231c08	fix create snapshot request docs (#15242 )	2022-11-17 08:43:40 +01:00
Nikita Beletskii	550f715ecd	Fix variable create API example in docs (#15248 )	2022-11-15 16:04:11 +01:00
Tim Gross	37134a4a37	eval delete: move batching of deletes into RPC handler and state (#15117 ) During unusual outage recovery scenarios on large clusters, a backlog of millions of evaluations can appear. In these cases, the `eval delete` command can put excessive load on the cluster by listing large sets of evals to extract the IDs and then sending larges batches of IDs. Although the command's batch size was carefully tuned, we still need to be JSON deserialize, re-serialize to MessagePack, send the log entries through raft, and get the FSM applied. To improve performance of this recovery case, move the batching process into the RPC handler and the state store. The design here is a little weird, so let's look a the failed options first: * A naive solution here would be to just send the filter as the raft request and let the FSM apply delete the whole set in a single operation. Benchmarking with 1M evals on a 3 node cluster demonstrated this can block the FSM apply for several minutes, which puts the cluster at risk if there's a leadership failover (the barrier write can't be made while this apply is in-flight). * A less naive but still bad solution would be to have the RPC handler filter and paginate, and then hand a list of IDs to the existing raft log entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and took roughly an hour to complete. Instead, we're filtering and paginating in the RPC handler to find a page token, and then passing both the filter and page token in the raft log. The FSM apply recreates the paginator using the filter and page token to get roughly the same page of evaluations, which it then deletes. The pagination process is fairly cheap (only abut 5% of the total FSM apply time), so counter-intuitively this rework ends up being much faster. A benchmark of 1M evaluations showed this blocked the FSM apply for 20-30ms at a time (typical for normal operations) and completes in less than 4 minutes. Note that, as with the existing design, this delete is not consistent: a new evaluation inserted "behind" the cursor of the pagination will fail to be deleted.	2022-11-14 14:08:13 -05:00
Douglas Jose	345ef0bbec	Fix wrong reference to `vault` (#15228 )	2022-11-14 10:49:09 +01:00
Kyle Root	99d5e7efb3	Fix broken URL to nvidia device plugin (#15234 )	2022-11-14 10:37:06 +01:00
Tim Gross	eabbcebdd4	exec: allow running commands from host volume (#14851 ) The exec driver and other drivers derived from the shared executor check the path of the command before handing off to libcontainer to ensure that the command doesn't escape the sandbox. But we don't check any host volume mounts, which should be safe to use as a source for executables if we're letting the user mount them to the container in the first place. Check the mount config to verify the executable lives in the mount's host path, but then return an absolute path within the mount's task path so that we can hand that off to libcontainer to run. Includes a good bit of refactoring here because the anchoring of the final task path has different code paths for inside the task dir vs inside a mount. But I've fleshed out the test coverage of this a good bit to ensure we haven't created any regressions in the process.	2022-11-11 09:51:15 -05:00
Seth Hoenig	01a3a29e51	docs: clarify how to access task meta values in templates (#15212 ) This PR updates template and meta docs pages to give examples of accessing meta values in templates. To do so one must use the environment variable form of the meta key name, which isn't obvious and wasn't yet documented.	2022-11-10 16:11:53 -06:00
twunderlich-grapl	1859559134	Fix s3 example URLs in the artifacts docs (#15123 ) * Fix s3 URLs so that they work Unfortunately, s3 urls prefixed with https:// do NOT work with the underlying go-getter library. As such, this fixes the examples so that they are working examples that won't cause problems for people reading the docs. See discussion in https://github.com/hashicorp/nomad/issues/1113 circa 2016. * Use s3:// protocol schema for artifact examples Per the discussion in https://github.com/hashicorp/nomad/pull/15123, we're going to use the explicit s3 protocol in the examples since that is the likeliest to work in all scenarios	2022-11-07 14:14:57 -05:00
Tim Gross	9e1c0b46d8	API for `Eval.Count` (#15147 ) Add a new `Eval.Count` RPC and associated HTTP API endpoints. This API is designed to support interactive use in the `nomad eval delete` command to get a count of evals expected to be deleted before doing so. The state store operations to do this sort of thing are somewhat expensive, but it's cheaper than serializing a big list of evals to JSON. Note that although it seems like this could be done as an extra parameter and response field on `Eval.List`, having it as its own endpoint avoids having to change the response body shape and lets us avoid handling the legacy filter params supported by `Eval.List`.	2022-11-07 08:53:19 -05:00
Charlie Voiselle	79c4478f5b	template: error on missing key (#15141 ) * Support error_on_missing_value for templates * Update docs for template stanza	2022-11-04 13:23:01 -04:00
Phil Renaud	ab5bfa8149	Accidentally trailed off on a docs paragraph (#15118 )	2022-11-02 23:33:41 -04:00
Phil Renaud	ffb4c63af7	[ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833 ) * Adds meta to job list stub and displays a pack logo on the jobs index * Changelog * Modifying struct for optional meta param * Explicitly ask for meta anytime I look up a job from index or job page * Test case for the endpoint * adding meta field to API struct and ommitting from response if empty * passthru method added to api/jobs.list * Meta param listed in docs for jobs list * Update api/jobs.go Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-11-02 16:58:24 -04:00
Tim Gross	903b5baaa4	keyring: safely handle missing keys and restore GC (#15092 ) When replication of a single key fails, the replication loop breaks early and therefore keys that fall later in the sorting order will never get replicated. This is particularly a problem for clusters impacted by the bug that caused #14981 and that were later upgraded; the keys that were never replicated can now never be replicated, and so we need to handle them safely. Included in the replication fix: * Refactor the replication loop so that each key replicated in a function call that returns an error, to make the workflow more clear and reduce nesting. Log the error and continue. * Improve stability of keyring replication tests. We no longer block leadership on initializing the keyring, so there's a race condition in the keyring tests where we can test for the existence of the root key before the keyring has been initialize. Change this to an "eventually" test. But these fixes aren't enough to fix #14981 because they'll end up seeing an error once a second complaining about the missing key, so we also need to fix keyring GC so the keys can be removed from the state store. Now we'll store the key ID used to sign a workload identity in the Allocation, and we'll index the Allocation table on that so we can track whether any live Allocation was signed with a particular key ID.	2022-11-01 15:00:50 -04:00
Tim Gross	f29c781fa7	docs: improved documentation on hardening and required capabilities (#15036 ) The existing docs on required capabilities are a little sparse and have been the subject of a lots of questions. Expand on this information and provide a pointer to the ongoing design discussion around rootless Nomad.	2022-10-26 09:46:13 -04:00
Tim Gross	aca95c0bc6	keyring: remove root key GC (#15034 )	2022-10-25 17:06:18 -04:00
Luiz Aoqui	8b8d85bce7	docs: use of `node_class` when autoscaling (#14950 ) Document how the value of `node_class` is used during cluster scaling. https://github.com/hashicorp/nomad-autoscaler/issues/255	2022-10-21 10:35:45 -04:00
James Rasell	215b4e7e36	acl: add ACL roles to event stream topic and resolve policies. (#14923 ) This changes adds ACL role creation and deletion to the event stream. It is exposed as a single topic with two types; the filter is primarily the role ID but also includes the role name. While conducting this work it was also discovered that the events stream has its own ACL resolution logic. This did not account for ACL tokens which included role links, or tokens with expiry times. ACL role links are now resolved to their policies and tokens are checked for expiry correctly.	2022-10-20 09:43:35 +02:00
James Rasell	d7b311ce55	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Luiz Aoqui	75830a7161	docs: expand Autoscaling documentation (#14937 ) Rename `Internals` section to `Concepts` to match core docs structure and expand on how policies are evaluated. Also include missing documentation for check grouping and fix examples to use the new feature.	2022-10-19 17:57:08 -04:00
Luiz Aoqui	bb00f3d713	docs: add autoscaling debug (#14941 )	2022-10-19 14:17:41 -04:00
Luiz Aoqui	9f51e7ee40	docs: move autoscaling `source` agent config (#14947 ) Move the Autoscaler agent configuration `source` to the `policy` page since they are very closely related. Also update all headers in this section so they follow the proper `h1 > h2 > h3 > ...` hierarchy.	2022-10-19 14:17:09 -04:00
Luiz Aoqui	150b69daaf	docs: explain autoscaler target-value strategy (#14951 ) Provide more technical details about how the `target-value` strategy calculates new scaling actions.	2022-10-19 14:16:17 -04:00
Zach Shilton	fedeb84500	website: fix broken links (#14946 ) * fix: nomad license put link * fix: redirected URL * fix: avoid auto-formatting changes	2022-10-19 14:07:48 -04:00
Anthony	eb3515c8f5	Updated datacenter block description (#14953 ) * Updated datacenter block description * Replacing accidentally removed title * docs: add closing period Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-19 08:44:52 -05:00
Bryce Kalow	94ff129167	website: fixes redirected links (#14918 )	2022-10-18 10:31:52 -05:00
Kevin Wang	d66b2eba43	fix: website broken links (#14904 ) * fix: website broken links * fix up keyring-rotate link Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-17 11:32:10 -04:00
Seth Hoenig	69ced2a2bd	services: remove assertion on 'task' field being set (#14864 ) This PR removes the assertion around when the 'task' field of a check may be set. Starting in Nomad 1.4 we automatically set the task field on all checks in support of the NSD checks feature. This is causing validation problems elsewhere, e.g. when a group service using the Consul provider sets 'task' it will fail validation that worked previously. The assertion of leaving 'task' unset was only about making sure job submitters weren't expecting some behavior, but in practice is causing bugs now that we need the task field for more than it was originally added for. We can simply update the docs, noting when the task field set by job submitters actually has value.	2022-10-10 13:02:33 -05:00
Damian Czaja	95f969c4bf	cli: add `nomad fmt` (#14779 )	2022-10-06 17:00:29 -04:00
Giovani Avelar	a625de2062	Allow specification of a custom job name/prefix for parameterized jobs (#14631 )	2022-10-06 16:21:40 -04:00
Michael Schurter	7bbbef9951	docs: clarify nomad vars vs vault (#14831 ) * docs: clarify nomad vars vs vault I think we should make the difference in root key management between Nomad and Vault clear in the concept docs. I didn't see anywhere else in the docs we compared it. I also s/secrets/variables everywhere except the first sentence since the feature is intended to be more generic than secrets. Right now it's more of a compliment to Consul's kv than Vault due to root key handling and featureset. * Update website/content/docs/concepts/variables.mdx Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-06 13:17:26 -07:00
Tim Gross	0cc64da404	docs: 1.4.0 upgrade warning for keyring initialization (#14825 )	2022-10-06 11:32:35 -04:00
Elijah Voigt	0a80a58394	Docs(job-specification/periodic): Add enabled toggle (#14767 ) This is probably undocumented for a reason, but the `enabled` toggle in the `periodic` stanza is very useful so I figured I try adding it to the docs. The feature has been secretly avaliable since #9142 and was called out in that PR as being a dubious addition, only added to avoid regressions. The use case for disabling a periodic job in this way is to prevent it from running without modifying the schedule. Ideally Nomad would make it more clear that this was the case, and allow you to force a run of the job, but even with those rough edges I think users would benefit from knowing about this toggle.	2022-10-03 15:08:24 -04:00
Tim Gross	2a6e8be6ba	internals documentation with diagrams (#14750 ) This changeset adds new architecture internals documents to the contributing guide. These are intentionally here and not on the public-facing website because the material is not required for operators and includes a lot of diagrams that we can cheaply maintain with mermaid syntax but would involve art assets to have up on the main site that would become quickly out of date as code changes happen and be extremely expensive to maintain. However, these should be suitable to use as points of conversation with expert end users. Included: * A description of Evaluation triggers and expected counts, with examples. * A description of Evaluation states and implicit states. This is taken from an internal document in our team wiki. * A description of how writing the State Store works. This is taken from a diagram I put together a few months ago for internal education purposes. * A description of Evaluation lifecycle, from registration to running Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but broken into digestible chunks and without multi-region deployments, which I'd like to cover in a future doc. Also includes adding Deployments to our public-facing glossary. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-03 14:06:41 -04:00
Tim Gross	e13ac471fc	Revert removing deprecated client options docs (#14753 ) This reverts PR #12416 and commit 6668ce022ac561f75ad113cc838b1fb786f11f79. While the driver options are well and truly deprecated, this documentation also covers features like `fingerprint.denylist` that are not available any other way. Let's revert this until #12420 is ready.	2022-09-30 08:38:03 -04:00
Derek Strickland	2c4df95e92	Merge pull request #14664 from hashicorp/docs-multiregion-dispatch multiregion: Added a section for multiregion parameterized job dispatch	2022-09-28 15:40:11 -04:00
Derek Strickland	c3d4496287	link from dispatch command	2022-09-28 08:30:22 -04:00
Derek Strickland	8b37e558fb	Apply suggestions from code review	2022-09-28 08:18:56 -04:00
Derek Strickland	fe7d1e08ac	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:20:11 -04:00
Derek Strickland	e1dba23ccf	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:19:54 -04:00
Seth Hoenig	5df5e70542	core: numeric operands comparisons in constraints (#14722 ) * cleanup: fixup linter warnings in schedular/feasible.go * core: numeric operands comparisons in constraints This PR changes constraint comparisons to be numeric rather than lexical if both operands are integers or floats. Inspiration #4856 Closes #4729 Closes #14719 * fix: always parse as int64	2022-09-27 11:07:07 -05:00
Michael Schurter	fb8739d926	docs: write a lot of words about heartbeats (#14679 ) * docs: write a lot of words about heartbeats Alternative to #14670 * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * use descriptive title for link * rework example of high failover ttl Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-26 14:43:34 -07:00
Michael Schurter	e6af1c0a14	fingerprint: add node attr for reserverable cores (#14694 ) * fingerprint: add node attr for reserverable cores Add an attribute for the number of reservable CPU cores as they may differ from the existing `cpu.numcores` due to client configuration or OS support. Hopefully clarifies some confusion in #14676 * add changelog * num_reservable_cores -> reservablecores	2022-09-26 13:03:03 -07:00
Michael Schurter	b554f9344a	fingerprint: lengthen Vault check after seen (#14693 ) Extension of #14673 Once Vault is initially fingerprinted, extend the period since changes should be infrequent and the fingerprint is relatively expensive since it is contacting a central Vault server. Also move the period timer reset after the fingerprint. This is similar to #9435 where the idea is to ensure the retry period starts after the operation is attempted. 15s will be the minimum time between fingerprints now instead of the maximum time between fingerprints. In the case of Vault fingerprinting, the original behavior might cause the following: 1. Timer is reset to 15s 2. Fingerprint takes 16s 3. Timer has already elapsed so we immediately Fingerprint again Even if fingerprinting Vault only takes a few seconds, that may very well be due to excessive load and backing off our fingerprints is desirable. The new bevahior ensures we always wait at least 15s between fingerprint attempts and should allow some natural jittering based on server load and network latency.	2022-09-26 12:14:19 -07:00
Karan Sharma	cdb3ec25d3	docs: add new tools (#14596 )	2022-09-26 11:42:06 -04:00
Tim Gross	62b1e2ef97	variables: document restrictions on path and size (#14687 )	2022-09-26 11:40:53 -04:00
Tim Gross	17aee4d69c	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Derek Strickland	a30fb3b58e	Update multiregion.mdx	2022-09-22 14:56:21 -04:00
Derek Strickland	78caaa2c38	multiregion: Added a section for multiregion parameterized job dispatch	2022-09-22 14:50:15 -04:00
Tim Gross	c29c4bd66c	cli: remove deprecated `eval status -json` list behavior (#14651 ) In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and deprecated the usage of `eval status` without an evaluation ID with an upgrade note that it would be removed in Nomad 1.4.0. This changeset completes that work.	2022-09-22 10:56:32 -04:00
Bryce Kalow	a84d2de9be	website: content updates for developer (#14473 ) Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com> Co-authored-by: Anthony <russo555@gmail.com> Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com> Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com> Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com> Co-authored-by: Kevin Wang <kwangsan@gmail.com>	2022-09-16 10:38:39 -05:00
Kyle Rarey	dd361d9581	docs: Correct driver name for 'Nomad Task Group' autoscaler target (#14576 )	2022-09-14 09:40:00 +02:00
Mahmood Ali	a9d5e4c510	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Tim Gross	9636b0f837	docs: tweak some copy in the concept docs (#14566 )	2022-09-13 13:21:09 -04:00
Seth Hoenig	afc815c0c7	Merge pull request #14559 from hashicorp/docs-nsd-check-watcher docs: add documentation for nomad service check restarts	2022-09-13 10:52:01 -05:00
Ashlee M Boyer	fc973ebe0e	docs: Fixing heading order, adding text for links in /docs/ecosystem (#14549 ) * Fixing heading order, adding text for links * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * Applying more suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-13 10:59:02 -04:00
Seth Hoenig	5b661ec84d	docs: update docs for NSD check restart	2022-09-13 09:59:02 -05:00
Tim Gross	357e7f4521	docs: include path in ACL requirements for variables (#14561 ) Also add links to the ACL policy reference and variables concepts docs near the top of the page.	2022-09-13 10:21:29 -04:00
Tim Gross	6dd79ca995	docs: variables HTTP API documentation (#14516 )	2022-09-13 10:18:26 -04:00
Tim Gross	cab787c44d	docs: keyring HTTP API documentation (#14513 )	2022-09-13 09:46:54 -04:00
Charlie Voiselle	8eb1689fca	Variables CLI documentation (#14249 )	2022-09-12 16:44:31 -04:00
Tim Gross	14b536ee86	docs: update `template` for Nomad Variables (#14527 )	2022-09-12 16:36:18 -04:00
Tim Gross	9259a373cd	remove root keyring install API (#14514 ) * keyring rotate API should require put/post method * remove keyring install API	2022-09-09 08:50:35 -04:00
Tim Gross	3fc7482ecd	CSI: failed allocation should not block its own controller unpublish (#14484 ) A Nomad user reported problems with CSI volumes associated with failed allocations, where the Nomad server did not send a controller unpublish RPC. The controller unpublish is skipped if other non-terminal allocations on the same node claim the volume. The check has a bug where the allocation belonging to the claim being freed was included in the check incorrectly. During a normal allocation stop for job stop or a new version of the job, the allocation is terminal. But allocations that fail are not yet marked terminal at the point in time when the client sends the unpublish RPC to the server. For CSI plugins that support controller attach/detach, this means that the controller will not be able to detach the volume from the allocation's host and the replacement claim will fail until a GC is run. This changeset fixes the conditional so that the claim's own allocation is not included, and makes the logic easier to read. Include a test case covering this path. Also includes two minor extra bugfixes: * Entities we get from the state store should always be copied before altering. Ensure that we copy the volume in the top-level unpublish workflow before handing off to the steps. * The list stub object for volumes in `nomad/structs` did not match the stub object in `api`. The `api` package also did not include the current readers/writers fields that are expected by the UI. True up the two objects and add the previously undocumented fields to the docs.	2022-09-08 13:30:05 -04:00
James Rasell	813c5daa96	hcl2: add strlen function and update docs. (#14463 )	2022-09-06 18:42:40 +02:00
Luiz Aoqui	1ae26981a0	connect: interpolate task env in config values (#14445 ) When configuring Consul Service Mesh, it's sometimes necessary to provide dynamic value that are only known to Nomad at runtime. By interpolating configuration values (in addition to configuration keys), user are able to pass these dynamic values to Consul from their Nomad jobs.	2022-09-02 15:00:28 -04:00
Luiz Aoqui	99bddfe04d	docs: add warning about changing region config (#14443 )	2022-09-01 16:47:06 -04:00
Luiz Aoqui	94d7dddccd	cli: set -hcl2-strict to false if -hcl1 is defined (#14426 ) These options are mutually exclusive but, since `-hcl2-strict` defaults to `true` users had to explicitily set it to `false` when using `-hcl1`. Also return `255` when job plan fails validation as this is the expected code in this situation.	2022-09-01 10:42:08 -04:00
Tim Gross	0ef073a669	docs: clarify CSI plugin compatibility (#14434 ) Nomad is generally compliant with the CSI specification for Container Orchestrators (CO), except for unimplemented features. However, some storage vendors have built CSI plugins that are not compliant with the specification or which expect that they're only deployed on Kubernetes. Nomad cannot vouch for the compatibility of any particular plugin, so clarify this in the docs. Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>	2022-09-01 10:06:44 -04:00
Brett Larson	9912dfd1e6	Update ephemeral_disk.mdx (#14356 ) It is really unclear on how to use this feature. it took me a while to find this, so I thought I would purpose how to use this.	2022-08-31 20:17:41 -04:00
James Rasell	986355bcd9	docs: add documentation for ACL token expiration and ACL roles. (#14332 ) The ACL command docs are now found within a sub-dir like the operator command docs. Updates to the ACL token commands to accommodate token expiry have also been added. The ACL API docs are now found within a sub-dir like the operator API docs. The ACL docs now include the ACL roles endpoint as well as updated ACL token endpoints for token expiration. The configuration section is also updated to accommodate the new ACL and server parameters for the new ACL features.	2022-08-31 16:13:47 +02:00
Tim Gross	c9d678a91a	keyring: wrap root key in key encryption key (#14388 ) Update the on-disk format for the root key so that it's wrapped with a unique per-key/per-server key encryption key. This is a bit of security theatre for the current implementation, but it uses `go-kms-wrapping` as the interface for wrapping the key. This provides a shim for future support of external KMS such as cloud provider APIs or Vault transit encryption. * Removes the JSON serialization extension we had on the `RootKey` struct; this struct is now only used for key replication and not for disk serialization, so we don't need this helper. * Creates a helper for generating cryptographically random slices of bytes that properly accounts for short reads from the source. * No observable functional changes outside of the on-disk format, so there are no test updates.	2022-08-30 10:59:25 -04:00
Tim Gross	37905d94b7	docs: fixing a few more places we missed "secure" during rename (#14395 )	2022-08-30 10:08:50 -04:00
quoing	ce7a3745d5	docs: template change script example correction (#14368 ) "path" parameter doesn't work, should be command	2022-08-30 12:09:55 +02:00
Tim Gross	d7652fdd3a	docs: rename Secure Variables to Variables (#14352 )	2022-08-29 11:37:08 -04:00
Luiz Aoqui	e012d9411e	Task lifecycle restart (#14127 ) * allocrunner: handle lifecycle when all tasks die When all tasks die the Coordinator must transition to its terminal state, coordinatorStatePoststop, to unblock poststop tasks. Since this could happen at any time (for example, a prestart task dies), all states must be able to transition to this terminal state. * allocrunner: implement different alloc restarts Add a new alloc restart mode where all tasks are restarted, even if they have already exited. Also unifies the alloc restart logic to use the implementation that restarts tasks concurrently and ignores ErrTaskNotRunning errors since those are expected when restarting the allocation. * allocrunner: allow tasks to run again Prevent the task runner Run() method from exiting to allow a dead task to run again. When the task runner is signaled to restart, the function will jump back to the MAIN loop and run it again. The task runner determines if a task needs to run again based on two new task events that were added to differentiate between a request to restart a specific task, the tasks that are currently running, or all tasks that have already run. * api/cli: add support for all tasks alloc restart Implement the new -all-tasks alloc restart CLI flag and its API counterpar, AllTasks. The client endpoint calls the appropriate restart method from the allocrunner depending on the restart parameters used. * test: fix tasklifecycle Coordinator test * allocrunner: kill taskrunners if all tasks are dead When all non-poststop tasks are dead we need to kill the taskrunners so we don't leak their goroutines, which are blocked in the alloc restart loop. This also ensures the allocrunner exits on its own. * taskrunner: fix tests that waited on WaitCh Now that "dead" tasks may run again, the taskrunner Run() method will not return when the task finishes running, so tests must wait for the task state to be "dead" instead of using the WaitCh, since it won't be closed until the taskrunner is killed. * tests: add tests for all tasks alloc restart * changelog: add entry for #14127 * taskrunner: fix restore logic. The first implementation of the task runner restore process relied on server data (`tr.Alloc().TerminalStatus()`) which may not be available to the client at the time of restore. It also had the incorrect code path. When restoring a dead task the driver handle always needs to be clear cleanly using `clearDriverHandle` otherwise, after exiting the MAIN loop, the task may be killed by `tr.handleKill`. The fix is to store the state of the Run() loop in the task runner local client state: if the task runner ever exits this loop cleanly (not with a shutdown) it will never be able to run again. So if the Run() loops starts with this local state flag set, it must exit early. This local state flag is also being checked on task restart requests. If the task is "dead" and its Run() loop is not active it will never be able to run again. * address code review requests * apply more code review changes * taskrunner: add different Restart modes Using the task event to differentiate between the allocrunner restart methods proved to be confusing for developers to understand how it all worked. So instead of relying on the event type, this commit separated the logic of restarting an taskRunner into two methods: - `Restart` will retain the current behaviour and only will only restart the task if it's currently running. - `ForceRestart` is the new method where a `dead` task is allowed to restart if its `Run()` method is still active. Callers will need to restart the allocRunner taskCoordinator to make sure it will allow the task to run again. * minor fixes	2022-08-24 17:43:07 -04:00
Piotr Kazmierczak	7077d1f9aa	template: custom change_mode scripts (#13972 ) This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707	2022-08-24 17:43:01 +02:00
Piotr Kazmierczak	077b6e7098	docs: Update upgrade guide to reflect enterprise changes introduced in nomad-enterprise (#14212 ) This PR documents a change made in the enterprise version of nomad that addresses the following issue: When a user tries to filter audit logs, they do so with a stanza that looks like the following: audit { enabled = true filter "remove deletes" { type = "HTTPEvent" endpoints = ["*"] stages = ["OperationComplete"] operations = ["DELETE"] } } When specifying both an "endpoint" and a "stage", the events with both matching a "endpoint" AND a matching "stage" will be filtered. When specifying both an "endpoint" and an "operation" the events with both matching a "endpoint" AND a matching "operation" will be filtered. When specifying both a "stage" and an "operation" the events with a matching a "stage" OR a matching "operation" will be filtered. The "OR" logic with stages and operations is unexpected and doesn't allow customers to get specific on which events they want to filter. For instance the following use-case is impossible to achieve: "I want to filter out all OperationReceived events that have the DELETE verb".	2022-08-24 16:31:49 +02:00
Tim Gross	afb9fe6a4e	docs: fix an anchor link in secure vars docs (#14231 )	2022-08-23 10:46:24 -04:00
Seth Hoenig	b5427a9f3b	Merge pull request #14215 from hashicorp/docs-update-checks-for-nsd docs: update check documentation with NSD specifics	2022-08-23 09:23:53 -05:00
Seth Hoenig	fb82f11e70	docs: fix checks doc typo Co-authored-by: Piotr Kazmierczak <phk@mm.st>	2022-08-23 09:23:36 -05:00
Tim Gross	bf57d76ec7	allow ACL policies to be associated with workload identity (#14140 ) The original design for workload identities and ACLs allows for operators to extend the automatic capabilities of a workload by using a specially-named policy. This has shown to be potentially unsafe because of naming collisions, so instead we'll allow operators to explicitly attach a policy to a workload identity. This changeset adds workload identity fields to ACL policy objects and threads that all the way down to the command line. It also a new secondary index to the ACL policy table on namespace and job so that claim resolution can efficiently query for related policies.	2022-08-22 16:41:21 -04:00
Luiz Aoqui	dbffdca92e	template: use pointer values for gid and uid (#14203 ) When a Nomad agent starts and loads jobs that already existed in the cluster, the default template uid and gid was being set to 0, since this is the zero value for int. This caused these jobs to fail in environments where it was not possible to use 0, such as in Windows clients. In order to differentiate between an explicit 0 and a template where these properties were not set we need to use a pointer.	2022-08-22 16:25:49 -04:00
Seth Hoenig	ea6d010790	docs: update check documentation with NSD specifics This PR updates the checks documentation to mention support for checks when using the Nomad service provider. There are limitations of NSD compared to Consul, and those configuration options are now noted as being Consul-only.	2022-08-22 10:50:26 -05:00
Phil Renaud	cbd4deedf8	[ui] general keyboard navigation: 1.3.4 release (#14138 ) * Initialized keyboard service Neat but funky: dynamic subnav traversal 👻 generalized traverseSubnav method Shift as special modifier key Nice little demo panel Keyboard shortcuts keycard Some animation styles on keyboard shortcuts Handle situations where a link is deeply nested from its parent menu item Keyboard service cleanup helper-based initializer and teardown for new contextual commands Keyboard shortcuts modal component added and demo-ghost removed Removed j and k from subnav traversal Register and unregister methods for subnav plus new subnavs for volumes and volume register main nav method Generalizing the register nav method 12762 table keynav (#12975) * Experimental feature: shortcut visual hints * Long way around to a custom modifier for keyboard shortcuts * dynamic table and list iterative shortcuts * Progress with regular old tether * Delogging * Table Keynav tether fix, server and client navs, and fix to shiftless on modified arrow keys Go to Optimize keyboard link and storage key changed to g r parameterized jobs keyboard nav Dynamic numeric keynav for multiple tables (#13482) * Multiple tables init * URL-bind enumerable keyboard commands and add to more taskRow and allocationRows * Type safety and lint fixes * Consolidated push to keyCommands * Default value when removing keyCommands * Remove the URL-based removal method and perform a recompute on any add Get tests passing in Keynav: remove math helpers and a few other defensive moves (#13761) * Remove ember math helpers * Test fixes for jobparts/body * Kill an unneeded integration helper test * delog * Trying if disabling percy lets this finish * Okay so its not percy; try parallelism in circle * Percyless yet again * Trying a different angle to not have percy * Upgrade percy to 1.6.1 [ui] Keyboard nav: "u" key to go up a level (#13754) * U to go up a level * Mislabelled my conditional * Custom lint ignore rule * Custom lint ignore rule, this time with commas * Since we're getting rid of ember math helpers elsewhere, do the math ourselves here Replace ArrowLeft etc. with an ascii arrow (#13776) * Replace ArrowLeft etc. with an ascii arrow * non-mutative helper cleanup Keyboard Nav: let users rebind their shortcuts (#13781) * click-outside and shortcuts enabled/disabled toggle * Trap focus when modal open * Enabled/disabled saved to localStorage * Autofocus edit button on variable index * Modal overflow styles * Functional rebind * Saving rebinds to localStorage for all majors * Started on defaultCommandBindings * Modal header style and cancel rebind on escape * keyboardable keybindings w buttons instead of spans * recording and defaultvalues * Enter short-circuits rebind * Only some commands are rebindable, and dont show dupes * No unused get import * More visually distinct header on modal * Disallowed keys for rebind, showing buffer as you type, and moving dedupe to modal logic willDestroy hook to prevent tests from doubling/tripling up addEventListener on kb events remove unused tests Keyboard Navigation acceptance tests (#13893) * Acceptance tests for keyboard modal * a11y audit fix and localStorage clear * Bind/rebind/localStorage tests * Keyboard tests for dynamic nav and tables * Rebinder and assert expectation * Second percy snapshot showing hints no longer relevant Weird issue where linktos with query props specifically from the task-groups page would fail to route / hit undefined.shouldSuperCede errors Adds the concept of exclusivity to a keycommand, removing peers that also share its label Lintfix Changelog and PR feedback Changelog and PR feedback Fix to rebinding in firefox by blurring the now-disabled button on rebind (#14053) * Secure Variables shortcuts removed * Variable index route autofocus removed * Updated changelog entry * Updated changelog entry * Keynav docs (#14148) * Section added to the API Docs UI page * Added a note about disabling * Prev and Next order * Remove dev log and unneeded comments	2022-08-17 12:59:33 -04:00
Kerim Satirli	614171610f	adds link for Nomad-Pack GitHub action (#14118 )	2022-08-16 08:34:26 +02:00
Tim Gross	a4e89d72a8	secure vars: filter by path in List RPCs (#14036 ) The List RPCs only checked the ACL for the Prefix argument of the request. Add an ACL filter to the paginator for the List RPC. Extend test coverage of ACLs in the List RPC and in the `acl` package, and add a "deny" capability so that operators can deny specific paths or prefixes below an allowed path.	2022-08-15 11:38:20 -04:00
Mike Nomitch	ce310b350d	Add notes about DAS being prometheus only (#14040 )	2022-08-15 10:17:31 +02:00
Seth Hoenig	eb966a4ce8	Merge pull request #14086 from Morantron/patch-1 Fix typo in /tools/autoscaling	2022-08-12 09:07:12 -05:00
Seth Hoenig	394aebfbd9	Merge pull request #14088 from hashicorp/b-plan-vault-token cli: support vault token in plan command	2022-08-12 09:05:34 -05:00
Seth Hoenig	1224fdf60d	Merge pull request #14089 from hashicorp/f-docker-disable-healthchecks docker: configuration for disable docker healthcheck	2022-08-12 09:00:31 -05:00
James Rasell	f6a5961a20	docs: correctly state RPC port is used by servers and clients. (#14091 )	2022-08-12 10:14:14 +02:00
Seth Hoenig	dc761aa7ec	docker: create a docker task config setting for disable built-in healthcheck This PR adds a docker driver task configuration setting for turning off built-in HEALTHCHECK of a container. References) https://docs.docker.com/engine/reference/builder/#healthcheck https://github.com/docker/engine-api/blob/master/types/container/config.go#L16 Closes #5310 Closes #14068	2022-08-11 10:33:48 -05:00
Seth Hoenig	ba5c45ab93	cli: respect vault token in plan command This PR fixes a regression where the 'job plan' command would not respect a Vault token if set via --vault-token or $VAULT_TOKEN. Basically the same bug/fix as for the validate command in https://github.com/hashicorp/nomad/issues/13062 Fixes https://github.com/hashicorp/nomad/issues/13939	2022-08-11 08:54:08 -05:00
Morantron	741170160f	Update index.mdx	2022-08-11 09:03:52 +02:00
Seth Hoenig	3d925a78e5	Merge pull request #14065 from hashicorp/b-fwd-vtoken-validation cli: forward request for job validation to nomad leader	2022-08-10 15:14:01 -05:00
Seth Hoenig	3aaaedf52e	cli: forward request for job validation to nomad leader This PR changes the behavior of 'nomad job validate' to forward the request to the nomad leader, rather than responding from any server. This is because we need the leader when validating Vault tokens, since the leader is the only server with an active vault client.	2022-08-10 14:34:04 -05:00
dgotlieb	7fbc8baaeb	doc typo fix docker and podman don't suck 🤣	2022-08-10 15:04:07 +03:00
Charlie Voiselle	9a19279f59	Sweep of docs for repeated words; minor edits (#14032 )	2022-08-05 16:45:30 -04:00
Luiz Aoqui	9affe31a0f	qemu: reduce monitor socket path (#13971 ) The QEMU driver can take an optional `graceful_shutdown` configuration which will create a Unix socket to send ACPI shutdown signal to the VM. Unix sockets have a hard length limit and the driver implementation assumed that QEMU versions 2.10.1 were able to handle longer paths. This is not correct, the linked QEMU fix only changed the behaviour from silently truncating longer socket paths to throwing an error. By validating the socket path before starting the QEMU machine we can provide users a more actionable and meaningful error message, and by using a shorter socket file name we leave a bit more room for user-defined values in the path, such as the task name. The maximum length allowed is also platform-dependant, so validation needs to be different for each OS.	2022-08-04 12:10:35 -04:00
Luiz Aoqui	e3d78c343c	template: set default UID/GID to -1 (#13998 ) UID/GID 0 is usually reserved for the root user/group. While Nomad clients are expected to run as root it may not always be the case. Setting these values as -1 if not defined will fallback to the pervious behaviour of not attempting to set file ownership and use whatever UID/GID the Nomad agent is running as. It will also keep backwards compatibility, which is specially important for platforms where this feature is not supported, like Windows.	2022-08-04 11:26:08 -04:00
Luiz Aoqui	8f05a55def	docs: remove link to HCL2 `timestamp` function (#13999 ) The `timestamp` HCL2 function was never part of the set of supported functions.	2022-08-04 10:07:51 -04:00
Derek Strickland	77df9c133b	Add Nomad RetryConfig to agent template config (#13907 ) * add Nomad RetryConfig to agent template config	2022-08-03 16:56:30 -04:00
Piotr Kazmierczak	530280505f	client: enable specifying user/group permissions in the template stanza (#13755 ) * Adds Uid/Gid parameters to template. * Updated diff_test * fixed order * update jobspec and api * removed obsolete code * helper functions for jobspec parse test * updated documentation * adjusted API jobs test. * propagate uid/gid setting to job_endpoint * adjusted job_endpoint tests * making uid/gid into pointers * refactor * updated documentation * updated documentation * Update client/allocrunner/taskrunner/template/template_test.go Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * Update website/content/api-docs/json-jobs.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * propagating documentation change from Luiz * formatting * changelog entry * changed changelog entry Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-08-02 22:15:38 +02:00
Tim Gross	e025afdf87	docs: concepts for secure variables and workload identity (#13764 ) Includes concept docs for secure variables, concept docs for workload identity, and an operations docs for keyring management.	2022-08-02 10:06:26 -04:00
Eric Weber	cbce13c1ac	Add stage_publish_base_dir field to csi_plugin stanza of a job (#13919 ) * Allow specification of CSI staging and publishing directory path * Add website documentation for stage_publish_dir * Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir * Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir	2022-08-02 09:42:44 -04:00
Tim Gross	e5ac6464f6	secure vars: enforce ENT quotas (OSS work) (#13951 ) Move the secure variables quota enforcement calls into the state store to ensure quota checks are atomic with quota updates (in the same transaction). Switch to a machine-size int instead of a uint64 for quota tracking. The ENT-side quota spec is described as int, and negative values have a meaning as "not permitted at all". Using the same type for tracking will make it easier to the math around checks, and uint64 is infeasibly large anyways. Add secure vars to quota HTTP API and CLI outputs and API docs.	2022-08-02 09:32:09 -04:00
Tim Gross	f14fafe914	docs: fix path for quota/usage API (#13952 )	2022-08-02 08:46:45 -04:00
asymmetric	b89718d70e	Update filesystem.mdx (#13738 ) fix alloc working directory path	2022-07-25 10:25:48 -04:00
Scott Holodak	12ef89a61a	docs: fix placement for `scaling` and `csi_plugin` (#13892 )	2022-07-25 10:06:59 -04:00
Charlie Voiselle	456ad33b7c	Fix link (#13881 )	2022-07-22 12:27:45 -04:00
Michael Schurter	0d1c9a53a4	docs: clarify submit-job allows stopping (#13871 )	2022-07-21 10:18:57 -07:00
Tim Gross	96aea74b4b	docs: keyring commands (#13690 ) Document the secure variables keyring commands, document the aliased gossip keyring commands, and note that the old gossip keyring commands are deprecated.	2022-07-20 14:14:10 -04:00
Tim Gross	49ad3dc3ba	docs: document secure variables server config options (#13695 )	2022-07-20 14:13:39 -04:00
Will Jordan	5354409b1a	Return 429 response on HTTP max connection limit (#13621 ) Return 429 response on HTTP max connection limit. Instead of silently closing the connection, return a `429 Too Many Requests` HTTP response with a helpful error message to aid debugging when the connection limit is unintentionally reached. Set a 10-millisecond write timeout and rate limiter for connection-limit 429 response to prevent writing the HTTP response from consuming too many server resources. Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections exceeding concurrency limit.	2022-07-20 14:12:21 -04:00
Luiz Aoqui	3dc701a8d0	docs: update Autoscaler AWS plugin with new ws_credential_provider config (#13779 )	2022-07-19 10:27:55 -04:00
Niklas Hambüchen	422c83e97a	docs: job-specification: Explain that priority has no effect on run order (#13835 ) Makes the issues from #9845 and #12792 less surprising to the user.	2022-07-19 08:55:29 -04:00
Andy Assareh	e49c021792	word typo digestible (#13772 )	2022-07-19 09:00:52 +02:00
Seth Hoenig	4dea14267d	Merge pull request #13813 from hashicorp/docs-move-checks docs: move checks into own page	2022-07-18 12:27:43 -05:00
Seth Hoenig	4459312541	docs: move checks into own page This PR creates a top-level 'check' page for job-specification docs. The content for checks is about half the content of the service page, and is about to increase in size when we add docs about Nomad service checks. Seemed like a good idea to just split the checks section out into its own thing (e.g. check_restart is already a topic). Doing the move first lets us backport this change without adding Nomad service check stuff yet. Mostly just a lift-and-shift but with some tweaked examples to de-emphasize the use of script checks.	2022-07-18 09:34:55 -05:00
Tim Gross	1e8978ca04	docs: ACL policy spec reference (#13787 ) The "Secure Nomad with Access Control" guide provides a tutorial for bootstrapping Nomad ACLs, writing policies, and creating tokens. Add a reference guide just for the ACL policy specification.	2022-07-18 09:35:28 -04:00
Luiz Aoqui	730f869b6b	docs: update Podman docs to v0.4.0 (#13783 )	2022-07-15 18:01:35 -04:00
Michael Schurter	e97548b5f8	Improve metrics reference documentation (#13769 ) * docs: tighten up parameterized job metrics docs * docs: improve alloc status descriptions Remove `nomad.client.allocations.start` as it doesn't exist.	2022-07-15 14:22:39 -07:00
Michael Schurter	5414f49821	docs: clarify blocked_evals metrics (#13751 ) Related to #13740 - blocked_evals.total_blocked is the number of evals blocked for any reason - blocked_evals.total_quota_limit is the number of evals blocked by quota limits, but critically: their resources are not counted in the cpu/memory	2022-07-14 11:32:33 -07:00
Seth Hoenig	3a32220b3b	Merge pull request #13716 from hashicorp/docs-update-consul-warning docs: remove consul 1.12.0 warning	2022-07-14 08:45:56 -05:00
Luiz Aoqui	b656981cf0	Track plan rejection history and automatically mark clients as ineligible (#13421 ) Plan rejections occur when the scheduler work and the leader plan applier disagree on the feasibility of a plan. This may happen for valid reasons: since Nomad does parallel scheduling, it is expected that different workers will have a different state when computing placements. As the final plan reaches the leader plan applier, it may no longer be valid due to a concurrent scheduling taking up intended resources. In these situations the plan applier will notify the worker that the plan was rejected and that they should refresh their state before trying again. In some rare and unexpected circumstances it has been observed that workers will repeatedly submit the same plan, even if they are always rejected. While the root cause is still unknown this mitigation has been put in place. The plan applier will now track the history of plan rejections per client and include in the plan result a list of node IDs that should be set as ineligible if the number of rejections in a given time window crosses a certain threshold. The window size and threshold value can be adjusted in the server configuration. To avoid marking several nodes as ineligible at one, the operation is rate limited to 5 nodes every 30min, with an initial burst of 10 operations.	2022-07-12 18:40:20 -04:00
Michael Schurter	3e50f72fad	core: merge reserved_ports into host_networks (#13651 ) Fixes #13505 This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports). As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility: Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly. Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me. So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.	2022-07-12 14:40:25 -07:00
Seth Hoenig	a9fa48f3db	docs: remove consul 1.12.0 warning	2022-07-12 09:53:17 -05:00
Tim Gross	fc4cd53cfb	docs: rename Internals to Concepts (#13696 )	2022-07-11 16:55:33 -04:00
Tim Gross	d49ff0175c	docs: move operator subcommands under their own trees (#13677 ) The sidebar navigation tree for the `operator` sub-sub commands is getting cluttered and we have a new set of commands coming to support secure variables keyring as well. Move these all under their own subtrees.	2022-07-11 14:00:24 -04:00
Seth Hoenig	ed2f2b1a75	docs: move upgrade docs for max_client_timeout Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-07-07 16:46:26 -05:00
Seth Hoenig	905e673553	docs: upgrade guide for client max_kill_timeout	2022-07-07 15:27:40 -05:00
Luiz Aoqui	03433dd8af	cli: improve output of eval commands (#13581 ) Use the same output format when listing multiple evals in the `eval list` command and when `eval status <prefix>` matches more than one eval. Include the eval namespace in all output formats and always include the job ID in `eval status` since, even `node-update` evals are related to a job. Add Node ID to the evals table output to help differentiate `node-update` evals. Co-authored-by: James Rasell <jrasell@hashicorp.com>	2022-07-07 13:13:34 -04:00
Ted Behling	6a032a54d2	driver/docker: Don't pull InfraImage if it exists (#13265 ) Co-authored-by: James Rasell <jrasell@hashicorp.com>	2022-07-07 17:44:06 +02:00
Seth Hoenig	b9fe6c8d2c	docs: fixup from cr comments	2022-07-07 08:37:10 -05:00
Seth Hoenig	1c31ef285e	docs: add docs for simple load balancing nomad services This PR adds a section to template docs for simple load balancing with nomad servicse.	2022-07-06 17:34:30 -05:00
James Rasell	0c0b028a59	core: allow deleting of evaluations (#13492 ) * core: add eval delete RPC and core functionality. * agent: add eval delete HTTP endpoint. * api: add eval delete API functionality. * cli: add eval delete command. * docs: add eval delete website documentation.	2022-07-06 16:30:11 +02:00
James Rasell	181b247384	core: allow pausing and un-pausing of leader broker routine (#13045 ) * core: allow pause/un-pause of eval broker on region leader. * agent: add ability to pause eval broker via scheduler config. * cli: add operator scheduler commands to interact with config. * api: add ability to pause eval broker via scheduler config * e2e: add operator scheduler test for eval broker pause. * docs: include new opertor scheduler CLI and pause eval API info.	2022-07-06 16:13:48 +02:00
Michelle Noorali	f227855de1	doc: explain permissions for Vault sys/capabilties-self	2022-07-06 10:01:30 -04:00
Yann Coleu	fe64f8cdd7	docs: typo on command word (#13582 )	2022-07-05 16:24:25 -04:00
Steven Collins	ab97650098	docs: Add 'serial' attribute to usb driver (#13547 )	2022-07-05 16:23:04 -04:00
Seth Hoenig	97726c2fd8	Merge pull request #12862 from hashicorp/f-choose-services api: enable selecting subset of services using rendezvous hashing	2022-06-30 15:17:40 -05:00
Derek Strickland	47e3b28dba	docs: update task leader to explain shutdown sequence. (#13498 ) * docs: update task leader to explain shutdown sequence.	2022-06-29 05:13:45 -04:00
James Rasell	d21e4abe3f	docs: fixup HCL2 index collection function documentation. (#13511 )	2022-06-28 18:27:38 +02:00
Andrew	3a87406f2f	Fix typo in Docker docs (#13497 )	2022-06-28 11:05:50 +02:00
Seth Hoenig	9467bc9eb3	api: enable selecting subset of services using rendezvous hashing This PR adds the 'choose' query parameter to the '/v1/service/<service>' endpoint. The value of 'choose' is in the form '<number>\|<key>', number is the number of desired services and key is a value unique but consistent to the requester (e.g. allocID). Folks aren't really expected to use this API directly, but rather through consul-template which will soon be getting a new helper function making use of this query parameter. Example, curl 'localhost:4646/v1/service/redis?choose=2\|abc123' Note: consul-templte v0.29.1 includes the necessary nomadServices functionality.	2022-06-25 10:37:37 -05:00
Seth Hoenig	91e08d5e23	core: remove support for raft protocol version 2 This PR checks server config for raft_protocol, which must now be set to 3 or unset (0). When unset, version 3 is used as the default.	2022-06-23 14:37:50 +00:00
Michael Schurter	7b7c72b21d	docs: clarify total_escaped is just an optimization (#13460 )	2022-06-22 11:39:56 -07:00
Elijah Voigt	665b198968	Lob.com uses Nomad too! (#13295 ) Lob.com has been ramping up our use of Nomad for ~6 months. Now that we've started blogging about it we'd love to be on the _official_ list.	2022-06-21 09:10:08 -04:00
Derek Strickland	a15cef689d	Improve Autoscaler overview (#13396 ) Improve Autoscaler overview documentation.	2022-06-17 05:15:22 -04:00
Nick Wales	3a8c8250f4	Merge pull request #13401 from nickwales/tls_typo Updates TLS documentation	2022-06-16 12:34:59 -05:00
Arthur Leclerc	d98a9b1d72	docs: Fix typo (#13389 )	2022-06-16 13:24:18 -04:00
Nick Wales	c964ae0135	Updates TLS documentation	2022-06-16 12:15:40 -05:00
James Hu	7e3d21646d	Fix spelling error (#13397 )	2022-06-16 12:41:49 -04:00
Luiz Aoqui	6598567725	docs: create volume spec page (#13353 ) In addition to jobs, there are other objects in Nomad that have a specific format and can be provided to commands and API endpoints. This commit creates a new menu section to hold the specification for volumes and update the command pages to point to the new centralized definition. Redirecting the previous entries is not possible with `redirect.js` because they are done server-side and URL fragments are not accessible to detect a match. So we provide hidden anchors with a link to the new page to guide users towards the new documentation. Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-06-14 14:08:25 -04:00
Grant Griffiths	99896da443	CSI: make plugin health_timeout configurable in csi_plugin stanza (#13340 ) Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>	2022-06-14 10:04:16 -04:00

... 2 3 4 5 6 ...

872 commits