open-nomad

Author	SHA1	Message	Date
Tim Gross	f29c781fa7	docs: improved documentation on hardening and required capabilities (#15036 ) The existing docs on required capabilities are a little sparse and have been the subject of a lots of questions. Expand on this information and provide a pointer to the ongoing design discussion around rootless Nomad.	2022-10-26 09:46:13 -04:00
Tim Gross	aca95c0bc6	keyring: remove root key GC (#15034 )	2022-10-25 17:06:18 -04:00
Luiz Aoqui	8b8d85bce7	docs: use of `node_class` when autoscaling (#14950 ) Document how the value of `node_class` is used during cluster scaling. https://github.com/hashicorp/nomad-autoscaler/issues/255	2022-10-21 10:35:45 -04:00
James Rasell	215b4e7e36	acl: add ACL roles to event stream topic and resolve policies. (#14923 ) This changes adds ACL role creation and deletion to the event stream. It is exposed as a single topic with two types; the filter is primarily the role ID but also includes the role name. While conducting this work it was also discovered that the events stream has its own ACL resolution logic. This did not account for ACL tokens which included role links, or tokens with expiry times. ACL role links are now resolved to their policies and tokens are checked for expiry correctly.	2022-10-20 09:43:35 +02:00
James Rasell	d7b311ce55	acl: correctly resolve ACL roles within client cache. (#14922 ) The client ACL cache was not accounting for tokens which included ACL role links. This change modifies the behaviour to resolve role links to policies. It will also now store ACL roles within the cache for quick lookup. The cache TTL is configurable in the same manner as policies or tokens. Another small fix is included that takes into account the ACL token expiry time. This was not included, which meant tokens with expiry could be used past the expiry time, until they were GC'd.	2022-10-20 09:37:32 +02:00
Luiz Aoqui	75830a7161	docs: expand Autoscaling documentation (#14937 ) Rename `Internals` section to `Concepts` to match core docs structure and expand on how policies are evaluated. Also include missing documentation for check grouping and fix examples to use the new feature.	2022-10-19 17:57:08 -04:00
Luiz Aoqui	bb00f3d713	docs: add autoscaling debug (#14941 )	2022-10-19 14:17:41 -04:00
Luiz Aoqui	9f51e7ee40	docs: move autoscaling `source` agent config (#14947 ) Move the Autoscaler agent configuration `source` to the `policy` page since they are very closely related. Also update all headers in this section so they follow the proper `h1 > h2 > h3 > ...` hierarchy.	2022-10-19 14:17:09 -04:00
Luiz Aoqui	150b69daaf	docs: explain autoscaler target-value strategy (#14951 ) Provide more technical details about how the `target-value` strategy calculates new scaling actions.	2022-10-19 14:16:17 -04:00
Zach Shilton	fedeb84500	website: fix broken links (#14946 ) * fix: nomad license put link * fix: redirected URL * fix: avoid auto-formatting changes	2022-10-19 14:07:48 -04:00
Anthony	eb3515c8f5	Updated datacenter block description (#14953 ) * Updated datacenter block description * Replacing accidentally removed title * docs: add closing period Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-19 08:44:52 -05:00
Bryce Kalow	94ff129167	website: fixes redirected links (#14918 )	2022-10-18 10:31:52 -05:00
Kevin Wang	d66b2eba43	fix: website broken links (#14904 ) * fix: website broken links * fix up keyring-rotate link Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-17 11:32:10 -04:00
Seth Hoenig	69ced2a2bd	services: remove assertion on 'task' field being set (#14864 ) This PR removes the assertion around when the 'task' field of a check may be set. Starting in Nomad 1.4 we automatically set the task field on all checks in support of the NSD checks feature. This is causing validation problems elsewhere, e.g. when a group service using the Consul provider sets 'task' it will fail validation that worked previously. The assertion of leaving 'task' unset was only about making sure job submitters weren't expecting some behavior, but in practice is causing bugs now that we need the task field for more than it was originally added for. We can simply update the docs, noting when the task field set by job submitters actually has value.	2022-10-10 13:02:33 -05:00
Damian Czaja	95f969c4bf	cli: add `nomad fmt` (#14779 )	2022-10-06 17:00:29 -04:00
Giovani Avelar	a625de2062	Allow specification of a custom job name/prefix for parameterized jobs (#14631 )	2022-10-06 16:21:40 -04:00
Michael Schurter	7bbbef9951	docs: clarify nomad vars vs vault (#14831 ) * docs: clarify nomad vars vs vault I think we should make the difference in root key management between Nomad and Vault clear in the concept docs. I didn't see anywhere else in the docs we compared it. I also s/secrets/variables everywhere except the first sentence since the feature is intended to be more generic than secrets. Right now it's more of a compliment to Consul's kv than Vault due to root key handling and featureset. * Update website/content/docs/concepts/variables.mdx Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-10-06 13:17:26 -07:00
Tim Gross	0cc64da404	docs: 1.4.0 upgrade warning for keyring initialization (#14825 )	2022-10-06 11:32:35 -04:00
Elijah Voigt	0a80a58394	Docs(job-specification/periodic): Add enabled toggle (#14767 ) This is probably undocumented for a reason, but the `enabled` toggle in the `periodic` stanza is very useful so I figured I try adding it to the docs. The feature has been secretly avaliable since #9142 and was called out in that PR as being a dubious addition, only added to avoid regressions. The use case for disabling a periodic job in this way is to prevent it from running without modifying the schedule. Ideally Nomad would make it more clear that this was the case, and allow you to force a run of the job, but even with those rough edges I think users would benefit from knowing about this toggle.	2022-10-03 15:08:24 -04:00
Tim Gross	2a6e8be6ba	internals documentation with diagrams (#14750 ) This changeset adds new architecture internals documents to the contributing guide. These are intentionally here and not on the public-facing website because the material is not required for operators and includes a lot of diagrams that we can cheaply maintain with mermaid syntax but would involve art assets to have up on the main site that would become quickly out of date as code changes happen and be extremely expensive to maintain. However, these should be suitable to use as points of conversation with expert end users. Included: * A description of Evaluation triggers and expected counts, with examples. * A description of Evaluation states and implicit states. This is taken from an internal document in our team wiki. * A description of how writing the State Store works. This is taken from a diagram I put together a few months ago for internal education purposes. * A description of Evaluation lifecycle, from registration to running Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but broken into digestible chunks and without multi-region deployments, which I'd like to cover in a future doc. Also includes adding Deployments to our public-facing glossary. Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com> Co-authored-by: Seth Hoenig <shoenig@duck.com>	2022-10-03 14:06:41 -04:00
Tim Gross	e13ac471fc	Revert removing deprecated client options docs (#14753 ) This reverts PR #12416 and commit 6668ce022ac561f75ad113cc838b1fb786f11f79. While the driver options are well and truly deprecated, this documentation also covers features like `fingerprint.denylist` that are not available any other way. Let's revert this until #12420 is ready.	2022-09-30 08:38:03 -04:00
Derek Strickland	2c4df95e92	Merge pull request #14664 from hashicorp/docs-multiregion-dispatch multiregion: Added a section for multiregion parameterized job dispatch	2022-09-28 15:40:11 -04:00
Derek Strickland	c3d4496287	link from dispatch command	2022-09-28 08:30:22 -04:00
Derek Strickland	8b37e558fb	Apply suggestions from code review	2022-09-28 08:18:56 -04:00
Derek Strickland	fe7d1e08ac	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:20:11 -04:00
Derek Strickland	e1dba23ccf	Update website/content/docs/job-specification/multiregion.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2022-09-28 07:19:54 -04:00
Seth Hoenig	5df5e70542	core: numeric operands comparisons in constraints (#14722 ) * cleanup: fixup linter warnings in schedular/feasible.go * core: numeric operands comparisons in constraints This PR changes constraint comparisons to be numeric rather than lexical if both operands are integers or floats. Inspiration #4856 Closes #4729 Closes #14719 * fix: always parse as int64	2022-09-27 11:07:07 -05:00
Michael Schurter	fb8739d926	docs: write a lot of words about heartbeats (#14679 ) * docs: write a lot of words about heartbeats Alternative to #14670 * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * use descriptive title for link * rework example of high failover ttl Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-26 14:43:34 -07:00
Michael Schurter	e6af1c0a14	fingerprint: add node attr for reserverable cores (#14694 ) * fingerprint: add node attr for reserverable cores Add an attribute for the number of reservable CPU cores as they may differ from the existing `cpu.numcores` due to client configuration or OS support. Hopefully clarifies some confusion in #14676 * add changelog * num_reservable_cores -> reservablecores	2022-09-26 13:03:03 -07:00
Michael Schurter	b554f9344a	fingerprint: lengthen Vault check after seen (#14693 ) Extension of #14673 Once Vault is initially fingerprinted, extend the period since changes should be infrequent and the fingerprint is relatively expensive since it is contacting a central Vault server. Also move the period timer reset after the fingerprint. This is similar to #9435 where the idea is to ensure the retry period starts after the operation is attempted. 15s will be the minimum time between fingerprints now instead of the maximum time between fingerprints. In the case of Vault fingerprinting, the original behavior might cause the following: 1. Timer is reset to 15s 2. Fingerprint takes 16s 3. Timer has already elapsed so we immediately Fingerprint again Even if fingerprinting Vault only takes a few seconds, that may very well be due to excessive load and backing off our fingerprints is desirable. The new bevahior ensures we always wait at least 15s between fingerprint attempts and should allow some natural jittering based on server load and network latency.	2022-09-26 12:14:19 -07:00
Karan Sharma	cdb3ec25d3	docs: add new tools (#14596 )	2022-09-26 11:42:06 -04:00
Tim Gross	62b1e2ef97	variables: document restrictions on path and size (#14687 )	2022-09-26 11:40:53 -04:00
Tim Gross	17aee4d69c	fingerprint: don't clear Consul/Vault attributes on failure (#14673 ) Clients periodically fingerprint Vault and Consul to ensure the server has updated attributes in the client's fingerprint. If the client can't reach Vault/Consul, the fingerprinter clears the attributes and requires a node update. Although this seems like correct behavior so that we can detect intentional removal of Vault/Consul access, it has two serious failure modes: (1) If a local Consul agent is restarted to pick up configuration changes and the client happens to fingerprint at that moment, the client will update its fingerprint and result in evaluations for all its jobs and all the system jobs in the cluster. (2) If a client loses Vault connectivity, the same thing happens. But the consequences are much worse in the Vault case because Vault is not run as a local agent, so Vault connectivity failures are highly correlated across the entire cluster. A 15 second Vault outage will cause a new `node-update` evalution for every system job on the cluster times the number of nodes, plus one `node-update` evaluation for every non-system job on each node. On large clusters of 1000s of nodes, we've seen this create a large backlog of evaluations. This changeset updates the fingerprinting behavior to keep the last fingerprint if Consul or Vault queries fail. This prevents a storm of evaluations at the cost of requiring a client restart if Consul or Vault is intentionally removed from the client.	2022-09-23 14:45:12 -04:00
Derek Strickland	a30fb3b58e	Update multiregion.mdx	2022-09-22 14:56:21 -04:00
Derek Strickland	78caaa2c38	multiregion: Added a section for multiregion parameterized job dispatch	2022-09-22 14:50:15 -04:00
Tim Gross	c29c4bd66c	cli: remove deprecated `eval status -json` list behavior (#14651 ) In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and deprecated the usage of `eval status` without an evaluation ID with an upgrade note that it would be removed in Nomad 1.4.0. This changeset completes that work.	2022-09-22 10:56:32 -04:00
Bryce Kalow	a84d2de9be	website: content updates for developer (#14473 ) Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com> Co-authored-by: Anthony <russo555@gmail.com> Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com> Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com> Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com> Co-authored-by: Kevin Wang <kwangsan@gmail.com>	2022-09-16 10:38:39 -05:00
Kyle Rarey	dd361d9581	docs: Correct driver name for 'Nomad Task Group' autoscaler target (#14576 )	2022-09-14 09:40:00 +02:00
Mahmood Ali	a9d5e4c510	scheduler: stopped-yet-running allocs are still running (#10446 ) * scheduler: stopped-yet-running allocs are still running * scheduler: test new stopped-but-running logic * test: assert nonoverlapping alloc behavior Also add a simpler Wait test helper to improve line numbers and save few lines of code. * docs: tried my best to describe #10446 it's not concise... feedback welcome * scheduler: fix test that allowed overlapping allocs * devices: only free devices when ClientStatus is terminal * test: output nicer failure message if err==nil Co-authored-by: Mahmood Ali <mahmood@hashicorp.com> Co-authored-by: Michael Schurter <mschurter@hashicorp.com>	2022-09-13 12:52:47 -07:00
Tim Gross	9636b0f837	docs: tweak some copy in the concept docs (#14566 )	2022-09-13 13:21:09 -04:00
Seth Hoenig	afc815c0c7	Merge pull request #14559 from hashicorp/docs-nsd-check-watcher docs: add documentation for nomad service check restarts	2022-09-13 10:52:01 -05:00
Ashlee M Boyer	fc973ebe0e	docs: Fixing heading order, adding text for links in /docs/ecosystem (#14549 ) * Fixing heading order, adding text for links * Apply suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com> * Applying more suggestions from code review Co-authored-by: Tim Gross <tgross@hashicorp.com>	2022-09-13 10:59:02 -04:00
Seth Hoenig	5b661ec84d	docs: update docs for NSD check restart	2022-09-13 09:59:02 -05:00
Tim Gross	357e7f4521	docs: include path in ACL requirements for variables (#14561 ) Also add links to the ACL policy reference and variables concepts docs near the top of the page.	2022-09-13 10:21:29 -04:00
Tim Gross	6dd79ca995	docs: variables HTTP API documentation (#14516 )	2022-09-13 10:18:26 -04:00
Tim Gross	cab787c44d	docs: keyring HTTP API documentation (#14513 )	2022-09-13 09:46:54 -04:00
Charlie Voiselle	8eb1689fca	Variables CLI documentation (#14249 )	2022-09-12 16:44:31 -04:00
Tim Gross	14b536ee86	docs: update `template` for Nomad Variables (#14527 )	2022-09-12 16:36:18 -04:00
Tim Gross	9259a373cd	remove root keyring install API (#14514 ) * keyring rotate API should require put/post method * remove keyring install API	2022-09-09 08:50:35 -04:00
Tim Gross	3fc7482ecd	CSI: failed allocation should not block its own controller unpublish (#14484 ) A Nomad user reported problems with CSI volumes associated with failed allocations, where the Nomad server did not send a controller unpublish RPC. The controller unpublish is skipped if other non-terminal allocations on the same node claim the volume. The check has a bug where the allocation belonging to the claim being freed was included in the check incorrectly. During a normal allocation stop for job stop or a new version of the job, the allocation is terminal. But allocations that fail are not yet marked terminal at the point in time when the client sends the unpublish RPC to the server. For CSI plugins that support controller attach/detach, this means that the controller will not be able to detach the volume from the allocation's host and the replacement claim will fail until a GC is run. This changeset fixes the conditional so that the claim's own allocation is not included, and makes the logic easier to read. Include a test case covering this path. Also includes two minor extra bugfixes: * Entities we get from the state store should always be copied before altering. Ensure that we copy the volume in the top-level unpublish workflow before handing off to the steps. * The list stub object for volumes in `nomad/structs` did not match the stub object in `api`. The `api` package also did not include the current readers/writers fields that are expected by the UI. True up the two objects and add the previously undocumented fields to the docs.	2022-09-08 13:30:05 -04:00

1 2 3 4 5 ...

664 commits