Commit Graph

4234 Commits

Author SHA1 Message Date
Giovani Avelar a625de2062
Allow specification of a custom job name/prefix for parameterized jobs (#14631) 2022-10-06 16:21:40 -04:00
Michael Schurter 7bbbef9951
docs: clarify nomad vars vs vault (#14831)
* docs: clarify nomad vars vs vault

I think we should make the difference in root key management between
Nomad and Vault clear in the concept docs. I didn't see anywhere else in
the docs we compared it.

I also s/secrets/variables everywhere except the first sentence since
the feature is intended to be more generic than secrets. Right now it's
more of a compliment to Consul's kv than Vault due to root key handling
and featureset.

* Update website/content/docs/concepts/variables.mdx

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-10-06 13:17:26 -07:00
HashiBot eab6bb5e35
website: upgrade next version (#14830)
Co-authored-by: Bryce Kalow <bkalow@hashicorp.com>
2022-10-06 13:48:11 -05:00
Tim Gross 0cc64da404
docs: 1.4.0 upgrade warning for keyring initialization (#14825) 2022-10-06 11:32:35 -04:00
Elijah Voigt 0a80a58394
Docs(job-specification/periodic): Add enabled toggle (#14767)
This is probably undocumented for a reason, but the `enabled` toggle in the
`periodic` stanza is very useful so I figured I try adding it to the docs.

The feature has been secretly avaliable since #9142 and was called out in that
PR as being a dubious addition, only added to avoid regressions.

The use case for disabling a periodic job in this way is to prevent it from
running without modifying the schedule. Ideally Nomad would make it more clear
that this was the case, and allow you to force a run of the job, but even with
those rough edges I think users would benefit from knowing about this toggle.
2022-10-03 15:08:24 -04:00
Tim Gross 2a6e8be6ba
internals documentation with diagrams (#14750)
This changeset adds new architecture internals documents to the contributing
guide. These are intentionally here and not on the public-facing website because
the material is not required for operators and includes a lot of diagrams that
we can cheaply maintain with mermaid syntax but would involve art assets to have
up on the main site that would become quickly out of date as code changes happen
and be extremely expensive to maintain. However, these should be suitable to use
as points of conversation with expert end users.

Included:
* A description of Evaluation triggers and expected counts, with examples.
* A description of Evaluation states and implicit states. This is taken from an
  internal document in our team wiki.
* A description of how writing the State Store works. This is taken from a
  diagram I put together a few months ago for internal education purposes.
* A description of Evaluation lifecycle, from registration to running
  Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but
  broken into digestible chunks and without multi-region deployments, which I'd
  like to cover in a future doc.

Also includes adding Deployments to our public-facing glossary.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-03 14:06:41 -04:00
dependabot[bot] 9ce74c83e6
build(deps-dev): bump @hashicorp/platform-cli in /website (#14541)
Bumps [@hashicorp/platform-cli](https://github.com/hashicorp/web-platform-packages/tree/HEAD/packages/cli) from 2.1.0 to 2.3.0.
- [Release notes](https://github.com/hashicorp/web-platform-packages/releases)
- [Changelog](https://github.com/hashicorp/web-platform-packages/blob/main/packages/cli/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/web-platform-packages/commits/@hashicorp/platform-cli@2.3.0/packages/cli)

---
updated-dependencies:
- dependency-name: "@hashicorp/platform-cli"
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-30 14:59:55 -04:00
Tim Gross e13ac471fc
Revert removing deprecated client options docs (#14753)
This reverts PR #12416 and commit 6668ce022ac561f75ad113cc838b1fb786f11f79.

While the driver options are well and truly deprecated, this documentation also
covers features like `fingerprint.denylist` that are not available any other
way. Let's revert this until #12420 is ready.
2022-09-30 08:38:03 -04:00
Derek Strickland 2c4df95e92
Merge pull request #14664 from hashicorp/docs-multiregion-dispatch
multiregion: Added a section for multiregion parameterized job dispatch
2022-09-28 15:40:11 -04:00
Derek Strickland c3d4496287 link from dispatch command 2022-09-28 08:30:22 -04:00
Derek Strickland 8b37e558fb Apply suggestions from code review 2022-09-28 08:18:56 -04:00
Derek Strickland fe7d1e08ac
Update website/content/docs/job-specification/multiregion.mdx
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-28 07:20:11 -04:00
Derek Strickland e1dba23ccf
Update website/content/docs/job-specification/multiregion.mdx
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-28 07:19:54 -04:00
Seth Hoenig 5df5e70542
core: numeric operands comparisons in constraints (#14722)
* cleanup: fixup linter warnings in schedular/feasible.go

* core: numeric operands comparisons in constraints

This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.

Inspiration #4856
Closes #4729
Closes #14719

* fix: always parse as int64
2022-09-27 11:07:07 -05:00
Michael Schurter fb8739d926
docs: write a lot of words about heartbeats (#14679)
* docs: write a lot of words about heartbeats

Alternative to #14670

* Apply suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* use descriptive title for link

* rework example of high failover ttl

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-26 14:43:34 -07:00
Michael Schurter e6af1c0a14
fingerprint: add node attr for reserverable cores (#14694)
* fingerprint: add node attr for reserverable cores

Add an attribute for the number of reservable CPU cores as they may
differ from the existing `cpu.numcores` due to client configuration or
OS support.

Hopefully clarifies some confusion in #14676

* add changelog

* num_reservable_cores -> reservablecores
2022-09-26 13:03:03 -07:00
Michael Schurter b554f9344a
fingerprint: lengthen Vault check after seen (#14693)
Extension of #14673

Once Vault is initially fingerprinted, extend the period since changes
should be infrequent and the fingerprint is relatively expensive since
it is contacting a central Vault server.

Also move the period timer reset *after* the fingerprint. This is
similar to #9435 where the idea is to ensure the retry period starts
*after* the operation is attempted. 15s will be the *minimum* time
between fingerprints now instead of the *maximum* time between
fingerprints.

In the case of Vault fingerprinting, the original behavior might cause
the following:

1. Timer is reset to 15s
2. Fingerprint takes 16s
3. Timer has already elapsed so we immediately Fingerprint again

Even if fingerprinting Vault only takes a few seconds, that may very
well be due to excessive load and backing off our fingerprints is
desirable. The new bevahior ensures we always wait at least 15s between
fingerprint attempts and should allow some natural jittering based on
server load and network latency.
2022-09-26 12:14:19 -07:00
Karan Sharma cdb3ec25d3
docs: add new tools (#14596) 2022-09-26 11:42:06 -04:00
Tim Gross 62b1e2ef97
variables: document restrictions on path and size (#14687) 2022-09-26 11:40:53 -04:00
Tim Gross 17aee4d69c
fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
Derek Strickland a30fb3b58e
Update multiregion.mdx 2022-09-22 14:56:21 -04:00
Derek Strickland 78caaa2c38 multiregion: Added a section for multiregion parameterized job dispatch 2022-09-22 14:50:15 -04:00
Tim Gross c29c4bd66c
cli: remove deprecated `eval status -json` list behavior (#14651)
In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and
deprecated the usage of `eval status` without an evaluation ID with an upgrade
note that it would be removed in Nomad 1.4.0. This changeset completes that
work.
2022-09-22 10:56:32 -04:00
Bryce Kalow a84d2de9be
website: content updates for developer (#14473)
Co-authored-by: Geoffrey Grosenbach <26+topfunky@users.noreply.github.com>
Co-authored-by: Anthony <russo555@gmail.com>
Co-authored-by: Ashlee Boyer <ashlee.boyer@hashicorp.com>
Co-authored-by: Ashlee M Boyer <43934258+ashleemboyer@users.noreply.github.com>
Co-authored-by: HashiBot <62622282+hashibot-web@users.noreply.github.com>
Co-authored-by: Kevin Wang <kwangsan@gmail.com>
2022-09-16 10:38:39 -05:00
Kyle Rarey dd361d9581
docs: Correct driver name for 'Nomad Task Group' autoscaler target (#14576) 2022-09-14 09:40:00 +02:00
Mahmood Ali a9d5e4c510
scheduler: stopped-yet-running allocs are still running (#10446)
* scheduler: stopped-yet-running allocs are still running

* scheduler: test new stopped-but-running logic

* test: assert nonoverlapping alloc behavior

Also add a simpler Wait test helper to improve line numbers and save few
lines of code.

* docs: tried my best to describe #10446

it's not concise... feedback welcome

* scheduler: fix test that allowed overlapping allocs

* devices: only free devices when ClientStatus is terminal

* test: output nicer failure message if err==nil

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-09-13 12:52:47 -07:00
Tim Gross 9636b0f837
docs: tweak some copy in the concept docs (#14566) 2022-09-13 13:21:09 -04:00
Seth Hoenig afc815c0c7
Merge pull request #14559 from hashicorp/docs-nsd-check-watcher
docs: add documentation for nomad service check restarts
2022-09-13 10:52:01 -05:00
Ashlee M Boyer fc973ebe0e
docs: Fixing heading order, adding text for links in /docs/ecosystem (#14549)
* Fixing heading order, adding text for links

* Apply suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* Applying more suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-13 10:59:02 -04:00
Seth Hoenig 5b661ec84d docs: update docs for NSD check restart 2022-09-13 09:59:02 -05:00
Tim Gross 357e7f4521
docs: include path in ACL requirements for variables (#14561)
Also add links to the ACL policy reference and variables concepts docs near the
top of the page.
2022-09-13 10:21:29 -04:00
Tim Gross 6dd79ca995
docs: variables HTTP API documentation (#14516) 2022-09-13 10:18:26 -04:00
Tim Gross cab787c44d
docs: keyring HTTP API documentation (#14513) 2022-09-13 09:46:54 -04:00
Charlie Voiselle 8eb1689fca
Variables CLI documentation (#14249) 2022-09-12 16:44:31 -04:00
Tim Gross 14b536ee86
docs: update `template` for Nomad Variables (#14527) 2022-09-12 16:36:18 -04:00
Tim Gross 9259a373cd
remove root keyring install API (#14514)
* keyring rotate API should require put/post method
* remove keyring install API
2022-09-09 08:50:35 -04:00
Tim Gross 3fc7482ecd
CSI: failed allocation should not block its own controller unpublish (#14484)
A Nomad user reported problems with CSI volumes associated with failed
allocations, where the Nomad server did not send a controller unpublish RPC.

The controller unpublish is skipped if other non-terminal allocations on the
same node claim the volume. The check has a bug where the allocation belonging
to the claim being freed was included in the check incorrectly. During a normal
allocation stop for job stop or a new version of the job, the allocation is
terminal. But allocations that fail are not yet marked terminal at the point in
time when the client sends the unpublish RPC to the server.

For CSI plugins that support controller attach/detach, this means that the
controller will not be able to detach the volume from the allocation's host and
the replacement claim will fail until a GC is run. This changeset fixes the
conditional so that the claim's own allocation is not included, and makes the
logic easier to read. Include a test case covering this path.

Also includes two minor extra bugfixes:

* Entities we get from the state store should always be copied before
altering. Ensure that we copy the volume in the top-level unpublish workflow
before handing off to the steps.

* The list stub object for volumes in `nomad/structs` did not match the stub
object in `api`. The `api` package also did not include the current
readers/writers fields that are expected by the UI. True up the two objects and
add the previously undocumented fields to the docs.
2022-09-08 13:30:05 -04:00
James Rasell 813c5daa96
hcl2: add strlen function and update docs. (#14463) 2022-09-06 18:42:40 +02:00
Luiz Aoqui 1ae26981a0
connect: interpolate task env in config values (#14445)
When configuring Consul Service Mesh, it's sometimes necessary to
provide dynamic value that are only known to Nomad at runtime. By
interpolating configuration values (in addition to configuration keys),
user are able to pass these dynamic values to Consul from their Nomad
jobs.
2022-09-02 15:00:28 -04:00
Luiz Aoqui 99bddfe04d
docs: add warning about changing region config (#14443) 2022-09-01 16:47:06 -04:00
Luiz Aoqui 94d7dddccd
cli: set -hcl2-strict to false if -hcl1 is defined (#14426)
These options are mutually exclusive but, since `-hcl2-strict` defaults
to `true` users had to explicitily set it to `false` when using `-hcl1`.

Also return `255` when job plan fails validation as this is the expected 
code in this situation.
2022-09-01 10:42:08 -04:00
Tim Gross 0ef073a669
docs: clarify CSI plugin compatibility (#14434)
Nomad is generally compliant with the CSI specification for Container
Orchestrators (CO), except for unimplemented features. However, some storage
vendors have built CSI plugins that are not compliant with the specification or
which expect that they're only deployed on Kubernetes. Nomad cannot vouch for
the compatibility of any particular plugin, so clarify this in the docs.

Co-authored-by: Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
2022-09-01 10:06:44 -04:00
Brett Larson 9912dfd1e6
Update ephemeral_disk.mdx (#14356)
It is really unclear on how to use this feature. it took me a while to find this, so I thought I would purpose how to use this.
2022-08-31 20:17:41 -04:00
James Rasell 986355bcd9
docs: add documentation for ACL token expiration and ACL roles. (#14332)
The ACL command docs are now found within a sub-dir like the
operator command docs. Updates to the ACL token commands to
accommodate token expiry have also been added.

The ACL API docs are now found within a sub-dir like the operator
API docs. The ACL docs now include the ACL roles endpoint as well
as updated ACL token endpoints for token expiration.

The configuration section is also updated to accommodate the new
ACL and server parameters for the new ACL features.
2022-08-31 16:13:47 +02:00
Tim Gross c9d678a91a
keyring: wrap root key in key encryption key (#14388)
Update the on-disk format for the root key so that it's wrapped with a unique
per-key/per-server key encryption key. This is a bit of security theatre for the
current implementation, but it uses `go-kms-wrapping` as the interface for
wrapping the key. This provides a shim for future support of external KMS such
as cloud provider APIs or Vault transit encryption.

* Removes the JSON serialization extension we had on the `RootKey` struct; this
  struct is now only used for key replication and not for disk serialization, so
  we don't need this helper.

* Creates a helper for generating cryptographically random slices of bytes that
  properly accounts for short reads from the source.

* No observable functional changes outside of the on-disk format, so there are
  no test updates.
2022-08-30 10:59:25 -04:00
Tim Gross 37905d94b7
docs: fixing a few more places we missed "secure" during rename (#14395) 2022-08-30 10:08:50 -04:00
quoing ce7a3745d5
docs: template change script example correction (#14368)
"path" parameter doesn't work, should be command
2022-08-30 12:09:55 +02:00
Tim Gross d7652fdd3a
docs: rename Secure Variables to Variables (#14352) 2022-08-29 11:37:08 -04:00
Luiz Aoqui e012d9411e
Task lifecycle restart (#14127)
* allocrunner: handle lifecycle when all tasks die

When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.

* allocrunner: implement different alloc restarts

Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.

* allocrunner: allow tasks to run again

Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.

The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.

* api/cli: add support for all tasks alloc restart

Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.

* test: fix tasklifecycle Coordinator test

* allocrunner: kill taskrunners if all tasks are dead

When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.

* taskrunner: fix tests that waited on WaitCh

Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.

* tests: add tests for all tasks alloc restart

* changelog: add entry for #14127

* taskrunner: fix restore logic.

The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.

It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.

The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.

This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.

* address code review requests

* apply more code review changes

* taskrunner: add different Restart modes

Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.

So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
  the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
  restart if its `Run()` method is still active. Callers will need to
  restart the allocRunner taskCoordinator to make sure it will allow the
  task to run again.

* minor fixes
2022-08-24 17:43:07 -04:00
Piotr Kazmierczak 7077d1f9aa
template: custom change_mode scripts (#13972)
This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707
2022-08-24 17:43:01 +02:00