Commit Graph

24117 Commits

Author SHA1 Message Date
Tim Gross 322ecc858f
client: defer `nobody` user lookup so Windows doesn't panic (#14790)
In #14742 we introduced a cached lookup of the `nobody` user, which is only ever
called on Unixish machines. But the initial caching was being done in an `init`
block, which meant it was being run on Windows as well. This prevents the Nomad
agent from starting on Windows.

An alternative fix here would be to have a separate `init` block for Windows and
Unix, but this potentially masks incorrect behavior if we accidentally added a
call to the `Nobody()` method on Windows later. This way we're forced to handle
the error in the caller.
2022-10-04 11:52:12 -04:00
Tim Gross 341dc84a77
variables: use correct URL in ref to docs (#14792) 2022-10-04 11:30:49 -04:00
Tim Gross 2f3d4f51e6
deps: remove gophers.dev dependency (#14789) 2022-10-04 09:49:50 -04:00
Tim Gross a3ff23608c
deps: use install from current HEAD for `hc-install` (#14786)
The `hc-install` tool we're using needed a patch for a specific bug, but that's
since been merged. We definitely want to switch to using a standard release from
that project once one is shipped with the CLI, but pinning to HEAD should keep
us for now.
2022-10-04 08:22:30 -04:00
Michael Schurter ed3218c3dd
Fixing flaky TestOverlap test (#14780)
* test: ensure feasible node selected in overlap test

* test: warn when getting close to retry limit
2022-10-03 14:35:02 -07:00
Elijah Voigt 0a80a58394
Docs(job-specification/periodic): Add enabled toggle (#14767)
This is probably undocumented for a reason, but the `enabled` toggle in the
`periodic` stanza is very useful so I figured I try adding it to the docs.

The feature has been secretly avaliable since #9142 and was called out in that
PR as being a dubious addition, only added to avoid regressions.

The use case for disabling a periodic job in this way is to prevent it from
running without modifying the schedule. Ideally Nomad would make it more clear
that this was the case, and allow you to force a run of the job, but even with
those rough edges I think users would benefit from knowing about this toggle.
2022-10-03 15:08:24 -04:00
Tim Gross 2a6e8be6ba
internals documentation with diagrams (#14750)
This changeset adds new architecture internals documents to the contributing
guide. These are intentionally here and not on the public-facing website because
the material is not required for operators and includes a lot of diagrams that
we can cheaply maintain with mermaid syntax but would involve art assets to have
up on the main site that would become quickly out of date as code changes happen
and be extremely expensive to maintain. However, these should be suitable to use
as points of conversation with expert end users.

Included:
* A description of Evaluation triggers and expected counts, with examples.
* A description of Evaluation states and implicit states. This is taken from an
  internal document in our team wiki.
* A description of how writing the State Store works. This is taken from a
  diagram I put together a few months ago for internal education purposes.
* A description of Evaluation lifecycle, from registration to running
  Allocations. This is mostly lifted from @lgfa29's amazing mega-diagram, but
  broken into digestible chunks and without multi-region deployments, which I'd
  like to cover in a future doc.

Also includes adding Deployments to our public-facing glossary.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-03 14:06:41 -04:00
Jai bd8d023ee5
refact: only reload job if job has no taskGroups (#14760) 2022-09-30 16:14:40 -04:00
dependabot[bot] 9ce74c83e6
build(deps-dev): bump @hashicorp/platform-cli in /website (#14541)
Bumps [@hashicorp/platform-cli](https://github.com/hashicorp/web-platform-packages/tree/HEAD/packages/cli) from 2.1.0 to 2.3.0.
- [Release notes](https://github.com/hashicorp/web-platform-packages/releases)
- [Changelog](https://github.com/hashicorp/web-platform-packages/blob/main/packages/cli/CHANGELOG.md)
- [Commits](https://github.com/hashicorp/web-platform-packages/commits/@hashicorp/platform-cli@2.3.0/packages/cli)

---
updated-dependencies:
- dependency-name: "@hashicorp/platform-cli"
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-09-30 14:59:55 -04:00
Luiz Aoqui b924802958
template: apply splay value on change_mode script (#14749)
Previously, the splay timeout was only applied if a template re-render
caused a restart or a signal action. The `change_mode = "script"` was
running after the `if restart || len(signals) != 0` check, so it was
invoked at all times.

This change refactors the logic so it's easier to notice that new
`change_mode` options should start only after `splay` is applied.
2022-09-30 12:04:22 -04:00
Tim Gross e13ac471fc
Revert removing deprecated client options docs (#14753)
This reverts PR #12416 and commit 6668ce022ac561f75ad113cc838b1fb786f11f79.

While the driver options are well and truly deprecated, this documentation also
covers features like `fingerprint.denylist` that are not available any other
way. Let's revert this until #12420 is ready.
2022-09-30 08:38:03 -04:00
Phil Renaud 8ac604841a
[ui] Bugfix: reinstate the "this variable will be accessible by $job/$group/$task" notification (#14741)
* When we isolated the variable form path to within its component for isolation reasons, we lost the model-level checks for related entites at type-time

* Be a little more functionally pure

* Use Ember.set to appease mirage
2022-09-29 10:40:00 -04:00
Phil Renaud a200c2f2f2
Fix a bug where we only checked the first task within a given alloc for services (#14740) 2022-09-29 10:39:42 -04:00
Seth Hoenig c68ed3b4c8
client: protect user lookups with global lock (#14742)
* client: protect user lookups with global lock

This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).

* cl: add cl
2022-09-29 09:30:13 -05:00
Michael Schurter 0e95fb03c0
test: skip chown test if nonroot (#14738)
CI always runs this as root, so it worked there and always scared me
when I ran it locally.
2022-09-28 14:45:38 -07:00
Derek Strickland 2c4df95e92
Merge pull request #14664 from hashicorp/docs-multiregion-dispatch
multiregion: Added a section for multiregion parameterized job dispatch
2022-09-28 15:40:11 -04:00
Derek Strickland c3d4496287 link from dispatch command 2022-09-28 08:30:22 -04:00
Derek Strickland 8b37e558fb Apply suggestions from code review 2022-09-28 08:18:56 -04:00
Derek Strickland fe7d1e08ac
Update website/content/docs/job-specification/multiregion.mdx
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-28 07:20:11 -04:00
Derek Strickland e1dba23ccf
Update website/content/docs/job-specification/multiregion.mdx
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-28 07:19:54 -04:00
Derek Strickland 7a26c4fb20
Merge pull request #14730 from hashicorp/remove-testing-changelog-entry
Remove changelog entry for test update PR
2022-09-27 18:26:02 -04:00
Derek Strickland 4c73a3b1dc
Remove changelog entry for test update PR 2022-09-27 18:17:49 -04:00
Derek Strickland eeeadfd24a
Merge pull request #14729 from hashicorp/remove-bug-fix-changelog-files
Fix and remove changelog files
2022-09-27 18:11:52 -04:00
Derek Strickland 06db9d5a4d
Merge pull request #14728 from hashicorp/post-1.4.0-rc.1-release
Post 1.4.0 rc.1 release
2022-09-27 17:52:47 -04:00
Derek Strickland 52e4997ace
Add enterprise tag 2022-09-27 17:50:25 -04:00
Derek Strickland ef0f8c5b81
Add enterprise tag 2022-09-27 17:49:27 -04:00
Derek Strickland 6738684167
Delete 14665.txt 2022-09-27 17:47:35 -04:00
Derek Strickland 87bdb74221
Remove bug fix changelog files 2022-09-27 17:46:32 -04:00
hc-github-team-nomad-core 9232da0914 Prepare for next release 2022-09-27 17:33:32 -04:00
hc-github-team-nomad-core 2fe5a962f3 Generate files for 1.4.0-rc.1 release 2022-09-27 17:33:32 -04:00
Derek Strickland a3cda3ede0 Apply changes from code review 2022-09-27 17:33:32 -04:00
Derek Strickland 1eb07f6202 Prepare release 1.4.0-rc.1 2022-09-27 17:33:32 -04:00
Phil Renaud 5a6084fff7
Visual diff tests: error states (#14707)
* 3 error states captured

* Assertion expecters

* Attempt to stabilize datacenters
2022-09-27 15:46:33 -04:00
Derek Strickland d26d2874ae
Merge pull request #14727 from hashicorp/changelog-14651-breaking-change
Fix changelog entry type
2022-09-27 15:03:02 -04:00
Derek Strickland cacf4bb8e1
Fix changelog entry type 2022-09-27 14:33:39 -04:00
Michael Schurter 0df5c7d5ae
test: fix flaky test (#14713)
Need to wait for Stop evals to be processed before you can expect
subsequent RPCs to see the alloc's DesiredStatus=stop.
2022-09-27 10:36:16 -07:00
Jim Razmus II 7da3fd050b
jobspec: allow artifact headers in HCLv1 (#14637)
* jobspec: allow artifact headers in HCLv1

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-27 12:18:49 -04:00
Seth Hoenig 5df5e70542
core: numeric operands comparisons in constraints (#14722)
* cleanup: fixup linter warnings in schedular/feasible.go

* core: numeric operands comparisons in constraints

This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.

Inspiration #4856
Closes #4729
Closes #14719

* fix: always parse as int64
2022-09-27 11:07:07 -05:00
Phil Renaud 24eea2d4d4
task logs page snapshot (#14709) 2022-09-27 10:25:04 -04:00
Phil Renaud 29feb48835
Empty and filled task exec screenshots for test (#14702)
* Empty and filled task exec screenshots for test

* Attempting to stabilize datacenter prop on servers
2022-09-27 10:24:53 -04:00
Tim Gross 87681fca68
CSI: ensure initial unpublish state is checkpointed (#14675)
A test flake revealed a bug in the CSI unpublish workflow, where an unpublish
that comes from a client that's successfully done the node-unpublish step will
not have the claim checkpointed if the controller-unpublish step fails. This
will result in a delay in releasing the volume claim until the next GC.

This changeset also ensures we're using a new snapshot after each write to raft,
and fixes two timing issues in test where either the volume watcher can
unpublish before the unpublish RPC is sent or we don't wait long enough in
resource-restricted environements like GHA.
2022-09-27 08:43:45 -04:00
Michael Schurter fb8739d926
docs: write a lot of words about heartbeats (#14679)
* docs: write a lot of words about heartbeats

Alternative to #14670

* Apply suggestions from code review

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* use descriptive title for link

* rework example of high failover ttl

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-26 14:43:34 -07:00
Seth Hoenig 7235d9988b
e2e: convert chroot env unit tests into e2e tests (#14710)
This PR translates two of our most flakey unit tests into
e2e tests where they are fit much more naturally.
2022-09-26 15:40:29 -05:00
Michael Schurter e6af1c0a14
fingerprint: add node attr for reserverable cores (#14694)
* fingerprint: add node attr for reserverable cores

Add an attribute for the number of reservable CPU cores as they may
differ from the existing `cpu.numcores` due to client configuration or
OS support.

Hopefully clarifies some confusion in #14676

* add changelog

* num_reservable_cores -> reservablecores
2022-09-26 13:03:03 -07:00
Luiz Aoqui 5c100c0d3d
client: recover from getter panics (#14696)
The artifact getter uses the go-getter library to fetch files from
different sources. Any bug in this library that results in a panic can
cause the entire Nomad client to crash due to a single file download
attempt.

This change aims to guard against this types of crashes by recovering
from panics when the getter attempts to download an artifact. The
resulting panic is converted to an error that is stored as a task event
for operator visibility and the panic stack trace is logged to the
client's log.
2022-09-26 15:16:26 -04:00
Michael Schurter b554f9344a
fingerprint: lengthen Vault check after seen (#14693)
Extension of #14673

Once Vault is initially fingerprinted, extend the period since changes
should be infrequent and the fingerprint is relatively expensive since
it is contacting a central Vault server.

Also move the period timer reset *after* the fingerprint. This is
similar to #9435 where the idea is to ensure the retry period starts
*after* the operation is attempted. 15s will be the *minimum* time
between fingerprints now instead of the *maximum* time between
fingerprints.

In the case of Vault fingerprinting, the original behavior might cause
the following:

1. Timer is reset to 15s
2. Fingerprint takes 16s
3. Timer has already elapsed so we immediately Fingerprint again

Even if fingerprinting Vault only takes a few seconds, that may very
well be due to excessive load and backing off our fingerprints is
desirable. The new bevahior ensures we always wait at least 15s between
fingerprint attempts and should allow some natural jittering based on
server load and network latency.
2022-09-26 12:14:19 -07:00
Tim Gross a661399b41
cli: fix doc strings for `var get` command (#14697) 2022-09-26 15:05:22 -04:00
Luiz Aoqui f7c6534a79
cli: set content length on `operator api` requests (#14634)
http.NewRequestWithContext will only set the right value for
Content-Length if the input is *bytes.Buffer, *bytes.Reader, or
*strings.Reader [0].

Since os.Stdin is an os.File, POST requests made with the `nomad
operator api` command would always have Content-Length set to -1, which
is interpreted as an unknown length by web servers.

[0]: https://pkg.go.dev/net/http#NewRequestWithContext
2022-09-26 14:21:40 -04:00
Karan Sharma cdb3ec25d3
docs: add new tools (#14596) 2022-09-26 11:42:06 -04:00
Tim Gross 62b1e2ef97
variables: document restrictions on path and size (#14687) 2022-09-26 11:40:53 -04:00