Commit graph

728 commits

Author SHA1 Message Date
Charlie Voiselle 83e43e01c1
Add missing timer reset (#15134) 2022-11-03 18:57:57 -04:00
Ethan 654ae1d591
fix: batchFirstFingerprints does not update device on node after v1.3.5 (#15125)
* fix: update device in batch first footprint

* cl: add cl note

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-11-03 16:31:39 -05:00
Tim Gross 672fb46d12
WI: set identity to client secret if missing (#15121)
Allocations created before 1.4.0 will not have a workload identity token. When
the client running these allocs is upgraded to 1.4.x, the identity hook will run
and replace the node secret ID token used previously with an empty string. This
causes service discovery queries to fail.

Fallback to the node's secret ID when the allocation doesn't have a signed
identity. Note that pre-1.4.0 allocations won't have templates that read
Variables, so there's no threat that this new node ID secret will be able to
read data that the allocation shouldn't have access to.
2022-11-03 11:10:11 -04:00
Phil Renaud ffb4c63af7
[ui] Adds meta to job list stub and displays a pack logo on the jobs index (#14833)
* Adds meta to job list stub and displays a pack logo on the jobs index

* Changelog

* Modifying struct for optional meta param

* Explicitly ask for meta anytime I look up a job from index or job page

* Test case for the endpoint

* adding meta field to API struct and ommitting from response if empty

* passthru method added to api/jobs.list

* Meta param listed in docs for jobs list

* Update api/jobs.go

Co-authored-by: Tim Gross <tgross@hashicorp.com>

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-11-02 16:58:24 -04:00
Phil Renaud 6d5fe56fa1
Job spec upload (#14747)
* Job spec upload by click or drag

* pseudo-restrict formats

* Changelog

* Tweak to job spec upload to be above editor layer

* Within the job-editor again tho

* Beginning testcase cleanup

* Test progression

* refact: update codemirror fillin logic

Co-authored-by: Jai Bhagat <jaybhagat841@gmail.com>
2022-11-02 10:34:10 -04:00
Tim Gross 4d7a4171cd
volumewatcher: prevent panic on nil volume (#15101)
If a GC claim is written and then volume is deleted before the `volumewatcher`
enters its run loop, we panic on the nil-pointer access. Simply doing a
nil-check at the top of the loop reveals a race condition around shutting down
the loop just as a new update is coming in.

Have the parent `volumeswatcher` send an initial update on the channel before
returning, so that we're still holding the lock. Update the watcher's `Stop`
method to set the running state, which lets us avoid having a second context and
makes stopping synchronous. This reduces the cases we have to handle in the run
loop.

Updated the tests now that we'll safely return from the goroutine and stop the
runner in a larger set of cases. Ran the tests with the `-race` detection flag
and fixed up any problems found here as well.
2022-11-01 16:53:10 -04:00
Tim Gross 38542f256e
variables: limit rekey eval to half the nack timeout (#15102)
In order to limit how much the rekey job can monopolize a scheduler worker, we
limit how long it can run to 1min before stopping work and emitting a new
eval. But this exactly matches the default nack timeout, so it'll fail the eval
rather than getting a chance to emit a new one.

Set the timeout for the rekey eval to half the configured nack timeout.
2022-11-01 16:50:50 -04:00
Tim Gross 903b5baaa4
keyring: safely handle missing keys and restore GC (#15092)
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.

Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
  that returns an error, to make the workflow more clear and reduce nesting. Log
  the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
  on initializing the keyring, so there's a race condition in the keyring tests
  where we can test for the existence of the root key before the keyring has
  been initialize. Change this to an "eventually" test.

But these fixes aren't enough to fix #14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
2022-11-01 15:00:50 -04:00
dependabot[bot] acc94d523f
build(deps): bump github.com/docker/cli from 20.10.18+incompatible to 20.10.21+incompatible (#15078)
* build(deps): bump github.com/docker/cli

Bumps [github.com/docker/cli](https://github.com/docker/cli) from 20.10.18+incompatible to 20.10.21+incompatible.
- [Release notes](https://github.com/docker/cli/releases)
- [Commits](https://github.com/docker/cli/compare/v20.10.18...v20.10.21)

---
updated-dependencies:
- dependency-name: github.com/docker/cli
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps: updated github.com/docker/cli from 20.10.18+incompatible to 20.10.21+incompatible

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-31 08:50:33 -05:00
dependabot[bot] 369e4da4ad
build(deps): bump github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126 (#15081)
* build(deps): bump github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126

Bumps [github.com/aws/aws-sdk-go](https://github.com/aws/aws-sdk-go) from 1.44.84 to 1.44.126.
- [Release notes](https://github.com/aws/aws-sdk-go/releases)
- [Commits](https://github.com/aws/aws-sdk-go/compare/v1.44.84...v1.44.126)

---
updated-dependencies:
- dependency-name: github.com/aws/aws-sdk-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

* deps: update github.com/aws/aws-sdk-go from 1.44.84 to 1.44.126

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-31 08:47:48 -05:00
Tim Gross 2ce1728fa6 Merge release 1.4.2 files
Changelog updates for 1.4.2 and backports.
2022-10-27 13:31:29 -04:00
Tim Gross 9d906d4632 variables: fix filter on List RPC
The List RPC correctly authorized against the prefix argument. But when
filtering results underneath the prefix, it only checked authorization for
standard ACL tokens and not Workload Identity. This results in WI tokens being
able to read List results (metadata only: variable paths and timestamps) for
variables under the `nomad/` prefix that belong to other jobs in the same
namespace.

Fixes the filtering and split the `handleMixedAuthEndpoint` function into
separate authentication and authorization steps so that we don't need to
re-verify the claim token on each filtered object.

Also includes:
* update semgrep rule for mixed auth endpoints
* variables: List returns empty set when all results are filtered
2022-10-27 13:08:05 -04:00
James Rasell da5069bded event stream: ensure token expiry is correctly checked for subs.
This change ensures that a token's expiry is checked before every
event is sent to the caller. Previously, a token could still be
used to listen for events after it had expired, as long as the
subscription was made while it was unexpired. This would last until
the token was garbage collected from state.

The check occurs within the RPC as there is currently no state
update when a token expires.
2022-10-27 13:08:05 -04:00
dependabot[bot] 07796965b1
build(deps): bump google.golang.org/grpc from 1.48.0 to 1.50.1 (#14897)
* build(deps): bump google.golang.org/grpc from 1.48.0 to 1.50.1

Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.48.0 to 1.50.1.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.48.0...v1.50.1)

---
updated-dependencies:
- dependency-name: google.golang.org/grpc
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* cl: add changelog entry for grpc

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-27 11:32:48 -05:00
dependabot[bot] eb210f2af7
build(deps): bump github.com/fsouza/go-dockerclient from 1.8.2 to 1.9.0 (#14898)
* build(deps): bump github.com/fsouza/go-dockerclient from 1.8.2 to 1.9.0

Bumps [github.com/fsouza/go-dockerclient](https://github.com/fsouza/go-dockerclient) from 1.8.2 to 1.9.0.
- [Release notes](https://github.com/fsouza/go-dockerclient/releases)
- [Changelog](https://github.com/fsouza/go-dockerclient/blob/main/container_changes_test.go)
- [Commits](https://github.com/fsouza/go-dockerclient/compare/v1.8.2...v1.9.0)

---
updated-dependencies:
- dependency-name: github.com/fsouza/go-dockerclient
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* cl: add changelog entry

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2022-10-27 11:05:45 -05:00
Charlie Voiselle 28cd831085
Update consul-template dep (#15045) 2022-10-26 11:51:45 -04:00
Tim Gross aca95c0bc6
keyring: remove root key GC (#15034) 2022-10-25 17:06:18 -04:00
Seth Hoenig d69556fb35
client: ensure minimal cgroup controllers enabled (#15027)
* client: ensure minimal cgroup controllers enabled

This PR fixes a bug where Nomad could not operate properly on operating
systems that set the root cgroup.subtree_control to a set of controllers that
do not include the minimal set of controllers needed by Nomad.

Nomad needs these controllers enabled to operate:
- cpuset
- cpu
- io
- memory
- pids

Now, Nomad will ensure these controllers are enabled during Client initialization,
adding them to cgroup.subtree_control as necessary. This should be particularly
helpful on the RHEL/CentOS/Fedora family of system. Ubuntu systems should be
unaffected as they enable all controllers by default.

Fixes: https://github.com/hashicorp/nomad/issues/14494

* docs: cleanup doc string

* client: cleanup controller writes, enhance log messages
2022-10-24 16:08:54 -05:00
Seth Hoenig 32744a3548
deps: update hashicorp/raft to v1.3.11 (#15021)
* deps: update hashicorp/raft to v1.3.11

Includes part of the fix for https://github.com/hashicorp/raft/issues/524

* cl: add changelog entry
2022-10-24 12:10:24 -05:00
Tim Gross b9922631bd
keyring: fix missing GC config, don't rotate on manual GC (#15009)
The configuration knobs for root keyring garbage collection are present in the
consumer and present in the user-facing config, but we missed the spot where we
copy from one to the other. Fix this so that users can set their own thresholds.

The root key is automatically rotated every ~30d, but the function that does
both rotation and key GC was wired up such that `nomad system gc` caused an
unexpected key rotation. Split this into two functions so that `nomad system gc`
cleans up old keys without forcing a rotation, which will be done periodially
or by the `nomad operator root keyring rotate` command.
2022-10-24 08:43:42 -04:00
James Rasell 206fb04dc1
acl: allow tokens to read policies linked via roles to the token. (#14982)
ACL tokens are granted permissions either by direct policy links
or via ACL role links. Callers should therefore be able to read
policies directly assigned to the caller token or indirectly by
ACL role links.
2022-10-21 09:05:17 +02:00
Luiz Aoqui 593e48e826
cli: prevent panic on operator debug (#14992)
If the API returns an error during debug bundle collection the CLI was
expanding the wrong error object, resulting in a panic since `err` is
`nil`.
2022-10-20 15:53:58 -04:00
Jai 08fde3a4ff
refact: upgrade Promise.then to async/await (#14798)
* refact: upgrade Promise.then to async/await

* naive solution (#14800)

* refact: use id instead of model

* chore:  add changelog entry

* refact: add conditional safety around alloc
2022-10-20 14:25:41 -04:00
Seth Hoenig 6e9c8a9955
deps: update go-memdb for goroutine leak fix (#14983)
* deps: update go-memdb for goroutine leak fix

* cl: update for goroutine leak go-memdb
2022-10-20 10:34:52 -05:00
James Rasell 215b4e7e36
acl: add ACL roles to event stream topic and resolve policies. (#14923)
This changes adds ACL role creation and deletion to the event
stream. It is exposed as a single topic with two types; the filter
is primarily the role ID but also includes the role name.

While conducting this work it was also discovered that the events
stream has its own ACL resolution logic. This did not account for
ACL tokens which included role links, or tokens with expiry times.
ACL role links are now resolved to their policies and tokens are
checked for expiry correctly.
2022-10-20 09:43:35 +02:00
James Rasell d7b311ce55
acl: correctly resolve ACL roles within client cache. (#14922)
The client ACL cache was not accounting for tokens which included
ACL role links. This change modifies the behaviour to resolve role
links to policies. It will also now store ACL roles within the
cache for quick lookup. The cache TTL is configurable in the same
manner as policies or tokens.

Another small fix is included that takes into account the ACL
token expiry time. This was not included, which meant tokens with
expiry could be used past the expiry time, until they were GC'd.
2022-10-20 09:37:32 +02:00
Phil Renaud 54eeb6ebe8
Adds searching and filtering for nodes on topology view (#14913)
* Adds searching and filtering for nodes on topology view

* Lintfix and changelog

* Acceptance tests for topology search and filter

* Search terms also apply to class and dc on topo page

* Initialize queryparam values so as to not break history state
2022-10-19 15:00:35 -04:00
Seth Hoenig 57375566d4
consul: register checks along with service on initial registration (#14944)
* consul: register checks along with service on initial registration

This PR updates Nomad's Consul service client to include checks in
an initial service registration, so that the checks associated with
the service are registered "atomically" with the service. Before, we
would only register the checks after the service registration, which
causes problems where the service is deemed healthy, even if one or
more checks are unhealthy - especially problematic in the case where
SuccessBeforePassing is configured.

Fixes #3935

* cr: followup to fix cause of extra consul logging

* cr: fix another bug

* cr: fixup changelog
2022-10-19 12:40:56 -05:00
James Rasell 8e25048f3d
acl: gate ACL role write and delete RPC usage on v1.4.0 or greater. (#14908) 2022-10-18 16:46:11 +02:00
James Rasell 9923f9e6f3
nnsd: gate registration write & delete RPC use on v1.3.0 or greater. (#14924) 2022-10-18 15:30:28 +02:00
Seth Hoenig f1b902beac
consul: do not re-register already registered services (#14917)
This PR updates Nomad's Consul service client to do map comparisons
using maps.Equal instead of reflect.DeepEqual. The bug fix is in how
DeepEqual treats nil slices different from empty slices, when actually
they should be treated the same.
2022-10-18 08:10:59 -05:00
Tim Gross 3c78980b78
make version checks specific to region (1.4.x) (#14912)
* One-time tokens are not replicated between regions, so we don't want to enforce
  that the version check across all of serf, just members in the same region.
* Scheduler: Disconnected clients handling is specific to a single region, so we
  don't want to enforce that the version check across all of serf, just members in
  the same region.
* Variables: enforce version check in Apply RPC
* Cleans up a bunch of legacy checks.

This changeset is specific to 1.4.x and the changes for previous versions of
Nomad will be manually backported in a separate PR.
2022-10-17 16:23:51 -04:00
Tim Gross c721ce618e
keyring: filter by region before checking version (#14901)
In #14821 we fixed a panic that can happen if a leadership election happens in
the middle of an upgrade. That fix checks that all servers are at the minimum
version before initializing the keyring (which blocks evaluation processing
during trhe upgrade). But the check we implemented is over the serf membership,
which includes servers in any federated regions, which don't necessarily have
the same upgrade cycle.

Filter the version check by the leader's region.

Also bump up log levels of major keyring operations
2022-10-17 13:21:16 -04:00
Tim Gross bcd26f8815
docker_logger: reorder imports to save memory (#14875)
Nomad runs one logmon process and also one docker_logger process for each
running allocation. A naive look at memory usage shows 10-30 MB of RSS, but a
closer look shows that most of this memory (ex. all but ~2MB for logmon) is
shared (`Shared_Clean` in Linux pmap).

But a heap dump of docker_logger shows that it currently has an extra ~2500 KiB
of heap (anonymously-mapped unshared memory) used for init blocks coming from
the agent code (ex. mostly regexes from go-version, structs, and the Consul
SDK). The packages for running logmon, docker_logger, and executor have an init
block that parses `os.Args` to drop into their own logic, which prevents them
from loading all the rest of the agent code and saves on memory, so this was
unexpected.

It looks like we accidentally reordered the imports in main to undo some of the
work originally done in 404d2d4c98f1df930be1ae9852fe6e6ae8c1517e. This changeset
restores the ordering. A follow-up heap dump shows this saves ~2MB of unshared
RSS per docker_logger process.
2022-10-11 13:23:03 -04:00
Seth Hoenig 1593963cd1
servicedisco: implicit constraint for nomad v1.4 when using nsd checks (#14868)
This PR adds a jobspec mutator to constrain jobs making use of checks
in the nomad service provider to nomad clients of at least v1.4.0.

Before, in a mixed client version cluster it was possible to submit
an NSD job making use of checks and for that job to land on an older,
incompatible client node.

Closes #14862
2022-10-11 08:21:42 -05:00
Seth Hoenig 69ced2a2bd
services: remove assertion on 'task' field being set (#14864)
This PR removes the assertion around when the 'task' field of
a check may be set. Starting in Nomad 1.4 we automatically set
the task field on all checks in support of the NSD checks feature.

This is causing validation problems elsewhere, e.g. when a group
service using the Consul provider sets 'task' it will fail
validation that worked previously.

The assertion of leaving 'task' unset was only about making sure
job submitters weren't expecting some behavior, but in practice
is causing bugs now that we need the task field for more than it
was originally added for.

We can simply update the docs, noting when the task field set by
job submitters actually has value.
2022-10-10 13:02:33 -05:00
Phil Renaud e771b94164
[ui] Makes service tags wrap and look like tag items (#14834)
* Makes service tags wrap and look like tag items

* Add a little vertical spacing and changelog

* Put client before tags

* Force tags list to new line
2022-10-07 09:23:52 -04:00
Damian Czaja 95f969c4bf
cli: add nomad fmt (#14779) 2022-10-06 17:00:29 -04:00
Phil Renaud 4b93a30225
[ui] Line charts: explicitly update X-axis whenever xScale changes (#14814)
* Explicitly update X-axis whenever xScale changes

* Changelog
2022-10-06 16:59:16 -04:00
Hemanth Krishna e516fc266f
enhancement: UpdateTask when Task is waiting for ShutdownDelay (#14775)
Signed-off-by: Hemanth Krishna <hkpdev008@gmail.com>
2022-10-06 16:33:28 -04:00
Will Jordan 8ae13208c9
Allow jobs not requiring any network resources (#14300)
Jobs not requiring any network resources should be allowed
even when the network fingerprinter is disabled.
2022-10-06 16:25:41 -04:00
Gabriel Villalonga Simon b974c32ba6
Check that JobPlanResponse Diff Type is None before checking for changes on getExitCode (#14492) 2022-10-06 16:23:22 -04:00
Pablo Ruiz García 40416be7b1
Invoke FingerprintManager's Reload() func during agent's SIGHUP (#14615)
Fixes #14614
2022-10-06 16:22:59 -04:00
Giovani Avelar a625de2062
Allow specification of a custom job name/prefix for parameterized jobs (#14631) 2022-10-06 16:21:40 -04:00
Tim Gross 80ec5e1346
fix panic from keyring raft entries being written during upgrade (#14821)
During an upgrade to Nomad 1.4.0, if a server running 1.4.0 becomes the leader
before one of the 1.3.x servers, the old server will crash because the keyring
is initialized and writes a raft entry.

Wait until all members are on a version that supports the keyring before
initializing it.
2022-10-06 12:47:02 -04:00
Luiz Aoqui b924802958
template: apply splay value on change_mode script (#14749)
Previously, the splay timeout was only applied if a template re-render
caused a restart or a signal action. The `change_mode = "script"` was
running after the `if restart || len(signals) != 0` check, so it was
invoked at all times.

This change refactors the logic so it's easier to notice that new
`change_mode` options should start only after `splay` is applied.
2022-09-30 12:04:22 -04:00
Seth Hoenig c68ed3b4c8
client: protect user lookups with global lock (#14742)
* client: protect user lookups with global lock

This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).

* cl: add cl
2022-09-29 09:30:13 -05:00
Derek Strickland 4c73a3b1dc
Remove changelog entry for test update PR 2022-09-27 18:17:49 -04:00
Derek Strickland 52e4997ace
Add enterprise tag 2022-09-27 17:50:25 -04:00
Derek Strickland ef0f8c5b81
Add enterprise tag 2022-09-27 17:49:27 -04:00
Derek Strickland 6738684167
Delete 14665.txt 2022-09-27 17:47:35 -04:00
Derek Strickland 87bdb74221
Remove bug fix changelog files 2022-09-27 17:46:32 -04:00
Derek Strickland cacf4bb8e1
Fix changelog entry type 2022-09-27 14:33:39 -04:00
Jim Razmus II 7da3fd050b
jobspec: allow artifact headers in HCLv1 (#14637)
* jobspec: allow artifact headers in HCLv1

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-27 12:18:49 -04:00
Seth Hoenig 5df5e70542
core: numeric operands comparisons in constraints (#14722)
* cleanup: fixup linter warnings in schedular/feasible.go

* core: numeric operands comparisons in constraints

This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.

Inspiration #4856
Closes #4729
Closes #14719

* fix: always parse as int64
2022-09-27 11:07:07 -05:00
Tim Gross 87681fca68
CSI: ensure initial unpublish state is checkpointed (#14675)
A test flake revealed a bug in the CSI unpublish workflow, where an unpublish
that comes from a client that's successfully done the node-unpublish step will
not have the claim checkpointed if the controller-unpublish step fails. This
will result in a delay in releasing the volume claim until the next GC.

This changeset also ensures we're using a new snapshot after each write to raft,
and fixes two timing issues in test where either the volume watcher can
unpublish before the unpublish RPC is sent or we don't wait long enough in
resource-restricted environements like GHA.
2022-09-27 08:43:45 -04:00
Michael Schurter e6af1c0a14
fingerprint: add node attr for reserverable cores (#14694)
* fingerprint: add node attr for reserverable cores

Add an attribute for the number of reservable CPU cores as they may
differ from the existing `cpu.numcores` due to client configuration or
OS support.

Hopefully clarifies some confusion in #14676

* add changelog

* num_reservable_cores -> reservablecores
2022-09-26 13:03:03 -07:00
Luiz Aoqui 5c100c0d3d
client: recover from getter panics (#14696)
The artifact getter uses the go-getter library to fetch files from
different sources. Any bug in this library that results in a panic can
cause the entire Nomad client to crash due to a single file download
attempt.

This change aims to guard against this types of crashes by recovering
from panics when the getter attempts to download an artifact. The
resulting panic is converted to an error that is stored as a task event
for operator visibility and the panic stack trace is logged to the
client's log.
2022-09-26 15:16:26 -04:00
Luiz Aoqui f7c6534a79
cli: set content length on operator api requests (#14634)
http.NewRequestWithContext will only set the right value for
Content-Length if the input is *bytes.Buffer, *bytes.Reader, or
*strings.Reader [0].

Since os.Stdin is an os.File, POST requests made with the `nomad
operator api` command would always have Content-Length set to -1, which
is interpreted as an unknown length by web servers.

[0]: https://pkg.go.dev/net/http#NewRequestWithContext
2022-09-26 14:21:40 -04:00
Phil Renaud 497bd02169
[ui] Warn users when they leave an edited but unsaved variable page (#14665)
* Warning on attempt to leave

* Lintfix

* Only router.off once

* Dont warn on transition when only updating queryparams

* Remove double-push and queryparam-only issues, thanks @lgfa29

* Acceptance tests

* Changelog
2022-09-23 16:53:40 -04:00
Phil Renaud a28e1bcc1e
[ui] Service Healthchecks: styles for pseudo-timestamp axis (#14677)
* Styles for pseudo-timestamp axis

* Changelog
2022-09-23 16:53:28 -04:00
Tim Gross 17aee4d69c
fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
Derek Strickland 6874997f91
scheduler: Fix bug where the would treat multiregion jobs as paused for job types that don't use deployments (#14659)
* scheduler: Fix bug where the scheduler would treat multiregion jobs as paused for job types that don't use deployments

Co-authored-by: Tim Gross <tgross@hashicorp.com>

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-22 14:31:27 -04:00
Jorge Marey 92158a1c62
connect: add nomad env to envoy bootstrap (#12959)
* Add nomad env to envoy bootstrap

* Add changelog file
2022-09-22 13:18:18 -05:00
Phil Renaud eca0e7bf56
[ui] task logs in sidebar (#14612)
* button styles

* Further styles including global toggle adjustment

* sidebar funcs and header

* Functioning task logs in high-level sidebars

* same-lineify the show tasks toggle

* Changelog

* Full-height sidebar calc in css, plz drop soon container queries

* Active status and query params for allocations page

* Reactive shouldShowLogs getter and added to client and task group pages

* Higher order func passing, thanks @DingoEatingFuzz

* Non-service job types get allocation params passed

* Keyframe animation for task log sidebar

* Acceptance test

* A few more sub-row tests

* Lintfix
2022-09-22 10:58:52 -04:00
Tim Gross c29c4bd66c
cli: remove deprecated eval status -json list behavior (#14651)
In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and
deprecated the usage of `eval status` without an evaluation ID with an upgrade
note that it would be removed in Nomad 1.4.0. This changeset completes that
work.
2022-09-22 10:56:32 -04:00
Jorge Marey 584ddfe859
Add Namespace, Job and Group to envoy stats (#14311) 2022-09-22 10:38:21 -04:00
Tim Gross d327a68696
operator debug: write NDJSON for large collections (#14610)
The `operator debug` command writes JSON files from API responses as a single
line containing an array of JSON objects. But some of these files can be
extremely large (GB's) for large production clusters, which makes it difficult
to parse them using typical line-oriented Unix command line tools that can
stream their inputs without consuming a lot of memory.

For collections that are typically large, instead emit newline-delimited JSON.

This changeset includes some first-pass refactoring of this command. It breaks
up monolithic methods that validate a path, create a file, serialize objects,
and write them to disk into smaller functions, some of which can now be
standalone to take advantage of generics.
2022-09-22 10:02:00 -04:00
James Rasell a25028c412
cli: fix a bug in operator API when setting HTTPS via address. (#14635)
Operators may have a setup whereby the TLS config comes from a
source other than setting Nomad specific env vars. In this case,
we should attempt to identify the scheme using the config setting
as a fallback.
2022-09-22 15:43:58 +02:00
Luiz Aoqui ad48401219
chore: move changelog file to the right folder (#14639) 2022-09-21 13:50:22 -04:00
Tim Gross 38a6e7e343
remove 1.4.0 changelog entry that refers to bugfix on new code (#14611)
Bug fixes on new features in Nomad 1.4.0 don't need or want changelog entries in
the same changelog the feature appeared, so remove this one.
2022-09-16 16:14:02 -04:00
Phil Renaud d6c9676252
Added task links to various alloc tables (#14592)
* Added task links to various alloc tables

* Lintfix

* Border collapse and added to task group page

* Logs icon temporarily removed and localStorage added

* Mock task added to test

* Delog

* Two asserts in new test

* Remove commented-out code

* Changelog

* Removing args.allocation deps
2022-09-16 15:58:22 -04:00
Phil Renaud cebfbb0c28
Stabilizing percy snapshots with faker (#14551)
* First attempt at stabilizing percy snapshots with faker

* Tokens seed moved to before management token generation

* Faker seed only in token test

* moving seed after storage clear

* And again, but back to no faker seeding

* Isolated seed and temporary log

* Setting seed(1) wherever we're snapshotting, or before establishing cluster scenarios

* Deliberate noop to see if percy is stable

* Changelog entry
2022-09-14 11:27:48 -04:00
Mahmood Ali a9d5e4c510
scheduler: stopped-yet-running allocs are still running (#10446)
* scheduler: stopped-yet-running allocs are still running

* scheduler: test new stopped-but-running logic

* test: assert nonoverlapping alloc behavior

Also add a simpler Wait test helper to improve line numbers and save few
lines of code.

* docs: tried my best to describe #10446

it's not concise... feedback welcome

* scheduler: fix test that allowed overlapping allocs

* devices: only free devices when ClientStatus is terminal

* test: output nicer failure message if err==nil

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-09-13 12:52:47 -07:00
Tim Gross eb757606f3
changelog entry for variables (#14509) 2022-09-13 10:25:26 -04:00
Derek Strickland 5ca934015b
job_endpoint: check spec for all regions (#14519)
* job_endpoint: check spec for all regions
2022-09-12 09:24:26 -04:00
James Rasell 009948186b
changelog: add entry for #14320 (#14518) 2022-09-09 17:25:50 +02:00
James Rasell f51a8c73e6
deps: update armon/go-metrics to v0.4.1 (#14493) 2022-09-09 09:20:55 +02:00
Charlie Voiselle e58998e218
Add client scheduling eligibility to heartbeat (#14483) 2022-09-08 14:31:36 -04:00
Tim Gross 3fc7482ecd
CSI: failed allocation should not block its own controller unpublish (#14484)
A Nomad user reported problems with CSI volumes associated with failed
allocations, where the Nomad server did not send a controller unpublish RPC.

The controller unpublish is skipped if other non-terminal allocations on the
same node claim the volume. The check has a bug where the allocation belonging
to the claim being freed was included in the check incorrectly. During a normal
allocation stop for job stop or a new version of the job, the allocation is
terminal. But allocations that fail are not yet marked terminal at the point in
time when the client sends the unpublish RPC to the server.

For CSI plugins that support controller attach/detach, this means that the
controller will not be able to detach the volume from the allocation's host and
the replacement claim will fail until a GC is run. This changeset fixes the
conditional so that the claim's own allocation is not included, and makes the
logic easier to read. Include a test case covering this path.

Also includes two minor extra bugfixes:

* Entities we get from the state store should always be copied before
altering. Ensure that we copy the volume in the top-level unpublish workflow
before handing off to the steps.

* The list stub object for volumes in `nomad/structs` did not match the stub
object in `api`. The `api` package also did not include the current
readers/writers fields that are expected by the UI. True up the two objects and
add the previously undocumented fields to the docs.
2022-09-08 13:30:05 -04:00
Seth Hoenig a608e7950e helper: guard against negative inputs into random stagger
This PR modifies RandomStagger to protect against negative input
values. If the given interval is negative, the value returned will
be somewhere in the stratosphere. Instead, treat negative inputs
like zero, returning zero.
2022-09-08 09:17:48 -05:00
Michael Schurter 7ff0290f8b
docs: add quota panic fix changelog entry (#14485)
See https://github.com/hashicorp/nomad-enterprise/pull/839 for original
(Enterprise only)
2022-09-07 17:04:46 -07:00
Phil Renaud 52bb5de25a Changelog added and unused tests removed 2022-09-07 10:31:39 -04:00
Luiz Aoqui 358ba279d0
ui: remove extra space in menu footer (#14457) 2022-09-06 16:53:17 -04:00
James Rasell 813c5daa96
hcl2: add strlen function and update docs. (#14463) 2022-09-06 18:42:40 +02:00
Tim Gross 6ff59e71a5
cli: remove network from quota status output (#14468)
Network quotas were removed in Nomad 1.0.4. Remove the fields no longer in use
from the `quota status` output.
2022-09-06 09:37:16 -04:00
Kellen Fox 5086368a1e
Add a log line to help track node eligibility (#14125)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-09-06 14:03:33 +02:00
Yan 6e927fa125
warn destructive update only when count > 1 (#13103) 2022-09-02 15:30:06 -04:00
Giovani Avelar b5cf358212
[ui] Show a different message when there are no tasks in a job (#14071)
Different mesage when there are not tasks in a job
2022-09-02 15:20:45 -04:00
Tiernan 98022376be
Fix error handling in Client consulDiscoveryImpl (#14431)
Added a missing `continue` on non-nil error to avoid accidentally using a bad peer.
2022-09-02 15:13:03 -04:00
Luiz Aoqui 1ae26981a0
connect: interpolate task env in config values (#14445)
When configuring Consul Service Mesh, it's sometimes necessary to
provide dynamic value that are only known to Nomad at runtime. By
interpolating configuration values (in addition to configuration keys),
user are able to pass these dynamic values to Consul from their Nomad
jobs.
2022-09-02 15:00:28 -04:00
Tim Gross 7921f044e5
migrate autopilot implementation to raft-autopilot (#14441)
Nomad's original autopilot was importing from a private package in Consul. It
has been moved out to a shared library. Switch Nomad to use this library so that
we can eliminate the import of Consul, which is necessary to build Nomad ENT
with the current version of the Consul SDK. This also will let us pick up
autopilot improvements shared with Consul more easily.
2022-09-01 14:27:10 -04:00
Luiz Aoqui 94d7dddccd
cli: set -hcl2-strict to false if -hcl1 is defined (#14426)
These options are mutually exclusive but, since `-hcl2-strict` defaults
to `true` users had to explicitily set it to `false` when using `-hcl1`.

Also return `255` when job plan fails validation as this is the expected 
code in this situation.
2022-09-01 10:42:08 -04:00
Luiz Aoqui 19de803503
cli: ignore VaultToken when generating job diff (#14424) 2022-09-01 10:01:53 -04:00
dependabot[bot] 9f8a3824c4
build(deps): bump github.com/hashicorp/go-version from 1.4.0 to 1.6.0 (#14364)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-09-01 11:55:42 +02:00
Luiz Aoqui 6f5d3e724f
changelog: add entry for #14374 (#14419) 2022-08-31 10:59:19 -04:00
Luiz Aoqui 27b253bc6e
changelog: add entry for #14381 (#14416) 2022-08-31 10:41:48 -04:00
Seth Hoenig 5d5c8af930 cgroups: refactor v2 kill path to use cgroups.kill interface file
This PR refactors the cgroups v2 group kill code path to use the
cgroups.kill interface file for destroying the cgroup. Previously
we copied the freeze + sigkill + unfreeze pattern from the v1 code,
but v2 provides a more efficient and more race-free way to handle
this.

Closes #14371
2022-08-29 14:55:13 -05:00
Michael Schurter dbffe22465
consul: allow stale namespace results (#12953)
Nomad reconciles services it expects to be registered in Consul with
what is actually registered in the local Consul agent. This is necessary
to prevent leaking service registrations if Nomad crashes at certain
points (or if there are bugs).

When Consul has namespaces enabled, we must iterate over each available
namespace to be sure no services were leaked into non-default
namespaces.

Since this reconciliation happens often, there's no need to require
results from the Consul leader server. In large clusters this creates
far more load than the "freshness" of the response is worth.

Therefore this patch switches the request to AllowStale=true
2022-08-26 16:05:12 -07:00
Vladimir Sokolov b646810401
cli: force periodic job if its id equals search prefix 2022-08-26 10:54:37 -04:00
dependabot[bot] 6d3389653b
build(deps): bump github.com/shirou/gopsutil/v3 from 3.21.12 to 3.22.7 (#14209)
* build(deps): bump github.com/shirou/gopsutil/v3 from 3.21.12 to 3.22.7

Bumps [github.com/shirou/gopsutil/v3](https://github.com/shirou/gopsutil) from 3.21.12 to 3.22.7.
- [Release notes](https://github.com/shirou/gopsutil/releases)
- [Commits](https://github.com/shirou/gopsutil/compare/v3.21.12...v3.22.7)

---
updated-dependencies:
- dependency-name: github.com/shirou/gopsutil/v3
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* changelog entry

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-08-25 14:15:41 -04:00
Seth Hoenig 51384dd63f client: refactor cpuset manager initialization
This PR refactors the code path in Client startup for setting up the cpuset
cgroup manager (non-linux systems not affected).

Before, there was a logic bug where we would try to read the cpuset.cpus.effective
cgroup interface file before ensuring nomad's parent cgroup existed. Therefor that
file would not exist, and the list of useable cpus would be empty. Tasks started
thereafter would not have a value set for their cpuset.cpus.

The refactoring fixes some less than ideal coding style. Instead we now bootstrap
each cpuset manager type (v1/v2) within its own constructor. If something goes
awry during bootstrap (e.g. cgroups not enabled), the constructor returns the
noop implementation and logs a warning.

Fixes #14229
2022-08-25 11:18:43 -05:00
Luiz Aoqui 31ab7964bd
ui: task lifecycle restart all tasks (#14223)
Now that tasks that have finished running can be restarted, the UI needs
to use the actual task state to determine which CSS class to use when
rendering the task lifecycle chart element.
2022-08-24 18:43:44 -04:00
Luiz Aoqui e012d9411e
Task lifecycle restart (#14127)
* allocrunner: handle lifecycle when all tasks die

When all tasks die the Coordinator must transition to its terminal
state, coordinatorStatePoststop, to unblock poststop tasks. Since this
could happen at any time (for example, a prestart task dies), all states
must be able to transition to this terminal state.

* allocrunner: implement different alloc restarts

Add a new alloc restart mode where all tasks are restarted, even if they
have already exited. Also unifies the alloc restart logic to use the
implementation that restarts tasks concurrently and ignores
ErrTaskNotRunning errors since those are expected when restarting the
allocation.

* allocrunner: allow tasks to run again

Prevent the task runner Run() method from exiting to allow a dead task
to run again. When the task runner is signaled to restart, the function
will jump back to the MAIN loop and run it again.

The task runner determines if a task needs to run again based on two new
task events that were added to differentiate between a request to
restart a specific task, the tasks that are currently running, or all
tasks that have already run.

* api/cli: add support for all tasks alloc restart

Implement the new -all-tasks alloc restart CLI flag and its API
counterpar, AllTasks. The client endpoint calls the appropriate restart
method from the allocrunner depending on the restart parameters used.

* test: fix tasklifecycle Coordinator test

* allocrunner: kill taskrunners if all tasks are dead

When all non-poststop tasks are dead we need to kill the taskrunners so
we don't leak their goroutines, which are blocked in the alloc restart
loop. This also ensures the allocrunner exits on its own.

* taskrunner: fix tests that waited on WaitCh

Now that "dead" tasks may run again, the taskrunner Run() method will
not return when the task finishes running, so tests must wait for the
task state to be "dead" instead of using the WaitCh, since it won't be
closed until the taskrunner is killed.

* tests: add tests for all tasks alloc restart

* changelog: add entry for #14127

* taskrunner: fix restore logic.

The first implementation of the task runner restore process relied on
server data (`tr.Alloc().TerminalStatus()`) which may not be available
to the client at the time of restore.

It also had the incorrect code path. When restoring a dead task the
driver handle always needs to be clear cleanly using `clearDriverHandle`
otherwise, after exiting the MAIN loop, the task may be killed by
`tr.handleKill`.

The fix is to store the state of the Run() loop in the task runner local
client state: if the task runner ever exits this loop cleanly (not with
a shutdown) it will never be able to run again. So if the Run() loops
starts with this local state flag set, it must exit early.

This local state flag is also being checked on task restart requests. If
the task is "dead" and its Run() loop is not active it will never be
able to run again.

* address code review requests

* apply more code review changes

* taskrunner: add different Restart modes

Using the task event to differentiate between the allocrunner restart
methods proved to be confusing for developers to understand how it all
worked.

So instead of relying on the event type, this commit separated the logic
of restarting an taskRunner into two methods:
- `Restart` will retain the current behaviour and only will only restart
  the task if it's currently running.
- `ForceRestart` is the new method where a `dead` task is allowed to
  restart if its `Run()` method is still active. Callers will need to
  restart the allocRunner taskCoordinator to make sure it will allow the
  task to run again.

* minor fixes
2022-08-24 17:43:07 -04:00
Tim Gross c732b215f0
vault: detect namespace change in config reload (#14298)
The `namespace` field was not included in the equality check between old and new
Vault configurations, which meant that a Vault config change that only changed
the namespace would not be detected as a change and the clients would not be
reloaded.

Also, the comparison for boolean fields such as `enabled` and
`allow_unauthenticated` was on the pointer and not the value of that pointer,
which results in spurious reloads in real config reload that is easily missed in
typical test scenarios.

Includes a minor refactor of the order of fields for `Copy` and `Merge` to match
the struct fields in hopes it makes it harder to make this mistake in the
future, as well as additional test coverage.
2022-08-24 17:03:29 -04:00
Seth Hoenig 423ea1a5c4 client/logmon: acquire executable in init block
This PR causes the logmon task runner to acquire the binary of the
Nomad executable in an 'init' block, so as to almost certainly get
the name while the nomad file still exists.

This is an attempt at fixing the case where a deleted Nomad file
(e.g. during upgrade) may be getting renamed with a mysterious
suffix first.

If this doesn't work, as a last resort we can literally just trim
the mystery string.

Fixes: #14079
2022-08-24 13:17:20 -05:00
Piotr Kazmierczak 7077d1f9aa
template: custom change_mode scripts (#13972)
This PR adds the functionality of allowing custom scripts to be executed on template change. Resolves #2707
2022-08-24 17:43:01 +02:00
Luiz Aoqui 848f2dcc22
changelog: update #14212 to breaking-change (#14292) 2022-08-24 11:36:53 -04:00
Piotr Kazmierczak 077b6e7098
docs: Update upgrade guide to reflect enterprise changes introduced in nomad-enterprise (#14212)
This PR documents a change made in the enterprise version of nomad that addresses the following issue:

When a user tries to filter audit logs, they do so with a stanza that looks like the following:

audit {
  enabled = true

  filter "remove deletes" {
    type = "HTTPEvent"
    endpoints  = ["*"]
    stages = ["OperationComplete"]
    operations = ["DELETE"]
  }
}

When specifying both an "endpoint" and a "stage", the events with both matching a "endpoint" AND a matching "stage" will be filtered.

When specifying both an "endpoint" and an "operation" the events with both matching a "endpoint" AND a matching "operation" will be filtered.

When specifying both a "stage" and an "operation" the events with a matching a "stage" OR a matching "operation" will be filtered.

The "OR" logic with stages and operations is unexpected and doesn't allow customers to get specific on which events they want to filter. For instance the following use-case is impossible to achieve: "I want to filter out all OperationReceived events that have the DELETE verb".
2022-08-24 16:31:49 +02:00
Seth Hoenig cfe9db0f66
build: set osusergo build tag by default (#14248)
This PR activates the osuergo build tag in GNUMakefile. This forces the os/user
package to be compiled without CGO. Doing so seems to resolve a race condition
in getpwnam_r that causes alloc creation to hang or panic on `user.Lookup("nobody")`.
2022-08-24 08:11:56 -05:00
Luiz Aoqui af5c01a070
ui: use task state to determine if task is active (#14224)
The current implementation uses the task's finishedAt field to determine
if a task is active of not, but this check is not accurate. A task in
the "pending" state will not have finishedAt value but it's also not
active.

This discrepancy results in some components, like the inline stats chart
of the task row component, to be displayed even whey they shouldn't.
2022-08-23 15:50:40 -04:00
Tim Gross bf57d76ec7
allow ACL policies to be associated with workload identity (#14140)
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.

This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
2022-08-22 16:41:21 -04:00
Luiz Aoqui dbffdca92e
template: use pointer values for gid and uid (#14203)
When a Nomad agent starts and loads jobs that already existed in the
cluster, the default template uid and gid was being set to 0, since this
is the zero value for int. This caused these jobs to fail in
environments where it was not possible to use 0, such as in Windows
clients.

In order to differentiate between an explicit 0 and a template where
these properties were not set we need to use a pointer.
2022-08-22 16:25:49 -04:00
Phil Renaud fcf2c40c60
[ui] Allocation route services table: show task-level services (#14199)
Adds service fragments to allocations and union taskGroup and task services
2022-08-22 11:45:12 -04:00
Derek Strickland 8dba52cee2
sentinel: add support for Nomad ACL Token and Namespace (#14171)
* sentinel: add ability to reference Nomad ACL Token and Namespace in Sentinel policies
2022-08-18 16:33:00 -04:00
Phil Renaud cbd4deedf8
[ui] general keyboard navigation: 1.3.4 release (#14138)
* Initialized keyboard service

Neat but funky: dynamic subnav traversal

👻

generalized traverseSubnav method

Shift as special modifier key

Nice little demo panel

Keyboard shortcuts keycard

Some animation styles on keyboard shortcuts

Handle situations where a link is deeply nested from its parent menu item

Keyboard service cleanup

helper-based initializer and teardown for new contextual commands

Keyboard shortcuts modal component added and demo-ghost removed

Removed j and k from subnav traversal

Register and unregister methods for subnav plus new subnavs for volumes and volume

register main nav method

Generalizing the register nav method

12762 table keynav (#12975)

* Experimental feature: shortcut visual hints

* Long way around to a custom modifier for keyboard shortcuts

* dynamic table and list iterative shortcuts

* Progress with regular old tether

* Delogging

* Table Keynav tether fix, server and client navs, and fix to shiftless on modified arrow keys

Go to Optimize keyboard link and storage key changed to g r

parameterized jobs keyboard nav

Dynamic numeric keynav for multiple tables (#13482)

* Multiple tables init

* URL-bind enumerable keyboard commands and add to more taskRow and allocationRows

* Type safety and lint fixes

* Consolidated push to keyCommands

* Default value when removing keyCommands

* Remove the URL-based removal method and perform a recompute on any add

Get tests passing in Keynav: remove math helpers and a few other defensive moves (#13761)

* Remove ember math helpers

* Test fixes for jobparts/body

* Kill an unneeded integration helper test

* delog

* Trying if disabling percy lets this finish

* Okay so its not percy; try parallelism in circle

* Percyless yet again

* Trying a different angle to not have percy

* Upgrade percy to 1.6.1

[ui] Keyboard nav: "u" key to go up a level (#13754)

* U to go up a level

* Mislabelled my conditional

* Custom lint ignore rule

* Custom lint ignore rule, this time with commas

* Since we're getting rid of ember math helpers elsewhere, do the math ourselves here

Replace ArrowLeft etc. with an ascii arrow (#13776)

* Replace ArrowLeft etc. with an ascii arrow

* non-mutative helper cleanup

Keyboard Nav: let users rebind their shortcuts (#13781)

* click-outside and shortcuts enabled/disabled toggle

* Trap focus when modal open

* Enabled/disabled saved to localStorage

* Autofocus edit button on variable index

* Modal overflow styles

* Functional rebind

* Saving rebinds to localStorage for all majors

* Started on defaultCommandBindings

* Modal header style and cancel rebind on escape

* keyboardable keybindings w buttons instead of spans

* recording and defaultvalues

* Enter short-circuits rebind

* Only some commands are rebindable, and dont show dupes

* No unused get import

* More visually distinct header on modal

* Disallowed keys for rebind, showing buffer as you type, and moving dedupe to modal logic

willDestroy hook to prevent tests from doubling/tripling up addEventListener on kb events

remove unused tests

Keyboard Navigation acceptance tests (#13893)

* Acceptance tests for keyboard modal

* a11y audit fix and localStorage clear

* Bind/rebind/localStorage tests

* Keyboard tests for dynamic nav and tables

* Rebinder and assert expectation

* Second percy snapshot showing hints no longer relevant

Weird issue where linktos with query props specifically from the task-groups page would fail to route / hit undefined.shouldSuperCede errors

Adds the concept of exclusivity to a keycommand, removing peers that also share its label

Lintfix

Changelog and PR feedback

Changelog and PR feedback

Fix to rebinding in firefox by blurring the now-disabled button on rebind (#14053)

* Secure Variables shortcuts removed

* Variable index route autofocus removed

* Updated changelog entry

* Updated changelog entry

* Keynav docs (#14148)

* Section added to the API Docs UI page

* Added a note about disabling

* Prev and Next order

* Remove dev log and unneeded comments
2022-08-17 12:59:33 -04:00
James Rasell fbc9f8b66c
changelog: add missing entry for #13539 (#14129) 2022-08-17 09:26:45 +02:00
Seth Hoenig 0a6497ee1f api: trim space of error response output 2022-08-16 15:00:38 -05:00
Seth Hoenig 91e32eec9b build: update to go1.19 2022-08-16 08:40:57 -05:00
Jai 81cac313c5
refact: add parent check to boolean (#14115)
* refact: add parent check to boolean

* chore:  add changelog entry
2022-08-15 13:42:08 -04:00
Seth Hoenig 4f72b0ed72 deps: add cl for fsouza/go-dockerclient 2022-08-15 09:59:49 -05:00
Seth Hoenig 077f46c74a
Merge pull request #14025 from hashicorp/dependabot/go_modules/go.etcd.io/bbolt-1.3.6
build(deps): bump go.etcd.io/bbolt from 1.3.5 to 1.3.6
2022-08-15 09:13:48 -05:00
Seth Hoenig 8a377ece7e deps: update cl for go.etcd.io/bbolt 2022-08-15 09:13:16 -05:00
Seth Hoenig 30d0e55ebb deps: update cl for grpc 2022-08-15 09:10:13 -05:00
Seth Hoenig 394aebfbd9
Merge pull request #14088 from hashicorp/b-plan-vault-token
cli: support vault token in plan command
2022-08-12 09:05:34 -05:00
Seth Hoenig a939245a27
docs: tweak changelog
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2022-08-12 08:59:58 -05:00
Seth Hoenig dc761aa7ec docker: create a docker task config setting for disable built-in healthcheck
This PR adds a docker driver task configuration setting for turning off
built-in HEALTHCHECK of a container.

References)
https://docs.docker.com/engine/reference/builder/#healthcheck
https://github.com/docker/engine-api/blob/master/types/container/config.go#L16

Closes #5310
Closes #14068
2022-08-11 10:33:48 -05:00
Seth Hoenig ba5c45ab93 cli: respect vault token in plan command
This PR fixes a regression where the 'job plan' command would not respect
a Vault token if set via --vault-token or $VAULT_TOKEN.

Basically the same bug/fix as for the validate command in https://github.com/hashicorp/nomad/issues/13062

Fixes https://github.com/hashicorp/nomad/issues/13939
2022-08-11 08:54:08 -05:00
Seth Hoenig 1901cfaba8
Merge pull request #14069 from brian-athinkingape/cli-fix-memstats-cgroupsv2
cli: for systems with cgroups v2, fix alloation resource utilization showing 0 memory used
2022-08-11 07:27:48 -05:00
Seth Hoenig 3aaaedf52e cli: forward request for job validation to nomad leader
This PR changes the behavior of 'nomad job validate' to forward the
request to the nomad leader, rather than responding from any server.

This is because we need the leader when validating Vault tokens, since
the leader is the only server with an active vault client.
2022-08-10 14:34:04 -05:00
Brian Chau 63b60ced2a Add changelog 14069 2022-08-09 14:16:34 -07:00
Luiz Aoqui 9affe31a0f
qemu: reduce monitor socket path (#13971)
The QEMU driver can take an optional `graceful_shutdown` configuration
which will create a Unix socket to send ACPI shutdown signal to the VM.

Unix sockets have a hard length limit and the driver implementation
assumed that QEMU versions 2.10.1 were able to handle longer paths. This
is not correct, the linked QEMU fix only changed the behaviour from
silently truncating longer socket paths to throwing an error.

By validating the socket path before starting the QEMU machine we can
provide users a more actionable and meaningful error message, and by
using a shorter socket file name we leave a bit more room for
user-defined values in the path, such as the task name.

The maximum length allowed is also platform-dependant, so validation
needs to be different for each OS.
2022-08-04 12:10:35 -04:00
Charles Z 7a8ec90fbe
allow unhealthy canaries without blocking autopromote (#14001) 2022-08-04 11:53:50 -04:00
Luiz Aoqui 2c0fea64e9
qemu: restore monitor socket path (#14000)
When a QEMU task is recovered the monitor socket path was not being
restored into the task handler, so the `graceful_shutdown` configuration
was effectively ignored if the client restarted.
2022-08-04 10:44:08 -04:00
Derek Strickland 77df9c133b
Add Nomad RetryConfig to agent template config (#13907)
* add Nomad RetryConfig to agent template config
2022-08-03 16:56:30 -04:00
Phil Renaud e58a95ed2f
New variable creation adds the first namespace in your available list at variable creation time (#13991)
* New variable creation adds the first namespace in your available list at variable creation time

* Changelog
2022-08-03 15:09:25 -04:00
Seth Hoenig e2309754de cl: update cl for 13670 2022-08-03 13:18:09 -05:00
Piotr Kazmierczak 530280505f
client: enable specifying user/group permissions in the template stanza (#13755)
* Adds Uid/Gid parameters to template.

* Updated diff_test

* fixed order

* update jobspec and api

* removed obsolete code

* helper functions for jobspec parse test

* updated documentation

* adjusted API jobs test.

* propagate uid/gid setting to job_endpoint

* adjusted job_endpoint tests

* making uid/gid into pointers

* refactor

* updated documentation

* updated documentation

* Update client/allocrunner/taskrunner/template/template_test.go

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* Update website/content/api-docs/json-jobs.mdx

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

* propagating documentation change from Luiz

* formatting

* changelog entry

* changed changelog entry

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-08-02 22:15:38 +02:00
Eric Weber cbce13c1ac
Add stage_publish_base_dir field to csi_plugin stanza of a job (#13919)
* Allow specification of CSI staging and publishing directory path
* Add website documentation for stage_publish_dir
* Replace erroneous reference to csi_plugin.mount_config with csi_plugin.mount_dir
* Avoid requiring CSI plugins to be redeployed after introducing StagePublishDir
2022-08-02 09:42:44 -04:00
Luiz Aoqui 6c31a51919
changelog: add entry for #13865 and #13866 (#13901) 2022-07-22 15:19:33 -04:00
Seth Hoenig 2f20a75d38 cl: add cl about removing lib/darwin library 2022-07-22 14:02:58 -05:00
Tim Gross c7a11a86c6
block deleting namespaces if the namespace contains a volume (#13880)
When we delete a namespace, we check to ensure that there are no non-terminal
jobs, which effectively covers evals, allocs, etc. CSI volumes are also
namespaced, so extend this check to cover CSI volumes.
2022-07-21 16:13:52 -04:00
Seth Hoenig c61e779b48
Merge pull request #13715 from hashicorp/dev-nsd-checks
client: add support for checks in nomad services
2022-07-21 10:22:57 -05:00
Seth Hoenig 606e3ebdd4 client: updates from pr feedback 2022-07-21 09:54:27 -05:00
Seth Hoenig 8e6eeaa37e
Merge pull request #13869 from hashicorp/b-uniq-services-2
servicedisco: ensure service uniqueness in job validation
2022-07-21 08:24:24 -05:00
Will Jordan 5354409b1a
Return 429 response on HTTP max connection limit (#13621)
Return 429 response on HTTP max connection limit. Instead of silently closing
the connection, return a `429 Too Many Requests` HTTP response with a helpful
error message to aid debugging when the connection limit is unintentionally
reached.

Set a 10-millisecond write timeout and rate limiter for connection-limit 429
response to prevent writing the HTTP response from consuming too many server
resources.

Add `nomad.agent.http.exceeded metric` counting the number of HTTP connections
exceeding concurrency limit.
2022-07-20 14:12:21 -04:00
Seth Hoenig e5978a9cbf jobspec: ensure service uniqueness in job validation 2022-07-20 12:38:08 -05:00
Tim Gross cfa2cb140e
fsm: one-time token expiration should be deterministic (#13737)
When applying a raft log to expire ACL tokens, we need to use a
timestamp provided by the leader so that the result is deterministic
across servers. Use leader's timestamp from RPC call
2022-07-18 14:19:29 -04:00
Seth Hoenig c23da281a1 metrics: even classless blocked evals get metrics
This PR fixes a bug where blocked evaluations with no class set would
not have metrics exported at the dc:class scope.

Fixes #13759
2022-07-15 14:12:44 -05:00
Luiz Aoqui b656981cf0
Track plan rejection history and automatically mark clients as ineligible (#13421)
Plan rejections occur when the scheduler work and the leader plan
applier disagree on the feasibility of a plan. This may happen for valid
reasons: since Nomad does parallel scheduling, it is expected that
different workers will have a different state when computing placements.

As the final plan reaches the leader plan applier, it may no longer be
valid due to a concurrent scheduling taking up intended resources. In
these situations the plan applier will notify the worker that the plan
was rejected and that they should refresh their state before trying
again.

In some rare and unexpected circumstances it has been observed that
workers will repeatedly submit the same plan, even if they are always
rejected.

While the root cause is still unknown this mitigation has been put in
place. The plan applier will now track the history of plan rejections
per client and include in the plan result a list of node IDs that should
be set as ineligible if the number of rejections in a given time window
crosses a certain threshold. The window size and threshold value can be
adjusted in the server configuration.

To avoid marking several nodes as ineligible at one, the operation is rate
limited to 5 nodes every 30min, with an initial burst of 10 operations.
2022-07-12 18:40:20 -04:00
Michael Schurter 3e50f72fad
core: merge reserved_ports into host_networks (#13651)
Fixes #13505

This fixes #13505 by treating reserved_ports like we treat a lot of jobspec settings: merging settings from more global stanzas (client.reserved.reserved_ports) "down" into more specific stanzas (client.host_networks[].reserved_ports).

As discussed in #13505 there are other options, and since it's totally broken right now we have some flexibility:

Treat overlapping reserved_ports on addresses as invalid and refuse to start agents. However, I'm not sure there's a cohesive model we want to publish right now since so much 0.9-0.12 compat code still exists! We would have to explain to folks that if their -network-interface and host_network addresses overlapped, they could only specify reserved_ports in one place or the other?! It gets ugly.
Use the global client.reserved.reserved_ports value as the default and treat host_network[].reserverd_ports as overrides. My first suggestion in the issue, but @groggemans made me realize the addresses on the agent's interface (as configured by -network-interface) may overlap with host_networks, so you'd need to remove the global reserved_ports from addresses shared with a shared network?! This seemed really confusing and subtle for users to me.
So I think "merging down" creates the most expressive yet understandable approach. I've played around with it a bit, and it doesn't seem too surprising. The only frustrating part is how difficult it is to observe the available addresses and ports on a node! However that's a job for another PR.
2022-07-12 14:40:25 -07:00
Phil Renaud 59c12fc758
Remove namespace cache (#13679) 2022-07-11 18:06:18 -04:00
Phil Renaud e9219a1ae0
Allow wildcard for Evaluations API (#13530)
* Failing test and TODO for wildcard

* Alias the namespace query parameter for Evals

* eval: fix list when using ACLs and * namespace

Apply the same verification process as in job, allocs and scaling
policy list endpoints to handle the eval list when using an ACL token
with limited namespace support but querying using the `*` wildcard
namespace.

* changelog: add entry for #13530

* ui: set namespace when querying eval

Evals have a unique UUID as ID, but when querying them the Nomad API
still expects a namespace query param, otherwise it assumes `default`.

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-07-11 16:42:17 -04:00
Luiz Aoqui 674c0ae08b
changelog: add entry for #13659 (#13691) 2022-07-11 16:07:33 -04:00
Tim Gross b6dd1191b2
snapshot restore-from-archive streaming and filtering (#13658)
Stream snapshot to FSM when restoring from archive
The `RestoreFromArchive` helper decompresses the snapshot archive to a
temporary file before reading it into the FSM. For large snapshots
this performs a lot of disk IO. Stream decompress the snapshot as we
read it, without first writing to a temporary file.

Add bexpr filters to the `RestoreFromArchive` helper.
The operator can pass these as `-filter` arguments to `nomad operator
snapshot state` (and other commands in the future) to include only
desired data when reading the snapshot.
2022-07-11 10:48:00 -04:00
James Rasell 9eb63c9e03
cli: ensure node status and drain use correct cmd name. (#13656) 2022-07-11 09:50:42 +02:00
Seth Hoenig 239eaf9a29
Merge pull request #13626 from hashicorp/b-client-max-kill-timeout
client: enforce max_kill_timeout client configuration
2022-07-07 13:44:39 -05:00
Luiz Aoqui 85908415f9
state: fix eval list by prefix with * namespace (#13551) 2022-07-07 14:21:51 -04:00
Luiz Aoqui 03433dd8af
cli: improve output of eval commands (#13581)
Use the same output format when listing multiple evals in the `eval
list` command and when `eval status <prefix>` matches more than one
eval.

Include the eval namespace in all output formats and always include the
job ID in `eval status` since, even `node-update` evals are related to a
job.

Add Node ID to the evals table output to help differentiate
`node-update` evals.

Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 13:13:34 -04:00
Ted Behling 6a032a54d2
driver/docker: Don't pull InfraImage if it exists (#13265)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2022-07-07 17:44:06 +02:00
Michael Schurter f21272065d
core: emit node evals only for sys jobs in dc (#12955)
Whenever a node joins the cluster, either for the first time or after
being `down`, we emit a evaluation for every system job to ensure all
applicable system jobs are running on the node.

This patch adds an optimization to skip creating evaluations for system
jobs not in the current node's DC. While the scheduler performs the same
feasability check, skipping the creation of the evaluation altogether
saves disk, network, and memory.
2022-07-06 14:35:18 -07:00
Seth Hoenig 5dd8aa3e27 client: enforce max_kill_timeout client configuration
This PR fixes a bug where client configuration max_kill_timeout was
not being enforced. The feature was introduced in 9f44780 but seems
to have been removed during the major drivers refactoring.

We can make sure the value is enforced by pluming it through the DriverHandler,
which now uses the lesser of the task.killTimeout or client.maxKillTimeout.
Also updates Event.SetKillTimeout to require both the task.killTimeout and
client.maxKillTimeout so that we don't make the mistake of using the wrong
value - as it was being given only the task.killTimeout before.
2022-07-06 15:29:38 -05:00
Luiz Aoqui a9a66ad018
api: apply new ACL check for wildcard namespace (#13608)
api: apply new ACL check for wildcard namespace

In #13606 the ACL check was refactored to better support the all
namespaces wildcard (`*`). This commit applies the changes to the jobs
and alloc list endpoints.
2022-07-06 16:17:16 -04:00
Tim Gross 1fc8995590
query for leader in operator debug command (#13472)
The `operator debug` command doesn't output the leader anywhere in the
output, which adds extra burden to offline debugging (away from an
ongoing incident where you can simply check manually). Query the
`/v1/status/leader` API but degrade gracefully.
2022-07-06 10:57:44 -04:00
James Rasell 0c0b028a59
core: allow deleting of evaluations (#13492)
* core: add eval delete RPC and core functionality.

* agent: add eval delete HTTP endpoint.

* api: add eval delete API functionality.

* cli: add eval delete command.

* docs: add eval delete website documentation.
2022-07-06 16:30:11 +02:00
James Rasell 181b247384
core: allow pausing and un-pausing of leader broker routine (#13045)
* core: allow pause/un-pause of eval broker on region leader.

* agent: add ability to pause eval broker via scheduler config.

* cli: add operator scheduler commands to interact with config.

* api: add ability to pause eval broker via scheduler config

* e2e: add operator scheduler test for eval broker pause.

* docs: include new opertor scheduler CLI and pause eval API info.
2022-07-06 16:13:48 +02:00
Phil Renaud 84a59ff059
[ui] Fix a bug where redirects after planning/editing a job didn't include namespace (#13588)
* Job editing and planning handles namespace as part of ID instead of queryParam

* Changelog added

* Tests updated to reflect new namespace redirects
2022-07-05 15:58:56 -04:00
Seth Hoenig 97726c2fd8
Merge pull request #12862 from hashicorp/f-choose-services
api: enable selecting subset of services using rendezvous hashing
2022-06-30 15:17:40 -05:00
Seth Hoenig 0048c59f1a
cl: fixup changelog comment
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2022-06-30 15:10:47 -05:00
James Rasell 3ecffaf36b
deps: update github.com/hashicorp/go-discover to latest. (#13491) 2022-06-28 10:28:32 +02:00
James Rasell d080eed9ae
client: fixed a problem calculating a service namespace. (#13493)
When calculating a services namespace for registration, the code
assumed the first task within the task array would include a
service block. This is incorrect as it is possible only a latter
task within the array contains a service definition.

This change fixes the logic, so we correctly search for a service
definition before identifying the namespace.
2022-06-28 09:47:28 +02:00
Seth Hoenig 9467bc9eb3 api: enable selecting subset of services using rendezvous hashing
This PR adds the 'choose' query parameter to the '/v1/service/<service>' endpoint.

The value of 'choose' is in the form '<number>|<key>', number is the number
of desired services and key is a value unique but consistent to the requester
(e.g. allocID).

Folks aren't really expected to use this API directly, but rather through consul-template
which will soon be getting a new helper function making use of this query parameter.

Example,

curl 'localhost:4646/v1/service/redis?choose=2|abc123'

Note: consul-templte v0.29.1 includes the necessary nomadServices functionality.
2022-06-25 10:37:37 -05:00
Phil Renaud 2e6e95e78c
[ui] Reinstate Meta and Payload sections to Parameterized Child Jobs (#13473)
* Shift meta off job.definition and decodedPayload alias to passed arg

* Changelog
2022-06-24 15:03:08 -04:00
Seth Hoenig b7a8318eac
Merge pull request #13467 from hashicorp/f-purge-raft-v2
core: remove support for raft protocol version 2
2022-06-24 10:10:26 -05:00
Tim Gross 4368dcc02f
fix deadlock in plan_apply (#13407)
The plan applier has to get a snapshot with a minimum index for the
plan it's working on in order to ensure consistency. Under heavy raft
loads, we can exceed the timeout. When this happens, we hit a bug
where the plan applier blocks waiting on the `indexCh` forever, and
all schedulers will block in `Plan.Submit`.

Closing the `indexCh` when the `asyncPlanWait` is done with it will
prevent the deadlock without impacting correctness of the previous
snapshot index.

This changeset includes the a PoC failing test that works by injecting
a large timeout into the state store. We need to turn this into a test
we can run normally without breaking the state store before we can
merge this PR.

Increase `snapshotMinIndex` timeout to 10s.
This timeout creates backpressure where any concurrent `Plan.Submit`
RPCs will block waiting for results. This sheds load across all
servers and gives raft some CPU to catch up, because schedulers won't
dequeue more work while waiting. Increase it to 10s based on
observations of large production clusters.
2022-06-23 12:06:27 -04:00
Seth Hoenig 91e08d5e23 core: remove support for raft protocol version 2
This PR checks server config for raft_protocol, which must now
be set to 3 or unset (0). When unset, version 3 is used as the
default.
2022-06-23 14:37:50 +00:00
Derek Strickland 7d6a3df197
csi_hook: valid if any driver supports csi (#13446)
* csi_hook: valid if any driver supports csi volumes
2022-06-22 10:43:43 -04:00
Derek Strickland 9de4d7367c
cli: fix detach handling (#13405)
Fix detach handling for:

- `deployment fail`
- `deployment promote`
- `deployment resume`
- `deployment unblock`
- `job promote`
2022-06-21 06:01:23 -04:00
Jeffrey Clark a97699221c
cni: add loopback to linux bridge (#13428)
CNI changed how to bring up the interface in v0.2.0.
Support was moved to a new loopback plugin.

https://github.com/containernetworking/cni/pull/121

Fixes #10014
2022-06-20 11:22:53 -04:00
James Rasell f1f7c5040b
api: added sysbatch job type constant to match other schedulers. (#13359) 2022-06-16 11:53:04 +02:00
Joseph Martin 4aa96d5bfc
Return evalID if -detach flag is passed to job revert (#13364)
* Return evalID if `-detach` flag is passed to job revert
2022-06-15 14:20:29 -04:00
Tim Gross 12d87c040c
fixup changelog entry for backported regression fix (#13370)
The changelog entry for #13340 indicated it was an improvement. But on
discussion, it was determined that this was a workaround for a
regression. Update the changelog to make this clear.
2022-06-14 14:33:39 -04:00
Grant Griffiths 99896da443
CSI: make plugin health_timeout configurable in csi_plugin stanza (#13340)
Signed-off-by: Grant Griffiths <ggriffiths@purestorage.com>
2022-06-14 10:04:16 -04:00
Daniel Rossbach 8c52c03c8c
qemu driver: Add option to configure drive_interface (#11864) 2022-06-10 10:03:51 -04:00
Luiz Aoqui e8b788b372
changelog: add entry for #12961 (#13318) 2022-06-10 09:04:00 -04:00
Tim Gross 9d5523a72d
CSI: skip node unpublish on GC'd or down nodes (#13301)
If the node has been GC'd or is down, we can't send it a node
unpublish. The CSI spec requires that we don't send the controller
unpublish before the node unpublish, but in the case where a node is
gone we can't know the final fate of the node unpublish step.

The `csi_hook` on the client will unpublish if the allocation has
stopped and if the host is terminated there's no mount for the volume
anyways. So we'll now assume that the node has unpublished at its
end. If it hasn't, any controller unpublish will potentially hang or
error and need to be retried.
2022-06-09 11:33:22 -04:00
phreakocious 94a78597d2
Add guest_agent config option for QEMU driver (#12800)
Add boolean 'guest_agent' config option for QEMU driver, which will
create the socket file for the QEMU Guest Agent in the task dir when
enabled.
2022-06-09 09:21:38 -04:00
Derek Strickland 13ea5ae87a
consul-template: Add fault tolerant defaults (#13041)
consul-template: Add fault tolerant defaults

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-06-08 14:08:25 -04:00
Luiz Aoqui 2e0bffba90
changelog: add entry for #12925 (#13250) 2022-06-08 10:14:33 -04:00
Tim Gross 8ff5ea1bee
CSI: no early return when feasibility check fails on eligible nodes (#13274)
As a performance optimization in the scheduler, feasibility checks
that apply to an entire class are only checked once for all nodes of
that class. Other feasibility checks are "available" checks because
they rely on more ephemeral characteristics and don't contribute to
the hash for the node class. This currently includes only CSI.

We have a separate fast path for "available" checks when the node has
already been marked eligible on the basis of class. This fast path has
a bug where it returns early rather than continuing the loop. This
causes the entire task group to be rejected.

Fix the bug by not returning early in the fast path and instead jump
to the top of the loop like all the other code paths in this method.
Includes a new test exercising topology at whole-scheduler level and a
fix for an existing test that should've caught this previously.
2022-06-07 13:31:10 -04:00
Derek Strickland 12f3ee46ea
alloc_runner: stop sidecar tasks last (#13055)
alloc_runner: stop sidecar tasks last
2022-06-07 11:35:19 -04:00
Tim Gross 81c70f4973
changelog entry for #12534 (#13260) 2022-06-06 16:19:17 -04:00
Conor Evans 86116a7607
add filebase64 function (#11791)
Signed-off-by: Conor Evans <coevans@tcd.ie>
2022-06-06 11:58:17 -04:00
Lance Haig 4bf27d743d
Allow Operator Generated bootstrap token (#12520) 2022-06-03 07:37:24 -04:00
Huan Wang 7d15157635
adding support for customized ingress tls (#13184) 2022-06-02 18:43:58 -04:00
Seth Hoenig 45e8748658
Merge pull request #13205 from hashicorp/b-batch-preempt2
core: reschedule evicted batch job when resources become available
2022-06-02 16:32:01 -05:00
Shantanu Gadgil 6cb8c95534
fingerprint kernel architecture name (#13182) 2022-06-02 15:51:00 -04:00
Seth Hoenig 0692190e12 core: reschedule evicted batch job when resources become available
This PR fixes a bug where an evicted batch job would not be rescheduled
once resources become available.

Closes #9890
2022-06-02 14:04:13 -05:00
Seth Hoenig 54efec5dfe docs: add docs and tests for tagged_addresses 2022-05-31 13:02:48 -05:00
Seth Hoenig 4631045d83 connect: enable setting connect upstream destination namespace 2022-05-26 09:39:36 -05:00