Commit Graph

548 Commits

Author SHA1 Message Date
Seth Hoenig f1b902beac
consul: do not re-register already registered services (#14917)
This PR updates Nomad's Consul service client to do map comparisons
using maps.Equal instead of reflect.DeepEqual. The bug fix is in how
DeepEqual treats nil slices different from empty slices, when actually
they should be treated the same.
2022-10-18 08:10:59 -05:00
Tim Gross 3c78980b78
make version checks specific to region (1.4.x) (#14912)
* One-time tokens are not replicated between regions, so we don't want to enforce
  that the version check across all of serf, just members in the same region.
* Scheduler: Disconnected clients handling is specific to a single region, so we
  don't want to enforce that the version check across all of serf, just members in
  the same region.
* Variables: enforce version check in Apply RPC
* Cleans up a bunch of legacy checks.

This changeset is specific to 1.4.x and the changes for previous versions of
Nomad will be manually backported in a separate PR.
2022-10-17 16:23:51 -04:00
Tim Gross c721ce618e
keyring: filter by region before checking version (#14901)
In #14821 we fixed a panic that can happen if a leadership election happens in
the middle of an upgrade. That fix checks that all servers are at the minimum
version before initializing the keyring (which blocks evaluation processing
during trhe upgrade). But the check we implemented is over the serf membership,
which includes servers in any federated regions, which don't necessarily have
the same upgrade cycle.

Filter the version check by the leader's region.

Also bump up log levels of major keyring operations
2022-10-17 13:21:16 -04:00
Tim Gross bcd26f8815
docker_logger: reorder imports to save memory (#14875)
Nomad runs one logmon process and also one docker_logger process for each
running allocation. A naive look at memory usage shows 10-30 MB of RSS, but a
closer look shows that most of this memory (ex. all but ~2MB for logmon) is
shared (`Shared_Clean` in Linux pmap).

But a heap dump of docker_logger shows that it currently has an extra ~2500 KiB
of heap (anonymously-mapped unshared memory) used for init blocks coming from
the agent code (ex. mostly regexes from go-version, structs, and the Consul
SDK). The packages for running logmon, docker_logger, and executor have an init
block that parses `os.Args` to drop into their own logic, which prevents them
from loading all the rest of the agent code and saves on memory, so this was
unexpected.

It looks like we accidentally reordered the imports in main to undo some of the
work originally done in 404d2d4c98f1df930be1ae9852fe6e6ae8c1517e. This changeset
restores the ordering. A follow-up heap dump shows this saves ~2MB of unshared
RSS per docker_logger process.
2022-10-11 13:23:03 -04:00
Seth Hoenig 1593963cd1
servicedisco: implicit constraint for nomad v1.4 when using nsd checks (#14868)
This PR adds a jobspec mutator to constrain jobs making use of checks
in the nomad service provider to nomad clients of at least v1.4.0.

Before, in a mixed client version cluster it was possible to submit
an NSD job making use of checks and for that job to land on an older,
incompatible client node.

Closes #14862
2022-10-11 08:21:42 -05:00
Seth Hoenig 69ced2a2bd
services: remove assertion on 'task' field being set (#14864)
This PR removes the assertion around when the 'task' field of
a check may be set. Starting in Nomad 1.4 we automatically set
the task field on all checks in support of the NSD checks feature.

This is causing validation problems elsewhere, e.g. when a group
service using the Consul provider sets 'task' it will fail
validation that worked previously.

The assertion of leaving 'task' unset was only about making sure
job submitters weren't expecting some behavior, but in practice
is causing bugs now that we need the task field for more than it
was originally added for.

We can simply update the docs, noting when the task field set by
job submitters actually has value.
2022-10-10 13:02:33 -05:00
Phil Renaud e771b94164
[ui] Makes service tags wrap and look like tag items (#14834)
* Makes service tags wrap and look like tag items

* Add a little vertical spacing and changelog

* Put client before tags

* Force tags list to new line
2022-10-07 09:23:52 -04:00
Damian Czaja 95f969c4bf
cli: add `nomad fmt` (#14779) 2022-10-06 17:00:29 -04:00
Phil Renaud 4b93a30225
[ui] Line charts: explicitly update X-axis whenever xScale changes (#14814)
* Explicitly update X-axis whenever xScale changes

* Changelog
2022-10-06 16:59:16 -04:00
Hemanth Krishna e516fc266f
enhancement: UpdateTask when Task is waiting for ShutdownDelay (#14775)
Signed-off-by: Hemanth Krishna <hkpdev008@gmail.com>
2022-10-06 16:33:28 -04:00
Will Jordan 8ae13208c9
Allow jobs not requiring any network resources (#14300)
Jobs not requiring any network resources should be allowed
even when the network fingerprinter is disabled.
2022-10-06 16:25:41 -04:00
Gabriel Villalonga Simon b974c32ba6
Check that JobPlanResponse Diff Type is None before checking for changes on getExitCode (#14492) 2022-10-06 16:23:22 -04:00
Pablo Ruiz García 40416be7b1
Invoke FingerprintManager's Reload() func during agent's SIGHUP (#14615)
Fixes #14614
2022-10-06 16:22:59 -04:00
Giovani Avelar a625de2062
Allow specification of a custom job name/prefix for parameterized jobs (#14631) 2022-10-06 16:21:40 -04:00
Tim Gross 80ec5e1346
fix panic from keyring raft entries being written during upgrade (#14821)
During an upgrade to Nomad 1.4.0, if a server running 1.4.0 becomes the leader
before one of the 1.3.x servers, the old server will crash because the keyring
is initialized and writes a raft entry.

Wait until all members are on a version that supports the keyring before
initializing it.
2022-10-06 12:47:02 -04:00
Luiz Aoqui b924802958
template: apply splay value on change_mode script (#14749)
Previously, the splay timeout was only applied if a template re-render
caused a restart or a signal action. The `change_mode = "script"` was
running after the `if restart || len(signals) != 0` check, so it was
invoked at all times.

This change refactors the logic so it's easier to notice that new
`change_mode` options should start only after `splay` is applied.
2022-09-30 12:04:22 -04:00
Seth Hoenig c68ed3b4c8
client: protect user lookups with global lock (#14742)
* client: protect user lookups with global lock

This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).

* cl: add cl
2022-09-29 09:30:13 -05:00
Derek Strickland 4c73a3b1dc
Remove changelog entry for test update PR 2022-09-27 18:17:49 -04:00
Derek Strickland 52e4997ace
Add enterprise tag 2022-09-27 17:50:25 -04:00
Derek Strickland ef0f8c5b81
Add enterprise tag 2022-09-27 17:49:27 -04:00
Derek Strickland 6738684167
Delete 14665.txt 2022-09-27 17:47:35 -04:00
Derek Strickland 87bdb74221
Remove bug fix changelog files 2022-09-27 17:46:32 -04:00
Derek Strickland cacf4bb8e1
Fix changelog entry type 2022-09-27 14:33:39 -04:00
Jim Razmus II 7da3fd050b
jobspec: allow artifact headers in HCLv1 (#14637)
* jobspec: allow artifact headers in HCLv1

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-09-27 12:18:49 -04:00
Seth Hoenig 5df5e70542
core: numeric operands comparisons in constraints (#14722)
* cleanup: fixup linter warnings in schedular/feasible.go

* core: numeric operands comparisons in constraints

This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.

Inspiration #4856
Closes #4729
Closes #14719

* fix: always parse as int64
2022-09-27 11:07:07 -05:00
Tim Gross 87681fca68
CSI: ensure initial unpublish state is checkpointed (#14675)
A test flake revealed a bug in the CSI unpublish workflow, where an unpublish
that comes from a client that's successfully done the node-unpublish step will
not have the claim checkpointed if the controller-unpublish step fails. This
will result in a delay in releasing the volume claim until the next GC.

This changeset also ensures we're using a new snapshot after each write to raft,
and fixes two timing issues in test where either the volume watcher can
unpublish before the unpublish RPC is sent or we don't wait long enough in
resource-restricted environements like GHA.
2022-09-27 08:43:45 -04:00
Michael Schurter e6af1c0a14
fingerprint: add node attr for reserverable cores (#14694)
* fingerprint: add node attr for reserverable cores

Add an attribute for the number of reservable CPU cores as they may
differ from the existing `cpu.numcores` due to client configuration or
OS support.

Hopefully clarifies some confusion in #14676

* add changelog

* num_reservable_cores -> reservablecores
2022-09-26 13:03:03 -07:00
Luiz Aoqui 5c100c0d3d
client: recover from getter panics (#14696)
The artifact getter uses the go-getter library to fetch files from
different sources. Any bug in this library that results in a panic can
cause the entire Nomad client to crash due to a single file download
attempt.

This change aims to guard against this types of crashes by recovering
from panics when the getter attempts to download an artifact. The
resulting panic is converted to an error that is stored as a task event
for operator visibility and the panic stack trace is logged to the
client's log.
2022-09-26 15:16:26 -04:00
Luiz Aoqui f7c6534a79
cli: set content length on `operator api` requests (#14634)
http.NewRequestWithContext will only set the right value for
Content-Length if the input is *bytes.Buffer, *bytes.Reader, or
*strings.Reader [0].

Since os.Stdin is an os.File, POST requests made with the `nomad
operator api` command would always have Content-Length set to -1, which
is interpreted as an unknown length by web servers.

[0]: https://pkg.go.dev/net/http#NewRequestWithContext
2022-09-26 14:21:40 -04:00
Phil Renaud 497bd02169
[ui] Warn users when they leave an edited but unsaved variable page (#14665)
* Warning on attempt to leave

* Lintfix

* Only router.off once

* Dont warn on transition when only updating queryparams

* Remove double-push and queryparam-only issues, thanks @lgfa29

* Acceptance tests

* Changelog
2022-09-23 16:53:40 -04:00
Phil Renaud a28e1bcc1e
[ui] Service Healthchecks: styles for pseudo-timestamp axis (#14677)
* Styles for pseudo-timestamp axis

* Changelog
2022-09-23 16:53:28 -04:00
Tim Gross 17aee4d69c
fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
Derek Strickland 6874997f91
scheduler: Fix bug where the would treat multiregion jobs as paused for job types that don't use deployments (#14659)
* scheduler: Fix bug where the scheduler would treat multiregion jobs as paused for job types that don't use deployments

Co-authored-by: Tim Gross <tgross@hashicorp.com>

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2022-09-22 14:31:27 -04:00
Jorge Marey 92158a1c62
connect: add nomad env to envoy bootstrap (#12959)
* Add nomad env to envoy bootstrap

* Add changelog file
2022-09-22 13:18:18 -05:00
Phil Renaud eca0e7bf56
[ui] task logs in sidebar (#14612)
* button styles

* Further styles including global toggle adjustment

* sidebar funcs and header

* Functioning task logs in high-level sidebars

* same-lineify the show tasks toggle

* Changelog

* Full-height sidebar calc in css, plz drop soon container queries

* Active status and query params for allocations page

* Reactive shouldShowLogs getter and added to client and task group pages

* Higher order func passing, thanks @DingoEatingFuzz

* Non-service job types get allocation params passed

* Keyframe animation for task log sidebar

* Acceptance test

* A few more sub-row tests

* Lintfix
2022-09-22 10:58:52 -04:00
Tim Gross c29c4bd66c
cli: remove deprecated `eval status -json` list behavior (#14651)
In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and
deprecated the usage of `eval status` without an evaluation ID with an upgrade
note that it would be removed in Nomad 1.4.0. This changeset completes that
work.
2022-09-22 10:56:32 -04:00
Jorge Marey 584ddfe859
Add Namespace, Job and Group to envoy stats (#14311) 2022-09-22 10:38:21 -04:00
Tim Gross d327a68696
operator debug: write NDJSON for large collections (#14610)
The `operator debug` command writes JSON files from API responses as a single
line containing an array of JSON objects. But some of these files can be
extremely large (GB's) for large production clusters, which makes it difficult
to parse them using typical line-oriented Unix command line tools that can
stream their inputs without consuming a lot of memory.

For collections that are typically large, instead emit newline-delimited JSON.

This changeset includes some first-pass refactoring of this command. It breaks
up monolithic methods that validate a path, create a file, serialize objects,
and write them to disk into smaller functions, some of which can now be
standalone to take advantage of generics.
2022-09-22 10:02:00 -04:00
James Rasell a25028c412
cli: fix a bug in operator API when setting HTTPS via address. (#14635)
Operators may have a setup whereby the TLS config comes from a
source other than setting Nomad specific env vars. In this case,
we should attempt to identify the scheme using the config setting
as a fallback.
2022-09-22 15:43:58 +02:00
Luiz Aoqui ad48401219
chore: move changelog file to the right folder (#14639) 2022-09-21 13:50:22 -04:00
Tim Gross 38a6e7e343
remove 1.4.0 changelog entry that refers to bugfix on new code (#14611)
Bug fixes on new features in Nomad 1.4.0 don't need or want changelog entries in
the same changelog the feature appeared, so remove this one.
2022-09-16 16:14:02 -04:00
Phil Renaud d6c9676252
Added task links to various alloc tables (#14592)
* Added task links to various alloc tables

* Lintfix

* Border collapse and added to task group page

* Logs icon temporarily removed and localStorage added

* Mock task added to test

* Delog

* Two asserts in new test

* Remove commented-out code

* Changelog

* Removing args.allocation deps
2022-09-16 15:58:22 -04:00
Phil Renaud cebfbb0c28
Stabilizing percy snapshots with faker (#14551)
* First attempt at stabilizing percy snapshots with faker

* Tokens seed moved to before management token generation

* Faker seed only in token test

* moving seed after storage clear

* And again, but back to no faker seeding

* Isolated seed and temporary log

* Setting seed(1) wherever we're snapshotting, or before establishing cluster scenarios

* Deliberate noop to see if percy is stable

* Changelog entry
2022-09-14 11:27:48 -04:00
Mahmood Ali a9d5e4c510
scheduler: stopped-yet-running allocs are still running (#10446)
* scheduler: stopped-yet-running allocs are still running

* scheduler: test new stopped-but-running logic

* test: assert nonoverlapping alloc behavior

Also add a simpler Wait test helper to improve line numbers and save few
lines of code.

* docs: tried my best to describe #10446

it's not concise... feedback welcome

* scheduler: fix test that allowed overlapping allocs

* devices: only free devices when ClientStatus is terminal

* test: output nicer failure message if err==nil

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-09-13 12:52:47 -07:00
Tim Gross eb757606f3
changelog entry for variables (#14509) 2022-09-13 10:25:26 -04:00
Derek Strickland 5ca934015b
job_endpoint: check spec for all regions (#14519)
* job_endpoint: check spec for all regions
2022-09-12 09:24:26 -04:00
James Rasell 009948186b
changelog: add entry for #14320 (#14518) 2022-09-09 17:25:50 +02:00
James Rasell f51a8c73e6
deps: update armon/go-metrics to v0.4.1 (#14493) 2022-09-09 09:20:55 +02:00
Charlie Voiselle e58998e218
Add client scheduling eligibility to heartbeat (#14483) 2022-09-08 14:31:36 -04:00
Tim Gross 3fc7482ecd
CSI: failed allocation should not block its own controller unpublish (#14484)
A Nomad user reported problems with CSI volumes associated with failed
allocations, where the Nomad server did not send a controller unpublish RPC.

The controller unpublish is skipped if other non-terminal allocations on the
same node claim the volume. The check has a bug where the allocation belonging
to the claim being freed was included in the check incorrectly. During a normal
allocation stop for job stop or a new version of the job, the allocation is
terminal. But allocations that fail are not yet marked terminal at the point in
time when the client sends the unpublish RPC to the server.

For CSI plugins that support controller attach/detach, this means that the
controller will not be able to detach the volume from the allocation's host and
the replacement claim will fail until a GC is run. This changeset fixes the
conditional so that the claim's own allocation is not included, and makes the
logic easier to read. Include a test case covering this path.

Also includes two minor extra bugfixes:

* Entities we get from the state store should always be copied before
altering. Ensure that we copy the volume in the top-level unpublish workflow
before handing off to the steps.

* The list stub object for volumes in `nomad/structs` did not match the stub
object in `api`. The `api` package also did not include the current
readers/writers fields that are expected by the UI. True up the two objects and
add the previously undocumented fields to the docs.
2022-09-08 13:30:05 -04:00