Commit graph

82 commits

Author SHA1 Message Date
Luiz Aoqui adf147cb36
acl: update job eval requirement to submit-job (#16463)
The job evaluate endpoint creates a new evaluation for the job which is
a write operation. This change modifies the necessary capability from
`read-job` to `submit-job` to better reflect this.
2023-03-13 17:13:54 -04:00
Seth Hoenig ff4503aac6
client: disable running artifact downloader as nobody (#16375)
* client: disable running artifact downloader as nobody

This PR reverts a change from Nomad 1.5 where artifact downloads were
executed as the nobody user on Linux systems. This was done as an attempt
to improve the security model of artifact downloading where third party
tools such as git or mercurial would be run as the root user with all
the security implications thereof.

However, doing so conflicts with Nomad's own advice for securing the
Client data directory - which when setup with the recommended directory
permissions structure prevents artifact downloads from working as intended.

Artifact downloads are at least still now executed as a child process of
the Nomad agent, and on modern Linux systems make use of the kernel Landlock
feature for limiting filesystem access of the child process.

* docs: update upgrade guide for 1.5.1 sandboxing

* docs: add cl

* docs: add title to upgrade guide fix
2023-03-08 15:58:43 -06:00
James Rasell 05fff34fc8
docs: add 1.5.0, 1.4.5, and 1.3.10 pause regression upgrade note. (#16358) 2023-03-07 18:29:03 +01:00
Tim Gross 87b88fd83d
docs: clarify upgrade note on 1.4.0 panics (#16171)
The panic bug for upgrades with older servers that shipped in 1.4.0 was fixed in
1.4.1, which makes the versions described in the warning in the upgrade guide
misleading. Clarify the upgrade guide.
2023-02-14 11:26:33 -05:00
Seth Hoenig 165791dd89
artifact: protect against unbounded artifact decompression (1.5.0) (#16151)
* artifact: protect against unbounded artifact decompression

Starting with 1.5.0, set defaut values for artifact decompression limits.

artifact.decompression_size_limit (default "100GB") - the maximum amount of
data that will be decompressed before triggering an error and cancelling
the operation

artifact.decompression_file_count_limit (default 4096) - the maximum number
of files that will be decompressed before triggering an error and
cancelling the operation.

* artifact: assert limits cannot be nil in validation
2023-02-14 09:28:39 -06:00
Tim Gross be3cd7236f
docs: fix links in 1.5.0 upgrade guide (#16106) 2023-02-09 09:39:49 -05:00
Tim Gross 8a7d6b0cde
cli: remove deprecated keyring and keygen commands (#16068)
These command were marked as deprecated in 1.4.0 with intent to remove in
1.5.0. Remove them and clean up the docs.
2023-02-07 09:49:52 -05:00
jmwilkinson 37834dffda
Allow wildcard datacenters to be specified in job file (#11170)
Also allows for default value of `datacenters = ["*"]`
2023-02-02 09:57:45 -05:00
Luiz Aoqui d22984f659
docs: add upgrade notice for batch GC changes (#15985) 2023-01-31 15:04:12 -05:00
Piotr Kazmierczak 14b53df3b6
renamed stanza to block for consistency with other projects (#15941) 2023-01-30 15:48:43 +01:00
Ashlee M Boyer 57f8ebfa26
docs: Migrate link formats (#15779)
* Adding check-legacy-links-format workflow

* Adding test-link-rewrites workflow

* chore: updates link checker workflow hash

* Migrating links to new format

Co-authored-by: Kendall Strautman <kendallstrautman@gmail.com>
2023-01-25 09:31:14 -08:00
Tim Gross a51149736d
Rename nomad.broker.total_blocked metric (#15835)
This changeset fixes a long-standing point of confusion in metrics emitted by
the eval broker. The eval broker has a queue of "blocked" evals that are waiting
for an in-flight ("unacked") eval of the same job to be completed. But this
"blocked" state is not the same as the `blocked` status that we write to raft
and expose in the Nomad API to end users. There's a second metric
`nomad.blocked_eval.total_blocked` that refers to evaluations in that
state. This has caused ongoing confusion in major customer incidents and even in
our own documentation! (Fixed in this PR.)

There's little functional change in this PR aside from the name of the metric
emitted, but there's a bit refactoring to clean up the names in `eval_broker.go`
so that there aren't name collisions and multiple names for the same
state. Changes included are:
* Everything that was previously called "pending" referred to entities that were
  associated witht he "ready" metric. These are all now called "ready" to match
  the metric.
* Everything named "blocked" in `eval_broker.go` is now named "pending", except
  for a couple of comments that actually refer to blocked RPCs.
* Added a note to the upgrade guide docs for 1.5.0.
* Fixed the scheduling performance metrics docs because the description for
  `nomad.broker.total_blocked` was actually the description for
  `nomad.blocked_eval.total_blocked`.
2023-01-20 14:23:56 -05:00
Anthony Davis 1c32471805
Fix rejoin_after_leave behavior (#15552) 2023-01-11 16:39:24 -05:00
Seth Hoenig be3f89b5f9
artifact: enable inheriting environment variables from client (#15514)
* artifact: enable inheriting environment variables from client

This PR adds client configuration for specifying environment variables that
should be inherited by the artifact sandbox process from the Nomad Client agent.

Most users should not need to set these values but the configuration is provided
to ensure backwards compatability. Configuration of go-getter should ideally be
done through the artifact block in a jobspec task.

e.g.

```hcl
client {
  artifact {
    set_environment_variables = "TMPDIR,GIT_SSH_OPTS"
  }
}
```

Closes #15498

* website: update set_environment_variables text to mention PATH
2022-12-09 15:46:07 -06:00
Seth Hoenig 825c5cc65e
artifact: add client toggle to disable filesystem isolation (#15503)
This PR adds the client config option for turning off filesystem isolation,
applicable on Linux systems where filesystem isolation is possible and
enabled by default.

```hcl
client{
  artifact {
    disable_filesystem_isolation = <bool:false>
  }
}
```

Closes #15496
2022-12-08 12:29:23 -06:00
Seth Hoenig 51a2212d3d
client: sandbox go-getter subprocess with landlock (#15328)
* client: sandbox go-getter subprocess with landlock

This PR re-implements the getter package for artifact downloads as a subprocess.

Key changes include

On all platforms, run getter as a child process of the Nomad agent.
On Linux platforms running as root, run the child process as the nobody user.
On supporting Linux kernels, uses landlock for filesystem isolation (via go-landlock).
On all platforms, restrict environment variables of the child process to a static set.
notably TMP/TEMP now points within the allocation's task directory
kernel.landlock attribute is fingerprinted (version number or unavailable)
These changes make Nomad client more resilient against a faulty go-getter implementation that may panic, and more secure against bad actors attempting to use artifact downloads as a privilege escalation vector.

Adds new e2e/artifact suite for ensuring artifact downloading works.

TODO: Windows git test (need to modify the image, etc... followup PR)

* landlock: fixup items from cr

* cr: fixup tests and go.mod file
2022-12-07 16:02:25 -06:00
Zach Shilton fedeb84500
website: fix broken links (#14946)
* fix: nomad license put link

* fix: redirected URL

* fix: avoid auto-formatting changes
2022-10-19 14:07:48 -04:00
Bryce Kalow 94ff129167
website: fixes redirected links (#14918) 2022-10-18 10:31:52 -05:00
Tim Gross 0cc64da404
docs: 1.4.0 upgrade warning for keyring initialization (#14825) 2022-10-06 11:32:35 -04:00
Seth Hoenig 5df5e70542
core: numeric operands comparisons in constraints (#14722)
* cleanup: fixup linter warnings in schedular/feasible.go

* core: numeric operands comparisons in constraints

This PR changes constraint comparisons to be numeric rather than
lexical if both operands are integers or floats.

Inspiration #4856
Closes #4729
Closes #14719

* fix: always parse as int64
2022-09-27 11:07:07 -05:00
Michael Schurter b554f9344a
fingerprint: lengthen Vault check after seen (#14693)
Extension of #14673

Once Vault is initially fingerprinted, extend the period since changes
should be infrequent and the fingerprint is relatively expensive since
it is contacting a central Vault server.

Also move the period timer reset *after* the fingerprint. This is
similar to #9435 where the idea is to ensure the retry period starts
*after* the operation is attempted. 15s will be the *minimum* time
between fingerprints now instead of the *maximum* time between
fingerprints.

In the case of Vault fingerprinting, the original behavior might cause
the following:

1. Timer is reset to 15s
2. Fingerprint takes 16s
3. Timer has already elapsed so we immediately Fingerprint again

Even if fingerprinting Vault only takes a few seconds, that may very
well be due to excessive load and backing off our fingerprints is
desirable. The new bevahior ensures we always wait at least 15s between
fingerprint attempts and should allow some natural jittering based on
server load and network latency.
2022-09-26 12:14:19 -07:00
Tim Gross 17aee4d69c
fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
Tim Gross c29c4bd66c
cli: remove deprecated eval status -json list behavior (#14651)
In Nomad 1.2.6 we shipped `eval list`, which accepts a `-json` flag, and
deprecated the usage of `eval status` without an evaluation ID with an upgrade
note that it would be removed in Nomad 1.4.0. This changeset completes that
work.
2022-09-22 10:56:32 -04:00
Mahmood Ali a9d5e4c510
scheduler: stopped-yet-running allocs are still running (#10446)
* scheduler: stopped-yet-running allocs are still running

* scheduler: test new stopped-but-running logic

* test: assert nonoverlapping alloc behavior

Also add a simpler Wait test helper to improve line numbers and save few
lines of code.

* docs: tried my best to describe #10446

it's not concise... feedback welcome

* scheduler: fix test that allowed overlapping allocs

* devices: only free devices when ClientStatus is terminal

* test: output nicer failure message if err==nil

Co-authored-by: Mahmood Ali <mahmood@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-09-13 12:52:47 -07:00
Piotr Kazmierczak 077b6e7098
docs: Update upgrade guide to reflect enterprise changes introduced in nomad-enterprise (#14212)
This PR documents a change made in the enterprise version of nomad that addresses the following issue:

When a user tries to filter audit logs, they do so with a stanza that looks like the following:

audit {
  enabled = true

  filter "remove deletes" {
    type = "HTTPEvent"
    endpoints  = ["*"]
    stages = ["OperationComplete"]
    operations = ["DELETE"]
  }
}

When specifying both an "endpoint" and a "stage", the events with both matching a "endpoint" AND a matching "stage" will be filtered.

When specifying both an "endpoint" and an "operation" the events with both matching a "endpoint" AND a matching "operation" will be filtered.

When specifying both a "stage" and an "operation" the events with a matching a "stage" OR a matching "operation" will be filtered.

The "OR" logic with stages and operations is unexpected and doesn't allow customers to get specific on which events they want to filter. For instance the following use-case is impossible to achieve: "I want to filter out all OperationReceived events that have the DELETE verb".
2022-08-24 16:31:49 +02:00
Luiz Aoqui dbffdca92e
template: use pointer values for gid and uid (#14203)
When a Nomad agent starts and loads jobs that already existed in the
cluster, the default template uid and gid was being set to 0, since this
is the zero value for int. This caused these jobs to fail in
environments where it was not possible to use 0, such as in Windows
clients.

In order to differentiate between an explicit 0 and a template where
these properties were not set we need to use a pointer.
2022-08-22 16:25:49 -04:00
Seth Hoenig a9fa48f3db docs: remove consul 1.12.0 warning 2022-07-12 09:53:17 -05:00
Tim Gross fc4cd53cfb
docs: rename Internals to Concepts (#13696) 2022-07-11 16:55:33 -04:00
Seth Hoenig ed2f2b1a75
docs: move upgrade docs for max_client_timeout
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2022-07-07 16:46:26 -05:00
Seth Hoenig 905e673553 docs: upgrade guide for client max_kill_timeout 2022-07-07 15:27:40 -05:00
Seth Hoenig 91e08d5e23 core: remove support for raft protocol version 2
This PR checks server config for raft_protocol, which must now
be set to 3 or unset (0). When unset, version 3 is used as the
default.
2022-06-23 14:37:50 +00:00
Arthur Leclerc d98a9b1d72
docs: Fix typo (#13389) 2022-06-16 13:24:18 -04:00
Michael Schurter 2965dc6a1a
artifact: fix numerous go-getter security issues
Fix numerous go-getter security issues:

- Add timeouts to http, git, and hg operations to prevent DoS
- Add size limit to http to prevent resource exhaustion
- Disable following symlinks in both artifacts and `job run`
- Stop performing initial HEAD request to avoid file corruption on
  retries and DoS opportunities.

**Approach**

Since Nomad has no ability to differentiate a DoS-via-large-artifact vs
a legitimate workload, all of the new limits are configurable at the
client agent level.

The max size of HTTP downloads is also exposed as a node attribute so
that if some workloads have large artifacts they can specify a high
limit in their jobspecs.

In the future all of this plumbing could be extended to enable/disable
specific getters or artifact downloading entirely on a per-node basis.
2022-05-24 16:29:39 -04:00
Luiz Aoqui fea13f39b3
docs: add Consul 1.12.0 upgrade notice 2022-05-16 18:44:26 -04:00
Tim Gross c763c4cb96
remove pre-0.9 driver code and related E2E test (#12791)
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
2022-04-27 09:53:37 -04:00
James Rasell b5d10bcece
docs: add upgrade note for Consul implicit constraint. (#12749) 2022-04-22 15:53:27 +02:00
Seth Hoenig df587d8263 docs: update documentation with connect acls changes
This PR updates the changelog, adds notes the 1.3 upgrade guide, and
updates the connect integration docs with documentation about the new
requirement on Consul ACL policies of Consul agent default anonymous ACL
tokens.
2022-04-18 08:22:33 -05:00
Seth Hoenig a1c4f16cf1 connect: prefix tag with nomad.; merge into envoy_stats_tags; update docs
This PR expands on the work done in #12543 to
- prefix the tag, so it is now "nomad.alloc_id" to be more consistent with Consul tags
- merge into pre-existing envoy_stats_tags fields
- update the upgrade guide docs
- update changelog
2022-04-14 12:52:52 -05:00
Seth Hoenig a75bc27601 docs: fixup title formatting in upgrade guide 2022-04-08 11:50:54 -05:00
Luiz Aoqui 0190f378a7
docs: fix upgrade specific broken link and conflict tag (#12521) 2022-04-08 12:36:47 -04:00
Michael Schurter 7a28fcb8af template: disallow writeToFile by default
Resolves #12095 by WONTFIXing it.

This approach disables `writeToFile` as it allows arbitrary host
filesystem writes and is only a small quality of life improvement over
multiple `template` stanzas.

This approach has the significant downside of leaving people who have
altered their `template.function_denylist` *still vulnerable!* I added
an upgrade note, but we should have implemented the denylist as a
`map[string]bool` so that new funcs could be denied without overriding
custom configurations.

This PR also includes a bug fix that broke enabling all consul-template
funcs. We repeatedly failed to differentiate between a nil (unset)
denylist and an empty (allow all) one.
2022-03-28 17:05:42 -07:00
Luiz Aoqui 848a3b271f
docs: fix link and add note about Nomad v1.3.0 on raft v3 upgrade (#12378) 2022-03-25 10:11:46 -04:00
Seth Hoenig 113b7eb727 client: cgroups v2 code review followup 2022-03-24 13:40:42 -05:00
Seth Hoenig 2e5c6de820 client: enable support for cgroups v2
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
2022-03-23 11:35:27 -05:00
Luiz Aoqui 68e5b58007
cli: display Raft version in server members (#12317)
The previous output of the `nomad server members` command would output a
column named `Protocol` that displayed the Serf protocol being currently
used by servers.

This is not a configurable option, so it holds very little value to
operators. It is also easy to confuse it with the Raft Protocol version,
which is configurable and highly relevant to operators.

This commit replaces the previous `Protocol` column with the new `Raft
Version`. It also updates the `-detailed` flag to be called `-verbose`
so it matches other commands. The detailed output now also outputs the
same information as the standard output with the addition of the
previous `Protocol` column and `Tags`.
2022-03-17 14:15:10 -04:00
Luiz Aoqui 8db12c2a17
server: transfer leadership in case of error (#12293)
When a Nomad server becomes the Raft leader, it must perform several
actions defined in the establishLeadership function. If any of these
actions fail, Raft will think the node is the leader, but it will not
actually be able to act as a Nomad leader.

In this scenario, leadership must be revoked and transferred to another
server if possible, or the node should retry the establishLeadership
steps.
2022-03-17 11:10:57 -04:00
Seth Hoenig 8e6d97744b docs: emphasize snapshot before upgrading 2022-02-24 08:22:41 -06:00
Seth Hoenig de95998faa core: switch to go.etc.io/bbolt
This PR swaps the underlying BoltDB implementation from boltdb/bolt
to go.etc.io/bbolt.

In addition, the Server has a new configuration option for disabling
NoFreelistSync on the underlying database.

Freelist option: https://github.com/etcd-io/bbolt/blob/master/db.go#L81
Consul equivelent PR: https://github.com/hashicorp/consul/pull/11720
2022-02-23 14:26:41 -06:00
Tim Gross 246db87a74
CSI: allow for concurrent plugin allocations (#12078)
The dynamic plugin registry assumes that plugins are singletons, which
matches the behavior of other Nomad plugins. But because dynamic
plugins like CSI are implemented by allocations, we need to handle the
possibility of multiple allocations for a given plugin type + ID, as
well as behaviors around interleaved allocation starts and stops.

Update the data structure for the dynamic registry so that more recent
allocations take over as the instance manager singleton, but we still
preserve the previous running allocations so that restores work
without racing.

Multiple allocations can run on a client for the same plugin, even if
only during updates. Provide each plugin task a unique path for the
control socket so that the tasks don't interfere with each other.
2022-02-23 15:23:07 -05:00
Luiz Aoqui d976e4a19b
docs: add upgrade note and ACL requirements for the job submit endpoint (#12046) 2022-02-10 15:35:16 -05:00