Commit Graph

22973 Commits

Author SHA1 Message Date
James Rasell 16b1f19ffe
api: move serviceregistration client to servics to match CLI.
The service registration client name was used to provide a
distinction between the service block and the service client. This
however creates new wording to understand and does not match the
CLI, therefore this change fixes that so we have a Services
client.

Consul specific objects within the service file have been moved to
the consul location to create a clearer separation.
2022-03-24 09:08:45 +01:00
James Rasell 96d8512c85
test: move remaining tests to use ci.Parallel. 2022-03-24 08:45:13 +01:00
dependabot[bot] 92021045b6
build(deps): bump github.com/stretchr/testify from 1.7.0 to 1.7.1 (#12306) 2022-03-23 19:12:51 -04:00
Seth Hoenig 2e5c6de820 client: enable support for cgroups v2
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
2022-03-23 11:35:27 -05:00
Tim Gross 5c91bc877c
csi: set gRPC authority header for unix domain socket (#12359)
The go-grpc library used by most CSI plugins doesn't require the
authority header to be set, which violates the HTTP2 spec but doesn't
impact Nomad because both sides of the connection are using the same
library. But plugins written in other languages (`democratic-csi` for
example) may have more strictly conforming gRPC server libraries and
we need to set the authority header manually.
2022-03-23 12:01:08 -04:00
James Rasell cd42572f0d
Merge pull request #12330 from hashicorp/f-gh-263
cli: add service commands for list, info, and delete.
2022-03-23 15:49:29 +01:00
Tim Gross 1743648901
CSI: fix timestamp from volume snapshot responses (#12352)
Listing snapshots was incorrectly returning nanoseconds instead of
seconds, and formatting of timestamps both list and create snapshot
was treating the timestamp as though it were nanoseconds instead of
seconds. This resulted in create timestamps always being displayed as
zero values.

Fix the unit conversion error in the command line and the incorrect
extraction in the CSI plugin client code. Beef up the unit tests to
make sure this code is actually exercised.
2022-03-23 10:39:28 -04:00
Tim Gross b7075f04fd
CSI: enforce single access mode at validation time (#12337)
A volume that has single-use access mode is feasibility checked during
scheduling to ensure that only a single reader or writer claim
exists. However, because feasibility checking is done one alloc at a
time before the plan is written, a job that's misconfigured to have
count > 1 that mounts one of these volumes will pass feasibility
checking.

Enforce the check at validation time instead to prevent us from even
trying to evaluation a job that's misconfigured this way.
2022-03-23 09:21:26 -04:00
James Rasell 0a9f54c525
cli: update namespace wildcard help to be non-specific.
A number of commands support namespace wildcard querying, so it
should be up to the sub-command to detail support, rather than
keeping this list up to date.
2022-03-23 12:56:48 +01:00
James Rasell bb8514fc75
core: remove node service registrations when node is down.
When a node fails its heart beating a number of actions are taken
to ensure state is cleaned. Service registrations a loosely tied
to nodes, therefore we should remove these from state when a node
is considered terminally down.
2022-03-23 09:42:46 +01:00
James Rasell a646333263
Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
Michele Degges 44ebafd4b1 [RelAPI Onboarding] Add release API metadata file 2022-03-22 17:06:33 -07:00
Tim Gross 33558cb51e
csi: fix handling of garbage collected node in node unpublish (#12350)
When a node is garbage collected, we assume that the volume is no
longer attached to it and ignore the `ErrUnknownNode` error. But we
used `errors.Is` to check for a wrapped error, and RPC flattens the
errors during serialization. This results in an error check that works
in automated testing but not in real clusters. Use a string contains
check instead.
2022-03-22 15:40:24 -04:00
Luiz Aoqui f8973d364e
core: use the new Raft API when removing peers (#12340)
Raft v3 introduced a new API for adding and removing peers that takes
the peer ID instead of the address.

Prior to this change, Nomad would use the remote peer Raft version for
deciding which API to use, but this would not work in the scenario where
a Raft v3 server tries to remove a Raft v2 server; the code running uses
v3 so it's unable to call the v2 API.

This change uses the Raft version of the server running the code to
decide which API to use. If the remote peer is a Raft v2, it uses the
server address as the ID.
2022-03-22 15:07:31 -04:00
Luiz Aoqui b5a42cd55d
set raft v3 as the default config (#12341) 2022-03-22 15:06:25 -04:00
Tim Gross 60cfeacd76
drainer: defer CSI plugins until last (#12324)
When a node is drained, system jobs are left until last so that
operators can rely on things like log shippers running even as their
applications are getting drained off. Include CSI plugins in this set
so that Controller plugins deployed as services can be handled as
gracefully as Node plugins that are running as system jobs.
2022-03-22 10:26:56 -04:00
Jonathan Tey 4635be07ab
demo: add missing file for Kadalu CSI demo (#12336) 2022-03-22 09:50:50 -04:00
Tim Gross 2a2ebd0537
CSI: presentation improvements (#12325)
* Fix plugin capability sorting.
  The `sort.StringSlice` method in the stdlib doesn't actually sort, but
  instead constructs a sorting type which you call `Sort()` on.
* Sort allocations for plugins by modify index.
  Present allocations in modify index order so that newest allocations
  show up at the top of the list. This results in sorted allocs in
  `nomad plugin status :id`, just like `nomad job status :id`.
* Sort allocations for volumes in HTTP response.
  Present allocations in modify index order so that newest allocations
  show up at the top of the list. This results in sorted allocs in
  `nomad volume status :id`, just like `nomad job status :id`.
  This is implemented in the HTTP response and not in the state store
  because the state store maintains two separate lists of allocs that
  are merged before sending over the API.
* Fix length of alloc IDs in `nomad volume status` output
2022-03-22 09:48:38 -04:00
James Rasell f1e4db70d2
Merge pull request #12332 from hashicorp/b-node-fixup-drainupdate-msg-spelling
core: fixup node drain update message spelling.
2022-03-22 10:44:03 +01:00
James Rasell 93f4f6f09c
Merge pull request #12329 from hashicorp/f-gh-266-touchdown
client: hookup service wrapper for use in clients
2022-03-22 10:42:05 +01:00
Tim Gross e687a21da9
CSI: set plugin `CSI_ENDPOINT` env var only if unset by user (#12257)
* Use unix:// prefix for CSI_ENDPOINT variable by default
* Some plugins have strict validation over the format of the
  `CSI_ENDPOINT` variable, and unfortunately not all plugins
  agree. Allow the user to override the `CSI_ENDPOINT` to workaround
  those cases.
* Update all demos and tests with CSI_ENDPOINT
2022-03-21 11:48:47 -04:00
Tim Gross bd403f2f88
E2E: ensure `ConnectACLsE2ETest` has clean state before starting (#12334)
The `ConnectACLsE2ETest` checks that the SI tokens have been properly
cleaned up between tests, but following the change to use HCP the
previous `Connect` test suite will often have SI tokens that haven't
been cleaned up by the time this test suite runs. Wait for the SI
tokens to be cleaned up at the start of the test to ensure we have a
clean state.
2022-03-21 11:05:02 -04:00
James Rasell 68cd3d89fe
core: fixup node drain update message spelling. 2022-03-21 13:37:08 +01:00
James Rasell c0a2e493c3
cli: add service commands for list, info, and delete. 2022-03-21 11:59:03 +01:00
James Rasell 042bf0fa57
client: hookup service wrapper for use within client hooks. 2022-03-21 10:29:57 +01:00
James Rasell fa67b0d53a
client: modify service wrapper to accomodate restore behaviour. 2022-03-21 09:49:39 +01:00
James Rasell 7718b9bc3f
Merge pull request #12298 from hashicorp/f-gh-266-nomad
client: add Nomad service registration implementation.
2022-03-21 08:58:45 +01:00
Seth Hoenig 3303a4534a
Merge pull request #12322 from hashicorp/ci-gha
ci: turn on testing in github actions
2022-03-18 12:58:09 -05:00
Seth Hoenig 8eea6e3aa3 ci: scope to push, ignore more dirs, update go update script 2022-03-18 12:47:38 -05:00
Seth Hoenig 57bd480062 ci: turn on testing in github actions 2022-03-18 11:12:24 -05:00
Seth Hoenig f2914ea36c
Merge pull request #12321 from hashicorp/ci-less-logging
ci: limit gotestsum to circle ci
2022-03-18 10:02:13 -05:00
Seth Hoenig 4d86f5d94d ci: limit gotestsum to circle ci
Part 2 of breaking up https://github.com/hashicorp/nomad/pull/12255

This PR makes it so gotestsum is invoked only in CircleCI. Also the
HCLogger(t) is plumbed more correctly in TestServer and TestAgent so
that they respect NOMAD_TEST_LOG_LEVEL.

The reason for these is we'll want to disable logging in GHA,
where spamming the disk with logs really drags performance.
2022-03-18 09:15:01 -05:00
Tim Gross 1561f66d99
api: fix ENT-only test imports for moved testutil package (#12320)
The `api/testutil` package was moved to `api/internal/testutil` but
this wasn't caught in the ENT tests because they're not run here in
the OSS repo.
2022-03-18 10:12:28 -04:00
Tim Gross 9f05d62338
E2E with HCP Consul/Vault (#12267)
Use HCP Consul and HCP Vault for the Consul and Vault clusters used in E2E testing. This has the following benefits:

* Without the need to support mTLS bootstrapping for Consul and Vault, we can simplify the mTLS configuration by leaning on Terraform instead of janky bash shell scripting.
* Vault bootstrapping is no longer required, so we can eliminate even more janky shell scripting
* Our E2E exercises HCP, which is important to us as an organization
* With the reduction in configurability, we can simplify the Terraform configuration and drop the complicated `provision.sh`/`provision.ps1` scripts we were using previously. We can template Nomad configuration files and upload them with the `file` provisioner.
* Packer builds for Linux and Windows become much simpler.

tl;dr way less janky shell scripting!
2022-03-18 09:27:28 -04:00
Seth Hoenig ab9a639a0a
Merge pull request #12313 from hashicorp/purge-parallel-2
ci: more parallel removal
2022-03-17 13:48:37 -05:00
Seth Hoenig b73d911f05 ci: do not exclude Parallel semgrep rule 2022-03-17 13:45:56 -05:00
Luiz Aoqui 68e5b58007
cli: display Raft version in `server members` (#12317)
The previous output of the `nomad server members` command would output a
column named `Protocol` that displayed the Serf protocol being currently
used by servers.

This is not a configurable option, so it holds very little value to
operators. It is also easy to confuse it with the Raft Protocol version,
which is configurable and highly relevant to operators.

This commit replaces the previous `Protocol` column with the new `Raft
Version`. It also updates the `-detailed` flag to be called `-verbose`
so it matches other commands. The detailed output now also outputs the
same information as the standard output with the addition of the
previous `Protocol` column and `Tags`.
2022-03-17 14:15:10 -04:00
Luiz Aoqui 15089f055f
api: add related evals to eval details (#12305)
The `related` query param is used to indicate that the request should
return a list of related (next, previous, and blocked) evaluations.

Co-authored-by: Jasmine Dahilig <jasmine@hashicorp.com>
2022-03-17 13:56:14 -04:00
Luiz Aoqui 8db12c2a17
server: transfer leadership in case of error (#12293)
When a Nomad server becomes the Raft leader, it must perform several
actions defined in the establishLeadership function. If any of these
actions fail, Raft will think the node is the leader, but it will not
actually be able to act as a Nomad leader.

In this scenario, leadership must be revoked and transferred to another
server if possible, or the node should retry the establishLeadership
steps.
2022-03-17 11:10:57 -04:00
Seth Hoenig 373d8f7241 ci: missing import for nomad09upgrade 2022-03-17 08:49:15 -05:00
Seth Hoenig 58b3d1711b ci: semgrep rule for parallel tests
Adds a semgrep rule warning about using ci.Parallel instead of t.Parallel
2022-03-17 08:43:37 -05:00
Seth Hoenig f87eb666c7 e2e: have e2e use ci.Parallel
This is a followup to having tests run in serial in CI.

The e2e package isn't in CI, but lets use the helper anyway
so we can setup semgrep rules covering the entire repository.
2022-03-17 08:37:34 -05:00
Seth Hoenig 3943dd1e16 ci: use serial testing for api in CI
This is a followup to running tests in serial in CI.
Since the API package cannot import anything outside of api/,
copy the ci.Parallel function into api/internal/testutil, and
have api tests use that.
2022-03-17 08:35:01 -05:00
James Rasell 1a4db3523d
Merge branch 'main' into tlefebvre/fix-wrong-drivernetworkmanager-interface 2022-03-17 09:38:13 +01:00
James Rasell ae26f79357
client: add Nomad service registration implementation. 2022-03-17 09:17:13 +01:00
James Rasell 54eed7b8ee
Merge pull request #12295 from hashicorp/f-gh-266-wrapper
client: add service registration wrapper to handle providers.
2022-03-17 08:52:45 +01:00
James Rasell 91e4d20b1d
Merge pull request #12307 from hashicorp/b-groupservices-avoid-double-tg-lookup
client: avoid double group lookup within groupservice hook setup.
2022-03-16 17:00:01 +01:00
Luiz Aoqui 83d834d84c
tests: move state store namespace tests from ENT (#12308) 2022-03-16 11:56:11 -04:00
Seth Hoenig aca50349f4
Merge pull request #12299 from hashicorp/ci-parallel
ci: trade test parallelization for unconstrained gomaxprocs
2022-03-16 08:55:39 -05:00
Seth Hoenig 2b83614a26 ci: explain why ci runs tests in serial now 2022-03-16 08:38:42 -05:00