Commit graph

331 commits

Author SHA1 Message Date
Charlie Voiselle 4caac1a92f
client: Add option to enable hairpinMode on Nomad bridge (#15961)
* Add `bridge_network_hairpin_mode` client config setting
* Add node attribute: `nomad.bridge.hairpin_mode`
* Changed format string to use `%q` to escape user provided data
* Add test to validate template JSON for developer safety

Co-authored-by: Daniel Bennett <dbennett@hashicorp.com>
2023-02-02 10:12:15 -05:00
Karl Johann Schubert b773a1b77f
client: add disk_total_mb and disk_free_mb config options (#15852) 2023-01-24 09:14:22 -05:00
Seth Hoenig 8cd77c14a2
env/aws: update ec2 cpu info data (#15770) 2023-01-13 09:58:23 -06:00
Seth Hoenig 83450c8762
vault: configure user agent on Nomad vault clients (#15745)
* vault: configure user agent on Nomad vault clients

This PR attempts to set the User-Agent header on each Vault API client
created by Nomad. Still need to figure a way to set User-Agent on the
Vault client created internally by consul-template.

* vault: fixup find-and-replace gone awry
2023-01-10 10:39:45 -06:00
Seth Hoenig 51a2212d3d
client: sandbox go-getter subprocess with landlock (#15328)
* client: sandbox go-getter subprocess with landlock

This PR re-implements the getter package for artifact downloads as a subprocess.

Key changes include

On all platforms, run getter as a child process of the Nomad agent.
On Linux platforms running as root, run the child process as the nobody user.
On supporting Linux kernels, uses landlock for filesystem isolation (via go-landlock).
On all platforms, restrict environment variables of the child process to a static set.
notably TMP/TEMP now points within the allocation's task directory
kernel.landlock attribute is fingerprinted (version number or unavailable)
These changes make Nomad client more resilient against a faulty go-getter implementation that may panic, and more secure against bad actors attempting to use artifact downloads as a privilege escalation vector.

Adds new e2e/artifact suite for ensuring artifact downloading works.

TODO: Windows git test (need to modify the image, etc... followup PR)

* landlock: fixup items from cr

* cr: fixup tests and go.mod file
2022-12-07 16:02:25 -06:00
Seth Hoenig 3ed37b0b1d
fingerprint: add fingerprinting for CNI plugins presense and version (#15452)
This PR adds a fingerprinter to set the attribute
"plugins.cni.version.<name>" => "<version>"

for each CNI plugin in <client>.cni_path (/opt/cni/bin by default).
2022-12-05 14:22:47 -06:00
James Rasell e2a2ea68fc
client: accommodate Consul 1.14.0 gRPC and agent self changes. (#15309)
* client: accommodate Consul 1.14.0 gRPC and agent self changes.

Consul 1.14.0 changed the way in which gRPC listeners are
configured, particularly when using TLS. Prior to the change, a
single listener was responsible for handling plain-text and
encrypted gRPC requests. In 1.14.0 and beyond, separate listeners
will be used for each, defaulting to 8502 and 8503 for plain-text
and TLS respectively.

The change means that Nomad’s Consul Connect integration would not
work when integrated with Consul clusters using TLS and running
1.14.0 or greater.

The Nomad Consul fingerprinter identifies the gRPC port Consul has
exposed using the "DebugConfig.GRPCPort" value from Consul’s
“/v1/agent/self” endpoint. In Consul 1.14.0 and greater, this only
represents the plain-text gRPC port which is likely to be disbaled
in clusters running TLS. In order to fix this issue, Nomad now
takes into account the Consul version and configured scheme to
optionally use “DebugConfig.GRPCTLSPort” value from Consul’s agent
self return.

The “consul_grcp_socket” allocrunner hook has also been updated so
that the fingerprinted gRPC port attribute is passed in. This
provides a better fallback method, when the operator does not
configure the “consul.grpc_address” option.

* docs: modify Consul Connect entries to detail 1.14.0 changes.

* changelog: add entry for #15309

* fixup: tidy tests and clean version match from review feedback.

* fixup: use strings tolower func.
2022-11-21 09:19:09 -06:00
Michael Schurter e6af1c0a14
fingerprint: add node attr for reserverable cores (#14694)
* fingerprint: add node attr for reserverable cores

Add an attribute for the number of reservable CPU cores as they may
differ from the existing `cpu.numcores` due to client configuration or
OS support.

Hopefully clarifies some confusion in #14676

* add changelog

* num_reservable_cores -> reservablecores
2022-09-26 13:03:03 -07:00
Michael Schurter b554f9344a
fingerprint: lengthen Vault check after seen (#14693)
Extension of #14673

Once Vault is initially fingerprinted, extend the period since changes
should be infrequent and the fingerprint is relatively expensive since
it is contacting a central Vault server.

Also move the period timer reset *after* the fingerprint. This is
similar to #9435 where the idea is to ensure the retry period starts
*after* the operation is attempted. 15s will be the *minimum* time
between fingerprints now instead of the *maximum* time between
fingerprints.

In the case of Vault fingerprinting, the original behavior might cause
the following:

1. Timer is reset to 15s
2. Fingerprint takes 16s
3. Timer has already elapsed so we immediately Fingerprint again

Even if fingerprinting Vault only takes a few seconds, that may very
well be due to excessive load and backing off our fingerprints is
desirable. The new bevahior ensures we always wait at least 15s between
fingerprint attempts and should allow some natural jittering based on
server load and network latency.
2022-09-26 12:14:19 -07:00
Tim Gross 17aee4d69c
fingerprint: don't clear Consul/Vault attributes on failure (#14673)
Clients periodically fingerprint Vault and Consul to ensure the server has
updated attributes in the client's fingerprint. If the client can't reach
Vault/Consul, the fingerprinter clears the attributes and requires a node
update. Although this seems like correct behavior so that we can detect
intentional removal of Vault/Consul access, it has two serious failure modes:

(1) If a local Consul agent is restarted to pick up configuration changes and the
client happens to fingerprint at that moment, the client will update its
fingerprint and result in evaluations for all its jobs and all the system jobs
in the cluster.

(2) If a client loses Vault connectivity, the same thing happens. But the
consequences are much worse in the Vault case because Vault is not run as a
local agent, so Vault connectivity failures are highly correlated across the
entire cluster. A 15 second Vault outage will cause a new `node-update`
evalution for every system job on the cluster times the number of nodes, plus
one `node-update` evaluation for every non-system job on each node. On large
clusters of 1000s of nodes, we've seen this create a large backlog of evaluations.

This changeset updates the fingerprinting behavior to keep the last fingerprint
if Consul or Vault queries fail. This prevents a storm of evaluations at the
cost of requiring a client restart if Consul or Vault is intentionally removed
from the client.
2022-09-23 14:45:12 -04:00
Tim Gross e5454362dc
CI: make make check clean on macOS (#14528)
Running `make check` on macOS identifies some dead code because the code is used
only with the Linux build tag. Move this code into appropriately-tagged code
files.
2022-09-09 12:26:34 -04:00
James Rasell 4b9bcf94da
chore: remove use of "err" a log line context key for errors. (#14433)
Log lines which include an error should use the full term "error"
as the context key. This provides consistency across the codebase
and avoids a Go style which operators might not be aware of.
2022-09-01 15:06:10 +02:00
Tim Gross cc9b480996
testing: setting env var incompatible with parallel tests (#14405)
Neither the `os.Setenv` nor `t.Setenv` helper are safe to use in parallel tests
because environment variables are process-global. The stdlib panics if you try
to do this. Remove the `ci.Parallel()` call from all tests where we're setting
environment variables.
2022-08-30 14:49:03 -04:00
Seth Hoenig 90972707f9 build: update aws env cpu info 2022-08-02 07:59:58 -05:00
Tim Gross 20a01cab9e
update AWS cpu info for fingerprinter (#13280) 2022-06-08 09:45:52 -04:00
Shantanu Gadgil 6cb8c95534
fingerprint kernel architecture name (#13182) 2022-06-02 15:51:00 -04:00
Seth Hoenig c87bfe398f build: update ec2 instance profiles
using tools/ec2info
2022-04-21 11:47:40 -05:00
James Rasell 67b467983e
Merge pull request #12368 from hashicorp/f-1.3-boogie-nights
service discovery: add initial MVP implementation
2022-03-25 18:04:47 +01:00
Hunter Morris dcaf99dcc1
client: Add AWS EC2 instance-life-cycle from metadata to client fingerprint (#12371) 2022-03-25 11:50:52 -04:00
James Rasell 9449e1c3e2
Merge branch 'main' into f-1.3-boogie-nights 2022-03-25 16:40:32 +01:00
Seth Hoenig 2e5c6de820 client: enable support for cgroups v2
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
2022-03-23 11:35:27 -05:00
James Rasell a646333263
Merge branch 'main' into f-1.3-boogie-nights 2022-03-23 09:41:25 +01:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
James Rasell 6d3589e8a7
client: add service discovery feature enabled attribute. 2022-03-14 12:42:01 +01:00
Kevin Schoonover 1dcfff2f70
fingerprint: remove metadata from digitalocean (#12032) 2022-02-09 07:31:45 -05:00
Tim Gross 21bd4835bd
fingerprint: digitalocean fingerprint test requires metadata header (#12028) 2022-02-08 16:35:13 -05:00
Seth Hoenig 5cb365b36b env: update aws cpu configs
By running the tools/ec2info tool
2022-02-08 12:44:00 -06:00
Kevin Schoonover b13573d4ab address comments
Co-authored-by: Seth Hoenig <seth.a.hoenig@gmail.com>
2022-02-07 09:03:48 -08:00
Kevin Schoonover 68eeaa7a18 small fixes 2022-02-05 22:23:43 -08:00
Kevin Schoonover 5523275e95 add digitalocean fingerprinter 2022-02-05 22:17:36 -08:00
Luiz Aoqui 4bdd2c84e3
fix host network reserved port fingerprint (#11728) 2021-12-22 15:29:54 -05:00
Mahmood Ali 4d90afb425 gofmt all the files
mostly to handle build directives in 1.17.
2021-10-01 10:14:28 -04:00
Luiz Aoqui a7698dedba
Disable PowerShell profile and simplify fingerprinting link speed on Windows (#11183) 2021-09-22 11:17:47 -04:00
Luiz Aoqui edd32ba571
Log network device name during fingerprinting (#11184) 2021-09-16 10:48:31 -04:00
James Rasell b6813f1221
chore: fix incorrect docstring formatting. 2021-08-30 11:08:12 +02:00
Seth Hoenig f71d1755a6 env/aws: update ec2 cpu data
using tools/ec2info

```
$ go run .
```
2021-07-22 09:32:46 -05:00
Seth Hoenig 209e2d6d81 consul: pr cleanup namespace probe function signatures 2021-06-07 15:41:01 -05:00
Seth Hoenig 519429a2de consul: probe consul namespace feature before using namespace api
This PR changes Nomad's wrapper around the Consul NamespaceAPI so that
it will detect if the Consul Namespaces feature is enabled before making
a request to the Namespaces API. Namespaces are not enabled in Consul OSS,
and require a suitable license to be used with Consul ENT.

Previously Nomad would check for a 404 status code when makeing a request
to the Namespaces API to "detect" if Consul OSS was being used. This does
not work for Consul ENT with Namespaces disabled, which returns a 500.

Now we avoid requesting the namespace API altogether if Consul is detected
to be the OSS sku, or if the Namespaces feature is not licensed. Since
Consul can be upgraded from OSS to ENT, or a new license applied, we cache
the value for 1 minute, refreshing on demand if expired.

Fixes https://github.com/hashicorp/nomad-enterprise/issues/575

Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688.
This turns out not to be possible due to a chicken-egg situation between
bootstrapping the agent and setting up the consul client. Also fun: the
Consul fingerprinter creates its own Consul client, because there is no
[currently] no way to pass the agent's client through the fingerprint factory.
2021-06-07 12:19:25 -05:00
Seth Hoenig 3346432d58 client/fingerprint/consul: add new attributes to consul fingerprinter
This PR adds new probes for detecting these new Consul related attributes:

Consul namespaces are a Consul enterprise feature that may be disabled depending
on the enterprise license associated with the Consul servers. Having this attribute
available will enable Nomad to properly decide whether to query the Consul Namespace
API.

Consul connect must be explicitly enabled before Connect APIs will work. Currently
Nomad only checks for a minimum Consul version. Having this attribute available will
enable Nomad to properly schedule Connect tasks only on nodes with a Consul agent that
has Connect enabled.

Consul connect requires the grpc port to be explicitly set before Connect APIs will work.
Currently Nomad only checks for a minimal Consul version. Having this attribute available
will enable Nomad to schedule Connect tasks only on nodes with a Consul agent that has
the grpc listener enabled.
2021-06-03 12:49:22 -05:00
Seth Hoenig b548cf6816 client/fingerprint/consul: refactor the consul fingerprinter to test individual attributes
This PR refactors the ConsulFingerprint implementation, breaking individual attributes
into individual functions to make testing them easier. This is in preparation for
additional extractors about to be added. Behavior should be otherwise unchanged.

It adds the attribute consul.sku, which can be used to differentiate between Consul
OSS vs Consul ENT.
2021-06-03 12:48:39 -05:00
Seth Hoenig f53c30c684 aws_env: update ec2 instances
Generate updated list using tools/ec2info
2021-04-22 11:33:51 -06:00
Nick Ethier 155a2ca5fb client/ar: thread through cpuset manager 2021-04-13 13:28:36 -04:00
Nick Ethier b6b74a98a9 client/fingerprint: move existing cgroup concerns to cgutil 2021-04-13 13:28:36 -04:00
Nick Ethier edc0da9040 client: only fingerprint reservable cores via cgroups, allowing manual override for other platforms 2021-04-13 13:28:15 -04:00
Nick Ethier bed4e92b61 fingerprint: implement client fingerprinting of reservable cores
on Linux systems this is derived from the configure cpuset cgroup parent (defaults to /nomad)
for non Linux systems and Linux systems where cgroups are not enabled, the client defaults to using all cores
2021-04-13 13:28:15 -04:00
Andrii Chubatiuk d8df568f10
support multiple host network aliases for the same interface 2021-04-13 09:33:33 -04:00
Yoan Blanc ac0d5d8bd3
chore: bump golangci-lint from v1.24 to v1.39
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2021-04-03 09:50:23 +02:00
Tim Gross f820021f9e deps: bump gopsutil to v3.21.2 2021-03-30 16:02:51 -04:00
Florian Apolloner df7e22362d Properly detect unloaded dynamic modules on RHEL derivates. Fixes #9776
The modules.dep file on RHEL includes .xz for compressed kernel modules.
2021-01-12 18:28:00 +01:00
Joel May 13faf0d79e Allow client.cpu_total_compute to override attr.cpu.totalcompute 2021-01-07 15:31:11 -05:00