Commit Graph

84 Commits

Author SHA1 Message Date
Tim Gross c763c4cb96
remove pre-0.9 driver code and related E2E test (#12791)
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
2022-04-27 09:53:37 -04:00
Seth Hoenig 52aaf86f52 raw_exec: make raw exec driver work with cgroups v2
This PR adds support for the raw_exec driver on systems with only cgroups v2.

The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.

There is a bit of refactoring in this PR, but the fundamental design remains
the same.

Closes #12351 #12348
2022-04-04 16:11:38 -05:00
Seth Hoenig 2e5c6de820 client: enable support for cgroups v2
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
2022-03-23 11:35:27 -05:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Mahmood Ali 4d90afb425 gofmt all the files
mostly to handle build directives in 1.17.
2021-10-01 10:14:28 -04:00
Mahmood Ali b2a8752c5f
honor task user when execing into raw_exec task (#9439)
Fix #9210 .

This update the executor so it honors the User when using nomad alloc exec. The bug was that the exec task didn't honor the init command when execing.
2020-11-25 09:34:10 -05:00
Mahmood Ali a89da9982d
raw_exec: don't use cgroups when no_cgroup is set (#9328)
When raw_exec is configured with [`no_cgroups`](https://www.nomadproject.io/docs/drivers/raw_exec#no_cgroups), raw_exec shouldn't attempt to create a cgroup.

Prior to this change, we accidentally always required freezer cgroup to do stats PID tracking. We already have the proper fallback in place for metrics, so only need to ensure that we don't create a cgroup for the task.

Fixes https://github.com/hashicorp/nomad/issues/8565
2020-11-11 16:20:34 -05:00
Mahmood Ali 2588b3bc98 cleanup driver eventor goroutines
This fixes few cases where driver eventor goroutines are leaked during
normal operations, but especially so in tests.

This change makes few modifications:

First, it switches drivers to use `Context`s to manage shutdown events.
Previously, it relied on callers invoking `.Shutdown()` function that is
specific to internal drivers only and require casting.  Using `Contexts`
provide a consistent idiomatic way to manage lifecycle for both internal
and external drivers.

Also, I discovered few places where we don't clean up a temporary driver
instance in the plugin catalog code, where we dispense a driver to
inspect and validate the schema config without properly cleaning it up.
2020-05-26 11:04:04 -04:00
Tim Gross aa8927abb4
volumes: return better error messages for unsupported task drivers (#8030)
When an allocation runs for a task driver that can't support volume mounts,
the mounting will fail in a way that can be hard to understand. With host
volumes this usually means failing silently, whereas with CSI the operator
gets inscrutable internals exposed in the `nomad alloc status`.

This changeset adds a MountConfig field to the task driver Capabilities
response. We validate this when the `csi_hook` or `volume_hook` fires and
return a user-friendly error.

Note that we don't currently have a way to get driver capabilities up to the
server, except through attributes. Validating this when the user initially
submits the jobspec would be even better than what we're doing here (and could
be useful for all our other capabilities), but that's out of scope for this
changeset.

Also note that the MountConfig enum starts with "supports all" in order to
support community plugins in a backwards compatible way, rather than cutting
them off from volume mounting unexpectedly.
2020-05-21 09:18:02 -04:00
Mahmood Ali bc893829bc changelog and comment 2019-11-19 15:51:08 -05:00
Mahmood Ali ea221cfe87 always destroy 2019-11-18 21:31:29 -05:00
Mahmood Ali abd700bf8f Add tests for orphaned processes 2019-11-18 21:31:29 -05:00
Mahmood Ali f98d4ee3f1 tests: enable raw_exec driver 2019-08-29 20:26:50 -04:00
Mahmood Ali 28e473aaff raw_exec: be defensive when disabled
Ensure that no raw_exec task can run on a client where it's disabled,
even if a flaw lead to client being assigned a raw_exec task
unexpectedly.
2019-08-29 09:09:40 -04:00
Nick Ethier 1fc5f86a7c
docker: support shared network namespaces 2019-07-31 01:03:20 -04:00
Nick Ethier 971c8c9c2b
Driver networking support
Adds support for passing network isolation config into drivers and
implements support in the rawexec driver as a proof of concept
2019-07-31 01:03:20 -04:00
Nick Ethier 2d60ef64d9
plugins/driver: make DriverNetworkManager interface optional 2019-07-31 01:03:19 -04:00
Nick Ethier 548f78ef15
ar: initial driver based network management 2019-07-31 01:03:17 -04:00
Mahmood Ali a4640db7a6 drivers: implement streaming exec for executor based drivers
These simply delegate call to backend executor.
2019-05-10 19:17:14 -04:00
Mahmood Ali b2b7618a1c
clarify unknown signal log line (#5466) 2019-03-25 17:19:43 -04:00
Mahmood Ali 8c82c19831 tests: IsTravis() -> IsCI()
Replace IsTravis() references that is intended for more CI environments
rather than for Travis environment specifically.
2019-02-20 08:21:03 -05:00
Mahmood Ali f7102cd01d
tests: add hcl task driver config parsing tests (#5314)
* drivers: add config parsing tests

Add basic tests for parsing and encoding task config.

* drivers/docker: fix some config declarations

* refactor and document config parse helpers
2019-02-12 14:46:37 -05:00
Nick Ethier e7ea26449e
client: fix bug during 0.8 state up grade that causes external drivers to fail 2019-01-30 14:22:29 -05:00
Nick Ethier 1f4c26e19e
raw_exec: ensure executor is killed after task is stopped 2019-01-25 23:06:31 -05:00
Nick Ethier be976d9c9a
Merge branch 'master' into f-driver-upgradepath-test
* master: (23 commits)
  tests: avoid assertion in goroutine
  spell check
  ci: run checkscripts
  tests: deflake TestRktDriver_StartWaitRecoverWaitStop
  drivers/rkt: Remove unused github.com/rkt/rkt
  drivers/rkt: allow development on non-linux
  cli: Hide `nomad docker_logger` from help output
  api: test api and structs are in sync
  goimports until make check is happy
  nil check node resources to prevent panic
  tr: use context in as select statement
  move pluginutils -> helper/pluginutils
  vet
  goimports
  gofmt
  Split hclspec
  move hclutils
  Driver tests do not use hcl2/hcl, hclspec, or hclutils
  move reattach config
  loader and singleton
  ...
2019-01-23 21:01:24 -05:00
Nick Ethier 5b9013528e
drivers: add docker upgrade path and e2e test 2019-01-23 14:44:42 -05:00
Michael Schurter 32daa7b47b goimports until make check is happy 2019-01-23 06:27:14 -08:00
Michael Schurter be0bab7c3f move pluginutils -> helper/pluginutils
I wanted a different color bikeshed, so I get to paint it
2019-01-22 15:50:08 -08:00
Alex Dadgar 72a5691897 Driver tests do not use hcl2/hcl, hclspec, or hclutils 2019-01-22 15:43:34 -08:00
Alex Dadgar b2c7268843 move reattach config 2019-01-22 15:11:58 -08:00
Alex Dadgar cdcd3c929c loader and singleton 2019-01-22 15:11:57 -08:00
Nick Ethier b840a2eb7b
drivers: fix func naming 2019-01-18 18:31:02 -05:00
Nick Ethier e3c6f89b9a
drivers: use consts for task handle version 2019-01-18 18:31:01 -05:00
Nick Ethier 9dd4eb3581
drivers: add upgrade path for rawexec, java, rkt and qemu 2019-01-18 18:31:01 -05:00
Nick Ethier 05bd369d1f
driver: add pre09 migration logic 2019-01-18 18:31:01 -05:00
Danielle Tomlinson d78120f890 rawexec: Fix Exec test on windows 2019-01-17 18:43:14 +01:00
Danielle Tomlinson 65457dd2f2 rawexec: SIGINT is not available on Windows 2019-01-17 18:43:14 +01:00
Alex Dadgar 471fdb3ccf
Merge pull request #5173 from hashicorp/b-log-levels
Plugins use parent loggers
2019-01-14 16:14:30 -08:00
Nick Ethier 9fea54e0dc
executor: implement streaming stats API
plugins/driver: update driver interface to support streaming stats

client/tr: use streaming stats api

TODO:
 * how to handle errors and closed channel during stats streaming
 * prevent tight loop if Stats(ctx) returns an error

drivers: update drivers TaskStats RPC to handle streaming results

executor: better error handling in stats rpc

docker: better control and error handling of stats rpc

driver: allow stats to return a recoverable error
2019-01-12 12:18:22 -05:00
Alex Dadgar 14ed757a56 Plugins use parent loggers
This PR fixes various instances of plugins being launched without using
the parent loggers. This meant that logs would not all go to the same
output, break formatting etc.
2019-01-11 11:36:37 -08:00
Mahmood Ali 64f80343fc drivers: re-export ResourceUsage structs
Re-export the ResourceUsage structs in drivers package to avoid drivers
directly depending on the internal client/structs package directly.

I attempted moving the structs to drivers, but that caused some import
cycles that was a bit hard to disentagle.  Alternatively, I added an
alias here that's sufficient for our purposes of avoiding external
drivers depend on internal packages, while allowing us to restructure
packages in future without breaking source compatibility.
2019-01-08 09:11:47 -05:00
Mahmood Ali 916a40bb9e move cstructs.DeviceNetwork to drivers pkg 2019-01-08 09:11:47 -05:00
Mahmood Ali 9369b123de use drivers.FSIsolation 2019-01-08 09:11:47 -05:00
Danielle Tomlinson a9b9ad34dc drivers: Implement InternalPluginDriver interface
This implements the InternalPluginDriver interface in each driver, and
calls the cancellation fn for their respective eventers.

This fixes a per task goroutine leak during test suite execution.
2019-01-08 13:49:31 +01:00
Alex Dadgar c9825a9c36 recover 2019-01-07 14:49:40 -08:00
Preetha Appan 2fb2de3cef
Standardize driver health description messages for all drivers 2019-01-06 22:06:38 -06:00
Alex Dadgar 4c57d2ec4d Add plugin API versioning to plugin loader and plugins 2018-12-18 16:48:00 -08:00
Nick Ethier a771ee59aa
rawexec: fix misleading log 2018-12-14 23:40:37 -05:00
Nick Ethier 09dadf0a23
Merge branch 'master' into f-grpc-executor
* master: (71 commits)
  Fix output of 'nomad deployment fail' with no arg
  Always create a running allocation when testing task state
  tests: ensure exec tests pass valid task resources (#4992)
  some changes for more idiomatic code
  fix iops related tests
  fixed bug in loop delay
  gofmt
  improved code for readability
  client: updateAlloc release lock after read
  fixup! device attributes in `nomad node status -verbose`
  drivers/exec: support device binds and mounts
  fix iops bug and increase test matrix coverage
  tests: tag image explicitly
  changelog
  ci: install lxc-templates explicitly
  tests: skip checking rdma cgroup
  ci: use Ubuntu 16.04 (Xenial) in TravisCI
  client: update driver info on new fingerprint
  drivers/docker: enforce volumes.enabled (#4983)
  client: Style: use fluent style for building loggers
  ...
2018-12-13 14:41:09 -05:00
Mahmood Ali 23c07b9afe tests: update stop/kill tests with new pattern
Update rawexec and rkt stop/kill tests with the patterns introduced in
7a49e9b68e519050a0c2ef0b67c33503bfbc51be.  This implementation should be
more resilient to discrepancy between task stopping and task being marked as exited.
2018-12-11 09:35:26 -05:00