Commit graph

3467 commits

Author SHA1 Message Date
Danielle Tomlinson f3fa9d1406 gc: Wait for allocrunners to be destroyed 2018-12-18 23:38:33 +01:00
Danielle Tomlinson cb78a90f40 client: Async API for shutdown/destroy allocrunners 2018-12-18 23:38:33 +01:00
Danielle Tomlinson d1fbac1aad allocrunner: Async shutdown and destroy
This commit reduces the locking required to shutdown or destroy
allocrunners, and allows parallel shutdown and destroy of allocrunners during
shutdown.
2018-12-18 23:38:33 +01:00
Danielle Tomlinson d9174d8dcf
Merge pull request #4989 from hashicorp/dani/b-client-update-race-condition
client: Give a copy of clientconfig to allocrunner
2018-12-17 10:49:46 +01:00
Danielle Tomlinson 53aa1bc198
Merge pull request #5004 from hashicorp/dani/f-hook-errors
client: Emit TaskEvents when task hooks fail
2018-12-17 10:42:57 +01:00
Danielle Tomlinson a50ea29da4 taskrunner: Use hook errors for artifacts 2018-12-17 10:39:38 +01:00
Mahmood Ali 2d2c562e18 Remove implicit check
I intended to remove this line in 29ef7ecf2372f980d12a9900e1b2a351568dd415 - see my notes there for details.
2018-12-16 09:14:26 -05:00
Mahmood Ali d58e38e912 tests: avoid implicitly asserting clean shutdown
The assertion here is causing many spurious failures that aren't
actually relevant to the test itself.

We are tracking the cause for this failure independently, and it would
make more sense to have a dedicated test for clean shutdown.
2018-12-15 15:30:09 -05:00
Danielle Tomlinson 3647b701a6 taskrunner: Emit task events when a hook fails 2018-12-13 18:20:18 +01:00
Danielle Tomlinson 8b06e8d297
Merge pull request #4990 from hashicorp/dani/b-alloc-lock
client: updateAlloc release lock after read
2018-12-13 12:43:59 +01:00
Danielle Tomlinson 3823599da9 client: Give a copy of clientconfig to allocrunner
Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.

This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.

This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
2018-12-13 12:42:15 +01:00
Alex Dadgar 20c59df8b9
Merge pull request #4969 from hashicorp/f-alloc-hooks
Make alloc health watcher a postrun hook rather than shutdown hook
2018-12-12 14:34:36 -08:00
Danielle Tomlinson 4184eadaf4 client: updateAlloc release lock after read
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.

This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
2018-12-12 16:30:01 +01:00
Mahmood Ali 3d166e6e9c
Merge pull request #4984 from hashicorp/b-client-update-driver
client: update driver info on new driver fingerprint
2018-12-11 18:01:03 -05:00
Mahmood Ali 69b2355274
Merge pull request #4975 from hashicorp/fix-master-20181209
Some test fixes and remedies
2018-12-11 18:00:21 -05:00
Alex Dadgar 1531b6d534
Merge pull request #4970 from hashicorp/f-no-iops
Deprecate IOPS
2018-12-11 12:51:22 -08:00
Mahmood Ali ba515947c2 client: update driver info on new fingerprint
Fixes a bug where a driver health and attributes are never updated from
their initial status.  If a driver started unhealthy, it may never go
into a healthy status.
2018-12-11 14:25:10 -05:00
Danielle Tomlinson ed1791f4bf client: Style: use fluent style for building loggers 2018-12-11 18:03:45 +01:00
Danielle Tomlinson 805669ead4 client: Correctly pass a noop PrevAllocMigrator when restoring 2018-12-11 15:46:58 +01:00
Mahmood Ali 3babda5d45 tests: no need for buffer channel 2018-12-11 09:35:26 -05:00
Mahmood Ali 5a487ac884 tests: prevent indefinite blocking in some tests
Noticed few places where tests seem to block indefinitely and panic
after the test run reaches the test package timeout.

I intend to follow up with the proper fix later, but timing out is much
better than indefinitely blocking.
2018-12-11 09:35:26 -05:00
Mahmood Ali 4635168f20 test: fix TestFingerprintManager_Run_Combination
Let's use a fingerprinter that doesn't have values prepopulated in test
fixtures.
2018-12-11 09:35:26 -05:00
Danielle Tomlinson 6fb5ca6ad5 allocrunner: Test alloc runners should include a noop migrator 2018-12-11 13:12:35 +01:00
Danielle Tomlinson 4b4b85e3f4 allocwatcher: Cleanup new migrator/watcher interface 2018-12-11 13:12:35 +01:00
Danielle Tomlinson 83720575de client: Unify handling of previous and preempted allocs 2018-12-11 13:12:35 +01:00
Danielle Tomlinson dff7093243 client: Wait for preempted allocs to terminate
When starting an allocation that is preempting other allocs, we create a
new group allocation watcher, and then wait for the allocations to
terminate in the allocation PreRun hooks.

If there's no preempted allocations, then we simply provide a
NoopAllocWatcher.
2018-12-11 00:59:18 +01:00
Danielle Tomlinson 2cdef6a7b4 allocwatcher: Add Group AllocWatcher
The Group Alloc watcher is an implementation of a PrevAllocWatcher that
can wait for multiple previous allocs before terminating.

This is to be used when running an allocation that is preempting upstream
allocations, and thus only supports being ran with a local alloc watcher.

It also currently requires all of its child watchers to correctly handle
context cancellation. Should this be a problem, it should be fairly easy
to implement a replacement using channels rather than a waitgroup.

It obeys the PrevAllocWatcher interface for convenience, but it may be
better to extract Migration capabilities into a seperate interface for
greater clarity.
2018-12-11 00:58:27 +01:00
Marcin Matlaszek 39eec70f31
Recover from any possible io error when invoking Write on FileRotator
As of now, FileRotator uses bufio.Write under the hood to write data to
configured output file. Due to the way how bufio handles any occurred io
error - saves it into `err` variable never resetting it automatically -
any operation like `Write`, `Flush` etc will become a no-op, returning the very same,
saved error (eg. Out of disk space) even when the problem is fixed (eg. disk
space is available again).

That automatically means that FileRotator will stop writing any logs,
reporting the same error over and over again, even if it's no longer
valid.

This PR fixes it by resetting the bufio Writer, which resets any errors
and tries to write requested data.
2018-12-07 18:22:29 +01:00
Alex Dadgar 1e3c3cb287 Deprecate IOPS
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
2018-12-06 15:09:26 -08:00
Danielle Tomlinson e3621c55fa gc: Fix maxallocs integration test 2018-12-06 21:50:50 +01:00
Alex Dadgar c4b5f80918 Make alloc health watcher a postrun hook rather than shutdown hook 2018-12-06 12:30:31 -08:00
Danielle Tomlinson 62b98e64ca client/gc: Replace GC integration test with unit
The previous integration test was broken during the client refactor, and
it seems to be some sort of race with state updating.

I'm going to try and construct a replacement test as part of work on
performance, but for now, the underlying behaviour is still being
tested.
2018-12-06 12:28:23 +01:00
Danielle Tomlinson f6e474fd55 client: Re-enable GC tests 2018-12-06 12:28:23 +01:00
Danielle Tomlinson d043532cb0 allocrunner: Basic test alloc runner 2018-12-06 12:28:23 +01:00
Alex Dadgar b39c21d49c Fix various bugs with task events
Fixes the following:
* Emitting events when the task fails to start
* Don't double emit events on task shutdown (nomad stop)
* Don't emit a OOM kill metric unless actually OOM'd
2018-12-05 14:27:07 -08:00
Danielle Tomlinson 10b3e68a6d
Merge pull request #4925 from hashicorp/f-driver-plugins-dani
Third Party Driver Plugins Support
2018-12-03 20:48:19 +01:00
Mahmood Ali 88622b97bd
libcontainer to manage /dev and /proc (#4945)
libcontainer already manages `/dev`, overriding task_dir - so let's use it for `/proc` as well and remove deadcode.
2018-12-03 10:41:01 -05:00
Danielle Tomlinson 9bd77e9295 testfix: Fix import cycle in allocdir tests 2018-12-01 17:25:30 +01:00
Danielle Tomlinson 66c521ca17 client: Move fingerprint structs to pkg
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.

It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
2018-12-01 17:10:39 +01:00
Danielle Tomlinson 2db5ae38d8 client: Rename drivers/shared/env => client/taskenv 2018-11-30 12:18:39 +01:00
Danielle Tomlinson f3a77b8084 client: Merge driver/shared/structs and client/structs 2018-11-30 10:56:45 +01:00
Danielle Tomlinson b9295f0d56 client/driver: Remove package 2018-11-30 10:47:08 +01:00
Danielle Tomlinson fdfe93aa25 fixup: executorplugin: fix rkt build 2018-11-30 10:47:08 +01:00
Danielle Tomlinson d72ecd95ec client/driver: Vendor setEnvvars into docker_test 2018-11-30 10:46:13 +01:00
Danielle Tomlinson d26a310db0 client: Move executor plugins into own package 2018-11-30 10:46:13 +01:00
Danielle Tomlinson d259c36844 driver: Flatten SetEnvvars into taskdirhook 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 6b72e96eba client: Move driver/logging to logmon/logging
The logging package is used by logmon and the legacy mock_driver. Because the
legacy drivers are going away, I'm moving it here to signify its actual
ownership.
2018-11-30 10:46:13 +01:00
Danielle Tomlinson 04c8851b4c client: Migrate DriverStats optout to drivers/shared/structs 2018-11-30 10:46:13 +01:00
Danielle Tomlinson dbd82e1af4 client: Remove test dependency on client/driver 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 0544a57abe drivers: Move client/drivers/executor to drivers/shared/executor 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 1a29811169 drivers: Move client/drivers/env to drivers/shared/env
As part of deprecating legacy drivers, we're moving the env package to a
new drivers/shared tree, as it is used by the modern docker and rkt
driver packages, and is useful for 3rd party plugins.
2018-11-30 10:46:13 +01:00
Nick Ethier bbe420718a
Merge pull request #4922 from hashicorp/f-drivermananger
add generic plugin manager interface and orchestration
2018-11-28 22:17:04 -05:00
Preetha 1f526db414
Merge pull request #4919 from hashicorp/f-fingerprint-attribute-type
Modify fingerprint interface to use typed attribute struct
2018-11-28 14:18:28 -06:00
Michael Schurter 1bd9a9f9dd
Merge pull request #4894 from hashicorp/f-device-hook
Device hook and devices affect computed node class
2018-11-28 12:10:43 -06:00
Preetha Appan f89dbcd9cc
modify fingerprint interface to use typed attribute struct 2018-11-28 10:01:03 -06:00
Nick Ethier 60c6907ea5
client/plugin: remove println from plugin group func 2018-11-27 22:45:09 -05:00
Nick Ethier 600738e991
client/plugin: lint/spelling errors 2018-11-27 22:45:09 -05:00
Nick Ethier 45a6bf7acd
client/plugin: add generic plugin mananger interface and orchestration 2018-11-27 22:45:03 -05:00
Mahmood Ali ad1f8d8c20 Fixes in old lxc driver 2018-11-27 21:40:43 -05:00
Michael Schurter 3e56ee005a add nil check around task resources in device hook
Looking at NewTaskRunner I'm unsure whether TaskRunner.TaskResources
(from which req.TaskResources is set) is intended to be nil at times or
if the TODO in NewTaskRunner is intended to ensure it is always non-nil.
2018-11-27 17:25:33 -08:00
Michael Schurter b75e9fce37 assume that slices contain only non-nil items 2018-11-27 17:25:33 -08:00
Michael Schurter 85073f9d29 client: properly support hook env vars
The old approach was incomplete. Hook env vars are now:

 * persisted and restored between agent restarts
 * deterministic (LWW if 2 hooks set the same key)
2018-11-27 17:25:33 -08:00
Alex Dadgar 4ee603c382 Device hook and devices affect computed node class
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.

TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
2018-11-27 17:25:33 -08:00
Michael Schurter 27e07f657e
Merge pull request #4896 from hashicorp/b-prevalloc-deadlock
Fix deadlock in previous alloc watcher by emitting last alloc update
2018-11-27 19:07:16 -06:00
Michael Schurter b75f79a793 fix test breakage caused by rebase 2018-11-27 16:34:01 -08:00
Michael Schurter 91da566935 fix mispelings 2018-11-27 16:33:55 -08:00
Chris Baker a1fb1f3830
Merge pull request #4891 from hashicorp/b-1150-rkt-volume-names
drivers/rkt: fix invalid volumes
2018-11-27 18:55:00 -05:00
Danielle Tomlinson 3651dbdc25
Merge pull request #4909 from hashicorp/b-restart-delay
taskrunner: Return the restart delay correctly
2018-11-27 23:55:54 +01:00
Michael Schurter 22149a661e client: comment on importance of chan ops ordering 2018-11-27 14:11:32 -08:00
Mahmood Ali 05a958dc21 Update client/structs/broadcaster.go
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2018-11-27 14:06:08 -08:00
Michael Schurter 81b6a24a84 client: fix send-after-close in broadcaster 2018-11-27 14:06:08 -08:00
Michael Schurter c429e6b0ab client: check if prev alloc is already terminated
This is a defensive fast-path as 7c6aa0be already fixed the deadlock.
2018-11-27 14:06:08 -08:00
Michael Schurter 944ea6d38b client: emit last sent alloc to new listeners
Fixes a deadlock where the allocwatcher would block forever waiting for
an update from a terminal alloc.

Made the broadcaster easier to debug as well.
2018-11-27 14:06:08 -08:00
Michael Schurter 1e4ef139dd
Merge pull request #4883 from hashicorp/f-graceful-shutdown
Support graceful shutdowns in agent
2018-11-27 15:55:15 -06:00
Michael Schurter 4f7e6f9464 client: fix races in use of goroutine group
The group utility struct does not support asynchronously launched
goroutines (goroutines-inside-of-goroutines), so switch those uses to a
normal go call.

This means watchNodeUpdates and watchNodeEvents may not be shutdown when
Shutdown() exits. During nomad agent shutdown this does not matter.

During tests this means a test may leak those goroutines or be unable to
know when those goroutines have exited.

Since there's no runtime impact and these goroutines do not affect alloc
state syncing it seems ok to risk leaking them.
2018-11-26 12:52:55 -08:00
Michael Schurter 9f43fb6d29 client: reuse group instead of diy'ing it 2018-11-26 12:52:31 -08:00
Michael Schurter 22771aa19e client/ar: remove useless wait ch from runTasks
Arguably this makes task.WaitCh() useless, but I think exposing a wait
chan from TaskRunners is a generically useful API.
2018-11-26 12:51:18 -08:00
Michael Schurter 2fdd013956 client: document how AR/TR Run methods behave 2018-11-26 12:50:35 -08:00
Chris Baker 9bd4317139 modified TaskConfig to include AllocID
use this for volume names in drivers/rkt to address #1150
2018-11-26 18:54:26 +00:00
Nick Ethier 95362eaa02
Merge pull request #4844 from hashicorp/f-docker-plugin
Docker driver plugin
2018-11-20 20:43:03 -05:00
Mahmood Ali e1994e59bd address review comments 2018-11-20 17:10:54 -05:00
Mahmood Ali 171b73fde7 Emit metric counters for Vault token and renewal failures 2018-11-20 17:10:54 -05:00
Mahmood Ali 5b10da5de6 Set User-Agent header when hitting Vault API 2018-11-20 17:10:54 -05:00
Danielle Tomlinson 093f029d5b taskrunner: Return the restart delay correctly
We were incorrectly returning a 0 duration to the taskrunner when
determining when a task should restart. This would cause tasks to be
restarted immediately, ignoring the restart {} stanza in a users
configuration.

This commit causes us to return the restart duration to the task runner
so it may correctly delay further execution.
2018-11-20 21:52:23 +01:00
Nick Ethier 3e42d6914e
task_runner: use NodeResources instead of deprecated struct 2018-11-20 13:46:39 -05:00
Nick Ethier 93c0200566
task_runner: use task and alloc copies instead of referencing the original pointer 2018-11-20 13:34:46 -05:00
Nick Ethier 29591a7c2e
task_runner: emit event on task exit with exit result details 2018-11-19 22:59:17 -05:00
Nick Ethier 4be8a86ef9
plugins/driver: remove NodeResources from task Resources and use PercentTicks field for docker driver 2018-11-19 22:59:17 -05:00
Nick Ethier 69049d37f5
drivers: added NodeResources to drivers.TaskConfig 2018-11-19 22:59:16 -05:00
Nick Ethier 8f8698b3e1
docker: started work on porting docker driver to new plugin framework 2018-11-19 22:59:15 -05:00
Michael Schurter 88577fe083 client.rpc: don't log errors on shutdown 2018-11-19 16:39:30 -08:00
Michael Schurter 5bd744ac3d client: support graceful shutdowns
Client.Shutdown now blocks until all AllocRunners and TaskRunners have
exited their Run loops. Tasks are left running.
2018-11-19 16:39:30 -08:00
Mahmood Ali 9479015f51
Merge pull request #4884 from hashicorp/f-alloc-devices-cli
Report alloc device statistics in API and CLI
2018-11-16 18:04:54 -05:00
Mahmood Ali f139234372 address review comments 2018-11-16 17:13:01 -05:00
Mahmood Ali f72e599ee7 Populate alloc stats API with device stats
This change makes few compromises:

* Looks up the devices associated with tasks at look up time.  Given
that `nomad alloc status` is called rarely generally (compared to stats
telemetry and general job reporting), it seems fine.  However, the
lookup overhead grows bounded by number of `tasks x total-host-devices`,
which can be significant.

* `client.Client` performs the task devices->statistics lookup.  It
passes self to alloc/task runners so they can look up the device statistics
allocated to them.
  * Currently alloc/task runners are responsible for constructing the
entire RPC response for stats
  * The alternatives for making task runners device statistics aware
don't seem appealing (e.g. having task runners contain reference to hostStats)

* On the alloc aggregation resource usage, I did a naive merging of task device statistics.
  * Personally, I question the value of such aggregation, compared to
costs of struct duplication and bloating the response - but opted to be
consistent in the API.
  * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.
2018-11-16 10:26:32 -05:00
Michael Schurter 0cdb188ae4 tests: fix tests post-rebase 2018-11-15 17:40:56 -08:00
Michael Schurter 59f106ecee client/tr: add a bit of context to envbuilder errors 2018-11-15 16:26:25 -08:00
Michael Schurter 742f8775ba client: remove old proxy references from comments 2018-11-15 16:26:25 -08:00
Michael Schurter 2d0b44c3b4 client: test more env key variations 2018-11-15 16:26:25 -08:00
Michael Schurter 8bcd90d78d client: add new nested variables to task's hcl ctx
The error messages are really bad, but it's extremely difficult to
produce good error messages without the original HCL.
2018-11-15 16:26:25 -08:00