open-nomad

Author	SHA1	Message	Date
Michael Schurter	875e231511	Merge pull request #5038 from hashicorp/b-drivermanager-tests WIP: fix failing tests caused by async driver manager	2019-01-03 12:32:18 -08:00
Danielle Tomlinson	35a4790740	Merge pull request #5142 from hashicorp/dani/cleanup-allocrunner-logs allocrunner: Standardised discard logs	2019-01-03 18:40:48 +01:00
Preetha	8078cb79f0	Merge pull request #5140 from hashicorp/dani/b-taskrunner taskrunner: Persist environment from hooks	2019-01-03 09:30:52 -06:00
Danielle Tomlinson	29196ca70e	allocrunner: Standardised discard logs Follow up from https://github.com/hashicorp/nomad/pull/5007#pullrequestreview-186739124	2019-01-03 14:04:31 +01:00
Danielle Tomlinson	1c8baf7db7	chore: Fix environement->environment typo	2019-01-03 13:31:30 +01:00
Danielle Tomlinson	28aa34ea78	taskrunner: Persist environment from hooks https://github.com/hashicorp/nomad/pull/5032 introduced a regression where the origHookState was used in place of the response from the hook.	2019-01-03 13:13:57 +01:00
Alex Dadgar	d7d32c2f61	Merge pull request #5032 from hashicorp/f-driver-env Store device envs separately and pass to drivers	2018-12-20 13:38:27 -08:00
Michael Schurter	e47a3ceed6	taskenv: have maps take precedence over primitives The Bug: You may have seen log lines like this when running 0.9.0-dev: ``` ... client.alloc_runner.task_runner: some environment variables not available for rendering: ... keys="attr.driver.docker.volumes.enabled, attr.driver.docker.version, attr.driver.docker.bridge_ip, attr.driver.qemu.version" ``` Not only should we not be erroring on builtin driver attributes, but the results were nondeterministic due to map iteration order! The root cause is that we have an old root attribute for all drivers like: ``` attr.driver.docker = "1" ``` When attributes were opaque variable names it was fine to also have "nested" attributes like: ``` attr.driver.docker.version = "1.2.3" ``` However in the HCLv2 world the variable names are no longer opaque: they form an object tree. The `docker` object can no longer both hold a value (`"1"`) and nested attributes (`version = "1.2.3"`). The Fix: Since the old `attr.driver.<name> = "1"` attribues are useless for task config interpolation, create a new precedence rule for creating the task config evaluation context: Maps take precedence over primitives. This means `attr.driver.docker.version` will always take precedence over `attr.driver.docker`. The results are determinstic and give users access to the more useful metadata. I made this a general precedence rule instead of special-casing driver attrs because it seemed like better default behavior than spamming WARNings to logs that were likely unactionable by users.	2018-12-20 11:37:46 -08:00
Nick Ethier	a96afb6c91	fix tests that fail as a result of async client startup	2018-12-20 00:53:44 -05:00
Nick Ethier	6c43ccf628	client: add proper build flag to allocrunner testing.go	2018-12-19 20:22:07 -05:00
Michael Schurter	0a0fb6f86d	test: re-eanble periodic fingerprint test	2018-12-19 17:08:24 -08:00
Michael Schurter	add2dd8c2d	test: copy AR's Alloc before mutating Fixes a race in client tests	2018-12-19 15:48:02 -08:00
Michael Schurter	17ed3f27ae	drivermgr: fix race in building driver list	2018-12-19 15:48:02 -08:00
Michael Schurter	4448f19413	Merge pull request #5030 from hashicorp/test-client-statusupdate client: assert alloc status updates work	2018-12-19 14:55:34 -08:00
Alex Dadgar	9d34802f7a	Store device envs separately and pass to drivers	2018-12-19 14:23:09 -08:00
Michael Schurter	951100af16	client: assert alloc status updates work Re-enabling and updating an old test. Able to cut out a ton of extra work by using WaitForRunning which does almost everything this test needs.	2018-12-19 11:41:53 -08:00
Michael Schurter	ee23bdafbc	client/state: missing deploy status isn't an error Fixes TestClient_SaveRestoreState	2018-12-19 10:39:27 -08:00
Michael Schurter	c84998e996	tests: implement HasHealth for mock health	2018-12-19 10:39:27 -08:00
Michael Schurter	ba1ddd2238	gofmt -s -w upgrade_int_test.go	2018-12-19 10:39:27 -08:00
Michael Schurter	337d07fdd8	client/state: improve upgradeTaskBucket error handling And add a test	2018-12-19 10:39:27 -08:00
Michael Schurter	c5ddcb6a15	client/state: add context to errors Unfortunately I don't know how to test these errors. As far as I can tell they should only happen if there was a programming error in the upgrade code or the underlying boltdb was corrupted somehow.	2018-12-19 10:39:27 -08:00
Michael Schurter	99bd5b3422	client/state: use 2 as version; test error path	2018-12-19 10:39:27 -08:00
Michael Schurter	d9ea8252a7	client/state: support upgrading from 0.8->0.9 Also persist and load DeploymentStatus to avoid rechecking health after client restarts.	2018-12-19 10:39:27 -08:00
Michael Schurter	0018b2f659	client/state: reorg state buckets to ease transition * Prefix task bucket with task- to prevent name conflicts * Shorten device manager bucket name * Remove commented out outdated var * Update layout comment	2018-12-19 10:22:28 -08:00
Michael Schurter	461599ff20	tr: fix HookState Copy() and Equal() methods They did not take into account the Env field.	2018-12-19 09:58:06 -08:00
Danielle Tomlinson	c580512d32	allocrunner: Close updates routine correctly	2018-12-19 18:32:51 +01:00
Nick Ethier	969ec51730	devicemanager: fix devicemanager tests	2018-12-19 00:35:12 -05:00
Nick Ethier	6f1777284d	drivermanager: use correct plugin config types	2018-12-18 23:07:01 -05:00
Nick Ethier	a02308ee6a	drivermanager: attempt to reattach and shutdown driver plugin if blocked by allow/block lists	2018-12-18 23:01:57 -05:00
Nick Ethier	ce1a5cba0e	drivermanager: use allocID and task name to route task events	2018-12-18 23:01:51 -05:00
Nick Ethier	bda32f9c79	client/pluginmanager: add plugin manager interface to device/driver managers	2018-12-18 22:56:23 -05:00
Nick Ethier	d8a0265e68	client: batch initial fingerprinting in plugin manangers drivermanager: fix pr comments/feedback	2018-12-18 22:56:19 -05:00
Nick Ethier	7d23cbf448	client/drivermananger: fixup issues from rebase and address PR comments	2018-12-18 22:55:38 -05:00
Nick Ethier	1543335710	tr: deregister task handler on cleanup	2018-12-18 22:55:38 -05:00
Nick Ethier	82175d1328	client/drivermananger: add driver manager The driver manager is modeled after the device manager and is started by the client. It's responsible for handling driver lifecycle and reattachment state, as well as processing the incomming fingerprint and task events from each driver. The mananger exposes a method for registering event handlers for task events that is used by the task runner to update the server when a task has been updated with an event. Since driver fingerprinting has been implemented by the driver manager, it is no longer needed in the fingerprint mananger and has been removed.	2018-12-18 22:55:18 -05:00
Alex Dadgar	730a6f5b9a	lint	2018-12-18 16:48:00 -08:00
Alex Dadgar	4c57d2ec4d	Add plugin API versioning to plugin loader and plugins	2018-12-18 16:48:00 -08:00
Alex Dadgar	9d1403d617	Merge pull request #5002 from hashicorp/b-task-config-resources Convert driver resource to AllocatedTaskResource	2018-12-18 16:46:34 -08:00
Danielle Tomlinson	0edc65631a	Merge pull request #5007 from hashicorp/dani/f-allocrunner-async allocrunner: Async api for shutdown/destroy/update	2018-12-19 01:26:41 +01:00
Alex Dadgar	8efac7ec81	Fix unit tests + upgrade pathing resources	2018-12-18 15:50:44 -08:00
Alex Dadgar	b8268d9a46	Lint	2018-12-18 15:50:44 -08:00
Alex Dadgar	66cf3156b2	LinuxResources doesn't use task.Resources	2018-12-18 15:50:44 -08:00
Alex Dadgar	327b551b39	Drivers	2018-12-18 15:50:11 -08:00
Alex Dadgar	b653ae2af7	utilities	2018-12-18 15:48:52 -08:00
Danielle Tomlinson	95a0c4fb29	taskrunner: Use a random suffix for Task Config The RestartCount is not really suitable for use as a source of uniqueness within task invocations as it is not monotonic, and interacts with the restart stanza in a users config, so conflates restarts due to task failures, with restarts due to enviromental changes, such as consul template or vault secrets changing. Here we instead use a substring from a uuid, which is more random than we strictly need, but is nicer than rolling our own random string generator here.	2018-12-19 00:38:54 +01:00
Danielle Tomlinson	1be0170ebe	client: Update tests for async destroy	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	d6eb084d8a	allocrunner: Drop and log updates after closing waitCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	0d91285cd6	allocrunner: Documentation for ShutdownCh/DestroyCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	f2bb13818e	fixup: Log when we detect out of order updates	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	986fde0f5a	allocrunner: Handle updates asynchronously This creates a new buffered channel and goroutine on the allocrunner for serializing updates to allocations. This allows us to take updates off the routine that is used from processing updates from the server, without having complicated machinery for tracking update lifetimes, or other external synchronization. This results in a nice performance improvement and signficantly better throughput on batch changes such as preempting a large number of jobs for a larger placement.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	f3fa9d1406	gc: Wait for allocrunners to be destroyed	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	cb78a90f40	client: Async API for shutdown/destroy allocrunners	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	d1fbac1aad	allocrunner: Async shutdown and destroy This commit reduces the locking required to shutdown or destroy allocrunners, and allows parallel shutdown and destroy of allocrunners during shutdown.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	d9174d8dcf	Merge pull request #4989 from hashicorp/dani/b-client-update-race-condition client: Give a copy of clientconfig to allocrunner	2018-12-17 10:49:46 +01:00
Danielle Tomlinson	53aa1bc198	Merge pull request #5004 from hashicorp/dani/f-hook-errors client: Emit TaskEvents when task hooks fail	2018-12-17 10:42:57 +01:00
Danielle Tomlinson	a50ea29da4	taskrunner: Use hook errors for artifacts	2018-12-17 10:39:38 +01:00
Mahmood Ali	2d2c562e18	Remove implicit check I intended to remove this line in 29ef7ecf2372f980d12a9900e1b2a351568dd415 - see my notes there for details.	2018-12-16 09:14:26 -05:00
Mahmood Ali	d58e38e912	tests: avoid implicitly asserting clean shutdown The assertion here is causing many spurious failures that aren't actually relevant to the test itself. We are tracking the cause for this failure independently, and it would make more sense to have a dedicated test for clean shutdown.	2018-12-15 15:30:09 -05:00
Danielle Tomlinson	3647b701a6	taskrunner: Emit task events when a hook fails	2018-12-13 18:20:18 +01:00
Danielle Tomlinson	8b06e8d297	Merge pull request #4990 from hashicorp/dani/b-alloc-lock client: updateAlloc release lock after read	2018-12-13 12:43:59 +01:00
Danielle Tomlinson	3823599da9	client: Give a copy of clientconfig to allocrunner Currently, there is a race condition between creating a taskrunner, and updating node attributes via fingerprinting. This is because the taskenv builder will try to iterate over the clientconfig.Node.Attributes map, which can be concurrently updated by the fingerprinting process, thus causing a panic. This fixes that by providing a copy of the clientconfg to the allocrunner inside the Read lock during config creation.	2018-12-13 12:42:15 +01:00
Alex Dadgar	20c59df8b9	Merge pull request #4969 from hashicorp/f-alloc-hooks Make alloc health watcher a postrun hook rather than shutdown hook	2018-12-12 14:34:36 -08:00
Danielle Tomlinson	4184eadaf4	client: updateAlloc release lock after read The allocLock is used to synchronize access to the alloc runner map, not to ensure internal consistency of the alloc runners themselves. This updates the updateAlloc process to avoid hanging on to an exclusive lock of the map while applying changes to allocrunners themselves, as they should be internally consistent. This fixes a bug where any client allocation api will block during the shutdown or updating of an allocrunner and its child taskrunners.	2018-12-12 16:30:01 +01:00
Mahmood Ali	3d166e6e9c	Merge pull request #4984 from hashicorp/b-client-update-driver client: update driver info on new driver fingerprint	2018-12-11 18:01:03 -05:00
Mahmood Ali	69b2355274	Merge pull request #4975 from hashicorp/fix-master-20181209 Some test fixes and remedies	2018-12-11 18:00:21 -05:00
Alex Dadgar	1531b6d534	Merge pull request #4970 from hashicorp/f-no-iops Deprecate IOPS	2018-12-11 12:51:22 -08:00
Mahmood Ali	ba515947c2	client: update driver info on new fingerprint Fixes a bug where a driver health and attributes are never updated from their initial status. If a driver started unhealthy, it may never go into a healthy status.	2018-12-11 14:25:10 -05:00
Danielle Tomlinson	ed1791f4bf	client: Style: use fluent style for building loggers	2018-12-11 18:03:45 +01:00
Danielle Tomlinson	805669ead4	client: Correctly pass a noop PrevAllocMigrator when restoring	2018-12-11 15:46:58 +01:00
Mahmood Ali	3babda5d45	tests: no need for buffer channel	2018-12-11 09:35:26 -05:00
Mahmood Ali	5a487ac884	tests: prevent indefinite blocking in some tests Noticed few places where tests seem to block indefinitely and panic after the test run reaches the test package timeout. I intend to follow up with the proper fix later, but timing out is much better than indefinitely blocking.	2018-12-11 09:35:26 -05:00
Mahmood Ali	4635168f20	test: fix TestFingerprintManager_Run_Combination Let's use a fingerprinter that doesn't have values prepopulated in test fixtures.	2018-12-11 09:35:26 -05:00
Danielle Tomlinson	6fb5ca6ad5	allocrunner: Test alloc runners should include a noop migrator	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	4b4b85e3f4	allocwatcher: Cleanup new migrator/watcher interface	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	83720575de	client: Unify handling of previous and preempted allocs	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	dff7093243	client: Wait for preempted allocs to terminate When starting an allocation that is preempting other allocs, we create a new group allocation watcher, and then wait for the allocations to terminate in the allocation PreRun hooks. If there's no preempted allocations, then we simply provide a NoopAllocWatcher.	2018-12-11 00:59:18 +01:00
Danielle Tomlinson	2cdef6a7b4	allocwatcher: Add Group AllocWatcher The Group Alloc watcher is an implementation of a PrevAllocWatcher that can wait for multiple previous allocs before terminating. This is to be used when running an allocation that is preempting upstream allocations, and thus only supports being ran with a local alloc watcher. It also currently requires all of its child watchers to correctly handle context cancellation. Should this be a problem, it should be fairly easy to implement a replacement using channels rather than a waitgroup. It obeys the PrevAllocWatcher interface for convenience, but it may be better to extract Migration capabilities into a seperate interface for greater clarity.	2018-12-11 00:58:27 +01:00
Marcin Matlaszek	39eec70f31	Recover from any possible io error when invoking Write on FileRotator As of now, FileRotator uses bufio.Write under the hood to write data to configured output file. Due to the way how bufio handles any occurred io error - saves it into `err` variable never resetting it automatically - any operation like `Write`, `Flush` etc will become a no-op, returning the very same, saved error (eg. Out of disk space) even when the problem is fixed (eg. disk space is available again). That automatically means that FileRotator will stop writing any logs, reporting the same error over and over again, even if it's no longer valid. This PR fixes it by resetting the bufio Writer, which resets any errors and tries to write requested data.	2018-12-07 18:22:29 +01:00
Alex Dadgar	1e3c3cb287	Deprecate IOPS IOPS have been modelled as a resource since Nomad 0.1 but has never actually been detected and there is no plan in the short term to add detection. This is because IOPS is a bit simplistic of a unit to define the performance requirements from the underlying storage system. In its current state it adds unnecessary confusion and can be removed without impacting any users. This PR leaves IOPS defined at the jobspec parsing level and in the api/ resources since these are the two public uses of the field. These should be considered deprecated and only exist to allow users to stop using them during the Nomad 0.9.x release. In the future, there should be no expectation that the field will exist.	2018-12-06 15:09:26 -08:00
Danielle Tomlinson	e3621c55fa	gc: Fix maxallocs integration test	2018-12-06 21:50:50 +01:00
Alex Dadgar	c4b5f80918	Make alloc health watcher a postrun hook rather than shutdown hook	2018-12-06 12:30:31 -08:00
Danielle Tomlinson	62b98e64ca	client/gc: Replace GC integration test with unit The previous integration test was broken during the client refactor, and it seems to be some sort of race with state updating. I'm going to try and construct a replacement test as part of work on performance, but for now, the underlying behaviour is still being tested.	2018-12-06 12:28:23 +01:00
Danielle Tomlinson	f6e474fd55	client: Re-enable GC tests	2018-12-06 12:28:23 +01:00
Danielle Tomlinson	d043532cb0	allocrunner: Basic test alloc runner	2018-12-06 12:28:23 +01:00
Alex Dadgar	b39c21d49c	Fix various bugs with task events Fixes the following: * Emitting events when the task fails to start * Don't double emit events on task shutdown (nomad stop) * Don't emit a OOM kill metric unless actually OOM'd	2018-12-05 14:27:07 -08:00
Danielle Tomlinson	10b3e68a6d	Merge pull request #4925 from hashicorp/f-driver-plugins-dani Third Party Driver Plugins Support	2018-12-03 20:48:19 +01:00
Mahmood Ali	88622b97bd	libcontainer to manage /dev and /proc (#4945 ) libcontainer already manages `/dev`, overriding task_dir - so let's use it for `/proc` as well and remove deadcode.	2018-12-03 10:41:01 -05:00
Danielle Tomlinson	9bd77e9295	testfix: Fix import cycle in allocdir tests	2018-12-01 17:25:30 +01:00
Danielle Tomlinson	66c521ca17	client: Move fingerprint structs to pkg This removes a cyclical dependency when importing client/structs from dependencies of the plugin_loader, specifically, drivers. Due to client/config also depending on the plugin_loader. It also better reflects the ownership of fingerprint structs, as they are fairly internal to the fingerprint manager.	2018-12-01 17:10:39 +01:00
Danielle Tomlinson	2db5ae38d8	client: Rename drivers/shared/env => client/taskenv	2018-11-30 12:18:39 +01:00
Danielle Tomlinson	f3a77b8084	client: Merge driver/shared/structs and client/structs	2018-11-30 10:56:45 +01:00
Danielle Tomlinson	b9295f0d56	client/driver: Remove package	2018-11-30 10:47:08 +01:00
Danielle Tomlinson	fdfe93aa25	fixup: executorplugin: fix rkt build	2018-11-30 10:47:08 +01:00
Danielle Tomlinson	d72ecd95ec	client/driver: Vendor setEnvvars into docker_test	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	d26a310db0	client: Move executor plugins into own package	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	d259c36844	driver: Flatten SetEnvvars into taskdirhook	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	6b72e96eba	client: Move driver/logging to logmon/logging The logging package is used by logmon and the legacy mock_driver. Because the legacy drivers are going away, I'm moving it here to signify its actual ownership.	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	04c8851b4c	client: Migrate DriverStats optout to drivers/shared/structs	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	dbd82e1af4	client: Remove test dependency on client/driver	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	0544a57abe	drivers: Move client/drivers/executor to drivers/shared/executor	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	1a29811169	drivers: Move client/drivers/env to drivers/shared/env As part of deprecating legacy drivers, we're moving the env package to a new drivers/shared tree, as it is used by the modern docker and rkt driver packages, and is useful for 3rd party plugins.	2018-11-30 10:46:13 +01:00
Nick Ethier	bbe420718a	Merge pull request #4922 from hashicorp/f-drivermananger add generic plugin manager interface and orchestration	2018-11-28 22:17:04 -05:00
Preetha	1f526db414	Merge pull request #4919 from hashicorp/f-fingerprint-attribute-type Modify fingerprint interface to use typed attribute struct	2018-11-28 14:18:28 -06:00
Michael Schurter	1bd9a9f9dd	Merge pull request #4894 from hashicorp/f-device-hook Device hook and devices affect computed node class	2018-11-28 12:10:43 -06:00
Preetha Appan	f89dbcd9cc	modify fingerprint interface to use typed attribute struct	2018-11-28 10:01:03 -06:00
Nick Ethier	60c6907ea5	client/plugin: remove println from plugin group func	2018-11-27 22:45:09 -05:00
Nick Ethier	600738e991	client/plugin: lint/spelling errors	2018-11-27 22:45:09 -05:00
Nick Ethier	45a6bf7acd	client/plugin: add generic plugin mananger interface and orchestration	2018-11-27 22:45:03 -05:00
Mahmood Ali	ad1f8d8c20	Fixes in old lxc driver	2018-11-27 21:40:43 -05:00
Michael Schurter	3e56ee005a	add nil check around task resources in device hook Looking at NewTaskRunner I'm unsure whether TaskRunner.TaskResources (from which req.TaskResources is set) is intended to be nil at times or if the TODO in NewTaskRunner is intended to ensure it is always non-nil.	2018-11-27 17:25:33 -08:00
Michael Schurter	b75e9fce37	assume that slices contain only non-nil items	2018-11-27 17:25:33 -08:00
Michael Schurter	85073f9d29	client: properly support hook env vars The old approach was incomplete. Hook env vars are now: * persisted and restored between agent restarts * deterministic (LWW if 2 hooks set the same key)	2018-11-27 17:25:33 -08:00
Alex Dadgar	4ee603c382	Device hook and devices affect computed node class This PR introduces a device hook that retrieves the device mount information for an allocation. It also updates the computed node class computation to take into account devices. TODO Fix the task runner unit test. The environment variable is being lost even though it is being properly set in the prestart hook.	2018-11-27 17:25:33 -08:00
Michael Schurter	27e07f657e	Merge pull request #4896 from hashicorp/b-prevalloc-deadlock Fix deadlock in previous alloc watcher by emitting last alloc update	2018-11-27 19:07:16 -06:00
Michael Schurter	b75f79a793	fix test breakage caused by rebase	2018-11-27 16:34:01 -08:00
Michael Schurter	91da566935	fix mispelings	2018-11-27 16:33:55 -08:00
Chris Baker	a1fb1f3830	Merge pull request #4891 from hashicorp/b-1150-rkt-volume-names drivers/rkt: fix invalid volumes	2018-11-27 18:55:00 -05:00
Danielle Tomlinson	3651dbdc25	Merge pull request #4909 from hashicorp/b-restart-delay taskrunner: Return the restart delay correctly	2018-11-27 23:55:54 +01:00
Michael Schurter	22149a661e	client: comment on importance of chan ops ordering	2018-11-27 14:11:32 -08:00
Mahmood Ali	05a958dc21	Update client/structs/broadcaster.go Co-Authored-By: schmichael <michael.schurter@gmail.com>	2018-11-27 14:06:08 -08:00
Michael Schurter	81b6a24a84	client: fix send-after-close in broadcaster	2018-11-27 14:06:08 -08:00
Michael Schurter	c429e6b0ab	client: check if prev alloc is already terminated This is a defensive fast-path as 7c6aa0be already fixed the deadlock.	2018-11-27 14:06:08 -08:00
Michael Schurter	944ea6d38b	client: emit last sent alloc to new listeners Fixes a deadlock where the allocwatcher would block forever waiting for an update from a terminal alloc. Made the broadcaster easier to debug as well.	2018-11-27 14:06:08 -08:00
Michael Schurter	1e4ef139dd	Merge pull request #4883 from hashicorp/f-graceful-shutdown Support graceful shutdowns in agent	2018-11-27 15:55:15 -06:00
Michael Schurter	4f7e6f9464	client: fix races in use of goroutine group The group utility struct does not support asynchronously launched goroutines (goroutines-inside-of-goroutines), so switch those uses to a normal go call. This means watchNodeUpdates and watchNodeEvents may not be shutdown when Shutdown() exits. During nomad agent shutdown this does not matter. During tests this means a test may leak those goroutines or be unable to know when those goroutines have exited. Since there's no runtime impact and these goroutines do not affect alloc state syncing it seems ok to risk leaking them.	2018-11-26 12:52:55 -08:00
Michael Schurter	9f43fb6d29	client: reuse group instead of diy'ing it	2018-11-26 12:52:31 -08:00
Michael Schurter	22771aa19e	client/ar: remove useless wait ch from runTasks Arguably this makes task.WaitCh() useless, but I think exposing a wait chan from TaskRunners is a generically useful API.	2018-11-26 12:51:18 -08:00
Michael Schurter	2fdd013956	client: document how AR/TR Run methods behave	2018-11-26 12:50:35 -08:00
Chris Baker	9bd4317139	modified TaskConfig to include AllocID use this for volume names in drivers/rkt to address #1150	2018-11-26 18:54:26 +00:00
Nick Ethier	95362eaa02	Merge pull request #4844 from hashicorp/f-docker-plugin Docker driver plugin	2018-11-20 20:43:03 -05:00
Mahmood Ali	e1994e59bd	address review comments	2018-11-20 17:10:54 -05:00
Mahmood Ali	171b73fde7	Emit metric counters for Vault token and renewal failures	2018-11-20 17:10:54 -05:00
Mahmood Ali	5b10da5de6	Set User-Agent header when hitting Vault API	2018-11-20 17:10:54 -05:00
Danielle Tomlinson	093f029d5b	taskrunner: Return the restart delay correctly We were incorrectly returning a 0 duration to the taskrunner when determining when a task should restart. This would cause tasks to be restarted immediately, ignoring the restart {} stanza in a users configuration. This commit causes us to return the restart duration to the task runner so it may correctly delay further execution.	2018-11-20 21:52:23 +01:00
Nick Ethier	3e42d6914e	task_runner: use NodeResources instead of deprecated struct	2018-11-20 13:46:39 -05:00
Nick Ethier	93c0200566	task_runner: use task and alloc copies instead of referencing the original pointer	2018-11-20 13:34:46 -05:00
Nick Ethier	29591a7c2e	task_runner: emit event on task exit with exit result details	2018-11-19 22:59:17 -05:00
Nick Ethier	4be8a86ef9	plugins/driver: remove NodeResources from task Resources and use PercentTicks field for docker driver	2018-11-19 22:59:17 -05:00
Nick Ethier	69049d37f5	drivers: added NodeResources to drivers.TaskConfig	2018-11-19 22:59:16 -05:00
Nick Ethier	8f8698b3e1	docker: started work on porting docker driver to new plugin framework	2018-11-19 22:59:15 -05:00
Michael Schurter	88577fe083	client.rpc: don't log errors on shutdown	2018-11-19 16:39:30 -08:00
Michael Schurter	5bd744ac3d	client: support graceful shutdowns Client.Shutdown now blocks until all AllocRunners and TaskRunners have exited their Run loops. Tasks are left running.	2018-11-19 16:39:30 -08:00
Mahmood Ali	9479015f51	Merge pull request #4884 from hashicorp/f-alloc-devices-cli Report alloc device statistics in API and CLI	2018-11-16 18:04:54 -05:00
Mahmood Ali	f139234372	address review comments	2018-11-16 17:13:01 -05:00
Mahmood Ali	f72e599ee7	Populate alloc stats API with device stats This change makes few compromises: * Looks up the devices associated with tasks at look up time. Given that `nomad alloc status` is called rarely generally (compared to stats telemetry and general job reporting), it seems fine. However, the lookup overhead grows bounded by number of `tasks x total-host-devices`, which can be significant. * `client.Client` performs the task devices->statistics lookup. It passes self to alloc/task runners so they can look up the device statistics allocated to them. * Currently alloc/task runners are responsible for constructing the entire RPC response for stats * The alternatives for making task runners device statistics aware don't seem appealing (e.g. having task runners contain reference to hostStats) * On the alloc aggregation resource usage, I did a naive merging of task device statistics. * Personally, I question the value of such aggregation, compared to costs of struct duplication and bloating the response - but opted to be consistent in the API. * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.	2018-11-16 10:26:32 -05:00
Michael Schurter	0cdb188ae4	tests: fix tests post-rebase	2018-11-15 17:40:56 -08:00
Michael Schurter	59f106ecee	client/tr: add a bit of context to envbuilder errors	2018-11-15 16:26:25 -08:00
Michael Schurter	742f8775ba	client: remove old proxy references from comments	2018-11-15 16:26:25 -08:00
Michael Schurter	2d0b44c3b4	client: test more env key variations	2018-11-15 16:26:25 -08:00
Michael Schurter	8bcd90d78d	client: add new nested variables to task's hcl ctx The error messages are really bad, but it's extremely difficult to produce good error messages without the original HCL.	2018-11-15 16:26:25 -08:00
Michael Schurter	5e51e2c2d5	client: turn env into nested objects for task configs	2018-11-15 16:25:57 -08:00
Michael Schurter	f8cdd561f0	client: interpolate driver configurations Also add missing SetDriverNetwork calls.	2018-11-15 16:25:57 -08:00
Mahmood Ali	046f098bac	Track Node Device attributes and serve them in API	2018-11-14 14:42:29 -05:00
Mahmood Ali	63acda956c	Add Client Device Stats structs in `api` package	2018-11-14 14:41:19 -05:00
Mahmood Ali	b74ccc742c	Expose Device Stats in /client/stats API endpoint	2018-11-14 14:41:19 -05:00
Mahmood Ali	c5de71a424	Allow nullable fields in StatValues In state values, we need to be able to distinguish between zero values (e.g. `false`) and unset values (e.g. `nil`). We can alternatively use protobuf `oneOf` and nested map to ensure consistency of fields that are set together, but the golang representation does not represent that well and introducing a mismatch between representations. Thus, I opted not to use it.	2018-11-14 14:41:19 -05:00
Mahmood Ali	713c9fe683	Move Stat{Object\|Value} to plugins/shared/structs Moving them as they may be useful for other packages/plugins besides devices.	2018-11-14 09:01:26 -05:00
Mahmood Ali	1f4db08f42	Regenerate proto files with protoc-gen-go@v1.2.0	2018-11-14 09:01:26 -05:00
Danielle Tomlinson	0917e93537	Merge pull request #4869 from hashicorp/b-executor-stdout executor: Fix stdout stderr copy/paste	2018-11-13 19:22:37 -08:00
Mahmood Ali	865419e756	convert all config durations to strings in tests	2018-11-13 10:21:40 -05:00
Mahmood Ali	ac3b4571eb	Address review comments	2018-11-13 10:21:40 -05:00
Mahmood Ali	69f26783e4	avoid setting resource limit on rkt command Was accidentally modified in 5b14d24bf4626bab420d00783d92bcf25e0b641e .	2018-11-13 10:21:40 -05:00
Mahmood Ali	8fa26f5521	Fix docker log fetching in tests We no longer use syslog for tracking logs so tracking them explicitly here	2018-11-13 10:21:40 -05:00
Mahmood Ali	88fa968623	killing should be done with wait client Incidentally changed in 5b14d24bf4626bab420d00783d92bcf25e0b641e	2018-11-13 10:21:40 -05:00
Mahmood Ali	7690f389a0	Prioritize checking consumer context cancellation Tests expect that as soon as eventer shuts down immediately on context cancellations; but golang does not guarantee priority when multiple pending channels are ready in a select statement.	2018-11-13 10:21:40 -05:00
Mahmood Ali	c62ec124c0	Set clean config for mock driver The default job here contains some exec task config (for setting command and args) that aren't used for mock driver. Now, the alloc runner seems stricter about validating fields and errors on unexpected fields. Updating configs in tests so we can have an explicit task config whenever driver is set explicitly.	2018-11-13 10:21:40 -05:00
Mahmood Ali	e5e6f9a785	Update Docker name parsing lookup `ParseNamed` function changed in e9f3f2cfee9d729a8642344c4fa4ea70b2d49468 where became `ParsedNormalizedName` with extra checks.	2018-11-13 10:21:40 -05:00
Danielle Tomlinson	bfeded1f30	executor: Fix stdout stderr copy/paste	2018-11-12 22:08:04 -08:00
Alex Dadgar	c4f9e22aeb	fix race	2018-11-07 12:22:07 -08:00
Alex Dadgar	a7ca737fb6	review comments	2018-11-07 11:31:52 -08:00
Alex Dadgar	f0c7a8159b	tests	2018-11-07 10:43:15 -08:00
Alex Dadgar	204ca8230c	Device manager Introduce a device manager that manages the lifecycle of device plugins on the client. It fingerprints, collects stats, and forwards Reserve requests to the correct plugin. The manager, also handles device plugins failing and validates their output.	2018-11-07 10:43:15 -08:00
Michael Schurter	a4e6a92d18	client: update alloc status when terminating Defensively update alloc status whenever killing all tasks.	2018-11-05 15:11:10 -08:00
Michael Schurter	66bf3db455	client: block on context as well as waitCh For lifecycle operations such as Restart and Kill, the client should not expect driver plugins to be well behaved and close their waitCh on context cancelation. Always wait on the passed in context as well as the waitCh.	2018-11-05 12:32:05 -08:00
Michael Schurter	b994f51990	client: fix tr lifecycle logic and shutdown delay ShutdownDelay must be honored whenever the task is killed or restarted. Services were not being deregistered prior to restarting.	2018-11-05 12:32:05 -08:00
Michael Schurter	2d3479147a	client: fix ar and tr tests	2018-11-05 12:32:05 -08:00
Michael Schurter	d29d09023e	client: do not run terminal allocs	2018-11-05 12:32:05 -08:00
Michael Schurter	2bbd88888c	client: first pass at implementing task restoring Task restoring works but dead tasks may be restarted	2018-11-05 12:32:05 -08:00
Nick Ethier	b0ddc03409	Merge pull request #4765 from jippi/increase-line-scan-limit fix: increase log rotator line scan limit	2018-10-29 18:46:30 -07:00
Nick Ethier	3fcf8ba7e6	Merge pull request #4795 from hashicorp/f-plugin-config Pass client configuration to plugins through loader	2018-10-29 18:42:27 -07:00
Nick Ethier	bda3b1d3b3	rename NomadConfig to ClientAgentConfig	2018-10-29 21:34:34 -04:00
Michael Schurter	6f2cffb196	Merge pull request #4803 from hashicorp/b-leader-fixes AR Fixes: task leader handling, restoring, state updating, AR.Destroy deadlocks	2018-10-29 17:38:59 -05:00
Michael Schurter	d71a1b4547	tests: more fixes due to api changes	2018-10-29 15:25:22 -07:00
Preetha Appan	b85cc38f3d	Stat path to binary to handle raw exec driver interpolated binary path	2018-10-26 17:24:05 -05:00
Preetha Appan	55ac8d3d12	Fix test linting	2018-10-26 10:30:12 -05:00
Michael Schurter	b7a9d61a38	ar: initialize allocwatcher on restore Fixes a panic. Left a comment on how the behavior could be improved, but this is what releases <0.9.0 did.	2018-10-19 09:45:45 -07:00
Michael Schurter	e060174130	ar: fix leader handling, state restoring, and destroying unrun ARs * Migrated all of the old leader task tests and got them passing * Refactor and consolidate task killing code in AR to always kill leader tasks first * Fixed lots of issues with state restoring * Fixed deadlock in AR.Destroy if AR.Run had never been called * Added a new in memory statedb for testing	2018-10-19 09:45:45 -07:00
Nick Ethier	58b430edae	added driver specific client config struct to plugin configuration	2018-10-18 23:31:01 -04:00
Michael Schurter	cefbf00bf0	ar: refactor task killing into 1 method Update comments and address some PR comments from #4775	2018-10-17 10:06:59 -07:00
Michael Schurter	21d78be961	tests: explicitly cleanup after clients	2018-10-17 10:06:59 -07:00
Michael Schurter	222f6b5741	ar: fix task leader, update, and stop handling	2018-10-17 10:06:59 -07:00
Michael Schurter	1badbb2fc4	tr: cleanup hook logs	2018-10-17 09:42:32 -07:00
Nick Ethier	65adb80ebf	plumb NomadConfig into plugins	2018-10-16 22:47:22 -04:00
Nick Ethier	d94b631b6b	drivers/exec: add exec implementation	2018-10-16 22:45:28 -04:00
Michael Schurter	0baaba8b09	templates: fix tests	2018-10-16 16:56:57 -07:00
Michael Schurter	838ddf4d4a	fix linter errors	2018-10-16 16:56:57 -07:00
Michael Schurter	e27c82ea4d	client: remove unused handleproxy	2018-10-16 16:56:56 -07:00
Michael Schurter	4ea5217d72	tr: remove unused DriverHandle interface was causing typed nil interface panics and served no purpose	2018-10-16 16:56:56 -07:00
Michael Schurter	528c426c53	Port client portion of #4392 to new taskrunner PR #4392 was merged to master after allocrunnerv2 was branched, so the client-specific portions must be ported from master to arv2.	2018-10-16 16:56:56 -07:00
Michael Schurter	f12501d4c3	tr: implement dispatch payload hook Now passing the TaskDir struct to prestart hooks instead of just the root task dir itself as dispatch needs local/.	2018-10-16 16:56:56 -07:00
Nick Ethier	d9f0cbf4a9	client: log retry during driver fingerprint redispense	2018-10-16 16:56:56 -07:00
Nick Ethier	c7ac1186c9	client: add test for driverfailure during fingerprinting	2018-10-16 16:56:56 -07:00
Nick Ethier	8cf669b5aa	taskrunner: return error on waitCh	2018-10-16 16:56:56 -07:00
Nick Ethier	047fad2953	client: simplify driver plugin logic from review comments	2018-10-16 16:56:56 -07:00
Nick Ethier	9686e1b258	client: fix broked tests from refactoring	2018-10-16 16:56:56 -07:00
Nick Ethier	3183b33d24	client: review comments and fixup/skip tests	2018-10-16 16:56:56 -07:00
Nick Ethier	f192c3752a	client: refactor post allocrunnerv2 finalization	2018-10-16 16:56:56 -07:00
Nick Ethier	4a4c7dbbfc	client: begin driver plugin integration client: fingerprint driver plugins	2018-10-16 16:56:56 -07:00
Alex Dadgar	7946a14aa8	Fix lints	2018-10-16 16:56:56 -07:00
Alex Dadgar	89dafaaea9	compile on windows	2018-10-16 16:56:56 -07:00
Alex Dadgar	ad4fac526c	more test fixes	2018-10-16 16:56:56 -07:00
Alex Dadgar	45e41cca03	allocrunnerv2 -> allocrunner	2018-10-16 16:56:56 -07:00
Alex Dadgar	9baa7402ef	fix test compiling	2018-10-16 16:56:55 -07:00
Alex Dadgar	7d9c069f09	skip building deprecated files	2018-10-16 16:56:55 -07:00
Alex Dadgar	6c9d9d5173	move files around	2018-10-16 16:56:55 -07:00
Michael Schurter	5f696608a6	tests: fix missing logger caused by bad merge	2018-10-16 16:56:55 -07:00
Michael Schurter	048510b13e	tr: properly comment handle fields	2018-10-16 16:56:55 -07:00
Michael Schurter	9e49ed3464	ar: AllocState should not mutate ar.state If ar.state.TaskStates has not been set, set it on the copy of ar.state. That keeps ar.state manipulations in one location and allows AllocState to only acquire read-locks.	2018-10-16 16:56:55 -07:00
Michael Schurter	f279b1d1b1	tests: test logs endpoint against pending task Although the really exciting change is making WaitForRunning return the allocations that it started. This should cut down test boilerplate significantly.	2018-10-16 16:56:55 -07:00
Michael Schurter	dd4227f84a	tests: make a test client/config easier to generate Sadly can't move the fingerprint timeout tweak into the helper due to circular imports.	2018-10-16 16:56:55 -07:00
Michael Schurter	1d747048ea	tests: ensure task state is initialized in NewAR Also expose NoopDB for use in tests.	2018-10-16 16:56:55 -07:00
Michael Schurter	960f3be76c	client: expose task state to client The interesting decision in this commit was to expose AR's state and not a fully materialized Allocation struct. AR.clientAlloc builds an Alloc that contains the task state, so I considered simply memoizing and exposing that method. However, that would lead to AR having two awkwardly similar methods: - Alloc() - which returns the server-sent alloc - ClientAlloc() - which returns the fully materialized client alloc Since ClientAlloc() could be memoized it would be just as cheap to call as Alloc(), so why not replace Alloc() entirely? Replacing Alloc() entirely would require Update() to immediately materialize the task states on server-sent Allocs as there may have been local task state changes since the server received an Alloc update. This quickly becomes difficult to reason about: should Update hooks use the TaskStates? Are state changes caused by TR Update hooks immediately reflected in the Alloc? Should AR persist its copy of the Alloc? If so, are its TaskStates canonical or the TaskStates on TR? So! Forget that. Let's separate the static Allocation from the dynamic AR & TR state! - AR.Alloc() is for static Allocation access (often for the Job) - AR.AllocState() is for the dynamic AR & TR runtime state (deployment status, task states, etc). If code needs to know the status of a task: AllocState() If code needs to know the names of tasks: Alloc() It should be very easy for a developer to reason about which method they should call and what they can do with the return values.	2018-10-16 16:56:55 -07:00
Michael Schurter	fb4aa74153	client: add comment	2018-10-16 16:56:55 -07:00
Michael Schurter	9a7e6be2b6	client: fix potentially dropped streaming errors	2018-10-16 16:56:55 -07:00
Michael Schurter	4b44b9039b	tr: remove unneeded lock; chan synchronizes access	2018-10-16 16:56:55 -07:00
Michael Schurter	211b96bb5c	tr: fix shutdown/destroy/WaitResult handling Multiple receivers raced for the WaitResult when killing tasks which could lead to a deadlock if the "wrong" receiver won. Wrap handlers in an ugly little proxy to avoid this. At first I wanted to push this into drivers, but the result is tied to the TR's handle lifecycle -- not the lifecycle of an alloc or task.	2018-10-16 16:56:55 -07:00
Michael Schurter	951ed17436	client: do not inspect task state to follow logs "Ask forgiveness, not permission." Instead of peaking at TaskStates (which are no longer updated on the AR.Alloc() view of the world) to only read logs for running tasks, just try to read the logs and improve the error handling if they don't exist. This should make log streaming less dependent on AR/TR behavior. Also fixed a race where the log streamer could exit before reading an error. This caused no logs or errors to be displayed sometimes when an error occurred.	2018-10-16 16:56:55 -07:00
Michael Schurter	2325348053	mock_driver: close waitCh after exiting mock_driver wasn't behaving like other driver handles.	2018-10-16 16:56:55 -07:00
Michael Schurter	8d1419c62b	client: fix accessing alloc runners * GetClientAlloc() gains nothing from using allAllocs() * getAllocatedResources was calling getAllocRunners() twice	2018-10-16 16:56:55 -07:00
Michael Schurter	55ab491801	tr: remove wip comments	2018-10-16 16:56:55 -07:00
Michael Schurter	3ccc091a72	ar: lock around accessing tasks Specify that Alloc() does not return updated task states.	2018-10-16 16:56:55 -07:00
Alex Dadgar	6f0ed6184b	Fix client reloading and pass the plugin loaders to server and client	2018-10-16 16:56:55 -07:00
Nick Ethier	352c05cdf4	plugin/drivers: plumb in stdout/stderr paths	2018-10-16 16:53:31 -07:00
Nick Ethier	0e3f85222a	driver/raw_exec: port existing raw_exec tests and add some testing utilities	2018-10-16 16:53:31 -07:00
Nick Ethier	d9628ff394	driver/raw_exec: more tests and bug fixes added wrapper struct for plugin.ReattachConfig to better handle serialization	2018-10-16 16:53:31 -07:00
Nick Ethier	bcc5c4a8bd	clientv2: base driver plugin (#4671 ) Driver plugin framework to facilitate development of driver plugins. Implementing plugins only need to implement the DriverPlugin interface. The framework proxies this interface to the go-plugin GRPC interface generated from the driver.proto spec. A testing harness is provided to allow implementing drivers to test the full lifecycle of the driver plugin. An example use: func TestMyDriver(t *testing.T) { harness := NewDriverHarness(t, &MyDiverPlugin{}) // The harness implements the DriverPlugin interface and can be used as such taskHandle, err := harness.StartTask(...) }	2018-10-16 16:53:31 -07:00
Michael Schurter	62c1285afc	tr: add comments and cleanup call signature From review comments on #4649 left post-merge.	2018-10-16 16:53:31 -07:00
Nick Ethier	5dee1141d1	executor v2 (#4656 ) * client/executor: refactor client to remove interpolation * executor: POC libcontainer based executor * vendor: use hashicorp libcontainer fork * vendor: add libcontainer/nsenter dep * executor: updated executor interface to simplify operations * executor: implement logging pipe * logmon: new logmon plugin to manage task logs * driver/executor: use logmon for log management * executor: fix tests and windows build * executor: fix logging key names * executor: fix test failures * executor: add config field to toggle between using libcontainer and standard executors * logmon: use discover utility to discover nomad executable * executor: only call libcontainer-shim on main in linux * logmon: use seperate path configs for stdout/stderr fifos * executor: windows fixes * executor: created reusable pid stats collection utility that can be used in an executor * executor: update fifo.Open calls * executor: fix build * remove executor from docker driver * executor: Shutdown func to kill and cleanup executor and its children * executor: move linux specific universal executor funcs to seperate file * move logmon initialization to a task runner hook * client: doc fixes and renaming from code review * taskrunner: use shared config struct for logmon fifo fields * taskrunner: logmon only needs to be started once per task	2018-10-16 16:53:31 -07:00
Michael Schurter	e6e2930a00	tr: implement stats collection hook Tested except for the net/rpc specific error case which may need changing in the gRPC world.	2018-10-16 16:53:31 -07:00
Michael Schurter	86bd329539	fix build errors post merges	2018-10-16 16:53:31 -07:00
Michael Schurter	a977e22028	test: cleanup mock consul service client Updated to hclog. It exposed fields that required an unexported lock to access. Created a getter methodn instead. Only old allocrunner currently used this feature.	2018-10-16 16:53:31 -07:00
Michael Schurter	6f92b04226	health_hook: simplify locking; test thoroughly Use doneCh like @dadgar suggested in the original PR. Thoroughly test hook as concurrent Update calls make for a tricky concurrency problem.	2018-10-16 16:53:30 -07:00
Alex Dadgar	cebfead6bc	add logger back	2018-10-16 16:53:30 -07:00
Nick Ethier	03422aa529	fifo: add new fifo package for named pipes (#4665 ) * fifo: add new fifo package for named pipes	2018-10-16 16:53:30 -07:00
Alex Dadgar	8504505c0d	client uses passed logger and fix fingerprinters	2018-10-16 16:53:30 -07:00
Nick Ethier	66ff12e5f7	Update runc/libcontainer and friends (#4655 ) * vendor: bump libcontainer and docker to remove Sirupsen imports * vendor: fix bad vendoring of archive package * vendor: fix api changes to cgroups in executor * vendor: fix docker api changes * vendor: update github.com/Azure/go-ansiterm to use non capitalized logrus import	2018-10-16 16:53:30 -07:00
Michael Schurter	195b8127fb	health_hook: fix panic and add tests Still more testing to do, but I want to get this panic fixed ASAP. All new tests pass with -race	2018-10-16 16:53:30 -07:00
Michael Schurter	64efc3d301	Emit events before long operations Append when there's nothing blocking between appending and sending an update to the server.	2018-10-16 16:53:30 -07:00
Michael Schurter	a2b696c4cf	Use a semaphore to block until watcher exits	2018-10-16 16:53:30 -07:00
Michael Schurter	a73162c977	ar: use multierror in update hook loop Make it match TaskRunner update hook behavior	2018-10-16 16:53:30 -07:00

... 3 4 5 6 7 ...

3667 commits