open-nomad

Commit Graph

Author	SHA1	Message	Date
Alex Dadgar	b653ae2af7	utilities	2018-12-18 15:48:52 -08:00
Danielle Tomlinson	95a0c4fb29	taskrunner: Use a random suffix for Task Config The RestartCount is not really suitable for use as a source of uniqueness within task invocations as it is not monotonic, and interacts with the restart stanza in a users config, so conflates restarts due to task failures, with restarts due to enviromental changes, such as consul template or vault secrets changing. Here we instead use a substring from a uuid, which is more random than we strictly need, but is nicer than rolling our own random string generator here.	2018-12-19 00:38:54 +01:00
Danielle Tomlinson	d6eb084d8a	allocrunner: Drop and log updates after closing waitCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	0d91285cd6	allocrunner: Documentation for ShutdownCh/DestroyCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	f2bb13818e	fixup: Log when we detect out of order updates	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	986fde0f5a	allocrunner: Handle updates asynchronously This creates a new buffered channel and goroutine on the allocrunner for serializing updates to allocations. This allows us to take updates off the routine that is used from processing updates from the server, without having complicated machinery for tracking update lifetimes, or other external synchronization. This results in a nice performance improvement and signficantly better throughput on batch changes such as preempting a large number of jobs for a larger placement.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	d1fbac1aad	allocrunner: Async shutdown and destroy This commit reduces the locking required to shutdown or destroy allocrunners, and allows parallel shutdown and destroy of allocrunners during shutdown.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	a50ea29da4	taskrunner: Use hook errors for artifacts	2018-12-17 10:39:38 +01:00
Danielle Tomlinson	3647b701a6	taskrunner: Emit task events when a hook fails	2018-12-13 18:20:18 +01:00
Alex Dadgar	20c59df8b9	Merge pull request #4969 from hashicorp/f-alloc-hooks Make alloc health watcher a postrun hook rather than shutdown hook	2018-12-12 14:34:36 -08:00
Danielle Tomlinson	6fb5ca6ad5	allocrunner: Test alloc runners should include a noop migrator	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	83720575de	client: Unify handling of previous and preempted allocs	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	dff7093243	client: Wait for preempted allocs to terminate When starting an allocation that is preempting other allocs, we create a new group allocation watcher, and then wait for the allocations to terminate in the allocation PreRun hooks. If there's no preempted allocations, then we simply provide a NoopAllocWatcher.	2018-12-11 00:59:18 +01:00
Alex Dadgar	c4b5f80918	Make alloc health watcher a postrun hook rather than shutdown hook	2018-12-06 12:30:31 -08:00
Danielle Tomlinson	d043532cb0	allocrunner: Basic test alloc runner	2018-12-06 12:28:23 +01:00
Alex Dadgar	b39c21d49c	Fix various bugs with task events Fixes the following: * Emitting events when the task fails to start * Don't double emit events on task shutdown (nomad stop) * Don't emit a OOM kill metric unless actually OOM'd	2018-12-05 14:27:07 -08:00
Danielle Tomlinson	2db5ae38d8	client: Rename drivers/shared/env => client/taskenv	2018-11-30 12:18:39 +01:00
Danielle Tomlinson	f3a77b8084	client: Merge driver/shared/structs and client/structs	2018-11-30 10:56:45 +01:00
Danielle Tomlinson	d259c36844	driver: Flatten SetEnvvars into taskdirhook	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	04c8851b4c	client: Migrate DriverStats optout to drivers/shared/structs	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	1a29811169	drivers: Move client/drivers/env to drivers/shared/env As part of deprecating legacy drivers, we're moving the env package to a new drivers/shared tree, as it is used by the modern docker and rkt driver packages, and is useful for 3rd party plugins.	2018-11-30 10:46:13 +01:00
Michael Schurter	3e56ee005a	add nil check around task resources in device hook Looking at NewTaskRunner I'm unsure whether TaskRunner.TaskResources (from which req.TaskResources is set) is intended to be nil at times or if the TODO in NewTaskRunner is intended to ensure it is always non-nil.	2018-11-27 17:25:33 -08:00
Michael Schurter	b75e9fce37	assume that slices contain only non-nil items	2018-11-27 17:25:33 -08:00
Michael Schurter	85073f9d29	client: properly support hook env vars The old approach was incomplete. Hook env vars are now: * persisted and restored between agent restarts * deterministic (LWW if 2 hooks set the same key)	2018-11-27 17:25:33 -08:00
Alex Dadgar	4ee603c382	Device hook and devices affect computed node class This PR introduces a device hook that retrieves the device mount information for an allocation. It also updates the computed node class computation to take into account devices. TODO Fix the task runner unit test. The environment variable is being lost even though it is being properly set in the prestart hook.	2018-11-27 17:25:33 -08:00
Michael Schurter	27e07f657e	Merge pull request #4896 from hashicorp/b-prevalloc-deadlock Fix deadlock in previous alloc watcher by emitting last alloc update	2018-11-27 19:07:16 -06:00
Michael Schurter	b75f79a793	fix test breakage caused by rebase	2018-11-27 16:34:01 -08:00
Chris Baker	a1fb1f3830	Merge pull request #4891 from hashicorp/b-1150-rkt-volume-names drivers/rkt: fix invalid volumes	2018-11-27 18:55:00 -05:00
Danielle Tomlinson	3651dbdc25	Merge pull request #4909 from hashicorp/b-restart-delay taskrunner: Return the restart delay correctly	2018-11-27 23:55:54 +01:00
Michael Schurter	944ea6d38b	client: emit last sent alloc to new listeners Fixes a deadlock where the allocwatcher would block forever waiting for an update from a terminal alloc. Made the broadcaster easier to debug as well.	2018-11-27 14:06:08 -08:00
Michael Schurter	1e4ef139dd	Merge pull request #4883 from hashicorp/f-graceful-shutdown Support graceful shutdowns in agent	2018-11-27 15:55:15 -06:00
Michael Schurter	22771aa19e	client/ar: remove useless wait ch from runTasks Arguably this makes task.WaitCh() useless, but I think exposing a wait chan from TaskRunners is a generically useful API.	2018-11-26 12:51:18 -08:00
Michael Schurter	2fdd013956	client: document how AR/TR Run methods behave	2018-11-26 12:50:35 -08:00
Chris Baker	9bd4317139	modified TaskConfig to include AllocID use this for volume names in drivers/rkt to address #1150	2018-11-26 18:54:26 +00:00
Danielle Tomlinson	093f029d5b	taskrunner: Return the restart delay correctly We were incorrectly returning a 0 duration to the taskrunner when determining when a task should restart. This would cause tasks to be restarted immediately, ignoring the restart {} stanza in a users configuration. This commit causes us to return the restart duration to the task runner so it may correctly delay further execution.	2018-11-20 21:52:23 +01:00
Nick Ethier	3e42d6914e	task_runner: use NodeResources instead of deprecated struct	2018-11-20 13:46:39 -05:00
Nick Ethier	93c0200566	task_runner: use task and alloc copies instead of referencing the original pointer	2018-11-20 13:34:46 -05:00
Nick Ethier	29591a7c2e	task_runner: emit event on task exit with exit result details	2018-11-19 22:59:17 -05:00
Nick Ethier	4be8a86ef9	plugins/driver: remove NodeResources from task Resources and use PercentTicks field for docker driver	2018-11-19 22:59:17 -05:00
Nick Ethier	69049d37f5	drivers: added NodeResources to drivers.TaskConfig	2018-11-19 22:59:16 -05:00
Nick Ethier	8f8698b3e1	docker: started work on porting docker driver to new plugin framework	2018-11-19 22:59:15 -05:00
Michael Schurter	5bd744ac3d	client: support graceful shutdowns Client.Shutdown now blocks until all AllocRunners and TaskRunners have exited their Run loops. Tasks are left running.	2018-11-19 16:39:30 -08:00
Mahmood Ali	9479015f51	Merge pull request #4884 from hashicorp/f-alloc-devices-cli Report alloc device statistics in API and CLI	2018-11-16 18:04:54 -05:00
Mahmood Ali	f139234372	address review comments	2018-11-16 17:13:01 -05:00
Mahmood Ali	f72e599ee7	Populate alloc stats API with device stats This change makes few compromises: * Looks up the devices associated with tasks at look up time. Given that `nomad alloc status` is called rarely generally (compared to stats telemetry and general job reporting), it seems fine. However, the lookup overhead grows bounded by number of `tasks x total-host-devices`, which can be significant. * `client.Client` performs the task devices->statistics lookup. It passes self to alloc/task runners so they can look up the device statistics allocated to them. * Currently alloc/task runners are responsible for constructing the entire RPC response for stats * The alternatives for making task runners device statistics aware don't seem appealing (e.g. having task runners contain reference to hostStats) * On the alloc aggregation resource usage, I did a naive merging of task device statistics. * Personally, I question the value of such aggregation, compared to costs of struct duplication and bloating the response - but opted to be consistent in the API. * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.	2018-11-16 10:26:32 -05:00
Michael Schurter	0cdb188ae4	tests: fix tests post-rebase	2018-11-15 17:40:56 -08:00
Michael Schurter	59f106ecee	client/tr: add a bit of context to envbuilder errors	2018-11-15 16:26:25 -08:00
Michael Schurter	742f8775ba	client: remove old proxy references from comments	2018-11-15 16:26:25 -08:00
Michael Schurter	8bcd90d78d	client: add new nested variables to task's hcl ctx The error messages are really bad, but it's extremely difficult to produce good error messages without the original HCL.	2018-11-15 16:26:25 -08:00
Michael Schurter	f8cdd561f0	client: interpolate driver configurations Also add missing SetDriverNetwork calls.	2018-11-15 16:25:57 -08:00
Mahmood Ali	865419e756	convert all config durations to strings in tests	2018-11-13 10:21:40 -05:00
Michael Schurter	a4e6a92d18	client: update alloc status when terminating Defensively update alloc status whenever killing all tasks.	2018-11-05 15:11:10 -08:00
Michael Schurter	66bf3db455	client: block on context as well as waitCh For lifecycle operations such as Restart and Kill, the client should not expect driver plugins to be well behaved and close their waitCh on context cancelation. Always wait on the passed in context as well as the waitCh.	2018-11-05 12:32:05 -08:00
Michael Schurter	b994f51990	client: fix tr lifecycle logic and shutdown delay ShutdownDelay must be honored whenever the task is killed or restarted. Services were not being deregistered prior to restarting.	2018-11-05 12:32:05 -08:00
Michael Schurter	2d3479147a	client: fix ar and tr tests	2018-11-05 12:32:05 -08:00
Michael Schurter	d29d09023e	client: do not run terminal allocs	2018-11-05 12:32:05 -08:00
Michael Schurter	2bbd88888c	client: first pass at implementing task restoring Task restoring works but dead tasks may be restarted	2018-11-05 12:32:05 -08:00
Nick Ethier	3fcf8ba7e6	Merge pull request #4795 from hashicorp/f-plugin-config Pass client configuration to plugins through loader	2018-10-29 18:42:27 -07:00
Michael Schurter	e060174130	ar: fix leader handling, state restoring, and destroying unrun ARs * Migrated all of the old leader task tests and got them passing * Refactor and consolidate task killing code in AR to always kill leader tasks first * Fixed lots of issues with state restoring * Fixed deadlock in AR.Destroy if AR.Run had never been called * Added a new in memory statedb for testing	2018-10-19 09:45:45 -07:00
Michael Schurter	cefbf00bf0	ar: refactor task killing into 1 method Update comments and address some PR comments from #4775	2018-10-17 10:06:59 -07:00
Michael Schurter	21d78be961	tests: explicitly cleanup after clients	2018-10-17 10:06:59 -07:00
Michael Schurter	222f6b5741	ar: fix task leader, update, and stop handling	2018-10-17 10:06:59 -07:00
Michael Schurter	1badbb2fc4	tr: cleanup hook logs	2018-10-17 09:42:32 -07:00
Nick Ethier	65adb80ebf	plumb NomadConfig into plugins	2018-10-16 22:47:22 -04:00
Michael Schurter	0baaba8b09	templates: fix tests	2018-10-16 16:56:57 -07:00
Michael Schurter	838ddf4d4a	fix linter errors	2018-10-16 16:56:57 -07:00
Michael Schurter	e27c82ea4d	client: remove unused handleproxy	2018-10-16 16:56:56 -07:00
Michael Schurter	4ea5217d72	tr: remove unused DriverHandle interface was causing typed nil interface panics and served no purpose	2018-10-16 16:56:56 -07:00
Michael Schurter	528c426c53	Port client portion of #4392 to new taskrunner PR #4392 was merged to master after allocrunnerv2 was branched, so the client-specific portions must be ported from master to arv2.	2018-10-16 16:56:56 -07:00
Michael Schurter	f12501d4c3	tr: implement dispatch payload hook Now passing the TaskDir struct to prestart hooks instead of just the root task dir itself as dispatch needs local/.	2018-10-16 16:56:56 -07:00
Nick Ethier	8cf669b5aa	taskrunner: return error on waitCh	2018-10-16 16:56:56 -07:00
Nick Ethier	047fad2953	client: simplify driver plugin logic from review comments	2018-10-16 16:56:56 -07:00
Nick Ethier	9686e1b258	client: fix broked tests from refactoring	2018-10-16 16:56:56 -07:00
Nick Ethier	3183b33d24	client: review comments and fixup/skip tests	2018-10-16 16:56:56 -07:00
Nick Ethier	f192c3752a	client: refactor post allocrunnerv2 finalization	2018-10-16 16:56:56 -07:00
Nick Ethier	4a4c7dbbfc	client: begin driver plugin integration client: fingerprint driver plugins	2018-10-16 16:56:56 -07:00
Alex Dadgar	7946a14aa8	Fix lints	2018-10-16 16:56:56 -07:00
Alex Dadgar	45e41cca03	allocrunnerv2 -> allocrunner	2018-10-16 16:56:56 -07:00
Alex Dadgar	6c9d9d5173	move files around	2018-10-16 16:56:55 -07:00
Michael Schurter	9d1ea3b228	client: hclog-ify most of the client Leaving fingerprinters in case that interface changes with plugins.	2018-10-16 16:53:30 -07:00
Michael Schurter	e42154fc46	implement stopping, destroying, and disk migration * Stopping an alloc is implemented via Updates but update hooks are not run. * Destroying an alloc is a best effort cleanup. * AllocRunner destroy hooks implemented. * Disk migration and blocking on a previous allocation exiting moved to its own package to avoid cycles. Now only depends on alloc broadcaster instead of also using a waitch. * AllocBroadcaster now only drops stale allocations and always keeps the latest version. * Made AllocDir safe for concurrent use Lots of internal contexts that are currently unused. Unsure if they should be used or removed.	2018-10-16 16:53:30 -07:00
Michael Schurter	820af27171	wrap boltdb in a write deduplicator Saves a tiny bit of cpu and some IO. Sadly doesn't prevent all IO on duplicate writes as the transactions are still created and committed. $ go test -bench=. -benchmem goos: linux goarch: amd64 pkg: github.com/hashicorp/nomad/helper/boltdd BenchmarkWriteDeduplication_On-4 500 4059591 ns/op 23736 B/op 56 allocs/op BenchmarkWriteDeduplication_Off-4 300 4115319 ns/op 25942 B/op 55 allocs/op	2018-10-16 16:53:30 -07:00
Michael Schurter	5383d20505	removing old restoration path before api change	2018-10-16 16:53:30 -07:00
Michael Schurter	39b3f3a85b	call handle.Network() instead of storing it	2018-10-16 16:53:30 -07:00
Michael Schurter	a4b4d7b266	consul service hook Deregistration works but difficult to test due to terminal updates not being fully implemented in the new client/ar/tr.	2018-10-16 16:53:29 -07:00
Michael Schurter	9a63d6103d	tr: add validate task hook	2018-10-16 16:53:29 -07:00
Alex Dadgar	e401c660e7	Implement lifecycle hooks on the task runner	2018-10-16 16:53:29 -07:00
Michael Schurter	eae54e2954	artifact task hook	2018-10-16 16:53:29 -07:00
Alex Dadgar	52f9cd7637	fixing tests	2018-10-04 14:26:19 -07:00
Alex Dadgar	ca28afa3b2	small fixes	2018-09-15 16:42:38 -07:00
Alex Dadgar	7739ef51ce	agent + consul	2018-09-13 10:43:40 -07:00
Michael Schurter	6def5bc4f9	client: set host name when migrating over tls Not setting the host name led the Go HTTP client to expect a certificate with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad` names ephemeral dir migrations were broken when TLS was enabled. Added an e2e test to ensure this doesn't break again as it's very difficult to test and the TLS configuration is very easy to get wrong.	2018-09-05 17:24:17 -07:00
Andrei Burd	444ee45aff	Parametrized/periodic jobs per child tagged metric emmision	2018-06-21 10:40:56 +03:00
Alex Dadgar	300b1a7a15	Tests only use testlog package logger	2018-06-13 15:40:56 -07:00
Alex Dadgar	9bab9edf27	test fixes	2018-06-12 17:45:39 -07:00
Alex Dadgar	90c2108bfb	Fix gc tests + parallel destroy + small test fixes	2018-06-12 10:23:45 -07:00
Alex Dadgar	f5ff509fa5	Refactor - wip	2018-06-12 10:23:45 -07:00

1 2 3

147 Commits