open-nomad

Author	SHA1	Message	Date
Danielle Tomlinson	9bd77e9295	testfix: Fix import cycle in allocdir tests	2018-12-01 17:25:30 +01:00
Danielle Tomlinson	66c521ca17	client: Move fingerprint structs to pkg This removes a cyclical dependency when importing client/structs from dependencies of the plugin_loader, specifically, drivers. Due to client/config also depending on the plugin_loader. It also better reflects the ownership of fingerprint structs, as they are fairly internal to the fingerprint manager.	2018-12-01 17:10:39 +01:00
Danielle Tomlinson	2db5ae38d8	client: Rename drivers/shared/env => client/taskenv	2018-11-30 12:18:39 +01:00
Danielle Tomlinson	f3a77b8084	client: Merge driver/shared/structs and client/structs	2018-11-30 10:56:45 +01:00
Danielle Tomlinson	b9295f0d56	client/driver: Remove package	2018-11-30 10:47:08 +01:00
Danielle Tomlinson	fdfe93aa25	fixup: executorplugin: fix rkt build	2018-11-30 10:47:08 +01:00
Danielle Tomlinson	d72ecd95ec	client/driver: Vendor setEnvvars into docker_test	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	d26a310db0	client: Move executor plugins into own package	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	d259c36844	driver: Flatten SetEnvvars into taskdirhook	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	6b72e96eba	client: Move driver/logging to logmon/logging The logging package is used by logmon and the legacy mock_driver. Because the legacy drivers are going away, I'm moving it here to signify its actual ownership.	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	04c8851b4c	client: Migrate DriverStats optout to drivers/shared/structs	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	dbd82e1af4	client: Remove test dependency on client/driver	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	0544a57abe	drivers: Move client/drivers/executor to drivers/shared/executor	2018-11-30 10:46:13 +01:00
Danielle Tomlinson	1a29811169	drivers: Move client/drivers/env to drivers/shared/env As part of deprecating legacy drivers, we're moving the env package to a new drivers/shared tree, as it is used by the modern docker and rkt driver packages, and is useful for 3rd party plugins.	2018-11-30 10:46:13 +01:00
Nick Ethier	bbe420718a	Merge pull request #4922 from hashicorp/f-drivermananger add generic plugin manager interface and orchestration	2018-11-28 22:17:04 -05:00
Preetha	1f526db414	Merge pull request #4919 from hashicorp/f-fingerprint-attribute-type Modify fingerprint interface to use typed attribute struct	2018-11-28 14:18:28 -06:00
Michael Schurter	1bd9a9f9dd	Merge pull request #4894 from hashicorp/f-device-hook Device hook and devices affect computed node class	2018-11-28 12:10:43 -06:00
Preetha Appan	f89dbcd9cc	modify fingerprint interface to use typed attribute struct	2018-11-28 10:01:03 -06:00
Nick Ethier	60c6907ea5	client/plugin: remove println from plugin group func	2018-11-27 22:45:09 -05:00
Nick Ethier	600738e991	client/plugin: lint/spelling errors	2018-11-27 22:45:09 -05:00
Nick Ethier	45a6bf7acd	client/plugin: add generic plugin mananger interface and orchestration	2018-11-27 22:45:03 -05:00
Mahmood Ali	ad1f8d8c20	Fixes in old lxc driver	2018-11-27 21:40:43 -05:00
Michael Schurter	3e56ee005a	add nil check around task resources in device hook Looking at NewTaskRunner I'm unsure whether TaskRunner.TaskResources (from which req.TaskResources is set) is intended to be nil at times or if the TODO in NewTaskRunner is intended to ensure it is always non-nil.	2018-11-27 17:25:33 -08:00
Michael Schurter	b75e9fce37	assume that slices contain only non-nil items	2018-11-27 17:25:33 -08:00
Michael Schurter	85073f9d29	client: properly support hook env vars The old approach was incomplete. Hook env vars are now: * persisted and restored between agent restarts * deterministic (LWW if 2 hooks set the same key)	2018-11-27 17:25:33 -08:00
Alex Dadgar	4ee603c382	Device hook and devices affect computed node class This PR introduces a device hook that retrieves the device mount information for an allocation. It also updates the computed node class computation to take into account devices. TODO Fix the task runner unit test. The environment variable is being lost even though it is being properly set in the prestart hook.	2018-11-27 17:25:33 -08:00
Michael Schurter	27e07f657e	Merge pull request #4896 from hashicorp/b-prevalloc-deadlock Fix deadlock in previous alloc watcher by emitting last alloc update	2018-11-27 19:07:16 -06:00
Michael Schurter	b75f79a793	fix test breakage caused by rebase	2018-11-27 16:34:01 -08:00
Michael Schurter	91da566935	fix mispelings	2018-11-27 16:33:55 -08:00
Chris Baker	a1fb1f3830	Merge pull request #4891 from hashicorp/b-1150-rkt-volume-names drivers/rkt: fix invalid volumes	2018-11-27 18:55:00 -05:00
Danielle Tomlinson	3651dbdc25	Merge pull request #4909 from hashicorp/b-restart-delay taskrunner: Return the restart delay correctly	2018-11-27 23:55:54 +01:00
Michael Schurter	22149a661e	client: comment on importance of chan ops ordering	2018-11-27 14:11:32 -08:00
Mahmood Ali	05a958dc21	Update client/structs/broadcaster.go Co-Authored-By: schmichael <michael.schurter@gmail.com>	2018-11-27 14:06:08 -08:00
Michael Schurter	81b6a24a84	client: fix send-after-close in broadcaster	2018-11-27 14:06:08 -08:00
Michael Schurter	c429e6b0ab	client: check if prev alloc is already terminated This is a defensive fast-path as 7c6aa0be already fixed the deadlock.	2018-11-27 14:06:08 -08:00
Michael Schurter	944ea6d38b	client: emit last sent alloc to new listeners Fixes a deadlock where the allocwatcher would block forever waiting for an update from a terminal alloc. Made the broadcaster easier to debug as well.	2018-11-27 14:06:08 -08:00
Michael Schurter	1e4ef139dd	Merge pull request #4883 from hashicorp/f-graceful-shutdown Support graceful shutdowns in agent	2018-11-27 15:55:15 -06:00
Michael Schurter	4f7e6f9464	client: fix races in use of goroutine group The group utility struct does not support asynchronously launched goroutines (goroutines-inside-of-goroutines), so switch those uses to a normal go call. This means watchNodeUpdates and watchNodeEvents may not be shutdown when Shutdown() exits. During nomad agent shutdown this does not matter. During tests this means a test may leak those goroutines or be unable to know when those goroutines have exited. Since there's no runtime impact and these goroutines do not affect alloc state syncing it seems ok to risk leaking them.	2018-11-26 12:52:55 -08:00
Michael Schurter	9f43fb6d29	client: reuse group instead of diy'ing it	2018-11-26 12:52:31 -08:00
Michael Schurter	22771aa19e	client/ar: remove useless wait ch from runTasks Arguably this makes task.WaitCh() useless, but I think exposing a wait chan from TaskRunners is a generically useful API.	2018-11-26 12:51:18 -08:00
Michael Schurter	2fdd013956	client: document how AR/TR Run methods behave	2018-11-26 12:50:35 -08:00
Chris Baker	9bd4317139	modified TaskConfig to include AllocID use this for volume names in drivers/rkt to address #1150	2018-11-26 18:54:26 +00:00
Nick Ethier	95362eaa02	Merge pull request #4844 from hashicorp/f-docker-plugin Docker driver plugin	2018-11-20 20:43:03 -05:00
Mahmood Ali	e1994e59bd	address review comments	2018-11-20 17:10:54 -05:00
Mahmood Ali	171b73fde7	Emit metric counters for Vault token and renewal failures	2018-11-20 17:10:54 -05:00
Mahmood Ali	5b10da5de6	Set User-Agent header when hitting Vault API	2018-11-20 17:10:54 -05:00
Danielle Tomlinson	093f029d5b	taskrunner: Return the restart delay correctly We were incorrectly returning a 0 duration to the taskrunner when determining when a task should restart. This would cause tasks to be restarted immediately, ignoring the restart {} stanza in a users configuration. This commit causes us to return the restart duration to the task runner so it may correctly delay further execution.	2018-11-20 21:52:23 +01:00
Nick Ethier	3e42d6914e	task_runner: use NodeResources instead of deprecated struct	2018-11-20 13:46:39 -05:00
Nick Ethier	93c0200566	task_runner: use task and alloc copies instead of referencing the original pointer	2018-11-20 13:34:46 -05:00
Nick Ethier	29591a7c2e	task_runner: emit event on task exit with exit result details	2018-11-19 22:59:17 -05:00
Nick Ethier	4be8a86ef9	plugins/driver: remove NodeResources from task Resources and use PercentTicks field for docker driver	2018-11-19 22:59:17 -05:00
Nick Ethier	69049d37f5	drivers: added NodeResources to drivers.TaskConfig	2018-11-19 22:59:16 -05:00
Nick Ethier	8f8698b3e1	docker: started work on porting docker driver to new plugin framework	2018-11-19 22:59:15 -05:00
Michael Schurter	88577fe083	client.rpc: don't log errors on shutdown	2018-11-19 16:39:30 -08:00
Michael Schurter	5bd744ac3d	client: support graceful shutdowns Client.Shutdown now blocks until all AllocRunners and TaskRunners have exited their Run loops. Tasks are left running.	2018-11-19 16:39:30 -08:00
Mahmood Ali	9479015f51	Merge pull request #4884 from hashicorp/f-alloc-devices-cli Report alloc device statistics in API and CLI	2018-11-16 18:04:54 -05:00
Mahmood Ali	f139234372	address review comments	2018-11-16 17:13:01 -05:00
Mahmood Ali	f72e599ee7	Populate alloc stats API with device stats This change makes few compromises: * Looks up the devices associated with tasks at look up time. Given that `nomad alloc status` is called rarely generally (compared to stats telemetry and general job reporting), it seems fine. However, the lookup overhead grows bounded by number of `tasks x total-host-devices`, which can be significant. * `client.Client` performs the task devices->statistics lookup. It passes self to alloc/task runners so they can look up the device statistics allocated to them. * Currently alloc/task runners are responsible for constructing the entire RPC response for stats * The alternatives for making task runners device statistics aware don't seem appealing (e.g. having task runners contain reference to hostStats) * On the alloc aggregation resource usage, I did a naive merging of task device statistics. * Personally, I question the value of such aggregation, compared to costs of struct duplication and bloating the response - but opted to be consistent in the API. * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.	2018-11-16 10:26:32 -05:00
Michael Schurter	0cdb188ae4	tests: fix tests post-rebase	2018-11-15 17:40:56 -08:00
Michael Schurter	59f106ecee	client/tr: add a bit of context to envbuilder errors	2018-11-15 16:26:25 -08:00
Michael Schurter	742f8775ba	client: remove old proxy references from comments	2018-11-15 16:26:25 -08:00
Michael Schurter	2d0b44c3b4	client: test more env key variations	2018-11-15 16:26:25 -08:00
Michael Schurter	8bcd90d78d	client: add new nested variables to task's hcl ctx The error messages are really bad, but it's extremely difficult to produce good error messages without the original HCL.	2018-11-15 16:26:25 -08:00
Michael Schurter	5e51e2c2d5	client: turn env into nested objects for task configs	2018-11-15 16:25:57 -08:00
Michael Schurter	f8cdd561f0	client: interpolate driver configurations Also add missing SetDriverNetwork calls.	2018-11-15 16:25:57 -08:00
Mahmood Ali	046f098bac	Track Node Device attributes and serve them in API	2018-11-14 14:42:29 -05:00
Mahmood Ali	63acda956c	Add Client Device Stats structs in `api` package	2018-11-14 14:41:19 -05:00
Mahmood Ali	b74ccc742c	Expose Device Stats in /client/stats API endpoint	2018-11-14 14:41:19 -05:00
Mahmood Ali	c5de71a424	Allow nullable fields in StatValues In state values, we need to be able to distinguish between zero values (e.g. `false`) and unset values (e.g. `nil`). We can alternatively use protobuf `oneOf` and nested map to ensure consistency of fields that are set together, but the golang representation does not represent that well and introducing a mismatch between representations. Thus, I opted not to use it.	2018-11-14 14:41:19 -05:00
Mahmood Ali	713c9fe683	Move Stat{Object\|Value} to plugins/shared/structs Moving them as they may be useful for other packages/plugins besides devices.	2018-11-14 09:01:26 -05:00
Mahmood Ali	1f4db08f42	Regenerate proto files with protoc-gen-go@v1.2.0	2018-11-14 09:01:26 -05:00
Danielle Tomlinson	0917e93537	Merge pull request #4869 from hashicorp/b-executor-stdout executor: Fix stdout stderr copy/paste	2018-11-13 19:22:37 -08:00
Mahmood Ali	865419e756	convert all config durations to strings in tests	2018-11-13 10:21:40 -05:00
Mahmood Ali	ac3b4571eb	Address review comments	2018-11-13 10:21:40 -05:00
Mahmood Ali	69f26783e4	avoid setting resource limit on rkt command Was accidentally modified in 5b14d24bf4626bab420d00783d92bcf25e0b641e .	2018-11-13 10:21:40 -05:00
Mahmood Ali	8fa26f5521	Fix docker log fetching in tests We no longer use syslog for tracking logs so tracking them explicitly here	2018-11-13 10:21:40 -05:00
Mahmood Ali	88fa968623	killing should be done with wait client Incidentally changed in 5b14d24bf4626bab420d00783d92bcf25e0b641e	2018-11-13 10:21:40 -05:00
Mahmood Ali	7690f389a0	Prioritize checking consumer context cancellation Tests expect that as soon as eventer shuts down immediately on context cancellations; but golang does not guarantee priority when multiple pending channels are ready in a select statement.	2018-11-13 10:21:40 -05:00
Mahmood Ali	c62ec124c0	Set clean config for mock driver The default job here contains some exec task config (for setting command and args) that aren't used for mock driver. Now, the alloc runner seems stricter about validating fields and errors on unexpected fields. Updating configs in tests so we can have an explicit task config whenever driver is set explicitly.	2018-11-13 10:21:40 -05:00
Mahmood Ali	e5e6f9a785	Update Docker name parsing lookup `ParseNamed` function changed in e9f3f2cfee9d729a8642344c4fa4ea70b2d49468 where became `ParsedNormalizedName` with extra checks.	2018-11-13 10:21:40 -05:00
Danielle Tomlinson	bfeded1f30	executor: Fix stdout stderr copy/paste	2018-11-12 22:08:04 -08:00
Alex Dadgar	c4f9e22aeb	fix race	2018-11-07 12:22:07 -08:00
Alex Dadgar	a7ca737fb6	review comments	2018-11-07 11:31:52 -08:00
Alex Dadgar	f0c7a8159b	tests	2018-11-07 10:43:15 -08:00
Alex Dadgar	204ca8230c	Device manager Introduce a device manager that manages the lifecycle of device plugins on the client. It fingerprints, collects stats, and forwards Reserve requests to the correct plugin. The manager, also handles device plugins failing and validates their output.	2018-11-07 10:43:15 -08:00
Michael Schurter	a4e6a92d18	client: update alloc status when terminating Defensively update alloc status whenever killing all tasks.	2018-11-05 15:11:10 -08:00
Michael Schurter	66bf3db455	client: block on context as well as waitCh For lifecycle operations such as Restart and Kill, the client should not expect driver plugins to be well behaved and close their waitCh on context cancelation. Always wait on the passed in context as well as the waitCh.	2018-11-05 12:32:05 -08:00
Michael Schurter	b994f51990	client: fix tr lifecycle logic and shutdown delay ShutdownDelay must be honored whenever the task is killed or restarted. Services were not being deregistered prior to restarting.	2018-11-05 12:32:05 -08:00
Michael Schurter	2d3479147a	client: fix ar and tr tests	2018-11-05 12:32:05 -08:00
Michael Schurter	d29d09023e	client: do not run terminal allocs	2018-11-05 12:32:05 -08:00
Michael Schurter	2bbd88888c	client: first pass at implementing task restoring Task restoring works but dead tasks may be restarted	2018-11-05 12:32:05 -08:00
Nick Ethier	b0ddc03409	Merge pull request #4765 from jippi/increase-line-scan-limit fix: increase log rotator line scan limit	2018-10-29 18:46:30 -07:00
Nick Ethier	3fcf8ba7e6	Merge pull request #4795 from hashicorp/f-plugin-config Pass client configuration to plugins through loader	2018-10-29 18:42:27 -07:00
Nick Ethier	bda3b1d3b3	rename NomadConfig to ClientAgentConfig	2018-10-29 21:34:34 -04:00
Michael Schurter	6f2cffb196	Merge pull request #4803 from hashicorp/b-leader-fixes AR Fixes: task leader handling, restoring, state updating, AR.Destroy deadlocks	2018-10-29 17:38:59 -05:00
Michael Schurter	d71a1b4547	tests: more fixes due to api changes	2018-10-29 15:25:22 -07:00
Preetha Appan	b85cc38f3d	Stat path to binary to handle raw exec driver interpolated binary path	2018-10-26 17:24:05 -05:00
Preetha Appan	55ac8d3d12	Fix test linting	2018-10-26 10:30:12 -05:00
Michael Schurter	b7a9d61a38	ar: initialize allocwatcher on restore Fixes a panic. Left a comment on how the behavior could be improved, but this is what releases <0.9.0 did.	2018-10-19 09:45:45 -07:00
Michael Schurter	e060174130	ar: fix leader handling, state restoring, and destroying unrun ARs * Migrated all of the old leader task tests and got them passing * Refactor and consolidate task killing code in AR to always kill leader tasks first * Fixed lots of issues with state restoring * Fixed deadlock in AR.Destroy if AR.Run had never been called * Added a new in memory statedb for testing	2018-10-19 09:45:45 -07:00
Nick Ethier	58b430edae	added driver specific client config struct to plugin configuration	2018-10-18 23:31:01 -04:00
Michael Schurter	cefbf00bf0	ar: refactor task killing into 1 method Update comments and address some PR comments from #4775	2018-10-17 10:06:59 -07:00
Michael Schurter	21d78be961	tests: explicitly cleanup after clients	2018-10-17 10:06:59 -07:00
Michael Schurter	222f6b5741	ar: fix task leader, update, and stop handling	2018-10-17 10:06:59 -07:00
Michael Schurter	1badbb2fc4	tr: cleanup hook logs	2018-10-17 09:42:32 -07:00
Nick Ethier	65adb80ebf	plumb NomadConfig into plugins	2018-10-16 22:47:22 -04:00
Nick Ethier	d94b631b6b	drivers/exec: add exec implementation	2018-10-16 22:45:28 -04:00
Michael Schurter	0baaba8b09	templates: fix tests	2018-10-16 16:56:57 -07:00
Michael Schurter	838ddf4d4a	fix linter errors	2018-10-16 16:56:57 -07:00
Michael Schurter	e27c82ea4d	client: remove unused handleproxy	2018-10-16 16:56:56 -07:00
Michael Schurter	4ea5217d72	tr: remove unused DriverHandle interface was causing typed nil interface panics and served no purpose	2018-10-16 16:56:56 -07:00
Michael Schurter	528c426c53	Port client portion of #4392 to new taskrunner PR #4392 was merged to master after allocrunnerv2 was branched, so the client-specific portions must be ported from master to arv2.	2018-10-16 16:56:56 -07:00
Michael Schurter	f12501d4c3	tr: implement dispatch payload hook Now passing the TaskDir struct to prestart hooks instead of just the root task dir itself as dispatch needs local/.	2018-10-16 16:56:56 -07:00
Nick Ethier	d9f0cbf4a9	client: log retry during driver fingerprint redispense	2018-10-16 16:56:56 -07:00
Nick Ethier	c7ac1186c9	client: add test for driverfailure during fingerprinting	2018-10-16 16:56:56 -07:00
Nick Ethier	8cf669b5aa	taskrunner: return error on waitCh	2018-10-16 16:56:56 -07:00
Nick Ethier	047fad2953	client: simplify driver plugin logic from review comments	2018-10-16 16:56:56 -07:00
Nick Ethier	9686e1b258	client: fix broked tests from refactoring	2018-10-16 16:56:56 -07:00
Nick Ethier	3183b33d24	client: review comments and fixup/skip tests	2018-10-16 16:56:56 -07:00
Nick Ethier	f192c3752a	client: refactor post allocrunnerv2 finalization	2018-10-16 16:56:56 -07:00
Nick Ethier	4a4c7dbbfc	client: begin driver plugin integration client: fingerprint driver plugins	2018-10-16 16:56:56 -07:00
Alex Dadgar	7946a14aa8	Fix lints	2018-10-16 16:56:56 -07:00
Alex Dadgar	89dafaaea9	compile on windows	2018-10-16 16:56:56 -07:00
Alex Dadgar	ad4fac526c	more test fixes	2018-10-16 16:56:56 -07:00
Alex Dadgar	45e41cca03	allocrunnerv2 -> allocrunner	2018-10-16 16:56:56 -07:00
Alex Dadgar	9baa7402ef	fix test compiling	2018-10-16 16:56:55 -07:00
Alex Dadgar	7d9c069f09	skip building deprecated files	2018-10-16 16:56:55 -07:00
Alex Dadgar	6c9d9d5173	move files around	2018-10-16 16:56:55 -07:00
Michael Schurter	5f696608a6	tests: fix missing logger caused by bad merge	2018-10-16 16:56:55 -07:00
Michael Schurter	048510b13e	tr: properly comment handle fields	2018-10-16 16:56:55 -07:00
Michael Schurter	9e49ed3464	ar: AllocState should not mutate ar.state If ar.state.TaskStates has not been set, set it on the copy of ar.state. That keeps ar.state manipulations in one location and allows AllocState to only acquire read-locks.	2018-10-16 16:56:55 -07:00
Michael Schurter	f279b1d1b1	tests: test logs endpoint against pending task Although the really exciting change is making WaitForRunning return the allocations that it started. This should cut down test boilerplate significantly.	2018-10-16 16:56:55 -07:00
Michael Schurter	dd4227f84a	tests: make a test client/config easier to generate Sadly can't move the fingerprint timeout tweak into the helper due to circular imports.	2018-10-16 16:56:55 -07:00
Michael Schurter	1d747048ea	tests: ensure task state is initialized in NewAR Also expose NoopDB for use in tests.	2018-10-16 16:56:55 -07:00
Michael Schurter	960f3be76c	client: expose task state to client The interesting decision in this commit was to expose AR's state and not a fully materialized Allocation struct. AR.clientAlloc builds an Alloc that contains the task state, so I considered simply memoizing and exposing that method. However, that would lead to AR having two awkwardly similar methods: - Alloc() - which returns the server-sent alloc - ClientAlloc() - which returns the fully materialized client alloc Since ClientAlloc() could be memoized it would be just as cheap to call as Alloc(), so why not replace Alloc() entirely? Replacing Alloc() entirely would require Update() to immediately materialize the task states on server-sent Allocs as there may have been local task state changes since the server received an Alloc update. This quickly becomes difficult to reason about: should Update hooks use the TaskStates? Are state changes caused by TR Update hooks immediately reflected in the Alloc? Should AR persist its copy of the Alloc? If so, are its TaskStates canonical or the TaskStates on TR? So! Forget that. Let's separate the static Allocation from the dynamic AR & TR state! - AR.Alloc() is for static Allocation access (often for the Job) - AR.AllocState() is for the dynamic AR & TR runtime state (deployment status, task states, etc). If code needs to know the status of a task: AllocState() If code needs to know the names of tasks: Alloc() It should be very easy for a developer to reason about which method they should call and what they can do with the return values.	2018-10-16 16:56:55 -07:00
Michael Schurter	fb4aa74153	client: add comment	2018-10-16 16:56:55 -07:00
Michael Schurter	9a7e6be2b6	client: fix potentially dropped streaming errors	2018-10-16 16:56:55 -07:00
Michael Schurter	4b44b9039b	tr: remove unneeded lock; chan synchronizes access	2018-10-16 16:56:55 -07:00
Michael Schurter	211b96bb5c	tr: fix shutdown/destroy/WaitResult handling Multiple receivers raced for the WaitResult when killing tasks which could lead to a deadlock if the "wrong" receiver won. Wrap handlers in an ugly little proxy to avoid this. At first I wanted to push this into drivers, but the result is tied to the TR's handle lifecycle -- not the lifecycle of an alloc or task.	2018-10-16 16:56:55 -07:00
Michael Schurter	951ed17436	client: do not inspect task state to follow logs "Ask forgiveness, not permission." Instead of peaking at TaskStates (which are no longer updated on the AR.Alloc() view of the world) to only read logs for running tasks, just try to read the logs and improve the error handling if they don't exist. This should make log streaming less dependent on AR/TR behavior. Also fixed a race where the log streamer could exit before reading an error. This caused no logs or errors to be displayed sometimes when an error occurred.	2018-10-16 16:56:55 -07:00
Michael Schurter	2325348053	mock_driver: close waitCh after exiting mock_driver wasn't behaving like other driver handles.	2018-10-16 16:56:55 -07:00
Michael Schurter	8d1419c62b	client: fix accessing alloc runners * GetClientAlloc() gains nothing from using allAllocs() * getAllocatedResources was calling getAllocRunners() twice	2018-10-16 16:56:55 -07:00
Michael Schurter	55ab491801	tr: remove wip comments	2018-10-16 16:56:55 -07:00
Michael Schurter	3ccc091a72	ar: lock around accessing tasks Specify that Alloc() does not return updated task states.	2018-10-16 16:56:55 -07:00
Alex Dadgar	6f0ed6184b	Fix client reloading and pass the plugin loaders to server and client	2018-10-16 16:56:55 -07:00
Nick Ethier	352c05cdf4	plugin/drivers: plumb in stdout/stderr paths	2018-10-16 16:53:31 -07:00
Nick Ethier	0e3f85222a	driver/raw_exec: port existing raw_exec tests and add some testing utilities	2018-10-16 16:53:31 -07:00
Nick Ethier	d9628ff394	driver/raw_exec: more tests and bug fixes added wrapper struct for plugin.ReattachConfig to better handle serialization	2018-10-16 16:53:31 -07:00
Nick Ethier	bcc5c4a8bd	clientv2: base driver plugin (#4671 ) Driver plugin framework to facilitate development of driver plugins. Implementing plugins only need to implement the DriverPlugin interface. The framework proxies this interface to the go-plugin GRPC interface generated from the driver.proto spec. A testing harness is provided to allow implementing drivers to test the full lifecycle of the driver plugin. An example use: func TestMyDriver(t *testing.T) { harness := NewDriverHarness(t, &MyDiverPlugin{}) // The harness implements the DriverPlugin interface and can be used as such taskHandle, err := harness.StartTask(...) }	2018-10-16 16:53:31 -07:00
Michael Schurter	62c1285afc	tr: add comments and cleanup call signature From review comments on #4649 left post-merge.	2018-10-16 16:53:31 -07:00
Nick Ethier	5dee1141d1	executor v2 (#4656 ) * client/executor: refactor client to remove interpolation * executor: POC libcontainer based executor * vendor: use hashicorp libcontainer fork * vendor: add libcontainer/nsenter dep * executor: updated executor interface to simplify operations * executor: implement logging pipe * logmon: new logmon plugin to manage task logs * driver/executor: use logmon for log management * executor: fix tests and windows build * executor: fix logging key names * executor: fix test failures * executor: add config field to toggle between using libcontainer and standard executors * logmon: use discover utility to discover nomad executable * executor: only call libcontainer-shim on main in linux * logmon: use seperate path configs for stdout/stderr fifos * executor: windows fixes * executor: created reusable pid stats collection utility that can be used in an executor * executor: update fifo.Open calls * executor: fix build * remove executor from docker driver * executor: Shutdown func to kill and cleanup executor and its children * executor: move linux specific universal executor funcs to seperate file * move logmon initialization to a task runner hook * client: doc fixes and renaming from code review * taskrunner: use shared config struct for logmon fifo fields * taskrunner: logmon only needs to be started once per task	2018-10-16 16:53:31 -07:00
Michael Schurter	e6e2930a00	tr: implement stats collection hook Tested except for the net/rpc specific error case which may need changing in the gRPC world.	2018-10-16 16:53:31 -07:00
Michael Schurter	86bd329539	fix build errors post merges	2018-10-16 16:53:31 -07:00
Michael Schurter	a977e22028	test: cleanup mock consul service client Updated to hclog. It exposed fields that required an unexported lock to access. Created a getter methodn instead. Only old allocrunner currently used this feature.	2018-10-16 16:53:31 -07:00
Michael Schurter	6f92b04226	health_hook: simplify locking; test thoroughly Use doneCh like @dadgar suggested in the original PR. Thoroughly test hook as concurrent Update calls make for a tricky concurrency problem.	2018-10-16 16:53:30 -07:00
Alex Dadgar	cebfead6bc	add logger back	2018-10-16 16:53:30 -07:00
Nick Ethier	03422aa529	fifo: add new fifo package for named pipes (#4665 ) * fifo: add new fifo package for named pipes	2018-10-16 16:53:30 -07:00
Alex Dadgar	8504505c0d	client uses passed logger and fix fingerprinters	2018-10-16 16:53:30 -07:00
Nick Ethier	66ff12e5f7	Update runc/libcontainer and friends (#4655 ) * vendor: bump libcontainer and docker to remove Sirupsen imports * vendor: fix bad vendoring of archive package * vendor: fix api changes to cgroups in executor * vendor: fix docker api changes * vendor: update github.com/Azure/go-ansiterm to use non capitalized logrus import	2018-10-16 16:53:30 -07:00
Michael Schurter	195b8127fb	health_hook: fix panic and add tests Still more testing to do, but I want to get this panic fixed ASAP. All new tests pass with -race	2018-10-16 16:53:30 -07:00
Michael Schurter	64efc3d301	Emit events before long operations Append when there's nothing blocking between appending and sending an update to the server.	2018-10-16 16:53:30 -07:00
Michael Schurter	a2b696c4cf	Use a semaphore to block until watcher exits	2018-10-16 16:53:30 -07:00
Michael Schurter	a73162c977	ar: use multierror in update hook loop Make it match TaskRunner update hook behavior	2018-10-16 16:53:30 -07:00
Michael Schurter	a7b427718c	tr: refactor EmitEvents into Emit+Append * UpdateState: set state, append event, persist, update servers * EmitEvent: append event, persist, update servers * AppendEvent: append event, persist AppendEvent may not even have to persist, but for the sake of correctness I'm going with that for now.	2018-10-16 16:53:30 -07:00
Michael Schurter	93f3ac9ed6	ar: create health setting shim for health watcher	2018-10-16 16:53:30 -07:00
Michael Schurter	4d5aaac6d2	fix detection of task transitioning to running	2018-10-16 16:53:30 -07:00
Michael Schurter	4136e59f79	arv2: implement alloc health watching Also remove initial alloc from broadcaster as it just caused useless extra processing.	2018-10-16 16:53:30 -07:00
Michael Schurter	5c5c6dc41b	refactor ar hooks into their own files minimize passed dependencies to ease testing	2018-10-16 16:53:30 -07:00
Michael Schurter	0bbf3a93ee	make AllocBroadcaster easier to use And test thoroughly.	2018-10-16 16:53:30 -07:00
Michael Schurter	9d1ea3b228	client: hclog-ify most of the client Leaving fingerprinters in case that interface changes with plugins.	2018-10-16 16:53:30 -07:00
Michael Schurter	e42154fc46	implement stopping, destroying, and disk migration * Stopping an alloc is implemented via Updates but update hooks are not run. * Destroying an alloc is a best effort cleanup. * AllocRunner destroy hooks implemented. * Disk migration and blocking on a previous allocation exiting moved to its own package to avoid cycles. Now only depends on alloc broadcaster instead of also using a waitch. * AllocBroadcaster now only drops stale allocations and always keeps the latest version. * Made AllocDir safe for concurrent use Lots of internal contexts that are currently unused. Unsure if they should be used or removed.	2018-10-16 16:53:30 -07:00
Michael Schurter	4236255686	lots of comment/log fixes	2018-10-16 16:53:30 -07:00
Michael Schurter	5749ede04e	keep forgetting lxc	2018-10-16 16:53:30 -07:00
Michael Schurter	357641c364	persist alloc state on changes, not periodically Allow alloc and task runners to persist their own state when something changes instead of periodically syncing all state.	2018-10-16 16:53:30 -07:00
Michael Schurter	820af27171	wrap boltdb in a write deduplicator Saves a tiny bit of cpu and some IO. Sadly doesn't prevent all IO on duplicate writes as the transactions are still created and committed. $ go test -bench=. -benchmem goos: linux goarch: amd64 pkg: github.com/hashicorp/nomad/helper/boltdd BenchmarkWriteDeduplication_On-4 500 4059591 ns/op 23736 B/op 56 allocs/op BenchmarkWriteDeduplication_Off-4 300 4115319 ns/op 25942 B/op 55 allocs/op	2018-10-16 16:53:30 -07:00
Michael Schurter	990228a6e2	wip wrap boltdb to get path information finished but doesn't handle deleting deeply nested buckets	2018-10-16 16:53:30 -07:00
Michael Schurter	a3fe0510d1	Move all encoding and put deduping into state db Still WIP as it does not handle deletions.	2018-10-16 16:53:30 -07:00
Michael Schurter	533bc93b3a	implement all boltdb interactions behind StateDB	2018-10-16 16:53:30 -07:00
Michael Schurter	d890de036a	tr: persist hook state whenever it changes	2018-10-16 16:53:30 -07:00
Michael Schurter	fae5e89a0e	artifacts: don't emit event when there's no artifacts	2018-10-16 16:53:30 -07:00
Michael Schurter	5383d20505	removing old restoration path before api change	2018-10-16 16:53:30 -07:00
Michael Schurter	a5d3e3fb0a	Implement alloc updates in arv2 Updates are applied asynchronously but sequentially	2018-10-16 16:53:30 -07:00
Michael Schurter	39b3f3a85b	call handle.Network() instead of storing it	2018-10-16 16:53:30 -07:00
Michael Schurter	7132b67c1e	Add Network method to Handle interface Should probably be moved to an Inspect method in the Driver Plugin world	2018-10-16 16:53:30 -07:00
Michael Schurter	a4b4d7b266	consul service hook Deregistration works but difficult to test due to terminal updates not being fully implemented in the new client/ar/tr.	2018-10-16 16:53:29 -07:00
Michael Schurter	5be982e674	restore vault client	2018-10-16 16:53:29 -07:00
Michael Schurter	ce04915c9f	log before killing tasks	2018-10-16 16:53:29 -07:00
Michael Schurter	a2bf851805	no need to TaskStateUpdated to return an error also updated comments	2018-10-16 16:53:29 -07:00
Alex Dadgar	fd3bc1bd39	Update state with server	2018-10-16 16:53:29 -07:00
Alex Dadgar	bc905cc61d	Define and thread through state updating interface	2018-10-16 16:53:29 -07:00
Michael Schurter	9a63d6103d	tr: add validate task hook	2018-10-16 16:53:29 -07:00
Michael Schurter	7f4ec50906	missed locking around c.allocs access	2018-10-16 16:53:29 -07:00
Alex Dadgar	c93cfc89c0	wip	2018-10-16 16:53:29 -07:00
Alex Dadgar	7ddc0eb65c	Fix deadlock	2018-10-16 16:53:29 -07:00
Alex Dadgar	3779077052	Remove SetState from interface	2018-10-16 16:53:29 -07:00
Alex Dadgar	e1ba73b515	compile	2018-10-16 16:53:29 -07:00
Michael Schurter	6ebdf532ea	wip split event emitting and state transitions	2018-10-16 16:53:29 -07:00
Michael Schurter	516d641db0	client: implement all-or-nothing alloc restoration Restoring calls NewAR -> Restore -> Run NewAR now calls NewTR AR.Restore calls TR.Restore AR.Run calls TR.Run	2018-10-16 16:53:29 -07:00
Alex Dadgar	e401c660e7	Implement lifecycle hooks on the task runner	2018-10-16 16:53:29 -07:00
Alex Dadgar	89b4ba9cc8	comments	2018-10-16 16:53:29 -07:00
Alex Dadgar	86e81947b4	Hook renames	2018-10-16 16:53:29 -07:00
Alex Dadgar	2599cf9d74	remove comment	2018-10-16 16:53:29 -07:00
Alex Dadgar	88aa0299a9	Template hook	2018-10-16 16:53:29 -07:00
Alex Dadgar	c9765deff1	address comments	2018-10-16 16:53:29 -07:00
Alex Dadgar	80f6ce50c0	vault hook	2018-10-16 16:53:29 -07:00
Michael Schurter	30d377eba4	tr: improve skip log line	2018-10-16 16:53:29 -07:00
Michael Schurter	ef213b864b	tr: pass context to hooks	2018-10-16 16:53:29 -07:00
Michael Schurter	3a4f387fd3	tr: fix setting done in existing hooks	2018-10-16 16:53:29 -07:00
Michael Schurter	b360f6f96e	fix hclog level	2018-10-16 16:53:29 -07:00
Michael Schurter	ae89b7da95	reimplement success state for tr hooks and state persistence splits apart local and remote persistence removes some locking for now	2018-10-16 16:53:29 -07:00
Michael Schurter	4f43ff5c51	pass statedb into allocrunnerv2	2018-10-16 16:53:29 -07:00
Michael Schurter	582c76a420	remove unused allocrunner shim	2018-10-16 16:53:29 -07:00
Michael Schurter	c5504bd939	tr: cleanup main loop and shutdown hook impl	2018-10-16 16:53:29 -07:00
Michael Schurter	561260d6fe	tr: skip error/success saving All hooks only need to be run once. Since only one hook can fail per run there's no need to track errors on a per hook basis.	2018-10-16 16:53:29 -07:00
Michael Schurter	67874e761f	tr: don't lock for immutable fields	2018-10-16 16:53:29 -07:00
Michael Schurter	f473cd03d6	tr: start update/shutdown logic	2018-10-16 16:53:29 -07:00
Michael Schurter	637ef264ae	Copy TR.Config vals to TR I think I like this pattern better as some Config vals are mutable (Alloc) and some aren't and some are used to derive other values and never used directly. Promoting them onto the TR struct is a little more work but is hopefully more clear as to how each value is used.	2018-10-16 16:53:29 -07:00
Michael Schurter	0f7dcfdc9a	example redis job "runs" on arv2! see below Tons left to do and lots of churn: 1. No state saving 2. No shutdown or gc 3. Removed AR factory for now 4. Made all "Config" structs local to the package they configure 5. Added allocID to GC to avoid a lookup Really hating how many things use *structs.Allocation. It's not bad without state saving, but if AllocRunner starts updating its copy things get racy fast.	2018-10-16 16:53:29 -07:00
Michael Schurter	9a6aa38b0f	begin adding AllocRunner.Update	2018-10-16 16:53:29 -07:00
Michael Schurter	eae54e2954	artifact task hook	2018-10-16 16:53:29 -07:00
Alex Dadgar	b9bed81e6e	Initial V2 alloc runner	2018-10-16 16:53:28 -07:00
Alex Dadgar	a78cefec18	use int64	2018-10-16 15:34:32 -07:00
Preetha Appan	7c0d8c646c	Change CPU/Disk/MemoryMB to int everywhere in new resource structs	2018-10-16 16:21:42 -05:00
Christian Winther	0c5154100c	fix: increase log rotator line scan limit In case where gelf/json logging is used, its fairly easy to exceed the 16k limit, resulting in json output being cut up into multiple strings the result is invalid json lines which can create all kind of badness in the logging server This fixes https://github.com/hashicorp/nomad/issues/4699 Signed-off-by: Christian Winther <jippignu@gmail.com>	2018-10-09 18:57:18 +02:00
Alex Dadgar	01f8e5b95f	renames	2018-10-04 14:57:25 -07:00
Alex Dadgar	52f9cd7637	fixing tests	2018-10-04 14:26:19 -07:00
Alex Dadgar	bac5cb1e8b	Scheduler uses allocated resources	2018-10-02 17:08:25 -07:00
Alex Dadgar	5c8697667e	Node reserved resources	2018-09-29 18:44:55 -07:00
Alex Dadgar	3183153315	Node resources on client	2018-09-29 17:23:41 -07:00
Alex Dadgar	9971b3393f	yamux	2018-09-17 14:22:40 -07:00
Alex Dadgar	ca28afa3b2	small fixes	2018-09-15 16:42:38 -07:00
Alex Dadgar	7739ef51ce	agent + consul	2018-09-13 10:43:40 -07:00
Michael Schurter	08862fc177	fix race around error handling	2018-09-05 17:34:17 -07:00
Michael Schurter	6def5bc4f9	client: set host name when migrating over tls Not setting the host name led the Go HTTP client to expect a certificate with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad` names ephemeral dir migrations were broken when TLS was enabled. Added an e2e test to ensure this doesn't break again as it's very difficult to test and the TLS configuration is very easy to get wrong.	2018-09-05 17:24:17 -07:00
Alex Dadgar	c6576ddac1	Fix make check errors	2018-09-04 16:03:52 -07:00
Alex Dadgar	089b533047	Fix kill timeout exceeding 5m on Docker driver Fixes an issue where the Docker API client would timeout before the kill timeout was hit.	2018-08-17 16:01:09 -07:00
Alex Dadgar	49a1ba9297	Merge pull request #4535 from hashicorp/f-keep-docker-container-0.8.4 Option to prevent removal of container on exit	2018-07-26 11:11:22 -07:00
Charlie Voiselle	f319a149cd	Option to prevent removal of container on exit	2018-07-26 11:10:48 -07:00
Michael Schurter	ddf948001e	Merge pull request #4462 from omame/omame/cpu_cfs_period Add support for specifying cpu_cfs_period in the Docker driver	2018-07-25 09:34:38 -07:00
Daniele Valeriani	b0a14caca2	Add test for cpu_cfs_period	2018-07-16 22:43:34 +02:00
Michael Schurter	91588cb861	rkt: revert to redis 3.2 to favor stability	2018-07-09 16:15:32 -07:00
Michael Schurter	c56f899ee9	rkt: speed up tests Disable networking when it's not needed and improve failure message for UserGroup test by including the full ps output on failure.	2018-07-09 14:02:27 -07:00
Michael Schurter	a1d4f77ce0	rkt: skip retrieving network information when net=none Even when net=none we would attempt to retrieve network information from rkt which would spew useless log lines such as: ``` testlog.go:30: 20:37:31.409209 [DEBUG] driver.rkt: failed getting network info for pod UUID 8303cfe6-0c10-4288-84f5-cb79ad6dbf1c attempt 2: no networks found. Sleeping for 970ms ``` It would also delay tests for ~60s during the network information retry period. So skip this when net=none. It's unlikely anyone actually uses net=none outside of tests, so I doubt anyone will notice this change. Official docs: https://coreos.com/rkt/docs/latest/networking/overview.html#no-loopback-only-networking	2018-07-09 13:44:43 -07:00
Michael Schurter	0fbc84b81d	tests: make alloc id consistent in helper It worked, but the old code used a different alloc id for the path than the actual alloc! Use the same alloc id everywhere to prevent confusing test output.	2018-07-09 13:37:35 -07:00
Michael Schurter	f3b8815c96	rkt: fix failing TestRktDriver_UserGroup test Started failing due to the docker redis image switching from Debian jessie to stretch: `53f8680550 (diff-acff46b161a3b7d6ed01ba79a032acc9)` Switched from Debian based image to Alpine to get a working `ps` command again (albeit busybox's stripped down implementation)	2018-07-09 12:19:02 -07:00
Daniele Valeriani	748f6afd89	Validate the value of cpu_cfs_period	2018-07-02 22:30:22 +02:00
Daniele Valeriani	9364446a03	Remove an unnecessary conversion	2018-07-02 17:47:23 +02:00
Daniele Valeriani	906952a2c8	Add support for specifying cpu_cfs_period in the Docker driver	2018-07-02 16:37:04 +02:00
Preetha	b567750824	Merge pull request #4392 from burdandrei/telemetry-parametrized-jobs Parametrized/periodic jobs per child tagged metric emmision	2018-06-21 17:13:36 -05:00
Preetha	043f4c208b	Merge pull request #3882 from burdandrei/telemetry-add-node-class-tag Added node class to tagged metrics	2018-06-21 17:04:35 -05:00
Andrei Burd	444ee45aff	Parametrized/periodic jobs per child tagged metric emmision	2018-06-21 10:40:56 +03:00
James Rasell	75f95ccf09	Merge branch 'master' into f_gh_4381	2018-06-19 17:51:57 +02:00
Alex Dadgar	b61051b3cd	Merge pull request #4409 from hashicorp/r-client-packages Refactor client packages	2018-06-13 17:32:25 -07:00
Alex Dadgar	22757d964e	lint	2018-06-13 16:06:39 -07:00
Alex Dadgar	af558df94c	Fix test using a lot of memory	2018-06-13 15:52:25 -07:00
Alex Dadgar	300b1a7a15	Tests only use testlog package logger	2018-06-13 15:40:56 -07:00
Chelsea Komlo	03075b603a	Merge pull request #4399 from hashicorp/r-reload-refactor Refactor logic for dynamic reloading	2018-06-13 13:35:12 -04:00
Alex Dadgar	9bab9edf27	test fixes	2018-06-12 17:45:39 -07:00
Alex Dadgar	90c2108bfb	Fix gc tests + parallel destroy + small test fixes	2018-06-12 10:23:45 -07:00
Alex Dadgar	f5ff509fa5	Refactor - wip	2018-06-12 10:23:45 -07:00
Alex Dadgar	ff2ab8f58e	Fix vault template test	2018-06-12 09:57:28 -07:00
Alex Dadgar	d0043691fb	remove structs + bump version	2018-06-11 13:52:19 -07:00
Alex Dadgar	af5753d2cd	bump version + generated files	2018-06-11 13:39:42 -07:00
Nick Ethier	f36eb14360	Merge pull request #4403 from hashicorp/b-fix-dispatched-optional-meta Fix dispatched optional meta correctly	2018-06-11 16:17:14 -04:00
Nick Ethier	e75e3ae665	nomad: use require pkg for tests	2018-06-11 13:50:50 -04:00
Nick Ethier	3aa6241b5c	client/driver/env: fix optional meta test	2018-06-11 12:29:13 -04:00
Nick Ethier	c65882cafd	client/driver/env: use 'job.Dispatch' to trigger optional meta logic	2018-06-11 12:15:19 -04:00
Nick Ethier	ccb5372813	Revert "Revert "client/driver/env: interpolate empty optional meta params as empty strings"" This reverts commit c17e0fc9dc5fd288935ab2b68fb441b4d25ac189.	2018-06-11 11:59:23 -04:00
Michael Schurter	c198cfd8ea	executor: fix log line formatting	2018-06-08 14:55:39 -07:00
Michael Schurter	d1a60e700e	executor: fix Windows blocking on pipe close Sending the Ctrl-Break signal to PowerShell <6 causes it to drop into debug mode. Closing its output pipe at that point will block indefinitely and prevent the process from being killed by Nomad. See the upstream powershell issue for details: https://github.com/PowerShell/PowerShell/issues/4254	2018-06-08 14:48:05 -07:00
Chelsea Holland Komlo	f74e74b22d	add client logic to determine whether TLS RPC connections should reload	2018-06-08 14:38:58 -04:00
James Rasell	b9009c419c	Add 'nomad.advertise.address' to client meta via NomadFingerPrint This change removes the addition of the advertise address to the exported task env vars and instead moves this work into the NomadFingerprint.Fingerprint which adds this value to the client attrs. This can then be used within a Nomad job like ${attr.nomad.advertise.address}.	2018-06-08 09:44:10 +02:00
Alex Dadgar	d9b35fab52	Revert "client/driver/env: interpolate empty optional meta params as empty strings" This reverts commit 84926f759a63a90be7bbcf0fad78deb3f02af23d.	2018-06-07 16:27:47 -07:00
Nick Ethier	b3c767fae0	client/driver: drop docker pull progress estimate if its < 0	2018-06-07 15:23:31 -04:00
James Rasell	367a8b5152	Add the local clients advertise address to interpolation env vars This commit adds the Nomad local client advertise address in the form host:port to the environment variables passed to each task.	2018-06-07 09:45:15 +02:00
Alex Dadgar	98705824ed	Merge pull request #4185 from jesusvazquez/add-counter-metric-for-oom-killer-events Add driver.docker counter metric for OOM Killer events	2018-06-04 15:12:51 -07:00
Alex Dadgar	23cd56dc78	remove generated structs	2018-06-01 16:11:28 -07:00
Alex Dadgar	bf5b5747ab	fix test message	2018-06-01 15:51:54 -07:00
Alex Dadgar	3e3d3c7445	Disable Exec on non-linux platforms This PR disables exec on non-linux platforms	2018-06-01 15:48:14 -07:00
Alex Dadgar	c0386819b3	bump version/lint/generated files	2018-06-01 15:23:10 -07:00
Preetha Appan	ce6d4a8d7a	Fix tests and move isClient to constructor	2018-06-01 15:59:53 -05:00
Alex Dadgar	a62dd2aadb	Merge pull request #4350 from hashicorp/b-raw-exec-cgroups Raw exec can use cgroups to manage PIDs	2018-06-01 17:37:49 +00:00
Alex Dadgar	8da42940c9	wait for result	2018-06-01 10:14:53 -07:00
Alex Dadgar	40fec81315	Merge pull request #4277 from hashicorp/f-retry-join-clients Add go-discover support to Nomad clients	2018-06-01 16:57:40 +00:00
Alex Dadgar	460ecb8705	Comments	2018-05-31 18:05:03 -07:00
Alex Dadgar	de98774f2c	Add test and docs	2018-05-31 18:05:03 -07:00
Alex Dadgar	ff28b04c46	Use more appropriate name than cgroup	2018-05-31 18:05:03 -07:00
Alex Dadgar	37e900b1d3	Only use freezer/devices when in the basic cgroup only	2018-05-31 18:05:03 -07:00
Alex Dadgar	ffd9270f2f	Use cgroup when possible	2018-05-31 18:05:03 -07:00
Alex Dadgar	0ff0ed290d	Fix TestDockerDriver_StartNVersions	2018-05-31 17:14:59 -07:00
Alex Dadgar	7e6dd498c9	Remove debug logging	2018-05-31 15:52:42 -07:00
Alex Dadgar	b1b908527f	spelling	2018-05-31 15:29:55 -07:00
Alex Dadgar	a3b29553a5	Force close stdout/stderr after grace This commit changes the force closing of the stdout/stderr file descriptor from closing immediately to being closed after a grace period. This allows the created process to close its own file and allows copying of the data.	2018-05-31 15:21:36 -07:00
Alex Dadgar	5e787e2d72	test build	2018-05-31 12:22:31 -07:00
Alex Dadgar	ead1b7f423	Log more info for TestExecutor_IsolationAndConstraints	2018-05-31 11:57:44 -07:00
Alex Dadgar	b05740ad13	Merge pull request #4341 from hashicorp/f-docker-pids Support Docker Pids Limit	2018-05-31 17:59:29 +00:00
Chelsea Holland Komlo	064b5481e0	add server join info to server and client	2018-05-31 10:50:03 -07:00
Alex Dadgar	f4d4bbdc97	test pid limit	2018-05-30 12:55:24 -07:00
Chelsea Holland Komlo	94d510e969	Support Docker Pids Limit	2018-05-25 19:54:14 -04:00
Alex Dadgar	1685c8ebe4	cleanup	2018-05-24 16:25:20 -07:00
Alex Dadgar	2eacdb6bd6	Force closing of pipe to child process	2018-05-24 16:03:48 -07:00
Chelsea Holland Komlo	38f611a7f2	refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing add missing fields to TLS merge method	2018-05-23 18:35:30 -04:00
Preetha	9084bb025e	Merge pull request #4303 from hashicorp/b-docker-client-nil-panic Add nil check before setting timeout on docker client	2018-05-21 19:34:44 -07:00
Jesus Vazquez	23d959e42c	Add job, task, taskgroup to open method	2018-05-21 20:37:18 +02:00
Jesus Vazquez	0a062a04c7	Remove allocID from dockerhandle struct	2018-05-21 20:33:01 +02:00
Jesus Vazquez	e5a81815bb	Rename labels job, task_group and task	2018-05-21 20:32:50 +02:00
Jesus Vazquez	ffe1b1a1b6	Remove allocid label from driver.docker.oom counter metric	2018-05-21 20:30:56 +02:00
Alex Dadgar	38762d9bde	Merge pull request #4282 from hashicorp/f-rotator Avoid splitting log line across two files	2018-05-21 17:52:13 +00:00
Alex Dadgar	d95698e2c5	Merge pull request #4298 from justenwalker/docker-driver-digest-tags driver/docker: pull image with digest	2018-05-21 17:46:14 +00:00
Nick Ethier	6392009dd6	client/driver: use correct repo address when using docker-credential helper (#4266 )	2018-05-15 17:39:48 -04:00
Justen Walker	a8989f33bb	driver/docker: add test for dockerImageRef	2018-05-14 14:24:03 -04:00
Justen Walker	194b2231d6	driver/docker: fix up TestParseDockerImage	2018-05-14 14:23:48 -04:00
Justen Walker	25b2807ce3	driver/docker: fix TestDockerDriver_ForcePull_RepoDigest	2018-05-14 14:23:02 -04:00
Nick Ethier	c4d07a2200	client/driver: gaurd authHelper test from running on windows	2018-05-14 13:46:57 -04:00
Justen Walker	b23ca7574c	driver/docker: cleanup parseDockerImage	2018-05-14 11:11:51 -04:00
Justen Walker	60f7f1aa08	driver/docker: pull image with digest GH #4290 Add digest support to the docker driver image config. This commit factors out some common code to print the repo:tag (dockerImageRef) for events/logs as well as parsing the image to retreive the repo,tag (parseDockerImage) so that the results are consistent/sane for both repo:tag and repo@sha256:... references. When pulling an image with a digest, the tag is blank and the repo contains the digest. See: https://github.com/fsouza/go-dockerclient/blob/master/image_test.go#L471	2018-05-14 10:42:58 -04:00
Preetha Appan	de66ec7394	Add nil check before setting timeout on docker client	2018-05-11 17:09:26 -05:00
Alex Dadgar	7ad5c76734	Add new line test	2018-05-11 10:52:09 -07:00
Alex Dadgar	3671ed139d	Avoid splitting log line across two files We attempt to avoid splitting a log line between two files by detecting if we are near the file size limit and scanning for new lines and only flushing those. BenchmarkRotator/1KB-8 300000 5613 ns/op BenchmarkRotator/2KB-8 200000 8384 ns/op BenchmarkRotator/4KB-8 100000 14604 ns/op BenchmarkRotator/8KB-8 50000 25002 ns/op BenchmarkRotator/16KB-8 30000 47572 ns/op BenchmarkRotator/32KB-8 20000 92080 ns/op BenchmarkRotator/64KB-8 10000 165883 ns/op BenchmarkRotator/128KB-8 5000 294405 ns/op BenchmarkRotator/256KB-8 2000 572374 ns/op	2018-05-10 15:11:01 -07:00
Alex Dadgar	f5d91b5338	Benchmark for rotator BenchmarkRotator/1KB-8 200000 5572 ns/op BenchmarkRotator/2KB-8 200000 8338 ns/op BenchmarkRotator/4KB-8 100000 14246 ns/op BenchmarkRotator/8KB-8 50000 25279 ns/op BenchmarkRotator/16KB-8 30000 48602 ns/op BenchmarkRotator/32KB-8 20000 92159 ns/op BenchmarkRotator/64KB-8 10000 154766 ns/op BenchmarkRotator/128KB-8 5000 296872 ns/op BenchmarkRotator/256KB-8 3000 551793 ns/op	2018-05-10 14:15:15 -07:00
Nick Ethier	91603a377e	client/driver: parse repo instead of attempting to pull repo info	2018-05-09 22:34:25 -04:00
Nick Ethier	38a33f9c75	client/driver: add test for docker auth helper	2018-05-09 22:33:56 -04:00
Alex Dadgar	e067a9ae06	naming of constants	2018-05-09 16:46:52 -07:00
Chelsea Holland Komlo	796bae6f1b	allow configurable cipher suites disallow 3DES and RC4 ciphers add documentation for tls_cipher_suites	2018-05-09 17:15:31 -04:00
Alex Dadgar	0e79e1a46e	Keep stream and logs in sync for detecting closed pipe	2018-05-09 11:22:52 -07:00
Preetha	e7ae6e98d9	Merge pull request #4259 from hashicorp/f-deployment-improvements	2018-05-08 16:37:10 -05:00
Nick Ethier	3598925ca4	client/driver: use correct repo address when using docker-credential helper	2018-05-08 15:17:28 -04:00
Nick Ethier	54c86a0292	client/driver/env: interpolate empty optional meta params as empty strings	2018-05-07 20:19:51 -04:00
Nick Ethier	016ab7a105	client/driver: remove unused const 'dockerPullProgressEmitInterval'	2018-05-07 16:24:48 -04:00
Michael Schurter	f1d13683e6	consul: remove services with/without canary tags Guard against Canary being set to false at the same time as an allocation is being stopped: this could cause RemoveTask to be called with the wrong Canary value and leaking a service. Deleting both Canary values is the safest route.	2018-05-07 14:55:01 -05:00
Michael Schurter	50e04c976e	consul: support canary tags for services Also refactor Consul ServiceClient to take a struct instead of a massive set of arguments. Meant updating a lot of code but it should be far easier to extend in the future as you will only need to update a single struct instead of every single call site. Adds an e2e test for canary tags.	2018-05-07 14:55:01 -05:00
Alex Dadgar	df8fce4347	Ensure canaries tags are interpolated	2018-05-07 14:50:01 -05:00
Alex Dadgar	552604451c	rework where time gets set	2018-05-07 14:50:01 -05:00
Alex Dadgar	ee50789c22	Initial implementation	2018-05-07 14:50:01 -05:00
Nick Ethier	d8de354dbf	client/driver: add waiting layer status count to pull progress status msg	2018-05-07 12:18:20 -04:00
Nick Ethier	77af17efbc	client/driver: add seperate handler for emitting pull progress	2018-05-07 12:17:34 -04:00
Nick Ethier	0bdd976b7d	client/driver: remove pull timeout due to race condition that can lead to unexpected timeouts If two jobs are pulling the same image simultaneously, which ever starts the pull first will set the pull timeout. This can lead to a poor UX where the first job requested a short timeout while the second job requested a longer timeout causing the pull to potentially timeout much sooner than expected by the second job.	2018-05-07 12:18:11 -04:00
Nick Ethier	7c5821d7c6	client/driver: do accounting on layer pull progress	2018-05-07 12:17:53 -04:00
Nick Ethier	8efda7dc6c	client/driver: emit progress to all allocs pulling same image	2018-05-07 12:17:34 -04:00
Nick Ethier	e35948ab91	client/driver: add image pull progress monitoring	2018-05-07 12:17:38 -04:00
Michael Schurter	0d534d30d6	Merge pull request #4251 from hashicorp/f-grpc-checks Support Consul gRPC Health Checks	2018-05-04 14:55:16 -07:00
Michael Schurter	f6a4713141	consul: make grpc checks more like http checks	2018-05-04 11:08:11 -07:00
Michael Schurter	382caec1e1	consul: initial grpc implementation Needs to be more like http.	2018-05-04 11:08:11 -07:00
Jesus Vazquez	08a390448b	Update counter driver.docker.oom labels	2018-05-04 14:02:34 +08:00
Jesus Vazquez	4f6db56283	Initialize dockerhandle with jobname, taskgroupname, taskname and allocid	2018-05-04 14:02:19 +08:00
Jesus Vazquez	127b764dfb	Add Job, taskgroupname, taskname, and allocid to the DockerHandle struct	2018-05-04 14:01:26 +08:00
Jesus Vazquez	fd1ff1a0cf	Run goimports	2018-05-04 13:46:36 +08:00
Jesus Vazquez	5dd4059527	Add driver.docker counter metric for OOM Killer events	2018-05-04 13:46:36 +08:00
Michael Schurter	526af6a246	framer: fix early exit/truncation in framer	2018-05-02 10:46:16 -07:00
Michael Schurter	f1a6aa103a	framer: fix race and remove unused error var In the old code `sending` in the `send()` method shared the Data slice's underlying backing array with its caller. Clearing StreamFrame.Data didn't break the reference from the sent frame to the StreamFramer's data slice.	2018-05-02 10:46:16 -07:00
Michael Schurter	7360fe3a6d	client: squelch errors on cleanly closed pipes	2018-05-02 10:46:16 -07:00
Michael Schurter	ffff97e25f	client: don't spin on read errors	2018-05-02 10:46:16 -07:00
Michael Schurter	5ef0a82e6e	client: reset encoders between uses According to go/codec's docs, Reset(...) should be called on Decoders/Encoders before reuse: https://godoc.org/github.com/ugorji/go/codec I could find no evidence that not calling Reset() caused bugs, but might as well do what the docs say?	2018-05-02 10:46:16 -07:00
Alex Dadgar	de4af37249	version bump and remove generated	2018-04-27 11:10:00 -07:00
Alex Dadgar	845a43864a	generated files	2018-04-27 10:45:40 -07:00
Alex Dadgar	35e06ddb31	Remove generated and version bump	2018-04-26 16:49:19 -07:00
Alex Dadgar	43192cefae	generated files	2018-04-26 16:28:58 -07:00
Michael Schurter	0e602d4779	Merge pull request #4188 from hashicorp/f-rkt-stats rkt: create parent cgroup to enable stats	2018-04-24 14:54:36 -07:00
Michael Schurter	d687761ebf	rkt: test Stats() and always run tests Remove the NOMAD_TEST_RKT flag as a guard for rkt tests. Still require Linux, root, and rkt to be installed. Only check for rkt installation once in hopes of speeding up rkt tests a bit.	2018-04-24 11:05:42 -07:00
Javier Palomo Almena	3e6c01ffa1	docker tests: Fix usage of NewDriverContext	2018-04-23 22:51:06 +02:00
Javier Palomo Almena	74d3c5df07	DriverContext: Add the TaskGroup and the Job name Adding this fields to the DriverContext object, will allow us to pass them to the drivers. An use case for this, will be to emit tagged metrics in the drivers, which contain all relevant information: - Job - TaskGroup - Task - ... Ref: https://github.com/hashicorp/nomad/pull/4185	2018-04-23 00:15:29 +02:00
Michael Schurter	4cee6cca6c	rkt: create parent cgroup to enable stats Having the Nomad executor create parent cgroups that rkt is launched within allows the stats collection code used for the exec driver to Just Work. The only downside is that now the Nomad executor's resource utilization counts against the cgroups resource limits just as it does for the exec driver.	2018-04-19 15:14:56 -07:00
Michael Schurter	1a85d0c990	run goimports	2018-04-19 11:16:28 -07:00
Michael Schurter	d77c265d1f	Merge pull request #4168 from ninoles/b-2117-windows-group-process B 2117 windows group process	2018-04-19 11:10:51 -07:00
Michael Schurter	fdbcbd4e5b	Merge pull request #4058 from hashicorp/f-mock-by-default [Post-0.8] test: build with mock_driver by default	2018-04-18 15:57:00 -07:00
Michael Schurter	d3650fb2cd	test: build with mock_driver by default `make release` and `make prerelease` set a `release` tag to disable enabling the `mock_driver`	2018-04-18 14:45:33 -07:00
Michael Schurter	a991923389	tests: fix race in alloc_runner_test.go I could not reproduce the failure locally even with `stress -cpu ...` eating all the cpu it could on my machine. But I think the race was in one of two places: * The task could restart which could create new events * I think there could be a race between the updater's version of events and alloc runners as updates are async I fixed both. Here's hoping that fixes this flaky test.	2018-04-17 17:14:59 -07:00
Fabien Ninoles	c81bec48c9	Merge branch 'master' into b-2117-windows-group-process	2018-04-17 13:47:25 -04:00
Fabien Ninoles	35cf641416	Update based on PR request.	2018-04-17 13:43:04 -04:00
Alex Dadgar	c4ad76091d	Merge pull request #4166 from hashicorp/b-panic-fix-update Fixes races accessing node and updating it during fingerprinting	2018-04-17 10:02:19 -07:00
Chelsea Holland Komlo	9b8a079558	fix up comments	2018-04-17 11:53:08 -04:00
Alex Dadgar	9d612c8cb0	Cleanup	2018-04-16 15:48:34 -07:00
Alex Dadgar	32adaf9dfc	Copy the config given to the alloc runner	2018-04-16 15:45:52 -07:00
Alex Dadgar	3ff2d4d795	fix race node access	2018-04-16 15:45:51 -07:00
Alex Dadgar	4f2a7b6949	Fix copying drivers	2018-04-16 15:45:51 -07:00
Alex Dadgar	0b799822ff	Operate on copy	2018-04-16 15:45:49 -07:00
Fabien Ninoles	27cf4995ce	- Clean up for windows compilation. - Set CREATE_NEW_PROCESS_GROUP for Windows subprocess. - Ensure we only kill actual process that need to.	2018-04-14 13:58:42 -04:00
Michael Schurter	3836b8a335	Merge pull request #3572 from emate/master Create new process group on process startup.	2018-04-13 11:56:38 -07:00
Alex Dadgar	adaf4fa7e0	Remove generated structs	2018-04-12 16:35:31 -07:00
Alex Dadgar	663c4d0433	Version bump and generated files	2018-04-12 16:21:50 -07:00
Alex Dadgar	ff1a1a63e8	Move where attribute for driver detection is set	2018-04-12 15:50:25 -07:00
Chelsea Holland Komlo	5291788b40	delete driver name from only health check attributes	2018-04-12 18:24:41 -04:00
Alex Dadgar	3d53d380f7	Fix tests	2018-04-12 14:29:30 -07:00
Alex Dadgar	f24ce2c50c	Driver health detection cleanups This PR does: 1. Health message based on detection has format "Driver XXX detected" and "Driver XXX not detected" 2. Set initial health description based on detection status and don't wait for the first health check. 3. Combine updating attributes on the node, fingerprint and health checking update for drivers into a single call back. 4. Condensed driver info in `node status` only shows detected drivers and make the output less wide by removing spaces.	2018-04-12 12:46:40 -07:00
Charlie Voiselle	ba88f00ccb	Changed "til" to "until" Should be "till" or "until"; chose "until" because it is unambiguous as to meaning.	2018-04-11 12:36:28 -05:00
Andrei Burd	502d17fa90	Added node class to tagged metrics	2018-04-11 12:20:59 +03:00
Chelsea Komlo	eb5aac16e6	Merge pull request #4111 from hashicorp/b-undetected-set-health-to-false Immediately set driver health status to false when driver moves to undetected	2018-04-10 18:30:31 -04:00
Chelsea Holland Komlo	d58b3e473c	update comment for when the fingerprinter setting health status	2018-04-10 16:53:00 -04:00
Chelsea Holland Komlo	f7ef13cc64	fingerprinter should set health check status if health check is not periodic	2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo	ede4f518bd	add setters for access to the fingerprint manager's node refactor extracting driver info	2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo	f479da19f5	guard against overwriting health status	2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo	ece1618815	immediately set healthy to false when driver moves to undetected	2018-04-10 15:29:51 -04:00
Alex Dadgar	3d367d6fd7	Fix client uptime metric missing client prefix	2018-04-10 10:39:36 -07:00
Seth Vargo	df4fe7e76c	Set user-agent when talking to GCE metadata	2018-04-10 10:36:46 -04:00
Chelsea Komlo	d3bd8fb96e	Merge pull request #4109 from hashicorp/f-shorten-docker-health-timeout Shorten docker health timeout	2018-04-09 15:38:39 -04:00
Chelsea Holland Komlo	ea4b65dd41	only initialize docker clients if they are nil	2018-04-09 14:13:07 -04:00
Chelsea Holland Komlo	288c7a33a1	refacotoring simplification from code review	2018-04-09 10:34:17 -04:00
Chelsea Holland Komlo	6e3b056c37	only run health check if driver moves from undetected to detected	2018-04-09 10:10:43 -04:00
Alex Dadgar	ae1f76477e	Start rebalance after discovering new servers	2018-04-05 15:41:59 -07:00
Alex Dadgar	929b6823a3	Merge pull request #4106 from hashicorp/b-servers Improved Client handling of failed RPCs	2018-04-05 13:48:50 -07:00
Alex Dadgar	be2513e0f9	more jitter	2018-04-05 13:48:33 -07:00
Chelsea Holland Komlo	d3637825ef	group similar functions; update comments health check timeout should be 1 minute	2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo	e8743f1f7b	remove do once block when creating a new docker client only set cached connections upon no error	2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo	d0d793fc23	use client with shorter timeouts for health checks	2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo	5d1b2b77cb	refactor docker clients method to be able to extend to creating new clients	2018-04-05 16:19:02 -04:00
Alex Dadgar	bd3345942c	Handle no leader and faster retries near limit Handle the ErrNoLeader case and apply slower retries. Also when we have missed the heartbeat retry aggressively, backing off after we have missed for more than 30 seconds.	2018-04-05 11:22:47 -07:00
Alex Dadgar	279b5c22e5	Scale heartbeat retrying based on remaining heartbeat time	2018-04-05 10:58:13 -07:00
Alex Dadgar	7941f4eb2d	Fire retry only when consul discovers new servers	2018-04-05 10:40:17 -07:00
Preetha	6254d75eee	Merge pull request #4101 from hashicorp/b-rescheduling-edge-fixes Fixes edge cases around timing/ task finish time being set more than once	2018-04-04 16:18:21 -05:00
Preetha Appan	12ba4c45da	remove outdated commented out test code	2018-04-04 15:03:24 -05:00
Preetha Appan	6363a6fb4d	Remove old comment	2018-04-04 15:01:48 -05:00
Preetha Appan	5e4525bd30	Moves setting finishedAt to the right place and adds two unit tests.	2018-04-04 14:38:15 -05:00
Alex Dadgar	86c32358d4	Spelling error	2018-04-03 18:30:01 -07:00
Alex Dadgar	01a6beafbf	RPC Retry Watcher	2018-04-03 18:05:28 -07:00
Preetha Appan	e6bbce3fa0	Add comment	2018-04-03 19:49:03 -05:00
Alex Dadgar	ec844f19d9	randomize servers	2018-04-03 17:46:13 -07:00
Preetha Appan	00537c739b	Fixes edge cases around timing and task finish time being set more than once	2018-04-03 16:34:59 -05:00
Alex Dadgar	58a3ec3fb2	Improve Vault error handling	2018-04-03 14:29:22 -07:00
Alex Dadgar	86f9044676	remove generated files	2018-03-30 16:52:49 -07:00
Alex Dadgar	af81349dbe	Generated files	2018-03-30 16:14:40 -07:00
Michael Schurter	257ba5937d	test: don't rely on alloc runner update count We were incorrectly relying on the count of alloc updates in a number of tests. Since alloc updates are async, their number is non-determinstic and largely meaningless. This should fix quite a few flaky tests in Travis and prevent future mistaken assumptions in tests.	2018-03-30 09:34:33 -07:00
Michael Schurter	62e9553333	Merge pull request #4069 from hashicorp/f-hashealth add HasHealth helper for nil checks	2018-03-29 17:03:20 -07:00
Alex Dadgar	beee130a6e	Always capture the finish time	2018-03-29 11:27:22 -07:00
Michael Schurter	91b5bb58d9	add HasHealth helper for nil checks We performed the DeploymentStatus nil checks a couple different ways, so hopefully this helper will consoldiate them and make it more clear what the code is doing.	2018-03-29 09:29:19 -07:00
Chelsea Komlo	4338360da9	Merge pull request #4065 from hashicorp/emit-node-event-on-first-health-change Emit first node event after initialization on health status change	2018-03-29 11:23:25 -04:00
Chelsea Holland Komlo	2174ede6b9	add clarifying comment	2018-03-29 10:58:39 -04:00
Michael Schurter	3a79c32677	Merge pull request #4059 from hashicorp/b-drain-health-svc-only only service allocs should have health watched	2018-03-28 16:49:22 -07:00
Michael Schurter	5eb0cb7176	only service allocs should have health watched	2018-03-28 16:20:11 -07:00
Chelsea Holland Komlo	e3319afee1	emit first node event	2018-03-28 17:26:53 -04:00
Chelsea Komlo	7812ac5abf	Merge pull request #4057 from hashicorp/specify-docker-msg Specify docker name in driver health messages	2018-03-28 13:32:36 -04:00
Preetha	177d2d6010	Merge pull request #4052 from hashicorp/f-specify-total-memory Allow to specify total memory on agent configuration	2018-03-28 12:28:41 -05:00
Chelsea Holland Komlo	efc03e252c	specify driver health messages	2018-03-28 11:35:21 -04:00
Preetha Appan	329428b49f	Code review feedback and unit test	2018-03-28 10:07:15 -05:00
Charlie Voiselle	ea10588227	rkt: logging enhancements (#4044 ) * Added extra debug logging; extended timeout; added jitter. * small log changes * increase timeout * remove unneccessary uuid	2018-03-27 17:30:06 -07:00
Michael Schurter	fcaee471a0	client: always mark exited sys/svc allocs as failed When restarts.attempts=0 was set in a jobspec a system or service alloc that exited with 0 status would be marked as `completed` instead of `failed`. Since system and service jobs are intended to run until stopped or updated, they should always be marked as failed when they exit even in cases where the exit code is 0.	2018-03-27 14:30:19 -07:00
Mildred Ki'Lya	1017cbe8ab	Allow to specify total memory on agent configuration Allow to set the total memory of an agent in its configuration file. This can be used in case the automatic detection doesn't work or in specific environments when memory overcommit (using swap for example) can be desirable.	2018-03-27 15:46:18 -05:00
Chelsea Holland Komlo	003bc209b9	use time.Time for node events for compatibility	2018-03-27 15:43:57 -04:00
Alex Dadgar	432784dae3	Fix alloc watcher snapshot streaming	2018-03-27 11:14:53 -07:00
Alex Dadgar	05449fea09	drop stats fetching log	2018-03-23 12:01:50 -07:00
Chelsea Komlo	5f0c382021	Merge pull request #4030 from hashicorp/health-check-ux UX improvments to driver health checks	2018-03-23 09:46:50 -04:00
Alex Dadgar	da27fc3880	Driver Info output	2018-03-22 17:18:32 -07:00
Chelsea Holland Komlo	e9005d8cfb	ux improvments to driver health checks	2018-03-22 18:38:29 -04:00
Michael Schurter	a318684738	Merge pull request #4022 from hashicorp/f-more-executor-logging executor: increase level for helpful log lines	2018-03-22 15:21:20 -07:00
Michael Schurter	a4f346abeb	remove spurious TODOs and FIXMEs	2018-03-21 16:55:22 -07:00
Michael Schurter	8b346c6176	test: try to prevent flakiness on travis	2018-03-21 16:51:45 -07:00
Michael Schurter	1b7ac447e9	alloc_runner: watch health for deployed batch jobs	2018-03-21 16:51:45 -07:00
Michael Schurter	62960ed7bd	client: don't monitor health of non-service jobs Also fix system job draining; won't work without deadline fixes	2018-03-21 16:51:44 -07:00
Alex Dadgar	a37329189a	Improve DeadlineTime helper	2018-03-21 16:51:44 -07:00
Alex Dadgar	db4a634072	RPC, FSM, State Store for marking DesiredTransistion fix build tag	2018-03-21 16:49:48 -07:00
Michael Schurter	bb0ff44fb4	mock_driver: improve Kill() logging	2018-03-21 16:49:48 -07:00
Michael Schurter	c0542474db	drain: initial drainv2 structs and impl	2018-03-21 16:49:48 -07:00
Chelsea Holland Komlo	f329e45e03	always set initial health status for every driver	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	bbaffe3eca	set driver to unhealthy once if it cannot be detected in periodic check	2018-03-21 15:15:26 -04:00
Alex Dadgar	5df4b3728d	Docker driver doesn't return errors but injects into the DriverInfo	2018-03-21 15:15:26 -04:00
Alex Dadgar	4365bb7f59	Only run health check if driver is detected	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	f801709a0a	fix issue when updating node events	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	285729aee2	function rename and re-arrange functions in fingerprint_manager	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	60f12d206f	improve comments; update watchDriver	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	739784736a	remove unused function	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d92703617c	simplify logic bump log level	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	86b7b3d2d9	fix up health check logic comparison; add node events to client driver checks	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	53a5bc2bb3	Code review feedback	2018-03-21 15:15:26 -04:00
Alex Dadgar	34dc58421c	notes from walk through	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	44b6951dda	improve tests	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d740a6a46e	refresh driver information for non-health checking drivers periodically	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d8f68e5ef8	fix up codereview feedback	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	d5f6c940c4	fix up racy tests	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	0425be8f48	updating comments; locking concurrent node access	2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo	c50d02ae93	go style; update comments	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	3aa726baab	fix scheduler driver name; create node structs file	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	3cba95e8a7	allow nomad to schedule based on the status of a client driver health check Slight updates for go style	2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo	0bde357731	add concept of health checks to fingerprinters and nodes fix up feedback from code review add driver info for all drivers to node	2018-03-21 15:15:25 -04:00
Michael Schurter	1022170bf3	executor: increase level for helpful log lines Should help with debugging issues like #3971	2018-03-21 11:53:58 -07:00
Marcin Matlaszek	6019a88824	Make raw_exec processes cleanup function more precise.	2018-03-20 13:40:21 +01:00
Marcin Matlaszek	bb36c122e2	Fix errors when trying to kill whole process group.	2018-03-20 13:40:21 +01:00
Marcin Matlaszek	86d650d7b0	Make starting & cleaning process group Windows compatible.	2018-03-20 13:40:21 +01:00
Marcin Matlaszek	79c139f2ef	Create new process group on process startup. Clean up by sending SIGKILL to the whole process group.	2018-03-20 13:40:21 +01:00
Michael Schurter	1044bc0feb	Merge pull request #3984 from hashicorp/f-loosen-consul-skipverify Replace Consul TLSSkipVerify handling	2018-03-16 11:21:28 -07:00
Michael Schurter	32ee5e0d53	Merge pull request #3990 from hashicorp/f-rkt-groups rkt: allow specifying --group	2018-03-16 11:19:53 -07:00
Michael Schurter	bd78cfb039	rkt: allow specifying --group	2018-03-16 11:08:22 -07:00
Michael Schurter	fb10ec9c01	docker: make volume errors recoverable The interface+mock just to test this one little error handling may seem like overkill but there was just no other way to write an automated test around this logic as there's no way to simluate this error with stock Docker.	2018-03-15 17:52:43 -07:00
Michael Schurter	0971114f0c	Replace Consul TLSSkipVerify handling Instead of checking Consul's version on startup to see if it supports TLSSkipVerify, assume that it does and only log in the job service handler if we discover Consul does not support TLSSkipVerify. The old code would break TLSSkipVerify support if Nomad started before Consul (such as on system boot) as TLSSkipVerify would default to false if Consul wasn't running. Since TLSSkipVerify has been supported since Consul 0.7.2, it's safe to relax our handling.	2018-03-14 17:43:06 -07:00
Preetha Appan	3c38eededd	Fix spelling in comment	2018-03-14 15:54:25 -05:00
Alex Dadgar	bef4a8ee09	fix clearing node events	2018-03-14 09:48:59 -07:00
Chelsea Komlo	810eedfa2a	Merge pull request #3945 from hashicorp/f-add-node-events Add node events	2018-03-14 08:42:55 -04:00
Preetha	360d6e5a92	Merge pull request #3968 from hashicorp/f-nicer-vault-error Make server side error messages from vault more clearer	2018-03-13 20:49:39 -05:00
Alex Dadgar	de6ebb6e6c	small cleanup	2018-03-13 18:08:22 -07:00
Chelsea Holland Komlo	b41501e442	code review feedback	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	1488b076d1	code review feedback	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	a8655320fd	fix up go check warnings	2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo	0934769b04	add client side emitting of node events Changelog	2018-03-13 18:08:21 -07:00
Preetha Appan	914eaed64f	Address some code review comments	2018-03-13 18:19:16 -05:00
Preetha Appan	09c231ce43	Return the err from server correctly	2018-03-13 18:10:14 -05:00
Preetha Appan	9618f52746	Remove error wrapping and make vault connection server side errors clearer.	2018-03-13 17:09:03 -05:00
Michael Schurter	79df90acb0	Merge pull request #3958 from simplesurance/swappiness fix: disable swap for executor_linux allocations	2018-03-13 10:10:22 -07:00
Fabian Holler	e6af051c93	fix: disable swap for executor_linux allocations A comment in the nomad source code states that swapping for executor_linux allocations is disabled but it wasn't. Nomad wrote -1 to the memsw.limit_in_bytes cgroup file to disable swapping. This has the following problems: 1.) Writing -1 to the file does not disable swapping. It sets the limit for memory and swap to unlimited. 2.) On common Linux distributions like Ubuntu 16.04 LTS the memsw.limit_in_bytes cgroup file does not exist by default. The memsw.limit_in_bytes file only exist if the Linux kernel is build with CONFIG_MEMCG_SWAP=yes and either CONFIG_MEMCG_SWAP_ENABLED=yes or when the kernel parameter swapaccount=1 is passed during boot. Most Linux distributions disable swap accounting by default because of higher memory usage. Nomad silently ignores if writing to the memsw.limit_in_bytes file fails. The allocation succeeds, no message is logged to notify the user. To ensure that disabling swap works on common Linux kernels, disable swapping by writing 0 to the memory.swappiness file. Using the memory.swappiness file only requires that the kernel is compiled with CONFIG_MEMCG=yes. This is the default in common Linux kernels.	2018-03-13 10:52:50 +01:00
Alex Dadgar	4844317cc2	Merge pull request #3890 from hashicorp/b-heartbeat Heartbeat improvements and handling failures during establishing leadership	2018-03-12 14:41:59 -07:00
Michael Schurter	7dd7fbcda2	non-Existent -> nonexistent Reverting from #3963 https://www.merriam-webster.com/dictionary/existent	2018-03-12 11:59:33 -07:00
Josh Soref	18c5659474	spelling: version	2018-03-11 19:13:25 +00:00
Josh Soref	6222bd564e	spelling: verify	2018-03-11 19:13:32 +00:00

... 8 9 10 11 12 ...

3830 commits