open-nomad

Commit Graph

Author	SHA1	Message	Date
Nick Ethier	15989bba8e	ar: cleanup lint errors	2019-07-31 01:03:18 -04:00
Nick Ethier	220cba3e7e	ar: move linux specific code to it's own file and add tests	2019-07-31 01:03:18 -04:00
Nick Ethier	548f78ef15	ar: initial driver based network management	2019-07-31 01:03:17 -04:00
Nick Ethier	66c514a388	Add network lifecycle management Adds a new Prerun and Postrun hooks to manage set up of network namespaces on linux. Work still needs to be done to make the code platform agnostic and support Docker style network initalization.	2019-07-31 01:03:17 -04:00
Jasmine Dahilig	2157f6ddf1	add formatting for hcl parsing error messages (#5972 )	2019-07-19 10:04:39 -07:00
Mahmood Ali	cd6f1d3102	Update consul-template dependency to latest To pick up the fix in https://github.com/hashicorp/consul-template/pull/1231 .	2019-07-18 07:32:03 +07:00
Mahmood Ali	8a82260319	log unrecoverable errors	2019-07-17 11:01:59 +07:00
Mahmood Ali	1a299c7b28	client/taskrunner: fix stats stats retry logic Previously, if a channel is closed, we retry the Stats call. But, if that call fails, we go in a backoff loop without calling Stats ever again. Here, we use a utility function for calling driverHandle.Stats call that retries as one expects. I aimed to preserve the logging formats but made small improvements as I saw fit.	2019-07-11 13:58:07 +08:00
Preetha Appan	ef9a71c68b	code review feedback	2019-07-10 10:41:06 -05:00
Preetha Appan	990e468edc	Populate task event struct with kill timeout This makes for a nicer task event message	2019-07-09 09:37:09 -05:00
Mahmood Ali	f10201c102	run post-run/post-stop task runner hooks Handle when prestart failed while restoring a task, to prevent accidentally leaking consul/logmon processes.	2019-07-02 18:38:32 +08:00
Mahmood Ali	4afd7835e3	Fail alloc if alloc runner prestart hooks fail When an alloc runner prestart hook fails, the task runners aren't invoked and they remain in a pending state. This leads to terrible results, some of which are: * Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861 * Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed * Alloc not being restarted/rescheduled to another node (as it's still in pending state) * Unexpected restart of alloc on a client restart, potentially days/weeks after alloc expected start time! Here, we treat all tasks to have failed if alloc runner prestart hook fails. This fixes the lockups, and permits the alloc to be rescheduled on another node. While it's desirable to retry alloc runner in such failures, I opted to treat it out of scope. I'm afraid of some subtles about alloc and task runners and their idempotency that's better handled in a follow up PR. This might be one of the root causes for https://github.com/hashicorp/nomad/issues/5840 .	2019-07-02 18:35:47 +08:00
Mahmood Ali	7614b8f09e	Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2 task runner to avoid running task if terminal	2019-07-02 15:31:17 +08:00
Mahmood Ali	7bfad051b9	address review comments	2019-07-02 14:53:50 +08:00
Mahmood Ali	3d89ae0f1e	task runner to avoid running task if terminal This change fixes a bug where nomad would avoid running alloc tasks if the alloc is client terminal but the server copy on the client isn't marked as running. Here, we fix the case by having task runner uses the allocRunner.shouldRun() instead of only checking the server updated alloc. Here, we preserve much of the invariants such that `tr.Run()` is always run, and don't change the overall alloc runner and task runner lifecycles. Fixes https://github.com/hashicorp/nomad/issues/5883	2019-06-27 11:27:34 +08:00
Danielle Lancashire	b9ac184e1f	tr: Fetch Wait channel before killTask in restart Currently, if killTask results in the termination of a process before calling WaitTask, Restart() will incorrectly return a TaskNotFound error when using the raw_exec driver on Windows.	2019-06-26 15:20:57 +02:00
Chris Baker	f71114f5b8	cleanup test	2019-06-18 14:15:25 +00:00
Chris Baker	a2dc351fd0	formatting and clarity	2019-06-18 14:00:57 +00:00
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Danielle	f923b568e0	Merge pull request #5821 from hashicorp/dani/b-5770 trhooks: Add TaskStopHook interface to services	2019-06-12 17:30:49 +02:00
Danielle Lancashire	c326344b57	trt: Fix test	2019-06-12 17:06:11 +02:00
Danielle Lancashire	13d76e35fd	trhooks: Add TaskStopHook interface to services We currently only run cleanup Service Hooks when a task is either Killed, or Exited. However, due to the implementation of a task runner, tasks are only Exited if they every correctly started running, which is not true when you recieve an error early in the task start flow, such as not being able to pull secrets from Vault. This updates the service hook to also call consul deregistration routines during a task Stop lifecycle event, to ensure that any registered checks and services are cleared in such cases. fixes #5770	2019-06-12 16:00:21 +02:00
Mahmood Ali	2acf30fdd3	Fallback to `alloc.TaskResources` for old allocs When a client is running against an old server (e.g. running 0.8), `alloc.AllocatedResources` may be nil, and we need to check the deprecated `alloc.TaskResources` instead. Fixes https://github.com/hashicorp/nomad/issues/5810	2019-06-11 10:32:53 -04:00
Mahmood Ali	7a4900aaa4	client/allocrunner: depend on internal task state Alloc runner already tracks tasks associated with alloc. Here, we become defensive by relying on the alloc runner tracked tasks, rather than depend on server never updating the job unexpectedly.	2019-06-10 18:42:51 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	935ee86e92	Merge pull request #5737 from fwkz/fix-restart-attempts Fix restart attempts of `restart` stanza in `delay` mode.	2019-06-05 19:05:07 -04:00
Danielle Lancashire	27583ed8c1	client: Pass servers contacted ch to allocrunner This fixes an issue where batch and service workloads would never be restarted due to indefinitely blocking on a nil channel. It also raises the restoration logging message to `Info` to simplify log analysis.	2019-05-22 13:47:35 +02:00
Mahmood Ali	9df1e00f35	tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal Given that Signal may be called multiple times, blocking for `SignalCh` isn't sufficient to synchornizing access to Signals field.	2019-05-21 13:56:58 -04:00
Mahmood Ali	b475ccbe3e	client: synchronize access to ar.alloc `allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's use `allocRunner.Alloc()` helper function to access it.	2019-05-21 09:55:05 -04:00
fwkz	8b84bec95a	Fix restart attempts of `restart` stanza. Number of restarts during 2nd interval is off by one.	2019-05-21 13:27:19 +02:00
Michael Schurter	2fe0768f3b	docs: changelog entry for #5669 and fix comment	2019-05-14 10:54:00 -07:00
Michael Schurter	af9096c8ba	client: register before restoring Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.	2019-05-14 10:53:27 -07:00
Michael Schurter	e07f73bfe0	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	d7e5ace1ed	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Michael Schurter	1c4e585fa7	client: expose allocated memory per task Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```	2019-05-10 11:12:12 -07:00
Mahmood Ali	919827f2df	Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base nomad exec part 1: plumbing and docker driver	2019-05-09 18:09:27 -04:00
Mahmood Ali	ab2cae0625	implement client endpoint of nomad exec Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking the relevant task handler for execution.	2019-05-09 16:49:08 -04:00
Chris Baker	93ec1293be	stale allocation data leads to incorrect (and even negative) metrics (#5637 ) * client: was not using up-to-date client state in determining which alloc count towards allocated resources * Update client/client.go Co-Authored-By: cgbaker <cgbaker@hashicorp.com>	2019-05-07 15:54:36 -04:00
Michael Schurter	8c7b3ff45a	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:01:30 -05:00
Michael Schurter	e19fa33f9c	Remove unnecessary boolean clause Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-05-03 10:00:17 -05:00
Preetha Appan	b99a204582	Update deployment health on failed allocations only if health is unset This fixes a confusing UX where a previously successful deployment's healthy/unhealthy count would get updated if any allocations failed after the deployment was already marked as successful.	2019-05-02 22:59:56 -05:00
Danielle	79515496cb	Merge pull request #5515 from hashicorp/dani/f-alloc-signal allocs: Add nomad alloc signal command	2019-04-26 14:21:05 +02:00
Mahmood Ali	bf0a09e270	retry grpc unavailable errors even if not shutting down	2019-04-25 18:39:17 -04:00
Mahmood Ali	81841e8528	try checking process status	2019-04-25 18:16:13 -04:00
Mahmood Ali	fc78521f29	add logging about attempts	2019-04-25 18:09:36 -04:00
Mahmood Ali	e6ca8641a8	try sleeping for stop signal to take effect	2019-04-25 17:16:29 -04:00
Mahmood Ali	ff3a095015	add a test that simulates logmon dying during Start() call	2019-04-25 16:41:17 -04:00
Mahmood Ali	bbac73883c	logmon: retry starting logmon if it exits Retry if we detect shutting down during Start() api call is started, locally.	2019-04-25 15:10:16 -04:00
Danielle Lancashire	3409e0be89	allocs: Add nomad alloc signal command This command will be used to send a signal to either a single task within an allocation, or all of the tasks if <task-name> is omitted. If the sent signal terminates the allocation, it will be treated as if the allocation has crashed, rather than as if it was operator-terminated. Signal validation is currently handled by the driver itself and nomad does not attempt to restrict or validate them.	2019-04-25 12:43:32 +02:00
Michael Schurter	61f17a1043	tweak logging level for failed log line Co-Authored-By: notnoop <mahmood@notnoop.com>	2019-04-22 14:40:17 -04:00
Danielle Lancashire	c31966fc71	loggging: Attempt to recover logmon failures Currently, when logmon fails to reattach, we will retry reattachment to the same pid until the task restart specification is exhausted. Because we cannot clear hook state during error conditions, it is not possible for us to signal to a future restart that it _shouldn't_ attempt to reattach to the plugin. Here we revert to explicitly detecting reattachment seperately from a launch of a new logmon, so we can recover from scenarios where a logmon plugin has failed. This is a net improvement over the current hard failure situation, as it means in the most common case (the pid has gone away), we can recover. Other reattachment failure modes where the plugin may still be running could potentially cause a duplicate process, or a subsequent failure to launch a new plugin. If there was a duplicate process, it could potentially cause duplicate logging. This is better than a production workload outage. If there was a subsequent failure to launch a new plugin, it would fail in the same (retry until restarts are exhausted) as the current failure mode.	2019-04-18 13:41:56 +02:00
Michael Schurter	f7a7acc345	Merge pull request #5518 from hashicorp/f-simplify-kill client: simplify kill logic	2019-04-15 14:11:58 -07:00
Chris Baker	6848591914	vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation	2019-04-12 15:06:34 +00:00
Danielle Lancashire	e135876493	allocs: Add nomad alloc restart This adds a `nomad alloc restart` command and api that allows a job operator with the alloc-lifecycle acl to perform an in-place restart of a Nomad allocation, or a given subtask.	2019-04-11 14:25:49 +02:00
Chris Baker	c0a7aee610	vault e2e: pass vault version into setup instead of having to infer it from test name	2019-04-10 10:34:10 -05:00
Chris Baker	f0c184fc29	taskrunner: removed some unecessary config from a test	2019-04-10 10:34:10 -05:00
Chris Baker	170f5239c8	client: gofmt	2019-04-10 10:34:10 -05:00
Chris Baker	a1d7971b2e	taskrunner: pass configured Vault namespace into TaskTemplateConfig	2019-04-10 10:34:10 -05:00
Michael Schurter	f7d4428855	client: simplify kill logic Remove runLaunched tracking as Run is always called for killable TaskRunners. TaskRunners which fail before Run can be called (during NewTaskRunner or Restore) are not killable as they're never added to the client's alloc map.	2019-04-04 15:18:33 -07:00
Michael Schurter	1d569a27dc	Revert "executor/linux: add defensive checks to binary path" This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.	2019-04-02 11:17:12 -07:00
Michael Schurter	fc5487dbbc	executor/linux: add defensive checks to binary path	2019-04-02 09:40:53 -07:00
Michael Schurter	7d49bc4c71	executor/linux: make chroot binary paths absolute Avoid libcontainer.Process trying to lookup the binary via $PATH as the executor has already found where the binary is located.	2019-04-01 15:45:31 -07:00
Michael Schurter	a4572919cd	Merge pull request #5456 from hashicorp/test-taskenv tests: port pre-0.9 task env tests	2019-03-25 10:41:38 -07:00
Michael Schurter	8efad12538	tests: port pre-0.9 task env tests I chose to make them more of integration tests since there's a lot more plumbing involved. The internal implementation details of how we craft task envs can now change and these tests will still properly assert the task runtime environment is setup properly.	2019-03-25 09:46:53 -07:00
Nick Ethier	dc18b8928a	logmon: make Start rpc idempotent and simplify hook	2019-03-19 14:02:36 -04:00
Nick Ethier	ac7fbee1b8	logmon:add static check for logmon exited hook	2019-03-18 15:59:43 -04:00
Nick Ethier	7dc3d83634	client/logmon: restart log collection correctly when a task is restarted	2019-03-15 23:59:18 -04:00
Michael Schurter	0ba1a5251b	client: cleanup and document context uses Some of the context uses in TR hooks are useless (Killed during Stop never seems meaningful). None of the hooks are interruptable for graceful shutdown which is unfortunate and probably needs fixing.	2019-03-12 15:03:54 -07:00
Michael Schurter	32d31575cc	client: emit event and call exited hooks during cleanup Builds upon earlier commit that cleans up restored handles of terminal allocs by also emitting terminated events and calling exited hooks when appropriate.	2019-03-05 15:12:02 -08:00
Michael Schurter	64e145ebdb	logmon: drop reattach log level as its expected Logged once per terminal task on agent restart.	2019-03-04 13:26:01 -08:00
Michael Schurter	c5271d3fa5	client: test logmon cleanup The test is sadly quite complicated and peeks into things (logmon's reattach config) AR doesn't normally have access to. However, I couldn't find another way of asserting logmon got cleaned up without resorting to smaller unit tests. Smaller unit tests risk re-implementing dependencies in an unrealistic way, so I opted for an ugly integration test.	2019-03-04 13:15:15 -08:00
Michael Schurter	ef8d284352	client: ensure task is cleaned up when terminal This commit is a significant change. TR.Run is now always executed, even for terminal allocations. This was changed to allow TR.Run to cleanup (run stop hooks) if a handle was recovered. This is intended to handle the case of Nomad receiving a DesiredStatus=Stop allocation update, persisting it, but crashing before stopping AR/TR. The commit also renames task runner hook data as it was very easy to accidently set state on Requests instead of Responses using the old field names.	2019-03-01 14:00:23 -08:00
Michael Schurter	812f1679e2	Merge pull request #5352 from hashicorp/b-leaked-logmon logmon fixes	2019-02-26 08:35:46 -08:00
Michael Schurter	e39a10a1f4	tests: move unix-specific test to its own file Other logmon tests should be portable.	2019-02-26 07:56:44 -08:00
Michael Schurter	3b2a592e93	client: restart task on logmon failures This code chooses to be conservative as opposed to optimal: when failing to reattach to logmon simply return a recoverable error instead of immediately trying to restart logmon. The recoverable error will cause the task's restart policy to be applied and a new logmon will be launched upon restart. Trying to do the optimal approach of simply starting a new logmon requires error string comparison and should be tested against a task actively logging to assert the behavior (are writes blocked? dropped?).	2019-02-25 15:42:45 -08:00
Michael Schurter	8830b00866	client: test logmon_hook	2019-02-23 15:36:48 -08:00
Preetha Appan	43679f4ce1	More alloc runner tests ported from 0.8.7	2019-02-22 17:58:06 -06:00
Mahmood Ali	32551fb0e5	emit TaskRestartSignal event on vault restart When Vault token expires and task is restarted, emit `TaskRestartSignal` similar to v0.8.7	2019-02-22 15:56:14 -05:00
Mahmood Ali	8cb4bbcc08	address review comments	2019-02-22 15:56:14 -05:00
Mahmood Ali	216eaa4843	tests: port TestTaskRunner_VaultManager_Signal From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427	2019-02-22 15:53:04 -05:00
Mahmood Ali	8e9e732319	tests: port TestTaskRunner_VaultManager_Restart From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1352	2019-02-22 15:53:04 -05:00
Mahmood Ali	33122ca7c0	tests: port TestTaskRunner_UnregisterConsul_Retries From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L620	2019-02-22 15:53:04 -05:00
Mahmood Ali	0128b0ce7a	tests: port TestTaskRunner_Template_NewVaultToken From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1275	2019-02-22 15:53:04 -05:00
Mahmood Ali	cfb80583af	tests: port TestTaskRunner_Template_Artifact From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1195	2019-02-22 15:52:59 -05:00
Mahmood Ali	1b14214a88	tests: port TestAllocRunner_RetryArtifact Port TestAllocRunner_RetryArtifact from https://github.com/hashicorp/nomad/blob/v0.8.7/client/alloc_runner_test.go#L610-L672 I changed the test name because it doesn't actually test that artifact hooks is retried	2019-02-22 15:50:39 -05:00
Mahmood Ali	c827e6e05a	tests: port TestAllocRunner_MoveAllocDir test	2019-02-22 15:50:39 -05:00
Michael Schurter	a2e3ea6dc9	logmon: fix reattach configuration There were multiple bugs here: 1. Reattach unmarshalling always returned an error because you can't unmarshal into a nil pointer. 2. The hook data wasn't being saved because it was put on the request struct, not the response struct. 3. The plugin configuration should only have reattach or a command set. Not both. 4. Setting Done=true meant the hook was never re-run on agent restart so reattaching was never attempted.	2019-02-21 15:32:18 -08:00
Michael Schurter	01cabdff88	client: restart on recoverable StartTask errors Fixes restarting on recoverable errors from StartTask. Ports TestTaskRunner_Run_RecoverableStartError from 0.8 which discovered the bug.	2019-02-21 15:30:49 -08:00
Michael Schurter	e3f321cd27	test: port TestTaskRunner_RestartSignalTask_NotRunning from 0.8	2019-02-21 15:30:49 -08:00
Michael Schurter	f3aa945a00	test: port TestTaskRunner_DriverNetwork from 0.8	2019-02-21 15:30:49 -08:00
Michael Schurter	518405ac33	Merge pull request #5322 from hashicorp/b-artifact-retries Fix regression by restarting on artifact download errors	2019-02-21 15:28:51 -08:00
Michael Schurter	2553800eb8	tests: port TestAllocRunner_Destroy from 0.8 Also add destroy(ar) helper to fix a bunch of shutdown races in AR tests.	2019-02-20 12:35:09 -08:00
Michael Schurter	6580ed668e	client: don't redownload completed artifacts on retries Track the download status of each artifact independently so that if only one of many artifacts fails to download, completed artifacts aren't downloaded again.	2019-02-20 08:45:12 -08:00
Michael Schurter	908bfab4c2	client: artifact errors are retry-able 0.9.0beta2 contains a regression where artifact download errors would not cause a task restart and instead immediately fail the task. This restores the pre-0.9 behavior of retrying all artifact errors and adds missing tests.	2019-02-20 07:21:27 -08:00
Michael Schurter	79ccf00b72	tests: add new task runner test helper Adds a new helper and removes a duplicated test.	2019-02-20 07:21:27 -08:00
Michael Schurter	159042a1a3	client: fix setting alloc unhealthy at deadline During the 0.9 client refactor the code to fail a deployment when the deadline was reached was broken. This restores and tests that behavior.	2019-02-19 07:44:14 -08:00
Mahmood Ali	87be233aca	test: improve readability of duration Co-Authored-By: schmichael <michael.schurter@gmail.com>	2019-02-14 08:12:06 -08:00
Mahmood Ali	16d3414842	test: improve failure message Co-Authored-By: schmichael <michael.schurter@gmail.com>	2019-02-14 08:11:37 -08:00
Michael Schurter	4814f0fb0b	tests: port TestTaskRunner_Download_List from 0.8	2019-02-12 15:48:04 -08:00
Michael Schurter	a152e3ef17	consul: fix task deregistration hook Broke ShutdownDelay but the test was timing dependent so it just appeared flaky. Made the test slower so that it should never incorrectly pass.	2019-02-12 15:36:02 -08:00
Michael Schurter	4ad879e75e	tests: port TaskRunner_DeriveToken tests from 0.8	2019-02-12 15:36:02 -08:00
Michael Schurter	6743ed9fdc	tests: port TestTaskRunner_BlockForVault from 0.8 Also fix race conditions in the mock vault client.	2019-02-12 13:46:09 -08:00
Michael Schurter	6c0cc65b2e	simplify hcl2 parsing helper No need to pass in the entire eval context	2019-02-04 11:07:57 -08:00
Alex Dadgar	5062c54874	Fix usage of fsi variable	2019-01-29 14:07:55 -08:00
Alex Dadgar	6f418ebaf0	Always populate task dir environment variables Fixes an issue where if a task was restarted after restating the client, the task dir environment variables would not be populated. This PR fixes this for both upgrades from 0.8.X and for normal 0.9 restarts.	2019-01-29 13:17:10 -08:00
Alex Dadgar	5da21635fb	Fix env templates having interpolated destinations Fixes an issue where env templates that had interpolated destinations would not work. Fixes https://github.com/hashicorp/nomad/issues/5250	2019-01-28 10:28:53 -08:00
Alex Dadgar	d6412fd8e7	Fix double restart counting for templates This PR fixes an issue where template restarts would count twice since it was emitting a restarting event.	2019-01-25 15:38:13 -08:00
Nick Ethier	a36c4320ff	Merge pull request #5227 from hashicorp/b-client-highcpu-usage Fix bug related to high cpu usage	2019-01-23 14:27:51 -05:00
Michael Schurter	13f061a83f	Merge pull request #5196 from hashicorp/f-plugin-utils Make plugins/shared external and make pluginutls/	2019-01-23 06:59:32 -08:00
Preetha	05bf183ba3	Merge pull request #5225 from hashicorp/b-notaskevent-terminalallocs Don't emit task events after alloc is in a terminal DesiredState	2019-01-23 08:54:10 -06:00
Michael Schurter	32daa7b47b	goimports until make check is happy	2019-01-23 06:27:14 -08:00
Nick Ethier	bcc3935228	tr: use context in as select statement	2019-01-22 20:11:39 -05:00
Michael Schurter	be0bab7c3f	move pluginutils -> helper/pluginutils I wanted a different color bikeshed, so I get to paint it	2019-01-22 15:50:08 -08:00
Alex Dadgar	2ca0e97361	Split hclspec	2019-01-22 15:43:34 -08:00
Alex Dadgar	5ca6dd7988	move hclutils	2019-01-22 15:43:34 -08:00
Alex Dadgar	72a5691897	Driver tests do not use hcl2/hcl, hclspec, or hclutils	2019-01-22 15:43:34 -08:00
Preetha Appan	38422642cb	Use DesiredState to determine whether to stop sending task events	2019-01-22 16:43:32 -06:00
Preetha Appan	862c9b7de5	dont emit events for terminal allocs	2019-01-22 16:26:33 -06:00
Michael Schurter	1fa376cac6	Merge pull request #5211 from hashicorp/test-porting-08 Port some 0.8 TaskRunner tests	2019-01-22 14:05:53 -08:00
Michael Schurter	8ced0adb67	test: port TestTaskRunner_CheckWatcher_Restart Added ability to adjust the number of events the TaskRunner keeps as there's no way to observe all events otherwise. Task events differ slightly from 0.8 because 0.9 emits Terminated every time a task exits instead of only when it exits on its own (not due to restart or kill). 0.9 does not emit Killing/Killed for restarts like 0.8 which seems fine as `Restart Signaled/Terminated/Restarting` is more descriptive. Original v0.8 events emitted: ``` expected := []string{ "Received", "Task Setup", "Started", "Restart Signaled", "Killing", "Killed", "Restarting", "Started", "Restart Signaled", "Killing", "Killed", "Restarting", "Started", "Restart Signaled", "Killing", "Killed", "Not Restarting", } ```	2019-01-22 09:46:46 -08:00
Michael Schurter	1719752a9d	test: port RestartTask from 0.8	2019-01-22 08:08:08 -08:00
Michael Schurter	9edff19625	test: port SignalFailure test from 0.8 Also fix signal error handling in mock_driver.	2019-01-22 08:08:08 -08:00
Preetha Appan	299a5fc821	Rename TaskKillRequest/Response to TaskPreKillRequest/Response	2019-01-22 09:54:02 -06:00
Preetha Appan	5a5b9c5666	Fix log comments	2019-01-22 09:45:58 -06:00
Preetha Appan	06e15f8381	Rename TaskKillHook to TaskPreKillHook to more closely match usage Also added/fixed comments	2019-01-22 09:41:56 -06:00
Michael Schurter	3b02af9386	Fix comment Co-Authored-By: preetapan <preetha@hashicorp.com>	2019-01-22 09:41:21 -06:00
Preetha Appan	09291c689b	Rename TaskKillHook to TaskPreKillHook to more closely match usage Also added/fixed comments	2019-01-22 09:41:21 -06:00
Nick Ethier	47127de671	ar: return error from hooks if occured	2019-01-18 18:31:02 -05:00
Mahmood Ali	5df63fda7c	Merge pull request #5190 from hashicorp/f-memory-usage Track Basic Memory Usage as reported by cgroups	2019-01-18 16:46:02 -05:00
Chris Baker	290c3f36ad	set TaskGroupName in task_runner	2019-01-18 20:25:11 +00:00
Chris Baker	8917961caa	documenting test for task runner failure to set TaskGroupName	2019-01-18 20:00:49 +00:00
Michael Schurter	cfadacfd95	Merge pull request #5203 from hashicorp/b-terminated client: restore Terminated event on every exit	2019-01-18 08:54:15 -08:00
Preetha Appan	e0b68a19c6	Fix one more place that should be using taskResources taskResources handles new resource fields in a backwards compatible way	2019-01-17 15:52:51 -06:00
Michael Schurter	a20ac7c1de	client: restore Terminated event on every exit v0.9.0-dev started emitting a Terminated event every time a task process exited. While this wasn't true in previous versions, it's a useful task event because it's the only place for job operators to view the task's exit code. This behavior is asserted in the e2e/taskevents tests.	2019-01-17 10:02:25 -08:00
Danielle Tomlinson	a695b3562c	Merge pull request #5193 from hashicorp/dani/logmon-reattach logmon: Reattach to existing loggers	2019-01-16 17:34:13 +01:00
Danielle Tomlinson	99da4c780d	logmon: Reattach to existing loggers This commit prevents us from creating duplicate logmon hooks when restoring allocations by persisting the logmon reattach config using HookData.	2019-01-16 14:56:10 +01:00
Michael Schurter	daa7d029a1	test: porting TestTaskRunner_SimpleRun_Dispatch Porting test from 0.8 to 0.9.	2019-01-15 15:22:13 -08:00
Michael Schurter	48afda786b	Merge pull request #5187 from hashicorp/test-consul Port a bunch of pre-0.9 Consul tests to 0.9	2019-01-15 07:41:50 -08:00
Mahmood Ali	9909d98bee	Track Basic Memory Usage as reported by cgroups Track current memory usage, `memory.usage_in_bytes`, in addition to `memory.max_memory_usage_in_bytes` and friends. This number is closer what Docker reports. Related to https://github.com/hashicorp/nomad/issues/5165 .	2019-01-14 18:47:52 -05:00
Nick Ethier	c619e70d39	Merge pull request #5018 from hashicorp/f-executor-stats executor: streaming stats api	2019-01-14 15:02:35 -05:00
Michael Schurter	4e7ea460e8	test: port some pre-0.9 DeploymentHealth tests Skipping a failing one as I need to move to some other work and don't want to leave this work orphaned on my machine.	2019-01-14 09:56:53 -08:00
Michael Schurter	ff2f23f5f9	test: assert service interpolation behavior Ported from pre-0.9 tests.	2019-01-14 09:56:53 -08:00
Michael Schurter	e877bb6370	test: assert shutdown delay deregs first Restore a pre-0.9 test that asserts Consul services are deregistered before a task's shutdown delay.	2019-01-14 09:56:53 -08:00
Michael Schurter	1ca858fa92	Update client/allocrunner/taskrunner/stats_hook.go Co-Authored-By: nickethier <ncethier@gmail.com>	2019-01-14 12:31:27 -05:00
Nick Ethier	fbd403df96	tr: stop stats collection on Exited hook	2019-01-14 12:30:14 -05:00
Nick Ethier	597b7b751d	tr: add retry /w backoff to stats_hook failure	2019-01-12 12:18:24 -05:00
Nick Ethier	7e306afde3	executor: fix failing stats related test	2019-01-12 12:18:23 -05:00
Nick Ethier	9fea54e0dc	executor: implement streaming stats API plugins/driver: update driver interface to support streaming stats client/tr: use streaming stats api TODO: * how to handle errors and closed channel during stats streaming * prevent tight loop if Stats(ctx) returns an error drivers: update drivers TaskStats RPC to handle streaming results executor: better error handling in stats rpc docker: better control and error handling of stats rpc driver: allow stats to return a recoverable error	2019-01-12 12:18:22 -05:00
Preetha Appan	f059ef8a47	Modified destroy failure handling to rely on allocrunner's destroy method Added a unit test with custom statedb implementation that errors, to use to verify destroy errors	2019-01-12 10:37:12 -06:00
Alex Dadgar	bd12e0b1f7	Merge pull request #5168 from hashicorp/b-kill-race Improve Kill handling on task runner	2019-01-09 12:05:10 -08:00
Alex Dadgar	069e181e8f	add more comments	2019-01-09 12:04:22 -08:00
Michael Schurter	e5ddff861c	Spelling fix Co-Authored-By: dadgar <alex@hashicorp.com>	2019-01-09 11:42:40 -08:00
Mahmood Ali	90f3cea187	Merge pull request #5157 from hashicorp/r-drivers-no-cstructs drivers: avoid referencing client/structs package	2019-01-09 13:06:46 -05:00
Alex Dadgar	149dec2169	Improve Kill handling on task runner This PR improves how killing a task is handled. Before the kill function directly orchestrated the killing and was only valid while the task was running. The new behavior is to mark the desired state and wait for the task runner to converge to that state.	2019-01-08 16:42:26 -08:00
Michael Schurter	5925424c7c	client: emit Killing/Killed task events We were just emitting Killed/Terminated events before. In v0.8 we emitted Killing/Killed, but lacked Terminated when explicitly stopping a task. This change makes it so Terminated is always included, whether explicitly stopping a task or it exiting on its own. New output: 2019-01-04T14:58:51-08:00 Killed Task successfully killed 2019-01-04T14:58:51-08:00 Terminated Exit Code: 130, Signal: 2 2019-01-04T14:58:51-08:00 Killing Sent interrupt 2019-01-04T14:58:51-08:00 Leader Task Dead Leader Task in Group dead 2019-01-04T14:58:49-08:00 Started Task started by client 2019-01-04T14:58:49-08:00 Task Setup Building Task Directory 2019-01-04T14:58:49-08:00 Received Task received by client Old (v0.8.6) output: 2019-01-04T22:14:54Z Killed Task successfully killed 2019-01-04T22:14:54Z Killing Sent interrupt. Waiting 5s before force killing 2019-01-04T22:14:54Z Leader Task Dead Leader Task in Group dead 2019-01-04T22:14:53Z Started Task started by client 2019-01-04T22:14:53Z Task Setup Building Task Directory 2019-01-04T22:14:53Z Received Task received by client	2019-01-08 07:20:54 -08:00
Mahmood Ali	916a40bb9e	move cstructs.DeviceNetwork to drivers pkg	2019-01-08 09:11:47 -05:00
Mahmood Ali	9369b123de	use drivers.FSIsolation	2019-01-08 09:11:47 -05:00
Mahmood Ali	f475a56087	remove always false parameter Simplify allocDir.Build() function to avoid depending on client/structs, and remove a parameter that's always set to `false`. The motivation here is to avoid a dependency cycle between drivers/cstructs and alloc_dir.	2019-01-08 09:11:47 -05:00
Alex Dadgar	0106f23aaa	Review comments	2019-01-07 14:50:28 -08:00
Alex Dadgar	79cfe26021	vet	2019-01-07 14:49:41 -08:00
Alex Dadgar	8a35d7b1dd	Test recovery	2019-01-07 14:49:41 -08:00
Alex Dadgar	f40f8ce02e	Mock driver has recovery, stats	2019-01-07 14:49:40 -08:00
Alex Dadgar	3f24e4d6ca	comments	2019-01-07 14:49:40 -08:00
Alex Dadgar	44dca19012	Fix hooks	2019-01-07 14:49:40 -08:00
Alex Dadgar	c9825a9c36	recover	2019-01-07 14:49:40 -08:00
Mahmood Ali	cd3c6cf60b	taskrunner: emit TaskReceived event Preserve pre-0.9, where task runner emits `Received: Task received by client` event on task runner creation.	2019-01-04 14:32:29 -05:00
Danielle Tomlinson	35a4790740	Merge pull request #5142 from hashicorp/dani/cleanup-allocrunner-logs allocrunner: Standardised discard logs	2019-01-03 18:40:48 +01:00
Danielle Tomlinson	29196ca70e	allocrunner: Standardised discard logs Follow up from https://github.com/hashicorp/nomad/pull/5007#pullrequestreview-186739124	2019-01-03 14:04:31 +01:00
Danielle Tomlinson	28aa34ea78	taskrunner: Persist environment from hooks https://github.com/hashicorp/nomad/pull/5032 introduced a regression where the origHookState was used in place of the response from the hook.	2019-01-03 13:13:57 +01:00
Alex Dadgar	d7d32c2f61	Merge pull request #5032 from hashicorp/f-driver-env Store device envs separately and pass to drivers	2018-12-20 13:38:27 -08:00
Nick Ethier	6c43ccf628	client: add proper build flag to allocrunner testing.go	2018-12-19 20:22:07 -05:00
Alex Dadgar	9d34802f7a	Store device envs separately and pass to drivers	2018-12-19 14:23:09 -08:00
Michael Schurter	c84998e996	tests: implement HasHealth for mock health	2018-12-19 10:39:27 -08:00
Michael Schurter	d9ea8252a7	client/state: support upgrading from 0.8->0.9 Also persist and load DeploymentStatus to avoid rechecking health after client restarts.	2018-12-19 10:39:27 -08:00
Michael Schurter	461599ff20	tr: fix HookState Copy() and Equal() methods They did not take into account the Env field.	2018-12-19 09:58:06 -08:00
Danielle Tomlinson	c580512d32	allocrunner: Close updates routine correctly	2018-12-19 18:32:51 +01:00
Nick Ethier	ce1a5cba0e	drivermanager: use allocID and task name to route task events	2018-12-18 23:01:51 -05:00
Nick Ethier	d8a0265e68	client: batch initial fingerprinting in plugin manangers drivermanager: fix pr comments/feedback	2018-12-18 22:56:19 -05:00
Nick Ethier	7d23cbf448	client/drivermananger: fixup issues from rebase and address PR comments	2018-12-18 22:55:38 -05:00
Nick Ethier	1543335710	tr: deregister task handler on cleanup	2018-12-18 22:55:38 -05:00
Nick Ethier	82175d1328	client/drivermananger: add driver manager The driver manager is modeled after the device manager and is started by the client. It's responsible for handling driver lifecycle and reattachment state, as well as processing the incomming fingerprint and task events from each driver. The mananger exposes a method for registering event handlers for task events that is used by the task runner to update the server when a task has been updated with an event. Since driver fingerprinting has been implemented by the driver manager, it is no longer needed in the fingerprint mananger and has been removed.	2018-12-18 22:55:18 -05:00
Alex Dadgar	9d1403d617	Merge pull request #5002 from hashicorp/b-task-config-resources Convert driver resource to AllocatedTaskResource	2018-12-18 16:46:34 -08:00
Danielle Tomlinson	0edc65631a	Merge pull request #5007 from hashicorp/dani/f-allocrunner-async allocrunner: Async api for shutdown/destroy/update	2018-12-19 01:26:41 +01:00
Alex Dadgar	8efac7ec81	Fix unit tests + upgrade pathing resources	2018-12-18 15:50:44 -08:00
Alex Dadgar	b8268d9a46	Lint	2018-12-18 15:50:44 -08:00
Alex Dadgar	66cf3156b2	LinuxResources doesn't use task.Resources	2018-12-18 15:50:44 -08:00
Alex Dadgar	327b551b39	Drivers	2018-12-18 15:50:11 -08:00
Alex Dadgar	b653ae2af7	utilities	2018-12-18 15:48:52 -08:00
Danielle Tomlinson	95a0c4fb29	taskrunner: Use a random suffix for Task Config The RestartCount is not really suitable for use as a source of uniqueness within task invocations as it is not monotonic, and interacts with the restart stanza in a users config, so conflates restarts due to task failures, with restarts due to enviromental changes, such as consul template or vault secrets changing. Here we instead use a substring from a uuid, which is more random than we strictly need, but is nicer than rolling our own random string generator here.	2018-12-19 00:38:54 +01:00
Danielle Tomlinson	d6eb084d8a	allocrunner: Drop and log updates after closing waitCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	0d91285cd6	allocrunner: Documentation for ShutdownCh/DestroyCh	2018-12-18 23:38:34 +01:00
Danielle Tomlinson	f2bb13818e	fixup: Log when we detect out of order updates	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	986fde0f5a	allocrunner: Handle updates asynchronously This creates a new buffered channel and goroutine on the allocrunner for serializing updates to allocations. This allows us to take updates off the routine that is used from processing updates from the server, without having complicated machinery for tracking update lifetimes, or other external synchronization. This results in a nice performance improvement and signficantly better throughput on batch changes such as preempting a large number of jobs for a larger placement.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	d1fbac1aad	allocrunner: Async shutdown and destroy This commit reduces the locking required to shutdown or destroy allocrunners, and allows parallel shutdown and destroy of allocrunners during shutdown.	2018-12-18 23:38:33 +01:00
Danielle Tomlinson	a50ea29da4	taskrunner: Use hook errors for artifacts	2018-12-17 10:39:38 +01:00
Danielle Tomlinson	3647b701a6	taskrunner: Emit task events when a hook fails	2018-12-13 18:20:18 +01:00
Alex Dadgar	20c59df8b9	Merge pull request #4969 from hashicorp/f-alloc-hooks Make alloc health watcher a postrun hook rather than shutdown hook	2018-12-12 14:34:36 -08:00
Danielle Tomlinson	6fb5ca6ad5	allocrunner: Test alloc runners should include a noop migrator	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	83720575de	client: Unify handling of previous and preempted allocs	2018-12-11 13:12:35 +01:00
Danielle Tomlinson	dff7093243	client: Wait for preempted allocs to terminate When starting an allocation that is preempting other allocs, we create a new group allocation watcher, and then wait for the allocations to terminate in the allocation PreRun hooks. If there's no preempted allocations, then we simply provide a NoopAllocWatcher.	2018-12-11 00:59:18 +01:00

... 2 3 4 5 6 ...

434 Commits