open-nomad

Author	SHA1	Message	Date
Chris Baker	e0170e1c67	metrics: add namespace label to allocation metrics	2019-06-17 20:50:26 +00:00
Danielle	f923b568e0	Merge pull request #5821 from hashicorp/dani/b-5770 trhooks: Add TaskStopHook interface to services	2019-06-12 17:30:49 +02:00
Danielle Lancashire	c326344b57	trt: Fix test	2019-06-12 17:06:11 +02:00
Danielle Lancashire	13d76e35fd	trhooks: Add TaskStopHook interface to services We currently only run cleanup Service Hooks when a task is either Killed, or Exited. However, due to the implementation of a task runner, tasks are only Exited if they every correctly started running, which is not true when you recieve an error early in the task start flow, such as not being able to pull secrets from Vault. This updates the service hook to also call consul deregistration routines during a task Stop lifecycle event, to ensure that any registered checks and services are cleared in such cases. fixes #5770	2019-06-12 16:00:21 +02:00
Mahmood Ali	2acf30fdd3	Fallback to `alloc.TaskResources` for old allocs When a client is running against an old server (e.g. running 0.8), `alloc.AllocatedResources` may be nil, and we need to check the deprecated `alloc.TaskResources` instead. Fixes https://github.com/hashicorp/nomad/issues/5810	2019-06-11 10:32:53 -04:00
Mahmood Ali	d30c3d10b0	Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1 More test fixes	2019-06-05 19:09:18 -04:00
Mahmood Ali	935ee86e92	Merge pull request #5737 from fwkz/fix-restart-attempts Fix restart attempts of `restart` stanza in `delay` mode.	2019-06-05 19:05:07 -04:00
Danielle Lancashire	27583ed8c1	client: Pass servers contacted ch to allocrunner This fixes an issue where batch and service workloads would never be restarted due to indefinitely blocking on a nil channel. It also raises the restoration logging message to `Info` to simplify log analysis.	2019-05-22 13:47:35 +02:00
Mahmood Ali	9df1e00f35	tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal Given that Signal may be called multiple times, blocking for `SignalCh` isn't sufficient to synchornizing access to Signals field.	2019-05-21 13:56:58 -04:00
fwkz	8b84bec95a	Fix restart attempts of `restart` stanza. Number of restarts during 2nd interval is off by one.	2019-05-21 13:27:19 +02:00
Michael Schurter	2fe0768f3b	docs: changelog entry for #5669 and fix comment	2019-05-14 10:54:00 -07:00
Michael Schurter	af9096c8ba	client: register before restoring Registration and restoring allocs don't share state or depend on each other in any way (syncing allocs with servers is done outside of registration). Since restoring is synchronous, start the registration goroutine first. For nodes with lots of allocs to restore or close to their heartbeat deadline, this could be the difference between becoming "lost" or not.	2019-05-14 10:53:27 -07:00
Michael Schurter	e07f73bfe0	client: do not restart dead tasks until server is contacted (try 2) Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162 Switch the MarkLive method for a chan that is closed by the client. Thanks to @notnoop for the idea! The old approach called a method on most existing ARs and TRs on every runAllocs call. The new approach does a once.Do call in runAllocs to accomplish the same thing with less work. Able to remove the gate abstraction that did much more than was needed.	2019-05-14 10:53:27 -07:00
Michael Schurter	d7e5ace1ed	client: do not restart dead tasks until server is contacted Fixes #1795 Running restored allocations and pulling what allocations to run from the server happen concurrently. This means that if a client is rebooted, and has its allocations rescheduled, it may restart the dead allocations before it contacts the server and determines they should be dead. This commit makes tasks that fail to reattach on restore wait until the server is contacted before restarting.	2019-05-14 10:53:27 -07:00
Michael Schurter	1c4e585fa7	client: expose allocated memory per task Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```	2019-05-10 11:12:12 -07:00
Mahmood Ali	ab2cae0625	implement client endpoint of nomad exec Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking the relevant task handler for execution.	2019-05-09 16:49:08 -04:00
Mahmood Ali	bf0a09e270	retry grpc unavailable errors even if not shutting down	2019-04-25 18:39:17 -04:00
Mahmood Ali	81841e8528	try checking process status	2019-04-25 18:16:13 -04:00
Mahmood Ali	fc78521f29	add logging about attempts	2019-04-25 18:09:36 -04:00
Mahmood Ali	e6ca8641a8	try sleeping for stop signal to take effect	2019-04-25 17:16:29 -04:00
Mahmood Ali	ff3a095015	add a test that simulates logmon dying during Start() call	2019-04-25 16:41:17 -04:00
Mahmood Ali	bbac73883c	logmon: retry starting logmon if it exits Retry if we detect shutting down during Start() api call is started, locally.	2019-04-25 15:10:16 -04:00
Michael Schurter	61f17a1043	tweak logging level for failed log line Co-Authored-By: notnoop <mahmood@notnoop.com>	2019-04-22 14:40:17 -04:00
Danielle Lancashire	c31966fc71	loggging: Attempt to recover logmon failures Currently, when logmon fails to reattach, we will retry reattachment to the same pid until the task restart specification is exhausted. Because we cannot clear hook state during error conditions, it is not possible for us to signal to a future restart that it _shouldn't_ attempt to reattach to the plugin. Here we revert to explicitly detecting reattachment seperately from a launch of a new logmon, so we can recover from scenarios where a logmon plugin has failed. This is a net improvement over the current hard failure situation, as it means in the most common case (the pid has gone away), we can recover. Other reattachment failure modes where the plugin may still be running could potentially cause a duplicate process, or a subsequent failure to launch a new plugin. If there was a duplicate process, it could potentially cause duplicate logging. This is better than a production workload outage. If there was a subsequent failure to launch a new plugin, it would fail in the same (retry until restarts are exhausted) as the current failure mode.	2019-04-18 13:41:56 +02:00
Michael Schurter	f7a7acc345	Merge pull request #5518 from hashicorp/f-simplify-kill client: simplify kill logic	2019-04-15 14:11:58 -07:00
Chris Baker	6848591914	vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation	2019-04-12 15:06:34 +00:00
Chris Baker	c0a7aee610	vault e2e: pass vault version into setup instead of having to infer it from test name	2019-04-10 10:34:10 -05:00
Chris Baker	f0c184fc29	taskrunner: removed some unecessary config from a test	2019-04-10 10:34:10 -05:00
Chris Baker	170f5239c8	client: gofmt	2019-04-10 10:34:10 -05:00
Chris Baker	a1d7971b2e	taskrunner: pass configured Vault namespace into TaskTemplateConfig	2019-04-10 10:34:10 -05:00
Michael Schurter	f7d4428855	client: simplify kill logic Remove runLaunched tracking as Run is always called for killable TaskRunners. TaskRunners which fail before Run can be called (during NewTaskRunner or Restore) are not killable as they're never added to the client's alloc map.	2019-04-04 15:18:33 -07:00
Michael Schurter	1d569a27dc	Revert "executor/linux: add defensive checks to binary path" This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.	2019-04-02 11:17:12 -07:00
Michael Schurter	fc5487dbbc	executor/linux: add defensive checks to binary path	2019-04-02 09:40:53 -07:00
Michael Schurter	7d49bc4c71	executor/linux: make chroot binary paths absolute Avoid libcontainer.Process trying to lookup the binary via $PATH as the executor has already found where the binary is located.	2019-04-01 15:45:31 -07:00
Michael Schurter	a4572919cd	Merge pull request #5456 from hashicorp/test-taskenv tests: port pre-0.9 task env tests	2019-03-25 10:41:38 -07:00
Michael Schurter	8efad12538	tests: port pre-0.9 task env tests I chose to make them more of integration tests since there's a lot more plumbing involved. The internal implementation details of how we craft task envs can now change and these tests will still properly assert the task runtime environment is setup properly.	2019-03-25 09:46:53 -07:00
Nick Ethier	dc18b8928a	logmon: make Start rpc idempotent and simplify hook	2019-03-19 14:02:36 -04:00
Nick Ethier	ac7fbee1b8	logmon:add static check for logmon exited hook	2019-03-18 15:59:43 -04:00
Nick Ethier	7dc3d83634	client/logmon: restart log collection correctly when a task is restarted	2019-03-15 23:59:18 -04:00
Michael Schurter	32d31575cc	client: emit event and call exited hooks during cleanup Builds upon earlier commit that cleans up restored handles of terminal allocs by also emitting terminated events and calling exited hooks when appropriate.	2019-03-05 15:12:02 -08:00
Michael Schurter	64e145ebdb	logmon: drop reattach log level as its expected Logged once per terminal task on agent restart.	2019-03-04 13:26:01 -08:00
Michael Schurter	c5271d3fa5	client: test logmon cleanup The test is sadly quite complicated and peeks into things (logmon's reattach config) AR doesn't normally have access to. However, I couldn't find another way of asserting logmon got cleaned up without resorting to smaller unit tests. Smaller unit tests risk re-implementing dependencies in an unrealistic way, so I opted for an ugly integration test.	2019-03-04 13:15:15 -08:00
Michael Schurter	ef8d284352	client: ensure task is cleaned up when terminal This commit is a significant change. TR.Run is now always executed, even for terminal allocations. This was changed to allow TR.Run to cleanup (run stop hooks) if a handle was recovered. This is intended to handle the case of Nomad receiving a DesiredStatus=Stop allocation update, persisting it, but crashing before stopping AR/TR. The commit also renames task runner hook data as it was very easy to accidently set state on Requests instead of Responses using the old field names.	2019-03-01 14:00:23 -08:00
Michael Schurter	812f1679e2	Merge pull request #5352 from hashicorp/b-leaked-logmon logmon fixes	2019-02-26 08:35:46 -08:00
Michael Schurter	e39a10a1f4	tests: move unix-specific test to its own file Other logmon tests should be portable.	2019-02-26 07:56:44 -08:00
Michael Schurter	3b2a592e93	client: restart task on logmon failures This code chooses to be conservative as opposed to optimal: when failing to reattach to logmon simply return a recoverable error instead of immediately trying to restart logmon. The recoverable error will cause the task's restart policy to be applied and a new logmon will be launched upon restart. Trying to do the optimal approach of simply starting a new logmon requires error string comparison and should be tested against a task actively logging to assert the behavior (are writes blocked? dropped?).	2019-02-25 15:42:45 -08:00
Michael Schurter	8830b00866	client: test logmon_hook	2019-02-23 15:36:48 -08:00
Mahmood Ali	32551fb0e5	emit TaskRestartSignal event on vault restart When Vault token expires and task is restarted, emit `TaskRestartSignal` similar to v0.8.7	2019-02-22 15:56:14 -05:00
Mahmood Ali	8cb4bbcc08	address review comments	2019-02-22 15:56:14 -05:00
Mahmood Ali	216eaa4843	tests: port TestTaskRunner_VaultManager_Signal From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427	2019-02-22 15:53:04 -05:00

1 2 3 4 5

209 commits