Commit graph

251 commits

Author SHA1 Message Date
Michael Schurter d7e5ace1ed client: do not restart dead tasks until server is contacted
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
2019-05-14 10:53:27 -07:00
Michael Schurter 1c4e585fa7 client: expose allocated memory per task
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
2019-05-10 11:12:12 -07:00
Mahmood Ali 919827f2df
Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base
nomad exec part 1: plumbing and docker driver
2019-05-09 18:09:27 -04:00
Mahmood Ali ab2cae0625 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Chris Baker 93ec1293be
stale allocation data leads to incorrect (and even negative) metrics (#5637)
* client: was not using up-to-date client state in determining which alloc count towards allocated resources

* Update client/client.go

Co-Authored-By: cgbaker <cgbaker@hashicorp.com>
2019-05-07 15:54:36 -04:00
Michael Schurter 8c7b3ff45a
Fix comment
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:01:30 -05:00
Michael Schurter e19fa33f9c
Remove unnecessary boolean clause
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:00:17 -05:00
Preetha Appan b99a204582
Update deployment health on failed allocations only if health is unset
This fixes a confusing UX where a previously successful deployment's
healthy/unhealthy count would get updated if any allocations failed after
the deployment was already marked as successful.
2019-05-02 22:59:56 -05:00
Danielle 79515496cb
Merge pull request #5515 from hashicorp/dani/f-alloc-signal
allocs: Add nomad alloc signal command
2019-04-26 14:21:05 +02:00
Mahmood Ali bf0a09e270 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali 81841e8528 try checking process status 2019-04-25 18:16:13 -04:00
Mahmood Ali fc78521f29 add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali e6ca8641a8 try sleeping for stop signal to take effect 2019-04-25 17:16:29 -04:00
Mahmood Ali ff3a095015 add a test that simulates logmon dying during Start() call 2019-04-25 16:41:17 -04:00
Mahmood Ali bbac73883c logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Danielle Lancashire 3409e0be89 allocs: Add nomad alloc signal command
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.

Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
2019-04-25 12:43:32 +02:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Danielle Lancashire c31966fc71 loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Michael Schurter f7a7acc345
Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Chris Baker 6848591914 vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation 2019-04-12 15:06:34 +00:00
Danielle Lancashire e135876493 allocs: Add nomad alloc restart
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
2019-04-11 14:25:49 +02:00
Chris Baker c0a7aee610
vault e2e: pass vault version into setup instead of having to infer it from test name 2019-04-10 10:34:10 -05:00
Chris Baker f0c184fc29
taskrunner: removed some unecessary config from a test 2019-04-10 10:34:10 -05:00
Chris Baker 170f5239c8
client: gofmt 2019-04-10 10:34:10 -05:00
Chris Baker a1d7971b2e
taskrunner: pass configured Vault namespace into TaskTemplateConfig 2019-04-10 10:34:10 -05:00
Michael Schurter f7d4428855 client: simplify kill logic
Remove runLaunched tracking as Run is *always* called for killable
TaskRunners. TaskRunners which fail before Run can be called (during
NewTaskRunner or Restore) are not killable as they're never added to the
client's alloc map.
2019-04-04 15:18:33 -07:00
Michael Schurter 1d569a27dc Revert "executor/linux: add defensive checks to binary path"
This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.
2019-04-02 11:17:12 -07:00
Michael Schurter fc5487dbbc executor/linux: add defensive checks to binary path 2019-04-02 09:40:53 -07:00
Michael Schurter 7d49bc4c71 executor/linux: make chroot binary paths absolute
Avoid libcontainer.Process trying to lookup the binary via $PATH as the
executor has already found where the binary is located.
2019-04-01 15:45:31 -07:00
Michael Schurter a4572919cd
Merge pull request #5456 from hashicorp/test-taskenv
tests: port pre-0.9 task env tests
2019-03-25 10:41:38 -07:00
Michael Schurter 8efad12538 tests: port pre-0.9 task env tests
I chose to make them more of integration tests since there's a lot more
plumbing involved. The internal implementation details of how we craft
task envs can now change and these tests will still properly assert the
task runtime environment is setup properly.
2019-03-25 09:46:53 -07:00
Nick Ethier dc18b8928a
logmon: make Start rpc idempotent and simplify hook 2019-03-19 14:02:36 -04:00
Nick Ethier ac7fbee1b8
logmon:add static check for logmon exited hook 2019-03-18 15:59:43 -04:00
Nick Ethier 7dc3d83634
client/logmon: restart log collection correctly when a task is restarted 2019-03-15 23:59:18 -04:00
Michael Schurter 0ba1a5251b client: cleanup and document context uses
Some of the context uses in TR hooks are useless (Killed during Stop
never seems meaningful).

None of the hooks are interruptable for graceful shutdown which is
unfortunate and probably needs fixing.
2019-03-12 15:03:54 -07:00
Michael Schurter 32d31575cc client: emit event and call exited hooks during cleanup
Builds upon earlier commit that cleans up restored handles of terminal
allocs by also emitting terminated events and calling exited hooks when
appropriate.
2019-03-05 15:12:02 -08:00
Michael Schurter 64e145ebdb logmon: drop reattach log level as its expected
Logged once per terminal task on agent restart.
2019-03-04 13:26:01 -08:00
Michael Schurter c5271d3fa5 client: test logmon cleanup
The test is sadly quite complicated and peeks into things (logmon's
reattach config) AR doesn't normally have access to.

However, I couldn't find another way of asserting logmon got cleaned up
without resorting to smaller unit tests. Smaller unit tests risk
re-implementing dependencies in an unrealistic way, so I opted for an
ugly integration test.
2019-03-04 13:15:15 -08:00
Michael Schurter ef8d284352 client: ensure task is cleaned up when terminal
This commit is a significant change. TR.Run is now always executed, even
for terminal allocations. This was changed to allow TR.Run to cleanup
(run stop hooks) if a handle was recovered.

This is intended to handle the case of Nomad receiving a
DesiredStatus=Stop allocation update, persisting it, but crashing before
stopping AR/TR.

The commit also renames task runner hook data as it was very easy to
accidently set state on Requests instead of Responses using the old
field names.
2019-03-01 14:00:23 -08:00
Michael Schurter 812f1679e2
Merge pull request #5352 from hashicorp/b-leaked-logmon
logmon fixes
2019-02-26 08:35:46 -08:00
Michael Schurter e39a10a1f4 tests: move unix-specific test to its own file
Other logmon tests should be portable.
2019-02-26 07:56:44 -08:00
Michael Schurter 3b2a592e93 client: restart task on logmon failures
This code chooses to be conservative as opposed to optimal: when failing
to reattach to logmon simply return a recoverable error instead of
immediately trying to restart logmon.

The recoverable error will cause the task's restart policy to be
applied and a new logmon will be launched upon restart.

Trying to do the optimal approach of simply starting a new logmon
requires error string comparison and should be tested against a task
actively logging to assert the behavior (are writes blocked? dropped?).
2019-02-25 15:42:45 -08:00
Michael Schurter 8830b00866 client: test logmon_hook 2019-02-23 15:36:48 -08:00
Preetha Appan 43679f4ce1
More alloc runner tests ported from 0.8.7 2019-02-22 17:58:06 -06:00
Mahmood Ali 32551fb0e5 emit TaskRestartSignal event on vault restart
When Vault token expires and task is restarted, emit `TaskRestartSignal`
similar to v0.8.7
2019-02-22 15:56:14 -05:00
Mahmood Ali 8cb4bbcc08 address review comments 2019-02-22 15:56:14 -05:00
Mahmood Ali 216eaa4843 tests: port TestTaskRunner_VaultManager_Signal
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427
2019-02-22 15:53:04 -05:00
Mahmood Ali 8e9e732319 tests: port TestTaskRunner_VaultManager_Restart
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1352
2019-02-22 15:53:04 -05:00
Mahmood Ali 33122ca7c0 tests: port TestTaskRunner_UnregisterConsul_Retries
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L620
2019-02-22 15:53:04 -05:00
Mahmood Ali 0128b0ce7a tests: port TestTaskRunner_Template_NewVaultToken
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1275
2019-02-22 15:53:04 -05:00