Commit graph

4186 commits

Author SHA1 Message Date
Michael Schurter a20ac7c1de client: restore Terminated event on every exit
v0.9.0-dev started emitting a Terminated event every time a task process
exited. While this wasn't true in previous versions, it's a useful task
event because it's the only place for job operators to view the task's
exit code.

This behavior is asserted in the e2e/taskevents tests.
2019-01-17 10:02:25 -08:00
Danielle Tomlinson 11c733faa8 allocwatcher: Stat_t is unavailable on win 2019-01-17 18:43:14 +01:00
Danielle Tomlinson 62e06eda56 chore: Cleanup formatting 2019-01-17 18:43:13 +01:00
Danielle Tomlinson 580b8c5dda client/fs: Skip delete-while-streaming test on win 2019-01-17 18:43:13 +01:00
Danielle Tomlinson 4dbddd0620 client/fs: windows error message for not found 2019-01-17 18:43:13 +01:00
Danielle Tomlinson 915bab2365 vaultclient: use require for error assertions 2019-01-17 18:43:13 +01:00
Danielle Tomlinson dc55d3e353 vaultclient: Update tests for vault 1.0 2019-01-17 18:43:13 +01:00
Danielle Tomlinson 7a5d511349 fingerprinter: Use HCLogger for windows 2019-01-17 18:43:13 +01:00
Danielle Tomlinson a695b3562c
Merge pull request #5193 from hashicorp/dani/logmon-reattach
logmon: Reattach to existing loggers
2019-01-16 17:34:13 +01:00
Danielle Tomlinson 99da4c780d logmon: Reattach to existing loggers
This commit prevents us from creating duplicate logmon hooks when
restoring allocations by persisting the logmon reattach config using
HookData.
2019-01-16 14:56:10 +01:00
Michael Schurter daa7d029a1 test: porting TestTaskRunner_SimpleRun_Dispatch
Porting test from 0.8 to 0.9.
2019-01-15 15:22:13 -08:00
Michael Schurter 48afda786b
Merge pull request #5187 from hashicorp/test-consul
Port a bunch of pre-0.9 Consul tests to 0.9
2019-01-15 07:41:50 -08:00
Alex Dadgar 471fdb3ccf
Merge pull request #5173 from hashicorp/b-log-levels
Plugins use parent loggers
2019-01-14 16:14:30 -08:00
Mahmood Ali 9909d98bee Track Basic Memory Usage as reported by cgroups
Track current memory usage, `memory.usage_in_bytes`, in addition to
`memory.max_memory_usage_in_bytes` and friends.  This number is closer
what Docker reports.

Related to https://github.com/hashicorp/nomad/issues/5165 .
2019-01-14 18:47:52 -05:00
Nick Ethier c619e70d39
Merge pull request #5018 from hashicorp/f-executor-stats
executor: streaming stats api
2019-01-14 15:02:35 -05:00
Michael Schurter 4e7ea460e8 test: port some pre-0.9 DeploymentHealth tests
Skipping a failing one as I need to move to some other work and don't
want to leave this work orphaned on my machine.
2019-01-14 09:56:53 -08:00
Michael Schurter ff2f23f5f9 test: assert service interpolation behavior
Ported from pre-0.9 tests.
2019-01-14 09:56:53 -08:00
Michael Schurter 5746be5844 test: add some extra logging 2019-01-14 09:56:53 -08:00
Michael Schurter e877bb6370 test: assert shutdown delay deregs first
Restore a pre-0.9 test that asserts Consul services are deregistered
before a task's shutdown delay.
2019-01-14 09:56:53 -08:00
Michael Schurter 1ca858fa92
Update client/allocrunner/taskrunner/stats_hook.go
Co-Authored-By: nickethier <ncethier@gmail.com>
2019-01-14 12:31:27 -05:00
Nick Ethier fbd403df96
tr: stop stats collection on Exited hook 2019-01-14 12:30:14 -05:00
Nick Ethier 597b7b751d
tr: add retry /w backoff to stats_hook failure 2019-01-12 12:18:24 -05:00
Nick Ethier 7e306afde3
executor: fix failing stats related test 2019-01-12 12:18:23 -05:00
Nick Ethier 9fea54e0dc
executor: implement streaming stats API
plugins/driver: update driver interface to support streaming stats

client/tr: use streaming stats api

TODO:
 * how to handle errors and closed channel during stats streaming
 * prevent tight loop if Stats(ctx) returns an error

drivers: update drivers TaskStats RPC to handle streaming results

executor: better error handling in stats rpc

docker: better control and error handling of stats rpc

driver: allow stats to return a recoverable error
2019-01-12 12:18:22 -05:00
Preetha Appan 9e8dbf6a4b
linting fixes 2019-01-12 10:38:20 -06:00
Preetha Appan c94179578d
Make unit test for allocrunner failure much nicer 2019-01-12 10:38:20 -06:00
Preetha Appan da0d083b03
Add unit test to simulate alloc runner creation failure 2019-01-12 10:38:20 -06:00
Preetha Appan e7b59ac08c
Only set deployment health if not already set 2019-01-12 10:38:20 -06:00
Michael Schurter dbf4c3a3c8
Apply suggestions from code review
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-01-12 10:38:20 -06:00
Preetha Appan 7bd1440710
REfactor statedb factory config to set it directly in client config 2019-01-12 10:38:20 -06:00
Preetha Appan e237f19b38
Remove invalid allocs 2019-01-12 10:38:20 -06:00
Preetha Appan f059ef8a47
Modified destroy failure handling to rely on allocrunner's destroy method
Added a unit test with custom statedb implementation that errors, to
use to verify destroy errors
2019-01-12 10:37:12 -06:00
Preetha Appan 6c95da8f67
Add back code to mark alloc as failed when restore fails
Also modify restore such that any handled errors don't propagate
back to the client
2019-01-12 10:37:12 -06:00
Preetha Appan 5fde0b0f5c
Revert code that made an alloc update when restore fails
Restore currently shuts down the client so the alloc update cant
always make it to the server
2019-01-12 10:37:12 -06:00
Preetha Appan 41bfdd764b
Handle client initialization errors when adding allocs or restoring allocs
We mark the alloc as failed and track failed allocs so that we don't send
updates after the first time
2019-01-12 10:37:12 -06:00
Alex Dadgar 14ed757a56 Plugins use parent loggers
This PR fixes various instances of plugins being launched without using
the parent loggers. This meant that logs would not all go to the same
output, break formatting etc.
2019-01-11 11:36:37 -08:00
Danielle Tomlinson 3e586e93da client: Cleanup allocrunner access 2019-01-11 18:39:18 +01:00
Mahmood Ali c3eaa0f4c8 tests: enable and fix tests requiring mock driver 2019-01-10 10:10:11 -05:00
Alex Dadgar bd12e0b1f7
Merge pull request #5168 from hashicorp/b-kill-race
Improve Kill handling on task runner
2019-01-09 12:05:10 -08:00
Alex Dadgar 069e181e8f add more comments 2019-01-09 12:04:22 -08:00
Michael Schurter e5ddff861c
Spelling fix
Co-Authored-By: dadgar <alex@hashicorp.com>
2019-01-09 11:42:40 -08:00
Mahmood Ali 90f3cea187
Merge pull request #5157 from hashicorp/r-drivers-no-cstructs
drivers: avoid referencing client/structs package
2019-01-09 13:06:46 -05:00
Mahmood Ali ff48dbb8a9
Merge pull request #5163 from hashicorp/r-minor-changes-20180108
Fix a panic on node.Deregister fail
2019-01-09 09:56:00 -05:00
Mahmood Ali 1f2473263e fix more cases of logging arity errors 2019-01-09 09:22:47 -05:00
Mahmood Ali 4952f2a182
Merge pull request #5159 from hashicorp/r-macos-tests
Fix Travis MacOS job
2019-01-09 08:22:30 -05:00
Alex Dadgar 149dec2169 Improve Kill handling on task runner
This PR improves how killing a task is handled. Before the kill function
directly orchestrated the killing and was only valid while the task was
running. The new behavior is to mark the desired state and wait for the
task runner to converge to that state.
2019-01-08 16:42:26 -08:00
Mahmood Ali 9f7eb1bdfa tests: fix a test job constaints failing in macOS
Allow scheduling mock job when running on MacOS (or Windows) hosts.
2019-01-08 12:37:42 -05:00
Michael Schurter c24f4f94c1
Merge pull request #5151 from hashicorp/b-task-events
Emit Killing task events and add e2e tests
2019-01-08 09:33:04 -08:00
Mahmood Ali 6d36b52412 run gofmt 2019-01-08 11:15:38 -05:00
Michael Schurter 92f9cda5f4
Merge pull request #5035 from hashicorp/test-client
test: re-eanble periodic fingerprint test
2019-01-08 07:37:39 -08:00
Michael Schurter 5925424c7c client: emit Killing/Killed task events
We were just emitting Killed/Terminated events before. In v0.8 we
emitted Killing/Killed, but lacked Terminated when explicitly stopping
a task. This change makes it so Terminated is always included, whether
explicitly stopping a task or it exiting on its own.

New output:

2019-01-04T14:58:51-08:00  Killed            Task successfully killed
2019-01-04T14:58:51-08:00  Terminated        Exit Code: 130, Signal: 2
2019-01-04T14:58:51-08:00  Killing           Sent interrupt
2019-01-04T14:58:51-08:00  Leader Task Dead  Leader Task in Group dead
2019-01-04T14:58:49-08:00  Started           Task started by client
2019-01-04T14:58:49-08:00  Task Setup        Building Task Directory
2019-01-04T14:58:49-08:00  Received          Task received by client

Old (v0.8.6) output:

2019-01-04T22:14:54Z  Killed            Task successfully killed
2019-01-04T22:14:54Z  Killing           Sent interrupt. Waiting 5s before force killing
2019-01-04T22:14:54Z  Leader Task Dead  Leader Task in Group dead
2019-01-04T22:14:53Z  Started           Task started by client
2019-01-04T22:14:53Z  Task Setup        Building Task Directory
2019-01-04T22:14:53Z  Received          Task received by client
2019-01-08 07:20:54 -08:00
Michael Schurter 324e989327
Merge pull request #5034 from hashicorp/test-fix-races
Test fix races
2019-01-08 07:04:09 -08:00
Mahmood Ali 916a40bb9e move cstructs.DeviceNetwork to drivers pkg 2019-01-08 09:11:47 -05:00
Mahmood Ali 9369b123de use drivers.FSIsolation 2019-01-08 09:11:47 -05:00
Mahmood Ali c10a8fd7fe remove deprecated allocrunner 2019-01-08 09:11:47 -05:00
Mahmood Ali f475a56087 remove always false parameter
Simplify allocDir.Build() function to avoid depending on client/structs,
and remove a parameter that's always set to `false`.

The motivation here is to avoid a dependency cycle between
drivers/cstructs and alloc_dir.
2019-01-08 09:11:47 -05:00
Danielle Tomlinson 8df20f49f7 drivers: Add internal interface for Shutdown
This allows us to correctly terminate internal state during runs of the
nomad test suite, e.g closing eventer contexts correctly.
2019-01-08 13:48:49 +01:00
Alex Dadgar edf132758d
Merge pull request #5152 from hashicorp/f-recover
Task runner recovers from external plugin exiting
2019-01-07 15:27:33 -08:00
Alex Dadgar 0106f23aaa Review comments 2019-01-07 14:50:28 -08:00
Alex Dadgar 79cfe26021 vet 2019-01-07 14:49:41 -08:00
Alex Dadgar 8a35d7b1dd Test recovery 2019-01-07 14:49:41 -08:00
Alex Dadgar f40f8ce02e Mock driver has recovery, stats 2019-01-07 14:49:40 -08:00
Alex Dadgar 3f24e4d6ca comments 2019-01-07 14:49:40 -08:00
Alex Dadgar 44dca19012 Fix hooks 2019-01-07 14:49:40 -08:00
Alex Dadgar c9825a9c36 recover 2019-01-07 14:49:40 -08:00
Alex Dadgar c3f05f2476 Don't log event error on driver shutdown 2019-01-07 14:49:40 -08:00
Michael Schurter d686ad51fb
Merge pull request #5043 from hashicorp/b-taskenv-conflicts
taskenv: have maps take precedence over primitives
2019-01-07 12:34:48 -08:00
Mahmood Ali 0ba7b0c132 tests: helper function for checking docker presense 2019-01-07 08:27:06 -05:00
Mahmood Ali cd3c6cf60b taskrunner: emit TaskReceived event
Preserve pre-0.9, where task runner emits `Received: Task received by
client` event on task runner creation.
2019-01-04 14:32:29 -05:00
Michael Schurter 875e231511
Merge pull request #5038 from hashicorp/b-drivermanager-tests
WIP: fix failing tests caused by async driver manager
2019-01-03 12:32:18 -08:00
Danielle Tomlinson 35a4790740
Merge pull request #5142 from hashicorp/dani/cleanup-allocrunner-logs
allocrunner: Standardised discard logs
2019-01-03 18:40:48 +01:00
Preetha 8078cb79f0
Merge pull request #5140 from hashicorp/dani/b-taskrunner
taskrunner: Persist environment from hooks
2019-01-03 09:30:52 -06:00
Danielle Tomlinson 29196ca70e allocrunner: Standardised discard logs
Follow up from https://github.com/hashicorp/nomad/pull/5007#pullrequestreview-186739124
2019-01-03 14:04:31 +01:00
Danielle Tomlinson 1c8baf7db7 chore: Fix environement->environment typo 2019-01-03 13:31:30 +01:00
Danielle Tomlinson 28aa34ea78 taskrunner: Persist environment from hooks
https://github.com/hashicorp/nomad/pull/5032 introduced a regression
where the origHookState was used in place of the response from the hook.
2019-01-03 13:13:57 +01:00
Alex Dadgar d7d32c2f61
Merge pull request #5032 from hashicorp/f-driver-env
Store device envs separately and pass to drivers
2018-12-20 13:38:27 -08:00
Michael Schurter e47a3ceed6 taskenv: have maps take precedence over primitives
**The Bug:**

You may have seen log lines like this when running 0.9.0-dev:

```
... client.alloc_runner.task_runner: some environment variables not available for rendering: ... keys="attr.driver.docker.volumes.enabled, attr.driver.docker.version, attr.driver.docker.bridge_ip, attr.driver.qemu.version"
```

Not only should we not be erroring on builtin driver attributes, but the
results were nondeterministic due to map iteration order!

The root cause is that we have an old root attribute for all drivers
like:

```
attr.driver.docker = "1"
```

When attributes were opaque variable names it was fine to also have
"nested" attributes like:

```
attr.driver.docker.version = "1.2.3"
```

However in the HCLv2 world the variable names are no longer opaque: they
form an object tree. The `docker` object can no longer both hold a value
(`"1"`) *and* nested attributes (`version = "1.2.3"`).

**The Fix:**

Since the old `attr.driver.<name> = "1"` attribues are useless for task
config interpolation, create a new precedence rule for creating the task
config evaluation context:

*Maps take precedence over primitives.*

This means `attr.driver.docker.version` will always take precedence over
`attr.driver.docker`. The results are determinstic and give users access
to the more useful metadata.

I made this a general precedence rule instead of special-casing driver
attrs because it seemed like better default behavior than spamming
WARNings to logs that were likely unactionable by users.
2018-12-20 11:37:46 -08:00
Nick Ethier a96afb6c91
fix tests that fail as a result of async client startup 2018-12-20 00:53:44 -05:00
Nick Ethier 6c43ccf628
client: add proper build flag to allocrunner testing.go 2018-12-19 20:22:07 -05:00
Michael Schurter 0a0fb6f86d test: re-eanble periodic fingerprint test 2018-12-19 17:08:24 -08:00
Michael Schurter add2dd8c2d test: copy AR's Alloc before mutating
Fixes a race in client tests
2018-12-19 15:48:02 -08:00
Michael Schurter 17ed3f27ae drivermgr: fix race in building driver list 2018-12-19 15:48:02 -08:00
Michael Schurter 4448f19413
Merge pull request #5030 from hashicorp/test-client-statusupdate
client: assert alloc status updates work
2018-12-19 14:55:34 -08:00
Alex Dadgar 9d34802f7a Store device envs separately and pass to drivers 2018-12-19 14:23:09 -08:00
Michael Schurter 951100af16 client: assert alloc status updates work
Re-enabling and updating an old test. Able to cut out a ton of extra
work by using WaitForRunning which does almost everything this test
needs.
2018-12-19 11:41:53 -08:00
Michael Schurter ee23bdafbc client/state: missing deploy status isn't an error
Fixes TestClient_SaveRestoreState
2018-12-19 10:39:27 -08:00
Michael Schurter c84998e996 tests: implement HasHealth for mock health 2018-12-19 10:39:27 -08:00
Michael Schurter ba1ddd2238 gofmt -s -w upgrade_int_test.go 2018-12-19 10:39:27 -08:00
Michael Schurter 337d07fdd8 client/state: improve upgradeTaskBucket error handling
And add a test
2018-12-19 10:39:27 -08:00
Michael Schurter c5ddcb6a15 client/state: add context to errors
Unfortunately I don't know how to test these errors. As far as I can
tell they should only happen if there was a programming error in the
upgrade code or the underlying boltdb was corrupted somehow.
2018-12-19 10:39:27 -08:00
Michael Schurter 99bd5b3422 client/state: use 2 as version; test error path 2018-12-19 10:39:27 -08:00
Michael Schurter d9ea8252a7 client/state: support upgrading from 0.8->0.9
Also persist and load DeploymentStatus to avoid rechecking health after
client restarts.
2018-12-19 10:39:27 -08:00
Michael Schurter 0018b2f659 client/state: reorg state buckets to ease transition
* Prefix task bucket with task- to prevent name conflicts
* Shorten device manager bucket name
* Remove commented out outdated var
* Update layout comment
2018-12-19 10:22:28 -08:00
Michael Schurter 461599ff20 tr: fix HookState Copy() and Equal() methods
They did not take into account the Env field.
2018-12-19 09:58:06 -08:00
Danielle Tomlinson c580512d32 allocrunner: Close updates routine correctly 2018-12-19 18:32:51 +01:00
Nick Ethier 969ec51730
devicemanager: fix devicemanager tests 2018-12-19 00:35:12 -05:00
Nick Ethier 6f1777284d
drivermanager: use correct plugin config types 2018-12-18 23:07:01 -05:00
Nick Ethier a02308ee6a
drivermanager: attempt to reattach and shutdown driver plugin if blocked by allow/block lists 2018-12-18 23:01:57 -05:00
Nick Ethier ce1a5cba0e
drivermanager: use allocID and task name to route task events 2018-12-18 23:01:51 -05:00
Nick Ethier bda32f9c79
client/pluginmanager: add plugin manager interface to device/driver managers 2018-12-18 22:56:23 -05:00
Nick Ethier d8a0265e68
client: batch initial fingerprinting in plugin manangers
drivermanager: fix pr comments/feedback
2018-12-18 22:56:19 -05:00
Nick Ethier 7d23cbf448
client/drivermananger: fixup issues from rebase and address PR comments 2018-12-18 22:55:38 -05:00
Nick Ethier 1543335710
tr: deregister task handler on cleanup 2018-12-18 22:55:38 -05:00
Nick Ethier 82175d1328
client/drivermananger: add driver manager
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.

Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
2018-12-18 22:55:18 -05:00
Alex Dadgar 730a6f5b9a lint 2018-12-18 16:48:00 -08:00
Alex Dadgar 4c57d2ec4d Add plugin API versioning to plugin loader and plugins 2018-12-18 16:48:00 -08:00
Alex Dadgar 9d1403d617
Merge pull request #5002 from hashicorp/b-task-config-resources
Convert driver resource to AllocatedTaskResource
2018-12-18 16:46:34 -08:00
Danielle Tomlinson 0edc65631a
Merge pull request #5007 from hashicorp/dani/f-allocrunner-async
allocrunner: Async api for shutdown/destroy/update
2018-12-19 01:26:41 +01:00
Alex Dadgar 8efac7ec81 Fix unit tests + upgrade pathing resources 2018-12-18 15:50:44 -08:00
Alex Dadgar b8268d9a46 Lint 2018-12-18 15:50:44 -08:00
Alex Dadgar 66cf3156b2 LinuxResources doesn't use task.Resources 2018-12-18 15:50:44 -08:00
Alex Dadgar 327b551b39 Drivers 2018-12-18 15:50:11 -08:00
Alex Dadgar b653ae2af7 utilities 2018-12-18 15:48:52 -08:00
Danielle Tomlinson 95a0c4fb29 taskrunner: Use a random suffix for Task Config
The RestartCount is not really suitable for use as a source of
uniqueness within task invocations as it is not monotonic, and interacts
with the restart stanza in a users config, so conflates restarts due to
task failures, with restarts due to enviromental changes, such as consul
template or vault secrets changing.

Here we instead use a substring from a uuid, which is more random than
we strictly need, but is nicer than rolling our own random string
generator here.
2018-12-19 00:38:54 +01:00
Danielle Tomlinson 1be0170ebe client: Update tests for async destroy 2018-12-18 23:38:34 +01:00
Danielle Tomlinson d6eb084d8a allocrunner: Drop and log updates after closing waitCh 2018-12-18 23:38:34 +01:00
Danielle Tomlinson 0d91285cd6 allocrunner: Documentation for ShutdownCh/DestroyCh 2018-12-18 23:38:34 +01:00
Danielle Tomlinson f2bb13818e fixup: Log when we detect out of order updates 2018-12-18 23:38:33 +01:00
Danielle Tomlinson 986fde0f5a allocrunner: Handle updates asynchronously
This creates a new buffered channel and goroutine on the allocrunner for
serializing updates to allocations. This allows us to take updates off
the routine that is used from processing updates from the server,
without having complicated machinery for tracking update lifetimes, or
other external synchronization.

This results in a nice performance improvement and signficantly better
throughput on batch changes such as preempting a large number of jobs
for a larger placement.
2018-12-18 23:38:33 +01:00
Danielle Tomlinson f3fa9d1406 gc: Wait for allocrunners to be destroyed 2018-12-18 23:38:33 +01:00
Danielle Tomlinson cb78a90f40 client: Async API for shutdown/destroy allocrunners 2018-12-18 23:38:33 +01:00
Danielle Tomlinson d1fbac1aad allocrunner: Async shutdown and destroy
This commit reduces the locking required to shutdown or destroy
allocrunners, and allows parallel shutdown and destroy of allocrunners during
shutdown.
2018-12-18 23:38:33 +01:00
Danielle Tomlinson d9174d8dcf
Merge pull request #4989 from hashicorp/dani/b-client-update-race-condition
client: Give a copy of clientconfig to allocrunner
2018-12-17 10:49:46 +01:00
Danielle Tomlinson 53aa1bc198
Merge pull request #5004 from hashicorp/dani/f-hook-errors
client: Emit TaskEvents when task hooks fail
2018-12-17 10:42:57 +01:00
Danielle Tomlinson a50ea29da4 taskrunner: Use hook errors for artifacts 2018-12-17 10:39:38 +01:00
Mahmood Ali 2d2c562e18 Remove implicit check
I intended to remove this line in 29ef7ecf2372f980d12a9900e1b2a351568dd415 - see my notes there for details.
2018-12-16 09:14:26 -05:00
Mahmood Ali d58e38e912 tests: avoid implicitly asserting clean shutdown
The assertion here is causing many spurious failures that aren't
actually relevant to the test itself.

We are tracking the cause for this failure independently, and it would
make more sense to have a dedicated test for clean shutdown.
2018-12-15 15:30:09 -05:00
Danielle Tomlinson 3647b701a6 taskrunner: Emit task events when a hook fails 2018-12-13 18:20:18 +01:00
Danielle Tomlinson 8b06e8d297
Merge pull request #4990 from hashicorp/dani/b-alloc-lock
client: updateAlloc release lock after read
2018-12-13 12:43:59 +01:00
Danielle Tomlinson 3823599da9 client: Give a copy of clientconfig to allocrunner
Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.

This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.

This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
2018-12-13 12:42:15 +01:00
Alex Dadgar 20c59df8b9
Merge pull request #4969 from hashicorp/f-alloc-hooks
Make alloc health watcher a postrun hook rather than shutdown hook
2018-12-12 14:34:36 -08:00
Danielle Tomlinson 4184eadaf4 client: updateAlloc release lock after read
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.

This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
2018-12-12 16:30:01 +01:00
Mahmood Ali 3d166e6e9c
Merge pull request #4984 from hashicorp/b-client-update-driver
client: update driver info on new driver fingerprint
2018-12-11 18:01:03 -05:00
Mahmood Ali 69b2355274
Merge pull request #4975 from hashicorp/fix-master-20181209
Some test fixes and remedies
2018-12-11 18:00:21 -05:00
Alex Dadgar 1531b6d534
Merge pull request #4970 from hashicorp/f-no-iops
Deprecate IOPS
2018-12-11 12:51:22 -08:00
Mahmood Ali ba515947c2 client: update driver info on new fingerprint
Fixes a bug where a driver health and attributes are never updated from
their initial status.  If a driver started unhealthy, it may never go
into a healthy status.
2018-12-11 14:25:10 -05:00
Danielle Tomlinson ed1791f4bf client: Style: use fluent style for building loggers 2018-12-11 18:03:45 +01:00
Danielle Tomlinson 805669ead4 client: Correctly pass a noop PrevAllocMigrator when restoring 2018-12-11 15:46:58 +01:00
Mahmood Ali 3babda5d45 tests: no need for buffer channel 2018-12-11 09:35:26 -05:00
Mahmood Ali 5a487ac884 tests: prevent indefinite blocking in some tests
Noticed few places where tests seem to block indefinitely and panic
after the test run reaches the test package timeout.

I intend to follow up with the proper fix later, but timing out is much
better than indefinitely blocking.
2018-12-11 09:35:26 -05:00
Mahmood Ali 4635168f20 test: fix TestFingerprintManager_Run_Combination
Let's use a fingerprinter that doesn't have values prepopulated in test
fixtures.
2018-12-11 09:35:26 -05:00
Danielle Tomlinson 6fb5ca6ad5 allocrunner: Test alloc runners should include a noop migrator 2018-12-11 13:12:35 +01:00
Danielle Tomlinson 4b4b85e3f4 allocwatcher: Cleanup new migrator/watcher interface 2018-12-11 13:12:35 +01:00
Danielle Tomlinson 83720575de client: Unify handling of previous and preempted allocs 2018-12-11 13:12:35 +01:00
Danielle Tomlinson dff7093243 client: Wait for preempted allocs to terminate
When starting an allocation that is preempting other allocs, we create a
new group allocation watcher, and then wait for the allocations to
terminate in the allocation PreRun hooks.

If there's no preempted allocations, then we simply provide a
NoopAllocWatcher.
2018-12-11 00:59:18 +01:00
Danielle Tomlinson 2cdef6a7b4 allocwatcher: Add Group AllocWatcher
The Group Alloc watcher is an implementation of a PrevAllocWatcher that
can wait for multiple previous allocs before terminating.

This is to be used when running an allocation that is preempting upstream
allocations, and thus only supports being ran with a local alloc watcher.

It also currently requires all of its child watchers to correctly handle
context cancellation. Should this be a problem, it should be fairly easy
to implement a replacement using channels rather than a waitgroup.

It obeys the PrevAllocWatcher interface for convenience, but it may be
better to extract Migration capabilities into a seperate interface for
greater clarity.
2018-12-11 00:58:27 +01:00
Marcin Matlaszek 39eec70f31
Recover from any possible io error when invoking Write on FileRotator
As of now, FileRotator uses bufio.Write under the hood to write data to
configured output file. Due to the way how bufio handles any occurred io
error - saves it into `err` variable never resetting it automatically -
any operation like `Write`, `Flush` etc will become a no-op, returning the very same,
saved error (eg. Out of disk space) even when the problem is fixed (eg. disk
space is available again).

That automatically means that FileRotator will stop writing any logs,
reporting the same error over and over again, even if it's no longer
valid.

This PR fixes it by resetting the bufio Writer, which resets any errors
and tries to write requested data.
2018-12-07 18:22:29 +01:00
Alex Dadgar 1e3c3cb287 Deprecate IOPS
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
2018-12-06 15:09:26 -08:00
Danielle Tomlinson e3621c55fa gc: Fix maxallocs integration test 2018-12-06 21:50:50 +01:00
Alex Dadgar c4b5f80918 Make alloc health watcher a postrun hook rather than shutdown hook 2018-12-06 12:30:31 -08:00
Danielle Tomlinson 62b98e64ca client/gc: Replace GC integration test with unit
The previous integration test was broken during the client refactor, and
it seems to be some sort of race with state updating.

I'm going to try and construct a replacement test as part of work on
performance, but for now, the underlying behaviour is still being
tested.
2018-12-06 12:28:23 +01:00
Danielle Tomlinson f6e474fd55 client: Re-enable GC tests 2018-12-06 12:28:23 +01:00
Danielle Tomlinson d043532cb0 allocrunner: Basic test alloc runner 2018-12-06 12:28:23 +01:00
Alex Dadgar b39c21d49c Fix various bugs with task events
Fixes the following:
* Emitting events when the task fails to start
* Don't double emit events on task shutdown (nomad stop)
* Don't emit a OOM kill metric unless actually OOM'd
2018-12-05 14:27:07 -08:00
Danielle Tomlinson 10b3e68a6d
Merge pull request #4925 from hashicorp/f-driver-plugins-dani
Third Party Driver Plugins Support
2018-12-03 20:48:19 +01:00
Mahmood Ali 88622b97bd
libcontainer to manage /dev and /proc (#4945)
libcontainer already manages `/dev`, overriding task_dir - so let's use it for `/proc` as well and remove deadcode.
2018-12-03 10:41:01 -05:00
Danielle Tomlinson 9bd77e9295 testfix: Fix import cycle in allocdir tests 2018-12-01 17:25:30 +01:00
Danielle Tomlinson 66c521ca17 client: Move fingerprint structs to pkg
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.

It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
2018-12-01 17:10:39 +01:00
Danielle Tomlinson 2db5ae38d8 client: Rename drivers/shared/env => client/taskenv 2018-11-30 12:18:39 +01:00
Danielle Tomlinson f3a77b8084 client: Merge driver/shared/structs and client/structs 2018-11-30 10:56:45 +01:00
Danielle Tomlinson b9295f0d56 client/driver: Remove package 2018-11-30 10:47:08 +01:00
Danielle Tomlinson fdfe93aa25 fixup: executorplugin: fix rkt build 2018-11-30 10:47:08 +01:00
Danielle Tomlinson d72ecd95ec client/driver: Vendor setEnvvars into docker_test 2018-11-30 10:46:13 +01:00
Danielle Tomlinson d26a310db0 client: Move executor plugins into own package 2018-11-30 10:46:13 +01:00
Danielle Tomlinson d259c36844 driver: Flatten SetEnvvars into taskdirhook 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 6b72e96eba client: Move driver/logging to logmon/logging
The logging package is used by logmon and the legacy mock_driver. Because the
legacy drivers are going away, I'm moving it here to signify its actual
ownership.
2018-11-30 10:46:13 +01:00
Danielle Tomlinson 04c8851b4c client: Migrate DriverStats optout to drivers/shared/structs 2018-11-30 10:46:13 +01:00
Danielle Tomlinson dbd82e1af4 client: Remove test dependency on client/driver 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 0544a57abe drivers: Move client/drivers/executor to drivers/shared/executor 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 1a29811169 drivers: Move client/drivers/env to drivers/shared/env
As part of deprecating legacy drivers, we're moving the env package to a
new drivers/shared tree, as it is used by the modern docker and rkt
driver packages, and is useful for 3rd party plugins.
2018-11-30 10:46:13 +01:00
Nick Ethier bbe420718a
Merge pull request #4922 from hashicorp/f-drivermananger
add generic plugin manager interface and orchestration
2018-11-28 22:17:04 -05:00
Preetha 1f526db414
Merge pull request #4919 from hashicorp/f-fingerprint-attribute-type
Modify fingerprint interface to use typed attribute struct
2018-11-28 14:18:28 -06:00
Michael Schurter 1bd9a9f9dd
Merge pull request #4894 from hashicorp/f-device-hook
Device hook and devices affect computed node class
2018-11-28 12:10:43 -06:00
Preetha Appan f89dbcd9cc
modify fingerprint interface to use typed attribute struct 2018-11-28 10:01:03 -06:00
Nick Ethier 60c6907ea5
client/plugin: remove println from plugin group func 2018-11-27 22:45:09 -05:00
Nick Ethier 600738e991
client/plugin: lint/spelling errors 2018-11-27 22:45:09 -05:00
Nick Ethier 45a6bf7acd
client/plugin: add generic plugin mananger interface and orchestration 2018-11-27 22:45:03 -05:00
Mahmood Ali ad1f8d8c20 Fixes in old lxc driver 2018-11-27 21:40:43 -05:00
Michael Schurter 3e56ee005a add nil check around task resources in device hook
Looking at NewTaskRunner I'm unsure whether TaskRunner.TaskResources
(from which req.TaskResources is set) is intended to be nil at times or
if the TODO in NewTaskRunner is intended to ensure it is always non-nil.
2018-11-27 17:25:33 -08:00
Michael Schurter b75e9fce37 assume that slices contain only non-nil items 2018-11-27 17:25:33 -08:00
Michael Schurter 85073f9d29 client: properly support hook env vars
The old approach was incomplete. Hook env vars are now:

 * persisted and restored between agent restarts
 * deterministic (LWW if 2 hooks set the same key)
2018-11-27 17:25:33 -08:00
Alex Dadgar 4ee603c382 Device hook and devices affect computed node class
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.

TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
2018-11-27 17:25:33 -08:00
Michael Schurter 27e07f657e
Merge pull request #4896 from hashicorp/b-prevalloc-deadlock
Fix deadlock in previous alloc watcher by emitting last alloc update
2018-11-27 19:07:16 -06:00
Michael Schurter b75f79a793 fix test breakage caused by rebase 2018-11-27 16:34:01 -08:00
Michael Schurter 91da566935 fix mispelings 2018-11-27 16:33:55 -08:00
Chris Baker a1fb1f3830
Merge pull request #4891 from hashicorp/b-1150-rkt-volume-names
drivers/rkt: fix invalid volumes
2018-11-27 18:55:00 -05:00
Danielle Tomlinson 3651dbdc25
Merge pull request #4909 from hashicorp/b-restart-delay
taskrunner: Return the restart delay correctly
2018-11-27 23:55:54 +01:00
Michael Schurter 22149a661e client: comment on importance of chan ops ordering 2018-11-27 14:11:32 -08:00
Mahmood Ali 05a958dc21 Update client/structs/broadcaster.go
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2018-11-27 14:06:08 -08:00
Michael Schurter 81b6a24a84 client: fix send-after-close in broadcaster 2018-11-27 14:06:08 -08:00
Michael Schurter c429e6b0ab client: check if prev alloc is already terminated
This is a defensive fast-path as 7c6aa0be already fixed the deadlock.
2018-11-27 14:06:08 -08:00
Michael Schurter 944ea6d38b client: emit last sent alloc to new listeners
Fixes a deadlock where the allocwatcher would block forever waiting for
an update from a terminal alloc.

Made the broadcaster easier to debug as well.
2018-11-27 14:06:08 -08:00
Michael Schurter 1e4ef139dd
Merge pull request #4883 from hashicorp/f-graceful-shutdown
Support graceful shutdowns in agent
2018-11-27 15:55:15 -06:00
Michael Schurter 4f7e6f9464 client: fix races in use of goroutine group
The group utility struct does not support asynchronously launched
goroutines (goroutines-inside-of-goroutines), so switch those uses to a
normal go call.

This means watchNodeUpdates and watchNodeEvents may not be shutdown when
Shutdown() exits. During nomad agent shutdown this does not matter.

During tests this means a test may leak those goroutines or be unable to
know when those goroutines have exited.

Since there's no runtime impact and these goroutines do not affect alloc
state syncing it seems ok to risk leaking them.
2018-11-26 12:52:55 -08:00
Michael Schurter 9f43fb6d29 client: reuse group instead of diy'ing it 2018-11-26 12:52:31 -08:00
Michael Schurter 22771aa19e client/ar: remove useless wait ch from runTasks
Arguably this makes task.WaitCh() useless, but I think exposing a wait
chan from TaskRunners is a generically useful API.
2018-11-26 12:51:18 -08:00
Michael Schurter 2fdd013956 client: document how AR/TR Run methods behave 2018-11-26 12:50:35 -08:00
Chris Baker 9bd4317139 modified TaskConfig to include AllocID
use this for volume names in drivers/rkt to address #1150
2018-11-26 18:54:26 +00:00
Nick Ethier 95362eaa02
Merge pull request #4844 from hashicorp/f-docker-plugin
Docker driver plugin
2018-11-20 20:43:03 -05:00
Mahmood Ali e1994e59bd address review comments 2018-11-20 17:10:54 -05:00
Mahmood Ali 171b73fde7 Emit metric counters for Vault token and renewal failures 2018-11-20 17:10:54 -05:00
Mahmood Ali 5b10da5de6 Set User-Agent header when hitting Vault API 2018-11-20 17:10:54 -05:00
Danielle Tomlinson 093f029d5b taskrunner: Return the restart delay correctly
We were incorrectly returning a 0 duration to the taskrunner when
determining when a task should restart. This would cause tasks to be
restarted immediately, ignoring the restart {} stanza in a users
configuration.

This commit causes us to return the restart duration to the task runner
so it may correctly delay further execution.
2018-11-20 21:52:23 +01:00
Nick Ethier 3e42d6914e
task_runner: use NodeResources instead of deprecated struct 2018-11-20 13:46:39 -05:00
Nick Ethier 93c0200566
task_runner: use task and alloc copies instead of referencing the original pointer 2018-11-20 13:34:46 -05:00
Nick Ethier 29591a7c2e
task_runner: emit event on task exit with exit result details 2018-11-19 22:59:17 -05:00
Nick Ethier 4be8a86ef9
plugins/driver: remove NodeResources from task Resources and use PercentTicks field for docker driver 2018-11-19 22:59:17 -05:00
Nick Ethier 69049d37f5
drivers: added NodeResources to drivers.TaskConfig 2018-11-19 22:59:16 -05:00
Nick Ethier 8f8698b3e1
docker: started work on porting docker driver to new plugin framework 2018-11-19 22:59:15 -05:00
Michael Schurter 88577fe083 client.rpc: don't log errors on shutdown 2018-11-19 16:39:30 -08:00
Michael Schurter 5bd744ac3d client: support graceful shutdowns
Client.Shutdown now blocks until all AllocRunners and TaskRunners have
exited their Run loops. Tasks are left running.
2018-11-19 16:39:30 -08:00
Mahmood Ali 9479015f51
Merge pull request #4884 from hashicorp/f-alloc-devices-cli
Report alloc device statistics in API and CLI
2018-11-16 18:04:54 -05:00
Mahmood Ali f139234372 address review comments 2018-11-16 17:13:01 -05:00
Mahmood Ali f72e599ee7 Populate alloc stats API with device stats
This change makes few compromises:

* Looks up the devices associated with tasks at look up time.  Given
that `nomad alloc status` is called rarely generally (compared to stats
telemetry and general job reporting), it seems fine.  However, the
lookup overhead grows bounded by number of `tasks x total-host-devices`,
which can be significant.

* `client.Client` performs the task devices->statistics lookup.  It
passes self to alloc/task runners so they can look up the device statistics
allocated to them.
  * Currently alloc/task runners are responsible for constructing the
entire RPC response for stats
  * The alternatives for making task runners device statistics aware
don't seem appealing (e.g. having task runners contain reference to hostStats)

* On the alloc aggregation resource usage, I did a naive merging of task device statistics.
  * Personally, I question the value of such aggregation, compared to
costs of struct duplication and bloating the response - but opted to be
consistent in the API.
  * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.
2018-11-16 10:26:32 -05:00
Michael Schurter 0cdb188ae4 tests: fix tests post-rebase 2018-11-15 17:40:56 -08:00
Michael Schurter 59f106ecee client/tr: add a bit of context to envbuilder errors 2018-11-15 16:26:25 -08:00
Michael Schurter 742f8775ba client: remove old proxy references from comments 2018-11-15 16:26:25 -08:00
Michael Schurter 2d0b44c3b4 client: test more env key variations 2018-11-15 16:26:25 -08:00
Michael Schurter 8bcd90d78d client: add new nested variables to task's hcl ctx
The error messages are really bad, but it's extremely difficult to
produce good error messages without the original HCL.
2018-11-15 16:26:25 -08:00
Michael Schurter 5e51e2c2d5 client: turn env into nested objects for task configs 2018-11-15 16:25:57 -08:00
Michael Schurter f8cdd561f0 client: interpolate driver configurations
Also add missing SetDriverNetwork calls.
2018-11-15 16:25:57 -08:00
Mahmood Ali 046f098bac Track Node Device attributes and serve them in API 2018-11-14 14:42:29 -05:00
Mahmood Ali 63acda956c Add Client Device Stats structs in api package 2018-11-14 14:41:19 -05:00
Mahmood Ali b74ccc742c Expose Device Stats in /client/stats API endpoint 2018-11-14 14:41:19 -05:00
Mahmood Ali c5de71a424 Allow nullable fields in StatValues
In state values, we need to be able to distinguish between zero values
(e.g. `false`) and unset values (e.g. `nil`).

We can alternatively use protobuf `oneOf` and nested map to ensure
consistency of fields that are set together, but the golang
representation does not represent that well and introducing a mismatch
between representations.  Thus, I opted not to use it.
2018-11-14 14:41:19 -05:00
Mahmood Ali 713c9fe683 Move Stat{Object|Value} to plugins/shared/structs
Moving them as they may be useful for other packages/plugins besides
devices.
2018-11-14 09:01:26 -05:00
Mahmood Ali 1f4db08f42 Regenerate proto files with protoc-gen-go@v1.2.0 2018-11-14 09:01:26 -05:00
Danielle Tomlinson 0917e93537
Merge pull request #4869 from hashicorp/b-executor-stdout
executor: Fix stdout stderr copy/paste
2018-11-13 19:22:37 -08:00
Mahmood Ali 865419e756 convert all config durations to strings in tests 2018-11-13 10:21:40 -05:00
Mahmood Ali ac3b4571eb Address review comments 2018-11-13 10:21:40 -05:00
Mahmood Ali 69f26783e4 avoid setting resource limit on rkt command
Was accidentally modified in 5b14d24bf4626bab420d00783d92bcf25e0b641e .
2018-11-13 10:21:40 -05:00
Mahmood Ali 8fa26f5521 Fix docker log fetching in tests
We no longer use syslog for tracking logs so tracking them explicitly
here
2018-11-13 10:21:40 -05:00
Mahmood Ali 88fa968623 killing should be done with wait client
Incidentally changed in 5b14d24bf4626bab420d00783d92bcf25e0b641e
2018-11-13 10:21:40 -05:00
Mahmood Ali 7690f389a0 Prioritize checking consumer context cancellation
Tests expect that as soon as eventer shuts down immediately on context
cancellations; but golang does not guarantee priority when multiple
pending channels are ready in a select statement.
2018-11-13 10:21:40 -05:00
Mahmood Ali c62ec124c0 Set clean config for mock driver
The default job here contains some exec task config (for setting
command and args) that aren't used for mock driver.  Now, the alloc
runner seems stricter about validating fields and errors on unexpected
fields.

Updating configs in tests so we can have an explicit task config
whenever driver is set explicitly.
2018-11-13 10:21:40 -05:00
Mahmood Ali e5e6f9a785 Update Docker name parsing lookup
`ParseNamed` function changed in e9f3f2cfee9d729a8642344c4fa4ea70b2d49468
where became `ParsedNormalizedName` with extra checks.
2018-11-13 10:21:40 -05:00
Danielle Tomlinson bfeded1f30 executor: Fix stdout stderr copy/paste 2018-11-12 22:08:04 -08:00
Alex Dadgar c4f9e22aeb fix race 2018-11-07 12:22:07 -08:00
Alex Dadgar a7ca737fb6 review comments 2018-11-07 11:31:52 -08:00
Alex Dadgar f0c7a8159b tests 2018-11-07 10:43:15 -08:00
Alex Dadgar 204ca8230c Device manager
Introduce a device manager that manages the lifecycle of device plugins
on the client. It fingerprints, collects stats, and forwards Reserve
requests to the correct plugin. The manager, also handles device plugins
failing and validates their output.
2018-11-07 10:43:15 -08:00
Michael Schurter a4e6a92d18 client: update alloc status when terminating
Defensively update alloc status whenever killing all tasks.
2018-11-05 15:11:10 -08:00
Michael Schurter 66bf3db455 client: block on context as well as waitCh
For lifecycle operations such as Restart and Kill, the client should not
expect driver plugins to be well behaved and close their waitCh on
context cancelation. Always wait on the passed in context as well as the
waitCh.
2018-11-05 12:32:05 -08:00
Michael Schurter b994f51990 client: fix tr lifecycle logic and shutdown delay
ShutdownDelay must be honored whenever the task is killed or restarted.
Services were not being deregistered prior to restarting.
2018-11-05 12:32:05 -08:00
Michael Schurter 2d3479147a client: fix ar and tr tests 2018-11-05 12:32:05 -08:00
Michael Schurter d29d09023e client: do not run terminal allocs 2018-11-05 12:32:05 -08:00
Michael Schurter 2bbd88888c client: first pass at implementing task restoring
Task restoring works but dead tasks may be restarted
2018-11-05 12:32:05 -08:00
Nick Ethier b0ddc03409
Merge pull request #4765 from jippi/increase-line-scan-limit
fix: increase log rotator line scan limit
2018-10-29 18:46:30 -07:00
Nick Ethier 3fcf8ba7e6
Merge pull request #4795 from hashicorp/f-plugin-config
Pass client configuration to plugins through loader
2018-10-29 18:42:27 -07:00
Nick Ethier bda3b1d3b3
rename NomadConfig to ClientAgentConfig 2018-10-29 21:34:34 -04:00
Michael Schurter 6f2cffb196
Merge pull request #4803 from hashicorp/b-leader-fixes
AR Fixes: task leader handling, restoring, state updating, AR.Destroy deadlocks
2018-10-29 17:38:59 -05:00
Michael Schurter d71a1b4547 tests: more fixes due to api changes 2018-10-29 15:25:22 -07:00
Preetha Appan b85cc38f3d
Stat path to binary to handle raw exec driver interpolated binary path 2018-10-26 17:24:05 -05:00
Preetha Appan 55ac8d3d12
Fix test linting 2018-10-26 10:30:12 -05:00
Michael Schurter b7a9d61a38 ar: initialize allocwatcher on restore
Fixes a panic. Left a comment on how the behavior could be improved, but
this is what releases <0.9.0 did.
2018-10-19 09:45:45 -07:00
Michael Schurter e060174130 ar: fix leader handling, state restoring, and destroying unrun ARs
* Migrated all of the old leader task tests and got them passing
* Refactor and consolidate task killing code in AR to always kill leader
  tasks first
* Fixed lots of issues with state restoring
* Fixed deadlock in AR.Destroy if AR.Run had never been called
* Added a new in memory statedb for testing
2018-10-19 09:45:45 -07:00
Nick Ethier 58b430edae
added driver specific client config struct to plugin configuration 2018-10-18 23:31:01 -04:00
Michael Schurter cefbf00bf0 ar: refactor task killing into 1 method
Update comments and address some PR comments from #4775
2018-10-17 10:06:59 -07:00
Michael Schurter 21d78be961 tests: explicitly cleanup after clients 2018-10-17 10:06:59 -07:00
Michael Schurter 222f6b5741 ar: fix task leader, update, and stop handling 2018-10-17 10:06:59 -07:00
Michael Schurter 1badbb2fc4 tr: cleanup hook logs 2018-10-17 09:42:32 -07:00
Nick Ethier 65adb80ebf
plumb NomadConfig into plugins 2018-10-16 22:47:22 -04:00
Nick Ethier d94b631b6b
drivers/exec: add exec implementation 2018-10-16 22:45:28 -04:00
Michael Schurter 0baaba8b09 templates: fix tests 2018-10-16 16:56:57 -07:00
Michael Schurter 838ddf4d4a fix linter errors 2018-10-16 16:56:57 -07:00
Michael Schurter e27c82ea4d client: remove unused handleproxy 2018-10-16 16:56:56 -07:00
Michael Schurter 4ea5217d72 tr: remove unused DriverHandle interface
was causing typed nil interface panics and served no purpose
2018-10-16 16:56:56 -07:00
Michael Schurter 528c426c53 Port client portion of #4392 to new taskrunner
PR #4392 was merged to master *after* allocrunnerv2 was branched, so the
client-specific portions must be ported from master to arv2.
2018-10-16 16:56:56 -07:00
Michael Schurter f12501d4c3 tr: implement dispatch payload hook
Now passing the TaskDir struct to prestart hooks instead of just the
root task dir itself as dispatch needs local/.
2018-10-16 16:56:56 -07:00
Nick Ethier d9f0cbf4a9 client: log retry during driver fingerprint redispense 2018-10-16 16:56:56 -07:00
Nick Ethier c7ac1186c9 client: add test for driverfailure during fingerprinting 2018-10-16 16:56:56 -07:00
Nick Ethier 8cf669b5aa taskrunner: return error on waitCh 2018-10-16 16:56:56 -07:00
Nick Ethier 047fad2953 client: simplify driver plugin logic from review comments 2018-10-16 16:56:56 -07:00
Nick Ethier 9686e1b258 client: fix broked tests from refactoring 2018-10-16 16:56:56 -07:00
Nick Ethier 3183b33d24 client: review comments and fixup/skip tests 2018-10-16 16:56:56 -07:00
Nick Ethier f192c3752a client: refactor post allocrunnerv2 finalization 2018-10-16 16:56:56 -07:00
Nick Ethier 4a4c7dbbfc client: begin driver plugin integration
client: fingerprint driver plugins
2018-10-16 16:56:56 -07:00
Alex Dadgar 7946a14aa8 Fix lints 2018-10-16 16:56:56 -07:00
Alex Dadgar 89dafaaea9 compile on windows 2018-10-16 16:56:56 -07:00
Alex Dadgar ad4fac526c more test fixes 2018-10-16 16:56:56 -07:00
Alex Dadgar 45e41cca03 allocrunnerv2 -> allocrunner 2018-10-16 16:56:56 -07:00
Alex Dadgar 9baa7402ef fix test compiling 2018-10-16 16:56:55 -07:00
Alex Dadgar 7d9c069f09 skip building deprecated files 2018-10-16 16:56:55 -07:00
Alex Dadgar 6c9d9d5173 move files around 2018-10-16 16:56:55 -07:00
Michael Schurter 5f696608a6 tests: fix missing logger caused by bad merge 2018-10-16 16:56:55 -07:00
Michael Schurter 048510b13e tr: properly comment handle fields 2018-10-16 16:56:55 -07:00
Michael Schurter 9e49ed3464 ar: AllocState should not mutate ar.state
If ar.state.TaskStates has not been set, set it on the copy of ar.state.
That keeps ar.state manipulations in one location and allows AllocState
to only acquire read-locks.
2018-10-16 16:56:55 -07:00
Michael Schurter f279b1d1b1 tests: test logs endpoint against pending task
Although the really exciting change is making WaitForRunning return the
allocations that it started. This should cut down test boilerplate
significantly.
2018-10-16 16:56:55 -07:00
Michael Schurter dd4227f84a tests: make a test client/config easier to generate
Sadly can't move the fingerprint timeout tweak into the helper due to
circular imports.
2018-10-16 16:56:55 -07:00
Michael Schurter 1d747048ea tests: ensure task state is initialized in NewAR
Also expose NoopDB for use in tests.
2018-10-16 16:56:55 -07:00
Michael Schurter 960f3be76c client: expose task state to client
The interesting decision in this commit was to expose AR's state and not
a fully materialized Allocation struct. AR.clientAlloc builds an Alloc
that contains the task state, so I considered simply memoizing and
exposing that method.

However, that would lead to AR having two awkwardly similar methods:
 - Alloc() - which returns the server-sent alloc
 - ClientAlloc() - which returns the fully materialized client alloc

Since ClientAlloc() could be memoized it would be just as cheap to call
as Alloc(), so why not replace Alloc() entirely?

Replacing Alloc() entirely would require Update() to immediately
materialize the task states on server-sent Allocs as there may have been
local task state changes since the server received an Alloc update.

This quickly becomes difficult to reason about: should Update hooks use
the TaskStates? Are state changes caused by TR Update hooks immediately
reflected in the Alloc? Should AR persist its copy of the Alloc? If so,
are its TaskStates canonical or the TaskStates on TR?

So! Forget that. Let's separate the static Allocation from the dynamic
AR & TR state!

 - AR.Alloc() is for static Allocation access (often for the Job)
 - AR.AllocState() is for the dynamic AR & TR runtime state (deployment
   status, task states, etc).

If code needs to know the status of a task: AllocState()
If code needs to know the names of tasks: Alloc()

It should be very easy for a developer to reason about which method they
should call and what they can do with the return values.
2018-10-16 16:56:55 -07:00
Michael Schurter fb4aa74153 client: add comment 2018-10-16 16:56:55 -07:00
Michael Schurter 9a7e6be2b6 client: fix potentially dropped streaming errors 2018-10-16 16:56:55 -07:00
Michael Schurter 4b44b9039b tr: remove unneeded lock; chan synchronizes access 2018-10-16 16:56:55 -07:00
Michael Schurter 211b96bb5c tr: fix shutdown/destroy/WaitResult handling
Multiple receivers raced for the WaitResult when killing tasks which
could lead to a deadlock if the "wrong" receiver won.

Wrap handlers in an ugly little proxy to avoid this. At first I wanted
to push this into drivers, but the result is tied to the TR's handle
lifecycle -- not the lifecycle of an alloc or task.
2018-10-16 16:56:55 -07:00
Michael Schurter 951ed17436 client: do not inspect task state to follow logs
"Ask forgiveness, not permission."

Instead of peaking at TaskStates (which are no longer updated on the
AR.Alloc() view of the world) to only read logs for running tasks, just
try to read the logs and improve the error handling if they don't exist.

This should make log streaming less dependent on AR/TR behavior.

Also fixed a race where the log streamer could exit before reading an
error. This caused no logs or errors to be displayed sometimes when an
error occurred.
2018-10-16 16:56:55 -07:00
Michael Schurter 2325348053 mock_driver: close waitCh after exiting
mock_driver wasn't behaving like other driver handles.
2018-10-16 16:56:55 -07:00
Michael Schurter 8d1419c62b client: fix accessing alloc runners
* GetClientAlloc() gains nothing from using allAllocs()
* getAllocatedResources was calling getAllocRunners() twice
2018-10-16 16:56:55 -07:00
Michael Schurter 55ab491801 tr: remove wip comments 2018-10-16 16:56:55 -07:00
Michael Schurter 3ccc091a72 ar: lock around accessing tasks
Specify that Alloc() does not return updated task states.
2018-10-16 16:56:55 -07:00
Alex Dadgar 6f0ed6184b Fix client reloading and pass the plugin loaders to server and client 2018-10-16 16:56:55 -07:00
Nick Ethier 352c05cdf4 plugin/drivers: plumb in stdout/stderr paths 2018-10-16 16:53:31 -07:00
Nick Ethier 0e3f85222a driver/raw_exec: port existing raw_exec tests and add some testing utilities 2018-10-16 16:53:31 -07:00
Nick Ethier d9628ff394 driver/raw_exec: more tests and bug fixes
added wrapper struct for plugin.ReattachConfig to better handle serialization
2018-10-16 16:53:31 -07:00
Nick Ethier bcc5c4a8bd clientv2: base driver plugin (#4671)
Driver plugin framework to facilitate development of driver plugins.

Implementing plugins only need to implement the DriverPlugin interface.
The framework proxies this interface to the go-plugin GRPC interface generated
from the driver.proto spec.

A testing harness is provided to allow implementing drivers to test the full
lifecycle of the driver plugin. An example use:

func TestMyDriver(t *testing.T) {
    harness := NewDriverHarness(t, &MyDiverPlugin{})
    // The harness implements the DriverPlugin interface and can be used as such
    taskHandle, err := harness.StartTask(...)
}
2018-10-16 16:53:31 -07:00
Michael Schurter 62c1285afc tr: add comments and cleanup call signature
From review comments on #4649 left post-merge.
2018-10-16 16:53:31 -07:00
Nick Ethier 5dee1141d1 executor v2 (#4656)
* client/executor: refactor client to remove interpolation

* executor: POC libcontainer based executor

* vendor: use hashicorp libcontainer fork

* vendor: add libcontainer/nsenter dep

* executor: updated executor interface to simplify operations

* executor: implement logging pipe

* logmon: new logmon plugin to manage task logs

* driver/executor: use logmon for log management

* executor: fix tests and windows build

* executor: fix logging key names

* executor: fix test failures

* executor: add config field to toggle between using libcontainer and standard executors

* logmon: use discover utility to discover nomad executable

* executor: only call libcontainer-shim on main in linux

* logmon: use seperate path configs for stdout/stderr fifos

* executor: windows fixes

* executor: created reusable pid stats collection utility that can be used in an executor

* executor: update fifo.Open calls

* executor: fix build

* remove executor from docker driver

* executor: Shutdown func to kill and cleanup executor and its children

* executor: move linux specific universal executor funcs to seperate file

* move logmon initialization to a task runner hook

* client: doc fixes and renaming from code review


* taskrunner: use shared config struct for logmon fifo fields

* taskrunner: logmon only needs to be started once per task
2018-10-16 16:53:31 -07:00
Michael Schurter e6e2930a00 tr: implement stats collection hook
Tested except for the net/rpc specific error case which may need
changing in the gRPC world.
2018-10-16 16:53:31 -07:00
Michael Schurter 86bd329539 fix build errors post merges 2018-10-16 16:53:31 -07:00
Michael Schurter a977e22028 test: cleanup mock consul service client
Updated to hclog.

It exposed fields that required an unexported lock to access. Created a
getter methodn instead. Only old allocrunner currently used this
feature.
2018-10-16 16:53:31 -07:00
Michael Schurter 6f92b04226 health_hook: simplify locking; test thoroughly
Use doneCh like @dadgar suggested in the original PR.

Thoroughly test hook as concurrent Update calls make for a tricky
concurrency problem.
2018-10-16 16:53:30 -07:00
Alex Dadgar cebfead6bc add logger back 2018-10-16 16:53:30 -07:00
Nick Ethier 03422aa529 fifo: add new fifo package for named pipes (#4665)
* fifo: add new fifo package for named pipes
2018-10-16 16:53:30 -07:00
Alex Dadgar 8504505c0d client uses passed logger and fix fingerprinters 2018-10-16 16:53:30 -07:00
Nick Ethier 66ff12e5f7 Update runc/libcontainer and friends (#4655)
* vendor: bump libcontainer and docker to remove Sirupsen imports

* vendor: fix bad vendoring of archive package

* vendor: fix api changes to cgroups in executor

* vendor: fix docker api changes

* vendor: update github.com/Azure/go-ansiterm to use non capitalized logrus import
2018-10-16 16:53:30 -07:00
Michael Schurter 195b8127fb health_hook: fix panic and add tests
Still more testing to do, but I want to get this panic fixed ASAP.

All new tests pass with -race
2018-10-16 16:53:30 -07:00
Michael Schurter 64efc3d301 Emit events before long operations
Append when there's nothing blocking between appending and sending an
update to the server.
2018-10-16 16:53:30 -07:00
Michael Schurter a2b696c4cf Use a semaphore to block until watcher exits 2018-10-16 16:53:30 -07:00
Michael Schurter a73162c977 ar: use multierror in update hook loop
Make it match TaskRunner update hook behavior
2018-10-16 16:53:30 -07:00
Michael Schurter a7b427718c tr: refactor EmitEvents into Emit+Append
* UpdateState: set state, append event, persist, update servers
* EmitEvent: append event, persist, update servers
* AppendEvent: append event, persist

AppendEvent may not even have to persist, but for the sake of
correctness I'm going with that for now.
2018-10-16 16:53:30 -07:00
Michael Schurter 93f3ac9ed6 ar: create health setting shim for health watcher 2018-10-16 16:53:30 -07:00
Michael Schurter 4d5aaac6d2 fix detection of task transitioning to running 2018-10-16 16:53:30 -07:00
Michael Schurter 4136e59f79 arv2: implement alloc health watching
Also remove initial alloc from broadcaster as it just caused useless
extra processing.
2018-10-16 16:53:30 -07:00
Michael Schurter 5c5c6dc41b refactor ar hooks into their own files
minimize passed dependencies to ease testing
2018-10-16 16:53:30 -07:00
Michael Schurter 0bbf3a93ee make AllocBroadcaster easier to use
And test thoroughly.
2018-10-16 16:53:30 -07:00
Michael Schurter 9d1ea3b228 client: hclog-ify most of the client
Leaving fingerprinters in case that interface changes with plugins.
2018-10-16 16:53:30 -07:00
Michael Schurter e42154fc46 implement stopping, destroying, and disk migration
* Stopping an alloc is implemented via Updates but update hooks are
  *not* run.
* Destroying an alloc is a best effort cleanup.
* AllocRunner destroy hooks implemented.
* Disk migration and blocking on a previous allocation exiting moved to
  its own package to avoid cycles. Now only depends on alloc broadcaster
  instead of also using a waitch.
* AllocBroadcaster now only drops stale allocations and always keeps the
  latest version.
* Made AllocDir safe for concurrent use

Lots of internal contexts that are currently unused. Unsure if they
should be used or removed.
2018-10-16 16:53:30 -07:00
Michael Schurter 4236255686 lots of comment/log fixes 2018-10-16 16:53:30 -07:00
Michael Schurter 5749ede04e keep forgetting lxc 2018-10-16 16:53:30 -07:00
Michael Schurter 357641c364 persist alloc state on changes, not periodically
Allow alloc and task runners to persist their own state when something
changes instead of periodically syncing all state.
2018-10-16 16:53:30 -07:00
Michael Schurter 820af27171 wrap boltdb in a write deduplicator
Saves a tiny bit of cpu and some IO. Sadly doesn't prevent all IO on
duplicate writes as the transactions are still created and committed.

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: github.com/hashicorp/nomad/helper/boltdd
BenchmarkWriteDeduplication_On-4             500           4059591 ns/op           23736 B/op         56 allocs/op
BenchmarkWriteDeduplication_Off-4            300           4115319 ns/op           25942 B/op         55 allocs/op
2018-10-16 16:53:30 -07:00
Michael Schurter 990228a6e2 wip wrap boltdb to get path information
finished but doesn't handle deleting deeply nested buckets
2018-10-16 16:53:30 -07:00
Michael Schurter a3fe0510d1 Move all encoding and put deduping into state db
Still WIP as it does not handle deletions.
2018-10-16 16:53:30 -07:00
Michael Schurter 533bc93b3a implement all boltdb interactions behind StateDB 2018-10-16 16:53:30 -07:00
Michael Schurter d890de036a tr: persist hook state whenever it changes 2018-10-16 16:53:30 -07:00
Michael Schurter fae5e89a0e artifacts: don't emit event when there's no artifacts 2018-10-16 16:53:30 -07:00
Michael Schurter 5383d20505 removing old restoration path before api change 2018-10-16 16:53:30 -07:00
Michael Schurter a5d3e3fb0a Implement alloc updates in arv2
Updates are applied asynchronously but sequentially
2018-10-16 16:53:30 -07:00
Michael Schurter 39b3f3a85b call handle.Network() instead of storing it 2018-10-16 16:53:30 -07:00
Michael Schurter 7132b67c1e Add Network method to Handle interface
Should probably be moved to an Inspect method in the Driver Plugin world
2018-10-16 16:53:30 -07:00
Michael Schurter a4b4d7b266 consul service hook
Deregistration works but difficult to test due to terminal updates not
being fully implemented in the new client/ar/tr.
2018-10-16 16:53:29 -07:00
Michael Schurter 5be982e674 restore vault client 2018-10-16 16:53:29 -07:00
Michael Schurter ce04915c9f log before killing tasks 2018-10-16 16:53:29 -07:00
Michael Schurter a2bf851805 no need to TaskStateUpdated to return an error
also updated comments
2018-10-16 16:53:29 -07:00
Alex Dadgar fd3bc1bd39 Update state with server 2018-10-16 16:53:29 -07:00
Alex Dadgar bc905cc61d Define and thread through state updating interface 2018-10-16 16:53:29 -07:00
Michael Schurter 9a63d6103d tr: add validate task hook 2018-10-16 16:53:29 -07:00
Michael Schurter 7f4ec50906 missed locking around c.allocs access 2018-10-16 16:53:29 -07:00
Alex Dadgar c93cfc89c0 wip 2018-10-16 16:53:29 -07:00
Alex Dadgar 7ddc0eb65c Fix deadlock 2018-10-16 16:53:29 -07:00
Alex Dadgar 3779077052 Remove SetState from interface 2018-10-16 16:53:29 -07:00
Alex Dadgar e1ba73b515 compile 2018-10-16 16:53:29 -07:00
Michael Schurter 6ebdf532ea wip split event emitting and state transitions 2018-10-16 16:53:29 -07:00
Michael Schurter 516d641db0 client: implement all-or-nothing alloc restoration
Restoring calls NewAR -> Restore -> Run

NewAR now calls NewTR
AR.Restore calls TR.Restore
AR.Run calls TR.Run
2018-10-16 16:53:29 -07:00
Alex Dadgar e401c660e7 Implement lifecycle hooks on the task runner 2018-10-16 16:53:29 -07:00
Alex Dadgar 89b4ba9cc8 comments 2018-10-16 16:53:29 -07:00
Alex Dadgar 86e81947b4 Hook renames 2018-10-16 16:53:29 -07:00
Alex Dadgar 2599cf9d74 remove comment 2018-10-16 16:53:29 -07:00
Alex Dadgar 88aa0299a9 Template hook 2018-10-16 16:53:29 -07:00
Alex Dadgar c9765deff1 address comments 2018-10-16 16:53:29 -07:00
Alex Dadgar 80f6ce50c0 vault hook 2018-10-16 16:53:29 -07:00
Michael Schurter 30d377eba4 tr: improve skip log line 2018-10-16 16:53:29 -07:00
Michael Schurter ef213b864b tr: pass context to hooks 2018-10-16 16:53:29 -07:00
Michael Schurter 3a4f387fd3 tr: fix setting done in existing hooks 2018-10-16 16:53:29 -07:00
Michael Schurter b360f6f96e fix hclog level 2018-10-16 16:53:29 -07:00
Michael Schurter ae89b7da95 reimplement success state for tr hooks and state persistence
splits apart local and remote persistence

removes some locking *for now*
2018-10-16 16:53:29 -07:00
Michael Schurter 4f43ff5c51 pass statedb into allocrunnerv2 2018-10-16 16:53:29 -07:00
Michael Schurter 582c76a420 remove unused allocrunner shim 2018-10-16 16:53:29 -07:00
Michael Schurter c5504bd939 tr: cleanup main loop and shutdown hook impl 2018-10-16 16:53:29 -07:00
Michael Schurter 561260d6fe tr: skip error/success saving
All hooks only need to be run once.
Since only one hook can fail per run there's no need to
track errors on a per hook basis.
2018-10-16 16:53:29 -07:00
Michael Schurter 67874e761f tr: don't lock for immutable fields 2018-10-16 16:53:29 -07:00
Michael Schurter f473cd03d6 tr: start update/shutdown logic 2018-10-16 16:53:29 -07:00
Michael Schurter 637ef264ae Copy TR.Config vals to TR
I think I like this pattern better as some Config vals are mutable
(Alloc) and some aren't and some are used to derive other values and
never used directly.

Promoting them onto the TR struct is a little more work but is hopefully
more clear as to how each value is used.
2018-10-16 16:53:29 -07:00
Michael Schurter 0f7dcfdc9a example redis job "runs" on arv2! see below
Tons left to do and lots of churn:
1. No state saving
2. No shutdown or gc
3. Removed AR factory *for now*
4. Made all "Config" structs local to the package they configure
5. Added allocID to GC to avoid a lookup

Really hating how many things use *structs.Allocation. It's not bad
without state saving, but if AllocRunner starts updating its copy things
get racy fast.
2018-10-16 16:53:29 -07:00
Michael Schurter 9a6aa38b0f begin adding AllocRunner.Update 2018-10-16 16:53:29 -07:00
Michael Schurter eae54e2954 artifact task hook 2018-10-16 16:53:29 -07:00
Alex Dadgar b9bed81e6e Initial V2 alloc runner 2018-10-16 16:53:28 -07:00
Alex Dadgar a78cefec18 use int64 2018-10-16 15:34:32 -07:00
Preetha Appan 7c0d8c646c
Change CPU/Disk/MemoryMB to int everywhere in new resource structs 2018-10-16 16:21:42 -05:00
Christian Winther 0c5154100c
fix: increase log rotator line scan limit
In case where gelf/json logging is used, its fairly easy to exceed the 16k limit, resulting in json output being cut up into multiple strings

the result is invalid json lines which can create all kind of badness in the logging server

This fixes https://github.com/hashicorp/nomad/issues/4699

Signed-off-by: Christian Winther <jippignu@gmail.com>
2018-10-09 18:57:18 +02:00
Alex Dadgar 01f8e5b95f renames 2018-10-04 14:57:25 -07:00
Alex Dadgar 52f9cd7637 fixing tests 2018-10-04 14:26:19 -07:00
Alex Dadgar bac5cb1e8b Scheduler uses allocated resources 2018-10-02 17:08:25 -07:00
Alex Dadgar 5c8697667e Node reserved resources 2018-09-29 18:44:55 -07:00
Alex Dadgar 3183153315 Node resources on client 2018-09-29 17:23:41 -07:00
Alex Dadgar 9971b3393f yamux 2018-09-17 14:22:40 -07:00
Alex Dadgar ca28afa3b2 small fixes 2018-09-15 16:42:38 -07:00
Alex Dadgar 7739ef51ce agent + consul 2018-09-13 10:43:40 -07:00
Michael Schurter 08862fc177 fix race around error handling 2018-09-05 17:34:17 -07:00
Michael Schurter 6def5bc4f9 client: set host name when migrating over tls
Not setting the host name led the Go HTTP client to expect a certificate
with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad`
names ephemeral dir migrations were broken when TLS was enabled.

Added an e2e test to ensure this doesn't break again as it's very
difficult to test and the TLS configuration is very easy to get wrong.
2018-09-05 17:24:17 -07:00
Alex Dadgar c6576ddac1 Fix make check errors 2018-09-04 16:03:52 -07:00
Alex Dadgar 089b533047 Fix kill timeout exceeding 5m on Docker driver
Fixes an issue where the Docker API client would timeout before the kill
timeout was hit.
2018-08-17 16:01:09 -07:00
Alex Dadgar 49a1ba9297
Merge pull request #4535 from hashicorp/f-keep-docker-container-0.8.4
Option to prevent removal of container on exit
2018-07-26 11:11:22 -07:00
Charlie Voiselle f319a149cd Option to prevent removal of container on exit 2018-07-26 11:10:48 -07:00
Michael Schurter ddf948001e
Merge pull request #4462 from omame/omame/cpu_cfs_period
Add support for specifying cpu_cfs_period in the Docker driver
2018-07-25 09:34:38 -07:00
Daniele Valeriani b0a14caca2 Add test for cpu_cfs_period 2018-07-16 22:43:34 +02:00
Michael Schurter 91588cb861 rkt: revert to redis 3.2 to favor stability 2018-07-09 16:15:32 -07:00
Michael Schurter c56f899ee9 rkt: speed up tests
Disable networking when it's not needed and improve failure message for
UserGroup test by including the full ps output on failure.
2018-07-09 14:02:27 -07:00
Michael Schurter a1d4f77ce0 rkt: skip retrieving network information when net=none
Even when net=none we would attempt to retrieve network information from
rkt which would spew useless log lines such as:

```
testlog.go:30: 20:37:31.409209 [DEBUG] driver.rkt: failed getting network info for pod UUID 8303cfe6-0c10-4288-84f5-cb79ad6dbf1c attempt 2: no networks found. Sleeping for 970ms
```

It would also delay tests for ~60s during the network information retry
period.

So skip this when net=none. It's unlikely anyone actually uses net=none
outside of tests, so I doubt anyone will notice this change.

Official docs:
https://coreos.com/rkt/docs/latest/networking/overview.html#no-loopback-only-networking
2018-07-09 13:44:43 -07:00
Michael Schurter 0fbc84b81d tests: make alloc id consistent in helper
It worked, but the old code used a different alloc id for the path than
the actual alloc! Use the same alloc id everywhere to prevent confusing
test output.
2018-07-09 13:37:35 -07:00
Michael Schurter f3b8815c96 rkt: fix failing TestRktDriver_UserGroup test
Started failing due to the docker redis image switching from Debian
jessie to stretch:
53f8680550 (diff-acff46b161a3b7d6ed01ba79a032acc9)

Switched from Debian based image to Alpine to get a working `ps` command
again (albeit busybox's stripped down implementation)
2018-07-09 12:19:02 -07:00
Daniele Valeriani 748f6afd89 Validate the value of cpu_cfs_period 2018-07-02 22:30:22 +02:00
Daniele Valeriani 9364446a03 Remove an unnecessary conversion 2018-07-02 17:47:23 +02:00
Daniele Valeriani 906952a2c8 Add support for specifying cpu_cfs_period in the Docker driver 2018-07-02 16:37:04 +02:00
Preetha b567750824
Merge pull request #4392 from burdandrei/telemetry-parametrized-jobs
Parametrized/periodic jobs per child tagged metric emmision
2018-06-21 17:13:36 -05:00
Preetha 043f4c208b
Merge pull request #3882 from burdandrei/telemetry-add-node-class-tag
Added node class to tagged metrics
2018-06-21 17:04:35 -05:00
Andrei Burd 444ee45aff Parametrized/periodic jobs per child tagged metric emmision 2018-06-21 10:40:56 +03:00
James Rasell 75f95ccf09
Merge branch 'master' into f_gh_4381 2018-06-19 17:51:57 +02:00
Alex Dadgar b61051b3cd
Merge pull request #4409 from hashicorp/r-client-packages
Refactor client packages
2018-06-13 17:32:25 -07:00
Alex Dadgar 22757d964e lint 2018-06-13 16:06:39 -07:00
Alex Dadgar af558df94c Fix test using a lot of memory 2018-06-13 15:52:25 -07:00
Alex Dadgar 300b1a7a15 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Chelsea Komlo 03075b603a
Merge pull request #4399 from hashicorp/r-reload-refactor
Refactor logic for dynamic reloading
2018-06-13 13:35:12 -04:00
Alex Dadgar 9bab9edf27 test fixes 2018-06-12 17:45:39 -07:00
Alex Dadgar 90c2108bfb Fix gc tests + parallel destroy + small test fixes 2018-06-12 10:23:45 -07:00
Alex Dadgar f5ff509fa5 Refactor - wip 2018-06-12 10:23:45 -07:00
Alex Dadgar ff2ab8f58e Fix vault template test 2018-06-12 09:57:28 -07:00
Alex Dadgar d0043691fb remove structs + bump version 2018-06-11 13:52:19 -07:00
Alex Dadgar af5753d2cd bump version + generated files 2018-06-11 13:39:42 -07:00
Nick Ethier f36eb14360
Merge pull request #4403 from hashicorp/b-fix-dispatched-optional-meta
Fix dispatched optional meta correctly
2018-06-11 16:17:14 -04:00
Nick Ethier e75e3ae665
nomad: use require pkg for tests 2018-06-11 13:50:50 -04:00
Nick Ethier 3aa6241b5c
client/driver/env: fix optional meta test 2018-06-11 12:29:13 -04:00
Nick Ethier c65882cafd
client/driver/env: use 'job.Dispatch' to trigger optional meta logic 2018-06-11 12:15:19 -04:00
Nick Ethier ccb5372813
Revert "Revert "client/driver/env: interpolate empty optional meta params as empty strings""
This reverts commit c17e0fc9dc5fd288935ab2b68fb441b4d25ac189.
2018-06-11 11:59:23 -04:00
Michael Schurter c198cfd8ea executor: fix log line formatting 2018-06-08 14:55:39 -07:00
Michael Schurter d1a60e700e executor: fix Windows blocking on pipe close
Sending the Ctrl-Break signal to PowerShell <6 causes it to drop into
debug mode. Closing its output pipe at that point will block
indefinitely and prevent the process from being killed by Nomad.

See the upstream powershell issue for details:
https://github.com/PowerShell/PowerShell/issues/4254
2018-06-08 14:48:05 -07:00
Chelsea Holland Komlo f74e74b22d add client logic to determine whether TLS RPC connections should reload 2018-06-08 14:38:58 -04:00
James Rasell b9009c419c
Add 'nomad.advertise.address' to client meta via NomadFingerPrint
This change removes the addition of the advertise address to the
exported task env vars and instead moves this work into the
NomadFingerprint.Fingerprint which adds this value to the client
attrs. This can then be used within a Nomad job like
${attr.nomad.advertise.address}.
2018-06-08 09:44:10 +02:00
Alex Dadgar d9b35fab52 Revert "client/driver/env: interpolate empty optional meta params as empty strings"
This reverts commit 84926f759a63a90be7bbcf0fad78deb3f02af23d.
2018-06-07 16:27:47 -07:00
Nick Ethier b3c767fae0
client/driver: drop docker pull progress estimate if its < 0 2018-06-07 15:23:31 -04:00
James Rasell 367a8b5152
Add the local clients advertise address to interpolation env vars
This commit adds the Nomad local client advertise address in the
form host:port to the environment variables passed to each task.
2018-06-07 09:45:15 +02:00
Alex Dadgar 98705824ed
Merge pull request #4185 from jesusvazquez/add-counter-metric-for-oom-killer-events
Add driver.docker counter metric for OOM Killer events
2018-06-04 15:12:51 -07:00
Alex Dadgar 23cd56dc78 remove generated structs 2018-06-01 16:11:28 -07:00
Alex Dadgar bf5b5747ab fix test message 2018-06-01 15:51:54 -07:00
Alex Dadgar 3e3d3c7445 Disable Exec on non-linux platforms
This PR disables exec on non-linux platforms
2018-06-01 15:48:14 -07:00
Alex Dadgar c0386819b3 bump version/lint/generated files 2018-06-01 15:23:10 -07:00
Preetha Appan ce6d4a8d7a
Fix tests and move isClient to constructor 2018-06-01 15:59:53 -05:00
Alex Dadgar a62dd2aadb
Merge pull request #4350 from hashicorp/b-raw-exec-cgroups
Raw exec can use cgroups to manage PIDs
2018-06-01 17:37:49 +00:00
Alex Dadgar 8da42940c9 wait for result 2018-06-01 10:14:53 -07:00
Alex Dadgar 40fec81315
Merge pull request #4277 from hashicorp/f-retry-join-clients
Add go-discover support to Nomad clients
2018-06-01 16:57:40 +00:00
Alex Dadgar 460ecb8705 Comments 2018-05-31 18:05:03 -07:00
Alex Dadgar de98774f2c Add test and docs 2018-05-31 18:05:03 -07:00
Alex Dadgar ff28b04c46 Use more appropriate name than cgroup 2018-05-31 18:05:03 -07:00
Alex Dadgar 37e900b1d3 Only use freezer/devices when in the basic cgroup only 2018-05-31 18:05:03 -07:00
Alex Dadgar ffd9270f2f Use cgroup when possible 2018-05-31 18:05:03 -07:00
Alex Dadgar 0ff0ed290d Fix TestDockerDriver_StartNVersions 2018-05-31 17:14:59 -07:00
Alex Dadgar 7e6dd498c9 Remove debug logging 2018-05-31 15:52:42 -07:00
Alex Dadgar b1b908527f spelling 2018-05-31 15:29:55 -07:00
Alex Dadgar a3b29553a5 Force close stdout/stderr after grace
This commit changes the force closing of the stdout/stderr file
descriptor from closing immediately to being closed after a grace
period. This allows the created process to close its own file and allows
copying of the data.
2018-05-31 15:21:36 -07:00
Alex Dadgar 5e787e2d72 test build 2018-05-31 12:22:31 -07:00
Alex Dadgar ead1b7f423 Log more info for TestExecutor_IsolationAndConstraints 2018-05-31 11:57:44 -07:00
Alex Dadgar b05740ad13
Merge pull request #4341 from hashicorp/f-docker-pids
Support Docker Pids Limit
2018-05-31 17:59:29 +00:00
Chelsea Holland Komlo 064b5481e0 add server join info to server and client 2018-05-31 10:50:03 -07:00
Alex Dadgar f4d4bbdc97 test pid limit 2018-05-30 12:55:24 -07:00
Chelsea Holland Komlo 94d510e969 Support Docker Pids Limit 2018-05-25 19:54:14 -04:00
Alex Dadgar 1685c8ebe4 cleanup 2018-05-24 16:25:20 -07:00
Alex Dadgar 2eacdb6bd6 Force closing of pipe to child process 2018-05-24 16:03:48 -07:00
Chelsea Holland Komlo 38f611a7f2 refactor NewTLSConfiguration to pass in verifyIncoming/verifyOutgoing
add missing fields to TLS merge method
2018-05-23 18:35:30 -04:00
Preetha 9084bb025e
Merge pull request #4303 from hashicorp/b-docker-client-nil-panic
Add nil check before setting timeout on docker client
2018-05-21 19:34:44 -07:00
Jesus Vazquez 23d959e42c Add job, task, taskgroup to open method 2018-05-21 20:37:18 +02:00
Jesus Vazquez 0a062a04c7 Remove allocID from dockerhandle struct 2018-05-21 20:33:01 +02:00
Jesus Vazquez e5a81815bb Rename labels job, task_group and task 2018-05-21 20:32:50 +02:00
Jesus Vazquez ffe1b1a1b6 Remove allocid label from driver.docker.oom counter metric 2018-05-21 20:30:56 +02:00
Alex Dadgar 38762d9bde
Merge pull request #4282 from hashicorp/f-rotator
Avoid splitting log line across two files
2018-05-21 17:52:13 +00:00
Alex Dadgar d95698e2c5
Merge pull request #4298 from justenwalker/docker-driver-digest-tags
driver/docker: pull image with digest
2018-05-21 17:46:14 +00:00
Nick Ethier 6392009dd6
client/driver: use correct repo address when using docker-credential helper (#4266) 2018-05-15 17:39:48 -04:00
Justen Walker a8989f33bb driver/docker: add test for dockerImageRef 2018-05-14 14:24:03 -04:00
Justen Walker 194b2231d6 driver/docker: fix up TestParseDockerImage 2018-05-14 14:23:48 -04:00
Justen Walker 25b2807ce3 driver/docker: fix TestDockerDriver_ForcePull_RepoDigest 2018-05-14 14:23:02 -04:00
Nick Ethier c4d07a2200
client/driver: gaurd authHelper test from running on windows 2018-05-14 13:46:57 -04:00
Justen Walker b23ca7574c driver/docker: cleanup parseDockerImage 2018-05-14 11:11:51 -04:00
Justen Walker 60f7f1aa08 driver/docker: pull image with digest
GH #4290

Add digest support to the docker driver image config. This commit
factors out some common code to print the repo:tag (dockerImageRef) for
events/logs as well as parsing the image to retreive the repo,tag
(parseDockerImage) so that the results are consistent/sane for both
repo:tag and repo@sha256:... references.

When pulling an image with a digest, the tag is blank and the repo
contains the digest. See:
https://github.com/fsouza/go-dockerclient/blob/master/image_test.go#L471
2018-05-14 10:42:58 -04:00
Preetha Appan de66ec7394
Add nil check before setting timeout on docker client 2018-05-11 17:09:26 -05:00
Alex Dadgar 7ad5c76734 Add new line test 2018-05-11 10:52:09 -07:00
Alex Dadgar 3671ed139d Avoid splitting log line across two files
We attempt to avoid splitting a log line between two files by detecting
if we are near the file size limit and scanning for new lines and only
flushing those.

BenchmarkRotator/1KB-8            300000              5613 ns/op
BenchmarkRotator/2KB-8            200000              8384 ns/op
BenchmarkRotator/4KB-8            100000             14604 ns/op
BenchmarkRotator/8KB-8             50000             25002 ns/op
BenchmarkRotator/16KB-8            30000             47572 ns/op
BenchmarkRotator/32KB-8            20000             92080 ns/op
BenchmarkRotator/64KB-8            10000            165883 ns/op
BenchmarkRotator/128KB-8            5000            294405 ns/op
BenchmarkRotator/256KB-8            2000            572374 ns/op
2018-05-10 15:11:01 -07:00
Alex Dadgar f5d91b5338 Benchmark for rotator
BenchmarkRotator/1KB-8            200000              5572 ns/op
BenchmarkRotator/2KB-8            200000              8338 ns/op
BenchmarkRotator/4KB-8            100000             14246 ns/op
BenchmarkRotator/8KB-8             50000             25279 ns/op
BenchmarkRotator/16KB-8            30000             48602 ns/op
BenchmarkRotator/32KB-8            20000             92159 ns/op
BenchmarkRotator/64KB-8            10000            154766 ns/op
BenchmarkRotator/128KB-8            5000            296872 ns/op
BenchmarkRotator/256KB-8            3000            551793 ns/op
2018-05-10 14:15:15 -07:00
Nick Ethier 91603a377e
client/driver: parse repo instead of attempting to pull repo info 2018-05-09 22:34:25 -04:00
Nick Ethier 38a33f9c75
client/driver: add test for docker auth helper 2018-05-09 22:33:56 -04:00
Alex Dadgar e067a9ae06 naming of constants 2018-05-09 16:46:52 -07:00
Chelsea Holland Komlo 796bae6f1b allow configurable cipher suites
disallow 3DES and RC4 ciphers

add documentation for tls_cipher_suites
2018-05-09 17:15:31 -04:00
Alex Dadgar 0e79e1a46e Keep stream and logs in sync for detecting closed pipe 2018-05-09 11:22:52 -07:00
Preetha e7ae6e98d9
Merge pull request #4259 from hashicorp/f-deployment-improvements 2018-05-08 16:37:10 -05:00
Nick Ethier 3598925ca4
client/driver: use correct repo address when using docker-credential helper 2018-05-08 15:17:28 -04:00
Nick Ethier 54c86a0292
client/driver/env: interpolate empty optional meta params as empty strings 2018-05-07 20:19:51 -04:00
Nick Ethier 016ab7a105
client/driver: remove unused const 'dockerPullProgressEmitInterval' 2018-05-07 16:24:48 -04:00
Michael Schurter f1d13683e6
consul: remove services with/without canary tags
Guard against Canary being set to false at the same time as an
allocation is being stopped: this could cause RemoveTask to be called
with the wrong Canary value and leaking a service.

Deleting both Canary values is the safest route.
2018-05-07 14:55:01 -05:00
Michael Schurter 50e04c976e
consul: support canary tags for services
Also refactor Consul ServiceClient to take a struct instead of a massive
set of arguments. Meant updating a lot of code but it should be far
easier to extend in the future as you will only need to update a single
struct instead of every single call site.

Adds an e2e test for canary tags.
2018-05-07 14:55:01 -05:00
Alex Dadgar df8fce4347
Ensure canaries tags are interpolated 2018-05-07 14:50:01 -05:00
Alex Dadgar 552604451c
rework where time gets set 2018-05-07 14:50:01 -05:00
Alex Dadgar ee50789c22
Initial implementation 2018-05-07 14:50:01 -05:00
Nick Ethier d8de354dbf
client/driver: add waiting layer status count to pull progress status msg 2018-05-07 12:18:20 -04:00
Nick Ethier 77af17efbc
client/driver: add seperate handler for emitting pull progress 2018-05-07 12:17:34 -04:00
Nick Ethier 0bdd976b7d
client/driver: remove pull timeout due to race condition that can lead to unexpected timeouts
If two jobs are pulling the same image simultaneously, which ever starts the pull first will set the pull timeout.
This can lead to a poor UX where the first job requested a short timeout while the second job requested a longer timeout
causing the pull to potentially timeout much sooner than expected by the second job.
2018-05-07 12:18:11 -04:00
Nick Ethier 7c5821d7c6
client/driver: do accounting on layer pull progress 2018-05-07 12:17:53 -04:00
Nick Ethier 8efda7dc6c
client/driver: emit progress to all allocs pulling same image 2018-05-07 12:17:34 -04:00
Nick Ethier e35948ab91
client/driver: add image pull progress monitoring 2018-05-07 12:17:38 -04:00
Michael Schurter 0d534d30d6
Merge pull request #4251 from hashicorp/f-grpc-checks
Support Consul gRPC Health Checks
2018-05-04 14:55:16 -07:00
Michael Schurter f6a4713141 consul: make grpc checks more like http checks 2018-05-04 11:08:11 -07:00
Michael Schurter 382caec1e1 consul: initial grpc implementation
Needs to be more like http.
2018-05-04 11:08:11 -07:00
Jesus Vazquez 08a390448b Update counter driver.docker.oom labels 2018-05-04 14:02:34 +08:00
Jesus Vazquez 4f6db56283 Initialize dockerhandle with jobname, taskgroupname, taskname and allocid 2018-05-04 14:02:19 +08:00
Jesus Vazquez 127b764dfb Add Job, taskgroupname, taskname, and allocid to the DockerHandle struct 2018-05-04 14:01:26 +08:00
Jesus Vazquez fd1ff1a0cf Run goimports 2018-05-04 13:46:36 +08:00
Jesus Vazquez 5dd4059527 Add driver.docker counter metric for OOM Killer events 2018-05-04 13:46:36 +08:00
Michael Schurter 526af6a246 framer: fix early exit/truncation in framer 2018-05-02 10:46:16 -07:00
Michael Schurter f1a6aa103a framer: fix race and remove unused error var
In the old code `sending` in the `send()` method shared the Data slice's
underlying backing array with its caller. Clearing StreamFrame.Data
didn't break the reference from the sent frame to the StreamFramer's
data slice.
2018-05-02 10:46:16 -07:00
Michael Schurter 7360fe3a6d client: squelch errors on cleanly closed pipes 2018-05-02 10:46:16 -07:00
Michael Schurter ffff97e25f client: don't spin on read errors 2018-05-02 10:46:16 -07:00
Michael Schurter 5ef0a82e6e client: reset encoders between uses
According to go/codec's docs, Reset(...) should be called on
Decoders/Encoders before reuse:

https://godoc.org/github.com/ugorji/go/codec

I could find no evidence that *not* calling Reset() caused bugs, but
might as well do what the docs say?
2018-05-02 10:46:16 -07:00
Alex Dadgar de4af37249 version bump and remove generated 2018-04-27 11:10:00 -07:00
Alex Dadgar 845a43864a generated files 2018-04-27 10:45:40 -07:00
Alex Dadgar 35e06ddb31 Remove generated and version bump 2018-04-26 16:49:19 -07:00
Alex Dadgar 43192cefae generated files 2018-04-26 16:28:58 -07:00
Michael Schurter 0e602d4779
Merge pull request #4188 from hashicorp/f-rkt-stats
rkt: create parent cgroup to enable stats
2018-04-24 14:54:36 -07:00
Michael Schurter d687761ebf rkt: test Stats() and always run tests
Remove the NOMAD_TEST_RKT flag as a guard for rkt tests. Still require
Linux, root, and rkt to be installed. Only check for rkt installation
once in hopes of speeding up rkt tests a bit.
2018-04-24 11:05:42 -07:00
Javier Palomo Almena 3e6c01ffa1 docker tests: Fix usage of NewDriverContext 2018-04-23 22:51:06 +02:00
Javier Palomo Almena 74d3c5df07 DriverContext: Add the TaskGroup and the Job name
Adding this fields to the DriverContext object, will allow us to pass
them to the drivers.

An use case for this, will be to emit tagged metrics in the drivers,
which contain all relevant information:
- Job
- TaskGroup
- Task
- ...

Ref: https://github.com/hashicorp/nomad/pull/4185
2018-04-23 00:15:29 +02:00
Michael Schurter 4cee6cca6c rkt: create parent cgroup to enable stats
Having the Nomad executor create parent cgroups that rkt is launched
within allows the stats collection code used for the exec driver to Just
Work. The only downside is that now the Nomad executor's resource
utilization counts against the cgroups resource limits just as it does
for the exec driver.
2018-04-19 15:14:56 -07:00
Michael Schurter 1a85d0c990 run goimports 2018-04-19 11:16:28 -07:00
Michael Schurter d77c265d1f
Merge pull request #4168 from ninoles/b-2117-windows-group-process
B 2117 windows group process
2018-04-19 11:10:51 -07:00
Michael Schurter fdbcbd4e5b
Merge pull request #4058 from hashicorp/f-mock-by-default
[Post-0.8] test: build with mock_driver by default
2018-04-18 15:57:00 -07:00
Michael Schurter d3650fb2cd test: build with mock_driver by default
`make release` and `make prerelease` set a `release` tag to disable
enabling the `mock_driver`
2018-04-18 14:45:33 -07:00
Michael Schurter a991923389 tests: fix race in alloc_runner_test.go
I could not reproduce the failure locally even with `stress -cpu ...`
eating all the cpu it could on my machine.

But I think the race was in one of two places:
* The task could restart which could create new events
* I think there could be a race between the updater's version of events
  and alloc runners as updates are async

I fixed both. Here's hoping that fixes this flaky test.
2018-04-17 17:14:59 -07:00
Fabien Ninoles c81bec48c9 Merge branch 'master' into b-2117-windows-group-process 2018-04-17 13:47:25 -04:00
Fabien Ninoles 35cf641416 Update based on PR request. 2018-04-17 13:43:04 -04:00
Alex Dadgar c4ad76091d
Merge pull request #4166 from hashicorp/b-panic-fix-update
Fixes races accessing node and updating it during fingerprinting
2018-04-17 10:02:19 -07:00
Chelsea Holland Komlo 9b8a079558 fix up comments 2018-04-17 11:53:08 -04:00
Alex Dadgar 9d612c8cb0 Cleanup 2018-04-16 15:48:34 -07:00
Alex Dadgar 32adaf9dfc Copy the config given to the alloc runner 2018-04-16 15:45:52 -07:00
Alex Dadgar 3ff2d4d795 fix race node access 2018-04-16 15:45:51 -07:00
Alex Dadgar 4f2a7b6949 Fix copying drivers 2018-04-16 15:45:51 -07:00
Alex Dadgar 0b799822ff Operate on copy 2018-04-16 15:45:49 -07:00
Fabien Ninoles 27cf4995ce - Clean up for windows compilation.
- Set CREATE_NEW_PROCESS_GROUP for Windows subprocess.
- Ensure we only kill actual process that need to.
2018-04-14 13:58:42 -04:00
Michael Schurter 3836b8a335
Merge pull request #3572 from emate/master
Create new process group on process startup.
2018-04-13 11:56:38 -07:00
Alex Dadgar adaf4fa7e0 Remove generated structs 2018-04-12 16:35:31 -07:00
Alex Dadgar 663c4d0433 Version bump and generated files 2018-04-12 16:21:50 -07:00
Alex Dadgar ff1a1a63e8 Move where attribute for driver detection is set 2018-04-12 15:50:25 -07:00
Chelsea Holland Komlo 5291788b40 delete driver name from only health check attributes 2018-04-12 18:24:41 -04:00
Alex Dadgar 3d53d380f7 Fix tests 2018-04-12 14:29:30 -07:00
Alex Dadgar f24ce2c50c Driver health detection cleanups
This PR does:

1. Health message based on detection has format "Driver XXX detected"
and "Driver XXX not detected"
2. Set initial health description based on detection status and don't
wait for the first health check.
3. Combine updating attributes on the node, fingerprint and health
checking update for drivers into a single call back.
4. Condensed driver info in `node status` only shows detected drivers
and make the output less wide by removing spaces.
2018-04-12 12:46:40 -07:00
Charlie Voiselle ba88f00ccb Changed "til" to "until"
Should be "till" or "until"; chose "until" because it is unambiguous as to meaning.
2018-04-11 12:36:28 -05:00
Andrei Burd 502d17fa90 Added node class to tagged metrics 2018-04-11 12:20:59 +03:00
Chelsea Komlo eb5aac16e6
Merge pull request #4111 from hashicorp/b-undetected-set-health-to-false
Immediately set driver health status to false when driver moves to undetected
2018-04-10 18:30:31 -04:00
Chelsea Holland Komlo d58b3e473c update comment for when the fingerprinter setting health status 2018-04-10 16:53:00 -04:00
Chelsea Holland Komlo f7ef13cc64 fingerprinter should set health check status if health check is not periodic 2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo ede4f518bd add setters for access to the fingerprint manager's node
refactor extracting driver info
2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo f479da19f5 guard against overwriting health status 2018-04-10 15:29:51 -04:00
Chelsea Holland Komlo ece1618815 immediately set healthy to false when driver moves to undetected 2018-04-10 15:29:51 -04:00
Alex Dadgar 3d367d6fd7 Fix client uptime metric missing client prefix 2018-04-10 10:39:36 -07:00
Seth Vargo df4fe7e76c
Set user-agent when talking to GCE metadata 2018-04-10 10:36:46 -04:00
Chelsea Komlo d3bd8fb96e
Merge pull request #4109 from hashicorp/f-shorten-docker-health-timeout
Shorten docker health timeout
2018-04-09 15:38:39 -04:00
Chelsea Holland Komlo ea4b65dd41 only initialize docker clients if they are nil 2018-04-09 14:13:07 -04:00
Chelsea Holland Komlo 288c7a33a1 refacotoring simplification from code review 2018-04-09 10:34:17 -04:00
Chelsea Holland Komlo 6e3b056c37 only run health check if driver moves from undetected to detected 2018-04-09 10:10:43 -04:00
Alex Dadgar ae1f76477e Start rebalance after discovering new servers 2018-04-05 15:41:59 -07:00
Alex Dadgar 929b6823a3
Merge pull request #4106 from hashicorp/b-servers
Improved Client handling of failed RPCs
2018-04-05 13:48:50 -07:00
Alex Dadgar be2513e0f9 more jitter 2018-04-05 13:48:33 -07:00
Chelsea Holland Komlo d3637825ef group similar functions; update comments
health check timeout should be 1 minute
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo e8743f1f7b remove do once block when creating a new docker client
only set cached connections upon no error
2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo d0d793fc23 use client with shorter timeouts for health checks 2018-04-05 16:19:02 -04:00
Chelsea Holland Komlo 5d1b2b77cb refactor docker clients method to be able to extend to creating new clients 2018-04-05 16:19:02 -04:00
Alex Dadgar bd3345942c Handle no leader and faster retries near limit
Handle the ErrNoLeader case and apply slower retries. Also when we have
missed the heartbeat retry aggressively, backing off after we have
missed for more than 30 seconds.
2018-04-05 11:22:47 -07:00
Alex Dadgar 279b5c22e5 Scale heartbeat retrying based on remaining heartbeat time 2018-04-05 10:58:13 -07:00
Alex Dadgar 7941f4eb2d Fire retry only when consul discovers new servers 2018-04-05 10:40:17 -07:00
Preetha 6254d75eee
Merge pull request #4101 from hashicorp/b-rescheduling-edge-fixes
Fixes edge cases around timing/ task finish time being set more than once
2018-04-04 16:18:21 -05:00
Preetha Appan 12ba4c45da
remove outdated commented out test code 2018-04-04 15:03:24 -05:00
Preetha Appan 6363a6fb4d
Remove old comment 2018-04-04 15:01:48 -05:00
Preetha Appan 5e4525bd30
Moves setting finishedAt to the right place and adds two unit tests. 2018-04-04 14:38:15 -05:00
Alex Dadgar 86c32358d4
Spelling error 2018-04-03 18:30:01 -07:00
Alex Dadgar 01a6beafbf RPC Retry Watcher 2018-04-03 18:05:28 -07:00
Preetha Appan e6bbce3fa0
Add comment 2018-04-03 19:49:03 -05:00
Alex Dadgar ec844f19d9 randomize servers 2018-04-03 17:46:13 -07:00
Preetha Appan 00537c739b
Fixes edge cases around timing and task finish time being set more than once 2018-04-03 16:34:59 -05:00
Alex Dadgar 58a3ec3fb2 Improve Vault error handling 2018-04-03 14:29:22 -07:00
Alex Dadgar 86f9044676 remove generated files 2018-03-30 16:52:49 -07:00
Alex Dadgar af81349dbe Generated files 2018-03-30 16:14:40 -07:00
Michael Schurter 257ba5937d test: don't rely on alloc runner update count
We were incorrectly relying on the count of alloc updates in a number of
tests. Since alloc updates are async, their number is non-determinstic
and largely meaningless.

This should fix quite a few flaky tests in Travis and prevent future
mistaken assumptions in tests.
2018-03-30 09:34:33 -07:00
Michael Schurter 62e9553333
Merge pull request #4069 from hashicorp/f-hashealth
add HasHealth helper for nil checks
2018-03-29 17:03:20 -07:00
Alex Dadgar beee130a6e Always capture the finish time 2018-03-29 11:27:22 -07:00
Michael Schurter 91b5bb58d9 add HasHealth helper for nil checks
We performed the DeploymentStatus nil checks a couple different ways, so
hopefully this helper will consoldiate them and make it more clear what
the code is doing.
2018-03-29 09:29:19 -07:00
Chelsea Komlo 4338360da9
Merge pull request #4065 from hashicorp/emit-node-event-on-first-health-change
Emit first node event after initialization on health status change
2018-03-29 11:23:25 -04:00
Chelsea Holland Komlo 2174ede6b9 add clarifying comment 2018-03-29 10:58:39 -04:00
Michael Schurter 3a79c32677
Merge pull request #4059 from hashicorp/b-drain-health-svc-only
only service allocs should have health watched
2018-03-28 16:49:22 -07:00
Michael Schurter 5eb0cb7176 only service allocs should have health watched 2018-03-28 16:20:11 -07:00
Chelsea Holland Komlo e3319afee1 emit first node event 2018-03-28 17:26:53 -04:00
Chelsea Komlo 7812ac5abf
Merge pull request #4057 from hashicorp/specify-docker-msg
Specify docker name in driver health messages
2018-03-28 13:32:36 -04:00
Preetha 177d2d6010
Merge pull request #4052 from hashicorp/f-specify-total-memory
Allow to specify total memory on agent configuration
2018-03-28 12:28:41 -05:00
Chelsea Holland Komlo efc03e252c specify driver health messages 2018-03-28 11:35:21 -04:00
Preetha Appan 329428b49f
Code review feedback and unit test 2018-03-28 10:07:15 -05:00
Charlie Voiselle ea10588227 rkt: logging enhancements (#4044)
* Added extra debug logging; extended timeout; added jitter.

* small log changes

* increase timeout

* remove unneccessary uuid
2018-03-27 17:30:06 -07:00
Michael Schurter fcaee471a0 client: always mark exited sys/svc allocs as failed
When restarts.attempts=0 was set in a jobspec a system or service alloc
that exited with 0 status would be marked as `completed` instead of
`failed`. Since system and service jobs are intended to run until
stopped or updated, they should always be marked as failed when they
exit even in cases where the exit code is 0.
2018-03-27 14:30:19 -07:00
Mildred Ki'Lya 1017cbe8ab
Allow to specify total memory on agent configuration
Allow to set the total memory of an agent in its configuration file. This
can be used in case the automatic detection doesn't work or in specific
environments when memory overcommit (using swap for example) can be
desirable.
2018-03-27 15:46:18 -05:00
Chelsea Holland Komlo 003bc209b9 use time.Time for node events for compatibility 2018-03-27 15:43:57 -04:00
Alex Dadgar 432784dae3 Fix alloc watcher snapshot streaming 2018-03-27 11:14:53 -07:00
Alex Dadgar 05449fea09 drop stats fetching log 2018-03-23 12:01:50 -07:00
Chelsea Komlo 5f0c382021
Merge pull request #4030 from hashicorp/health-check-ux
UX improvments to driver health checks
2018-03-23 09:46:50 -04:00
Alex Dadgar da27fc3880 Driver Info output 2018-03-22 17:18:32 -07:00
Chelsea Holland Komlo e9005d8cfb ux improvments to driver health checks 2018-03-22 18:38:29 -04:00
Michael Schurter a318684738
Merge pull request #4022 from hashicorp/f-more-executor-logging
executor: increase level for helpful log lines
2018-03-22 15:21:20 -07:00
Michael Schurter a4f346abeb remove spurious TODOs and FIXMEs 2018-03-21 16:55:22 -07:00
Michael Schurter 8b346c6176 test: try to prevent flakiness on travis 2018-03-21 16:51:45 -07:00
Michael Schurter 1b7ac447e9 alloc_runner: watch health for deployed batch jobs 2018-03-21 16:51:45 -07:00
Michael Schurter 62960ed7bd client: don't monitor health of non-service jobs
Also fix system job draining; won't work without deadline fixes
2018-03-21 16:51:44 -07:00
Alex Dadgar a37329189a Improve DeadlineTime helper 2018-03-21 16:51:44 -07:00
Alex Dadgar db4a634072 RPC, FSM, State Store for marking DesiredTransistion
fix build tag
2018-03-21 16:49:48 -07:00
Michael Schurter bb0ff44fb4 mock_driver: improve Kill() logging 2018-03-21 16:49:48 -07:00
Michael Schurter c0542474db drain: initial drainv2 structs and impl 2018-03-21 16:49:48 -07:00
Chelsea Holland Komlo f329e45e03 always set initial health status for every driver 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo bbaffe3eca set driver to unhealthy once if it cannot be detected in periodic check 2018-03-21 15:15:26 -04:00
Alex Dadgar 5df4b3728d Docker driver doesn't return errors but injects into the DriverInfo 2018-03-21 15:15:26 -04:00
Alex Dadgar 4365bb7f59 Only run health check if driver is detected 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo f801709a0a fix issue when updating node events 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 285729aee2 function rename and re-arrange functions in fingerprint_manager 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 60f12d206f improve comments; update watchDriver 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 739784736a remove unused function 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo d92703617c simplify logic
bump log level
2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 86b7b3d2d9 fix up health check logic comparison; add node events to client driver checks 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 53a5bc2bb3 Code review feedback 2018-03-21 15:15:26 -04:00
Alex Dadgar 34dc58421c notes from walk through 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 44b6951dda improve tests 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo d740a6a46e refresh driver information for non-health checking drivers periodically 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo d8f68e5ef8 fix up codereview feedback 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo d5f6c940c4 fix up racy tests 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo 0425be8f48 updating comments; locking concurrent node access 2018-03-21 15:15:26 -04:00
Chelsea Holland Komlo c50d02ae93 go style; update comments 2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo 3aa726baab fix scheduler driver name; create node structs file 2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo 3cba95e8a7 allow nomad to schedule based on the status of a client driver health check
Slight updates for go style
2018-03-21 15:15:25 -04:00
Chelsea Holland Komlo 0bde357731 add concept of health checks to fingerprinters and nodes
fix up feedback from code review

add driver info for all drivers to node
2018-03-21 15:15:25 -04:00
Michael Schurter 1022170bf3 executor: increase level for helpful log lines
Should help with debugging issues like #3971
2018-03-21 11:53:58 -07:00
Marcin Matlaszek 6019a88824
Make raw_exec processes cleanup function more precise. 2018-03-20 13:40:21 +01:00
Marcin Matlaszek bb36c122e2
Fix errors when trying to kill whole process group. 2018-03-20 13:40:21 +01:00
Marcin Matlaszek 86d650d7b0
Make starting & cleaning process group Windows compatible. 2018-03-20 13:40:21 +01:00
Marcin Matlaszek 79c139f2ef
Create new process group on process startup.
Clean up by sending SIGKILL to the whole process group.
2018-03-20 13:40:21 +01:00
Michael Schurter 1044bc0feb
Merge pull request #3984 from hashicorp/f-loosen-consul-skipverify
Replace Consul TLSSkipVerify handling
2018-03-16 11:21:28 -07:00
Michael Schurter 32ee5e0d53
Merge pull request #3990 from hashicorp/f-rkt-groups
rkt: allow specifying --group
2018-03-16 11:19:53 -07:00
Michael Schurter bd78cfb039 rkt: allow specifying --group 2018-03-16 11:08:22 -07:00
Michael Schurter fb10ec9c01 docker: make volume errors recoverable
The interface+mock just to test this one little error handling may seem
like overkill but there was just no other way to write an automated test
around this logic as there's no way to simluate this error with stock
Docker.
2018-03-15 17:52:43 -07:00
Michael Schurter 0971114f0c Replace Consul TLSSkipVerify handling
Instead of checking Consul's version on startup to see if it supports
TLSSkipVerify, assume that it does and only log in the job service
handler if we discover Consul does not support TLSSkipVerify.

The old code would break TLSSkipVerify support if Nomad started before
Consul (such as on system boot) as TLSSkipVerify would default to false
if Consul wasn't running. Since TLSSkipVerify has been supported since
Consul 0.7.2, it's safe to relax our handling.
2018-03-14 17:43:06 -07:00
Preetha Appan 3c38eededd
Fix spelling in comment 2018-03-14 15:54:25 -05:00
Alex Dadgar bef4a8ee09 fix clearing node events 2018-03-14 09:48:59 -07:00
Chelsea Komlo 810eedfa2a
Merge pull request #3945 from hashicorp/f-add-node-events
Add node events
2018-03-14 08:42:55 -04:00
Preetha 360d6e5a92
Merge pull request #3968 from hashicorp/f-nicer-vault-error
Make server side error messages from vault more clearer
2018-03-13 20:49:39 -05:00
Alex Dadgar de6ebb6e6c small cleanup 2018-03-13 18:08:22 -07:00
Chelsea Holland Komlo b41501e442 code review feedback 2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo 1488b076d1 code review feedback 2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo a8655320fd fix up go check warnings 2018-03-13 18:08:21 -07:00
Chelsea Holland Komlo 0934769b04 add client side emitting of node events
Changelog
2018-03-13 18:08:21 -07:00
Preetha Appan 914eaed64f
Address some code review comments 2018-03-13 18:19:16 -05:00
Preetha Appan 09c231ce43
Return the err from server correctly 2018-03-13 18:10:14 -05:00
Preetha Appan 9618f52746
Remove error wrapping and make vault connection server side errors clearer. 2018-03-13 17:09:03 -05:00
Michael Schurter 79df90acb0
Merge pull request #3958 from simplesurance/swappiness
fix: disable swap for executor_linux allocations
2018-03-13 10:10:22 -07:00
Fabian Holler e6af051c93 fix: disable swap for executor_linux allocations
A comment in the nomad source code states that swapping for
executor_linux allocations is disabled but it wasn't.

Nomad wrote -1 to the memsw.limit_in_bytes cgroup file to disable
swapping.
This has the following problems:

1.) Writing -1 to the file does not disable swapping. It sets
    the limit for memory and swap to unlimited.
2.) On common Linux distributions like Ubuntu 16.04 LTS the
    memsw.limit_in_bytes cgroup file does not exist by default.
    The memsw.limit_in_bytes file only exist if the Linux kernel is
    build with CONFIG_MEMCG_SWAP=yes and either
    CONFIG_MEMCG_SWAP_ENABLED=yes or when the kernel parameter
    swapaccount=1 is passed during boot.
    Most Linux distributions disable swap accounting by default because
    of higher memory usage.
    Nomad silently ignores if writing to the memsw.limit_in_bytes file
    fails. The allocation succeeds, no message is logged to notify the
    user.

To ensure that disabling swap works on common Linux kernels, disable
swapping by writing 0 to the memory.swappiness file.
Using the memory.swappiness file only requires that the kernel is
compiled with CONFIG_MEMCG=yes. This is the default in common Linux
kernels.
2018-03-13 10:52:50 +01:00
Alex Dadgar 4844317cc2
Merge pull request #3890 from hashicorp/b-heartbeat
Heartbeat improvements and handling failures during establishing leadership
2018-03-12 14:41:59 -07:00
Michael Schurter 7dd7fbcda2 non-Existent -> nonexistent
Reverting from #3963

https://www.merriam-webster.com/dictionary/existent
2018-03-12 11:59:33 -07:00
Josh Soref 18c5659474 spelling: version 2018-03-11 19:13:25 +00:00
Josh Soref 6222bd564e spelling: verify 2018-03-11 19:13:32 +00:00
Josh Soref 1359fd2c3d spelling: unexpected 2018-03-11 19:08:07 +00:00
Josh Soref 173ce63fe9 spelling: transition 2018-03-11 19:06:05 +00:00
Josh Soref 782c704de6 spelling: thresholds 2018-03-11 19:03:47 +00:00
Josh Soref ac6d3767da spelling: terminated 2018-03-11 19:01:49 +00:00
Josh Soref 2dda6abab9 spelling: templates 2018-03-11 19:01:39 +00:00
Josh Soref 8978caea28 spelling: shutdown 2018-03-11 18:55:49 +00:00
Josh Soref 8d191c9273 spelling: severity 2018-03-11 18:53:52 +00:00
Josh Soref a79eccaa58 spelling: service 2018-03-11 18:53:47 +00:00
Josh Soref 8149694f3a spelling: server 2018-03-11 18:55:30 +00:00
Josh Soref 3787d8141e spelling: serialize 2018-03-11 18:53:39 +00:00
Josh Soref e37626561c spelling: semantics 2018-03-11 19:00:26 +00:00
Josh Soref e4639ac62f spelling: secrets 2018-03-11 18:53:26 +00:00
Josh Soref cec45c6bc8 spelling: safety 2018-03-11 18:52:54 +00:00
Josh Soref de9d0c7180 spelling: retrieved 2018-03-11 18:51:40 +00:00
Josh Soref e949d23e1b spelling: resource 2018-03-11 18:51:03 +00:00
Josh Soref 82221f9a2b spelling: represents 2018-03-11 18:42:29 +00:00
Josh Soref 1c3b60ae70 spelling: replace 2018-03-11 18:41:53 +00:00
Josh Soref b47ab9ab8c spelling: removes 2018-03-11 18:41:43 +00:00
Josh Soref db166c6cf6 spelling: remnants 2018-03-11 18:41:26 +00:00
Josh Soref 258d76ec13 spelling: registry 2018-03-11 18:41:13 +00:00
Josh Soref 7ad77f568b spelling: purposes 2018-03-11 18:39:35 +00:00
Josh Soref 6fa892a463 spelling: propagated 2018-03-11 18:39:26 +00:00
Josh Soref 1a8204fa11 spelling: previous 2018-03-11 18:38:23 +00:00
Josh Soref f764e5552a spelling: periodically 2018-03-11 18:36:59 +00:00
Josh Soref 96e47bd4c1 spelling: parallelism 2018-03-11 18:35:54 +00:00
Josh Soref 3c1ce6d16d spelling: otherwise 2018-03-11 18:34:27 +00:00
Josh Soref 96dba3e267 spelling: mount 2018-03-11 18:27:18 +00:00
Josh Soref 13e5fb8221 spelling: malicious 2018-03-11 18:26:25 +00:00
Josh Soref 1ef6d6319e spelling: labels 2018-03-11 18:21:44 +00:00
Josh Soref b6ec60fb5f spelling: isolation 2018-03-11 18:19:02 +00:00
Josh Soref 337ac13f0a spelling: interpolation 2018-03-11 18:16:36 +00:00
Josh Soref 75d1240446 spelling: interface 2018-03-11 18:15:37 +00:00
Josh Soref c1a0ae3161 spelling: inspect 2018-03-11 18:15:27 +00:00
Josh Soref 2a1cf2f216 spelling: initialization 2018-03-11 18:18:37 +00:00
Josh Soref b293b48287 spelling: idempotent 2018-03-11 18:14:50 +00:00
Josh Soref 52b83328fc spelling: heartbeating 2018-03-11 18:12:19 +00:00
Josh Soref 3ad579930e spelling: fingerprint 2018-03-11 18:07:37 +00:00
Josh Soref 7f6e4012a0 spelling: existent 2018-03-11 18:30:37 +00:00
Josh Soref 7cd95f6eb3 spelling: executor 2018-03-11 18:05:31 +00:00
Josh Soref b9ce8b9e37 spelling: each 2018-03-11 17:56:19 +00:00
Josh Soref 0fc23b0ba3 spelling: down 2018-03-11 17:55:47 +00:00
Josh Soref e8478c4065 spelling: documentation 2018-03-11 17:55:21 +00:00
Josh Soref 4241ffc5ab spelling: disable 2018-03-11 17:55:12 +00:00
Josh Soref 858b9e809f spelling: directory 2018-03-11 17:55:06 +00:00