Commit graph

4492 commits

Author SHA1 Message Date
Lang Martin 4f6493a301 taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
Mahmood Ali 3890619100 initialize device manager stats interval
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics.  The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.

FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.

The stats interval defaults to 1 second and is user configurable.  I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.

Fixes https://github.com/hashicorp/nomad/issues/6057 .
2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet cbdc1978bf Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Nick Ethier 96d379071d
ar: fix bridge networking port mapping when port.To is unset (#6190) 2019-08-22 21:53:52 -04:00
Michael Schurter 59e0b67c7f connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Michael Schurter b008fd1724 connect: register group services with Consul
Fixes #6042

Add new task group service hook for registering group services like
Connect-enabled services.

Does not yet support checks.
2019-08-20 12:25:10 -07:00
lchayoun 2307c9d1d2 allow dash in non generated environment variable names - should only clean generate environment variables 2019-08-16 11:11:47 +03:00
Nick Ethier 965f00b2fc
Builtin Admission Controller Framework (#6116)
* nomad: add admission controller framework

* nomad: add admission controller framework and Consul Connect hooks

* run admission controllers before checking permissions

* client: add default node meta for connect configurables

* nomad: remove validateJob func since it has been moved to admission controller

* nomad: use new TaskKind type

* client: use consts for connect sidecar image and log level

* Apply suggestions from code review

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

* nomad: add job register test with connect sidecar

* Update nomad/job_endpoint_hooks.go

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-08-15 11:22:37 -04:00
lchayoun c5a38a045a allow dash in non generated environment variable names - should only clean generate environment variables 2019-08-13 19:23:13 +03:00
Tim Gross 03433f35d4 client/template: configuration for function blacklist and sandboxing
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.

When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
2019-08-12 16:34:48 -04:00
Danielle Lancashire 7e6c8e5ac1
Copy documentation to api/tasks 2019-08-12 16:22:27 +02:00
Danielle Lancashire 861caa9564
HostVolumeConfig: Source -> Path 2019-08-12 15:39:08 +02:00
Danielle Lancashire e132a30899
structs: Unify Volume and VolumeRequest 2019-08-12 15:39:08 +02:00
Danielle Lancashire 6ef8d5233e
client: Add volume_hook for mounting volumes 2019-08-12 15:39:08 +02:00
Danielle Lancashire 063e4240c1
client: Add parsing and registration of HostVolume configuration 2019-08-12 15:39:08 +02:00
lchayoun ca892163b2 allow dash in non generated environment variable names 2019-08-11 12:51:42 +03:00
Nick Ethier 7806f4c597
Revert "client: add autofetch for CNI plugins"
This reverts commit 0bd157cc3b04fb090dd0d54affcae71496102ce8.
2019-08-08 15:10:19 -04:00
Nick Ethier 7d28ece8de
Revert "client: remove debugging lines"
This reverts commit 54ce4d1f7ef4913cb12c03dbc98bcd903f7787c9.
2019-08-08 14:52:52 -04:00
Liel Chayoun 24dcb2379c
Update env_test.go 2019-08-06 11:59:31 +03:00
Mahmood Ali b17bac5101 Render consul templates using task env only (#6055)
When rendering a task consul template, ensure that only task environment
variables are used.

Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1].  Thus, we add
an empty entry for each host process env-var not found in task env-vars.

[1] bfa5d0e133/template/funcs.go (L61-L75)
2019-08-05 16:30:47 -04:00
Mahmood Ali f66169cd6a
Merge pull request #6065 from hashicorp/b-nil-driver-exec
Check if driver handle is nil before execing
2019-08-02 09:48:28 -05:00
Mahmood Ali a4670db9b7 Check if driver handle is nil before execing
Defend against tr.getDriverHandle being nil.  Exec handler checks if
task is running, but it may be stopped between check and driver handler
fetching.
2019-08-02 10:07:41 +08:00
Nick Ethier 7de0bec8ab
client/cni: updated comments and simplified logic to auto download plugins 2019-07-31 01:04:10 -04:00
Nick Ethier b16640c50d
Apply suggestions from code review
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2019-07-31 01:04:10 -04:00
Nick Ethier 321d10a041
client: remove debugging lines 2019-07-31 01:04:09 -04:00
Nick Ethier af6b191963
client: add autofetch for CNI plugins 2019-07-31 01:04:09 -04:00
Nick Ethier 1e9dd1b193
remove unused file 2019-07-31 01:04:09 -04:00
Nick Ethier 09a4cfd8d7
fix failing tests 2019-07-31 01:04:07 -04:00
Nick Ethier ef83f0831b
ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
Nick Ethier af66a35924
networking: Add new bridge networking mode implementation 2019-07-31 01:04:06 -04:00
Michael Schurter fb487358fb
connect: add group.service stanza support 2019-07-31 01:04:05 -04:00
Nick Ethier 63c5504d56
ar: fix lint errors 2019-07-31 01:03:19 -04:00
Nick Ethier e312201d18
ar: rearrange network hook to support building on windows 2019-07-31 01:03:19 -04:00
Nick Ethier 370533c9c7
ar: fix test that failed due to error renaming 2019-07-31 01:03:19 -04:00
Nick Ethier 2d60ef64d9
plugins/driver: make DriverNetworkManager interface optional 2019-07-31 01:03:19 -04:00
Nick Ethier f87e7e9c9a
ar: plumb error handling into alloc runner hook initialization 2019-07-31 01:03:18 -04:00
Nick Ethier ef1795b344
ar: add tests for network hook 2019-07-31 01:03:18 -04:00
Nick Ethier 15989bba8e
ar: cleanup lint errors 2019-07-31 01:03:18 -04:00
Nick Ethier 220cba3e7e
ar: move linux specific code to it's own file and add tests 2019-07-31 01:03:18 -04:00
Nick Ethier 548f78ef15
ar: initial driver based network management 2019-07-31 01:03:17 -04:00
Nick Ethier 66c514a388
Add network lifecycle management
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
2019-07-31 01:03:17 -04:00
Preetha Appan d048029b5a
remove generated code and change version to 0.10.0 2019-07-30 15:56:05 -05:00
Nomad Release bot e39fb11531 Generate files for 0.9.4 release 2019-07-30 19:05:18 +00:00
Preetha Appan 6b4c40f5a8
remove generated code 2019-07-23 12:07:49 -05:00
Nomad Release bot 04187c8b86 Generate files for 0.9.4-rc1 release 2019-07-22 21:42:36 +00:00
Michael Schurter d90680021e logmon: fix comment formattinglogmon: fix comment formattinglogmon: fix
comment formattinglogmon: fix comment formattinglogmon: fix comment
formatting
2019-07-22 13:05:01 -07:00
Michael Schurter e37bc3513c logmon: ensure errors are still handled properly
...and add a comment to switch back to the old error handling once we
switch to Go 1.12.
2019-07-22 12:49:48 -07:00
Danielle Lancashire 1bcbbbfbe6
logmon: Workaround golang/go#29119
There's a bug in go1.11 that causes some io operations on windows to
return incorrect errors for some cases when Stat-ing files. To avoid
upgrading to go1.12 in a point release, here we loosen up the cases
where we will attempt to create fifos, and add some logging of
underlying stat errors to help with debugging.
2019-07-22 18:28:12 +02:00
Jasmine Dahilig 2157f6ddf1
add formatting for hcl parsing error messages (#5972) 2019-07-19 10:04:39 -07:00
Mahmood Ali cd6f1d3102 Update consul-template dependency to latest
To pick up the fix in
https://github.com/hashicorp/consul-template/pull/1231 .
2019-07-18 07:32:03 +07:00
Mahmood Ali 8a82260319 log unrecoverable errors 2019-07-17 11:01:59 +07:00
Mahmood Ali 1a299c7b28 client/taskrunner: fix stats stats retry logic
Previously, if a channel is closed, we retry the Stats call.  But, if that call
fails, we go in a backoff loop without calling Stats ever again.

Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.

I aimed to preserve the logging formats but made small improvements as I saw fit.
2019-07-11 13:58:07 +08:00
Preetha Appan 7d645c5ad9
Test file for detect content type that satisfies linter and encoding 2019-07-10 11:42:04 -05:00
Preetha Appan ef9a71c68b
code review feedback 2019-07-10 10:41:06 -05:00
Preetha Appan 990e468edc
Populate task event struct with kill timeout
This makes for a nicer task event message
2019-07-09 09:37:09 -05:00
Preetha Appan 108a292cc0
fix linting failure in test case file 2019-07-08 11:29:12 -05:00
Michael Lange b2e9570075
Use consistent casing in the JSON representation of the AllocFileInfo struct 2019-07-02 17:27:31 -07:00
Preetha Appan 8495fb9055
Added additional test cases and fixed go test case 2019-07-02 13:25:29 -05:00
Mahmood Ali a97d451ac7
Merge pull request #5905 from hashicorp/b-ar-failed-prestart
Fail alloc if alloc runner prestart hooks fail
2019-07-02 20:25:53 +08:00
Danielle c6872cdf12
Merge pull request #5864 from hashicorp/dani/win-pipe-cleaner
windows: Fix restarts using the raw_exec driver
2019-07-02 13:58:56 +02:00
Danielle Lancashire e20300313f
fifo: Safer access to Conn 2019-07-02 13:12:54 +02:00
Mahmood Ali f10201c102 run post-run/post-stop task runner hooks
Handle when prestart failed while restoring a task, to prevent
accidentally leaking consul/logmon processes.
2019-07-02 18:38:32 +08:00
Mahmood Ali 4afd7835e3 Fail alloc if alloc runner prestart hooks fail
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
2019-07-02 18:35:47 +08:00
Mahmood Ali 7614b8f09e
Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2
task runner to avoid running task if terminal
2019-07-02 15:31:17 +08:00
Mahmood Ali 7bfad051b9 address review comments 2019-07-02 14:53:50 +08:00
Mahmood Ali c0c00ecc07
Merge pull request #5906 from hashicorp/b-alloc-stale-updates
client: defensive against getting stale alloc updates
2019-07-02 12:40:17 +08:00
Preetha Appan c09342903b
Improve test cases for detecting content type 2019-07-01 16:24:48 -05:00
Danielle Lancashire 688f82f07d
fifo: Close connections and cleanup lock handling 2019-07-01 14:14:29 +02:00
Danielle Lancashire 2c7d1f1b99
logmon: Add windows compatibility test 2019-07-01 14:14:06 +02:00
Mahmood Ali c5f5a1fcb9 client: defensive against getting stale alloc updates
When fetching node alloc assignments, be defensive against a stale read before
killing local nodes allocs.

The bug is when both client and servers are restarting and the client requests
the node allocation for the node, it might get stale data as server hasn't
finished applying all the restored raft transaction to store.

Consequently, client would kill and destroy the alloc locally, just to fetch it
again moments later when server store is up to date.

The bug can be reproduced quite reliably with single node setup (configured with
persistence).  I suspect it's too edge-casey to occur in production cluster with
multiple servers, but we may need to examine leader failover scenarios more closely.

In this commit, we only remove and destroy allocs if the removal index is more
recent than the alloc index. This seems like a cheap resiliency fix we already
use for detecting alloc updates.

A more proper fix would be to ensure that a nomad server only serves
RPC calls when state store is fully restored or up to date in leadership
transition cases.
2019-06-29 04:17:35 -05:00
Preetha Appan 3345ce3ba4
Infer content type in alloc fs stat endpoint 2019-06-28 20:31:28 -05:00
Danielle Lancashire e1151f743b
appveyor: Run logmon tests 2019-06-28 16:01:41 +02:00
Danielle Lancashire 634ada671e
fifo: Require that fifos do not exist for create
Although this operation is safe on linux, it is not safe on Windows when
using the named pipe interface. To provide a ~reasonable common api
abstraction, here we switch to returning File exists errors on the unix
api.
2019-06-28 13:47:18 +02:00
Danielle Lancashire 0ff27cfc0f
vendor: Use dani fork of go-winio 2019-06-28 13:47:18 +02:00
Danielle Lancashire 514a2a6017
logmon: Refactor fifo access for windows safety
On unix platforms, it is safe to re-open fifo's for reading after the
first creation if the file is already a fifo, however this is not
possible on windows where this triggers a permissions error on the
socket path, as you cannot recreate it.

We can't transparently handle this in the CreateAndRead handle, because
the Access Is Denied error is too generic to reliably be an IO error.
Instead, we add an explict API for opening a reader to an existing FIFO,
and check to see if the fifo already exists inside the calling package
(e.g logmon)
2019-06-28 13:41:54 +02:00
Mahmood Ali 3d89ae0f1e task runner to avoid running task if terminal
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes https://github.com/hashicorp/nomad/issues/5883
2019-06-27 11:27:34 +08:00
Danielle Lancashire b9ac184e1f
tr: Fetch Wait channel before killTask in restart
Currently, if killTask results in the termination of a process before
calling WaitTask, Restart() will incorrectly return a TaskNotFound
error when using the raw_exec driver on Windows.
2019-06-26 15:20:57 +02:00
Mahmood Ali b209584dce
Merge pull request #5726 from hashicorp/b-plugins-via-init
Use init() to handle plugin invocation
2019-06-18 21:09:03 -04:00
Mahmood Ali ac64509c59 comment on use of init() for plugin handlers 2019-06-18 20:54:55 -04:00
Chris Baker f71114f5b8 cleanup test 2019-06-18 14:15:25 +00:00
Chris Baker a2dc351fd0 formatting and clarity 2019-06-18 14:00:57 +00:00
Chris Baker e0170e1c67 metrics: add namespace label to allocation metrics 2019-06-17 20:50:26 +00:00
Mahmood Ali 962921f86c Use init to handle plugin invocation
Currently, nomad "plugin" processes (e.g. executor, logmon, docker_logger) are started as CLI
commands to be handled by command CLI framework.  Plugin launchers use
`discover.NomadBinary()` to identify the binary and start it.

This has few downsides: The trivial one is that when running tests, one
must re-compile the nomad binary as the tests need to invoke the nomad
executable to start plugin.  This is frequently overlooked, resulting in
puzzlement.

The more significant issue with `executor` in particular is in relation
to external driver:

* Plugin must identify the path of invoking nomad binary, which is not
trivial; `discvoer.NomadBinary()` now returns the path to the plugin
rather than to nomad, preventing external drivers from launching
executors.

* The external driver may get a different version of executor than it
expects (specially if we make a binary incompatible change in future).

This commit addresses both downside by having the plugin invocation
handling through an `init()` call, similar to how libcontainer init
handler is done in [1] and recommened by libcontainer [2].  `init()`
will be invoked and handled properly in tests and external drivers.

For external drivers, this change will cause external drivers to launch
the executor that's compiled against.

There a are a couple of downsides to this approach:
* These specific packages (i.e executor, logmon, and dockerlog) need to
be careful in use of `init()`, package initializers.  Must avoid having
command execution rely on any other init in the package.  I prefixed
files with `z_` (golang processes files in lexical order), but ensured
we don't depend on order.
* The command handling is spread in multiple packages making it a bit
less obvious how plugin starts are handled.

[1] drivers/shared/executor/libcontainer_nsenter_linux.go
[2] eb4aeed24f/libcontainer (using-libcontainer)
2019-06-13 16:48:01 -04:00
Jasmine Dahilig ed9740db10
Merge pull request #5664 from hashicorp/f-http-hcl-region
backfill region from hcl for jobUpdate and jobPlan
2019-06-13 12:25:01 -07:00
Jasmine Dahilig 51e141be7a backfill region from job hcl in jobUpdate and jobPlan endpoints
- updated region in job metadata that gets persisted to nomad datastore
- fixed many unrelated unit tests that used an invalid region value
(they previously passed because hcl wasn't getting picked up and
the job would default to global region)
2019-06-13 08:03:16 -07:00
Mahmood Ali e31159bf1f Prepare for 0.9.4 dev cycle 2019-06-12 18:47:50 +00:00
Nomad Release bot 4803215109 Generate files for 0.9.3 release 2019-06-12 16:11:16 +00:00
Danielle f923b568e0
Merge pull request #5821 from hashicorp/dani/b-5770
trhooks: Add TaskStopHook interface to services
2019-06-12 17:30:49 +02:00
Danielle Lancashire c326344b57
trt: Fix test 2019-06-12 17:06:11 +02:00
Danielle Lancashire 13d76e35fd
trhooks: Add TaskStopHook interface to services
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.

This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.

fixes #5770
2019-06-12 16:00:21 +02:00
Mahmood Ali 2acf30fdd3 Fallback to alloc.TaskResources for old allocs
When a client is running against an old server (e.g. running 0.8),
`alloc.AllocatedResources` may be nil, and we need to check the
deprecated `alloc.TaskResources` instead.

Fixes https://github.com/hashicorp/nomad/issues/5810
2019-06-11 10:32:53 -04:00
Mahmood Ali 7a4900aaa4 client/allocrunner: depend on internal task state
Alloc runner already tracks tasks associated with alloc.  Here, we
become defensive by relying on the alloc runner tracked tasks, rather
than depend on server never updating the job unexpectedly.
2019-06-10 18:42:51 -04:00
Mahmood Ali d30c3d10b0
Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1
More test fixes
2019-06-05 19:09:18 -04:00
Mahmood Ali 935ee86e92
Merge pull request #5737 from fwkz/fix-restart-attempts
Fix restart attempts of `restart` stanza in `delay` mode.
2019-06-05 19:05:07 -04:00
Mahmood Ali 97957fbf75 Prepare for 0.9.3 dev cycle 2019-06-05 14:54:00 +00:00
Nomad Release bot 43bfbf3fcc Generate files for 0.9.2 release 2019-06-05 11:59:27 +00:00
Mahmood Ali a9f81f2daa client config flag to disable remote exec
This exposes a client flag to disable nomad remote exec support in
environments where access to tasks ought to be restricted.

I used `disable_remote_exec` client flag that defaults to allowing
remote exec. Opted for a client config that can be used to disable
remote exec globally, or to a subset of the cluster if necessary.
2019-06-03 15:31:39 -04:00
Mahmood Ali a4ead8ff79 remove 0.9.2-rc1 generated code 2019-05-23 11:14:24 -04:00
Nomad Release bot 6d6bc59732 Generate files for 0.9.2-rc1 release 2019-05-22 19:29:30 +00:00
Michael Schurter a54511b304
Merge pull request #5731 from hashicorp/b-ignore-dc
client: drop unused DC field from servers list
2019-05-22 08:42:15 -07:00
Mahmood Ali 84419f08ce client: synchronize client.invalidAllocs access
invalidAllocs may be accessed and manipulated from different goroutines,
so must be locked.
2019-05-22 09:37:49 -04:00
Danielle Lancashire 27583ed8c1 client: Pass servers contacted ch to allocrunner
This fixes an issue where batch and service workloads would never be
restarted due to indefinitely blocking on a nil channel.

It also raises the restoration logging message to `Info` to simplify log
analysis.
2019-05-22 13:47:35 +02:00
Mahmood Ali 9df1e00f35 tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal
Given that Signal may be called multiple times, blocking for `SignalCh`
isn't sufficient to synchornizing access to Signals field.
2019-05-21 13:56:58 -04:00
Mahmood Ali b06e585713
Merge pull request #5739 from hashicorp/r-rm-logmon-syslog-deadcode
logmon: remove syslog server deadcode
2019-05-21 11:46:48 -04:00
Mahmood Ali eca23bf9c4
Merge pull request #5742 from hashicorp/b-test-fixes-20190520
Grab bag of (primarily race) test fixes
2019-05-21 11:46:36 -04:00
Mahmood Ali e88bb61488
Merge pull request #5740 from hashicorp/b-nomad-exec-term-race
exec: allow drivers to handle stream termination
2019-05-21 11:24:12 -04:00
Mahmood Ali b475ccbe3e client: synchronize access to ar.alloc
`allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's
use `allocRunner.Alloc()` helper function to access it.
2019-05-21 09:55:05 -04:00
Mahmood Ali 2a7b073167 tests: fix fifo lib race
Accidentally accessed outer `err` variable inside a goroutine
2019-05-21 09:49:56 -04:00
Mahmood Ali 296bd41c9e tests: fix data race in client TestDriverManager_Fingerprint_Periodic 2019-05-21 09:49:56 -04:00
Mahmood Ali d9e59eece0 tests: fix client TestFS_Stream data race
Close is invoked in a different goroutine from test
2019-05-21 09:49:56 -04:00
Mahmood Ali 75e0a3f405 exec: allow drivers to handle stream termination
Without this change, alloc_endpoint cancel the context passed to handler
when we detect EOF.  This races driver in setting exit code; and we run
into a case where the exec process terminates cleanly yet we attempt to
mark it as failed with context error.

Here, we rely on the driver to handle errors returned from Stream and
without racing to set an error.
2019-05-21 09:40:25 -04:00
Mahmood Ali 974bcbecc9 logmon: remove syslog server deadcode
Remove unused syslog server related code that got replaced by the docker
logger in Nomad 0.9
2019-05-21 09:36:43 -04:00
fwkz 8b84bec95a Fix restart attempts of restart stanza.
Number of restarts during 2nd interval is off by one.
2019-05-21 13:27:19 +02:00
Michael Schurter d41abda957 client: drop unused DC field from servers list
See #5730 for details.
2019-05-20 14:19:15 -07:00
Michael Schurter 2fe0768f3b docs: changelog entry for #5669 and fix comment 2019-05-14 10:54:00 -07:00
Michael Schurter af9096c8ba client: register before restoring
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).

Since restoring is synchronous, start the registration goroutine first.

For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
2019-05-14 10:53:27 -07:00
Michael Schurter e07f73bfe0 client: do not restart dead tasks until server is contacted (try 2)
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162

Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!

The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
2019-05-14 10:53:27 -07:00
Michael Schurter d7e5ace1ed client: do not restart dead tasks until server is contacted
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
2019-05-14 10:53:27 -07:00
Michael Schurter 3b1f8991a1 client: log when server list changes
Stop logging in the happy path when nothing has changed.
2019-05-13 15:42:55 -07:00
Michael Schurter 48db8135da
Merge pull request #5492 from hashicorp/f-allocated-mem
client: expose allocated memory per task
2019-05-13 13:31:22 -07:00
Lang Martin 1d03a43ce2
Merge pull request #5642 from hashicorp/b-network-fingerprinting-ipv4
network fingerprinting multiple IPs on the configured network device
2019-05-13 11:46:53 -04:00
Michael Schurter 1c4e585fa7 client: expose allocated memory per task
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
2019-05-10 11:12:12 -07:00
Lang Martin f6bc45dd23 client improve a comment in updateNetworks 2019-05-10 11:25:04 -04:00
Mahmood Ali 919827f2df
Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base
nomad exec part 1: plumbing and docker driver
2019-05-09 18:09:27 -04:00
Mahmood Ali ab2cae0625 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Preetha 1d02886bb6
Merge pull request #5654 from hashicorp/b-hearbeat-lockfix
Remove unnecessary locking and serverlist syncing in heartbeats
2019-05-08 13:36:39 -05:00
Preetha Appan 3289e7f4a0
fix typo and add one more test scenario 2019-05-08 10:54:22 -05:00
Preetha Appan db6b291a5a
code review feedback 2019-05-07 16:23:32 -05:00
Chris Baker 93ec1293be
stale allocation data leads to incorrect (and even negative) metrics (#5637)
* client: was not using up-to-date client state in determining which alloc count towards allocated resources

* Update client/client.go

Co-Authored-By: cgbaker <cgbaker@hashicorp.com>
2019-05-07 15:54:36 -04:00
Preetha Appan b063fc81a4
Remove unnecessary locking and serverlist syncing in heartbeats
This removes an unnecessary shared lock between discovery and heartbeating
which was causing heartbeats to be missed upon retries when a single server
fails. Also made a drive by fix to call the periodic server shuffler goroutine.
2019-05-06 14:44:55 -05:00
Michael Schurter 8c7b3ff45a
Fix comment
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:01:30 -05:00
Michael Schurter e19fa33f9c
Remove unnecessary boolean clause
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:00:17 -05:00
Preetha Appan b99a204582
Update deployment health on failed allocations only if health is unset
This fixes a confusing UX where a previously successful deployment's
healthy/unhealthy count would get updated if any allocations failed after
the deployment was already marked as successful.
2019-05-02 22:59:56 -05:00
Lang Martin c32cce51f0 client fingerprinting can keep multi ips on a device 2019-05-02 18:11:28 -04:00
Lang Martin 94f23016a2 client_test new test fingerprinting can keep multi ips on a device 2019-05-02 18:11:28 -04:00
Mahmood Ali 7a32d3f3aa client: handle 0.8 server network resources
Fixes https://github.com/hashicorp/nomad/issues/5587

When a nomad 0.9 client is handling an alloc generated by a nomad 0.8
server, we should check the alloc.TaskResources for networking details
rather than task.Resources.

We check alloc.TaskResources for networking for other tasks in the task
group [1], so it's a bit odd that we used the task.Resources struct
here.  TaskRunner also uses `alloc.TaskResources`[2].

The task.Resources struct in 0.8 was sparsly populated, resulting to
storing of 0 in port mapping env vars:

```
vagrant@nomad-server-01:~$ nomad version
Nomad v0.8.7 (21a2d93eecf018ad2209a5eab6aae6c359267933+CHANGES)
vagrant@nomad-server-01:~$ nomad server members
Name                    Address      Port  Status  Leader  Protocol  Build  Datacenter  Region
nomad-server-01.global  10.199.0.11  4648  alive   true    2         0.8.7  dc1         global
vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b | jq '.Job.TaskGroups[0].Tasks[0].Resources.Networks'
[
  {
    "CIDR": "",
    "Device": "",
    "DynamicPorts": [
      {
        "Label": "db",
        "Value": 0
      }
    ],
    "IP": "",
    "MBits": 10,
    "ReservedPorts": null
  }
]
vagrant@nomad-server-01:~$ nomad alloc status -json 5b34649b | jq '.TaskResources'
{
  "redis": {
    "CPU": 500,
    "DiskMB": 0,
    "IOPS": 0,
    "MemoryMB": 256,
    "Networks": [
      {
        "CIDR": "",
        "Device": "eth1",
        "DynamicPorts": [
          {
            "Label": "db",
            "Value": 21722
          }
        ],
        "IP": "10.199.0.21",
        "MBits": 10,
        "ReservedPorts": null
      }
    ]
  }
}
```

Also, updated the test values to mimic how Nomad 0.8 structs are
represented, and made its result match the non compact values in
`TestEnvironment_AsList`.

[1] 24e9040b18/client/taskenv/env.go (L624-L639)
[2] https://github.com/hashicorp/nomad/blob/master/client/allocrunner/taskrunner/task_runner.go#L287-L303
2019-05-02 12:08:38 -04:00
Mahmood Ali 446f06721d aux: helper method that returns token as well as ACL policy
This helper returns the token as well as the ACL policy, to be used in a later
commit for logging the token info associated with nomad exec invocation.
2019-04-30 10:23:56 -04:00
Lang Martin 371014b781
Merge pull request #5553 from hashicorp/b-fingerprinter-manual-config
client fingerprinter doesn't overwrite manual configuration
2019-04-26 12:55:34 -04:00
Danielle 79515496cb
Merge pull request #5515 from hashicorp/dani/f-alloc-signal
allocs: Add nomad alloc signal command
2019-04-26 14:21:05 +02:00
Danielle Lancashire a8880f9643 alloc_signal: Add autcompletion and cmd tests 2019-04-26 12:47:53 +02:00
Mahmood Ali bf0a09e270 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali 81841e8528 try checking process status 2019-04-25 18:16:13 -04:00
Mahmood Ali fc78521f29 add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali e6ca8641a8 try sleeping for stop signal to take effect 2019-04-25 17:16:29 -04:00
Mahmood Ali ff3a095015 add a test that simulates logmon dying during Start() call 2019-04-25 16:41:17 -04:00
Mahmood Ali bbac73883c logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Mahmood Ali b51f00a7f3 logmon client to handle grpc closing errors 2019-04-25 14:32:24 -04:00
Danielle Lancashire 3409e0be89 allocs: Add nomad alloc signal command
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.

Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
2019-04-25 12:43:32 +02:00
Chris Baker 91c4e1eabb
Merge pull request #5541 from hashicorp/b/5540-bad-client-alloc-metrics
client/metrics: fixed stale metrics
2019-04-22 15:07:30 -04:00
Mahmood Ali f515b93b5e
Merge pull request #5577 from hashicorp/dani/b-logmon-unrecoverable
logging: Attempt to recover logmon failures
2019-04-22 14:40:24 -04:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Chris Baker 0b1a4dd206 client/metrics: modified metrics to use (updated) client copy of allocation instead of (unupdated) server copy 2019-04-22 18:31:45 +00:00
Lang Martin eba4e29440 client fingerprinter doesn't overwrite manual configuration
Revert "Revert accidental merge of pr #5482"
This reverts commit c45652ab8c113487b9d4fbfb107782cbcf8a85b0.
2019-04-19 15:23:48 -04:00
Michael Schurter 26f3bdbf8f
Merge pull request #5583 from ygersie/fingerprint_nilpointer
fix nil pointer in fingerprinting AWS env leading to crash
2019-04-19 08:08:59 -07:00
Mahmood Ali 902eed4bf9 clarify cryptic log line 2019-04-19 09:31:43 -04:00
Mahmood Ali f74d60439f client: log detected driver health state
Noticed that `detected drivers` log line was misleading - when a driver
doesn't fingerprint before timeout, their health status is empty string
`""` which we would mark as detected.

Now, we log all drivers along with their state to ease driver
fingerprint debugging.
2019-04-19 09:15:25 -04:00
Mahmood Ali 6bdc9860b7 client: avoid registering node twice right away
I noticed that `watchNodeUpdates()` almost immediately after
`registerAndHeartbeat()` calls `retryRegisterNode()`, well after 5
seconds.

This call is unnecessary and made debugging a bit harder.  So here, we
ensure that we only re-register node for new node events, not for
initial registration.
2019-04-19 09:12:50 -04:00
Mahmood Ali f82ea8824f client: wait for batched driver updated
Here we retain 0.8.7 behavior of waiting for driver fingerprints before
registering a node, with some timeout.  This is needed for system jobs,
as system job scheduling for node occur at node registration, and the
race might mean that a system job may not get placed on the node because
of missing drivers.

The timeout isn't strictly necessary, but raising it to 1 minute as it's
closer to indefinitely blocked than 1 second.  We need to keep the value
high enough to capture as much drivers/devices, but low enough that
doesn't risk blocking too long due to misbehaving plugin.

Fixes https://github.com/hashicorp/nomad/issues/5579
2019-04-19 09:00:24 -04:00
Yorick Gersie 95f81f3eeb fix nil pointer in fingerprinting AWS env leading to crash
HTTP Client returns a nil response if an error has occured. We first
  need to check for an error before being able to check the HTTP response
  code.
2019-04-19 11:07:13 +02:00
Danielle Lancashire c31966fc71 loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Michael Schurter a85e7b7cc9 vault: fix data races 2019-04-16 11:22:44 -07:00
Michael Schurter 0aeb3dbd86 vault: fix renewal time
Renewal time was being calculated as 10s+Intn(lease-10s), so the renewal
time could be very rapid or within 1s of the deadline: [10s, lease)

This commit fixes the renewal time by calculating it as:

	(lease/2) +/- 10s

For a lease of 60s this means the renewal will occur in [20s, 40s).
2019-04-16 11:22:44 -07:00
Michael Schurter f7a7acc345
Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Chris Baker 6848591914 vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation 2019-04-12 15:06:34 +00:00
Lang Martin a2a1e7829d Revert accidental merge of pr #5482
Revert "fingerprint Constraints and Affinities have Equals, as set"
This reverts commit 596f16fb5f1a4a6766a57b3311af806d22382609.

Revert "client tests assert the independent handling of interface and speed"
This reverts commit 7857ac5993a578474d0570819f99b7b6e027de40.

Revert "structs missed applying a style change from the review"
This reverts commit 658916e3274efa438beadc2535f47109d0c2f0f2.

Revert "client, structs comments"
This reverts commit be2838d6baa9d382a5013fa80ea016856f28ade2.

Revert "client fingerprint updateNetworks preserves the network configuration"
This reverts commit fc309cb430e62d8e66267a724f006ae9abe1c63c.

Revert "client_test cleanup comments from review"
This reverts commit bc0bf4efb9114e699bc662f50c8f12319b6b3445.

Revert "client Networks Equals is set equality"
This reverts commit f8d432345b54b1953a4a4c719b9269f845e3e573.

Revert "struct cleanup indentation in RequestedDevice Equals"
This reverts commit f4746411cab328215def6508955b160a53452da3.

Revert "struct Equals checks for identity before value checking"
This reverts commit 0767a4665ed30ab8d9586a59a74db75d51fd9226.

Revert "fix client-test, avoid hardwired platform dependecy on lo0"
This reverts commit e89dbb2ab182b6368507dbcd33c3342223eb0ae7.

Revert "refactor error in client fingerprint to include the offending data"
This reverts commit a7fed726c6e0264d42a58410d840adde780a30f5.

Revert "add client updateNodeResources to merge but preserve manual config"
This reverts commit 84bd433c7e1d030193e054ec23474380ff3b9032.

Revert "refactor struts.RequestedDevice to have its own Equals"
This reverts commit 689782524090e51183474516715aa2f34908b8e6.

Revert "refactor structs.Resource.Networks to have its own Equals"
This reverts commit 49e2e6c77bb3eaa4577772b36c62205061c92fa1.

Revert "refactor structs.Resource.Devices to have its own Equals"
This reverts commit 4ede9226bb971ae42cc203560ed0029897aec2c9.

Revert "add COMPAT(0.10): Remove in 0.10 notes to impl for structs.Resources"
This reverts commit 49fbaace5298d5ccf031eb7ebec93906e1d468b5.

Revert "add structs.Resources Equals"
This reverts commit 8528a2a2a6450e4462a1d02741571b5efcb45f0b.

Revert "test that fingerprint resources are updated, net not clobbered"
This reverts commit 8ee02ddd23bafc87b9fce52b60c6026335bb722d.
2019-04-11 10:29:40 -04:00
Lang Martin 5d3596eb7e client tests assert the independent handling of interface and speed 2019-04-11 09:56:22 -04:00
Lang Martin 7258a13c72 client, structs comments 2019-04-11 09:56:22 -04:00
Lang Martin 22d87e4538 client fingerprint updateNetworks preserves the network configuration 2019-04-11 09:56:22 -04:00
Lang Martin 8fe9699e51 client_test cleanup comments from review 2019-04-11 09:56:22 -04:00
Lang Martin 63c993c8ae fix client-test, avoid hardwired platform dependecy on lo0 2019-04-11 09:56:22 -04:00
Lang Martin a9db848974 refactor error in client fingerprint to include the offending data 2019-04-11 09:56:22 -04:00
Lang Martin f211500cea add client updateNodeResources to merge but preserve manual config 2019-04-11 09:56:22 -04:00
Lang Martin a4b59130d2 test that fingerprint resources are updated, net not clobbered 2019-04-11 09:56:21 -04:00
Danielle Lancashire e135876493 allocs: Add nomad alloc restart
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
2019-04-11 14:25:49 +02:00
Chris Baker 829a972693
vault client test: minor formatting
vendor: using upstream circonus-gometrics
2019-04-10 10:34:10 -05:00
Chris Baker c0a7aee610
vault e2e: pass vault version into setup instead of having to infer it from test name 2019-04-10 10:34:10 -05:00
Chris Baker f0c184fc29
taskrunner: removed some unecessary config from a test 2019-04-10 10:34:10 -05:00
Chris Baker a26d4fe1e5
docs: -vault-namespace, VAULT_NAMESPACE, and config
agent: added VAULT_NAMESPACE env-based configuration
2019-04-10 10:34:10 -05:00
Chris Baker 170f5239c8
client: gofmt 2019-04-10 10:34:10 -05:00
Chris Baker a1d7971b2e
taskrunner: pass configured Vault namespace into TaskTemplateConfig 2019-04-10 10:34:10 -05:00
Chris Baker 0eaeef872f
config/docs: added namespace to vault config
server/client: process `namespace` config, setting on the instantiated vault client
2019-04-10 10:34:10 -05:00
Michael Schurter 45b4827ad7 Bump to 0.9.1-dev 2019-04-09 09:01:48 -07:00
Nomad Release bot e307734e4a Generate files for 0.9.0 release 2019-04-09 01:56:00 +00:00
Michael Schurter f7d4428855 client: simplify kill logic
Remove runLaunched tracking as Run is *always* called for killable
TaskRunners. TaskRunners which fail before Run can be called (during
NewTaskRunner or Restore) are not killable as they're never added to the
client's alloc map.
2019-04-04 15:18:33 -07:00
Michael Schurter 3af602b633 Remove 0.9.0-rc2 generated files 2019-04-03 07:41:09 -07:00
Nomad Release bot 16b4336ccf Generate files for 0.9.0-rc2 release 2019-04-03 01:54:29 +00:00
Michael Schurter 923cd91850
Merge pull request #5504 from hashicorp/b-exec-path
executor/linux: make chroot binary paths absolute
2019-04-02 14:09:50 -07:00
Michael Schurter 1d569a27dc Revert "executor/linux: add defensive checks to binary path"
This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.
2019-04-02 11:17:12 -07:00
Michael Schurter fc5487dbbc executor/linux: add defensive checks to binary path 2019-04-02 09:40:53 -07:00
Michael Schurter 7d49bc4c71 executor/linux: make chroot binary paths absolute
Avoid libcontainer.Process trying to lookup the binary via $PATH as the
executor has already found where the binary is located.
2019-04-01 15:45:31 -07:00
Mahmood Ali 81f4f07ed7 rename fifo methods for clarity 2019-04-01 16:52:58 -04:00
Mahmood Ali e87afe465b clarify closeDone blocking and field name 2019-04-01 16:10:34 -04:00
Mahmood Ali 9d647713c0 no requires in a test goroutine 2019-04-01 15:38:39 -04:00
Mahmood Ali 2b1f858e1b log when fifo fails to open 2019-04-01 13:18:03 -04:00
Mahmood Ali 967452a3f0 fifo: Use plain fifo file in Unix
This PR switches to using plain fifo files instead of golang structs
managed by containerd/fifo library.

The library main benefit is management of opening fifo files.  In Linux,
a reader `open()` request would block until a writer opens the file (and
vice-versa).  The library uses goroutines so that it's the first IO
operation that blocks.

This benefit isn't really useful for us: Given that logmon simply
streams output in a separate process, blocking of opening or first read
is effectively the same.

The library additionally makes further complications for managing state
and tracking read/write permission that seems overhead for our use,
compared to using a file directly.

Looking here, I made the following incidental changes:
* document that we do handle if fifo files are already created, as we
rely on that behavior for logmon restarts
* use type system to lock read vs write: currently, fifo library returns
`io.ReadWriteCloser` even if fifo is opened for writing only!
2019-04-01 13:18:03 -04:00
Michael Schurter a4572919cd
Merge pull request #5456 from hashicorp/test-taskenv
tests: port pre-0.9 task env tests
2019-03-25 10:41:38 -07:00
Michael Schurter 8efad12538 tests: port pre-0.9 task env tests
I chose to make them more of integration tests since there's a lot more
plumbing involved. The internal implementation details of how we craft
task envs can now change and these tests will still properly assert the
task runtime environment is setup properly.
2019-03-25 09:46:53 -07:00
Michael Schurter 9afbc45cff Bump to dev post-0.9.0-rc1 release 2019-03-22 08:26:30 -07:00
Nomad Release bot 3ab3dd4105 Generate files for 0.9.0-rc1 release 2019-03-21 19:06:13 +00:00
Mahmood Ali b08a2744f8
Merge pull request #5428 from hashicorp/b-dropped-logs-on-task-restart
client/logmon: restart log collection correctly when a task is restarted
2019-03-21 14:02:08 -04:00
Mahmood Ali 729458f110 fix TestLogmon_Start_restart 2019-03-21 13:36:46 -04:00
Nick Ethier b252d712df
logmon: fix test assertion 2019-03-20 21:37:17 -04:00
Nick Ethier c1f5011181
logmon: remove sleeps from tests 2019-03-20 10:45:09 -04:00
Nick Ethier e14041bdec
logmon: add tests for rotation and open/closing of fifos 2019-03-19 14:41:23 -04:00
Nick Ethier dc18b8928a
logmon: make Start rpc idempotent and simplify hook 2019-03-19 14:02:36 -04:00
Nick Ethier ac7fbee1b8
logmon:add static check for logmon exited hook 2019-03-18 15:59:43 -04:00
Nick Ethier 7dc3d83634
client/logmon: restart log collection correctly when a task is restarted 2019-03-15 23:59:18 -04:00
Mahmood Ali fb55717b0c
Regenerate Proto files (#5421)
Noticed that the protobuf files are out of sync with ones generated by 1.2.0 protoc go plugin.

The cause for these files seem to be related to release processes, e.g. [0.9.0-beta1 preperation](ecec3d38de (diff-da4da188ee496377d456025c2eab4e87)), and [0.9.0-beta3 preperation](b849d84f2f).

This restores the changes to that of the pinned protoc version and fails build if protobuf files are out of sync.  Sample failing Travis job is that of the first commit change: https://travis-ci.org/hashicorp/nomad/jobs/506285085
2019-03-14 10:56:27 -04:00
Michael Schurter b126e9eec4
Merge pull request #5386 from hashicorp/b-logmon-stop
Fix task/logmon leak after crash
2019-03-12 15:23:02 -07:00
Michael Schurter 0ba1a5251b client: cleanup and document context uses
Some of the context uses in TR hooks are useless (Killed during Stop
never seems meaningful).

None of the hooks are interruptable for graceful shutdown which is
unfortunate and probably needs fixing.
2019-03-12 15:03:54 -07:00
Mahmood Ali 8deb532be2 run TestAllocations_Stats in CI 2019-03-08 07:57:37 -05:00
Michael Schurter 32d31575cc client: emit event and call exited hooks during cleanup
Builds upon earlier commit that cleans up restored handles of terminal
allocs by also emitting terminated events and calling exited hooks when
appropriate.
2019-03-05 15:12:02 -08:00
Michael Schurter a4bc46b6e6 test: fix NewMemDB API change 2019-03-04 13:37:20 -08:00
Michael Schurter 64e145ebdb logmon: drop reattach log level as its expected
Logged once per terminal task on agent restart.
2019-03-04 13:26:01 -08:00
Michael Schurter c5271d3fa5 client: test logmon cleanup
The test is sadly quite complicated and peeks into things (logmon's
reattach config) AR doesn't normally have access to.

However, I couldn't find another way of asserting logmon got cleaned up
without resorting to smaller unit tests. Smaller unit tests risk
re-implementing dependencies in an unrealistic way, so I opted for an
ugly integration test.
2019-03-04 13:15:15 -08:00
Preetha Appan 0e547d29ad
s/mananger/manager 2019-03-04 12:25:54 -06:00
Michael Schurter ef8d284352 client: ensure task is cleaned up when terminal
This commit is a significant change. TR.Run is now always executed, even
for terminal allocations. This was changed to allow TR.Run to cleanup
(run stop hooks) if a handle was recovered.

This is intended to handle the case of Nomad receiving a
DesiredStatus=Stop allocation update, persisting it, but crashing before
stopping AR/TR.

The commit also renames task runner hook data as it was very easy to
accidently set state on Requests instead of Responses using the old
field names.
2019-03-01 14:00:23 -08:00
Michael Schurter 3f386e3951 Remove generated files for 0.9.0-beta3 2019-02-26 10:34:08 -08:00
Michael Schurter d74755900e Generate files for 0.9.0-beta3 release 2019-02-26 09:44:49 -08:00
Michael Schurter 812f1679e2
Merge pull request #5352 from hashicorp/b-leaked-logmon
logmon fixes
2019-02-26 08:35:46 -08:00
Michael Schurter e39a10a1f4 tests: move unix-specific test to its own file
Other logmon tests should be portable.
2019-02-26 07:56:44 -08:00
Mahmood Ali 45b6392d4e
tests: port some fingerprint tests from 0.8 (#5359)
Port some integration tests of driver fingerprinting.

Some tests (e.g. `TestFingerprintManager_Run_DriversInBlacklist`) have
been subsituted by more isolated tests in
`client/pluginmanager/drivermanager/manager_test.go`
2019-02-26 10:54:16 -05:00
Michael Schurter 3b2a592e93 client: restart task on logmon failures
This code chooses to be conservative as opposed to optimal: when failing
to reattach to logmon simply return a recoverable error instead of
immediately trying to restart logmon.

The recoverable error will cause the task's restart policy to be
applied and a new logmon will be launched upon restart.

Trying to do the optimal approach of simply starting a new logmon
requires error string comparison and should be tested against a task
actively logging to assert the behavior (are writes blocked? dropped?).
2019-02-25 15:42:45 -08:00
Michael Schurter 8830b00866 client: test logmon_hook 2019-02-23 15:36:48 -08:00
Preetha Appan 43679f4ce1
More alloc runner tests ported from 0.8.7 2019-02-22 17:58:06 -06:00
Mahmood Ali 32551fb0e5 emit TaskRestartSignal event on vault restart
When Vault token expires and task is restarted, emit `TaskRestartSignal`
similar to v0.8.7
2019-02-22 15:56:14 -05:00
Mahmood Ali 8cb4bbcc08 address review comments 2019-02-22 15:56:14 -05:00
Mahmood Ali 216eaa4843 tests: port TestTaskRunner_VaultManager_Signal
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427
2019-02-22 15:53:04 -05:00
Mahmood Ali 8e9e732319 tests: port TestTaskRunner_VaultManager_Restart
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1352
2019-02-22 15:53:04 -05:00
Mahmood Ali 33122ca7c0 tests: port TestTaskRunner_UnregisterConsul_Retries
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L620
2019-02-22 15:53:04 -05:00
Mahmood Ali 0128b0ce7a tests: port TestTaskRunner_Template_NewVaultToken
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1275
2019-02-22 15:53:04 -05:00
Mahmood Ali cfb80583af tests: port TestTaskRunner_Template_Artifact
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1195
2019-02-22 15:52:59 -05:00
Mahmood Ali 1b14214a88 tests: port TestAllocRunner_RetryArtifact
Port TestAllocRunner_RetryArtifact from https://github.com/hashicorp/nomad/blob/v0.8.7/client/alloc_runner_test.go#L610-L672

I changed the test name because it doesn't actually test that artifact
hooks is retried
2019-02-22 15:50:39 -05:00
Mahmood Ali c827e6e05a tests: port TestAllocRunner_MoveAllocDir test 2019-02-22 15:50:39 -05:00
Michael Schurter a2e3ea6dc9 logmon: fix reattach configuration
There were multiple bugs here:

1. Reattach unmarshalling always returned an error because you can't
   unmarshal into a nil pointer.
2. The hook data wasn't being saved because it was put on the request
   struct, not the response struct.
3. The plugin configuration should only have reattach *or* a command
   set. Not both.
4. Setting Done=true meant the hook was never re-run on agent restart so
   reattaching was never attempted.
2019-02-21 15:32:18 -08:00
Michael Schurter f5e0dba9d1 fingerprint: improve initial fingerpint message
The initial fingerprint message is actually fairly useful, so I bumped
it to Debug and fixed the output formatting.
2019-02-21 15:32:18 -08:00
Michael Schurter 01cabdff88 client: restart on recoverable StartTask errors
Fixes restarting on recoverable errors from StartTask.

Ports TestTaskRunner_Run_RecoverableStartError from 0.8 which discovered
the bug.
2019-02-21 15:30:49 -08:00
Michael Schurter e3f321cd27 test: port TestTaskRunner_RestartSignalTask_NotRunning from 0.8 2019-02-21 15:30:49 -08:00
Michael Schurter f3aa945a00 test: port TestTaskRunner_DriverNetwork from 0.8 2019-02-21 15:30:49 -08:00
Michael Schurter 518405ac33
Merge pull request #5322 from hashicorp/b-artifact-retries
Fix regression by restarting on artifact download errors
2019-02-21 15:28:51 -08:00
Mahmood Ali 6d30284ec9
Merge pull request #5341 from hashicorp/ci-windows-docker
Run Docker tests in Windows AppVeyor CI
2019-02-21 13:17:33 -05:00
Michael Schurter 2553800eb8 tests: port TestAllocRunner_Destroy from 0.8
Also add destroy(ar) helper to fix a bunch of shutdown races in AR
tests.
2019-02-20 12:35:09 -08:00
Michael Schurter 6580ed668e client: don't redownload completed artifacts on retries
Track the download status of each artifact independently so that if only
one of many artifacts fails to download, completed artifacts aren't
downloaded again.
2019-02-20 08:45:12 -08:00
Michael Schurter 908bfab4c2 client: artifact errors are retry-able
0.9.0beta2 contains a regression where artifact download errors would
not cause a task restart and instead immediately fail the task.

This restores the pre-0.9 behavior of retrying all artifact errors and
adds missing tests.
2019-02-20 07:21:27 -08:00
Michael Schurter 79ccf00b72 tests: add new task runner test helper
Adds a new helper and removes a duplicated test.
2019-02-20 07:21:27 -08:00
Mahmood Ali 33ff8c3e8d tests: expect Docker on AppVeyor
Prepare to run docker on AppVeyor Windows environment
2019-02-20 07:41:47 -05:00
Michael Schurter 159042a1a3 client: fix setting alloc unhealthy at deadline
During the 0.9 client refactor the code to fail a deployment when the
deadline was reached was broken. This restores and tests that behavior.
2019-02-19 07:44:14 -08:00
Mahmood Ali 87be233aca
test: improve readability of duration
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2019-02-14 08:12:06 -08:00
Mahmood Ali 16d3414842
test: improve failure message
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2019-02-14 08:11:37 -08:00
Michael Schurter 4814f0fb0b tests: port TestTaskRunner_Download_List from 0.8 2019-02-12 15:48:04 -08:00
Michael Schurter a152e3ef17 consul: fix task deregistration hook
Broke ShutdownDelay but the test was timing dependent so it just
appeared flaky. Made the test slower so that it should never incorrectly
pass.
2019-02-12 15:36:02 -08:00
Michael Schurter 4ad879e75e tests: port TaskRunner_DeriveToken tests from 0.8 2019-02-12 15:36:02 -08:00
Michael Schurter 6743ed9fdc tests: port TestTaskRunner_BlockForVault from 0.8
Also fix race conditions in the mock vault client.
2019-02-12 13:46:09 -08:00
Michael Schurter 6c0cc65b2e simplify hcl2 parsing helper
No need to pass in the entire eval context
2019-02-04 11:07:57 -08:00
Michael Schurter fec2752fb2 client: log when allocs have been processed
Will hopefully help us catch deadlocks/livelocks/slowdowns in the
add/remove allocs pipeline which should be fast.
2019-02-04 11:07:57 -08:00
Michael Schurter 2db91425e3 Remove 0.9.0-beta2 generated files 2019-02-01 08:28:44 -08:00
Alex Dadgar 84d0afccae Generate files for 0.9.0-beta2 2019-01-30 13:31:50 -08:00
Alex Dadgar 449e582ffc
Merge pull request #5281 from hashicorp/f-affinity-weight-int
Change types of weights on spread/affinity
2019-01-30 13:25:56 -08:00
Alex Dadgar d2e5ede119 remove generated structs 2019-01-30 12:38:34 -08:00
Nick Ethier e7ea26449e
client: fix bug during 0.8 state up grade that causes external drivers to fail 2019-01-30 14:22:29 -05:00
Alex Dadgar bc804dda2e Nomad 0.9.0-beta1 generated code 2019-01-30 10:49:44 -08:00
Alex Dadgar 5062c54874 Fix usage of fsi variable 2019-01-29 14:07:55 -08:00
Alex Dadgar 6f418ebaf0 Always populate task dir environment variables
Fixes an issue where if a task was restarted after restating the client,
the task dir environment variables would not be populated. This PR fixes
this for both upgrades from 0.8.X and for normal 0.9 restarts.
2019-01-29 13:17:10 -08:00
Nick Ethier bcbed3c532
Merge pull request #5248 from hashicorp/b-rawexec-leak
Fix leaked executor in raw_exec
2019-01-28 21:18:31 -05:00
Alex Dadgar 5da21635fb Fix env templates having interpolated destinations
Fixes an issue where env templates that had interpolated destinations
would not work.

Fixes https://github.com/hashicorp/nomad/issues/5250
2019-01-28 10:28:53 -08:00
Nick Ethier 8d7a47340c
drivermanager: don't store nil reattach configs 2019-01-25 23:07:04 -05:00
Alex Dadgar d6412fd8e7 Fix double restart counting for templates
This PR fixes an issue where template restarts would count twice since
it was emitting a restarting event.
2019-01-25 15:38:13 -08:00
Nick Ethier be976d9c9a
Merge branch 'master' into f-driver-upgradepath-test
* master: (23 commits)
  tests: avoid assertion in goroutine
  spell check
  ci: run checkscripts
  tests: deflake TestRktDriver_StartWaitRecoverWaitStop
  drivers/rkt: Remove unused github.com/rkt/rkt
  drivers/rkt: allow development on non-linux
  cli: Hide `nomad docker_logger` from help output
  api: test api and structs are in sync
  goimports until make check is happy
  nil check node resources to prevent panic
  tr: use context in as select statement
  move pluginutils -> helper/pluginutils
  vet
  goimports
  gofmt
  Split hclspec
  move hclutils
  Driver tests do not use hcl2/hcl, hclspec, or hclutils
  move reattach config
  loader and singleton
  ...
2019-01-23 21:01:24 -05:00
Nick Ethier 5b9013528e
drivers: add docker upgrade path and e2e test 2019-01-23 14:44:42 -05:00
Nick Ethier a36c4320ff
Merge pull request #5227 from hashicorp/b-client-highcpu-usage
Fix bug related to high cpu usage
2019-01-23 14:27:51 -05:00
Michael Schurter 13f061a83f
Merge pull request #5196 from hashicorp/f-plugin-utils
Make plugins/shared external and make pluginutls/
2019-01-23 06:59:32 -08:00
Preetha 05bf183ba3
Merge pull request #5225 from hashicorp/b-notaskevent-terminalallocs
Don't emit task events after alloc is in a terminal DesiredState
2019-01-23 08:54:10 -06:00
Michael Schurter 32daa7b47b goimports until make check is happy 2019-01-23 06:27:14 -08:00
Nick Ethier bcc3935228
tr: use context in as select statement 2019-01-22 20:11:39 -05:00
Michael Schurter be0bab7c3f move pluginutils -> helper/pluginutils
I wanted a different color bikeshed, so I get to paint it
2019-01-22 15:50:08 -08:00
Alex Dadgar 4bdccab550 goimports 2019-01-22 15:44:31 -08:00
Alex Dadgar b7a65676fe gofmt 2019-01-22 15:43:34 -08:00
Alex Dadgar 2ca0e97361 Split hclspec 2019-01-22 15:43:34 -08:00
Alex Dadgar 5ca6dd7988 move hclutils 2019-01-22 15:43:34 -08:00
Alex Dadgar 72a5691897 Driver tests do not use hcl2/hcl, hclspec, or hclutils 2019-01-22 15:43:34 -08:00
Alex Dadgar b2c7268843 move reattach config 2019-01-22 15:11:58 -08:00
Alex Dadgar cdcd3c929c loader and singleton 2019-01-22 15:11:57 -08:00
Alex Dadgar 6c2782f037 move catalog + grpcutils 2019-01-22 15:11:57 -08:00
Preetha Appan 38422642cb
Use DesiredState to determine whether to stop sending task events 2019-01-22 16:43:32 -06:00
Preetha Appan 862c9b7de5
dont emit events for terminal allocs 2019-01-22 16:26:33 -06:00
Michael Schurter 1fa376cac6
Merge pull request #5211 from hashicorp/test-porting-08
Port some 0.8 TaskRunner tests
2019-01-22 14:05:53 -08:00
Michael Schurter 8ced0adb67 test: port TestTaskRunner_CheckWatcher_Restart
Added ability to adjust the number of events the TaskRunner keeps as
there's no way to observe all events otherwise.

Task events differ slightly from 0.8 because 0.9 emits Terminated every
time a task exits instead of only when it exits on its own (not due to
restart or kill).

0.9 does not emit Killing/Killed for restarts like 0.8 which seems fine
as `Restart Signaled/Terminated/Restarting` is more descriptive.

Original v0.8 events emitted:
```
	expected := []string{
		"Received",
		"Task Setup",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Restarting",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Restarting",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Not Restarting",
	}
```
2019-01-22 09:46:46 -08:00
Michael Schurter 1719752a9d test: port RestartTask from 0.8 2019-01-22 08:08:08 -08:00
Michael Schurter 9edff19625 test: port SignalFailure test from 0.8
Also fix signal error handling in mock_driver.
2019-01-22 08:08:08 -08:00
Preetha Appan 299a5fc821
Rename TaskKillRequest/Response to TaskPreKillRequest/Response 2019-01-22 09:54:02 -06:00
Preetha Appan 5a5b9c5666
Fix log comments 2019-01-22 09:45:58 -06:00
Preetha Appan 06e15f8381
Rename TaskKillHook to TaskPreKillHook to more closely match usage
Also added/fixed comments
2019-01-22 09:41:56 -06:00
Michael Schurter 3b02af9386
Fix comment
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-01-22 09:41:21 -06:00
Preetha Appan 09291c689b
Rename TaskKillHook to TaskPreKillHook to more closely match usage
Also added/fixed comments
2019-01-22 09:41:21 -06:00
Mahmood Ali a9b73e6b86
Merge pull request #5216 from hashicorp/b-fix-tests-20180118
tests: deflake client TestFS_Logs_TaskPending test
2019-01-21 09:54:15 -05:00
Mahmood Ali d19ba5bd8e tests: deflake client TestFS_Logs_TaskPending test 2019-01-18 21:26:48 -05:00
Nick Ethier 47127de671
ar: return error from hooks if occured 2019-01-18 18:31:02 -05:00
Nick Ethier e3c6f89b9a
drivers: use consts for task handle version 2019-01-18 18:31:01 -05:00
Nick Ethier 6804450c69
cleanup code comments and small fixes from refactor 2019-01-18 18:31:01 -05:00
Nick Ethier 05bd369d1f
driver: add pre09 migration logic 2019-01-18 18:31:01 -05:00
Mahmood Ali 5df63fda7c
Merge pull request #5190 from hashicorp/f-memory-usage
Track Basic Memory Usage as reported by cgroups
2019-01-18 16:46:02 -05:00
Chris Baker 290c3f36ad set TaskGroupName in task_runner 2019-01-18 20:25:11 +00:00
Chris Baker 8917961caa documenting test for task runner failure to set TaskGroupName 2019-01-18 20:00:49 +00:00
Michael Schurter cfadacfd95
Merge pull request #5203 from hashicorp/b-terminated
client: restore Terminated event on every exit
2019-01-18 08:54:15 -08:00
Danielle Tomlinson bf21612e2b
Merge pull request #5174 from hashicorp/dani/windows
Some Windows fixes and CI
2019-01-18 11:21:53 +01:00
Preetha Appan e0b68a19c6
Fix one more place that should be using taskResources
taskResources handles new resource fields in a backwards compatible way
2019-01-17 15:52:51 -06:00
Michael Schurter a20ac7c1de client: restore Terminated event on every exit
v0.9.0-dev started emitting a Terminated event every time a task process
exited. While this wasn't true in previous versions, it's a useful task
event because it's the only place for job operators to view the task's
exit code.

This behavior is asserted in the e2e/taskevents tests.
2019-01-17 10:02:25 -08:00
Danielle Tomlinson 11c733faa8 allocwatcher: Stat_t is unavailable on win 2019-01-17 18:43:14 +01:00
Danielle Tomlinson 62e06eda56 chore: Cleanup formatting 2019-01-17 18:43:13 +01:00
Danielle Tomlinson 580b8c5dda client/fs: Skip delete-while-streaming test on win 2019-01-17 18:43:13 +01:00
Danielle Tomlinson 4dbddd0620 client/fs: windows error message for not found 2019-01-17 18:43:13 +01:00
Danielle Tomlinson 915bab2365 vaultclient: use require for error assertions 2019-01-17 18:43:13 +01:00
Danielle Tomlinson dc55d3e353 vaultclient: Update tests for vault 1.0 2019-01-17 18:43:13 +01:00
Danielle Tomlinson 7a5d511349 fingerprinter: Use HCLogger for windows 2019-01-17 18:43:13 +01:00
Danielle Tomlinson a695b3562c
Merge pull request #5193 from hashicorp/dani/logmon-reattach
logmon: Reattach to existing loggers
2019-01-16 17:34:13 +01:00
Danielle Tomlinson 99da4c780d logmon: Reattach to existing loggers
This commit prevents us from creating duplicate logmon hooks when
restoring allocations by persisting the logmon reattach config using
HookData.
2019-01-16 14:56:10 +01:00
Michael Schurter daa7d029a1 test: porting TestTaskRunner_SimpleRun_Dispatch
Porting test from 0.8 to 0.9.
2019-01-15 15:22:13 -08:00
Michael Schurter 48afda786b
Merge pull request #5187 from hashicorp/test-consul
Port a bunch of pre-0.9 Consul tests to 0.9
2019-01-15 07:41:50 -08:00
Alex Dadgar 471fdb3ccf
Merge pull request #5173 from hashicorp/b-log-levels
Plugins use parent loggers
2019-01-14 16:14:30 -08:00
Mahmood Ali 9909d98bee Track Basic Memory Usage as reported by cgroups
Track current memory usage, `memory.usage_in_bytes`, in addition to
`memory.max_memory_usage_in_bytes` and friends.  This number is closer
what Docker reports.

Related to https://github.com/hashicorp/nomad/issues/5165 .
2019-01-14 18:47:52 -05:00
Nick Ethier c619e70d39
Merge pull request #5018 from hashicorp/f-executor-stats
executor: streaming stats api
2019-01-14 15:02:35 -05:00
Michael Schurter 4e7ea460e8 test: port some pre-0.9 DeploymentHealth tests
Skipping a failing one as I need to move to some other work and don't
want to leave this work orphaned on my machine.
2019-01-14 09:56:53 -08:00
Michael Schurter ff2f23f5f9 test: assert service interpolation behavior
Ported from pre-0.9 tests.
2019-01-14 09:56:53 -08:00
Michael Schurter 5746be5844 test: add some extra logging 2019-01-14 09:56:53 -08:00
Michael Schurter e877bb6370 test: assert shutdown delay deregs first
Restore a pre-0.9 test that asserts Consul services are deregistered
before a task's shutdown delay.
2019-01-14 09:56:53 -08:00
Michael Schurter 1ca858fa92
Update client/allocrunner/taskrunner/stats_hook.go
Co-Authored-By: nickethier <ncethier@gmail.com>
2019-01-14 12:31:27 -05:00
Nick Ethier fbd403df96
tr: stop stats collection on Exited hook 2019-01-14 12:30:14 -05:00
Nick Ethier 597b7b751d
tr: add retry /w backoff to stats_hook failure 2019-01-12 12:18:24 -05:00
Nick Ethier 7e306afde3
executor: fix failing stats related test 2019-01-12 12:18:23 -05:00
Nick Ethier 9fea54e0dc
executor: implement streaming stats API
plugins/driver: update driver interface to support streaming stats

client/tr: use streaming stats api

TODO:
 * how to handle errors and closed channel during stats streaming
 * prevent tight loop if Stats(ctx) returns an error

drivers: update drivers TaskStats RPC to handle streaming results

executor: better error handling in stats rpc

docker: better control and error handling of stats rpc

driver: allow stats to return a recoverable error
2019-01-12 12:18:22 -05:00
Preetha Appan 9e8dbf6a4b
linting fixes 2019-01-12 10:38:20 -06:00
Preetha Appan c94179578d
Make unit test for allocrunner failure much nicer 2019-01-12 10:38:20 -06:00
Preetha Appan da0d083b03
Add unit test to simulate alloc runner creation failure 2019-01-12 10:38:20 -06:00
Preetha Appan e7b59ac08c
Only set deployment health if not already set 2019-01-12 10:38:20 -06:00
Michael Schurter dbf4c3a3c8
Apply suggestions from code review
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-01-12 10:38:20 -06:00
Preetha Appan 7bd1440710
REfactor statedb factory config to set it directly in client config 2019-01-12 10:38:20 -06:00
Preetha Appan e237f19b38
Remove invalid allocs 2019-01-12 10:38:20 -06:00
Preetha Appan f059ef8a47
Modified destroy failure handling to rely on allocrunner's destroy method
Added a unit test with custom statedb implementation that errors, to
use to verify destroy errors
2019-01-12 10:37:12 -06:00
Preetha Appan 6c95da8f67
Add back code to mark alloc as failed when restore fails
Also modify restore such that any handled errors don't propagate
back to the client
2019-01-12 10:37:12 -06:00
Preetha Appan 5fde0b0f5c
Revert code that made an alloc update when restore fails
Restore currently shuts down the client so the alloc update cant
always make it to the server
2019-01-12 10:37:12 -06:00
Preetha Appan 41bfdd764b
Handle client initialization errors when adding allocs or restoring allocs
We mark the alloc as failed and track failed allocs so that we don't send
updates after the first time
2019-01-12 10:37:12 -06:00
Alex Dadgar 14ed757a56 Plugins use parent loggers
This PR fixes various instances of plugins being launched without using
the parent loggers. This meant that logs would not all go to the same
output, break formatting etc.
2019-01-11 11:36:37 -08:00
Danielle Tomlinson 3e586e93da client: Cleanup allocrunner access 2019-01-11 18:39:18 +01:00
Mahmood Ali c3eaa0f4c8 tests: enable and fix tests requiring mock driver 2019-01-10 10:10:11 -05:00
Alex Dadgar bd12e0b1f7
Merge pull request #5168 from hashicorp/b-kill-race
Improve Kill handling on task runner
2019-01-09 12:05:10 -08:00
Alex Dadgar 069e181e8f add more comments 2019-01-09 12:04:22 -08:00
Michael Schurter e5ddff861c
Spelling fix
Co-Authored-By: dadgar <alex@hashicorp.com>
2019-01-09 11:42:40 -08:00
Mahmood Ali 90f3cea187
Merge pull request #5157 from hashicorp/r-drivers-no-cstructs
drivers: avoid referencing client/structs package
2019-01-09 13:06:46 -05:00
Mahmood Ali ff48dbb8a9
Merge pull request #5163 from hashicorp/r-minor-changes-20180108
Fix a panic on node.Deregister fail
2019-01-09 09:56:00 -05:00
Mahmood Ali 1f2473263e fix more cases of logging arity errors 2019-01-09 09:22:47 -05:00
Mahmood Ali 4952f2a182
Merge pull request #5159 from hashicorp/r-macos-tests
Fix Travis MacOS job
2019-01-09 08:22:30 -05:00
Alex Dadgar 149dec2169 Improve Kill handling on task runner
This PR improves how killing a task is handled. Before the kill function
directly orchestrated the killing and was only valid while the task was
running. The new behavior is to mark the desired state and wait for the
task runner to converge to that state.
2019-01-08 16:42:26 -08:00
Mahmood Ali 9f7eb1bdfa tests: fix a test job constaints failing in macOS
Allow scheduling mock job when running on MacOS (or Windows) hosts.
2019-01-08 12:37:42 -05:00
Michael Schurter c24f4f94c1
Merge pull request #5151 from hashicorp/b-task-events
Emit Killing task events and add e2e tests
2019-01-08 09:33:04 -08:00
Mahmood Ali 6d36b52412 run gofmt 2019-01-08 11:15:38 -05:00
Michael Schurter 92f9cda5f4
Merge pull request #5035 from hashicorp/test-client
test: re-eanble periodic fingerprint test
2019-01-08 07:37:39 -08:00
Michael Schurter 5925424c7c client: emit Killing/Killed task events
We were just emitting Killed/Terminated events before. In v0.8 we
emitted Killing/Killed, but lacked Terminated when explicitly stopping
a task. This change makes it so Terminated is always included, whether
explicitly stopping a task or it exiting on its own.

New output:

2019-01-04T14:58:51-08:00  Killed            Task successfully killed
2019-01-04T14:58:51-08:00  Terminated        Exit Code: 130, Signal: 2
2019-01-04T14:58:51-08:00  Killing           Sent interrupt
2019-01-04T14:58:51-08:00  Leader Task Dead  Leader Task in Group dead
2019-01-04T14:58:49-08:00  Started           Task started by client
2019-01-04T14:58:49-08:00  Task Setup        Building Task Directory
2019-01-04T14:58:49-08:00  Received          Task received by client

Old (v0.8.6) output:

2019-01-04T22:14:54Z  Killed            Task successfully killed
2019-01-04T22:14:54Z  Killing           Sent interrupt. Waiting 5s before force killing
2019-01-04T22:14:54Z  Leader Task Dead  Leader Task in Group dead
2019-01-04T22:14:53Z  Started           Task started by client
2019-01-04T22:14:53Z  Task Setup        Building Task Directory
2019-01-04T22:14:53Z  Received          Task received by client
2019-01-08 07:20:54 -08:00
Michael Schurter 324e989327
Merge pull request #5034 from hashicorp/test-fix-races
Test fix races
2019-01-08 07:04:09 -08:00
Mahmood Ali 916a40bb9e move cstructs.DeviceNetwork to drivers pkg 2019-01-08 09:11:47 -05:00
Mahmood Ali 9369b123de use drivers.FSIsolation 2019-01-08 09:11:47 -05:00
Mahmood Ali c10a8fd7fe remove deprecated allocrunner 2019-01-08 09:11:47 -05:00
Mahmood Ali f475a56087 remove always false parameter
Simplify allocDir.Build() function to avoid depending on client/structs,
and remove a parameter that's always set to `false`.

The motivation here is to avoid a dependency cycle between
drivers/cstructs and alloc_dir.
2019-01-08 09:11:47 -05:00
Danielle Tomlinson 8df20f49f7 drivers: Add internal interface for Shutdown
This allows us to correctly terminate internal state during runs of the
nomad test suite, e.g closing eventer contexts correctly.
2019-01-08 13:48:49 +01:00
Alex Dadgar edf132758d
Merge pull request #5152 from hashicorp/f-recover
Task runner recovers from external plugin exiting
2019-01-07 15:27:33 -08:00
Alex Dadgar 0106f23aaa Review comments 2019-01-07 14:50:28 -08:00
Alex Dadgar 79cfe26021 vet 2019-01-07 14:49:41 -08:00
Alex Dadgar 8a35d7b1dd Test recovery 2019-01-07 14:49:41 -08:00
Alex Dadgar f40f8ce02e Mock driver has recovery, stats 2019-01-07 14:49:40 -08:00
Alex Dadgar 3f24e4d6ca comments 2019-01-07 14:49:40 -08:00
Alex Dadgar 44dca19012 Fix hooks 2019-01-07 14:49:40 -08:00
Alex Dadgar c9825a9c36 recover 2019-01-07 14:49:40 -08:00
Alex Dadgar c3f05f2476 Don't log event error on driver shutdown 2019-01-07 14:49:40 -08:00
Michael Schurter d686ad51fb
Merge pull request #5043 from hashicorp/b-taskenv-conflicts
taskenv: have maps take precedence over primitives
2019-01-07 12:34:48 -08:00
Mahmood Ali 0ba7b0c132 tests: helper function for checking docker presense 2019-01-07 08:27:06 -05:00
Mahmood Ali cd3c6cf60b taskrunner: emit TaskReceived event
Preserve pre-0.9, where task runner emits `Received: Task received by
client` event on task runner creation.
2019-01-04 14:32:29 -05:00
Michael Schurter 875e231511
Merge pull request #5038 from hashicorp/b-drivermanager-tests
WIP: fix failing tests caused by async driver manager
2019-01-03 12:32:18 -08:00
Danielle Tomlinson 35a4790740
Merge pull request #5142 from hashicorp/dani/cleanup-allocrunner-logs
allocrunner: Standardised discard logs
2019-01-03 18:40:48 +01:00
Preetha 8078cb79f0
Merge pull request #5140 from hashicorp/dani/b-taskrunner
taskrunner: Persist environment from hooks
2019-01-03 09:30:52 -06:00
Danielle Tomlinson 29196ca70e allocrunner: Standardised discard logs
Follow up from https://github.com/hashicorp/nomad/pull/5007#pullrequestreview-186739124
2019-01-03 14:04:31 +01:00
Danielle Tomlinson 1c8baf7db7 chore: Fix environement->environment typo 2019-01-03 13:31:30 +01:00
Danielle Tomlinson 28aa34ea78 taskrunner: Persist environment from hooks
https://github.com/hashicorp/nomad/pull/5032 introduced a regression
where the origHookState was used in place of the response from the hook.
2019-01-03 13:13:57 +01:00
Alex Dadgar d7d32c2f61
Merge pull request #5032 from hashicorp/f-driver-env
Store device envs separately and pass to drivers
2018-12-20 13:38:27 -08:00
Michael Schurter e47a3ceed6 taskenv: have maps take precedence over primitives
**The Bug:**

You may have seen log lines like this when running 0.9.0-dev:

```
... client.alloc_runner.task_runner: some environment variables not available for rendering: ... keys="attr.driver.docker.volumes.enabled, attr.driver.docker.version, attr.driver.docker.bridge_ip, attr.driver.qemu.version"
```

Not only should we not be erroring on builtin driver attributes, but the
results were nondeterministic due to map iteration order!

The root cause is that we have an old root attribute for all drivers
like:

```
attr.driver.docker = "1"
```

When attributes were opaque variable names it was fine to also have
"nested" attributes like:

```
attr.driver.docker.version = "1.2.3"
```

However in the HCLv2 world the variable names are no longer opaque: they
form an object tree. The `docker` object can no longer both hold a value
(`"1"`) *and* nested attributes (`version = "1.2.3"`).

**The Fix:**

Since the old `attr.driver.<name> = "1"` attribues are useless for task
config interpolation, create a new precedence rule for creating the task
config evaluation context:

*Maps take precedence over primitives.*

This means `attr.driver.docker.version` will always take precedence over
`attr.driver.docker`. The results are determinstic and give users access
to the more useful metadata.

I made this a general precedence rule instead of special-casing driver
attrs because it seemed like better default behavior than spamming
WARNings to logs that were likely unactionable by users.
2018-12-20 11:37:46 -08:00
Nick Ethier a96afb6c91
fix tests that fail as a result of async client startup 2018-12-20 00:53:44 -05:00
Nick Ethier 6c43ccf628
client: add proper build flag to allocrunner testing.go 2018-12-19 20:22:07 -05:00
Michael Schurter 0a0fb6f86d test: re-eanble periodic fingerprint test 2018-12-19 17:08:24 -08:00
Michael Schurter add2dd8c2d test: copy AR's Alloc before mutating
Fixes a race in client tests
2018-12-19 15:48:02 -08:00
Michael Schurter 17ed3f27ae drivermgr: fix race in building driver list 2018-12-19 15:48:02 -08:00
Michael Schurter 4448f19413
Merge pull request #5030 from hashicorp/test-client-statusupdate
client: assert alloc status updates work
2018-12-19 14:55:34 -08:00
Alex Dadgar 9d34802f7a Store device envs separately and pass to drivers 2018-12-19 14:23:09 -08:00
Michael Schurter 951100af16 client: assert alloc status updates work
Re-enabling and updating an old test. Able to cut out a ton of extra
work by using WaitForRunning which does almost everything this test
needs.
2018-12-19 11:41:53 -08:00
Michael Schurter ee23bdafbc client/state: missing deploy status isn't an error
Fixes TestClient_SaveRestoreState
2018-12-19 10:39:27 -08:00
Michael Schurter c84998e996 tests: implement HasHealth for mock health 2018-12-19 10:39:27 -08:00
Michael Schurter ba1ddd2238 gofmt -s -w upgrade_int_test.go 2018-12-19 10:39:27 -08:00
Michael Schurter 337d07fdd8 client/state: improve upgradeTaskBucket error handling
And add a test
2018-12-19 10:39:27 -08:00
Michael Schurter c5ddcb6a15 client/state: add context to errors
Unfortunately I don't know how to test these errors. As far as I can
tell they should only happen if there was a programming error in the
upgrade code or the underlying boltdb was corrupted somehow.
2018-12-19 10:39:27 -08:00
Michael Schurter 99bd5b3422 client/state: use 2 as version; test error path 2018-12-19 10:39:27 -08:00
Michael Schurter d9ea8252a7 client/state: support upgrading from 0.8->0.9
Also persist and load DeploymentStatus to avoid rechecking health after
client restarts.
2018-12-19 10:39:27 -08:00
Michael Schurter 0018b2f659 client/state: reorg state buckets to ease transition
* Prefix task bucket with task- to prevent name conflicts
* Shorten device manager bucket name
* Remove commented out outdated var
* Update layout comment
2018-12-19 10:22:28 -08:00
Michael Schurter 461599ff20 tr: fix HookState Copy() and Equal() methods
They did not take into account the Env field.
2018-12-19 09:58:06 -08:00
Danielle Tomlinson c580512d32 allocrunner: Close updates routine correctly 2018-12-19 18:32:51 +01:00
Nick Ethier 969ec51730
devicemanager: fix devicemanager tests 2018-12-19 00:35:12 -05:00
Nick Ethier 6f1777284d
drivermanager: use correct plugin config types 2018-12-18 23:07:01 -05:00
Nick Ethier a02308ee6a
drivermanager: attempt to reattach and shutdown driver plugin if blocked by allow/block lists 2018-12-18 23:01:57 -05:00
Nick Ethier ce1a5cba0e
drivermanager: use allocID and task name to route task events 2018-12-18 23:01:51 -05:00
Nick Ethier bda32f9c79
client/pluginmanager: add plugin manager interface to device/driver managers 2018-12-18 22:56:23 -05:00
Nick Ethier d8a0265e68
client: batch initial fingerprinting in plugin manangers
drivermanager: fix pr comments/feedback
2018-12-18 22:56:19 -05:00
Nick Ethier 7d23cbf448
client/drivermananger: fixup issues from rebase and address PR comments 2018-12-18 22:55:38 -05:00
Nick Ethier 1543335710
tr: deregister task handler on cleanup 2018-12-18 22:55:38 -05:00
Nick Ethier 82175d1328
client/drivermananger: add driver manager
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.

Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
2018-12-18 22:55:18 -05:00
Alex Dadgar 730a6f5b9a lint 2018-12-18 16:48:00 -08:00
Alex Dadgar 4c57d2ec4d Add plugin API versioning to plugin loader and plugins 2018-12-18 16:48:00 -08:00
Alex Dadgar 9d1403d617
Merge pull request #5002 from hashicorp/b-task-config-resources
Convert driver resource to AllocatedTaskResource
2018-12-18 16:46:34 -08:00
Danielle Tomlinson 0edc65631a
Merge pull request #5007 from hashicorp/dani/f-allocrunner-async
allocrunner: Async api for shutdown/destroy/update
2018-12-19 01:26:41 +01:00
Alex Dadgar 8efac7ec81 Fix unit tests + upgrade pathing resources 2018-12-18 15:50:44 -08:00
Alex Dadgar b8268d9a46 Lint 2018-12-18 15:50:44 -08:00
Alex Dadgar 66cf3156b2 LinuxResources doesn't use task.Resources 2018-12-18 15:50:44 -08:00
Alex Dadgar 327b551b39 Drivers 2018-12-18 15:50:11 -08:00
Alex Dadgar b653ae2af7 utilities 2018-12-18 15:48:52 -08:00
Danielle Tomlinson 95a0c4fb29 taskrunner: Use a random suffix for Task Config
The RestartCount is not really suitable for use as a source of
uniqueness within task invocations as it is not monotonic, and interacts
with the restart stanza in a users config, so conflates restarts due to
task failures, with restarts due to enviromental changes, such as consul
template or vault secrets changing.

Here we instead use a substring from a uuid, which is more random than
we strictly need, but is nicer than rolling our own random string
generator here.
2018-12-19 00:38:54 +01:00
Danielle Tomlinson 1be0170ebe client: Update tests for async destroy 2018-12-18 23:38:34 +01:00
Danielle Tomlinson d6eb084d8a allocrunner: Drop and log updates after closing waitCh 2018-12-18 23:38:34 +01:00
Danielle Tomlinson 0d91285cd6 allocrunner: Documentation for ShutdownCh/DestroyCh 2018-12-18 23:38:34 +01:00
Danielle Tomlinson f2bb13818e fixup: Log when we detect out of order updates 2018-12-18 23:38:33 +01:00
Danielle Tomlinson 986fde0f5a allocrunner: Handle updates asynchronously
This creates a new buffered channel and goroutine on the allocrunner for
serializing updates to allocations. This allows us to take updates off
the routine that is used from processing updates from the server,
without having complicated machinery for tracking update lifetimes, or
other external synchronization.

This results in a nice performance improvement and signficantly better
throughput on batch changes such as preempting a large number of jobs
for a larger placement.
2018-12-18 23:38:33 +01:00
Danielle Tomlinson f3fa9d1406 gc: Wait for allocrunners to be destroyed 2018-12-18 23:38:33 +01:00
Danielle Tomlinson cb78a90f40 client: Async API for shutdown/destroy allocrunners 2018-12-18 23:38:33 +01:00
Danielle Tomlinson d1fbac1aad allocrunner: Async shutdown and destroy
This commit reduces the locking required to shutdown or destroy
allocrunners, and allows parallel shutdown and destroy of allocrunners during
shutdown.
2018-12-18 23:38:33 +01:00
Danielle Tomlinson d9174d8dcf
Merge pull request #4989 from hashicorp/dani/b-client-update-race-condition
client: Give a copy of clientconfig to allocrunner
2018-12-17 10:49:46 +01:00
Danielle Tomlinson 53aa1bc198
Merge pull request #5004 from hashicorp/dani/f-hook-errors
client: Emit TaskEvents when task hooks fail
2018-12-17 10:42:57 +01:00
Danielle Tomlinson a50ea29da4 taskrunner: Use hook errors for artifacts 2018-12-17 10:39:38 +01:00
Mahmood Ali 2d2c562e18 Remove implicit check
I intended to remove this line in 29ef7ecf2372f980d12a9900e1b2a351568dd415 - see my notes there for details.
2018-12-16 09:14:26 -05:00
Mahmood Ali d58e38e912 tests: avoid implicitly asserting clean shutdown
The assertion here is causing many spurious failures that aren't
actually relevant to the test itself.

We are tracking the cause for this failure independently, and it would
make more sense to have a dedicated test for clean shutdown.
2018-12-15 15:30:09 -05:00
Danielle Tomlinson 3647b701a6 taskrunner: Emit task events when a hook fails 2018-12-13 18:20:18 +01:00
Danielle Tomlinson 8b06e8d297
Merge pull request #4990 from hashicorp/dani/b-alloc-lock
client: updateAlloc release lock after read
2018-12-13 12:43:59 +01:00
Danielle Tomlinson 3823599da9 client: Give a copy of clientconfig to allocrunner
Currently, there is a race condition between creating a taskrunner, and
updating node attributes via fingerprinting.

This is because the taskenv builder will try to iterate over the
clientconfig.Node.Attributes map, which can be concurrently updated by
the fingerprinting process, thus causing a panic.

This fixes that by providing a copy of the clientconfg to the
allocrunner inside the Read lock during config creation.
2018-12-13 12:42:15 +01:00
Alex Dadgar 20c59df8b9
Merge pull request #4969 from hashicorp/f-alloc-hooks
Make alloc health watcher a postrun hook rather than shutdown hook
2018-12-12 14:34:36 -08:00
Danielle Tomlinson 4184eadaf4 client: updateAlloc release lock after read
The allocLock is used to synchronize access to the alloc runner map, not
to ensure internal consistency of the alloc runners themselves. This
updates the updateAlloc process to avoid hanging on to an exclusive lock
of the map while applying changes to allocrunners themselves, as they
should be internally consistent.

This fixes a bug where any client allocation api will block during the
shutdown or updating of an allocrunner and its child taskrunners.
2018-12-12 16:30:01 +01:00
Mahmood Ali 3d166e6e9c
Merge pull request #4984 from hashicorp/b-client-update-driver
client: update driver info on new driver fingerprint
2018-12-11 18:01:03 -05:00
Mahmood Ali 69b2355274
Merge pull request #4975 from hashicorp/fix-master-20181209
Some test fixes and remedies
2018-12-11 18:00:21 -05:00
Alex Dadgar 1531b6d534
Merge pull request #4970 from hashicorp/f-no-iops
Deprecate IOPS
2018-12-11 12:51:22 -08:00
Mahmood Ali ba515947c2 client: update driver info on new fingerprint
Fixes a bug where a driver health and attributes are never updated from
their initial status.  If a driver started unhealthy, it may never go
into a healthy status.
2018-12-11 14:25:10 -05:00
Danielle Tomlinson ed1791f4bf client: Style: use fluent style for building loggers 2018-12-11 18:03:45 +01:00
Danielle Tomlinson 805669ead4 client: Correctly pass a noop PrevAllocMigrator when restoring 2018-12-11 15:46:58 +01:00
Mahmood Ali 3babda5d45 tests: no need for buffer channel 2018-12-11 09:35:26 -05:00
Mahmood Ali 5a487ac884 tests: prevent indefinite blocking in some tests
Noticed few places where tests seem to block indefinitely and panic
after the test run reaches the test package timeout.

I intend to follow up with the proper fix later, but timing out is much
better than indefinitely blocking.
2018-12-11 09:35:26 -05:00
Mahmood Ali 4635168f20 test: fix TestFingerprintManager_Run_Combination
Let's use a fingerprinter that doesn't have values prepopulated in test
fixtures.
2018-12-11 09:35:26 -05:00
Danielle Tomlinson 6fb5ca6ad5 allocrunner: Test alloc runners should include a noop migrator 2018-12-11 13:12:35 +01:00
Danielle Tomlinson 4b4b85e3f4 allocwatcher: Cleanup new migrator/watcher interface 2018-12-11 13:12:35 +01:00
Danielle Tomlinson 83720575de client: Unify handling of previous and preempted allocs 2018-12-11 13:12:35 +01:00
Danielle Tomlinson dff7093243 client: Wait for preempted allocs to terminate
When starting an allocation that is preempting other allocs, we create a
new group allocation watcher, and then wait for the allocations to
terminate in the allocation PreRun hooks.

If there's no preempted allocations, then we simply provide a
NoopAllocWatcher.
2018-12-11 00:59:18 +01:00
Danielle Tomlinson 2cdef6a7b4 allocwatcher: Add Group AllocWatcher
The Group Alloc watcher is an implementation of a PrevAllocWatcher that
can wait for multiple previous allocs before terminating.

This is to be used when running an allocation that is preempting upstream
allocations, and thus only supports being ran with a local alloc watcher.

It also currently requires all of its child watchers to correctly handle
context cancellation. Should this be a problem, it should be fairly easy
to implement a replacement using channels rather than a waitgroup.

It obeys the PrevAllocWatcher interface for convenience, but it may be
better to extract Migration capabilities into a seperate interface for
greater clarity.
2018-12-11 00:58:27 +01:00
Marcin Matlaszek 39eec70f31
Recover from any possible io error when invoking Write on FileRotator
As of now, FileRotator uses bufio.Write under the hood to write data to
configured output file. Due to the way how bufio handles any occurred io
error - saves it into `err` variable never resetting it automatically -
any operation like `Write`, `Flush` etc will become a no-op, returning the very same,
saved error (eg. Out of disk space) even when the problem is fixed (eg. disk
space is available again).

That automatically means that FileRotator will stop writing any logs,
reporting the same error over and over again, even if it's no longer
valid.

This PR fixes it by resetting the bufio Writer, which resets any errors
and tries to write requested data.
2018-12-07 18:22:29 +01:00
Alex Dadgar 1e3c3cb287 Deprecate IOPS
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.
2018-12-06 15:09:26 -08:00
Danielle Tomlinson e3621c55fa gc: Fix maxallocs integration test 2018-12-06 21:50:50 +01:00
Alex Dadgar c4b5f80918 Make alloc health watcher a postrun hook rather than shutdown hook 2018-12-06 12:30:31 -08:00
Danielle Tomlinson 62b98e64ca client/gc: Replace GC integration test with unit
The previous integration test was broken during the client refactor, and
it seems to be some sort of race with state updating.

I'm going to try and construct a replacement test as part of work on
performance, but for now, the underlying behaviour is still being
tested.
2018-12-06 12:28:23 +01:00
Danielle Tomlinson f6e474fd55 client: Re-enable GC tests 2018-12-06 12:28:23 +01:00
Danielle Tomlinson d043532cb0 allocrunner: Basic test alloc runner 2018-12-06 12:28:23 +01:00
Alex Dadgar b39c21d49c Fix various bugs with task events
Fixes the following:
* Emitting events when the task fails to start
* Don't double emit events on task shutdown (nomad stop)
* Don't emit a OOM kill metric unless actually OOM'd
2018-12-05 14:27:07 -08:00
Danielle Tomlinson 10b3e68a6d
Merge pull request #4925 from hashicorp/f-driver-plugins-dani
Third Party Driver Plugins Support
2018-12-03 20:48:19 +01:00
Mahmood Ali 88622b97bd
libcontainer to manage /dev and /proc (#4945)
libcontainer already manages `/dev`, overriding task_dir - so let's use it for `/proc` as well and remove deadcode.
2018-12-03 10:41:01 -05:00
Danielle Tomlinson 9bd77e9295 testfix: Fix import cycle in allocdir tests 2018-12-01 17:25:30 +01:00
Danielle Tomlinson 66c521ca17 client: Move fingerprint structs to pkg
This removes a cyclical dependency when importing client/structs from
dependencies of the plugin_loader, specifically, drivers. Due to
client/config also depending on the plugin_loader.

It also better reflects the ownership of fingerprint structs, as they
are fairly internal to the fingerprint manager.
2018-12-01 17:10:39 +01:00
Danielle Tomlinson 2db5ae38d8 client: Rename drivers/shared/env => client/taskenv 2018-11-30 12:18:39 +01:00
Danielle Tomlinson f3a77b8084 client: Merge driver/shared/structs and client/structs 2018-11-30 10:56:45 +01:00
Danielle Tomlinson b9295f0d56 client/driver: Remove package 2018-11-30 10:47:08 +01:00
Danielle Tomlinson fdfe93aa25 fixup: executorplugin: fix rkt build 2018-11-30 10:47:08 +01:00
Danielle Tomlinson d72ecd95ec client/driver: Vendor setEnvvars into docker_test 2018-11-30 10:46:13 +01:00
Danielle Tomlinson d26a310db0 client: Move executor plugins into own package 2018-11-30 10:46:13 +01:00
Danielle Tomlinson d259c36844 driver: Flatten SetEnvvars into taskdirhook 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 6b72e96eba client: Move driver/logging to logmon/logging
The logging package is used by logmon and the legacy mock_driver. Because the
legacy drivers are going away, I'm moving it here to signify its actual
ownership.
2018-11-30 10:46:13 +01:00
Danielle Tomlinson 04c8851b4c client: Migrate DriverStats optout to drivers/shared/structs 2018-11-30 10:46:13 +01:00
Danielle Tomlinson dbd82e1af4 client: Remove test dependency on client/driver 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 0544a57abe drivers: Move client/drivers/executor to drivers/shared/executor 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 1a29811169 drivers: Move client/drivers/env to drivers/shared/env
As part of deprecating legacy drivers, we're moving the env package to a
new drivers/shared tree, as it is used by the modern docker and rkt
driver packages, and is useful for 3rd party plugins.
2018-11-30 10:46:13 +01:00
Nick Ethier bbe420718a
Merge pull request #4922 from hashicorp/f-drivermananger
add generic plugin manager interface and orchestration
2018-11-28 22:17:04 -05:00
Preetha 1f526db414
Merge pull request #4919 from hashicorp/f-fingerprint-attribute-type
Modify fingerprint interface to use typed attribute struct
2018-11-28 14:18:28 -06:00
Michael Schurter 1bd9a9f9dd
Merge pull request #4894 from hashicorp/f-device-hook
Device hook and devices affect computed node class
2018-11-28 12:10:43 -06:00
Preetha Appan f89dbcd9cc
modify fingerprint interface to use typed attribute struct 2018-11-28 10:01:03 -06:00
Nick Ethier 60c6907ea5
client/plugin: remove println from plugin group func 2018-11-27 22:45:09 -05:00
Nick Ethier 600738e991
client/plugin: lint/spelling errors 2018-11-27 22:45:09 -05:00
Nick Ethier 45a6bf7acd
client/plugin: add generic plugin mananger interface and orchestration 2018-11-27 22:45:03 -05:00
Mahmood Ali ad1f8d8c20 Fixes in old lxc driver 2018-11-27 21:40:43 -05:00
Michael Schurter 3e56ee005a add nil check around task resources in device hook
Looking at NewTaskRunner I'm unsure whether TaskRunner.TaskResources
(from which req.TaskResources is set) is intended to be nil at times or
if the TODO in NewTaskRunner is intended to ensure it is always non-nil.
2018-11-27 17:25:33 -08:00
Michael Schurter b75e9fce37 assume that slices contain only non-nil items 2018-11-27 17:25:33 -08:00
Michael Schurter 85073f9d29 client: properly support hook env vars
The old approach was incomplete. Hook env vars are now:

 * persisted and restored between agent restarts
 * deterministic (LWW if 2 hooks set the same key)
2018-11-27 17:25:33 -08:00
Alex Dadgar 4ee603c382 Device hook and devices affect computed node class
This PR introduces a device hook that retrieves the device mount
information for an allocation. It also updates the computed node class
computation to take into account devices.

TODO Fix the task runner unit test. The environment variable is being
lost even though it is being properly set in the prestart hook.
2018-11-27 17:25:33 -08:00
Michael Schurter 27e07f657e
Merge pull request #4896 from hashicorp/b-prevalloc-deadlock
Fix deadlock in previous alloc watcher by emitting last alloc update
2018-11-27 19:07:16 -06:00
Michael Schurter b75f79a793 fix test breakage caused by rebase 2018-11-27 16:34:01 -08:00
Michael Schurter 91da566935 fix mispelings 2018-11-27 16:33:55 -08:00
Chris Baker a1fb1f3830
Merge pull request #4891 from hashicorp/b-1150-rkt-volume-names
drivers/rkt: fix invalid volumes
2018-11-27 18:55:00 -05:00
Danielle Tomlinson 3651dbdc25
Merge pull request #4909 from hashicorp/b-restart-delay
taskrunner: Return the restart delay correctly
2018-11-27 23:55:54 +01:00
Michael Schurter 22149a661e client: comment on importance of chan ops ordering 2018-11-27 14:11:32 -08:00
Mahmood Ali 05a958dc21 Update client/structs/broadcaster.go
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2018-11-27 14:06:08 -08:00
Michael Schurter 81b6a24a84 client: fix send-after-close in broadcaster 2018-11-27 14:06:08 -08:00
Michael Schurter c429e6b0ab client: check if prev alloc is already terminated
This is a defensive fast-path as 7c6aa0be already fixed the deadlock.
2018-11-27 14:06:08 -08:00
Michael Schurter 944ea6d38b client: emit last sent alloc to new listeners
Fixes a deadlock where the allocwatcher would block forever waiting for
an update from a terminal alloc.

Made the broadcaster easier to debug as well.
2018-11-27 14:06:08 -08:00
Michael Schurter 1e4ef139dd
Merge pull request #4883 from hashicorp/f-graceful-shutdown
Support graceful shutdowns in agent
2018-11-27 15:55:15 -06:00
Michael Schurter 4f7e6f9464 client: fix races in use of goroutine group
The group utility struct does not support asynchronously launched
goroutines (goroutines-inside-of-goroutines), so switch those uses to a
normal go call.

This means watchNodeUpdates and watchNodeEvents may not be shutdown when
Shutdown() exits. During nomad agent shutdown this does not matter.

During tests this means a test may leak those goroutines or be unable to
know when those goroutines have exited.

Since there's no runtime impact and these goroutines do not affect alloc
state syncing it seems ok to risk leaking them.
2018-11-26 12:52:55 -08:00
Michael Schurter 9f43fb6d29 client: reuse group instead of diy'ing it 2018-11-26 12:52:31 -08:00
Michael Schurter 22771aa19e client/ar: remove useless wait ch from runTasks
Arguably this makes task.WaitCh() useless, but I think exposing a wait
chan from TaskRunners is a generically useful API.
2018-11-26 12:51:18 -08:00
Michael Schurter 2fdd013956 client: document how AR/TR Run methods behave 2018-11-26 12:50:35 -08:00
Chris Baker 9bd4317139 modified TaskConfig to include AllocID
use this for volume names in drivers/rkt to address #1150
2018-11-26 18:54:26 +00:00
Nick Ethier 95362eaa02
Merge pull request #4844 from hashicorp/f-docker-plugin
Docker driver plugin
2018-11-20 20:43:03 -05:00
Mahmood Ali e1994e59bd address review comments 2018-11-20 17:10:54 -05:00
Mahmood Ali 171b73fde7 Emit metric counters for Vault token and renewal failures 2018-11-20 17:10:54 -05:00
Mahmood Ali 5b10da5de6 Set User-Agent header when hitting Vault API 2018-11-20 17:10:54 -05:00
Danielle Tomlinson 093f029d5b taskrunner: Return the restart delay correctly
We were incorrectly returning a 0 duration to the taskrunner when
determining when a task should restart. This would cause tasks to be
restarted immediately, ignoring the restart {} stanza in a users
configuration.

This commit causes us to return the restart duration to the task runner
so it may correctly delay further execution.
2018-11-20 21:52:23 +01:00
Nick Ethier 3e42d6914e
task_runner: use NodeResources instead of deprecated struct 2018-11-20 13:46:39 -05:00
Nick Ethier 93c0200566
task_runner: use task and alloc copies instead of referencing the original pointer 2018-11-20 13:34:46 -05:00
Nick Ethier 29591a7c2e
task_runner: emit event on task exit with exit result details 2018-11-19 22:59:17 -05:00
Nick Ethier 4be8a86ef9
plugins/driver: remove NodeResources from task Resources and use PercentTicks field for docker driver 2018-11-19 22:59:17 -05:00
Nick Ethier 69049d37f5
drivers: added NodeResources to drivers.TaskConfig 2018-11-19 22:59:16 -05:00
Nick Ethier 8f8698b3e1
docker: started work on porting docker driver to new plugin framework 2018-11-19 22:59:15 -05:00
Michael Schurter 88577fe083 client.rpc: don't log errors on shutdown 2018-11-19 16:39:30 -08:00
Michael Schurter 5bd744ac3d client: support graceful shutdowns
Client.Shutdown now blocks until all AllocRunners and TaskRunners have
exited their Run loops. Tasks are left running.
2018-11-19 16:39:30 -08:00
Mahmood Ali 9479015f51
Merge pull request #4884 from hashicorp/f-alloc-devices-cli
Report alloc device statistics in API and CLI
2018-11-16 18:04:54 -05:00
Mahmood Ali f139234372 address review comments 2018-11-16 17:13:01 -05:00
Mahmood Ali f72e599ee7 Populate alloc stats API with device stats
This change makes few compromises:

* Looks up the devices associated with tasks at look up time.  Given
that `nomad alloc status` is called rarely generally (compared to stats
telemetry and general job reporting), it seems fine.  However, the
lookup overhead grows bounded by number of `tasks x total-host-devices`,
which can be significant.

* `client.Client` performs the task devices->statistics lookup.  It
passes self to alloc/task runners so they can look up the device statistics
allocated to them.
  * Currently alloc/task runners are responsible for constructing the
entire RPC response for stats
  * The alternatives for making task runners device statistics aware
don't seem appealing (e.g. having task runners contain reference to hostStats)

* On the alloc aggregation resource usage, I did a naive merging of task device statistics.
  * Personally, I question the value of such aggregation, compared to
costs of struct duplication and bloating the response - but opted to be
consistent in the API.
  * With naive concatination, device instances from a single device group used by separate tasks in the alloc, would be aggregated in two separate device group statistics.
2018-11-16 10:26:32 -05:00
Michael Schurter 0cdb188ae4 tests: fix tests post-rebase 2018-11-15 17:40:56 -08:00
Michael Schurter 59f106ecee client/tr: add a bit of context to envbuilder errors 2018-11-15 16:26:25 -08:00
Michael Schurter 742f8775ba client: remove old proxy references from comments 2018-11-15 16:26:25 -08:00
Michael Schurter 2d0b44c3b4 client: test more env key variations 2018-11-15 16:26:25 -08:00
Michael Schurter 8bcd90d78d client: add new nested variables to task's hcl ctx
The error messages are really bad, but it's extremely difficult to
produce good error messages without the original HCL.
2018-11-15 16:26:25 -08:00
Michael Schurter 5e51e2c2d5 client: turn env into nested objects for task configs 2018-11-15 16:25:57 -08:00
Michael Schurter f8cdd561f0 client: interpolate driver configurations
Also add missing SetDriverNetwork calls.
2018-11-15 16:25:57 -08:00
Mahmood Ali 046f098bac Track Node Device attributes and serve them in API 2018-11-14 14:42:29 -05:00
Mahmood Ali 63acda956c Add Client Device Stats structs in api package 2018-11-14 14:41:19 -05:00
Mahmood Ali b74ccc742c Expose Device Stats in /client/stats API endpoint 2018-11-14 14:41:19 -05:00
Mahmood Ali c5de71a424 Allow nullable fields in StatValues
In state values, we need to be able to distinguish between zero values
(e.g. `false`) and unset values (e.g. `nil`).

We can alternatively use protobuf `oneOf` and nested map to ensure
consistency of fields that are set together, but the golang
representation does not represent that well and introducing a mismatch
between representations.  Thus, I opted not to use it.
2018-11-14 14:41:19 -05:00
Mahmood Ali 713c9fe683 Move Stat{Object|Value} to plugins/shared/structs
Moving them as they may be useful for other packages/plugins besides
devices.
2018-11-14 09:01:26 -05:00
Mahmood Ali 1f4db08f42 Regenerate proto files with protoc-gen-go@v1.2.0 2018-11-14 09:01:26 -05:00
Danielle Tomlinson 0917e93537
Merge pull request #4869 from hashicorp/b-executor-stdout
executor: Fix stdout stderr copy/paste
2018-11-13 19:22:37 -08:00
Mahmood Ali 865419e756 convert all config durations to strings in tests 2018-11-13 10:21:40 -05:00
Mahmood Ali ac3b4571eb Address review comments 2018-11-13 10:21:40 -05:00
Mahmood Ali 69f26783e4 avoid setting resource limit on rkt command
Was accidentally modified in 5b14d24bf4626bab420d00783d92bcf25e0b641e .
2018-11-13 10:21:40 -05:00
Mahmood Ali 8fa26f5521 Fix docker log fetching in tests
We no longer use syslog for tracking logs so tracking them explicitly
here
2018-11-13 10:21:40 -05:00
Mahmood Ali 88fa968623 killing should be done with wait client
Incidentally changed in 5b14d24bf4626bab420d00783d92bcf25e0b641e
2018-11-13 10:21:40 -05:00
Mahmood Ali 7690f389a0 Prioritize checking consumer context cancellation
Tests expect that as soon as eventer shuts down immediately on context
cancellations; but golang does not guarantee priority when multiple
pending channels are ready in a select statement.
2018-11-13 10:21:40 -05:00
Mahmood Ali c62ec124c0 Set clean config for mock driver
The default job here contains some exec task config (for setting
command and args) that aren't used for mock driver.  Now, the alloc
runner seems stricter about validating fields and errors on unexpected
fields.

Updating configs in tests so we can have an explicit task config
whenever driver is set explicitly.
2018-11-13 10:21:40 -05:00
Mahmood Ali e5e6f9a785 Update Docker name parsing lookup
`ParseNamed` function changed in e9f3f2cfee9d729a8642344c4fa4ea70b2d49468
where became `ParsedNormalizedName` with extra checks.
2018-11-13 10:21:40 -05:00
Danielle Tomlinson bfeded1f30 executor: Fix stdout stderr copy/paste 2018-11-12 22:08:04 -08:00
Alex Dadgar c4f9e22aeb fix race 2018-11-07 12:22:07 -08:00
Alex Dadgar a7ca737fb6 review comments 2018-11-07 11:31:52 -08:00
Alex Dadgar f0c7a8159b tests 2018-11-07 10:43:15 -08:00
Alex Dadgar 204ca8230c Device manager
Introduce a device manager that manages the lifecycle of device plugins
on the client. It fingerprints, collects stats, and forwards Reserve
requests to the correct plugin. The manager, also handles device plugins
failing and validates their output.
2018-11-07 10:43:15 -08:00
Michael Schurter a4e6a92d18 client: update alloc status when terminating
Defensively update alloc status whenever killing all tasks.
2018-11-05 15:11:10 -08:00
Michael Schurter 66bf3db455 client: block on context as well as waitCh
For lifecycle operations such as Restart and Kill, the client should not
expect driver plugins to be well behaved and close their waitCh on
context cancelation. Always wait on the passed in context as well as the
waitCh.
2018-11-05 12:32:05 -08:00
Michael Schurter b994f51990 client: fix tr lifecycle logic and shutdown delay
ShutdownDelay must be honored whenever the task is killed or restarted.
Services were not being deregistered prior to restarting.
2018-11-05 12:32:05 -08:00
Michael Schurter 2d3479147a client: fix ar and tr tests 2018-11-05 12:32:05 -08:00
Michael Schurter d29d09023e client: do not run terminal allocs 2018-11-05 12:32:05 -08:00
Michael Schurter 2bbd88888c client: first pass at implementing task restoring
Task restoring works but dead tasks may be restarted
2018-11-05 12:32:05 -08:00
Nick Ethier b0ddc03409
Merge pull request #4765 from jippi/increase-line-scan-limit
fix: increase log rotator line scan limit
2018-10-29 18:46:30 -07:00
Nick Ethier 3fcf8ba7e6
Merge pull request #4795 from hashicorp/f-plugin-config
Pass client configuration to plugins through loader
2018-10-29 18:42:27 -07:00
Nick Ethier bda3b1d3b3
rename NomadConfig to ClientAgentConfig 2018-10-29 21:34:34 -04:00
Michael Schurter 6f2cffb196
Merge pull request #4803 from hashicorp/b-leader-fixes
AR Fixes: task leader handling, restoring, state updating, AR.Destroy deadlocks
2018-10-29 17:38:59 -05:00
Michael Schurter d71a1b4547 tests: more fixes due to api changes 2018-10-29 15:25:22 -07:00
Preetha Appan b85cc38f3d
Stat path to binary to handle raw exec driver interpolated binary path 2018-10-26 17:24:05 -05:00
Preetha Appan 55ac8d3d12
Fix test linting 2018-10-26 10:30:12 -05:00
Michael Schurter b7a9d61a38 ar: initialize allocwatcher on restore
Fixes a panic. Left a comment on how the behavior could be improved, but
this is what releases <0.9.0 did.
2018-10-19 09:45:45 -07:00
Michael Schurter e060174130 ar: fix leader handling, state restoring, and destroying unrun ARs
* Migrated all of the old leader task tests and got them passing
* Refactor and consolidate task killing code in AR to always kill leader
  tasks first
* Fixed lots of issues with state restoring
* Fixed deadlock in AR.Destroy if AR.Run had never been called
* Added a new in memory statedb for testing
2018-10-19 09:45:45 -07:00
Nick Ethier 58b430edae
added driver specific client config struct to plugin configuration 2018-10-18 23:31:01 -04:00
Michael Schurter cefbf00bf0 ar: refactor task killing into 1 method
Update comments and address some PR comments from #4775
2018-10-17 10:06:59 -07:00
Michael Schurter 21d78be961 tests: explicitly cleanup after clients 2018-10-17 10:06:59 -07:00
Michael Schurter 222f6b5741 ar: fix task leader, update, and stop handling 2018-10-17 10:06:59 -07:00
Michael Schurter 1badbb2fc4 tr: cleanup hook logs 2018-10-17 09:42:32 -07:00
Nick Ethier 65adb80ebf
plumb NomadConfig into plugins 2018-10-16 22:47:22 -04:00
Nick Ethier d94b631b6b
drivers/exec: add exec implementation 2018-10-16 22:45:28 -04:00
Michael Schurter 0baaba8b09 templates: fix tests 2018-10-16 16:56:57 -07:00
Michael Schurter 838ddf4d4a fix linter errors 2018-10-16 16:56:57 -07:00
Michael Schurter e27c82ea4d client: remove unused handleproxy 2018-10-16 16:56:56 -07:00
Michael Schurter 4ea5217d72 tr: remove unused DriverHandle interface
was causing typed nil interface panics and served no purpose
2018-10-16 16:56:56 -07:00
Michael Schurter 528c426c53 Port client portion of #4392 to new taskrunner
PR #4392 was merged to master *after* allocrunnerv2 was branched, so the
client-specific portions must be ported from master to arv2.
2018-10-16 16:56:56 -07:00
Michael Schurter f12501d4c3 tr: implement dispatch payload hook
Now passing the TaskDir struct to prestart hooks instead of just the
root task dir itself as dispatch needs local/.
2018-10-16 16:56:56 -07:00
Nick Ethier d9f0cbf4a9 client: log retry during driver fingerprint redispense 2018-10-16 16:56:56 -07:00
Nick Ethier c7ac1186c9 client: add test for driverfailure during fingerprinting 2018-10-16 16:56:56 -07:00
Nick Ethier 8cf669b5aa taskrunner: return error on waitCh 2018-10-16 16:56:56 -07:00
Nick Ethier 047fad2953 client: simplify driver plugin logic from review comments 2018-10-16 16:56:56 -07:00
Nick Ethier 9686e1b258 client: fix broked tests from refactoring 2018-10-16 16:56:56 -07:00
Nick Ethier 3183b33d24 client: review comments and fixup/skip tests 2018-10-16 16:56:56 -07:00
Nick Ethier f192c3752a client: refactor post allocrunnerv2 finalization 2018-10-16 16:56:56 -07:00
Nick Ethier 4a4c7dbbfc client: begin driver plugin integration
client: fingerprint driver plugins
2018-10-16 16:56:56 -07:00
Alex Dadgar 7946a14aa8 Fix lints 2018-10-16 16:56:56 -07:00
Alex Dadgar 89dafaaea9 compile on windows 2018-10-16 16:56:56 -07:00
Alex Dadgar ad4fac526c more test fixes 2018-10-16 16:56:56 -07:00
Alex Dadgar 45e41cca03 allocrunnerv2 -> allocrunner 2018-10-16 16:56:56 -07:00
Alex Dadgar 9baa7402ef fix test compiling 2018-10-16 16:56:55 -07:00
Alex Dadgar 7d9c069f09 skip building deprecated files 2018-10-16 16:56:55 -07:00
Alex Dadgar 6c9d9d5173 move files around 2018-10-16 16:56:55 -07:00
Michael Schurter 5f696608a6 tests: fix missing logger caused by bad merge 2018-10-16 16:56:55 -07:00
Michael Schurter 048510b13e tr: properly comment handle fields 2018-10-16 16:56:55 -07:00
Michael Schurter 9e49ed3464 ar: AllocState should not mutate ar.state
If ar.state.TaskStates has not been set, set it on the copy of ar.state.
That keeps ar.state manipulations in one location and allows AllocState
to only acquire read-locks.
2018-10-16 16:56:55 -07:00
Michael Schurter f279b1d1b1 tests: test logs endpoint against pending task
Although the really exciting change is making WaitForRunning return the
allocations that it started. This should cut down test boilerplate
significantly.
2018-10-16 16:56:55 -07:00
Michael Schurter dd4227f84a tests: make a test client/config easier to generate
Sadly can't move the fingerprint timeout tweak into the helper due to
circular imports.
2018-10-16 16:56:55 -07:00
Michael Schurter 1d747048ea tests: ensure task state is initialized in NewAR
Also expose NoopDB for use in tests.
2018-10-16 16:56:55 -07:00
Michael Schurter 960f3be76c client: expose task state to client
The interesting decision in this commit was to expose AR's state and not
a fully materialized Allocation struct. AR.clientAlloc builds an Alloc
that contains the task state, so I considered simply memoizing and
exposing that method.

However, that would lead to AR having two awkwardly similar methods:
 - Alloc() - which returns the server-sent alloc
 - ClientAlloc() - which returns the fully materialized client alloc

Since ClientAlloc() could be memoized it would be just as cheap to call
as Alloc(), so why not replace Alloc() entirely?

Replacing Alloc() entirely would require Update() to immediately
materialize the task states on server-sent Allocs as there may have been
local task state changes since the server received an Alloc update.

This quickly becomes difficult to reason about: should Update hooks use
the TaskStates? Are state changes caused by TR Update hooks immediately
reflected in the Alloc? Should AR persist its copy of the Alloc? If so,
are its TaskStates canonical or the TaskStates on TR?

So! Forget that. Let's separate the static Allocation from the dynamic
AR & TR state!

 - AR.Alloc() is for static Allocation access (often for the Job)
 - AR.AllocState() is for the dynamic AR & TR runtime state (deployment
   status, task states, etc).

If code needs to know the status of a task: AllocState()
If code needs to know the names of tasks: Alloc()

It should be very easy for a developer to reason about which method they
should call and what they can do with the return values.
2018-10-16 16:56:55 -07:00
Michael Schurter fb4aa74153 client: add comment 2018-10-16 16:56:55 -07:00
Michael Schurter 9a7e6be2b6 client: fix potentially dropped streaming errors 2018-10-16 16:56:55 -07:00
Michael Schurter 4b44b9039b tr: remove unneeded lock; chan synchronizes access 2018-10-16 16:56:55 -07:00
Michael Schurter 211b96bb5c tr: fix shutdown/destroy/WaitResult handling
Multiple receivers raced for the WaitResult when killing tasks which
could lead to a deadlock if the "wrong" receiver won.

Wrap handlers in an ugly little proxy to avoid this. At first I wanted
to push this into drivers, but the result is tied to the TR's handle
lifecycle -- not the lifecycle of an alloc or task.
2018-10-16 16:56:55 -07:00
Michael Schurter 951ed17436 client: do not inspect task state to follow logs
"Ask forgiveness, not permission."

Instead of peaking at TaskStates (which are no longer updated on the
AR.Alloc() view of the world) to only read logs for running tasks, just
try to read the logs and improve the error handling if they don't exist.

This should make log streaming less dependent on AR/TR behavior.

Also fixed a race where the log streamer could exit before reading an
error. This caused no logs or errors to be displayed sometimes when an
error occurred.
2018-10-16 16:56:55 -07:00
Michael Schurter 2325348053 mock_driver: close waitCh after exiting
mock_driver wasn't behaving like other driver handles.
2018-10-16 16:56:55 -07:00
Michael Schurter 8d1419c62b client: fix accessing alloc runners
* GetClientAlloc() gains nothing from using allAllocs()
* getAllocatedResources was calling getAllocRunners() twice
2018-10-16 16:56:55 -07:00
Michael Schurter 55ab491801 tr: remove wip comments 2018-10-16 16:56:55 -07:00
Michael Schurter 3ccc091a72 ar: lock around accessing tasks
Specify that Alloc() does not return updated task states.
2018-10-16 16:56:55 -07:00
Alex Dadgar 6f0ed6184b Fix client reloading and pass the plugin loaders to server and client 2018-10-16 16:56:55 -07:00
Nick Ethier 352c05cdf4 plugin/drivers: plumb in stdout/stderr paths 2018-10-16 16:53:31 -07:00
Nick Ethier 0e3f85222a driver/raw_exec: port existing raw_exec tests and add some testing utilities 2018-10-16 16:53:31 -07:00
Nick Ethier d9628ff394 driver/raw_exec: more tests and bug fixes
added wrapper struct for plugin.ReattachConfig to better handle serialization
2018-10-16 16:53:31 -07:00
Nick Ethier bcc5c4a8bd clientv2: base driver plugin (#4671)
Driver plugin framework to facilitate development of driver plugins.

Implementing plugins only need to implement the DriverPlugin interface.
The framework proxies this interface to the go-plugin GRPC interface generated
from the driver.proto spec.

A testing harness is provided to allow implementing drivers to test the full
lifecycle of the driver plugin. An example use:

func TestMyDriver(t *testing.T) {
    harness := NewDriverHarness(t, &MyDiverPlugin{})
    // The harness implements the DriverPlugin interface and can be used as such
    taskHandle, err := harness.StartTask(...)
}
2018-10-16 16:53:31 -07:00
Michael Schurter 62c1285afc tr: add comments and cleanup call signature
From review comments on #4649 left post-merge.
2018-10-16 16:53:31 -07:00
Nick Ethier 5dee1141d1 executor v2 (#4656)
* client/executor: refactor client to remove interpolation

* executor: POC libcontainer based executor

* vendor: use hashicorp libcontainer fork

* vendor: add libcontainer/nsenter dep

* executor: updated executor interface to simplify operations

* executor: implement logging pipe

* logmon: new logmon plugin to manage task logs

* driver/executor: use logmon for log management

* executor: fix tests and windows build

* executor: fix logging key names

* executor: fix test failures

* executor: add config field to toggle between using libcontainer and standard executors

* logmon: use discover utility to discover nomad executable

* executor: only call libcontainer-shim on main in linux

* logmon: use seperate path configs for stdout/stderr fifos

* executor: windows fixes

* executor: created reusable pid stats collection utility that can be used in an executor

* executor: update fifo.Open calls

* executor: fix build

* remove executor from docker driver

* executor: Shutdown func to kill and cleanup executor and its children

* executor: move linux specific universal executor funcs to seperate file

* move logmon initialization to a task runner hook

* client: doc fixes and renaming from code review


* taskrunner: use shared config struct for logmon fifo fields

* taskrunner: logmon only needs to be started once per task
2018-10-16 16:53:31 -07:00
Michael Schurter e6e2930a00 tr: implement stats collection hook
Tested except for the net/rpc specific error case which may need
changing in the gRPC world.
2018-10-16 16:53:31 -07:00
Michael Schurter 86bd329539 fix build errors post merges 2018-10-16 16:53:31 -07:00
Michael Schurter a977e22028 test: cleanup mock consul service client
Updated to hclog.

It exposed fields that required an unexported lock to access. Created a
getter methodn instead. Only old allocrunner currently used this
feature.
2018-10-16 16:53:31 -07:00
Michael Schurter 6f92b04226 health_hook: simplify locking; test thoroughly
Use doneCh like @dadgar suggested in the original PR.

Thoroughly test hook as concurrent Update calls make for a tricky
concurrency problem.
2018-10-16 16:53:30 -07:00
Alex Dadgar cebfead6bc add logger back 2018-10-16 16:53:30 -07:00
Nick Ethier 03422aa529 fifo: add new fifo package for named pipes (#4665)
* fifo: add new fifo package for named pipes
2018-10-16 16:53:30 -07:00
Alex Dadgar 8504505c0d client uses passed logger and fix fingerprinters 2018-10-16 16:53:30 -07:00
Nick Ethier 66ff12e5f7 Update runc/libcontainer and friends (#4655)
* vendor: bump libcontainer and docker to remove Sirupsen imports

* vendor: fix bad vendoring of archive package

* vendor: fix api changes to cgroups in executor

* vendor: fix docker api changes

* vendor: update github.com/Azure/go-ansiterm to use non capitalized logrus import
2018-10-16 16:53:30 -07:00
Michael Schurter 195b8127fb health_hook: fix panic and add tests
Still more testing to do, but I want to get this panic fixed ASAP.

All new tests pass with -race
2018-10-16 16:53:30 -07:00
Michael Schurter 64efc3d301 Emit events before long operations
Append when there's nothing blocking between appending and sending an
update to the server.
2018-10-16 16:53:30 -07:00
Michael Schurter a2b696c4cf Use a semaphore to block until watcher exits 2018-10-16 16:53:30 -07:00
Michael Schurter a73162c977 ar: use multierror in update hook loop
Make it match TaskRunner update hook behavior
2018-10-16 16:53:30 -07:00
Michael Schurter a7b427718c tr: refactor EmitEvents into Emit+Append
* UpdateState: set state, append event, persist, update servers
* EmitEvent: append event, persist, update servers
* AppendEvent: append event, persist

AppendEvent may not even have to persist, but for the sake of
correctness I'm going with that for now.
2018-10-16 16:53:30 -07:00
Michael Schurter 93f3ac9ed6 ar: create health setting shim for health watcher 2018-10-16 16:53:30 -07:00
Michael Schurter 4d5aaac6d2 fix detection of task transitioning to running 2018-10-16 16:53:30 -07:00
Michael Schurter 4136e59f79 arv2: implement alloc health watching
Also remove initial alloc from broadcaster as it just caused useless
extra processing.
2018-10-16 16:53:30 -07:00
Michael Schurter 5c5c6dc41b refactor ar hooks into their own files
minimize passed dependencies to ease testing
2018-10-16 16:53:30 -07:00
Michael Schurter 0bbf3a93ee make AllocBroadcaster easier to use
And test thoroughly.
2018-10-16 16:53:30 -07:00
Michael Schurter 9d1ea3b228 client: hclog-ify most of the client
Leaving fingerprinters in case that interface changes with plugins.
2018-10-16 16:53:30 -07:00
Michael Schurter e42154fc46 implement stopping, destroying, and disk migration
* Stopping an alloc is implemented via Updates but update hooks are
  *not* run.
* Destroying an alloc is a best effort cleanup.
* AllocRunner destroy hooks implemented.
* Disk migration and blocking on a previous allocation exiting moved to
  its own package to avoid cycles. Now only depends on alloc broadcaster
  instead of also using a waitch.
* AllocBroadcaster now only drops stale allocations and always keeps the
  latest version.
* Made AllocDir safe for concurrent use

Lots of internal contexts that are currently unused. Unsure if they
should be used or removed.
2018-10-16 16:53:30 -07:00
Michael Schurter 4236255686 lots of comment/log fixes 2018-10-16 16:53:30 -07:00
Michael Schurter 5749ede04e keep forgetting lxc 2018-10-16 16:53:30 -07:00
Michael Schurter 357641c364 persist alloc state on changes, not periodically
Allow alloc and task runners to persist their own state when something
changes instead of periodically syncing all state.
2018-10-16 16:53:30 -07:00
Michael Schurter 820af27171 wrap boltdb in a write deduplicator
Saves a tiny bit of cpu and some IO. Sadly doesn't prevent all IO on
duplicate writes as the transactions are still created and committed.

$ go test -bench=. -benchmem
goos: linux
goarch: amd64
pkg: github.com/hashicorp/nomad/helper/boltdd
BenchmarkWriteDeduplication_On-4             500           4059591 ns/op           23736 B/op         56 allocs/op
BenchmarkWriteDeduplication_Off-4            300           4115319 ns/op           25942 B/op         55 allocs/op
2018-10-16 16:53:30 -07:00
Michael Schurter 990228a6e2 wip wrap boltdb to get path information
finished but doesn't handle deleting deeply nested buckets
2018-10-16 16:53:30 -07:00
Michael Schurter a3fe0510d1 Move all encoding and put deduping into state db
Still WIP as it does not handle deletions.
2018-10-16 16:53:30 -07:00
Michael Schurter 533bc93b3a implement all boltdb interactions behind StateDB 2018-10-16 16:53:30 -07:00
Michael Schurter d890de036a tr: persist hook state whenever it changes 2018-10-16 16:53:30 -07:00
Michael Schurter fae5e89a0e artifacts: don't emit event when there's no artifacts 2018-10-16 16:53:30 -07:00
Michael Schurter 5383d20505 removing old restoration path before api change 2018-10-16 16:53:30 -07:00
Michael Schurter a5d3e3fb0a Implement alloc updates in arv2
Updates are applied asynchronously but sequentially
2018-10-16 16:53:30 -07:00
Michael Schurter 39b3f3a85b call handle.Network() instead of storing it 2018-10-16 16:53:30 -07:00
Michael Schurter 7132b67c1e Add Network method to Handle interface
Should probably be moved to an Inspect method in the Driver Plugin world
2018-10-16 16:53:30 -07:00
Michael Schurter a4b4d7b266 consul service hook
Deregistration works but difficult to test due to terminal updates not
being fully implemented in the new client/ar/tr.
2018-10-16 16:53:29 -07:00
Michael Schurter 5be982e674 restore vault client 2018-10-16 16:53:29 -07:00
Michael Schurter ce04915c9f log before killing tasks 2018-10-16 16:53:29 -07:00
Michael Schurter a2bf851805 no need to TaskStateUpdated to return an error
also updated comments
2018-10-16 16:53:29 -07:00
Alex Dadgar fd3bc1bd39 Update state with server 2018-10-16 16:53:29 -07:00
Alex Dadgar bc905cc61d Define and thread through state updating interface 2018-10-16 16:53:29 -07:00
Michael Schurter 9a63d6103d tr: add validate task hook 2018-10-16 16:53:29 -07:00
Michael Schurter 7f4ec50906 missed locking around c.allocs access 2018-10-16 16:53:29 -07:00
Alex Dadgar c93cfc89c0 wip 2018-10-16 16:53:29 -07:00
Alex Dadgar 7ddc0eb65c Fix deadlock 2018-10-16 16:53:29 -07:00
Alex Dadgar 3779077052 Remove SetState from interface 2018-10-16 16:53:29 -07:00
Alex Dadgar e1ba73b515 compile 2018-10-16 16:53:29 -07:00
Michael Schurter 6ebdf532ea wip split event emitting and state transitions 2018-10-16 16:53:29 -07:00
Michael Schurter 516d641db0 client: implement all-or-nothing alloc restoration
Restoring calls NewAR -> Restore -> Run

NewAR now calls NewTR
AR.Restore calls TR.Restore
AR.Run calls TR.Run
2018-10-16 16:53:29 -07:00
Alex Dadgar e401c660e7 Implement lifecycle hooks on the task runner 2018-10-16 16:53:29 -07:00
Alex Dadgar 89b4ba9cc8 comments 2018-10-16 16:53:29 -07:00
Alex Dadgar 86e81947b4 Hook renames 2018-10-16 16:53:29 -07:00
Alex Dadgar 2599cf9d74 remove comment 2018-10-16 16:53:29 -07:00
Alex Dadgar 88aa0299a9 Template hook 2018-10-16 16:53:29 -07:00
Alex Dadgar c9765deff1 address comments 2018-10-16 16:53:29 -07:00
Alex Dadgar 80f6ce50c0 vault hook 2018-10-16 16:53:29 -07:00
Michael Schurter 30d377eba4 tr: improve skip log line 2018-10-16 16:53:29 -07:00
Michael Schurter ef213b864b tr: pass context to hooks 2018-10-16 16:53:29 -07:00
Michael Schurter 3a4f387fd3 tr: fix setting done in existing hooks 2018-10-16 16:53:29 -07:00
Michael Schurter b360f6f96e fix hclog level 2018-10-16 16:53:29 -07:00
Michael Schurter ae89b7da95 reimplement success state for tr hooks and state persistence
splits apart local and remote persistence

removes some locking *for now*
2018-10-16 16:53:29 -07:00
Michael Schurter 4f43ff5c51 pass statedb into allocrunnerv2 2018-10-16 16:53:29 -07:00
Michael Schurter 582c76a420 remove unused allocrunner shim 2018-10-16 16:53:29 -07:00
Michael Schurter c5504bd939 tr: cleanup main loop and shutdown hook impl 2018-10-16 16:53:29 -07:00
Michael Schurter 561260d6fe tr: skip error/success saving
All hooks only need to be run once.
Since only one hook can fail per run there's no need to
track errors on a per hook basis.
2018-10-16 16:53:29 -07:00
Michael Schurter 67874e761f tr: don't lock for immutable fields 2018-10-16 16:53:29 -07:00
Michael Schurter f473cd03d6 tr: start update/shutdown logic 2018-10-16 16:53:29 -07:00
Michael Schurter 637ef264ae Copy TR.Config vals to TR
I think I like this pattern better as some Config vals are mutable
(Alloc) and some aren't and some are used to derive other values and
never used directly.

Promoting them onto the TR struct is a little more work but is hopefully
more clear as to how each value is used.
2018-10-16 16:53:29 -07:00
Michael Schurter 0f7dcfdc9a example redis job "runs" on arv2! see below
Tons left to do and lots of churn:
1. No state saving
2. No shutdown or gc
3. Removed AR factory *for now*
4. Made all "Config" structs local to the package they configure
5. Added allocID to GC to avoid a lookup

Really hating how many things use *structs.Allocation. It's not bad
without state saving, but if AllocRunner starts updating its copy things
get racy fast.
2018-10-16 16:53:29 -07:00
Michael Schurter 9a6aa38b0f begin adding AllocRunner.Update 2018-10-16 16:53:29 -07:00
Michael Schurter eae54e2954 artifact task hook 2018-10-16 16:53:29 -07:00
Alex Dadgar b9bed81e6e Initial V2 alloc runner 2018-10-16 16:53:28 -07:00
Alex Dadgar a78cefec18 use int64 2018-10-16 15:34:32 -07:00
Preetha Appan 7c0d8c646c
Change CPU/Disk/MemoryMB to int everywhere in new resource structs 2018-10-16 16:21:42 -05:00
Christian Winther 0c5154100c
fix: increase log rotator line scan limit
In case where gelf/json logging is used, its fairly easy to exceed the 16k limit, resulting in json output being cut up into multiple strings

the result is invalid json lines which can create all kind of badness in the logging server

This fixes https://github.com/hashicorp/nomad/issues/4699

Signed-off-by: Christian Winther <jippignu@gmail.com>
2018-10-09 18:57:18 +02:00
Alex Dadgar 01f8e5b95f renames 2018-10-04 14:57:25 -07:00
Alex Dadgar 52f9cd7637 fixing tests 2018-10-04 14:26:19 -07:00
Alex Dadgar bac5cb1e8b Scheduler uses allocated resources 2018-10-02 17:08:25 -07:00
Alex Dadgar 5c8697667e Node reserved resources 2018-09-29 18:44:55 -07:00
Alex Dadgar 3183153315 Node resources on client 2018-09-29 17:23:41 -07:00
Alex Dadgar 9971b3393f yamux 2018-09-17 14:22:40 -07:00
Alex Dadgar ca28afa3b2 small fixes 2018-09-15 16:42:38 -07:00
Alex Dadgar 7739ef51ce agent + consul 2018-09-13 10:43:40 -07:00
Michael Schurter 08862fc177 fix race around error handling 2018-09-05 17:34:17 -07:00
Michael Schurter 6def5bc4f9 client: set host name when migrating over tls
Not setting the host name led the Go HTTP client to expect a certificate
with a DNS-resolvable name. Since Nomad uses `${role}.${region}.nomad`
names ephemeral dir migrations were broken when TLS was enabled.

Added an e2e test to ensure this doesn't break again as it's very
difficult to test and the TLS configuration is very easy to get wrong.
2018-09-05 17:24:17 -07:00
Alex Dadgar c6576ddac1 Fix make check errors 2018-09-04 16:03:52 -07:00
Alex Dadgar 089b533047 Fix kill timeout exceeding 5m on Docker driver
Fixes an issue where the Docker API client would timeout before the kill
timeout was hit.
2018-08-17 16:01:09 -07:00
Alex Dadgar 49a1ba9297
Merge pull request #4535 from hashicorp/f-keep-docker-container-0.8.4
Option to prevent removal of container on exit
2018-07-26 11:11:22 -07:00
Charlie Voiselle f319a149cd Option to prevent removal of container on exit 2018-07-26 11:10:48 -07:00