Commit Graph

304 Commits

Author SHA1 Message Date
Nick Ethier 9c3cc63cd1 tr: initialize envoybootstrap prestart hook response.Env field 2020-01-08 13:41:38 -05:00
Nick Ethier 105cbf6df9 tr: expose envoy sidecar admin port as environment variable 2020-01-06 21:53:45 -05:00
Nick Ethier 677e9cdc16 connect: configure envoy such that multiple sidecars can run in the same alloc 2020-01-06 11:26:27 -05:00
Mahmood Ali 4a1cc67f58
Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob
driver: allow disabling log collection
2019-12-13 11:41:20 -05:00
Mahmood Ali 46bc3b57e6 address review comments 2019-12-13 11:21:00 -05:00
Chris Dickson 4d8ba272d1 client: expose allocated CPU per task (#6784) 2019-12-09 15:40:22 -05:00
Mahmood Ali 0b7085ba3a driver: allow disabling log collection
Operators commonly have docker logs aggregated using various tools and
don't need nomad to manage their docker logs.  Worse, Nomad uses a
somewhat heavy docker api call to collect them and it seems to cause
problems when a client runs hundreds of log collections.

Here we add a knob to disable log aggregation completely for nomad.
When log collection is disabled, we avoid running logmon and
docker_logger for the docker tasks in this implementation.

The downside here is once disabled, `nomad logs ...` commands and API
no longer return logs and operators must corrolate alloc-ids with their
aggregated log info.

This is meant as a stop gap measure.  Ideally, we'd follow up with at
least two changes:

First, we should optimize behavior when we can such that operators don't
need to disable docker log collection.  Potentially by reverting to
using pre-0.9 syslog aggregation in linux environments, though with
different trade-offs.

Second, when/if logs are disabled, nomad logs endpoints should lookup
docker logs api on demand.  This ensures that the cost of log collection
is paid sparingly.
2019-12-08 14:15:03 -05:00
Preetha be4a51d5b8
Merge pull request #6349 from hashicorp/b-host-stats
client: Return empty values when host stats fail
2019-11-20 10:13:02 -06:00
Lang Martin aa985ebe21
getter: allow the gcs download scheme (#6692) 2019-11-19 09:10:56 -05:00
Nick Ethier bd454a4c6f
client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Michael Schurter 9fed8d1bed client: fix panic from 0.8 -> 0.10 upgrade
makeAllocTaskServices did not do a nil check on AllocatedResources
which causes a panic when upgrading directly from 0.8 to 0.10. While
skipping 0.9 is not supported we intend to fix serious crashers caused
by such upgrades to prevent cluster outages.

I did a quick audit of the client package and everywhere else that
accesses AllocatedResources appears to be properly guarded by a nil
check.
2019-11-01 07:47:03 -07:00
Michael Schurter f54f1cb321
Revert "Revert "Use joint context to cancel prestart hooks"" 2019-10-08 11:34:09 -07:00
Michael Schurter 81a30ae106
Revert "Use joint context to cancel prestart hooks" 2019-10-08 11:27:08 -07:00
Drew Bailey 69eebcd241
simplify logic to check for vault read event
defer shutdown to cleanup after failed run

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

update comment to include ctx note for shutdown
2019-09-30 11:02:14 -07:00
Drew Bailey 7565b8a8d9
Use joint context to cancel prestart hooks
fixes https://github.com/hashicorp/nomad/issues/6382

The prestart hook for templates blocks while it resolves vault secrets.
If the secret is not found it continues to retry. If a task is shutdown
during this time, the prestart hook currently does not receive
shutdownCtxCancel, causing it to hang.

This PR joins the two contexts so either killCtx or shutdownCtx cancel
and stop the task.
2019-09-30 10:48:01 -07:00
Danielle Lancashire 4f2343e1c0
client: Return empty values when host stats fail
Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).

When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.

This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including `nomad.client.unallocated.*` metrics.

However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.

Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.
2019-09-19 01:22:07 +02:00
Danielle Lancashire 78b61de45f
config: Hoist volume.config.source into volume
Currently, using a Volume in a job uses the following configuration:

```
volume "alias-name" {
  type = "volume-type"
  read_only = true

  config {
    source = "host_volume_name"
  }
}
```

This commit migrates to the following:

```
volume "alias-name" {
  type = "volume-type"
  source = "host_volume_name"
  read_only = true
}
```

The original design was based due to being uncertain about the future of storage
plugins, and to allow maxium flexibility.

However, this causes a few issues, namely:
- We frequently need to parse this configuration during submission,
scheduling, and mounting
- It complicates the configuration from and end users perspective
- It complicates the ability to do validation

As we understand the problem space of CSI a little more, it has become
clear that we won't need the `source` to be in config, as it will be
used in the majority of cases:

- Host Volumes: Always need a source
- Preallocated CSI Volumes: Always needs a source from a volume or claim name
- Dynamic Persistent CSI Volumes*: Always needs a source to attach the volumes
                                   to for managing upgrades and to avoid dangling.
- Dynamic Ephemeral CSI Volumes*: Less thought out, but `source` will probably point
                                  to the plugin name, and a `config` block will
                                  allow you to pass meta to the plugin. Or will
                                  point to a pre-configured ephemeral config.
*If implemented

The new design simplifies this by merging the source into the volume
stanza to solve the above issues with usability, performance, and error
handling.
2019-09-13 04:37:59 +02:00
Tim Gross 3fa4bca4a0
script checks: Update needs to update Alloc as well (#6291) 2019-09-06 11:18:00 -04:00
Tim Gross 8ce201854a
client: recreate script checks on Update (#6265)
Splitting the immutable and mutable components of the scriptCheck led
to a bug where the environment interpolation wasn't being incorporated
into the check's ID, which caused the UpdateTTL to update for a check
ID that Consul didn't have (because our Consul client creates the ID
from the structs.ServiceCheck each time we update).

Task group services don't have access to a task environment at
creation, so their checks get registered before the check can be
interpolated. Use the original check ID so they can be updated.
2019-09-05 11:43:23 -04:00
Tim Gross 0f29dcc935
support script checks for task group services (#6197)
In Nomad prior to Consul Connect, all Consul checks work the same
except for Script checks. Because the Task being checked is running in
its own container namespaces, the check is executed by Nomad in the
Task's context. If the Script check passes, Nomad uses the TTL check
feature of Consul to update the check status. This means in order to
run a Script check, we need to know what Task to execute it in.

To support Consul Connect, we need Group Services, and these need to
be registered in Consul along with their checks. We could push the
Service down into the Task, but this doesn't work if someone wants to
associate a service with a task's ports, but do script checks in
another task in the allocation.

Because Nomad is handling the Script check and not Consul anyways,
this moves the script check handling into the task runner so that the
task runner can own the script check's configuration and
lifecycle. This will allow us to pass the group service check
configuration down into a task without associating the service itself
with the task.

When tasks are checked for script checks, we walk back through their
task group to see if there are script checks associated with the
task. If so, we'll spin off script check tasklets for them. The
group-level service and any restart behaviors it needs are entirely
encapsulated within the group service hook.
2019-09-03 15:09:04 -04:00
Michael Schurter 5957030d18
connect: add unix socket to proxy grpc for envoy (#6232)
* connect: add unix socket to proxy grpc for envoy

Fixes #6124

Implement a L4 proxy from a unix socket inside a network namespace to
Consul's gRPC endpoint on the host. This allows Envoy to connect to
Consul's xDS configuration API.

* connect: pointer receiver on structs with mutexes

* connect: warn on all proxy errors
2019-09-03 08:43:38 -07:00
Evan Ercolano fcf66918d0 Remove unused canary param from MakeTaskServiceID 2019-08-31 16:53:23 -04:00
Lang Martin 4f6493a301 taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
Jerome Gravel-Niquet cbdc1978bf Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Michael Schurter 59e0b67c7f connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Tim Gross 03433f35d4 client/template: configuration for function blacklist and sandboxing
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.

When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
2019-08-12 16:34:48 -04:00
Danielle Lancashire 861caa9564
HostVolumeConfig: Source -> Path 2019-08-12 15:39:08 +02:00
Danielle Lancashire e132a30899
structs: Unify Volume and VolumeRequest 2019-08-12 15:39:08 +02:00
Danielle Lancashire 6ef8d5233e
client: Add volume_hook for mounting volumes 2019-08-12 15:39:08 +02:00
Mahmood Ali b17bac5101 Render consul templates using task env only (#6055)
When rendering a task consul template, ensure that only task environment
variables are used.

Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1].  Thus, we add
an empty entry for each host process env-var not found in task env-vars.

[1] bfa5d0e133/template/funcs.go (L61-L75)
2019-08-05 16:30:47 -04:00
Mahmood Ali f66169cd6a
Merge pull request #6065 from hashicorp/b-nil-driver-exec
Check if driver handle is nil before execing
2019-08-02 09:48:28 -05:00
Mahmood Ali a4670db9b7 Check if driver handle is nil before execing
Defend against tr.getDriverHandle being nil.  Exec handler checks if
task is running, but it may be stopped between check and driver handler
fetching.
2019-08-02 10:07:41 +08:00
Nick Ethier 66c514a388
Add network lifecycle management
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
2019-07-31 01:03:17 -04:00
Jasmine Dahilig 2157f6ddf1
add formatting for hcl parsing error messages (#5972) 2019-07-19 10:04:39 -07:00
Mahmood Ali cd6f1d3102 Update consul-template dependency to latest
To pick up the fix in
https://github.com/hashicorp/consul-template/pull/1231 .
2019-07-18 07:32:03 +07:00
Mahmood Ali 8a82260319 log unrecoverable errors 2019-07-17 11:01:59 +07:00
Mahmood Ali 1a299c7b28 client/taskrunner: fix stats stats retry logic
Previously, if a channel is closed, we retry the Stats call.  But, if that call
fails, we go in a backoff loop without calling Stats ever again.

Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.

I aimed to preserve the logging formats but made small improvements as I saw fit.
2019-07-11 13:58:07 +08:00
Mahmood Ali f10201c102 run post-run/post-stop task runner hooks
Handle when prestart failed while restoring a task, to prevent
accidentally leaking consul/logmon processes.
2019-07-02 18:38:32 +08:00
Mahmood Ali 4afd7835e3 Fail alloc if alloc runner prestart hooks fail
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
2019-07-02 18:35:47 +08:00
Mahmood Ali 7614b8f09e
Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2
task runner to avoid running task if terminal
2019-07-02 15:31:17 +08:00
Mahmood Ali 7bfad051b9 address review comments 2019-07-02 14:53:50 +08:00
Mahmood Ali 3d89ae0f1e task runner to avoid running task if terminal
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes https://github.com/hashicorp/nomad/issues/5883
2019-06-27 11:27:34 +08:00
Danielle Lancashire b9ac184e1f
tr: Fetch Wait channel before killTask in restart
Currently, if killTask results in the termination of a process before
calling WaitTask, Restart() will incorrectly return a TaskNotFound
error when using the raw_exec driver on Windows.
2019-06-26 15:20:57 +02:00
Chris Baker f71114f5b8 cleanup test 2019-06-18 14:15:25 +00:00
Chris Baker a2dc351fd0 formatting and clarity 2019-06-18 14:00:57 +00:00
Chris Baker e0170e1c67 metrics: add namespace label to allocation metrics 2019-06-17 20:50:26 +00:00
Danielle f923b568e0
Merge pull request #5821 from hashicorp/dani/b-5770
trhooks: Add TaskStopHook interface to services
2019-06-12 17:30:49 +02:00
Danielle Lancashire c326344b57
trt: Fix test 2019-06-12 17:06:11 +02:00
Danielle Lancashire 13d76e35fd
trhooks: Add TaskStopHook interface to services
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.

This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.

fixes #5770
2019-06-12 16:00:21 +02:00
Mahmood Ali 2acf30fdd3 Fallback to `alloc.TaskResources` for old allocs
When a client is running against an old server (e.g. running 0.8),
`alloc.AllocatedResources` may be nil, and we need to check the
deprecated `alloc.TaskResources` instead.

Fixes https://github.com/hashicorp/nomad/issues/5810
2019-06-11 10:32:53 -04:00
Mahmood Ali d30c3d10b0
Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1
More test fixes
2019-06-05 19:09:18 -04:00
Mahmood Ali 935ee86e92
Merge pull request #5737 from fwkz/fix-restart-attempts
Fix restart attempts of `restart` stanza in `delay` mode.
2019-06-05 19:05:07 -04:00
Danielle Lancashire 27583ed8c1 client: Pass servers contacted ch to allocrunner
This fixes an issue where batch and service workloads would never be
restarted due to indefinitely blocking on a nil channel.

It also raises the restoration logging message to `Info` to simplify log
analysis.
2019-05-22 13:47:35 +02:00
Mahmood Ali 9df1e00f35 tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal
Given that Signal may be called multiple times, blocking for `SignalCh`
isn't sufficient to synchornizing access to Signals field.
2019-05-21 13:56:58 -04:00
fwkz 8b84bec95a Fix restart attempts of `restart` stanza.
Number of restarts during 2nd interval is off by one.
2019-05-21 13:27:19 +02:00
Michael Schurter 2fe0768f3b docs: changelog entry for #5669 and fix comment 2019-05-14 10:54:00 -07:00
Michael Schurter af9096c8ba client: register before restoring
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).

Since restoring is synchronous, start the registration goroutine first.

For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
2019-05-14 10:53:27 -07:00
Michael Schurter e07f73bfe0 client: do not restart dead tasks until server is contacted (try 2)
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162

Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!

The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
2019-05-14 10:53:27 -07:00
Michael Schurter d7e5ace1ed client: do not restart dead tasks until server is contacted
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
2019-05-14 10:53:27 -07:00
Michael Schurter 1c4e585fa7 client: expose allocated memory per task
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
2019-05-10 11:12:12 -07:00
Mahmood Ali ab2cae0625 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Mahmood Ali bf0a09e270 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali 81841e8528 try checking process status 2019-04-25 18:16:13 -04:00
Mahmood Ali fc78521f29 add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali e6ca8641a8 try sleeping for stop signal to take effect 2019-04-25 17:16:29 -04:00
Mahmood Ali ff3a095015 add a test that simulates logmon dying during Start() call 2019-04-25 16:41:17 -04:00
Mahmood Ali bbac73883c logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Danielle Lancashire c31966fc71 loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Michael Schurter f7a7acc345
Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Chris Baker 6848591914 vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation 2019-04-12 15:06:34 +00:00
Chris Baker c0a7aee610
vault e2e: pass vault version into setup instead of having to infer it from test name 2019-04-10 10:34:10 -05:00
Chris Baker f0c184fc29
taskrunner: removed some unecessary config from a test 2019-04-10 10:34:10 -05:00
Chris Baker 170f5239c8
client: gofmt 2019-04-10 10:34:10 -05:00
Chris Baker a1d7971b2e
taskrunner: pass configured Vault namespace into TaskTemplateConfig 2019-04-10 10:34:10 -05:00
Michael Schurter f7d4428855 client: simplify kill logic
Remove runLaunched tracking as Run is *always* called for killable
TaskRunners. TaskRunners which fail before Run can be called (during
NewTaskRunner or Restore) are not killable as they're never added to the
client's alloc map.
2019-04-04 15:18:33 -07:00
Michael Schurter 1d569a27dc Revert "executor/linux: add defensive checks to binary path"
This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.
2019-04-02 11:17:12 -07:00
Michael Schurter fc5487dbbc executor/linux: add defensive checks to binary path 2019-04-02 09:40:53 -07:00
Michael Schurter 7d49bc4c71 executor/linux: make chroot binary paths absolute
Avoid libcontainer.Process trying to lookup the binary via $PATH as the
executor has already found where the binary is located.
2019-04-01 15:45:31 -07:00
Michael Schurter a4572919cd
Merge pull request #5456 from hashicorp/test-taskenv
tests: port pre-0.9 task env tests
2019-03-25 10:41:38 -07:00
Michael Schurter 8efad12538 tests: port pre-0.9 task env tests
I chose to make them more of integration tests since there's a lot more
plumbing involved. The internal implementation details of how we craft
task envs can now change and these tests will still properly assert the
task runtime environment is setup properly.
2019-03-25 09:46:53 -07:00
Nick Ethier dc18b8928a
logmon: make Start rpc idempotent and simplify hook 2019-03-19 14:02:36 -04:00
Nick Ethier ac7fbee1b8
logmon:add static check for logmon exited hook 2019-03-18 15:59:43 -04:00
Nick Ethier 7dc3d83634
client/logmon: restart log collection correctly when a task is restarted 2019-03-15 23:59:18 -04:00
Michael Schurter 32d31575cc client: emit event and call exited hooks during cleanup
Builds upon earlier commit that cleans up restored handles of terminal
allocs by also emitting terminated events and calling exited hooks when
appropriate.
2019-03-05 15:12:02 -08:00
Michael Schurter 64e145ebdb logmon: drop reattach log level as its expected
Logged once per terminal task on agent restart.
2019-03-04 13:26:01 -08:00
Michael Schurter c5271d3fa5 client: test logmon cleanup
The test is sadly quite complicated and peeks into things (logmon's
reattach config) AR doesn't normally have access to.

However, I couldn't find another way of asserting logmon got cleaned up
without resorting to smaller unit tests. Smaller unit tests risk
re-implementing dependencies in an unrealistic way, so I opted for an
ugly integration test.
2019-03-04 13:15:15 -08:00
Michael Schurter ef8d284352 client: ensure task is cleaned up when terminal
This commit is a significant change. TR.Run is now always executed, even
for terminal allocations. This was changed to allow TR.Run to cleanup
(run stop hooks) if a handle was recovered.

This is intended to handle the case of Nomad receiving a
DesiredStatus=Stop allocation update, persisting it, but crashing before
stopping AR/TR.

The commit also renames task runner hook data as it was very easy to
accidently set state on Requests instead of Responses using the old
field names.
2019-03-01 14:00:23 -08:00
Michael Schurter 812f1679e2
Merge pull request #5352 from hashicorp/b-leaked-logmon
logmon fixes
2019-02-26 08:35:46 -08:00
Michael Schurter e39a10a1f4 tests: move unix-specific test to its own file
Other logmon tests should be portable.
2019-02-26 07:56:44 -08:00
Michael Schurter 3b2a592e93 client: restart task on logmon failures
This code chooses to be conservative as opposed to optimal: when failing
to reattach to logmon simply return a recoverable error instead of
immediately trying to restart logmon.

The recoverable error will cause the task's restart policy to be
applied and a new logmon will be launched upon restart.

Trying to do the optimal approach of simply starting a new logmon
requires error string comparison and should be tested against a task
actively logging to assert the behavior (are writes blocked? dropped?).
2019-02-25 15:42:45 -08:00
Michael Schurter 8830b00866 client: test logmon_hook 2019-02-23 15:36:48 -08:00
Mahmood Ali 32551fb0e5 emit TaskRestartSignal event on vault restart
When Vault token expires and task is restarted, emit `TaskRestartSignal`
similar to v0.8.7
2019-02-22 15:56:14 -05:00
Mahmood Ali 8cb4bbcc08 address review comments 2019-02-22 15:56:14 -05:00
Mahmood Ali 216eaa4843 tests: port TestTaskRunner_VaultManager_Signal
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427
2019-02-22 15:53:04 -05:00
Mahmood Ali 8e9e732319 tests: port TestTaskRunner_VaultManager_Restart
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1352
2019-02-22 15:53:04 -05:00
Mahmood Ali 33122ca7c0 tests: port TestTaskRunner_UnregisterConsul_Retries
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L620
2019-02-22 15:53:04 -05:00
Mahmood Ali 0128b0ce7a tests: port TestTaskRunner_Template_NewVaultToken
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1275
2019-02-22 15:53:04 -05:00
Mahmood Ali cfb80583af tests: port TestTaskRunner_Template_Artifact
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1195
2019-02-22 15:52:59 -05:00
Michael Schurter a2e3ea6dc9 logmon: fix reattach configuration
There were multiple bugs here:

1. Reattach unmarshalling always returned an error because you can't
   unmarshal into a nil pointer.
2. The hook data wasn't being saved because it was put on the request
   struct, not the response struct.
3. The plugin configuration should only have reattach *or* a command
   set. Not both.
4. Setting Done=true meant the hook was never re-run on agent restart so
   reattaching was never attempted.
2019-02-21 15:32:18 -08:00