Commit graph

512 commits

Author SHA1 Message Date
Tim Gross 24aa32c503 csi: use a blocking initial connection with timeout
The plugin supervisor lazily connects to plugins, but this means we
only get "Unavailable" back from the gRPC call in cases where the
plugin can never be reached (for example, if the Nomad client has the
wrong permissions for the socket).

This changeset improves the operator experience by switching to a
blocking `DialWithContext`. It eagerly connects so that we can
validate the connection is real and get a "failed to open" error in
case where Nomad can't establish the initial connection.
2020-05-14 15:59:19 -04:00
Mahmood Ali 543f08c1ae Deflake TestTaskTemplateManager_BlockedEvents test
This change deflakes TestTaskTemplateManager_BlockedEvents test, because
it is expecting a number of events without accounting for transitional
state.

The test TestTaskTemplateManager_BlockedEvents attempts to ensure that a
template rendering emits blocked events for missing template ksys.

It works by setting a template that requires keys 0,1,2,3,4 and then
eventually sets keys 0,1,2,3 and ensures that we get a final event indicating
that keys 3 and 4 are still missing.

The test waits to get a blocked event for the final state, but it can
fail if receives a blocked event for a transitional state (e.g. one
reporting 2,3,4,5 are missing).

This fixes the test by ensuring that it waits until the final message
before assertion.

Also, it clarifies the intent of the test with stricter assertions and
additional comments.
2020-05-09 14:09:39 -04:00
Drew Bailey 8bfee62b70
Run task shutdown_delay regardless of service registration
task shutdown_delay will currently only run if there are registered
services for the task. This implementation detail isn't explicity stated
anywhere and is defined outside of the service stanza.

This change moves shutdown_delay to be evaluated after prekill hooks are
run, outside of any task runner hooks.

just use time.sleep
2020-04-10 11:06:26 -04:00
Nick Ethier 5166806993
Merge pull request #7600 from hashicorp/b-5767
tr/service_hook: prevent Update from running before Poststart finish
2020-04-06 16:52:42 -04:00
Nick Ethier 567609e101
tr/service_hook: reset initialized flag during deregister 2020-04-06 16:05:36 -04:00
Nick Ethier 3b5d2f8eb8
tr/service_hook: update hook fields during update when poststart hasn't finished 2020-04-02 12:48:19 -04:00
Seth Hoenig e7fcd281ae connect: set consul TLS options on envoy bootstrap
Fixes #6594 #6711 #6714 #7567

e2e testing is still TBD in #6502

Before, we only passed the Nomad agent's configured Consul HTTP
address onto the `consul connect envoy ...` bootstrap command.
This meant any Consul setup with TLS enabled would not work with
Nomad's Connect integration.

This change now sets CLI args and Environment Variables for
configuring TLS options for communicating with Consul when doing
the envoy bootstrap, as described in
https://www.consul.io/docs/commands/connect/envoy.html#usage
2020-04-02 10:30:50 -06:00
Nick Ethier fa271ff1b3
tr/service_hook: prevent Update from running before Poststart has finished 2020-04-02 12:17:36 -04:00
Mahmood Ali a5b024fdea tests: restart restartpolicy for all tasks in tests 2020-03-24 21:52:48 -04:00
Mahmood Ali 7565ac34c0 tests: populate task restart policy properly 2020-03-24 21:44:37 -04:00
Mahmood Ali ceed57b48f per-task restart policy 2020-03-24 17:00:41 -04:00
Tim Gross 5a0bcd39d1 csi: dynamically update plugin registration (#7386)
Allow for faster updates to plugin status when allocations become
terminal by listening for register/deregister events from the dynamic
plugin registry (which in turn are triggered by the plugin supervisor
hook).

The deregistration function closures that we pass up to the CSI plugin
manager don't properly close over the name and type of the
registration, causing monolith-type plugins to deregister only one of
their two plugins on alloc shutdown. Rebind plugin supervisor 
deregistration targets to fix that.

Includes log message and comment improvements
2020-03-23 13:59:25 -04:00
Tim Gross fe926e899e volumes: add task environment interpolation to volume_mount (#7364) 2020-03-23 13:59:25 -04:00
Tim Gross 1cf7ef44ed csi: docstring and log message fixups (#7327)
Fix some docstring typos and fix noisy log message during client restarts.
A log for the common case where the plugin socket isn't ready yet
isn't actionable by the operator so having it at info is just noise.
2020-03-23 13:58:30 -04:00
Tim Gross de4ad6ca38 csi: add Provider field to CSI CLIs and APIs (#7285)
Derive a provider name and version for plugins (and the volumes that
use them) from the CSI identity API `GetPluginInfo`. Expose the vendor
name as `Provider` in the API and CLI commands.
2020-03-23 13:58:30 -04:00
Lang Martin a4784ef258 csi add allocation context to fingerprinting results (#7133)
* structs: CSIInfo include AllocID, CSIPlugins no Jobs

* state_store: eliminate plugin Jobs, delete an empty plugin

* nomad/structs/csi: detect empty plugins correctly

* client/allocrunner/taskrunner/plugin_supervisor_hook: option AllocID

* client/pluginmanager/csimanager/instance: allocID

* client/pluginmanager/csimanager/fingerprint: set AllocID

* client/node_updater: split controller and node plugins

* api/csi: remove Jobs

The CSI Plugin API will map plugins to allocations, which allows
plugins to be defined by jobs in many configurations. In particular,
multiple plugins can be defined in the same job, and multiple jobs can
be used to define a single plugin.

Because we now map the allocation context directly from the node, it's
no longer necessary to track the jobs associated with a plugin
directly.

* nomad/csi_endpoint_test: CreateTestPlugin & register via fingerprint

* client/dynamicplugins: lift AllocID into the struct from Options

* api/csi_test: remove Jobs test

* nomad/structs/csi: CSIPlugins has an array of allocs

* nomad/state/state_store: implement CSIPluginDenormalize

* nomad/state/state_store: CSIPluginDenormalize npe on missing alloc

* nomad/csi_endpoint_test: defer deleteNodes for clarity

* api/csi_test: disable this test awaiting mocks:
https://github.com/hashicorp/nomad/issues/7123
2020-03-23 13:58:30 -04:00
Danielle Lancashire 5b05baf9f6 csi: Add /dev mounts to CSI Plugins
CSI Plugins that manage devices need not just access to the CSI
directory, but also to manage devices inside `/dev`.

This commit introduces a `/dev:/dev` mount to the container so that they
may do so.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 6665bdec2e taskrunner/volume_hook: Cleanup arg order of prepareHostVolumes 2020-03-23 13:58:30 -04:00
Danielle Lancashire 8692ca86bb taskrunner/volume_hook: Mounts for CSI Volumes
This commit implements support for creating driver mounts for CSI
Volumes.

It works by fetching the created mounts from the allocation resources
and then iterates through the volume requests, creating driver mount
configs as required.

It's a little bit messy primarily because there's _so_ much terminology
overlap and it's a bit difficult to follow.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 7a33864edf volume_hook: Loosen validation in host volume prep 2020-03-23 13:58:30 -04:00
Danielle Lancashire d8334cf884 allocrunner: Push state from hooks to taskrunners
This commit is an initial (read: janky) approach to forwarding state
from an allocrunner hook to a taskrunner using a similar `hookResources`
approach that tr's use internally.

It should eventually probably be replaced with something a little bit
more message based, but for things that only come from pre-run hooks,
and don't change, it's probably fine for now.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 3bff9fefae csi: Provide plugin-scoped paths during RPCs
When providing paths to plugins, the path needs to be in the scope of
the plugins container, rather than that of the host.

Here we enable that by providing the mount point through the plugin
registration and then use it when constructing request target paths.
2020-03-23 13:58:29 -04:00
Danielle Lancashire 1a10433b97 csi: Add VolumeManager (#6920)
This changeset is some pre-requisite boilerplate that is required for
introducing CSI volume management for client nodes.

It extracts out fingerprinting logic from the csi instance manager.
This change is to facilitate reusing the csimanager to also manage the
node-local CSI functionality, as it is the easiest place for us to
guaruntee health checking and to provide additional visibility into the
running operations through the fingerprinter mechanism and goroutine.

It also introduces the VolumeMounter interface that will be used to
manage staging/publishing unstaging/unpublishing of volumes on the host.
2020-03-23 13:58:29 -04:00
Danielle Lancashire de5d373001 csi: Setup gRPC Clients with a logger 2020-03-23 13:58:29 -04:00
Danielle Lancashire 426c26d7c0 CSI Plugin Registration (#6555)
This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:

* A `csi_plugin` stanza as part of a Nomad task configuration, to
  allow a task to expose that it is a plugin.

* A new task runner hook: `csi_plugin_supervisor`. This hook does two
  things. When the `csi_plugin` stanza is detected, it will
  automatically configure the plugin task to receive bidirectional
  mounts to the CSI intermediary directory. At runtime, it will then
  perform an initial heartbeat of the plugin and handle submitting it to
  the new `dynamicplugins.Registry` for further use by the client, and
  then run a lightweight heartbeat loop that will emit task events
  when health changes.

* The `dynamicplugins.Registry` for handling plugins that run
  as Nomad tasks, in contrast to the existing catalog that requires
  `go-plugin` type plugins and to know the plugin configuration in
  advance.

* The `csimanager` which fingerprints CSI plugins, in a similar way to
  `drivermanager` and `devicemanager`. It currently only fingerprints
  the NodeID from the plugin, and assumes that all plugins are
  monolithic.

Missing features

* We do not use the live updates of the `dynamicplugin` registry in
  the `csimanager` yet.

* We do not deregister the plugins from the client when they shutdown
  yet, they just become indefinitely marked as unhealthy. This is
  deliberate until we figure out how we should manage deploying new
  versions of plugins/transitioning them.
2020-03-23 13:58:28 -04:00
Mahmood Ali e1f53347e9 tr: proceed to mark other tasks as dead if alloc fails 2020-03-21 17:52:58 -04:00
Mahmood Ali e30d26b404 fix test 2020-03-21 17:52:57 -04:00
Jasmine Dahilig 73a64e4397 change jobspec lifecycle stanza to use sidecar attribute instead of
block_until status
2020-03-21 17:52:57 -04:00
Jasmine Dahilig 89778bc88d fix restart policy for system jobs with no lifecycle 2020-03-21 17:52:56 -04:00
Jasmine Dahilig 2e93d7a875 fix failing ci test: TestTaskRunner_UnregisterConsul_Retries 2020-03-21 17:52:54 -04:00
Jasmine Dahilig 4ab39318cc clean up restart conditions and restart tests for task lifecycle 2020-03-21 17:52:50 -04:00
Jasmine Dahilig b9a258ed7b incorporate lifecycle into restart tracker 2020-03-21 17:52:40 -04:00
Mahmood Ali d7354b8920 Add a coordinator for alloc runners 2020-03-21 17:52:38 -04:00
Fredrik Hoem Grelland edb3bd0f3f Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170) 2020-02-23 16:24:53 +01:00
Mahmood Ali 98ad59b1de update rest of consul packages 2020-02-16 16:25:04 -06:00
Seth Hoenig db7bcba027 tests: set consul token for nomad client for testing SIDS TR hook 2020-01-31 19:06:15 -06:00
Seth Hoenig 9b20ca5b25 e2e: setup consul ACLs a little more correctly 2020-01-31 19:06:11 -06:00
Seth Hoenig 4152254c3a tests: skip some SIDS hook tests if running tests as root 2020-01-31 19:05:32 -06:00
Seth Hoenig 441e8c7db7 client: additional test cases around failures in SIDS hook 2020-01-31 19:05:27 -06:00
Seth Hoenig c281b05fc0 client: PR cleanup - improved logging around kill task in SIDS hook 2020-01-31 19:05:23 -06:00
Seth Hoenig 03a4af9563 client: PR cleanup - shadow context variable 2020-01-31 19:05:19 -06:00
Seth Hoenig 587a5d4a8d nomad: make TaskGroup.UsesConnect helper a public helper 2020-01-31 19:05:11 -06:00
Seth Hoenig 057f117592 client: manage TR kill from parent on SI token derivation failure
Re-orient the management of the tr.kill to happen in the parent of
the spawned goroutine that is doing the actual token derivation. This
makes the code a little more straightforward, making it easier to
reason about not leaking the worker goroutine.
2020-01-31 19:05:02 -06:00
Seth Hoenig c8761a3f11 client: set context timeout around SI token derivation
The derivation of an SI token needs to be safegaurded by a context
timeout, otherwise an unresponsive Consul could cause the siHook
to block forever on Prestart.
2020-01-31 19:04:56 -06:00
Seth Hoenig 4ee55fcd6c nomad,client: apply more comment/style PR tweaks 2020-01-31 19:04:52 -06:00
Seth Hoenig be7c671919 nomad,client: apply smaller PR suggestions
Apply smaller suggestions like doc strings, variable names, etc.

Co-Authored-By: Nick Ethier <nethier@hashicorp.com>
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2020-01-31 19:04:40 -06:00
Seth Hoenig 78a7d1e426 comments: cleanup some leftover debug comments and such 2020-01-31 19:04:35 -06:00
Seth Hoenig 5c5da95f34 client: skip task SI token file load failure if testing as root
The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when
running as a non-priveleged user, since it deliberately tries to read
an un-readable file to simulate a failure loading the SI token file.
2020-01-31 19:04:30 -06:00
Seth Hoenig ab7ae8bbb4 client: remove unused indirection for referencing consul executable
Was thinking about using the testing pattern where you create executable
shell scripts as test resources which "mock" the process a bit of code
is meant to fork+exec. Turns out that wasn't really necessary in this case.
2020-01-31 19:04:25 -06:00
Seth Hoenig 2c7ac9a80d nomad: fixup token policy validation 2020-01-31 19:04:08 -06:00
Seth Hoenig d204f2f4f0 client: enable envoy bootstrap hook to set SI token
When creating the envoy bootstrap configuration, we should append
the "-token=<token>" argument in the case where the sidsHook placed
the token in the secrets directory.
2020-01-31 19:04:01 -06:00
Seth Hoenig 9df33f622f nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00
Seth Hoenig 93cf770edb client: enable nomad client to request and set SI tokens for tasks
When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.
2020-01-31 19:03:38 -06:00
Mahmood Ali 9611324654
Merge pull request #6922 from hashicorp/b-alloc-canoncalize
Handle Upgrades and Alloc.TaskResources modification
2020-01-28 15:12:41 -05:00
Mahmood Ali b1b714691c address review comments 2020-01-15 08:57:05 -05:00
Nick Ethier 1f28633954
Merge pull request #6816 from hashicorp/b-multiple-envoy
connect: configure envoy to support multiple sidecars in the same alloc
2020-01-09 23:25:39 -05:00
Mahmood Ali 4e5d867644 client: stop using alloc.TaskResources
Now that alloc.Canonicalize() is called in all alloc sources in the
client (i.e. on state restore and RPC fetching), we no longer need to
check alloc.TaskResources.

alloc.AllocatedResources is always non-nil through alloc runner.
Though, early on, we check for alloc validity, so NewTaskRunner and
TaskEnv must still check.  `TestClient_AddAllocError` test validates
that behavior.
2020-01-09 09:25:07 -05:00
Tim Gross fa4da93578
interpolate environment for services in script checks (#6916)
In 0.10.2 (specifically 387b016) we added interpolation to group
service blocks and centralized the logic for task environment
interpolation. This wasn't also added to script checks, which caused a
regression where the IDs for script checks for services w/
interpolated fields (ex. the service name) didn't match the service ID
that was registered with Consul.

This changeset calls the same taskenv interpolation logic during
`script_check` configuration, and adds tests to reduce the risk of
future regressions by comparing the IDs of service hook and the check hook.
2020-01-09 08:12:54 -05:00
Nick Ethier 9c3cc63cd1 tr: initialize envoybootstrap prestart hook response.Env field 2020-01-08 13:41:38 -05:00
Nick Ethier 105cbf6df9 tr: expose envoy sidecar admin port as environment variable 2020-01-06 21:53:45 -05:00
Nick Ethier 677e9cdc16 connect: configure envoy such that multiple sidecars can run in the same alloc 2020-01-06 11:26:27 -05:00
Mahmood Ali 4a1cc67f58
Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob
driver: allow disabling log collection
2019-12-13 11:41:20 -05:00
Mahmood Ali 46bc3b57e6 address review comments 2019-12-13 11:21:00 -05:00
Chris Dickson 4d8ba272d1 client: expose allocated CPU per task (#6784) 2019-12-09 15:40:22 -05:00
Mahmood Ali 0b7085ba3a driver: allow disabling log collection
Operators commonly have docker logs aggregated using various tools and
don't need nomad to manage their docker logs.  Worse, Nomad uses a
somewhat heavy docker api call to collect them and it seems to cause
problems when a client runs hundreds of log collections.

Here we add a knob to disable log aggregation completely for nomad.
When log collection is disabled, we avoid running logmon and
docker_logger for the docker tasks in this implementation.

The downside here is once disabled, `nomad logs ...` commands and API
no longer return logs and operators must corrolate alloc-ids with their
aggregated log info.

This is meant as a stop gap measure.  Ideally, we'd follow up with at
least two changes:

First, we should optimize behavior when we can such that operators don't
need to disable docker log collection.  Potentially by reverting to
using pre-0.9 syslog aggregation in linux environments, though with
different trade-offs.

Second, when/if logs are disabled, nomad logs endpoints should lookup
docker logs api on demand.  This ensures that the cost of log collection
is paid sparingly.
2019-12-08 14:15:03 -05:00
Preetha be4a51d5b8
Merge pull request #6349 from hashicorp/b-host-stats
client: Return empty values when host stats fail
2019-11-20 10:13:02 -06:00
Lang Martin aa985ebe21
getter: allow the gcs download scheme (#6692) 2019-11-19 09:10:56 -05:00
Nick Ethier bd454a4c6f
client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Michael Schurter 9fed8d1bed client: fix panic from 0.8 -> 0.10 upgrade
makeAllocTaskServices did not do a nil check on AllocatedResources
which causes a panic when upgrading directly from 0.8 to 0.10. While
skipping 0.9 is not supported we intend to fix serious crashers caused
by such upgrades to prevent cluster outages.

I did a quick audit of the client package and everywhere else that
accesses AllocatedResources appears to be properly guarded by a nil
check.
2019-11-01 07:47:03 -07:00
Michael Schurter f54f1cb321
Revert "Revert "Use joint context to cancel prestart hooks"" 2019-10-08 11:34:09 -07:00
Michael Schurter 81a30ae106
Revert "Use joint context to cancel prestart hooks" 2019-10-08 11:27:08 -07:00
Drew Bailey 69eebcd241
simplify logic to check for vault read event
defer shutdown to cleanup after failed run

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

update comment to include ctx note for shutdown
2019-09-30 11:02:14 -07:00
Drew Bailey 7565b8a8d9
Use joint context to cancel prestart hooks
fixes https://github.com/hashicorp/nomad/issues/6382

The prestart hook for templates blocks while it resolves vault secrets.
If the secret is not found it continues to retry. If a task is shutdown
during this time, the prestart hook currently does not receive
shutdownCtxCancel, causing it to hang.

This PR joins the two contexts so either killCtx or shutdownCtx cancel
and stop the task.
2019-09-30 10:48:01 -07:00
Danielle Lancashire 4f2343e1c0
client: Return empty values when host stats fail
Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).

When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.

This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including `nomad.client.unallocated.*` metrics.

However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.

Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.
2019-09-19 01:22:07 +02:00
Danielle Lancashire 78b61de45f
config: Hoist volume.config.source into volume
Currently, using a Volume in a job uses the following configuration:

```
volume "alias-name" {
  type = "volume-type"
  read_only = true

  config {
    source = "host_volume_name"
  }
}
```

This commit migrates to the following:

```
volume "alias-name" {
  type = "volume-type"
  source = "host_volume_name"
  read_only = true
}
```

The original design was based due to being uncertain about the future of storage
plugins, and to allow maxium flexibility.

However, this causes a few issues, namely:
- We frequently need to parse this configuration during submission,
scheduling, and mounting
- It complicates the configuration from and end users perspective
- It complicates the ability to do validation

As we understand the problem space of CSI a little more, it has become
clear that we won't need the `source` to be in config, as it will be
used in the majority of cases:

- Host Volumes: Always need a source
- Preallocated CSI Volumes: Always needs a source from a volume or claim name
- Dynamic Persistent CSI Volumes*: Always needs a source to attach the volumes
                                   to for managing upgrades and to avoid dangling.
- Dynamic Ephemeral CSI Volumes*: Less thought out, but `source` will probably point
                                  to the plugin name, and a `config` block will
                                  allow you to pass meta to the plugin. Or will
                                  point to a pre-configured ephemeral config.
*If implemented

The new design simplifies this by merging the source into the volume
stanza to solve the above issues with usability, performance, and error
handling.
2019-09-13 04:37:59 +02:00
Tim Gross 3fa4bca4a0
script checks: Update needs to update Alloc as well (#6291) 2019-09-06 11:18:00 -04:00
Tim Gross 8ce201854a
client: recreate script checks on Update (#6265)
Splitting the immutable and mutable components of the scriptCheck led
to a bug where the environment interpolation wasn't being incorporated
into the check's ID, which caused the UpdateTTL to update for a check
ID that Consul didn't have (because our Consul client creates the ID
from the structs.ServiceCheck each time we update).

Task group services don't have access to a task environment at
creation, so their checks get registered before the check can be
interpolated. Use the original check ID so they can be updated.
2019-09-05 11:43:23 -04:00
Tim Gross 0f29dcc935
support script checks for task group services (#6197)
In Nomad prior to Consul Connect, all Consul checks work the same
except for Script checks. Because the Task being checked is running in
its own container namespaces, the check is executed by Nomad in the
Task's context. If the Script check passes, Nomad uses the TTL check
feature of Consul to update the check status. This means in order to
run a Script check, we need to know what Task to execute it in.

To support Consul Connect, we need Group Services, and these need to
be registered in Consul along with their checks. We could push the
Service down into the Task, but this doesn't work if someone wants to
associate a service with a task's ports, but do script checks in
another task in the allocation.

Because Nomad is handling the Script check and not Consul anyways,
this moves the script check handling into the task runner so that the
task runner can own the script check's configuration and
lifecycle. This will allow us to pass the group service check
configuration down into a task without associating the service itself
with the task.

When tasks are checked for script checks, we walk back through their
task group to see if there are script checks associated with the
task. If so, we'll spin off script check tasklets for them. The
group-level service and any restart behaviors it needs are entirely
encapsulated within the group service hook.
2019-09-03 15:09:04 -04:00
Michael Schurter 5957030d18
connect: add unix socket to proxy grpc for envoy (#6232)
* connect: add unix socket to proxy grpc for envoy

Fixes #6124

Implement a L4 proxy from a unix socket inside a network namespace to
Consul's gRPC endpoint on the host. This allows Envoy to connect to
Consul's xDS configuration API.

* connect: pointer receiver on structs with mutexes

* connect: warn on all proxy errors
2019-09-03 08:43:38 -07:00
Evan Ercolano fcf66918d0 Remove unused canary param from MakeTaskServiceID 2019-08-31 16:53:23 -04:00
Lang Martin 4f6493a301 taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
Jerome Gravel-Niquet cbdc1978bf Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Michael Schurter 59e0b67c7f connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Tim Gross 03433f35d4 client/template: configuration for function blacklist and sandboxing
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.

When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
2019-08-12 16:34:48 -04:00
Danielle Lancashire 861caa9564
HostVolumeConfig: Source -> Path 2019-08-12 15:39:08 +02:00
Danielle Lancashire e132a30899
structs: Unify Volume and VolumeRequest 2019-08-12 15:39:08 +02:00
Danielle Lancashire 6ef8d5233e
client: Add volume_hook for mounting volumes 2019-08-12 15:39:08 +02:00
Mahmood Ali b17bac5101 Render consul templates using task env only (#6055)
When rendering a task consul template, ensure that only task environment
variables are used.

Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1].  Thus, we add
an empty entry for each host process env-var not found in task env-vars.

[1] bfa5d0e133/template/funcs.go (L61-L75)
2019-08-05 16:30:47 -04:00
Mahmood Ali f66169cd6a
Merge pull request #6065 from hashicorp/b-nil-driver-exec
Check if driver handle is nil before execing
2019-08-02 09:48:28 -05:00
Mahmood Ali a4670db9b7 Check if driver handle is nil before execing
Defend against tr.getDriverHandle being nil.  Exec handler checks if
task is running, but it may be stopped between check and driver handler
fetching.
2019-08-02 10:07:41 +08:00
Nick Ethier 66c514a388
Add network lifecycle management
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
2019-07-31 01:03:17 -04:00
Jasmine Dahilig 2157f6ddf1
add formatting for hcl parsing error messages (#5972) 2019-07-19 10:04:39 -07:00
Mahmood Ali cd6f1d3102 Update consul-template dependency to latest
To pick up the fix in
https://github.com/hashicorp/consul-template/pull/1231 .
2019-07-18 07:32:03 +07:00
Mahmood Ali 8a82260319 log unrecoverable errors 2019-07-17 11:01:59 +07:00
Mahmood Ali 1a299c7b28 client/taskrunner: fix stats stats retry logic
Previously, if a channel is closed, we retry the Stats call.  But, if that call
fails, we go in a backoff loop without calling Stats ever again.

Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.

I aimed to preserve the logging formats but made small improvements as I saw fit.
2019-07-11 13:58:07 +08:00
Mahmood Ali f10201c102 run post-run/post-stop task runner hooks
Handle when prestart failed while restoring a task, to prevent
accidentally leaking consul/logmon processes.
2019-07-02 18:38:32 +08:00
Mahmood Ali 4afd7835e3 Fail alloc if alloc runner prestart hooks fail
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
2019-07-02 18:35:47 +08:00
Mahmood Ali 7614b8f09e
Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2
task runner to avoid running task if terminal
2019-07-02 15:31:17 +08:00
Mahmood Ali 7bfad051b9 address review comments 2019-07-02 14:53:50 +08:00
Mahmood Ali 3d89ae0f1e task runner to avoid running task if terminal
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes https://github.com/hashicorp/nomad/issues/5883
2019-06-27 11:27:34 +08:00
Danielle Lancashire b9ac184e1f
tr: Fetch Wait channel before killTask in restart
Currently, if killTask results in the termination of a process before
calling WaitTask, Restart() will incorrectly return a TaskNotFound
error when using the raw_exec driver on Windows.
2019-06-26 15:20:57 +02:00
Chris Baker f71114f5b8 cleanup test 2019-06-18 14:15:25 +00:00
Chris Baker a2dc351fd0 formatting and clarity 2019-06-18 14:00:57 +00:00
Chris Baker e0170e1c67 metrics: add namespace label to allocation metrics 2019-06-17 20:50:26 +00:00
Danielle f923b568e0
Merge pull request #5821 from hashicorp/dani/b-5770
trhooks: Add TaskStopHook interface to services
2019-06-12 17:30:49 +02:00
Danielle Lancashire c326344b57
trt: Fix test 2019-06-12 17:06:11 +02:00
Danielle Lancashire 13d76e35fd
trhooks: Add TaskStopHook interface to services
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.

This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.

fixes #5770
2019-06-12 16:00:21 +02:00
Mahmood Ali 2acf30fdd3 Fallback to alloc.TaskResources for old allocs
When a client is running against an old server (e.g. running 0.8),
`alloc.AllocatedResources` may be nil, and we need to check the
deprecated `alloc.TaskResources` instead.

Fixes https://github.com/hashicorp/nomad/issues/5810
2019-06-11 10:32:53 -04:00
Mahmood Ali d30c3d10b0
Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1
More test fixes
2019-06-05 19:09:18 -04:00
Mahmood Ali 935ee86e92
Merge pull request #5737 from fwkz/fix-restart-attempts
Fix restart attempts of `restart` stanza in `delay` mode.
2019-06-05 19:05:07 -04:00
Danielle Lancashire 27583ed8c1 client: Pass servers contacted ch to allocrunner
This fixes an issue where batch and service workloads would never be
restarted due to indefinitely blocking on a nil channel.

It also raises the restoration logging message to `Info` to simplify log
analysis.
2019-05-22 13:47:35 +02:00
Mahmood Ali 9df1e00f35 tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal
Given that Signal may be called multiple times, blocking for `SignalCh`
isn't sufficient to synchornizing access to Signals field.
2019-05-21 13:56:58 -04:00
fwkz 8b84bec95a Fix restart attempts of restart stanza.
Number of restarts during 2nd interval is off by one.
2019-05-21 13:27:19 +02:00
Michael Schurter 2fe0768f3b docs: changelog entry for #5669 and fix comment 2019-05-14 10:54:00 -07:00
Michael Schurter af9096c8ba client: register before restoring
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).

Since restoring is synchronous, start the registration goroutine first.

For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
2019-05-14 10:53:27 -07:00
Michael Schurter e07f73bfe0 client: do not restart dead tasks until server is contacted (try 2)
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162

Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!

The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
2019-05-14 10:53:27 -07:00
Michael Schurter d7e5ace1ed client: do not restart dead tasks until server is contacted
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
2019-05-14 10:53:27 -07:00
Michael Schurter 1c4e585fa7 client: expose allocated memory per task
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
2019-05-10 11:12:12 -07:00
Mahmood Ali ab2cae0625 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Mahmood Ali bf0a09e270 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali 81841e8528 try checking process status 2019-04-25 18:16:13 -04:00
Mahmood Ali fc78521f29 add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali e6ca8641a8 try sleeping for stop signal to take effect 2019-04-25 17:16:29 -04:00
Mahmood Ali ff3a095015 add a test that simulates logmon dying during Start() call 2019-04-25 16:41:17 -04:00
Mahmood Ali bbac73883c logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Danielle Lancashire c31966fc71 loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Michael Schurter f7a7acc345
Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Chris Baker 6848591914 vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation 2019-04-12 15:06:34 +00:00
Chris Baker c0a7aee610
vault e2e: pass vault version into setup instead of having to infer it from test name 2019-04-10 10:34:10 -05:00
Chris Baker f0c184fc29
taskrunner: removed some unecessary config from a test 2019-04-10 10:34:10 -05:00
Chris Baker 170f5239c8
client: gofmt 2019-04-10 10:34:10 -05:00
Chris Baker a1d7971b2e
taskrunner: pass configured Vault namespace into TaskTemplateConfig 2019-04-10 10:34:10 -05:00
Michael Schurter f7d4428855 client: simplify kill logic
Remove runLaunched tracking as Run is *always* called for killable
TaskRunners. TaskRunners which fail before Run can be called (during
NewTaskRunner or Restore) are not killable as they're never added to the
client's alloc map.
2019-04-04 15:18:33 -07:00
Michael Schurter 1d569a27dc Revert "executor/linux: add defensive checks to binary path"
This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.
2019-04-02 11:17:12 -07:00
Michael Schurter fc5487dbbc executor/linux: add defensive checks to binary path 2019-04-02 09:40:53 -07:00
Michael Schurter 7d49bc4c71 executor/linux: make chroot binary paths absolute
Avoid libcontainer.Process trying to lookup the binary via $PATH as the
executor has already found where the binary is located.
2019-04-01 15:45:31 -07:00
Michael Schurter a4572919cd
Merge pull request #5456 from hashicorp/test-taskenv
tests: port pre-0.9 task env tests
2019-03-25 10:41:38 -07:00
Michael Schurter 8efad12538 tests: port pre-0.9 task env tests
I chose to make them more of integration tests since there's a lot more
plumbing involved. The internal implementation details of how we craft
task envs can now change and these tests will still properly assert the
task runtime environment is setup properly.
2019-03-25 09:46:53 -07:00
Nick Ethier dc18b8928a
logmon: make Start rpc idempotent and simplify hook 2019-03-19 14:02:36 -04:00
Nick Ethier ac7fbee1b8
logmon:add static check for logmon exited hook 2019-03-18 15:59:43 -04:00
Nick Ethier 7dc3d83634
client/logmon: restart log collection correctly when a task is restarted 2019-03-15 23:59:18 -04:00
Michael Schurter 32d31575cc client: emit event and call exited hooks during cleanup
Builds upon earlier commit that cleans up restored handles of terminal
allocs by also emitting terminated events and calling exited hooks when
appropriate.
2019-03-05 15:12:02 -08:00
Michael Schurter 64e145ebdb logmon: drop reattach log level as its expected
Logged once per terminal task on agent restart.
2019-03-04 13:26:01 -08:00
Michael Schurter c5271d3fa5 client: test logmon cleanup
The test is sadly quite complicated and peeks into things (logmon's
reattach config) AR doesn't normally have access to.

However, I couldn't find another way of asserting logmon got cleaned up
without resorting to smaller unit tests. Smaller unit tests risk
re-implementing dependencies in an unrealistic way, so I opted for an
ugly integration test.
2019-03-04 13:15:15 -08:00
Michael Schurter ef8d284352 client: ensure task is cleaned up when terminal
This commit is a significant change. TR.Run is now always executed, even
for terminal allocations. This was changed to allow TR.Run to cleanup
(run stop hooks) if a handle was recovered.

This is intended to handle the case of Nomad receiving a
DesiredStatus=Stop allocation update, persisting it, but crashing before
stopping AR/TR.

The commit also renames task runner hook data as it was very easy to
accidently set state on Requests instead of Responses using the old
field names.
2019-03-01 14:00:23 -08:00
Michael Schurter 812f1679e2
Merge pull request #5352 from hashicorp/b-leaked-logmon
logmon fixes
2019-02-26 08:35:46 -08:00
Michael Schurter e39a10a1f4 tests: move unix-specific test to its own file
Other logmon tests should be portable.
2019-02-26 07:56:44 -08:00
Michael Schurter 3b2a592e93 client: restart task on logmon failures
This code chooses to be conservative as opposed to optimal: when failing
to reattach to logmon simply return a recoverable error instead of
immediately trying to restart logmon.

The recoverable error will cause the task's restart policy to be
applied and a new logmon will be launched upon restart.

Trying to do the optimal approach of simply starting a new logmon
requires error string comparison and should be tested against a task
actively logging to assert the behavior (are writes blocked? dropped?).
2019-02-25 15:42:45 -08:00
Michael Schurter 8830b00866 client: test logmon_hook 2019-02-23 15:36:48 -08:00
Mahmood Ali 32551fb0e5 emit TaskRestartSignal event on vault restart
When Vault token expires and task is restarted, emit `TaskRestartSignal`
similar to v0.8.7
2019-02-22 15:56:14 -05:00
Mahmood Ali 8cb4bbcc08 address review comments 2019-02-22 15:56:14 -05:00
Mahmood Ali 216eaa4843 tests: port TestTaskRunner_VaultManager_Signal
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427
2019-02-22 15:53:04 -05:00
Mahmood Ali 8e9e732319 tests: port TestTaskRunner_VaultManager_Restart
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1352
2019-02-22 15:53:04 -05:00
Mahmood Ali 33122ca7c0 tests: port TestTaskRunner_UnregisterConsul_Retries
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L620
2019-02-22 15:53:04 -05:00
Mahmood Ali 0128b0ce7a tests: port TestTaskRunner_Template_NewVaultToken
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1275
2019-02-22 15:53:04 -05:00
Mahmood Ali cfb80583af tests: port TestTaskRunner_Template_Artifact
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1195
2019-02-22 15:52:59 -05:00
Michael Schurter a2e3ea6dc9 logmon: fix reattach configuration
There were multiple bugs here:

1. Reattach unmarshalling always returned an error because you can't
   unmarshal into a nil pointer.
2. The hook data wasn't being saved because it was put on the request
   struct, not the response struct.
3. The plugin configuration should only have reattach *or* a command
   set. Not both.
4. Setting Done=true meant the hook was never re-run on agent restart so
   reattaching was never attempted.
2019-02-21 15:32:18 -08:00
Michael Schurter 01cabdff88 client: restart on recoverable StartTask errors
Fixes restarting on recoverable errors from StartTask.

Ports TestTaskRunner_Run_RecoverableStartError from 0.8 which discovered
the bug.
2019-02-21 15:30:49 -08:00
Michael Schurter e3f321cd27 test: port TestTaskRunner_RestartSignalTask_NotRunning from 0.8 2019-02-21 15:30:49 -08:00
Michael Schurter f3aa945a00 test: port TestTaskRunner_DriverNetwork from 0.8 2019-02-21 15:30:49 -08:00
Michael Schurter 6580ed668e client: don't redownload completed artifacts on retries
Track the download status of each artifact independently so that if only
one of many artifacts fails to download, completed artifacts aren't
downloaded again.
2019-02-20 08:45:12 -08:00
Michael Schurter 908bfab4c2 client: artifact errors are retry-able
0.9.0beta2 contains a regression where artifact download errors would
not cause a task restart and instead immediately fail the task.

This restores the pre-0.9 behavior of retrying all artifact errors and
adds missing tests.
2019-02-20 07:21:27 -08:00
Michael Schurter 79ccf00b72 tests: add new task runner test helper
Adds a new helper and removes a duplicated test.
2019-02-20 07:21:27 -08:00
Mahmood Ali 87be233aca
test: improve readability of duration
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2019-02-14 08:12:06 -08:00
Mahmood Ali 16d3414842
test: improve failure message
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2019-02-14 08:11:37 -08:00
Michael Schurter 4814f0fb0b tests: port TestTaskRunner_Download_List from 0.8 2019-02-12 15:48:04 -08:00
Michael Schurter a152e3ef17 consul: fix task deregistration hook
Broke ShutdownDelay but the test was timing dependent so it just
appeared flaky. Made the test slower so that it should never incorrectly
pass.
2019-02-12 15:36:02 -08:00
Michael Schurter 4ad879e75e tests: port TaskRunner_DeriveToken tests from 0.8 2019-02-12 15:36:02 -08:00
Michael Schurter 6743ed9fdc tests: port TestTaskRunner_BlockForVault from 0.8
Also fix race conditions in the mock vault client.
2019-02-12 13:46:09 -08:00
Michael Schurter 6c0cc65b2e simplify hcl2 parsing helper
No need to pass in the entire eval context
2019-02-04 11:07:57 -08:00
Alex Dadgar 5062c54874 Fix usage of fsi variable 2019-01-29 14:07:55 -08:00
Alex Dadgar 6f418ebaf0 Always populate task dir environment variables
Fixes an issue where if a task was restarted after restating the client,
the task dir environment variables would not be populated. This PR fixes
this for both upgrades from 0.8.X and for normal 0.9 restarts.
2019-01-29 13:17:10 -08:00
Alex Dadgar 5da21635fb Fix env templates having interpolated destinations
Fixes an issue where env templates that had interpolated destinations
would not work.

Fixes https://github.com/hashicorp/nomad/issues/5250
2019-01-28 10:28:53 -08:00
Alex Dadgar d6412fd8e7 Fix double restart counting for templates
This PR fixes an issue where template restarts would count twice since
it was emitting a restarting event.
2019-01-25 15:38:13 -08:00
Nick Ethier a36c4320ff
Merge pull request #5227 from hashicorp/b-client-highcpu-usage
Fix bug related to high cpu usage
2019-01-23 14:27:51 -05:00
Michael Schurter 32daa7b47b goimports until make check is happy 2019-01-23 06:27:14 -08:00
Nick Ethier bcc3935228
tr: use context in as select statement 2019-01-22 20:11:39 -05:00
Michael Schurter be0bab7c3f move pluginutils -> helper/pluginutils
I wanted a different color bikeshed, so I get to paint it
2019-01-22 15:50:08 -08:00
Alex Dadgar 2ca0e97361 Split hclspec 2019-01-22 15:43:34 -08:00
Alex Dadgar 5ca6dd7988 move hclutils 2019-01-22 15:43:34 -08:00
Alex Dadgar 72a5691897 Driver tests do not use hcl2/hcl, hclspec, or hclutils 2019-01-22 15:43:34 -08:00
Michael Schurter 1fa376cac6
Merge pull request #5211 from hashicorp/test-porting-08
Port some 0.8 TaskRunner tests
2019-01-22 14:05:53 -08:00
Michael Schurter 8ced0adb67 test: port TestTaskRunner_CheckWatcher_Restart
Added ability to adjust the number of events the TaskRunner keeps as
there's no way to observe all events otherwise.

Task events differ slightly from 0.8 because 0.9 emits Terminated every
time a task exits instead of only when it exits on its own (not due to
restart or kill).

0.9 does not emit Killing/Killed for restarts like 0.8 which seems fine
as `Restart Signaled/Terminated/Restarting` is more descriptive.

Original v0.8 events emitted:
```
	expected := []string{
		"Received",
		"Task Setup",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Restarting",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Restarting",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Not Restarting",
	}
```
2019-01-22 09:46:46 -08:00
Michael Schurter 1719752a9d test: port RestartTask from 0.8 2019-01-22 08:08:08 -08:00
Michael Schurter 9edff19625 test: port SignalFailure test from 0.8
Also fix signal error handling in mock_driver.
2019-01-22 08:08:08 -08:00
Preetha Appan 299a5fc821
Rename TaskKillRequest/Response to TaskPreKillRequest/Response 2019-01-22 09:54:02 -06:00
Preetha Appan 5a5b9c5666
Fix log comments 2019-01-22 09:45:58 -06:00
Preetha Appan 06e15f8381
Rename TaskKillHook to TaskPreKillHook to more closely match usage
Also added/fixed comments
2019-01-22 09:41:56 -06:00
Michael Schurter 3b02af9386
Fix comment
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-01-22 09:41:21 -06:00
Preetha Appan 09291c689b
Rename TaskKillHook to TaskPreKillHook to more closely match usage
Also added/fixed comments
2019-01-22 09:41:21 -06:00
Mahmood Ali 5df63fda7c
Merge pull request #5190 from hashicorp/f-memory-usage
Track Basic Memory Usage as reported by cgroups
2019-01-18 16:46:02 -05:00
Chris Baker 290c3f36ad set TaskGroupName in task_runner 2019-01-18 20:25:11 +00:00
Chris Baker 8917961caa documenting test for task runner failure to set TaskGroupName 2019-01-18 20:00:49 +00:00
Michael Schurter cfadacfd95
Merge pull request #5203 from hashicorp/b-terminated
client: restore Terminated event on every exit
2019-01-18 08:54:15 -08:00
Preetha Appan e0b68a19c6
Fix one more place that should be using taskResources
taskResources handles new resource fields in a backwards compatible way
2019-01-17 15:52:51 -06:00
Michael Schurter a20ac7c1de client: restore Terminated event on every exit
v0.9.0-dev started emitting a Terminated event every time a task process
exited. While this wasn't true in previous versions, it's a useful task
event because it's the only place for job operators to view the task's
exit code.

This behavior is asserted in the e2e/taskevents tests.
2019-01-17 10:02:25 -08:00
Danielle Tomlinson a695b3562c
Merge pull request #5193 from hashicorp/dani/logmon-reattach
logmon: Reattach to existing loggers
2019-01-16 17:34:13 +01:00
Danielle Tomlinson 99da4c780d logmon: Reattach to existing loggers
This commit prevents us from creating duplicate logmon hooks when
restoring allocations by persisting the logmon reattach config using
HookData.
2019-01-16 14:56:10 +01:00
Michael Schurter daa7d029a1 test: porting TestTaskRunner_SimpleRun_Dispatch
Porting test from 0.8 to 0.9.
2019-01-15 15:22:13 -08:00
Michael Schurter 48afda786b
Merge pull request #5187 from hashicorp/test-consul
Port a bunch of pre-0.9 Consul tests to 0.9
2019-01-15 07:41:50 -08:00
Mahmood Ali 9909d98bee Track Basic Memory Usage as reported by cgroups
Track current memory usage, `memory.usage_in_bytes`, in addition to
`memory.max_memory_usage_in_bytes` and friends.  This number is closer
what Docker reports.

Related to https://github.com/hashicorp/nomad/issues/5165 .
2019-01-14 18:47:52 -05:00
Michael Schurter ff2f23f5f9 test: assert service interpolation behavior
Ported from pre-0.9 tests.
2019-01-14 09:56:53 -08:00
Michael Schurter e877bb6370 test: assert shutdown delay deregs first
Restore a pre-0.9 test that asserts Consul services are deregistered
before a task's shutdown delay.
2019-01-14 09:56:53 -08:00
Michael Schurter 1ca858fa92
Update client/allocrunner/taskrunner/stats_hook.go
Co-Authored-By: nickethier <ncethier@gmail.com>
2019-01-14 12:31:27 -05:00
Nick Ethier fbd403df96
tr: stop stats collection on Exited hook 2019-01-14 12:30:14 -05:00
Nick Ethier 597b7b751d
tr: add retry /w backoff to stats_hook failure 2019-01-12 12:18:24 -05:00
Nick Ethier 7e306afde3
executor: fix failing stats related test 2019-01-12 12:18:23 -05:00
Nick Ethier 9fea54e0dc
executor: implement streaming stats API
plugins/driver: update driver interface to support streaming stats

client/tr: use streaming stats api

TODO:
 * how to handle errors and closed channel during stats streaming
 * prevent tight loop if Stats(ctx) returns an error

drivers: update drivers TaskStats RPC to handle streaming results

executor: better error handling in stats rpc

docker: better control and error handling of stats rpc

driver: allow stats to return a recoverable error
2019-01-12 12:18:22 -05:00
Alex Dadgar bd12e0b1f7
Merge pull request #5168 from hashicorp/b-kill-race
Improve Kill handling on task runner
2019-01-09 12:05:10 -08:00
Alex Dadgar 069e181e8f add more comments 2019-01-09 12:04:22 -08:00
Michael Schurter e5ddff861c
Spelling fix
Co-Authored-By: dadgar <alex@hashicorp.com>
2019-01-09 11:42:40 -08:00
Mahmood Ali 90f3cea187
Merge pull request #5157 from hashicorp/r-drivers-no-cstructs
drivers: avoid referencing client/structs package
2019-01-09 13:06:46 -05:00
Alex Dadgar 149dec2169 Improve Kill handling on task runner
This PR improves how killing a task is handled. Before the kill function
directly orchestrated the killing and was only valid while the task was
running. The new behavior is to mark the desired state and wait for the
task runner to converge to that state.
2019-01-08 16:42:26 -08:00
Michael Schurter 5925424c7c client: emit Killing/Killed task events
We were just emitting Killed/Terminated events before. In v0.8 we
emitted Killing/Killed, but lacked Terminated when explicitly stopping
a task. This change makes it so Terminated is always included, whether
explicitly stopping a task or it exiting on its own.

New output:

2019-01-04T14:58:51-08:00  Killed            Task successfully killed
2019-01-04T14:58:51-08:00  Terminated        Exit Code: 130, Signal: 2
2019-01-04T14:58:51-08:00  Killing           Sent interrupt
2019-01-04T14:58:51-08:00  Leader Task Dead  Leader Task in Group dead
2019-01-04T14:58:49-08:00  Started           Task started by client
2019-01-04T14:58:49-08:00  Task Setup        Building Task Directory
2019-01-04T14:58:49-08:00  Received          Task received by client

Old (v0.8.6) output:

2019-01-04T22:14:54Z  Killed            Task successfully killed
2019-01-04T22:14:54Z  Killing           Sent interrupt. Waiting 5s before force killing
2019-01-04T22:14:54Z  Leader Task Dead  Leader Task in Group dead
2019-01-04T22:14:53Z  Started           Task started by client
2019-01-04T22:14:53Z  Task Setup        Building Task Directory
2019-01-04T22:14:53Z  Received          Task received by client
2019-01-08 07:20:54 -08:00
Mahmood Ali 916a40bb9e move cstructs.DeviceNetwork to drivers pkg 2019-01-08 09:11:47 -05:00
Mahmood Ali 9369b123de use drivers.FSIsolation 2019-01-08 09:11:47 -05:00
Mahmood Ali f475a56087 remove always false parameter
Simplify allocDir.Build() function to avoid depending on client/structs,
and remove a parameter that's always set to `false`.

The motivation here is to avoid a dependency cycle between
drivers/cstructs and alloc_dir.
2019-01-08 09:11:47 -05:00
Alex Dadgar 0106f23aaa Review comments 2019-01-07 14:50:28 -08:00
Alex Dadgar 79cfe26021 vet 2019-01-07 14:49:41 -08:00
Alex Dadgar 8a35d7b1dd Test recovery 2019-01-07 14:49:41 -08:00
Alex Dadgar f40f8ce02e Mock driver has recovery, stats 2019-01-07 14:49:40 -08:00
Alex Dadgar 3f24e4d6ca comments 2019-01-07 14:49:40 -08:00
Alex Dadgar 44dca19012 Fix hooks 2019-01-07 14:49:40 -08:00
Alex Dadgar c9825a9c36 recover 2019-01-07 14:49:40 -08:00
Mahmood Ali cd3c6cf60b taskrunner: emit TaskReceived event
Preserve pre-0.9, where task runner emits `Received: Task received by
client` event on task runner creation.
2019-01-04 14:32:29 -05:00
Danielle Tomlinson 28aa34ea78 taskrunner: Persist environment from hooks
https://github.com/hashicorp/nomad/pull/5032 introduced a regression
where the origHookState was used in place of the response from the hook.
2019-01-03 13:13:57 +01:00
Alex Dadgar 9d34802f7a Store device envs separately and pass to drivers 2018-12-19 14:23:09 -08:00
Michael Schurter d9ea8252a7 client/state: support upgrading from 0.8->0.9
Also persist and load DeploymentStatus to avoid rechecking health after
client restarts.
2018-12-19 10:39:27 -08:00
Michael Schurter 461599ff20 tr: fix HookState Copy() and Equal() methods
They did not take into account the Env field.
2018-12-19 09:58:06 -08:00
Nick Ethier ce1a5cba0e
drivermanager: use allocID and task name to route task events 2018-12-18 23:01:51 -05:00
Nick Ethier d8a0265e68
client: batch initial fingerprinting in plugin manangers
drivermanager: fix pr comments/feedback
2018-12-18 22:56:19 -05:00
Nick Ethier 7d23cbf448
client/drivermananger: fixup issues from rebase and address PR comments 2018-12-18 22:55:38 -05:00
Nick Ethier 1543335710
tr: deregister task handler on cleanup 2018-12-18 22:55:38 -05:00
Nick Ethier 82175d1328
client/drivermananger: add driver manager
The driver manager is modeled after the device manager and is started by the client.
It's responsible for handling driver lifecycle and reattachment state, as well as
processing the incomming fingerprint and task events from each driver. The mananger
exposes a method for registering event handlers for task events that is used by the
task runner to update the server when a task has been updated with an event.

Since driver fingerprinting has been implemented by the driver manager, it is no
longer needed in the fingerprint mananger and has been removed.
2018-12-18 22:55:18 -05:00
Alex Dadgar 8efac7ec81 Fix unit tests + upgrade pathing resources 2018-12-18 15:50:44 -08:00
Alex Dadgar b8268d9a46 Lint 2018-12-18 15:50:44 -08:00
Alex Dadgar 66cf3156b2 LinuxResources doesn't use task.Resources 2018-12-18 15:50:44 -08:00
Alex Dadgar 327b551b39 Drivers 2018-12-18 15:50:11 -08:00
Alex Dadgar b653ae2af7 utilities 2018-12-18 15:48:52 -08:00
Danielle Tomlinson 95a0c4fb29 taskrunner: Use a random suffix for Task Config
The RestartCount is not really suitable for use as a source of
uniqueness within task invocations as it is not monotonic, and interacts
with the restart stanza in a users config, so conflates restarts due to
task failures, with restarts due to enviromental changes, such as consul
template or vault secrets changing.

Here we instead use a substring from a uuid, which is more random than
we strictly need, but is nicer than rolling our own random string
generator here.
2018-12-19 00:38:54 +01:00
Danielle Tomlinson a50ea29da4 taskrunner: Use hook errors for artifacts 2018-12-17 10:39:38 +01:00
Danielle Tomlinson 3647b701a6 taskrunner: Emit task events when a hook fails 2018-12-13 18:20:18 +01:00
Alex Dadgar b39c21d49c Fix various bugs with task events
Fixes the following:
* Emitting events when the task fails to start
* Don't double emit events on task shutdown (nomad stop)
* Don't emit a OOM kill metric unless actually OOM'd
2018-12-05 14:27:07 -08:00
Danielle Tomlinson 2db5ae38d8 client: Rename drivers/shared/env => client/taskenv 2018-11-30 12:18:39 +01:00
Danielle Tomlinson f3a77b8084 client: Merge driver/shared/structs and client/structs 2018-11-30 10:56:45 +01:00
Danielle Tomlinson d259c36844 driver: Flatten SetEnvvars into taskdirhook 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 04c8851b4c client: Migrate DriverStats optout to drivers/shared/structs 2018-11-30 10:46:13 +01:00
Danielle Tomlinson 1a29811169 drivers: Move client/drivers/env to drivers/shared/env
As part of deprecating legacy drivers, we're moving the env package to a
new drivers/shared tree, as it is used by the modern docker and rkt
driver packages, and is useful for 3rd party plugins.
2018-11-30 10:46:13 +01:00
Michael Schurter 3e56ee005a add nil check around task resources in device hook
Looking at NewTaskRunner I'm unsure whether TaskRunner.TaskResources
(from which req.TaskResources is set) is intended to be nil at times or
if the TODO in NewTaskRunner is intended to ensure it is always non-nil.
2018-11-27 17:25:33 -08:00