Commit graph

491 commits

Author SHA1 Message Date
Drew Bailey 24929776a2
shutdown delay for task groups
copy struct values

ensure groupserviceHook implements RunnerPreKillhook

run deregister first

test that shutdown times are delayed

move magic number into variable
2019-12-16 11:38:16 -05:00
Mahmood Ali 4a1cc67f58
Merge pull request #6820 from hashicorp/f-skip-docker-logging-knob
driver: allow disabling log collection
2019-12-13 11:41:20 -05:00
Mahmood Ali 46bc3b57e6 address review comments 2019-12-13 11:21:00 -05:00
Chris Dickson 4d8ba272d1 client: expose allocated CPU per task (#6784) 2019-12-09 15:40:22 -05:00
Mahmood Ali 0b7085ba3a driver: allow disabling log collection
Operators commonly have docker logs aggregated using various tools and
don't need nomad to manage their docker logs.  Worse, Nomad uses a
somewhat heavy docker api call to collect them and it seems to cause
problems when a client runs hundreds of log collections.

Here we add a knob to disable log aggregation completely for nomad.
When log collection is disabled, we avoid running logmon and
docker_logger for the docker tasks in this implementation.

The downside here is once disabled, `nomad logs ...` commands and API
no longer return logs and operators must corrolate alloc-ids with their
aggregated log info.

This is meant as a stop gap measure.  Ideally, we'd follow up with at
least two changes:

First, we should optimize behavior when we can such that operators don't
need to disable docker log collection.  Potentially by reverting to
using pre-0.9 syslog aggregation in linux environments, though with
different trade-offs.

Second, when/if logs are disabled, nomad logs endpoints should lookup
docker logs api on demand.  This ensures that the cost of log collection
is paid sparingly.
2019-12-08 14:15:03 -05:00
Danielle Lancashire d2075ebae9 spellcheck: Fix spelling of retrieve 2019-12-05 18:59:47 -06:00
Preetha be4a51d5b8
Merge pull request #6349 from hashicorp/b-host-stats
client: Return empty values when host stats fail
2019-11-20 10:13:02 -06:00
Lang Martin aa985ebe21
getter: allow the gcs download scheme (#6692) 2019-11-19 09:10:56 -05:00
Nick Ethier bd454a4c6f
client: improve group service stanza interpolation and check_re… (#6586)
* client: improve group service stanza interpolation and check_restart support

Interpolation can now be done on group service stanzas. Note that some task runtime specific information
that was previously available when the service was registered poststart of a task is no longer available.

The check_restart stanza for checks defined on group services will now properly restart the allocation upon
check failures if configured.
2019-11-18 13:04:01 -05:00
Michael Schurter 9fed8d1bed client: fix panic from 0.8 -> 0.10 upgrade
makeAllocTaskServices did not do a nil check on AllocatedResources
which causes a panic when upgrading directly from 0.8 to 0.10. While
skipping 0.9 is not supported we intend to fix serious crashers caused
by such upgrades to prevent cluster outages.

I did a quick audit of the client package and everywhere else that
accesses AllocatedResources appears to be properly guarded by a nil
check.
2019-11-01 07:47:03 -07:00
Michael Schurter f54f1cb321
Revert "Revert "Use joint context to cancel prestart hooks"" 2019-10-08 11:34:09 -07:00
Michael Schurter 81a30ae106
Revert "Use joint context to cancel prestart hooks" 2019-10-08 11:27:08 -07:00
Drew Bailey 69eebcd241
simplify logic to check for vault read event
defer shutdown to cleanup after failed run

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

update comment to include ctx note for shutdown
2019-09-30 11:02:14 -07:00
Drew Bailey 7565b8a8d9
Use joint context to cancel prestart hooks
fixes https://github.com/hashicorp/nomad/issues/6382

The prestart hook for templates blocks while it resolves vault secrets.
If the secret is not found it continues to retry. If a task is shutdown
during this time, the prestart hook currently does not receive
shutdownCtxCancel, causing it to hang.

This PR joins the two contexts so either killCtx or shutdownCtx cancel
and stop the task.
2019-09-30 10:48:01 -07:00
Tim Gross a6aadb3714 connect: remove proxy socket for restarted client 2019-09-25 14:58:17 -04:00
Tim Gross e43d33aa50 client: don't run alloc postrun during shutdown 2019-09-25 14:58:17 -04:00
Tim Gross d965a15490 driver/networking: don't recreate existing network namespaces 2019-09-25 14:58:17 -04:00
Danielle Lancashire 4f2343e1c0
client: Return empty values when host stats fail
Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).

When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.

This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including `nomad.client.unallocated.*` metrics.

However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.

Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.
2019-09-19 01:22:07 +02:00
Danielle ec3ecdecfc
Merge pull request #6321 from hashicorp/dani/remove-config
Hoist Volume.Config.Source into Volume.Source
2019-09-16 10:12:58 -07:00
Tim Gross a6ef8c5d42
client/networking: wrap error message from CNI plugin (#6316) 2019-09-13 08:20:05 -04:00
Danielle Lancashire 78b61de45f
config: Hoist volume.config.source into volume
Currently, using a Volume in a job uses the following configuration:

```
volume "alias-name" {
  type = "volume-type"
  read_only = true

  config {
    source = "host_volume_name"
  }
}
```

This commit migrates to the following:

```
volume "alias-name" {
  type = "volume-type"
  source = "host_volume_name"
  read_only = true
}
```

The original design was based due to being uncertain about the future of storage
plugins, and to allow maxium flexibility.

However, this causes a few issues, namely:
- We frequently need to parse this configuration during submission,
scheduling, and mounting
- It complicates the configuration from and end users perspective
- It complicates the ability to do validation

As we understand the problem space of CSI a little more, it has become
clear that we won't need the `source` to be in config, as it will be
used in the majority of cases:

- Host Volumes: Always need a source
- Preallocated CSI Volumes: Always needs a source from a volume or claim name
- Dynamic Persistent CSI Volumes*: Always needs a source to attach the volumes
                                   to for managing upgrades and to avoid dangling.
- Dynamic Ephemeral CSI Volumes*: Less thought out, but `source` will probably point
                                  to the plugin name, and a `config` block will
                                  allow you to pass meta to the plugin. Or will
                                  point to a pre-configured ephemeral config.
*If implemented

The new design simplifies this by merging the source into the volume
stanza to solve the above issues with usability, performance, and error
handling.
2019-09-13 04:37:59 +02:00
Tim Gross 3fa4bca4a0
script checks: Update needs to update Alloc as well (#6291) 2019-09-06 11:18:00 -04:00
Tim Gross 8ce201854a
client: recreate script checks on Update (#6265)
Splitting the immutable and mutable components of the scriptCheck led
to a bug where the environment interpolation wasn't being incorporated
into the check's ID, which caused the UpdateTTL to update for a check
ID that Consul didn't have (because our Consul client creates the ID
from the structs.ServiceCheck each time we update).

Task group services don't have access to a task environment at
creation, so their checks get registered before the check can be
interpolated. Use the original check ID so they can be updated.
2019-09-05 11:43:23 -04:00
Michael Schurter ee06c36345
Merge pull request #6254 from hashicorp/test-connect-e2e-demo
e2e: test demo job for connect
2019-09-04 14:33:26 -07:00
Nick Ethier e440ba80f1
ar: refactor network bridge config to use go-cni lib (#6255)
* ar: refactor network bridge config to use go-cni lib

* ar: use eth as the iface prefix for bridged network namespaces

* vendor: update containerd/go-cni package

* ar: update network hook to use TODO contexts when calling configurator

* unnecessary conversion
2019-09-04 16:33:25 -04:00
Michael Schurter 93b47f4ddc client: reword error message 2019-09-04 12:40:09 -07:00
Tim Gross 0f29dcc935
support script checks for task group services (#6197)
In Nomad prior to Consul Connect, all Consul checks work the same
except for Script checks. Because the Task being checked is running in
its own container namespaces, the check is executed by Nomad in the
Task's context. If the Script check passes, Nomad uses the TTL check
feature of Consul to update the check status. This means in order to
run a Script check, we need to know what Task to execute it in.

To support Consul Connect, we need Group Services, and these need to
be registered in Consul along with their checks. We could push the
Service down into the Task, but this doesn't work if someone wants to
associate a service with a task's ports, but do script checks in
another task in the allocation.

Because Nomad is handling the Script check and not Consul anyways,
this moves the script check handling into the task runner so that the
task runner can own the script check's configuration and
lifecycle. This will allow us to pass the group service check
configuration down into a task without associating the service itself
with the task.

When tasks are checked for script checks, we walk back through their
task group to see if there are script checks associated with the
task. If so, we'll spin off script check tasklets for them. The
group-level service and any restart behaviors it needs are entirely
encapsulated within the group service hook.
2019-09-03 15:09:04 -04:00
Michael Schurter 5957030d18
connect: add unix socket to proxy grpc for envoy (#6232)
* connect: add unix socket to proxy grpc for envoy

Fixes #6124

Implement a L4 proxy from a unix socket inside a network namespace to
Consul's gRPC endpoint on the host. This allows Envoy to connect to
Consul's xDS configuration API.

* connect: pointer receiver on structs with mutexes

* connect: warn on all proxy errors
2019-09-03 08:43:38 -07:00
Jasmine Dahilig 4edebe389a
add default update stanza and max_parallel=0 disables deployments (#6191) 2019-09-02 10:30:09 -07:00
Evan Ercolano fcf66918d0 Remove unused canary param from MakeTaskServiceID 2019-08-31 16:53:23 -04:00
Nick Ethier cf014c7fd5
ar: ensure network forwarding is allowed for bridged allocs (#6196)
* ar: ensure network forwarding is allowed in iptables for bridged allocs

* ensure filter rule exists at setup time
2019-08-28 10:51:34 -04:00
Mahmood Ali cc460d4804 Write to client store while holding lock
Protect against a race where destroying and persist state goroutines
race.

The downside is that the database io operation will run while holding
the lock and may run indefinitely.  The risk of lock being long held is
slow destruction, but slow io has bigger problems.
2019-08-26 13:45:58 -04:00
Mahmood Ali c132623ffc Don't persist allocs of destroyed alloc runners
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
2019-08-25 11:21:28 -04:00
Lang Martin 4f6493a301 taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
Jerome Gravel-Niquet cbdc1978bf Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Nick Ethier 96d379071d
ar: fix bridge networking port mapping when port.To is unset (#6190) 2019-08-22 21:53:52 -04:00
Michael Schurter 59e0b67c7f connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Michael Schurter b008fd1724 connect: register group services with Consul
Fixes #6042

Add new task group service hook for registering group services like
Connect-enabled services.

Does not yet support checks.
2019-08-20 12:25:10 -07:00
Tim Gross 03433f35d4 client/template: configuration for function blacklist and sandboxing
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.

When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
2019-08-12 16:34:48 -04:00
Danielle Lancashire 861caa9564
HostVolumeConfig: Source -> Path 2019-08-12 15:39:08 +02:00
Danielle Lancashire e132a30899
structs: Unify Volume and VolumeRequest 2019-08-12 15:39:08 +02:00
Danielle Lancashire 6ef8d5233e
client: Add volume_hook for mounting volumes 2019-08-12 15:39:08 +02:00
Mahmood Ali b17bac5101 Render consul templates using task env only (#6055)
When rendering a task consul template, ensure that only task environment
variables are used.

Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1].  Thus, we add
an empty entry for each host process env-var not found in task env-vars.

[1] bfa5d0e133/template/funcs.go (L61-L75)
2019-08-05 16:30:47 -04:00
Mahmood Ali f66169cd6a
Merge pull request #6065 from hashicorp/b-nil-driver-exec
Check if driver handle is nil before execing
2019-08-02 09:48:28 -05:00
Mahmood Ali a4670db9b7 Check if driver handle is nil before execing
Defend against tr.getDriverHandle being nil.  Exec handler checks if
task is running, but it may be stopped between check and driver handler
fetching.
2019-08-02 10:07:41 +08:00
Nick Ethier 321d10a041
client: remove debugging lines 2019-07-31 01:04:09 -04:00
Nick Ethier af6b191963
client: add autofetch for CNI plugins 2019-07-31 01:04:09 -04:00
Nick Ethier 1e9dd1b193
remove unused file 2019-07-31 01:04:09 -04:00
Nick Ethier 09a4cfd8d7
fix failing tests 2019-07-31 01:04:07 -04:00
Nick Ethier ef83f0831b
ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
Nick Ethier af66a35924
networking: Add new bridge networking mode implementation 2019-07-31 01:04:06 -04:00
Nick Ethier 63c5504d56
ar: fix lint errors 2019-07-31 01:03:19 -04:00
Nick Ethier e312201d18
ar: rearrange network hook to support building on windows 2019-07-31 01:03:19 -04:00
Nick Ethier 370533c9c7
ar: fix test that failed due to error renaming 2019-07-31 01:03:19 -04:00
Nick Ethier 2d60ef64d9
plugins/driver: make DriverNetworkManager interface optional 2019-07-31 01:03:19 -04:00
Nick Ethier f87e7e9c9a
ar: plumb error handling into alloc runner hook initialization 2019-07-31 01:03:18 -04:00
Nick Ethier ef1795b344
ar: add tests for network hook 2019-07-31 01:03:18 -04:00
Nick Ethier 15989bba8e
ar: cleanup lint errors 2019-07-31 01:03:18 -04:00
Nick Ethier 220cba3e7e
ar: move linux specific code to it's own file and add tests 2019-07-31 01:03:18 -04:00
Nick Ethier 548f78ef15
ar: initial driver based network management 2019-07-31 01:03:17 -04:00
Nick Ethier 66c514a388
Add network lifecycle management
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
2019-07-31 01:03:17 -04:00
Jasmine Dahilig 2157f6ddf1
add formatting for hcl parsing error messages (#5972) 2019-07-19 10:04:39 -07:00
Mahmood Ali cd6f1d3102 Update consul-template dependency to latest
To pick up the fix in
https://github.com/hashicorp/consul-template/pull/1231 .
2019-07-18 07:32:03 +07:00
Mahmood Ali 8a82260319 log unrecoverable errors 2019-07-17 11:01:59 +07:00
Mahmood Ali 1a299c7b28 client/taskrunner: fix stats stats retry logic
Previously, if a channel is closed, we retry the Stats call.  But, if that call
fails, we go in a backoff loop without calling Stats ever again.

Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.

I aimed to preserve the logging formats but made small improvements as I saw fit.
2019-07-11 13:58:07 +08:00
Preetha Appan ef9a71c68b
code review feedback 2019-07-10 10:41:06 -05:00
Preetha Appan 990e468edc
Populate task event struct with kill timeout
This makes for a nicer task event message
2019-07-09 09:37:09 -05:00
Mahmood Ali f10201c102 run post-run/post-stop task runner hooks
Handle when prestart failed while restoring a task, to prevent
accidentally leaking consul/logmon processes.
2019-07-02 18:38:32 +08:00
Mahmood Ali 4afd7835e3 Fail alloc if alloc runner prestart hooks fail
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
2019-07-02 18:35:47 +08:00
Mahmood Ali 7614b8f09e
Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2
task runner to avoid running task if terminal
2019-07-02 15:31:17 +08:00
Mahmood Ali 7bfad051b9 address review comments 2019-07-02 14:53:50 +08:00
Mahmood Ali 3d89ae0f1e task runner to avoid running task if terminal
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes https://github.com/hashicorp/nomad/issues/5883
2019-06-27 11:27:34 +08:00
Danielle Lancashire b9ac184e1f
tr: Fetch Wait channel before killTask in restart
Currently, if killTask results in the termination of a process before
calling WaitTask, Restart() will incorrectly return a TaskNotFound
error when using the raw_exec driver on Windows.
2019-06-26 15:20:57 +02:00
Chris Baker f71114f5b8 cleanup test 2019-06-18 14:15:25 +00:00
Chris Baker a2dc351fd0 formatting and clarity 2019-06-18 14:00:57 +00:00
Chris Baker e0170e1c67 metrics: add namespace label to allocation metrics 2019-06-17 20:50:26 +00:00
Danielle f923b568e0
Merge pull request #5821 from hashicorp/dani/b-5770
trhooks: Add TaskStopHook interface to services
2019-06-12 17:30:49 +02:00
Danielle Lancashire c326344b57
trt: Fix test 2019-06-12 17:06:11 +02:00
Danielle Lancashire 13d76e35fd
trhooks: Add TaskStopHook interface to services
We currently only run cleanup Service Hooks when a task is either
Killed, or Exited. However, due to the implementation of a task runner,
tasks are only Exited if they every correctly started running, which is
not true when you recieve an error early in the task start flow, such as
not being able to pull secrets from Vault.

This updates the service hook to also call consul deregistration
routines during a task Stop lifecycle event, to ensure that any
registered checks and services are cleared in such cases.

fixes #5770
2019-06-12 16:00:21 +02:00
Mahmood Ali 2acf30fdd3 Fallback to alloc.TaskResources for old allocs
When a client is running against an old server (e.g. running 0.8),
`alloc.AllocatedResources` may be nil, and we need to check the
deprecated `alloc.TaskResources` instead.

Fixes https://github.com/hashicorp/nomad/issues/5810
2019-06-11 10:32:53 -04:00
Mahmood Ali 7a4900aaa4 client/allocrunner: depend on internal task state
Alloc runner already tracks tasks associated with alloc.  Here, we
become defensive by relying on the alloc runner tracked tasks, rather
than depend on server never updating the job unexpectedly.
2019-06-10 18:42:51 -04:00
Mahmood Ali d30c3d10b0
Merge pull request #5747 from hashicorp/b-test-fixes-20190521-1
More test fixes
2019-06-05 19:09:18 -04:00
Mahmood Ali 935ee86e92
Merge pull request #5737 from fwkz/fix-restart-attempts
Fix restart attempts of `restart` stanza in `delay` mode.
2019-06-05 19:05:07 -04:00
Danielle Lancashire 27583ed8c1 client: Pass servers contacted ch to allocrunner
This fixes an issue where batch and service workloads would never be
restarted due to indefinitely blocking on a nil channel.

It also raises the restoration logging message to `Info` to simplify log
analysis.
2019-05-22 13:47:35 +02:00
Mahmood Ali 9df1e00f35 tests: fix data race in client/allocrunner/taskrunner/template TestTaskTemplateManager_Rerender_Signal
Given that Signal may be called multiple times, blocking for `SignalCh`
isn't sufficient to synchornizing access to Signals field.
2019-05-21 13:56:58 -04:00
Mahmood Ali b475ccbe3e client: synchronize access to ar.alloc
`allocRunner.alloc` is protected by `allocRunner.allocLock`, so let's
use `allocRunner.Alloc()` helper function to access it.
2019-05-21 09:55:05 -04:00
fwkz 8b84bec95a Fix restart attempts of restart stanza.
Number of restarts during 2nd interval is off by one.
2019-05-21 13:27:19 +02:00
Michael Schurter 2fe0768f3b docs: changelog entry for #5669 and fix comment 2019-05-14 10:54:00 -07:00
Michael Schurter af9096c8ba client: register before restoring
Registration and restoring allocs don't share state or depend on each
other in any way (syncing allocs with servers is done outside of
registration).

Since restoring is synchronous, start the registration goroutine first.

For nodes with lots of allocs to restore or close to their heartbeat
deadline, this could be the difference between becoming "lost" or not.
2019-05-14 10:53:27 -07:00
Michael Schurter e07f73bfe0 client: do not restart dead tasks until server is contacted (try 2)
Refactoring of 104067bc2b2002a4e45ae7b667a476b89addc162

Switch the MarkLive method for a chan that is closed by the client.
Thanks to @notnoop for the idea!

The old approach called a method on most existing ARs and TRs on every
runAllocs call. The new approach does a once.Do call in runAllocs to
accomplish the same thing with less work. Able to remove the gate
abstraction that did much more than was needed.
2019-05-14 10:53:27 -07:00
Michael Schurter d7e5ace1ed client: do not restart dead tasks until server is contacted
Fixes #1795

Running restored allocations and pulling what allocations to run from
the server happen concurrently. This means that if a client is rebooted,
and has its allocations rescheduled, it may restart the dead allocations
before it contacts the server and determines they should be dead.

This commit makes tasks that fail to reattach on restore wait until the
server is contacted before restarting.
2019-05-14 10:53:27 -07:00
Michael Schurter 1c4e585fa7 client: expose allocated memory per task
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
2019-05-10 11:12:12 -07:00
Mahmood Ali 919827f2df
Merge pull request #5632 from hashicorp/f-nomad-exec-parts-01-base
nomad exec part 1: plumbing and docker driver
2019-05-09 18:09:27 -04:00
Mahmood Ali ab2cae0625 implement client endpoint of nomad exec
Add a client streaming RPC endpoint for processing nomad exec tasks, by invoking
the relevant task handler for execution.
2019-05-09 16:49:08 -04:00
Chris Baker 93ec1293be
stale allocation data leads to incorrect (and even negative) metrics (#5637)
* client: was not using up-to-date client state in determining which alloc count towards allocated resources

* Update client/client.go

Co-Authored-By: cgbaker <cgbaker@hashicorp.com>
2019-05-07 15:54:36 -04:00
Michael Schurter 8c7b3ff45a
Fix comment
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:01:30 -05:00
Michael Schurter e19fa33f9c
Remove unnecessary boolean clause
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-05-03 10:00:17 -05:00
Preetha Appan b99a204582
Update deployment health on failed allocations only if health is unset
This fixes a confusing UX where a previously successful deployment's
healthy/unhealthy count would get updated if any allocations failed after
the deployment was already marked as successful.
2019-05-02 22:59:56 -05:00
Danielle 79515496cb
Merge pull request #5515 from hashicorp/dani/f-alloc-signal
allocs: Add nomad alloc signal command
2019-04-26 14:21:05 +02:00
Mahmood Ali bf0a09e270 retry grpc unavailable errors even if not shutting down 2019-04-25 18:39:17 -04:00
Mahmood Ali 81841e8528 try checking process status 2019-04-25 18:16:13 -04:00
Mahmood Ali fc78521f29 add logging about attempts 2019-04-25 18:09:36 -04:00
Mahmood Ali e6ca8641a8 try sleeping for stop signal to take effect 2019-04-25 17:16:29 -04:00
Mahmood Ali ff3a095015 add a test that simulates logmon dying during Start() call 2019-04-25 16:41:17 -04:00
Mahmood Ali bbac73883c logmon: retry starting logmon if it exits
Retry if we detect shutting down during Start() api call is started,
locally.
2019-04-25 15:10:16 -04:00
Danielle Lancashire 3409e0be89 allocs: Add nomad alloc signal command
This command will be used to send a signal to either a single task within an
allocation, or all of the tasks if <task-name> is omitted. If the sent signal
terminates the allocation, it will be treated as if the allocation has crashed,
rather than as if it was operator-terminated.

Signal validation is currently handled by the driver itself and nomad
does not attempt to restrict or validate them.
2019-04-25 12:43:32 +02:00
Michael Schurter 61f17a1043
tweak logging level for failed log line
Co-Authored-By: notnoop <mahmood@notnoop.com>
2019-04-22 14:40:17 -04:00
Danielle Lancashire c31966fc71 loggging: Attempt to recover logmon failures
Currently, when logmon fails to reattach, we will retry reattachment to
the same pid until the task restart specification is exhausted.

Because we cannot clear hook state during error conditions, it is not
possible for us to signal to a future restart that it _shouldn't_
attempt to reattach to the plugin.

Here we revert to explicitly detecting reattachment seperately from a
launch of a new logmon, so we can recover from scenarios where a logmon
plugin has failed.

This is a net improvement over the current hard failure situation, as it
means in the most common case (the pid has gone away), we can recover.

Other reattachment failure modes where the plugin may still be running
could potentially cause a duplicate process, or a subsequent failure to launch
a new plugin.

If there was a duplicate process, it could potentially cause duplicate
logging. This is better than a production workload outage.

If there was a subsequent failure to launch a new plugin, it would fail
in the same (retry until restarts are exhausted) as the current failure
mode.
2019-04-18 13:41:56 +02:00
Michael Schurter f7a7acc345
Merge pull request #5518 from hashicorp/f-simplify-kill
client: simplify kill logic
2019-04-15 14:11:58 -07:00
Chris Baker 6848591914 vault namespaces: inject VAULT_NAMESPACE alongside VAULT_TOKEN + documentation 2019-04-12 15:06:34 +00:00
Danielle Lancashire e135876493 allocs: Add nomad alloc restart
This adds a `nomad alloc restart` command and api that allows a job operator
with the alloc-lifecycle acl to perform an in-place restart of a Nomad
allocation, or a given subtask.
2019-04-11 14:25:49 +02:00
Chris Baker c0a7aee610
vault e2e: pass vault version into setup instead of having to infer it from test name 2019-04-10 10:34:10 -05:00
Chris Baker f0c184fc29
taskrunner: removed some unecessary config from a test 2019-04-10 10:34:10 -05:00
Chris Baker 170f5239c8
client: gofmt 2019-04-10 10:34:10 -05:00
Chris Baker a1d7971b2e
taskrunner: pass configured Vault namespace into TaskTemplateConfig 2019-04-10 10:34:10 -05:00
Michael Schurter f7d4428855 client: simplify kill logic
Remove runLaunched tracking as Run is *always* called for killable
TaskRunners. TaskRunners which fail before Run can be called (during
NewTaskRunner or Restore) are not killable as they're never added to the
client's alloc map.
2019-04-04 15:18:33 -07:00
Michael Schurter 1d569a27dc Revert "executor/linux: add defensive checks to binary path"
This reverts commit cb36f4537e63d53b198c2a87d1e03880895631bd.
2019-04-02 11:17:12 -07:00
Michael Schurter fc5487dbbc executor/linux: add defensive checks to binary path 2019-04-02 09:40:53 -07:00
Michael Schurter 7d49bc4c71 executor/linux: make chroot binary paths absolute
Avoid libcontainer.Process trying to lookup the binary via $PATH as the
executor has already found where the binary is located.
2019-04-01 15:45:31 -07:00
Michael Schurter a4572919cd
Merge pull request #5456 from hashicorp/test-taskenv
tests: port pre-0.9 task env tests
2019-03-25 10:41:38 -07:00
Michael Schurter 8efad12538 tests: port pre-0.9 task env tests
I chose to make them more of integration tests since there's a lot more
plumbing involved. The internal implementation details of how we craft
task envs can now change and these tests will still properly assert the
task runtime environment is setup properly.
2019-03-25 09:46:53 -07:00
Nick Ethier dc18b8928a
logmon: make Start rpc idempotent and simplify hook 2019-03-19 14:02:36 -04:00
Nick Ethier ac7fbee1b8
logmon:add static check for logmon exited hook 2019-03-18 15:59:43 -04:00
Nick Ethier 7dc3d83634
client/logmon: restart log collection correctly when a task is restarted 2019-03-15 23:59:18 -04:00
Michael Schurter 0ba1a5251b client: cleanup and document context uses
Some of the context uses in TR hooks are useless (Killed during Stop
never seems meaningful).

None of the hooks are interruptable for graceful shutdown which is
unfortunate and probably needs fixing.
2019-03-12 15:03:54 -07:00
Michael Schurter 32d31575cc client: emit event and call exited hooks during cleanup
Builds upon earlier commit that cleans up restored handles of terminal
allocs by also emitting terminated events and calling exited hooks when
appropriate.
2019-03-05 15:12:02 -08:00
Michael Schurter 64e145ebdb logmon: drop reattach log level as its expected
Logged once per terminal task on agent restart.
2019-03-04 13:26:01 -08:00
Michael Schurter c5271d3fa5 client: test logmon cleanup
The test is sadly quite complicated and peeks into things (logmon's
reattach config) AR doesn't normally have access to.

However, I couldn't find another way of asserting logmon got cleaned up
without resorting to smaller unit tests. Smaller unit tests risk
re-implementing dependencies in an unrealistic way, so I opted for an
ugly integration test.
2019-03-04 13:15:15 -08:00
Michael Schurter ef8d284352 client: ensure task is cleaned up when terminal
This commit is a significant change. TR.Run is now always executed, even
for terminal allocations. This was changed to allow TR.Run to cleanup
(run stop hooks) if a handle was recovered.

This is intended to handle the case of Nomad receiving a
DesiredStatus=Stop allocation update, persisting it, but crashing before
stopping AR/TR.

The commit also renames task runner hook data as it was very easy to
accidently set state on Requests instead of Responses using the old
field names.
2019-03-01 14:00:23 -08:00
Michael Schurter 812f1679e2
Merge pull request #5352 from hashicorp/b-leaked-logmon
logmon fixes
2019-02-26 08:35:46 -08:00
Michael Schurter e39a10a1f4 tests: move unix-specific test to its own file
Other logmon tests should be portable.
2019-02-26 07:56:44 -08:00
Michael Schurter 3b2a592e93 client: restart task on logmon failures
This code chooses to be conservative as opposed to optimal: when failing
to reattach to logmon simply return a recoverable error instead of
immediately trying to restart logmon.

The recoverable error will cause the task's restart policy to be
applied and a new logmon will be launched upon restart.

Trying to do the optimal approach of simply starting a new logmon
requires error string comparison and should be tested against a task
actively logging to assert the behavior (are writes blocked? dropped?).
2019-02-25 15:42:45 -08:00
Michael Schurter 8830b00866 client: test logmon_hook 2019-02-23 15:36:48 -08:00
Preetha Appan 43679f4ce1
More alloc runner tests ported from 0.8.7 2019-02-22 17:58:06 -06:00
Mahmood Ali 32551fb0e5 emit TaskRestartSignal event on vault restart
When Vault token expires and task is restarted, emit `TaskRestartSignal`
similar to v0.8.7
2019-02-22 15:56:14 -05:00
Mahmood Ali 8cb4bbcc08 address review comments 2019-02-22 15:56:14 -05:00
Mahmood Ali 216eaa4843 tests: port TestTaskRunner_VaultManager_Signal
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1427
2019-02-22 15:53:04 -05:00
Mahmood Ali 8e9e732319 tests: port TestTaskRunner_VaultManager_Restart
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1352
2019-02-22 15:53:04 -05:00
Mahmood Ali 33122ca7c0 tests: port TestTaskRunner_UnregisterConsul_Retries
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L620
2019-02-22 15:53:04 -05:00
Mahmood Ali 0128b0ce7a tests: port TestTaskRunner_Template_NewVaultToken
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1275
2019-02-22 15:53:04 -05:00
Mahmood Ali cfb80583af tests: port TestTaskRunner_Template_Artifact
From https://github.com/hashicorp/nomad/blob/v0.8.7/client/task_runner_test.go#L1195
2019-02-22 15:52:59 -05:00
Mahmood Ali 1b14214a88 tests: port TestAllocRunner_RetryArtifact
Port TestAllocRunner_RetryArtifact from https://github.com/hashicorp/nomad/blob/v0.8.7/client/alloc_runner_test.go#L610-L672

I changed the test name because it doesn't actually test that artifact
hooks is retried
2019-02-22 15:50:39 -05:00
Mahmood Ali c827e6e05a tests: port TestAllocRunner_MoveAllocDir test 2019-02-22 15:50:39 -05:00
Michael Schurter a2e3ea6dc9 logmon: fix reattach configuration
There were multiple bugs here:

1. Reattach unmarshalling always returned an error because you can't
   unmarshal into a nil pointer.
2. The hook data wasn't being saved because it was put on the request
   struct, not the response struct.
3. The plugin configuration should only have reattach *or* a command
   set. Not both.
4. Setting Done=true meant the hook was never re-run on agent restart so
   reattaching was never attempted.
2019-02-21 15:32:18 -08:00
Michael Schurter 01cabdff88 client: restart on recoverable StartTask errors
Fixes restarting on recoverable errors from StartTask.

Ports TestTaskRunner_Run_RecoverableStartError from 0.8 which discovered
the bug.
2019-02-21 15:30:49 -08:00
Michael Schurter e3f321cd27 test: port TestTaskRunner_RestartSignalTask_NotRunning from 0.8 2019-02-21 15:30:49 -08:00
Michael Schurter f3aa945a00 test: port TestTaskRunner_DriverNetwork from 0.8 2019-02-21 15:30:49 -08:00
Michael Schurter 518405ac33
Merge pull request #5322 from hashicorp/b-artifact-retries
Fix regression by restarting on artifact download errors
2019-02-21 15:28:51 -08:00
Michael Schurter 2553800eb8 tests: port TestAllocRunner_Destroy from 0.8
Also add destroy(ar) helper to fix a bunch of shutdown races in AR
tests.
2019-02-20 12:35:09 -08:00
Michael Schurter 6580ed668e client: don't redownload completed artifacts on retries
Track the download status of each artifact independently so that if only
one of many artifacts fails to download, completed artifacts aren't
downloaded again.
2019-02-20 08:45:12 -08:00
Michael Schurter 908bfab4c2 client: artifact errors are retry-able
0.9.0beta2 contains a regression where artifact download errors would
not cause a task restart and instead immediately fail the task.

This restores the pre-0.9 behavior of retrying all artifact errors and
adds missing tests.
2019-02-20 07:21:27 -08:00
Michael Schurter 79ccf00b72 tests: add new task runner test helper
Adds a new helper and removes a duplicated test.
2019-02-20 07:21:27 -08:00
Michael Schurter 159042a1a3 client: fix setting alloc unhealthy at deadline
During the 0.9 client refactor the code to fail a deployment when the
deadline was reached was broken. This restores and tests that behavior.
2019-02-19 07:44:14 -08:00
Mahmood Ali 87be233aca
test: improve readability of duration
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2019-02-14 08:12:06 -08:00
Mahmood Ali 16d3414842
test: improve failure message
Co-Authored-By: schmichael <michael.schurter@gmail.com>
2019-02-14 08:11:37 -08:00
Michael Schurter 4814f0fb0b tests: port TestTaskRunner_Download_List from 0.8 2019-02-12 15:48:04 -08:00
Michael Schurter a152e3ef17 consul: fix task deregistration hook
Broke ShutdownDelay but the test was timing dependent so it just
appeared flaky. Made the test slower so that it should never incorrectly
pass.
2019-02-12 15:36:02 -08:00
Michael Schurter 4ad879e75e tests: port TaskRunner_DeriveToken tests from 0.8 2019-02-12 15:36:02 -08:00
Michael Schurter 6743ed9fdc tests: port TestTaskRunner_BlockForVault from 0.8
Also fix race conditions in the mock vault client.
2019-02-12 13:46:09 -08:00
Michael Schurter 6c0cc65b2e simplify hcl2 parsing helper
No need to pass in the entire eval context
2019-02-04 11:07:57 -08:00
Alex Dadgar 5062c54874 Fix usage of fsi variable 2019-01-29 14:07:55 -08:00
Alex Dadgar 6f418ebaf0 Always populate task dir environment variables
Fixes an issue where if a task was restarted after restating the client,
the task dir environment variables would not be populated. This PR fixes
this for both upgrades from 0.8.X and for normal 0.9 restarts.
2019-01-29 13:17:10 -08:00
Alex Dadgar 5da21635fb Fix env templates having interpolated destinations
Fixes an issue where env templates that had interpolated destinations
would not work.

Fixes https://github.com/hashicorp/nomad/issues/5250
2019-01-28 10:28:53 -08:00
Alex Dadgar d6412fd8e7 Fix double restart counting for templates
This PR fixes an issue where template restarts would count twice since
it was emitting a restarting event.
2019-01-25 15:38:13 -08:00
Nick Ethier a36c4320ff
Merge pull request #5227 from hashicorp/b-client-highcpu-usage
Fix bug related to high cpu usage
2019-01-23 14:27:51 -05:00
Michael Schurter 13f061a83f
Merge pull request #5196 from hashicorp/f-plugin-utils
Make plugins/shared external and make pluginutls/
2019-01-23 06:59:32 -08:00
Preetha 05bf183ba3
Merge pull request #5225 from hashicorp/b-notaskevent-terminalallocs
Don't emit task events after alloc is in a terminal DesiredState
2019-01-23 08:54:10 -06:00
Michael Schurter 32daa7b47b goimports until make check is happy 2019-01-23 06:27:14 -08:00
Nick Ethier bcc3935228
tr: use context in as select statement 2019-01-22 20:11:39 -05:00
Michael Schurter be0bab7c3f move pluginutils -> helper/pluginutils
I wanted a different color bikeshed, so I get to paint it
2019-01-22 15:50:08 -08:00
Alex Dadgar 2ca0e97361 Split hclspec 2019-01-22 15:43:34 -08:00
Alex Dadgar 5ca6dd7988 move hclutils 2019-01-22 15:43:34 -08:00
Alex Dadgar 72a5691897 Driver tests do not use hcl2/hcl, hclspec, or hclutils 2019-01-22 15:43:34 -08:00
Preetha Appan 38422642cb
Use DesiredState to determine whether to stop sending task events 2019-01-22 16:43:32 -06:00
Preetha Appan 862c9b7de5
dont emit events for terminal allocs 2019-01-22 16:26:33 -06:00
Michael Schurter 1fa376cac6
Merge pull request #5211 from hashicorp/test-porting-08
Port some 0.8 TaskRunner tests
2019-01-22 14:05:53 -08:00
Michael Schurter 8ced0adb67 test: port TestTaskRunner_CheckWatcher_Restart
Added ability to adjust the number of events the TaskRunner keeps as
there's no way to observe all events otherwise.

Task events differ slightly from 0.8 because 0.9 emits Terminated every
time a task exits instead of only when it exits on its own (not due to
restart or kill).

0.9 does not emit Killing/Killed for restarts like 0.8 which seems fine
as `Restart Signaled/Terminated/Restarting` is more descriptive.

Original v0.8 events emitted:
```
	expected := []string{
		"Received",
		"Task Setup",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Restarting",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Restarting",
		"Started",
		"Restart Signaled",
		"Killing",
		"Killed",
		"Not Restarting",
	}
```
2019-01-22 09:46:46 -08:00
Michael Schurter 1719752a9d test: port RestartTask from 0.8 2019-01-22 08:08:08 -08:00
Michael Schurter 9edff19625 test: port SignalFailure test from 0.8
Also fix signal error handling in mock_driver.
2019-01-22 08:08:08 -08:00
Preetha Appan 299a5fc821
Rename TaskKillRequest/Response to TaskPreKillRequest/Response 2019-01-22 09:54:02 -06:00
Preetha Appan 5a5b9c5666
Fix log comments 2019-01-22 09:45:58 -06:00
Preetha Appan 06e15f8381
Rename TaskKillHook to TaskPreKillHook to more closely match usage
Also added/fixed comments
2019-01-22 09:41:56 -06:00
Michael Schurter 3b02af9386
Fix comment
Co-Authored-By: preetapan <preetha@hashicorp.com>
2019-01-22 09:41:21 -06:00
Preetha Appan 09291c689b
Rename TaskKillHook to TaskPreKillHook to more closely match usage
Also added/fixed comments
2019-01-22 09:41:21 -06:00
Nick Ethier 47127de671
ar: return error from hooks if occured 2019-01-18 18:31:02 -05:00
Mahmood Ali 5df63fda7c
Merge pull request #5190 from hashicorp/f-memory-usage
Track Basic Memory Usage as reported by cgroups
2019-01-18 16:46:02 -05:00
Chris Baker 290c3f36ad set TaskGroupName in task_runner 2019-01-18 20:25:11 +00:00
Chris Baker 8917961caa documenting test for task runner failure to set TaskGroupName 2019-01-18 20:00:49 +00:00
Michael Schurter cfadacfd95
Merge pull request #5203 from hashicorp/b-terminated
client: restore Terminated event on every exit
2019-01-18 08:54:15 -08:00
Preetha Appan e0b68a19c6
Fix one more place that should be using taskResources
taskResources handles new resource fields in a backwards compatible way
2019-01-17 15:52:51 -06:00
Michael Schurter a20ac7c1de client: restore Terminated event on every exit
v0.9.0-dev started emitting a Terminated event every time a task process
exited. While this wasn't true in previous versions, it's a useful task
event because it's the only place for job operators to view the task's
exit code.

This behavior is asserted in the e2e/taskevents tests.
2019-01-17 10:02:25 -08:00
Danielle Tomlinson a695b3562c
Merge pull request #5193 from hashicorp/dani/logmon-reattach
logmon: Reattach to existing loggers
2019-01-16 17:34:13 +01:00
Danielle Tomlinson 99da4c780d logmon: Reattach to existing loggers
This commit prevents us from creating duplicate logmon hooks when
restoring allocations by persisting the logmon reattach config using
HookData.
2019-01-16 14:56:10 +01:00
Michael Schurter daa7d029a1 test: porting TestTaskRunner_SimpleRun_Dispatch
Porting test from 0.8 to 0.9.
2019-01-15 15:22:13 -08:00
Michael Schurter 48afda786b
Merge pull request #5187 from hashicorp/test-consul
Port a bunch of pre-0.9 Consul tests to 0.9
2019-01-15 07:41:50 -08:00
Mahmood Ali 9909d98bee Track Basic Memory Usage as reported by cgroups
Track current memory usage, `memory.usage_in_bytes`, in addition to
`memory.max_memory_usage_in_bytes` and friends.  This number is closer
what Docker reports.

Related to https://github.com/hashicorp/nomad/issues/5165 .
2019-01-14 18:47:52 -05:00
Nick Ethier c619e70d39
Merge pull request #5018 from hashicorp/f-executor-stats
executor: streaming stats api
2019-01-14 15:02:35 -05:00
Michael Schurter 4e7ea460e8 test: port some pre-0.9 DeploymentHealth tests
Skipping a failing one as I need to move to some other work and don't
want to leave this work orphaned on my machine.
2019-01-14 09:56:53 -08:00
Michael Schurter ff2f23f5f9 test: assert service interpolation behavior
Ported from pre-0.9 tests.
2019-01-14 09:56:53 -08:00
Michael Schurter e877bb6370 test: assert shutdown delay deregs first
Restore a pre-0.9 test that asserts Consul services are deregistered
before a task's shutdown delay.
2019-01-14 09:56:53 -08:00