Commit Graph

3913 Commits

Author SHA1 Message Date
Pete Woods 49b7d23cea Add node "status" and "scheduling eligibility" tags to client metrics (#6130)
When summing up the capability of your Nomad fleet for scaling purposes, it's important to exclude draining nodes, as they won't accept new jobs.
2019-09-03 12:11:11 -04:00
Michael Schurter 5957030d18
connect: add unix socket to proxy grpc for envoy (#6232)
* connect: add unix socket to proxy grpc for envoy

Fixes #6124

Implement a L4 proxy from a unix socket inside a network namespace to
Consul's gRPC endpoint on the host. This allows Envoy to connect to
Consul's xDS configuration API.

* connect: pointer receiver on structs with mutexes

* connect: warn on all proxy errors
2019-09-03 08:43:38 -07:00
Mahmood Ali 0f401e6b8b lint: ignore generated windows syscall wrappers 2019-09-03 10:59:58 -04:00
Jasmine Dahilig 4edebe389a
add default update stanza and max_parallel=0 disables deployments (#6191) 2019-09-02 10:30:09 -07:00
Evan Ercolano fcf66918d0 Remove unused canary param from MakeTaskServiceID 2019-08-31 16:53:23 -04:00
Danielle Lancashire 4fcb7394e9
client: Fix memory fingerprinting on 32bit
Also introduce regression ci for 32 bit fingerprinting
2019-08-31 18:33:59 +02:00
Mahmood Ali 0bd2eee87f
Merge pull request #6216 from hashicorp/b-recognize-pending-allocs
alloc_runner: wait when starting suspicious allocs
2019-08-28 14:46:09 -04:00
Mahmood Ali e0da3c5d0e rename to hasLocalState, and ignore clientstate
The ClientState being pending isn't a good criteria; as an alloc may
have been updated in-place before it was completed.

Also, updated the logic so we only check for task states.  If an alloc
has deployment state but no persisted tasks at all, restore will still
fail.
2019-08-28 11:44:48 -04:00
Nick Ethier cf014c7fd5
ar: ensure network forwarding is allowed for bridged allocs (#6196)
* ar: ensure network forwarding is allowed in iptables for bridged allocs

* ensure filter rule exists at setup time
2019-08-28 10:51:34 -04:00
Nick Ethier cbb27e74bc
Add environment variables for connect upstreams (#6171)
* taskenv: add connect upstream env vars + test

* set taskenv upstreams instead of appending

* Update client/taskenv/env.go

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-08-27 23:41:38 -04:00
Mahmood Ali 90c5eefbab Alternative approach: avoid restoring
This uses an alternative approach where we avoid restoring the alloc
runner in the first place, if we suspect that the alloc may have been
completed already.
2019-08-27 17:30:55 -04:00
Jasmine Dahilig 4078393bb6
expose nomad namespace as environment variable in allocation #5692 (#6192) 2019-08-27 08:38:07 -07:00
Mahmood Ali 647c1457cb alloc_runner: wait when starting suspicious allocs
This commit aims to help users running with clients suseptible to the
destroyed alloc being restrarted bug upgrade to latest.  Without this,
such users will have their tasks run unexpectedly on upgrade and only
see the bug resolved after subsequent restart.

If, on restore, the client sees a pending alloc without any other
persisted info, then err on the side that it's an corrupt persisted
state of an alloc instead of the client happening to be killed right
when alloc is assigned to client.

Few reasons motivate this behavior:

Statistically speaking, corruption being the cause is more likely.  A
long running client will have higher chance of having allocs persisted
incorrectly with pending state.  Being killed right when an alloc is
about to start is relatively unlikely.

Also, delaying starting an alloc that hasn't started (by hopefully
seconds) is not as severe as launching too many allocs that may bring
client down.

More importantly, this helps customers upgrade their clients without
risking taking their clients down and destablizing their cluster. We
don't want existing users to force triggering the bug while they upgrade
and restart cluster.
2019-08-26 22:05:31 -04:00
Mahmood Ali dfdf0edd3b
Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun
Don't persist allocs of destroyed alloc runners
2019-08-26 17:26:18 -04:00
Mahmood Ali cc460d4804 Write to client store while holding lock
Protect against a race where destroying and persist state goroutines
race.

The downside is that the database io operation will run while holding
the lock and may run indefinitely.  The risk of lock being long held is
slow destruction, but slow io has bigger problems.
2019-08-26 13:45:58 -04:00
Mahmood Ali 1851820f20 logmon: log stat error to help debugging 2019-08-26 10:10:20 -04:00
Mahmood Ali c132623ffc Don't persist allocs of destroyed alloc runners
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
2019-08-25 11:21:28 -04:00
Mahmood Ali 6301725002 logmon: revert workaround for Windows go1.11 bug
Revert e0126123ab1ba848f72458538bc6118c978245e6 now that we are running
with Golang 1.12, and https://github.com/golang/go/issues/29119 is no
longer relevant.
2019-08-24 08:19:44 -04:00
Mahmood Ali df1f3eb9ee
Merge pull request #6201 from hashicorp/b-device-stats-interval
initialize device manager stats interval
2019-08-24 08:16:03 -04:00
Mahmood Ali 07b5f4c530
Merge pull request #6146 from hashicorp/b-config-template-copy
clientConfig.Copy() to copy template config too
2019-08-23 19:00:57 -04:00
Mahmood Ali b98568774b clientConfig.Copy() to copy template config too 2019-08-23 18:43:22 -04:00
Lang Martin 4f6493a301 taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
Mahmood Ali 3890619100 initialize device manager stats interval
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics.  The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.

FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.

The stats interval defaults to 1 second and is user configurable.  I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.

Fixes https://github.com/hashicorp/nomad/issues/6057 .
2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet cbdc1978bf Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Nick Ethier 96d379071d
ar: fix bridge networking port mapping when port.To is unset (#6190) 2019-08-22 21:53:52 -04:00
Michael Schurter 59e0b67c7f connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Michael Schurter b008fd1724 connect: register group services with Consul
Fixes #6042

Add new task group service hook for registering group services like
Connect-enabled services.

Does not yet support checks.
2019-08-20 12:25:10 -07:00
lchayoun 2307c9d1d2 allow dash in non generated environment variable names - should only clean generate environment variables 2019-08-16 11:11:47 +03:00
Nick Ethier 965f00b2fc
Builtin Admission Controller Framework (#6116)
* nomad: add admission controller framework

* nomad: add admission controller framework and Consul Connect hooks

* run admission controllers before checking permissions

* client: add default node meta for connect configurables

* nomad: remove validateJob func since it has been moved to admission controller

* nomad: use new TaskKind type

* client: use consts for connect sidecar image and log level

* Apply suggestions from code review

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

* nomad: add job register test with connect sidecar

* Update nomad/job_endpoint_hooks.go

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-08-15 11:22:37 -04:00
lchayoun c5a38a045a allow dash in non generated environment variable names - should only clean generate environment variables 2019-08-13 19:23:13 +03:00
Tim Gross 03433f35d4 client/template: configuration for function blacklist and sandboxing
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.

When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
2019-08-12 16:34:48 -04:00
Danielle Lancashire 7e6c8e5ac1
Copy documentation to api/tasks 2019-08-12 16:22:27 +02:00
Danielle Lancashire 861caa9564
HostVolumeConfig: Source -> Path 2019-08-12 15:39:08 +02:00
Danielle Lancashire e132a30899
structs: Unify Volume and VolumeRequest 2019-08-12 15:39:08 +02:00
Danielle Lancashire 6ef8d5233e
client: Add volume_hook for mounting volumes 2019-08-12 15:39:08 +02:00
Danielle Lancashire 063e4240c1
client: Add parsing and registration of HostVolume configuration 2019-08-12 15:39:08 +02:00
lchayoun ca892163b2 allow dash in non generated environment variable names 2019-08-11 12:51:42 +03:00
Nick Ethier 7806f4c597
Revert "client: add autofetch for CNI plugins"
This reverts commit 0bd157cc3b04fb090dd0d54affcae71496102ce8.
2019-08-08 15:10:19 -04:00
Nick Ethier 7d28ece8de
Revert "client: remove debugging lines"
This reverts commit 54ce4d1f7ef4913cb12c03dbc98bcd903f7787c9.
2019-08-08 14:52:52 -04:00
Liel Chayoun 24dcb2379c
Update env_test.go 2019-08-06 11:59:31 +03:00
Mahmood Ali b17bac5101 Render consul templates using task env only (#6055)
When rendering a task consul template, ensure that only task environment
variables are used.

Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1].  Thus, we add
an empty entry for each host process env-var not found in task env-vars.

[1] bfa5d0e133/template/funcs.go (L61-L75)
2019-08-05 16:30:47 -04:00
Mahmood Ali f66169cd6a
Merge pull request #6065 from hashicorp/b-nil-driver-exec
Check if driver handle is nil before execing
2019-08-02 09:48:28 -05:00
Mahmood Ali a4670db9b7 Check if driver handle is nil before execing
Defend against tr.getDriverHandle being nil.  Exec handler checks if
task is running, but it may be stopped between check and driver handler
fetching.
2019-08-02 10:07:41 +08:00
Nick Ethier 7de0bec8ab
client/cni: updated comments and simplified logic to auto download plugins 2019-07-31 01:04:10 -04:00
Nick Ethier b16640c50d
Apply suggestions from code review
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2019-07-31 01:04:10 -04:00
Nick Ethier 321d10a041
client: remove debugging lines 2019-07-31 01:04:09 -04:00
Nick Ethier af6b191963
client: add autofetch for CNI plugins 2019-07-31 01:04:09 -04:00
Nick Ethier 1e9dd1b193
remove unused file 2019-07-31 01:04:09 -04:00
Nick Ethier 09a4cfd8d7
fix failing tests 2019-07-31 01:04:07 -04:00
Nick Ethier ef83f0831b
ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
Nick Ethier af66a35924
networking: Add new bridge networking mode implementation 2019-07-31 01:04:06 -04:00
Michael Schurter fb487358fb
connect: add group.service stanza support 2019-07-31 01:04:05 -04:00
Nick Ethier 63c5504d56
ar: fix lint errors 2019-07-31 01:03:19 -04:00
Nick Ethier e312201d18
ar: rearrange network hook to support building on windows 2019-07-31 01:03:19 -04:00
Nick Ethier 370533c9c7
ar: fix test that failed due to error renaming 2019-07-31 01:03:19 -04:00
Nick Ethier 2d60ef64d9
plugins/driver: make DriverNetworkManager interface optional 2019-07-31 01:03:19 -04:00
Nick Ethier f87e7e9c9a
ar: plumb error handling into alloc runner hook initialization 2019-07-31 01:03:18 -04:00
Nick Ethier ef1795b344
ar: add tests for network hook 2019-07-31 01:03:18 -04:00
Nick Ethier 15989bba8e
ar: cleanup lint errors 2019-07-31 01:03:18 -04:00
Nick Ethier 220cba3e7e
ar: move linux specific code to it's own file and add tests 2019-07-31 01:03:18 -04:00
Nick Ethier 548f78ef15
ar: initial driver based network management 2019-07-31 01:03:17 -04:00
Nick Ethier 66c514a388
Add network lifecycle management
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
2019-07-31 01:03:17 -04:00
Preetha Appan d048029b5a
remove generated code and change version to 0.10.0 2019-07-30 15:56:05 -05:00
Nomad Release bot e39fb11531 Generate files for 0.9.4 release 2019-07-30 19:05:18 +00:00
Preetha Appan 6b4c40f5a8
remove generated code 2019-07-23 12:07:49 -05:00
Nomad Release bot 04187c8b86 Generate files for 0.9.4-rc1 release 2019-07-22 21:42:36 +00:00
Michael Schurter d90680021e logmon: fix comment formattinglogmon: fix comment formattinglogmon: fix
comment formattinglogmon: fix comment formattinglogmon: fix comment
formatting
2019-07-22 13:05:01 -07:00
Michael Schurter e37bc3513c logmon: ensure errors are still handled properly
...and add a comment to switch back to the old error handling once we
switch to Go 1.12.
2019-07-22 12:49:48 -07:00
Danielle Lancashire 1bcbbbfbe6
logmon: Workaround golang/go#29119
There's a bug in go1.11 that causes some io operations on windows to
return incorrect errors for some cases when Stat-ing files. To avoid
upgrading to go1.12 in a point release, here we loosen up the cases
where we will attempt to create fifos, and add some logging of
underlying stat errors to help with debugging.
2019-07-22 18:28:12 +02:00
Jasmine Dahilig 2157f6ddf1
add formatting for hcl parsing error messages (#5972) 2019-07-19 10:04:39 -07:00
Mahmood Ali cd6f1d3102 Update consul-template dependency to latest
To pick up the fix in
https://github.com/hashicorp/consul-template/pull/1231 .
2019-07-18 07:32:03 +07:00
Mahmood Ali 8a82260319 log unrecoverable errors 2019-07-17 11:01:59 +07:00
Mahmood Ali 1a299c7b28 client/taskrunner: fix stats stats retry logic
Previously, if a channel is closed, we retry the Stats call.  But, if that call
fails, we go in a backoff loop without calling Stats ever again.

Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.

I aimed to preserve the logging formats but made small improvements as I saw fit.
2019-07-11 13:58:07 +08:00
Preetha Appan 7d645c5ad9
Test file for detect content type that satisfies linter and encoding 2019-07-10 11:42:04 -05:00
Preetha Appan ef9a71c68b
code review feedback 2019-07-10 10:41:06 -05:00
Preetha Appan 990e468edc
Populate task event struct with kill timeout
This makes for a nicer task event message
2019-07-09 09:37:09 -05:00
Preetha Appan 108a292cc0
fix linting failure in test case file 2019-07-08 11:29:12 -05:00
Michael Lange b2e9570075
Use consistent casing in the JSON representation of the AllocFileInfo struct 2019-07-02 17:27:31 -07:00
Preetha Appan 8495fb9055
Added additional test cases and fixed go test case 2019-07-02 13:25:29 -05:00
Mahmood Ali a97d451ac7
Merge pull request #5905 from hashicorp/b-ar-failed-prestart
Fail alloc if alloc runner prestart hooks fail
2019-07-02 20:25:53 +08:00
Danielle c6872cdf12
Merge pull request #5864 from hashicorp/dani/win-pipe-cleaner
windows: Fix restarts using the raw_exec driver
2019-07-02 13:58:56 +02:00
Danielle Lancashire e20300313f
fifo: Safer access to Conn 2019-07-02 13:12:54 +02:00
Mahmood Ali f10201c102 run post-run/post-stop task runner hooks
Handle when prestart failed while restoring a task, to prevent
accidentally leaking consul/logmon processes.
2019-07-02 18:38:32 +08:00
Mahmood Ali 4afd7835e3 Fail alloc if alloc runner prestart hooks fail
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
2019-07-02 18:35:47 +08:00
Mahmood Ali 7614b8f09e
Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2
task runner to avoid running task if terminal
2019-07-02 15:31:17 +08:00
Mahmood Ali 7bfad051b9 address review comments 2019-07-02 14:53:50 +08:00
Mahmood Ali c0c00ecc07
Merge pull request #5906 from hashicorp/b-alloc-stale-updates
client: defensive against getting stale alloc updates
2019-07-02 12:40:17 +08:00
Preetha Appan c09342903b
Improve test cases for detecting content type 2019-07-01 16:24:48 -05:00
Danielle Lancashire 688f82f07d
fifo: Close connections and cleanup lock handling 2019-07-01 14:14:29 +02:00
Danielle Lancashire 2c7d1f1b99
logmon: Add windows compatibility test 2019-07-01 14:14:06 +02:00
Mahmood Ali c5f5a1fcb9 client: defensive against getting stale alloc updates
When fetching node alloc assignments, be defensive against a stale read before
killing local nodes allocs.

The bug is when both client and servers are restarting and the client requests
the node allocation for the node, it might get stale data as server hasn't
finished applying all the restored raft transaction to store.

Consequently, client would kill and destroy the alloc locally, just to fetch it
again moments later when server store is up to date.

The bug can be reproduced quite reliably with single node setup (configured with
persistence).  I suspect it's too edge-casey to occur in production cluster with
multiple servers, but we may need to examine leader failover scenarios more closely.

In this commit, we only remove and destroy allocs if the removal index is more
recent than the alloc index. This seems like a cheap resiliency fix we already
use for detecting alloc updates.

A more proper fix would be to ensure that a nomad server only serves
RPC calls when state store is fully restored or up to date in leadership
transition cases.
2019-06-29 04:17:35 -05:00
Preetha Appan 3345ce3ba4
Infer content type in alloc fs stat endpoint 2019-06-28 20:31:28 -05:00
Danielle Lancashire e1151f743b
appveyor: Run logmon tests 2019-06-28 16:01:41 +02:00
Danielle Lancashire 634ada671e
fifo: Require that fifos do not exist for create
Although this operation is safe on linux, it is not safe on Windows when
using the named pipe interface. To provide a ~reasonable common api
abstraction, here we switch to returning File exists errors on the unix
api.
2019-06-28 13:47:18 +02:00
Danielle Lancashire 0ff27cfc0f
vendor: Use dani fork of go-winio 2019-06-28 13:47:18 +02:00
Danielle Lancashire 514a2a6017
logmon: Refactor fifo access for windows safety
On unix platforms, it is safe to re-open fifo's for reading after the
first creation if the file is already a fifo, however this is not
possible on windows where this triggers a permissions error on the
socket path, as you cannot recreate it.

We can't transparently handle this in the CreateAndRead handle, because
the Access Is Denied error is too generic to reliably be an IO error.
Instead, we add an explict API for opening a reader to an existing FIFO,
and check to see if the fifo already exists inside the calling package
(e.g logmon)
2019-06-28 13:41:54 +02:00
Mahmood Ali 3d89ae0f1e task runner to avoid running task if terminal
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes https://github.com/hashicorp/nomad/issues/5883
2019-06-27 11:27:34 +08:00
Danielle Lancashire b9ac184e1f
tr: Fetch Wait channel before killTask in restart
Currently, if killTask results in the termination of a process before
calling WaitTask, Restart() will incorrectly return a TaskNotFound
error when using the raw_exec driver on Windows.
2019-06-26 15:20:57 +02:00
Mahmood Ali b209584dce
Merge pull request #5726 from hashicorp/b-plugins-via-init
Use init() to handle plugin invocation
2019-06-18 21:09:03 -04:00
Mahmood Ali ac64509c59 comment on use of init() for plugin handlers 2019-06-18 20:54:55 -04:00