Commit graph

4005 commits

Author SHA1 Message Date
Michael Schurter 8634533e82 client: expose group network ports in env vars
Fixes #6375

Intentionally omitted IPs prior to 0.10.0 release to minimize changes
and risk.
2019-10-21 12:31:13 -07:00
Michael Schurter bb82f365ff connect: upgrade to envoy 1.11.2 and add sha
Append the Docker image sha to the Envoy image to ensure users default
to using the version that Nomad was tested against.
2019-10-18 10:16:58 -07:00
Michael Schurter ee5ea3ecc7 connect: upgrade to envoy 1.11.2 and add sha
Append the Docker image sha to the Envoy image to ensure users default
to using the version that Nomad was tested against.
2019-10-18 07:46:53 -07:00
Mahmood Ali 4e4a9b252c
Merge pull request #6290 from hashicorp/r-generated-code-refactor
dev: avoid codecgen code in downstream projects
2019-10-15 08:22:31 -04:00
Michael Schurter 2992cb80b0 Remove 0.10.0-rc1 generated files 2019-10-10 13:31:42 -07:00
Nomad Release bot 3007f1662e Generate files for 0.10.0-rc1 release 2019-10-10 19:08:23 +00:00
Michael Schurter f54f1cb321
Revert "Revert "Use joint context to cancel prestart hooks"" 2019-10-08 11:34:09 -07:00
Michael Schurter 81a30ae106
Revert "Use joint context to cancel prestart hooks" 2019-10-08 11:27:08 -07:00
Mahmood Ali 4b2ba62e35 acl: check ACL against object namespace
Fix a bug where a millicious user can access or manipulate an alloc in a
namespace they don't have access to.  The allocation endpoints perform
ACL checks against the request namespace, not the allocation namespace,
and performs the allocation lookup independently from namespaces.

Here, we check that the requested can access the alloc namespace
regardless of the declared request namespace.

Ideally, we'd enforce that the declared request namespace matches
the actual allocation namespace.  Unfortunately, we haven't documented
alloc endpoints as namespaced functions; we suspect starting to enforce
this will be very disruptive and inappropriate for a nomad point
release.  As such, we maintain current behavior that doesn't require
passing the proper namespace in request.  A future major release may
start enforcing checking declared namespace.
2019-10-08 12:59:22 -04:00
Drew Bailey 69eebcd241
simplify logic to check for vault read event
defer shutdown to cleanup after failed run

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

update comment to include ctx note for shutdown
2019-09-30 11:02:14 -07:00
Drew Bailey 7565b8a8d9
Use joint context to cancel prestart hooks
fixes https://github.com/hashicorp/nomad/issues/6382

The prestart hook for templates blocks while it resolves vault secrets.
If the secret is not found it continues to retry. If a task is shutdown
during this time, the prestart hook currently does not receive
shutdownCtxCancel, causing it to hang.

This PR joins the two contexts so either killCtx or shutdownCtx cancel
and stop the task.
2019-09-30 10:48:01 -07:00
Mahmood Ali e29ee4c400 nomad: defensive check for namespaces in job registration call
In a job registration request, ensure that the request namespace "header" and job
namespace field match.  This should be the case already in prod, as http
handlers ensures that the values match [1].

This mitigates bugs that exploit bugs where we may check a value but act
on another, resulting into bypassing ACL system.

[1] https://github.com/hashicorp/nomad/blob/v0.9.5/command/agent/job_endpoint.go#L415-L418
2019-09-26 17:02:47 -04:00
Tim Gross a6aadb3714 connect: remove proxy socket for restarted client 2019-09-25 14:58:17 -04:00
Tim Gross e43d33aa50 client: don't run alloc postrun during shutdown 2019-09-25 14:58:17 -04:00
Tim Gross d965a15490 driver/networking: don't recreate existing network namespaces 2019-09-25 14:58:17 -04:00
Jasmine Dahilig 0780adfa7f
timeout after 5 seconds when client opens a data directory (#6348) 2019-09-24 16:28:21 -07:00
Danielle Lancashire c9bcb7e76a
client_stats: Always emit client stats 2019-09-19 01:22:08 +02:00
Danielle Lancashire 4f2343e1c0
client: Return empty values when host stats fail
Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).

When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.

This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including `nomad.client.unallocated.*` metrics.

However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.

Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.
2019-09-19 01:22:07 +02:00
Danielle f8b64ee1ab
Merge pull request #6330 from hashicorp/f-host-vols-fail-startup
client: Fail startup if host volumes do not exist
2019-09-17 10:55:30 -07:00
Danielle ec3ecdecfc
Merge pull request #6321 from hashicorp/dani/remove-config
Hoist Volume.Config.Source into Volume.Source
2019-09-16 10:12:58 -07:00
Danielle Lancashire e3796e9d60
client: Fail startup if host volumes do not exist
Some drivers will automatically create directories when trying to mount
a path into a container, but some will not.

To unify this behaviour, this commit requires that host volumes already exist,
and can be stat'd by the Nomad Agent during client startup.
2019-09-13 23:28:10 +02:00
Ruslan Usifov b3c72d1729 close file handle when FileRotator object will closed. Fixes https://github.com/hashicorp/nomad/issues/6309 (#6323) 2019-09-13 10:31:13 -04:00
Tim Gross a6ef8c5d42
client/networking: wrap error message from CNI plugin (#6316) 2019-09-13 08:20:05 -04:00
Danielle Lancashire 78b61de45f
config: Hoist volume.config.source into volume
Currently, using a Volume in a job uses the following configuration:

```
volume "alias-name" {
  type = "volume-type"
  read_only = true

  config {
    source = "host_volume_name"
  }
}
```

This commit migrates to the following:

```
volume "alias-name" {
  type = "volume-type"
  source = "host_volume_name"
  read_only = true
}
```

The original design was based due to being uncertain about the future of storage
plugins, and to allow maxium flexibility.

However, this causes a few issues, namely:
- We frequently need to parse this configuration during submission,
scheduling, and mounting
- It complicates the configuration from and end users perspective
- It complicates the ability to do validation

As we understand the problem space of CSI a little more, it has become
clear that we won't need the `source` to be in config, as it will be
used in the majority of cases:

- Host Volumes: Always need a source
- Preallocated CSI Volumes: Always needs a source from a volume or claim name
- Dynamic Persistent CSI Volumes*: Always needs a source to attach the volumes
                                   to for managing upgrades and to avoid dangling.
- Dynamic Ephemeral CSI Volumes*: Less thought out, but `source` will probably point
                                  to the plugin name, and a `config` block will
                                  allow you to pass meta to the plugin. Or will
                                  point to a pre-configured ephemeral config.
*If implemented

The new design simplifies this by merging the source into the volume
stanza to solve the above issues with usability, performance, and error
handling.
2019-09-13 04:37:59 +02:00
Mahmood Ali 78f62d3670
Merge pull request #6080 from lchayoun/bug-6079
Allow dash in environment variable names
2019-09-11 11:17:24 -07:00
Mahmood Ali b6bf7f9a6c
Merge pull request #6260 from hashicorp/c-circleci-tweak-20190903
ci: ignore nested pkgs in GOTEST_PKGS_EXCLUDE
2019-09-11 11:17:10 -07:00
Tim Gross b05cd4c430
test: expand symlink for temp dir for macOS compatibility (#6303)
On macOS, `os.TempDir` returns a symlinked path under `/var` which is
outside of the directories shared into the VM used for Docker, and
that fails tests using Docker that need that mount. If we expand the
symlink to get the real path in `/private`, we're in the shared
folders and can safely mount them.
2019-09-10 12:20:09 -04:00
Mahmood Ali 4b8280e51d remove generated code 2019-09-06 19:24:15 +00:00
Nomad Release bot dc7d728a82 Generate files for 0.10.0-beta1 release 2019-09-06 18:47:09 +00:00
Tim Gross 3fa4bca4a0
script checks: Update needs to update Alloc as well (#6291) 2019-09-06 11:18:00 -04:00
Mahmood Ali 01f42053e4 dev: avoid codecgen code in downstream projects
This is an attempt to ease dependency management for external driver
plugins, by avoiding requiring them to compile ugorji/go generated
files.  Plugin developers reported some pain with the brittleness of
ugorji/go dependency in particular, specially when using go mod, the
default go mod manager in golang 1.13.

Context
--------

Nomad uses msgpack to persist and serialize internal structs, using
ugorji/go library.  As an optimization, we use ugorji/go code generation
to speedup process and aovid the relection-based slow path.

We commit these generated files in repository when we cut and tag the
release to ease reproducability and debugging old releases.  Thus,
downstream projects that depend on release tag, indirectly depends on
ugorji/go generated code.

Sadly, the generated code is brittle and specific to the version of
ugorji/go being used.  When go mod picks another version of ugorji/go
then nomad (go mod by default uses release according to semver),
downstream projects face compilation errors.

Interestingly, downstream projects don't commonly serialize nomad
internal structs.  Drivers and device plugins use grpc instead of
msgpack for the most part.  In the few cases where they use msgpag (e.g.
decoding task config), they do without codegen path as they run on
driver specific structs not the nomad internal structs.  Also, the
ugorji/go serialization through reflection is generally backward
compatible (mod some ugorji/go regression bugs that get introduced every
now and then :( ).

Proposal
---------

The proposal here is to keep committing ugorji/go codec generated files
for releases but to use a go tag for them.

All nomad development through the makefile, including releasing, CI and
dev flow, has the tag enabled.

Downstream plugin projects, by default, will skip these files and life
proceed as normal for them.

The downside is that nomad developers who use generated code but avoid
using make must start passing additional go tag argument.  Though this
is not a blessed configuration.
2019-09-06 09:22:00 -04:00
Tim Gross 8ce201854a
client: recreate script checks on Update (#6265)
Splitting the immutable and mutable components of the scriptCheck led
to a bug where the environment interpolation wasn't being incorporated
into the check's ID, which caused the UpdateTTL to update for a check
ID that Consul didn't have (because our Consul client creates the ID
from the structs.ServiceCheck each time we update).

Task group services don't have access to a task environment at
creation, so their checks get registered before the check can be
interpolated. Use the original check ID so they can be updated.
2019-09-05 11:43:23 -04:00
Michael Schurter ee06c36345
Merge pull request #6254 from hashicorp/test-connect-e2e-demo
e2e: test demo job for connect
2019-09-04 14:33:26 -07:00
Nick Ethier e440ba80f1
ar: refactor network bridge config to use go-cni lib (#6255)
* ar: refactor network bridge config to use go-cni lib

* ar: use eth as the iface prefix for bridged network namespaces

* vendor: update containerd/go-cni package

* ar: update network hook to use TODO contexts when calling configurator

* unnecessary conversion
2019-09-04 16:33:25 -04:00
Michael Schurter 93b47f4ddc client: reword error message 2019-09-04 12:40:09 -07:00
Mahmood Ali b76de72943
Merge pull request #6251 from hashicorp/b-port-map-regression
NOMAD_PORT_<label> regression
2019-09-04 11:54:09 -04:00
Danielle 2564a1dabd
Merge pull request #6239 from hashicorp/b-32bitmem
Fix memory fingerprinting on 32bit
2019-09-04 17:39:07 +02:00
Danielle aa5605fce1
fingerprint: Add backwards compatibility comment
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-09-04 17:38:35 +02:00
Mahmood Ali 87f0457973 fix qemu and update docker with tests 2019-09-04 11:27:51 -04:00
Jasmine Dahilig 5b6e39b37c fix portmap envvars in docker driver 2019-09-04 11:26:13 -04:00
Danielle Lancashire 67715d846e
fingerprint: Restore support for legacy memory fingerprint 2019-09-04 17:10:28 +02:00
Tim Gross 0f29dcc935
support script checks for task group services (#6197)
In Nomad prior to Consul Connect, all Consul checks work the same
except for Script checks. Because the Task being checked is running in
its own container namespaces, the check is executed by Nomad in the
Task's context. If the Script check passes, Nomad uses the TTL check
feature of Consul to update the check status. This means in order to
run a Script check, we need to know what Task to execute it in.

To support Consul Connect, we need Group Services, and these need to
be registered in Consul along with their checks. We could push the
Service down into the Task, but this doesn't work if someone wants to
associate a service with a task's ports, but do script checks in
another task in the allocation.

Because Nomad is handling the Script check and not Consul anyways,
this moves the script check handling into the task runner so that the
task runner can own the script check's configuration and
lifecycle. This will allow us to pass the group service check
configuration down into a task without associating the service itself
with the task.

When tasks are checked for script checks, we walk back through their
task group to see if there are script checks associated with the
task. If so, we'll spin off script check tasklets for them. The
group-level service and any restart behaviors it needs are entirely
encapsulated within the group service hook.
2019-09-03 15:09:04 -04:00
Pete Woods 49b7d23cea Add node "status" and "scheduling eligibility" tags to client metrics (#6130)
When summing up the capability of your Nomad fleet for scaling purposes, it's important to exclude draining nodes, as they won't accept new jobs.
2019-09-03 12:11:11 -04:00
Michael Schurter 5957030d18
connect: add unix socket to proxy grpc for envoy (#6232)
* connect: add unix socket to proxy grpc for envoy

Fixes #6124

Implement a L4 proxy from a unix socket inside a network namespace to
Consul's gRPC endpoint on the host. This allows Envoy to connect to
Consul's xDS configuration API.

* connect: pointer receiver on structs with mutexes

* connect: warn on all proxy errors
2019-09-03 08:43:38 -07:00
Mahmood Ali 0f401e6b8b lint: ignore generated windows syscall wrappers 2019-09-03 10:59:58 -04:00
Jasmine Dahilig 4edebe389a
add default update stanza and max_parallel=0 disables deployments (#6191) 2019-09-02 10:30:09 -07:00
Evan Ercolano fcf66918d0 Remove unused canary param from MakeTaskServiceID 2019-08-31 16:53:23 -04:00
Danielle Lancashire 4fcb7394e9
client: Fix memory fingerprinting on 32bit
Also introduce regression ci for 32 bit fingerprinting
2019-08-31 18:33:59 +02:00
Mahmood Ali 0bd2eee87f
Merge pull request #6216 from hashicorp/b-recognize-pending-allocs
alloc_runner: wait when starting suspicious allocs
2019-08-28 14:46:09 -04:00
Mahmood Ali e0da3c5d0e rename to hasLocalState, and ignore clientstate
The ClientState being pending isn't a good criteria; as an alloc may
have been updated in-place before it was completed.

Also, updated the logic so we only check for task states.  If an alloc
has deployment state but no persisted tasks at all, restore will still
fail.
2019-08-28 11:44:48 -04:00
Nick Ethier cf014c7fd5
ar: ensure network forwarding is allowed for bridged allocs (#6196)
* ar: ensure network forwarding is allowed in iptables for bridged allocs

* ensure filter rule exists at setup time
2019-08-28 10:51:34 -04:00
Nick Ethier cbb27e74bc
Add environment variables for connect upstreams (#6171)
* taskenv: add connect upstream env vars + test

* set taskenv upstreams instead of appending

* Update client/taskenv/env.go

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-08-27 23:41:38 -04:00
Mahmood Ali 90c5eefbab Alternative approach: avoid restoring
This uses an alternative approach where we avoid restoring the alloc
runner in the first place, if we suspect that the alloc may have been
completed already.
2019-08-27 17:30:55 -04:00
Jasmine Dahilig 4078393bb6
expose nomad namespace as environment variable in allocation #5692 (#6192) 2019-08-27 08:38:07 -07:00
Mahmood Ali 647c1457cb alloc_runner: wait when starting suspicious allocs
This commit aims to help users running with clients suseptible to the
destroyed alloc being restrarted bug upgrade to latest.  Without this,
such users will have their tasks run unexpectedly on upgrade and only
see the bug resolved after subsequent restart.

If, on restore, the client sees a pending alloc without any other
persisted info, then err on the side that it's an corrupt persisted
state of an alloc instead of the client happening to be killed right
when alloc is assigned to client.

Few reasons motivate this behavior:

Statistically speaking, corruption being the cause is more likely.  A
long running client will have higher chance of having allocs persisted
incorrectly with pending state.  Being killed right when an alloc is
about to start is relatively unlikely.

Also, delaying starting an alloc that hasn't started (by hopefully
seconds) is not as severe as launching too many allocs that may bring
client down.

More importantly, this helps customers upgrade their clients without
risking taking their clients down and destablizing their cluster. We
don't want existing users to force triggering the bug while they upgrade
and restart cluster.
2019-08-26 22:05:31 -04:00
Mahmood Ali dfdf0edd3b
Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun
Don't persist allocs of destroyed alloc runners
2019-08-26 17:26:18 -04:00
Mahmood Ali cc460d4804 Write to client store while holding lock
Protect against a race where destroying and persist state goroutines
race.

The downside is that the database io operation will run while holding
the lock and may run indefinitely.  The risk of lock being long held is
slow destruction, but slow io has bigger problems.
2019-08-26 13:45:58 -04:00
Mahmood Ali 1851820f20 logmon: log stat error to help debugging 2019-08-26 10:10:20 -04:00
Mahmood Ali c132623ffc Don't persist allocs of destroyed alloc runners
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted.  A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set.  Periodically, they get persisted until alloc is
gced by server.  During that  time, the client db will contain the alloc
but not its individual tasks status nor completed state.  On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management.  Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes https://github.com/hashicorp/nomad/issues/5984
Related to https://github.com/hashicorp/nomad/pull/5890
2019-08-25 11:21:28 -04:00
Mahmood Ali 6301725002 logmon: revert workaround for Windows go1.11 bug
Revert e0126123ab1ba848f72458538bc6118c978245e6 now that we are running
with Golang 1.12, and https://github.com/golang/go/issues/29119 is no
longer relevant.
2019-08-24 08:19:44 -04:00
Mahmood Ali df1f3eb9ee
Merge pull request #6201 from hashicorp/b-device-stats-interval
initialize device manager stats interval
2019-08-24 08:16:03 -04:00
Mahmood Ali 07b5f4c530
Merge pull request #6146 from hashicorp/b-config-template-copy
clientConfig.Copy() to copy template config too
2019-08-23 19:00:57 -04:00
Mahmood Ali b98568774b clientConfig.Copy() to copy template config too 2019-08-23 18:43:22 -04:00
Lang Martin 4f6493a301 taskrunner getter set Umask for go-getter, setuid test 2019-08-23 15:59:03 -04:00
Mahmood Ali 3890619100 initialize device manager stats interval
Fixes a bug where we cpu is pigged at 100% due to collecting devices
statistics.  The passed stats interval was ignored, and the default zero
value causes a very tight loop of stats collection.

FWIW, in my testing, it took 2.5-3ms to collect nvidia GPU stats, on a
`g2.2xlarge` ec2 instance.

The stats interval defaults to 1 second and is user configurable.  I
believe this is too frequent as a default, and I may advocate for
reducing it to a value closer to 5s or 10s, but keeping it as is for
now.

Fixes https://github.com/hashicorp/nomad/issues/6057 .
2019-08-23 14:58:34 -04:00
Jerome Gravel-Niquet cbdc1978bf Consul service meta (#6193)
* adds meta object to service in job spec, sends it to consul

* adds tests for service meta

* fix tests

* adds docs

* better hashing for service meta, use helper for copying meta when registering service

* tried to be DRY, but looks like it would be more work to use the
helper function
2019-08-23 12:49:02 -04:00
Nick Ethier 96d379071d
ar: fix bridge networking port mapping when port.To is unset (#6190) 2019-08-22 21:53:52 -04:00
Michael Schurter 59e0b67c7f connect: task hook for bootstrapping envoy sidecar
Fixes #6041

Unlike all other Consul operations, boostrapping requires Consul be
available. This PR tries Consul 3 times with a backoff to account for
the group services being asynchronously registered with Consul.
2019-08-22 08:15:32 -07:00
Michael Schurter b008fd1724 connect: register group services with Consul
Fixes #6042

Add new task group service hook for registering group services like
Connect-enabled services.

Does not yet support checks.
2019-08-20 12:25:10 -07:00
lchayoun 2307c9d1d2 allow dash in non generated environment variable names - should only clean generate environment variables 2019-08-16 11:11:47 +03:00
Nick Ethier 965f00b2fc
Builtin Admission Controller Framework (#6116)
* nomad: add admission controller framework

* nomad: add admission controller framework and Consul Connect hooks

* run admission controllers before checking permissions

* client: add default node meta for connect configurables

* nomad: remove validateJob func since it has been moved to admission controller

* nomad: use new TaskKind type

* client: use consts for connect sidecar image and log level

* Apply suggestions from code review

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>

* nomad: add job register test with connect sidecar

* Update nomad/job_endpoint_hooks.go

Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2019-08-15 11:22:37 -04:00
lchayoun c5a38a045a allow dash in non generated environment variable names - should only clean generate environment variables 2019-08-13 19:23:13 +03:00
Tim Gross 03433f35d4 client/template: configuration for function blacklist and sandboxing
When rendering a task template, the `plugin` function is no longer
permitted by default and will raise an error. An operator can opt-in
to permitting this function with the new `template.function_blacklist`
field in the client configuration.

When rendering a task template, path parameters for the `file`
function will be treated as relative to the task directory by
default. Relative paths or symlinks that point outside the task
directory will raise an error. An operator can opt-out of this
protection with the new `template.disable_file_sandbox` field in the
client configuration.
2019-08-12 16:34:48 -04:00
Danielle Lancashire 7e6c8e5ac1
Copy documentation to api/tasks 2019-08-12 16:22:27 +02:00
Danielle Lancashire 861caa9564
HostVolumeConfig: Source -> Path 2019-08-12 15:39:08 +02:00
Danielle Lancashire e132a30899
structs: Unify Volume and VolumeRequest 2019-08-12 15:39:08 +02:00
Danielle Lancashire 6ef8d5233e
client: Add volume_hook for mounting volumes 2019-08-12 15:39:08 +02:00
Danielle Lancashire 063e4240c1
client: Add parsing and registration of HostVolume configuration 2019-08-12 15:39:08 +02:00
lchayoun ca892163b2 allow dash in non generated environment variable names 2019-08-11 12:51:42 +03:00
Nick Ethier 7806f4c597
Revert "client: add autofetch for CNI plugins"
This reverts commit 0bd157cc3b04fb090dd0d54affcae71496102ce8.
2019-08-08 15:10:19 -04:00
Nick Ethier 7d28ece8de
Revert "client: remove debugging lines"
This reverts commit 54ce4d1f7ef4913cb12c03dbc98bcd903f7787c9.
2019-08-08 14:52:52 -04:00
Liel Chayoun 24dcb2379c
Update env_test.go 2019-08-06 11:59:31 +03:00
Mahmood Ali b17bac5101 Render consul templates using task env only (#6055)
When rendering a task consul template, ensure that only task environment
variables are used.

Currently, `consul-template` always falls back to host process
environment variables when key isn't a task env var[1].  Thus, we add
an empty entry for each host process env-var not found in task env-vars.

[1] bfa5d0e133/template/funcs.go (L61-L75)
2019-08-05 16:30:47 -04:00
Mahmood Ali f66169cd6a
Merge pull request #6065 from hashicorp/b-nil-driver-exec
Check if driver handle is nil before execing
2019-08-02 09:48:28 -05:00
Mahmood Ali a4670db9b7 Check if driver handle is nil before execing
Defend against tr.getDriverHandle being nil.  Exec handler checks if
task is running, but it may be stopped between check and driver handler
fetching.
2019-08-02 10:07:41 +08:00
Nick Ethier 7de0bec8ab
client/cni: updated comments and simplified logic to auto download plugins 2019-07-31 01:04:10 -04:00
Nick Ethier b16640c50d
Apply suggestions from code review
Co-Authored-By: Mahmood Ali <mahmood@hashicorp.com>
2019-07-31 01:04:10 -04:00
Nick Ethier 321d10a041
client: remove debugging lines 2019-07-31 01:04:09 -04:00
Nick Ethier af6b191963
client: add autofetch for CNI plugins 2019-07-31 01:04:09 -04:00
Nick Ethier 1e9dd1b193
remove unused file 2019-07-31 01:04:09 -04:00
Nick Ethier 09a4cfd8d7
fix failing tests 2019-07-31 01:04:07 -04:00
Nick Ethier ef83f0831b
ar: plumb client config for networking into the network hook 2019-07-31 01:04:06 -04:00
Nick Ethier af66a35924
networking: Add new bridge networking mode implementation 2019-07-31 01:04:06 -04:00
Michael Schurter fb487358fb
connect: add group.service stanza support 2019-07-31 01:04:05 -04:00
Nick Ethier 63c5504d56
ar: fix lint errors 2019-07-31 01:03:19 -04:00
Nick Ethier e312201d18
ar: rearrange network hook to support building on windows 2019-07-31 01:03:19 -04:00
Nick Ethier 370533c9c7
ar: fix test that failed due to error renaming 2019-07-31 01:03:19 -04:00
Nick Ethier 2d60ef64d9
plugins/driver: make DriverNetworkManager interface optional 2019-07-31 01:03:19 -04:00
Nick Ethier f87e7e9c9a
ar: plumb error handling into alloc runner hook initialization 2019-07-31 01:03:18 -04:00
Nick Ethier ef1795b344
ar: add tests for network hook 2019-07-31 01:03:18 -04:00
Nick Ethier 15989bba8e
ar: cleanup lint errors 2019-07-31 01:03:18 -04:00
Nick Ethier 220cba3e7e
ar: move linux specific code to it's own file and add tests 2019-07-31 01:03:18 -04:00
Nick Ethier 548f78ef15
ar: initial driver based network management 2019-07-31 01:03:17 -04:00
Nick Ethier 66c514a388
Add network lifecycle management
Adds a new Prerun and Postrun hooks to manage set up of network namespaces
on linux. Work still needs to be done to make the code platform agnostic and
support Docker style network initalization.
2019-07-31 01:03:17 -04:00
Preetha Appan d048029b5a
remove generated code and change version to 0.10.0 2019-07-30 15:56:05 -05:00
Nomad Release bot e39fb11531 Generate files for 0.9.4 release 2019-07-30 19:05:18 +00:00
Preetha Appan 6b4c40f5a8
remove generated code 2019-07-23 12:07:49 -05:00
Nomad Release bot 04187c8b86 Generate files for 0.9.4-rc1 release 2019-07-22 21:42:36 +00:00
Michael Schurter d90680021e logmon: fix comment formattinglogmon: fix comment formattinglogmon: fix
comment formattinglogmon: fix comment formattinglogmon: fix comment
formatting
2019-07-22 13:05:01 -07:00
Michael Schurter e37bc3513c logmon: ensure errors are still handled properly
...and add a comment to switch back to the old error handling once we
switch to Go 1.12.
2019-07-22 12:49:48 -07:00
Danielle Lancashire 1bcbbbfbe6
logmon: Workaround golang/go#29119
There's a bug in go1.11 that causes some io operations on windows to
return incorrect errors for some cases when Stat-ing files. To avoid
upgrading to go1.12 in a point release, here we loosen up the cases
where we will attempt to create fifos, and add some logging of
underlying stat errors to help with debugging.
2019-07-22 18:28:12 +02:00
Jasmine Dahilig 2157f6ddf1
add formatting for hcl parsing error messages (#5972) 2019-07-19 10:04:39 -07:00
Mahmood Ali cd6f1d3102 Update consul-template dependency to latest
To pick up the fix in
https://github.com/hashicorp/consul-template/pull/1231 .
2019-07-18 07:32:03 +07:00
Mahmood Ali 8a82260319 log unrecoverable errors 2019-07-17 11:01:59 +07:00
Mahmood Ali 1a299c7b28 client/taskrunner: fix stats stats retry logic
Previously, if a channel is closed, we retry the Stats call.  But, if that call
fails, we go in a backoff loop without calling Stats ever again.

Here, we use a utility function for calling driverHandle.Stats call that retries
as one expects.

I aimed to preserve the logging formats but made small improvements as I saw fit.
2019-07-11 13:58:07 +08:00
Preetha Appan 7d645c5ad9
Test file for detect content type that satisfies linter and encoding 2019-07-10 11:42:04 -05:00
Preetha Appan ef9a71c68b
code review feedback 2019-07-10 10:41:06 -05:00
Preetha Appan 990e468edc
Populate task event struct with kill timeout
This makes for a nicer task event message
2019-07-09 09:37:09 -05:00
Preetha Appan 108a292cc0
fix linting failure in test case file 2019-07-08 11:29:12 -05:00
Michael Lange b2e9570075
Use consistent casing in the JSON representation of the AllocFileInfo struct 2019-07-02 17:27:31 -07:00
Preetha Appan 8495fb9055
Added additional test cases and fixed go test case 2019-07-02 13:25:29 -05:00
Mahmood Ali a97d451ac7
Merge pull request #5905 from hashicorp/b-ar-failed-prestart
Fail alloc if alloc runner prestart hooks fail
2019-07-02 20:25:53 +08:00
Danielle c6872cdf12
Merge pull request #5864 from hashicorp/dani/win-pipe-cleaner
windows: Fix restarts using the raw_exec driver
2019-07-02 13:58:56 +02:00
Danielle Lancashire e20300313f
fifo: Safer access to Conn 2019-07-02 13:12:54 +02:00
Mahmood Ali f10201c102 run post-run/post-stop task runner hooks
Handle when prestart failed while restoring a task, to prevent
accidentally leaking consul/logmon processes.
2019-07-02 18:38:32 +08:00
Mahmood Ali 4afd7835e3 Fail alloc if alloc runner prestart hooks fail
When an alloc runner prestart hook fails, the task runners aren't invoked
and they remain in a pending state.

This leads to terrible results, some of which are:
* Lockup in GC process as reported in https://github.com/hashicorp/nomad/pull/5861
* Lockup in shutdown process as TR.Shutdown() waits for WaitCh to be closed
* Alloc not being restarted/rescheduled to another node (as it's still in
  pending state)
* Unexpected restart of alloc on a client restart, potentially days/weeks after
  alloc expected start time!

Here, we treat all tasks to have failed if alloc runner prestart hook fails.
This fixes the lockups, and permits the alloc to be rescheduled on another node.

While it's desirable to retry alloc runner in such failures, I opted to treat it
out of scope.  I'm afraid of some subtles about alloc and task runners and their
idempotency that's better handled in a follow up PR.

This might be one of the root causes for
https://github.com/hashicorp/nomad/issues/5840 .
2019-07-02 18:35:47 +08:00
Mahmood Ali 7614b8f09e
Merge pull request #5890 from hashicorp/b-dont-start-completed-allocs-2
task runner to avoid running task if terminal
2019-07-02 15:31:17 +08:00
Mahmood Ali 7bfad051b9 address review comments 2019-07-02 14:53:50 +08:00
Mahmood Ali c0c00ecc07
Merge pull request #5906 from hashicorp/b-alloc-stale-updates
client: defensive against getting stale alloc updates
2019-07-02 12:40:17 +08:00
Preetha Appan c09342903b
Improve test cases for detecting content type 2019-07-01 16:24:48 -05:00
Danielle Lancashire 688f82f07d
fifo: Close connections and cleanup lock handling 2019-07-01 14:14:29 +02:00
Danielle Lancashire 2c7d1f1b99
logmon: Add windows compatibility test 2019-07-01 14:14:06 +02:00
Mahmood Ali c5f5a1fcb9 client: defensive against getting stale alloc updates
When fetching node alloc assignments, be defensive against a stale read before
killing local nodes allocs.

The bug is when both client and servers are restarting and the client requests
the node allocation for the node, it might get stale data as server hasn't
finished applying all the restored raft transaction to store.

Consequently, client would kill and destroy the alloc locally, just to fetch it
again moments later when server store is up to date.

The bug can be reproduced quite reliably with single node setup (configured with
persistence).  I suspect it's too edge-casey to occur in production cluster with
multiple servers, but we may need to examine leader failover scenarios more closely.

In this commit, we only remove and destroy allocs if the removal index is more
recent than the alloc index. This seems like a cheap resiliency fix we already
use for detecting alloc updates.

A more proper fix would be to ensure that a nomad server only serves
RPC calls when state store is fully restored or up to date in leadership
transition cases.
2019-06-29 04:17:35 -05:00
Preetha Appan 3345ce3ba4
Infer content type in alloc fs stat endpoint 2019-06-28 20:31:28 -05:00
Danielle Lancashire e1151f743b
appveyor: Run logmon tests 2019-06-28 16:01:41 +02:00
Danielle Lancashire 634ada671e
fifo: Require that fifos do not exist for create
Although this operation is safe on linux, it is not safe on Windows when
using the named pipe interface. To provide a ~reasonable common api
abstraction, here we switch to returning File exists errors on the unix
api.
2019-06-28 13:47:18 +02:00
Danielle Lancashire 0ff27cfc0f
vendor: Use dani fork of go-winio 2019-06-28 13:47:18 +02:00
Danielle Lancashire 514a2a6017
logmon: Refactor fifo access for windows safety
On unix platforms, it is safe to re-open fifo's for reading after the
first creation if the file is already a fifo, however this is not
possible on windows where this triggers a permissions error on the
socket path, as you cannot recreate it.

We can't transparently handle this in the CreateAndRead handle, because
the Access Is Denied error is too generic to reliably be an IO error.
Instead, we add an explict API for opening a reader to an existing FIFO,
and check to see if the fifo already exists inside the calling package
(e.g logmon)
2019-06-28 13:41:54 +02:00
Mahmood Ali 3d89ae0f1e task runner to avoid running task if terminal
This change fixes a bug where nomad would avoid running alloc tasks if
the alloc is client terminal but the server copy on the client isn't
marked as running.

Here, we fix the case by having task runner uses the
allocRunner.shouldRun() instead of only checking the server updated
alloc.

Here, we preserve much of the invariants such that `tr.Run()` is always
run, and don't change the overall alloc runner and task runner
lifecycles.

Fixes https://github.com/hashicorp/nomad/issues/5883
2019-06-27 11:27:34 +08:00
Danielle Lancashire b9ac184e1f
tr: Fetch Wait channel before killTask in restart
Currently, if killTask results in the termination of a process before
calling WaitTask, Restart() will incorrectly return a TaskNotFound
error when using the raw_exec driver on Windows.
2019-06-26 15:20:57 +02:00
Mahmood Ali b209584dce
Merge pull request #5726 from hashicorp/b-plugins-via-init
Use init() to handle plugin invocation
2019-06-18 21:09:03 -04:00
Mahmood Ali ac64509c59 comment on use of init() for plugin handlers 2019-06-18 20:54:55 -04:00
Chris Baker f71114f5b8 cleanup test 2019-06-18 14:15:25 +00:00
Chris Baker a2dc351fd0 formatting and clarity 2019-06-18 14:00:57 +00:00
Chris Baker e0170e1c67 metrics: add namespace label to allocation metrics 2019-06-17 20:50:26 +00:00
Mahmood Ali 962921f86c Use init to handle plugin invocation
Currently, nomad "plugin" processes (e.g. executor, logmon, docker_logger) are started as CLI
commands to be handled by command CLI framework.  Plugin launchers use
`discover.NomadBinary()` to identify the binary and start it.

This has few downsides: The trivial one is that when running tests, one
must re-compile the nomad binary as the tests need to invoke the nomad
executable to start plugin.  This is frequently overlooked, resulting in
puzzlement.

The more significant issue with `executor` in particular is in relation
to external driver:

* Plugin must identify the path of invoking nomad binary, which is not
trivial; `discvoer.NomadBinary()` now returns the path to the plugin
rather than to nomad, preventing external drivers from launching
executors.

* The external driver may get a different version of executor than it
expects (specially if we make a binary incompatible change in future).

This commit addresses both downside by having the plugin invocation
handling through an `init()` call, similar to how libcontainer init
handler is done in [1] and recommened by libcontainer [2].  `init()`
will be invoked and handled properly in tests and external drivers.

For external drivers, this change will cause external drivers to launch
the executor that's compiled against.

There a are a couple of downsides to this approach:
* These specific packages (i.e executor, logmon, and dockerlog) need to
be careful in use of `init()`, package initializers.  Must avoid having
command execution rely on any other init in the package.  I prefixed
files with `z_` (golang processes files in lexical order), but ensured
we don't depend on order.
* The command handling is spread in multiple packages making it a bit
less obvious how plugin starts are handled.

[1] drivers/shared/executor/libcontainer_nsenter_linux.go
[2] eb4aeed24f/libcontainer (using-libcontainer)
2019-06-13 16:48:01 -04:00
Jasmine Dahilig ed9740db10
Merge pull request #5664 from hashicorp/f-http-hcl-region
backfill region from hcl for jobUpdate and jobPlan
2019-06-13 12:25:01 -07:00
Jasmine Dahilig 51e141be7a backfill region from job hcl in jobUpdate and jobPlan endpoints
- updated region in job metadata that gets persisted to nomad datastore
- fixed many unrelated unit tests that used an invalid region value
(they previously passed because hcl wasn't getting picked up and
the job would default to global region)
2019-06-13 08:03:16 -07:00
Mahmood Ali e31159bf1f Prepare for 0.9.4 dev cycle 2019-06-12 18:47:50 +00:00
Nomad Release bot 4803215109 Generate files for 0.9.3 release 2019-06-12 16:11:16 +00:00