Commit graph

377 commits

Author SHA1 Message Date
Mahmood Ali 067fd86a8c
drivers: Capture exit code when task is killed (#10494)
This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in https://github.com/hashicorp/nomad/issues/10430 .

We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown.

Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes.

By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired.

I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update.

Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.
2021-05-04 10:54:00 -04:00
Michael Schurter 547a718ef6
Merge pull request #10248 from hashicorp/f-remotetask-2021
core: propagate remote task handles
2021-04-30 08:57:26 -07:00
Michael Schurter e62795798d core: propagate remote task handles
Add a new driver capability: RemoteTasks.

When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.

This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.

See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
2021-04-27 15:07:03 -07:00
Seth Hoenig 238ac718f2 connect: use exp backoff when waiting on consul envoy bootstrap
This PR wraps the use of the consul envoy bootstrap command in
an expoenential backoff closure, configured to timeout after 60
seconds. This is an increase over the current behavior of making
3 attempts over 6 seconds.

Should help with #10451
2021-04-27 09:21:50 -06:00
Seth Hoenig f258fc8270
Merge pull request #10401 from hashicorp/cp-cns-ent-test-fixes
cherry-pick fixes from cns ent tests
2021-04-20 08:45:15 -06:00
Seth Hoenig 6e1c71446d client: always set script checks hook
Similar to a bugfix made for the services hook, we need to always
set the script checks hook, in case a task is initially launched
without script checks, but then updated to include script checks.

The scipt checks hook is the thing that handles that new registration.
2021-04-19 15:37:42 -06:00
Seth Hoenig 509490e5d2 e2e: consul namespace tests from nomad ent
(cherry-picked from ent without _ent things)

This is part 2/4 of e2e tests for Consul Namespaces. Took a
first pass at what the parameterized tests can look like, but
only on the ENT side for this PR. Will continue to refactor
in the next PRs.

Also fixes 2 bugs:
 - Config Entries registered by Nomad Server on job registration
   were not getting Namespace set
 - Group level script checks were not getting Namespace set

Those changes will need to be copied back to Nomad OSS.

Nomad OSS + no ACLs (previously, needs refactor)
Nomad ENT + no ACLs (this)
Nomad OSS + ACLs (todo)
Nomad ENT + ALCs (todo)
2021-04-19 15:35:31 -06:00
Nick Ethier 110f982eb3 plugins/drivers: fix deprecated fields 2021-04-16 14:13:29 -04:00
Nick Ethier 1e09ca5cd7 tr: set cpuset cpus if reserved 2021-04-15 13:31:51 -04:00
Nick Ethier 155a2ca5fb client/ar: thread through cpuset manager 2021-04-13 13:28:36 -04:00
Mahmood Ali 2fd9eafc28
only publish measured metrics (#10376) 2021-04-13 11:39:33 -04:00
Seth Hoenig f17ba33f61 consul: plubming for specifying consul namespace in job/group
This PR adds the common OSS changes for adding support for Consul Namespaces,
which is going to be a Nomad Enterprise feature. There is no new functionality
provided by this changeset and hopefully no new bugs.
2021-04-05 10:03:19 -06:00
Mahmood Ali 95d85b9cac oversubscription: set the linux memory limit
Use the MemoryMaxMB as the LinuxResources limit. This is intended to ease
drivers implementation and adoption of the features: drivers that use
`resources.LinuxResources.MemoryLimitBytes` don't need to be updated.

Drivers that use NomadResources will need to updated to track the new
field value. Given that tasks aren't guaranteed to use up the excess
memory limit, this is a reasonable compromise.
2021-03-30 16:55:58 -04:00
Tim Gross f820021f9e deps: bump gopsutil to v3.21.2 2021-03-30 16:02:51 -04:00
Florian Apolloner b9b71e7ac5 Automatically populate CONSUL_HTTP_ADDR for connect native tasks in host networking mode. Fixes #10239 2021-03-28 14:34:31 +02:00
Seth Hoenig 02919a7e89
Merge pull request #10103 from AndrewChubatiuk/service-portlabel-interpolation-fix
fixed service interpolation for sidecar tasks
2021-03-17 10:40:48 -05:00
Adrian Todorov 47e1cb11df
driver/docker: add extra labels ( job name, task and task group name) 2021-03-08 08:59:52 -05:00
AndrewChubatiuk 6a4f3c6c8a fixed service interpolation for sidecar tasks 2021-03-01 10:39:14 +02:00
AndrewChubatiuk 3d0aa2ef56 allocate sidecar task port on host_network interface 2021-02-13 02:42:13 +02:00
Seth Hoenig 8b05efcf88 consul/connect: Add support for Connect terminating gateways
This PR implements Nomad built-in support for running Consul Connect
terminating gateways. Such a gateway can be used by services running
inside the service mesh to access "legacy" services running outside
the service mesh while still making use of Consul's service identity
based networking and ACL policies.

https://www.consul.io/docs/connect/gateways/terminating-gateway

These gateways are declared as part of a task group level service
definition within the connect stanza.

service {
  connect {
    gateway {
      proxy {
        // envoy proxy configuration
      }
      terminating {
        // terminating-gateway configuration entry
      }
    }
  }
}

Currently Envoy is the only supported gateway implementation in
Consul. The gateay task can be customized by configuring the
connect.sidecar_task block.

When the gateway.terminating field is set, Nomad will write/update
the Configuration Entry into Consul on job submission. Because CEs
are global in scope and there may be more than one Nomad cluster
communicating with Consul, there is an assumption that any terminating
gateway defined in Nomad for a particular service will be the same
among Nomad clusters.

Gateways require Consul 1.8.0+, checked by a node constraint.

Closes #9445
2021-01-25 10:36:04 -06:00
Seth Hoenig 5abaf1b86d consul/connect: ensure proxyID in test case 2021-01-20 09:48:12 -06:00
Seth Hoenig a18e63ed55 client: use closed variable in append 2021-01-20 09:20:50 -06:00
Seth Hoenig 991884e715 consul/connect: Enable running multiple ingress gateways per Nomad agent
Connect ingress gateway services were being registered into Consul without
an explicit deterministic service ID. Consul would generate one automatically,
but then Nomad would have no way to register a second gateway on the same agent
as it would not supply 'proxy-id' during envoy bootstrap.

Set the ServiceID for gateways, and supply 'proxy-id' when doing envoy bootstrap.

Fixes #9834
2021-01-19 12:58:36 -06:00
Kris Hicks d71a90c8a4
Fix some errcheck errors (#9811)
* Throw away result of multierror.Append

When given a *multierror.Error, it is mutated, therefore the return
value is not needed.

* Simplify MergeMultierrorWarnings, use StringBuilder

* Hash.Write() never returns an error

* Remove error that was always nil

* Remove error from Resources.Add signature

When this was originally written it could return an error, but that was
refactored away, and callers of it as of today never handle the error.

* Throw away results of io.Copy during Bridge

* Handle errors when computing node class in test
2021-01-14 12:46:35 -08:00
Mahmood Ali 00be4fc63c
tests: deflake TestTaskRunner_StatsHook_Periodic (#9734)
This PR deflakes TestTaskRunner_StatsHook_Periodic tests and adds backoff when the driver closes the channel.

TestTaskRunner_StatsHook_Periodic is currently the most flaky test - failing ~4% of the time (20 out of 486 workflows). A sample failure: https://app.circleci.com/pipelines/github/hashicorp/nomad/14028/workflows/957b674f-cbcc-4228-96d9-1094fdee5b9c/jobs/128563 .

This change has two components:

First, it updates the StatsHook so that it backs off when stats channel is closed. In the context of the test where the mock driver emits a single stats update and closes the channel, the test may make tens of thousands update during the period. In real context, if a driver doesn't implement the stats handler properly or when a task finishes, we may generate way too many Stats queries in a tight loop. Here, the backoff reduces these queries. I've added a failing test that shows 154,458 stats updates within 500ms in https://app.circleci.com/pipelines/github/hashicorp/nomad/14092/workflows/50672445-392d-4661-b19e-e3561ed32746/jobs/129423 .

Second, the test ignores the first stats update after a task exit. Due to the asynchronicity of updates and channel/context use, it's possible that an update is enqueued while the test marks the task as exited, resulting into a spurious update.
2021-01-06 16:03:00 -05:00
Seth Hoenig b4eafe6f2d consul: always include task services hook
Previously, Nomad would optimize out the services task runner
hook for tasks which were initially submitted with no services
defined. This causes a problem when the job is later updated to
include service(s) on that task, which will result in nothing
happening because the hook is not present to handle the service
registration in the .Update.

Instead, always enable the services hook. The group services
alloc runner hook is already always enabled.

Fixes #9707
2021-01-05 08:47:19 -06:00
Chris Baker c7072258af enabled broken test that is no longer broken 2021-01-04 22:25:35 +00:00
Chris Baker 9b125b8837 update template and artifact interpolation to use client-relative paths
resolves #9839
resolves #6929
resolves #6910

e2e: template env interpolation path testing
2021-01-04 22:25:34 +00:00
Tim Gross 1785822386
template: trigger change_mode for dynamic secrets on restore (#9636)
When a task is restored after a client restart, the template runner will
create a new lease for any dynamic secret (ex. Consul or PKI secrets
engines). But because this lease is being created in the prestart hook, we
don't trigger the `change_mode`.

This changeset uses the the existence of the task handle to detect a
previously running task that's been restored, so that we can trigger the
template `change_mode` if the template is changed, as it will be only with
dynamic secrets.
2020-12-16 13:36:19 -05:00
Seth Hoenig beaa6359d5 consul/connect: fix regression where client connect images ignored
Nomad v1.0.0 introduced a regression where the client configurations
for `connect.sidecar_image` and `connect.gateway_image` would be
ignored despite being set. This PR restores that functionality.

There was a missing layer of interpolation that needs to occur for
these parameters. Since Nomad 1.0 now supports dynamic envoy versioning
through the ${NOMAD_envoy_version} psuedo variable, we basically need
to first interpolate

  ${connect.sidecar_image} => envoyproxy/envoy:v${NOMAD_envoy_version}

then use Consul at runtime to resolve to a real image, e.g.

  envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.16.0

Of course, if the version of Consul is too old to provide an envoy
version preference, we then need to know to fallback to the old
version of envoy that we used before.

  envoyproxy/envoy:v${NOMAD_envoy_version} => envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09

Beyond that, we also need to continue to support jobs that set the
sidecar task themselves, e.g.

  sidecar_task { config { image: "custom/envoy" } }

which itself could include teh pseudo envoy version variable.
2020-12-14 09:47:55 -06:00
Kris Hicks 0cf9cae656
Apply some suggested fixes from staticcheck (#9598) 2020-12-10 07:29:18 -08:00
Kris Hicks 0a3a748053
Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
Michael Schurter 15f2b8fe7c client: skip broken test and fix assertion 2020-11-18 10:01:02 -08:00
Michael Schurter ff91bba70e client: fix interpolation in template source
While Nomad v0.12.8 fixed `NOMAD_{ALLOC,TASK,SECRETS}_DIR` use in
`template.destination`, interpolating these variables in
`template.source` caused a path escape error.

**Why not apply the destination fix to source?**

The destination fix forces destination to always be relative to the task
directory. This makes sense for the destination as a destination outside
the task directory would be unreachable by the task. There's no reason
to ever render a template outside the task directory. (Using `..` does
allow destinations to escape the task directory if
`template.disable_file_sandbox = true`. That's just awkward and unsafe
enough I hope no one uses it.)

There is a reason to source a template outside a task
directory. At least if there weren't then I can't think of why we
implemented `template.disable_file_sandbox`. So v0.12.8 left the
behavior of `template.source` the more straightforward "Interpolate and
validate."

However, since outside of `raw_exec` every other driver uses absolute
paths for `NOMAD_*_DIR` interpolation, this means those variables are
unusable unless `disable_file_sandbox` is set.

**The Fix**

The variables are now interpolated as relative paths *only for the
purpose of rendering templates.* This is an unfortunate special case,
but reflects the fact that the templates view of the filesystem is
completely different (unconstrainted) vs the task's view (chrooted).
Arguably the values of these variables *should be context-specific.*
I think it's more reasonable to think of the "hack" as templating
running uncontainerized than that giving templates different paths is a
hack.

**TODO**

- [ ] E2E tests
- [ ] Job validation may still be broken and prevent my fix from
      working?

**raw_exec**

`raw_exec` is actually broken _a different way_ as exercised by tests in
this commit. I think we should probably remove these tests and fix that
in a followup PR/release, but I wanted to leave them in for the initial
review and discussion. Since non-containerized source paths are broken
anyway, perhaps there's another solution to this entire problem I'm
overlooking?
2020-11-17 22:03:04 -08:00
Seth Hoenig 4cc3c01d5b
Merge pull request #9352 from hashicorp/f-artifact-headers
jobspec: add support for headers in artifact stanza
2020-11-13 14:04:27 -06:00
Seth Hoenig bb8a5816a0 jobspec: add support for headers in artifact stanza
This PR adds the ability to set HTTP headers when downloading
an artifact from an `http` or `https` resource.

The implementation in `go-getter` is such that a new `HTTPGetter`
must be created for each artifact that sets headers (as opposed
to conveniently setting headers per-request). This PR maintains
the memoization of the default Getter objects, creating new ones
only for artifacts where headers are set.

Closes #9306
2020-11-13 12:03:54 -06:00
Jasmine Dahilig d6110cbed4
lifecycle: add poststop hook (#8194) 2020-11-12 08:01:42 -08:00
Tim Gross 1fb1c9c5d4
artifact/template: make destination path absolute inside taskdir (#9149)
Prior to Nomad 0.12.5, you could use `${NOMAD_SECRETS_DIR}/mysecret.txt` as
the `artifact.destination` and `template.destination` because we would always
append the destination to the task working directory. In the recent security
patch we treated the `destination` absolute path as valid if it didn't escape
the working directory, but this breaks backwards compatibility and
interpolation of `destination` fields.

This changeset partially reverts the behavior so that we always append the
destination, but we also perform the escape check on that new destination
after interpolation so the security hole is closed.

Also, ConsulTemplate test should exercise interpolation
2020-10-22 15:47:49 -04:00
Tim Gross 6df36e4cdb artifact/template: prevent file sandbox escapes
Ensure that the client honors the client configuration for the
`template.disable_file_sandbox` field when validating the jobspec's
`template.source` parameter, and not just with consul-template's own `file`
function.

Prevent interpolated `template.source`, `template.destination`, and
`artifact.destination` fields from escaping file sandbox.
2020-10-21 14:34:12 -04:00
Nick Ethier 4903e5b114
Consul with CNI and host_network addresses (#9095)
* consul: advertise cni and multi host interface addresses

* structs: add service/check address_mode validation

* ar/groupservices: fetch networkstatus at hook runtime

* ar/groupservice: nil check network status getter before calling

* consul: comment network status can be nil
2020-10-15 15:32:21 -04:00
Michael Schurter 9c3972937b s/0.13/1.0/g
1.0 here we come!
2020-10-14 15:17:47 -07:00
Chris Baker 1d35578bed removed backwards-compatible/untagged metrics deprecated in 0.7 2020-10-13 20:18:39 +00:00
Seth Hoenig ed13e5723f consul/connect: dynamically select envoy sidecar at runtime
As newer versions of Consul are released, the minimum version of Envoy
it supports as a sidecar proxy also gets bumped. Starting with the upcoming
Consul v1.9.X series, Envoy v1.11.X will no longer be supported. Current
versions of Nomad hardcode a version of Envoy v1.11.2 to be used as the
default implementation of Connect sidecar proxy.

This PR introduces a change such that each Nomad Client will query its
local Consul for a list of Envoy proxies that it supports (https://github.com/hashicorp/consul/pull/8545)
and then launch the Connect sidecar proxy task using the latest supported version
of Envoy. If the `SupportedProxies` API component is not available from
Consul, Nomad will fallback to the old version of Envoy supported by old
versions of Consul.

Setting the meta configuration option `meta.connect.sidecar_image` or
setting the `connect.sidecar_task` stanza will take precedence as is
the current behavior for sidecar proxies.

Setting the meta configuration option `meta.connect.gateway_image`
will take precedence as is the current behavior for connect gateways.

`meta.connect.sidecar_image` and `meta.connect.gateway_image` may make
use of the special `${NOMAD_envoy_version}` variable interpolation, which
resolves to the newest version of Envoy supported by the Consul agent.

Addresses #8585 #7665
2020-10-13 09:14:12 -05:00
Yoan Blanc 891accb89a
use allow/deny instead of the colored alternatives (#9019)
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-10-12 08:47:05 -04:00
Fredrik Hoem Grelland a015c52846
configure nomad cluster to use a Consul Namespace [Consul Enterprise] (#8849) 2020-10-02 14:46:36 -04:00
Fredrik Hoem Grelland 953d4de8dd
update consul-template to v0.25.1 (#8988) 2020-10-01 14:08:49 -04:00
Seth Hoenig af9543c997 consul: fix validation of task in group-level script-checks
When defining a script-check in a group-level service, Nomad needs to
know which task is associated with the check so that it can use the
correct task driver to execute the check.

This PR fixes two bugs:
1) validate service.task or service.check.task is configured
2) make service.check.task inherit service.task if it is itself unset

Fixes #8952
2020-09-28 15:02:59 -05:00
Lars Lehtonen 55f0302c46
client/allocrunner/taskrunner: client.Close after err check (#8825) 2020-09-04 08:12:08 -04:00
Jasmine Dahilig 71a694f39c
Merge pull request #8390 from hashicorp/lifecycle-poststart-hook
task lifecycle poststart hook
2020-08-31 13:53:24 -07:00
Seth Hoenig dfe179abc5 consul/connect: fixup some comments and context timeout 2020-08-26 13:17:16 -05:00