Commit graph

579 commits

Author SHA1 Message Date
James Rasell 7205b3f08e
Merge pull request #11402 from hashicorp/document-client-initial-vault-renew
taskrunner: add clarifying initial vault token renew comment.
2022-01-13 16:21:58 +01:00
Alessandro De Blasis e647549ecf
metrics: added mapped_file metric (#11500)
Signed-off-by: Alessandro De Blasis <alex@deblasis.net>
Co-authored-by: Nate <37554478+servusdei2018@users.noreply.github.com>
2022-01-10 15:35:19 -05:00
grembo edd3b8a20c
Un-break templates when using vault stanza change_mode noop (#11783)
Templates in nomad jobs make use of the vault token defined in
the vault stanza when issuing credentials like client certificates.

When using change_mode "noop" in the vault stanza, consul-template
is not informed in case a vault token is re-issued (which can
happen from time to time for various reasons, as described
in https://www.nomadproject.io/docs/job-specification/vault).

As a result, consul-template will keep using the old vault token
to renew credentials and - once the token expired - stop renewing
credentials. The symptom of this problem is a vault_token
file that is newer than the issued credential (e.g., TLS certificate)
in a job's /secrets directory.

This change corrects this, so that h.updater.updatedVaultToken(token)
is called, which will inform stakeholders about the new
token and make sure, the new token is used by consul-template.

Example job template fragment:

    vault {
        policies = ["nomad-job-policy"]
        change_mode = "noop"
    }

    template {
      data = <<-EOH
        {{ with secret "pki_int/issue/nomad-job"
        "common_name=myjob.service.consul" "ttl=90m"
        "alt_names=localhost" "ip_sans=127.0.0.1"}}
        {{ .Data.certificate }}
        {{ .Data.private_key }}
        {{ .Data.issuing_ca }}
        {{ end }}
      EOH
      destination = "${NOMAD_SECRETS_DIR}/myjob.crt"
      change_mode = "noop"
    }

This fix does not alter the meaning of the three change modes of vault

- "noop" - Take no action
- "restart" - Restart the job
- "signal" - send a signal to the task

as the switch statement following line 232 contains the necessary
logic.

It is assumed that "take no action" was never meant to mean "don't tell
consul-template about the new vault token".

Successfully tested in a staging cluster consisting of multiple
nomad client nodes.
2022-01-10 14:41:38 -05:00
Derek Strickland 0a8e03f0f7
Expose Consul template configuration parameters (#11606)
This PR exposes the following existing`consul-template` configuration options to Nomad jobspec authors in the `{job.group.task.template}` stanza.

- `wait`

It also exposes the following`consul-template` configuration to Nomad operators in the `{client.template}` stanza.

- `max_stale`
- `block_query_wait`
- `consul_retry`
- `vault_retry` 
- `wait` 

Finally, it adds the following new Nomad-specific configuration to the `{client.template}` stanza that allows Operators to set bounds on what `jobspec` authors configure.

- `wait_bounds`

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2022-01-10 10:19:07 -05:00
Tim Gross 5eda9be7b0
CSI: tests to exercise csi_hook (#11788)
Small refactoring of the allocrunner hook for CSI to make it more
testable, and a unit test that covers most of its logic.
2022-01-07 15:23:47 -05:00
Tim Gross 265e488ab4
task runner: fix goroutine leak in prestart hook (#11741)
The task runner prestart hooks take a `joincontext` so they have the
option to exit early if either of two contexts are canceled: from
killing the task or client shutdown. Some tasks exit without being
shutdown from the server, so neither of the joined contexts ever gets
canceled and we leak the `joincontext` (48 bytes) and its internal
goroutine. This primarily impacts batch jobs and any task that fails
or completes early such as non-sidecar prestart lifecycle tasks.
Cancel the `joincontext` after the prestart call exits to fix the
leak.
2021-12-23 11:50:51 -05:00
James Rasell 45f4689f9c
chore: fixup inconsistent method receiver names. (#11704) 2021-12-20 11:44:21 +01:00
Tim Gross a0cf5db797
provide -no-shutdown-delay flag for job/alloc stop (#11596)
Some operators use very long group/task `shutdown_delay` settings to
safely drain network connections to their workloads after service
deregistration. But during incident response, they may want to cause
that drain to be skipped so they can quickly shed load.

Provide a `-no-shutdown-delay` flag on the `nomad alloc stop` and
`nomad job stop` commands that bypasses the delay. This sets a new
desired transition state on the affected allocations that the
allocation/task runner will identify during pre-kill on the client.

Note (as documented here) that using this flag will almost always
result in failed inbound network connections for workloads as the
tasks will exit before clients receive updated service discovery
information and won't be gracefully drained.
2021-12-13 14:54:53 -05:00
James Rasell e3537a06bb
taskrunner: add clarifying initial vault token renew comment. 2021-10-28 17:09:22 +02:00
Michael Schurter fd68bbc342 test: update tests to properly use AllocDir
Also use t.TempDir when possible.
2021-10-19 10:49:07 -07:00
Michael Schurter 10c3bad652 client: never embed alloc_dir in chroot
Fixes #2522

Skip embedding client.alloc_dir when building chroot. If a user
configures a Nomad client agent so that the chroot_env will embed the
client.alloc_dir, Nomad will happily infinitely recurse while building
the chroot until something horrible happens. The best case scenario is
the filesystem's path length limit is hit. The worst case scenario is
disk space is exhausted.

A bad agent configuration will look something like this:

```hcl
data_dir = "/tmp/nomad-badagent"

client {
  enabled = true

  chroot_env {
    # Note that the source matches the data_dir
    "/tmp/nomad-badagent" = "/ohno"
    # ...
  }
}
```

Note that `/ohno/client` (the state_dir) will still be created but not
`/ohno/alloc` (the alloc_dir).
While I cannot think of a good reason why someone would want to embed
Nomad's client (and possibly server) directories in chroots, there
should be no cause for harm. chroots are only built when Nomad runs as
root, and Nomad disables running exec jobs as root by default. Therefore
even if client state is copied into chroots, it will be inaccessible to
tasks.

Skipping the `data_dir` and `{client,server}.state_dir` is possible, but
this PR attempts to implement the minimum viable solution to reduce risk
of unintended side effects or bugs.

When running tests as root in a vm without the fix, the following error
occurs:

```
=== RUN   TestAllocDir_SkipAllocDir
    alloc_dir_test.go:520:
                Error Trace:    alloc_dir_test.go:520
                Error:          Received unexpected error:
                                Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long
                Test:           TestAllocDir_SkipAllocDir
--- FAIL: TestAllocDir_SkipAllocDir (22.76s)
```

Also removed unused Copy methods on AllocDir and TaskDir structs.

Thanks to @eveld for not letting me forget about this!
2021-10-18 09:22:01 -07:00
Mahmood Ali 4d90afb425 gofmt all the files
mostly to handle build directives in 1.17.
2021-10-01 10:14:28 -04:00
James Rasell 0e926ef3fd
allow configuration of Docker hostnames in bridge mode (#11173)
Add a new hostname string parameter to the network block which
allows operators to specify the hostname of the network namespace.
Changing this causes a destructive update to the allocation and it
is omitted if empty from API responses. This parameter also supports
interpolation.

In order to have a hostname passed as a configuration param when
creating an allocation network, the CreateNetwork func of the
DriverNetworkManager interface needs to be updated. In order to
minimize the disruption of future changes, rather than add another
string func arg, the function now accepts a request struct along with
the allocID param. The struct has the hostname as a field.

The in-tree implementations of DriverNetworkManager.CreateNetwork
have been modified to account for the function signature change.
In updating for the change, the enhancement of adding hostnames to
network namespaces has also been added to the Docker driver, whilst
the default Linux manager does not current implement it.
2021-09-16 08:13:09 +02:00
James Rasell b6813f1221
chore: fix incorrect docstring formatting. 2021-08-30 11:08:12 +02:00
Mahmood Ali c37339a8c8
Merge pull request #9160 from hashicorp/f-sysbatch
core: implement system batch scheduler
2021-08-16 09:30:24 -04:00
James Rasell a9a04141a3
consul/connect: avoid warn messages on connect proxy errors
When creating a TCP proxy bridge for Connect tasks, we are at the
mercy of either end for managing the connection state. For long
lived gRPC connections the proxy could reasonably expect to stay
open until the context was cancelled. For the HTTP connections used
by connect native tasks, we experience connection disconnects.
The proxy gets recreated as needed on follow up requests, however
we also emit a WARN log when the connection is broken. This PR
lowers the WARN to a TRACE, because these disconnects are to be
expected.

Ideally we would be able to proxy at the HTTP layer, however Consul
or the connect native task could be configured to expect mTLS, preventing
Nomad from MiTM the requests.

We also can't mange the proxy lifecycle more intelligently, because
we have no control over the HTTP client or server and how they wish
to manage connection state.

What we have now works, it's just noisy.

Fixes #10933
2021-08-05 11:27:35 +02:00
Seth Hoenig 3371214431 core: implement system batch scheduler
This PR implements a new "System Batch" scheduler type. Jobs can
make use of this new scheduler by setting their type to 'sysbatch'.

Like the name implies, sysbatch can be thought of as a hybrid between
system and batch jobs - it is for running short lived jobs intended to
run on every compatible node in the cluster.

As with batch jobs, sysbatch jobs can also be periodic and/or parameterized
dispatch jobs. A sysbatch job is considered complete when it has been run
on all compatible nodes until reaching a terminal state (success or failed
on retries).

Feasibility and preemption are governed the same as with system jobs. In
this PR, the update stanza is not yet supported. The update stanza is sill
limited in functionality for the underlying system scheduler, and is
not useful yet for sysbatch jobs. Further work in #4740 will improve
support for the update stanza and deployments.

Closes #2527
2021-08-03 10:30:47 -04:00
Michael Schurter efe8ea2c2c
Merge pull request #10849 from benbuzbee/benbuz/fix-destroy
Don't treat a failed recover + successful destroy as a successful recover
2021-07-19 10:49:31 -07:00
Seth Hoenig f9d3fedca2 consul/connect: add missing import statements 2021-07-12 09:28:16 -05:00
Seth Hoenig 5540dfc17f
consul/connect: use join host port
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2021-07-12 09:04:54 -05:00
Seth Hoenig f80ae067a8 consul/connect: fix bug causing high cpu with multiple connect sidecars in group
This PR fixes a bug where the underlying Envoy process of a Connect gateway
would consume a full core of CPU if there is more than one sidecar or gateway
in a group. The utilization was being caused by Consul injecting an envoy_ready_listener
on 127.0.0.1:8443, of which only one of the Envoys would be able to bind to.
The others would spin in a hot loop trying to bind the listener.

As a workaround, we now specify -address during the Envoy bootstrap config
step, which is how Consul maps this ready listener. Because there is already
the envoy_admin_listener, and we need to continue supporting running gateways
in host networking mode, and in those case we want to use the same port
value coming from the service.port field, we now bind the admin listener to
the 127.0.0.2 loop-back interface, and the ready listener takes 127.0.0.1.

This shouldn't make a difference in the 99.999% use case where envoy is
being run in its official docker container. Advanced users can reference
${NOMAD_ENVOY_ADMIN_ADDR_<service>} (as they 'ought to) if needed,
as well as the new variable ${NOMAD_ENVOY_READY_ADDR_<service>} for the
envoy_ready_listener.
2021-07-09 14:34:44 -05:00
Seth Hoenig e47ea462fb client: fix logline in group shutdown hook
Fixes #10844
2021-07-08 11:14:37 -05:00
Seth Hoenig c8260c3940 consul: avoid triggering unnecessary sync when removing workload
There are bits of logic in callers of RemoveWorkload on group/task
cleanup hooks which call RemoveWorkload with the "Canary" version
of the workload, in case the alloc is marked as a Canary. This logic
triggers an extra sync with Consul, and also doesn't do the intended
behavior - for which no special casing is necessary anyway. When the
workload is marked for removal, all associated services and checks
will be removed regardless of the Canary status, because the service
and check IDs do not incorporate the canary-ness in the first place.

The only place where canary-ness matters is when updating a workload,
where we need to compute the hash of the services and checks to determine
whether they have been modified, the Canary flag of which is a part of
that.

Fixes #10842
2021-07-06 14:08:42 -05:00
Ben Buzbee e247f8806b Don't treat a failed recover + successful destroy as a successful
recover

This code just seems incorrect. As it stands today it reports a
successful restore if RecoverTask fails and then DestroyTask succeeds.

This can result in a really annoying bug where it then calls RecoverTask
again, whereby it will probably get ErrTaskNotFound and call DestroyTask
once more.

I think the only reason this has not been noticed so far is because most
drivers like Docker will return Success, then nomad will call
RecoverTask, get an error (not found) and call DestroyTask again, and
get a ErrTasksNotFound err.
2021-07-03 01:46:36 +00:00
Seth Hoenig 5aa657c6bd consul/connect: automatically set consul tls sni name for connect native tasks
This PR makes it so that Nomad will automatically set the CONSUL_TLS_SERVER_NAME
environment variable for Connect native tasks running in bridge networking mode
where Consul has TLS enabled. Because of the use of a unix domain socket for
communicating with Consul when in bridge networking mode, the server name is
a file name instead of something compatible with the mTLS certificate Consul
will authenticate against. "localhost" is by default a compatible name, so Nomad
will set the environment variable to that.

Fixes #10804
2021-06-28 08:36:53 -05:00
Tim Gross 59c1237fc9
tests: allocrunner CNI tests are Linux-only (#10783)
Running the `client/allocrunner` tests fail to compile on macOS because the
CNI test file depends on the CNI network configurator, which is in a
Linux-only file.
2021-06-18 11:34:31 -04:00
Tim Gross 7bd61bbf43
docker: generate /etc/hosts file for bridge network mode (#10766)
When `network.mode = "bridge"`, we create a pause container in Docker with no
networking so that we have a process to hold the network namespace we create
in Nomad. The default `/etc/hosts` file of that pause container is then used
for all the Docker tasks that share that network namespace. Some applications
rely on this file being populated.

This changeset generates a `/etc/hosts` file and bind-mounts it to the
container when Nomad owns the network, so that the container's hostname has an
IP in the file as expected. The hosts file will include the entries added by
the Docker driver's `extra_hosts` field.

In this changeset, only the Docker task driver will take advantage of this
option, as the `exec`/`java` drivers currently copy the host's `/etc/hosts`
file and this can't be changed without breaking backwards compatibility. But
the fields are available in the task driver protobuf for community task
drivers to use if they'd like.
2021-06-16 14:55:22 -04:00
James Rasell 492e308846
tests: remove duplicate import statements. 2021-06-11 09:39:22 +02:00
Seth Hoenig 519429a2de consul: probe consul namespace feature before using namespace api
This PR changes Nomad's wrapper around the Consul NamespaceAPI so that
it will detect if the Consul Namespaces feature is enabled before making
a request to the Namespaces API. Namespaces are not enabled in Consul OSS,
and require a suitable license to be used with Consul ENT.

Previously Nomad would check for a 404 status code when makeing a request
to the Namespaces API to "detect" if Consul OSS was being used. This does
not work for Consul ENT with Namespaces disabled, which returns a 500.

Now we avoid requesting the namespace API altogether if Consul is detected
to be the OSS sku, or if the Namespaces feature is not licensed. Since
Consul can be upgraded from OSS to ENT, or a new license applied, we cache
the value for 1 minute, refreshing on demand if expired.

Fixes https://github.com/hashicorp/nomad-enterprise/issues/575

Note that the ticket originally describes using attributes from https://github.com/hashicorp/nomad/issues/10688.
This turns out not to be possible due to a chicken-egg situation between
bootstrapping the agent and setting up the consul client. Also fun: the
Consul fingerprinter creates its own Consul client, because there is no
[currently] no way to pass the agent's client through the fingerprint factory.
2021-06-07 12:19:25 -05:00
Seth Hoenig d026ff1f66 consul/connect: add support for connect mesh gateways
This PR implements first-class support for Nomad running Consul
Connect Mesh Gateways. Mesh gateways enable services in the Connect
mesh to make cross-DC connections via gateways, where each datacenter
may not have full node interconnectivity.

Consul docs with more information:
https://www.consul.io/docs/connect/gateways/mesh-gateway

The following group level service block can be used to establish
a Connect mesh gateway.

service {
  connect {
    gateway {
      mesh {
        // no configuration
      }
    }
  }
}

Services can make use of a mesh gateway by configuring so in their
upstream blocks, e.g.

service {
  connect {
    sidecar_service {
      proxy {
        upstreams {
          destination_name = "<service>"
          local_bind_port  = <port>
          datacenter       = "<datacenter>"
          mesh_gateway {
            mode = "<mode>"
          }
        }
      }
    }
  }
}

Typical use of a mesh gateway is to create a bridge between datacenters.
A mesh gateway should then be configured with a service port that is
mapped from a host_network configured on a WAN interface in Nomad agent
config, e.g.

client {
  host_network "public" {
    interface = "eth1"
  }
}

Create a port mapping in the group.network block for use by the mesh
gateway service from the public host_network, e.g.

network {
  mode = "bridge"
  port "mesh_wan" {
    host_network = "public"
  }
}

Use this port label for the service.port of the mesh gateway, e.g.

service {
  name = "mesh-gateway"
  port = "mesh_wan"
  connect {
    gateway {
      mesh {}
    }
  }
}

Currently Envoy is the only supported gateway implementation in Consul.
By default Nomad client will run the latest official Envoy docker image
supported by the local Consul agent. The Envoy task can be customized
by setting `meta.connect.gateway_image` in agent config or by setting
the `connect.sidecar_task` block.

Gateways require Consul 1.8.0+, enforced by the Nomad scheduler.

Closes #9446
2021-06-04 08:24:49 -05:00
Mahmood Ali 067fd86a8c
drivers: Capture exit code when task is killed (#10494)
This commit ensures Nomad captures the task code more reliably even when the task is killed. This issue affect to `raw_exec` driver, as noted in https://github.com/hashicorp/nomad/issues/10430 .

We fix this issue by ensuring that the TaskRunner only calls `driver.WaitTask` once. The TaskRunner monitors the completion of the task by calling `driver.WaitTask` which should return the task exit code on completion. However, it also could return a "context canceled" error if the agent/executor is shutdown.

Previously, when a task is to be stopped, the killTask path makes two WaitTask calls, and the second returns "context canceled" occasionally because of a "race" in task shutting down and depending on driver, and how fast it shuts down after task completes.

By having a single WaitTask call and consistently waiting for the task, we ensure we capture the exit code reliably before the executor is shutdown or the contexts expired.

I opted to change the TaskRunner implementation to avoid changing the driver interface or requiring 3rd party drivers to update.

Additionally, the PR ensures that attempts to kill the task terminate when the task "naturally" dies. Without this change, if the task dies at the right moment, the `killTask` call may retry to kill an already-dead task for up to 5 minutes before giving up.
2021-05-04 10:54:00 -04:00
Michael Schurter 547a718ef6
Merge pull request #10248 from hashicorp/f-remotetask-2021
core: propagate remote task handles
2021-04-30 08:57:26 -07:00
Michael Schurter e62795798d core: propagate remote task handles
Add a new driver capability: RemoteTasks.

When a task is run by a driver with RemoteTasks set, its TaskHandle will
be propagated to the server in its allocation's TaskState. If the task
is replaced due to a down node or draining, its TaskHandle will be
propagated to its replacement allocation.

This allows tasks to be scheduled in remote systems whose lifecycles are
disconnected from the Nomad node's lifecycle.

See https://github.com/hashicorp/nomad-driver-ecs for an example ECS
remote task driver.
2021-04-27 15:07:03 -07:00
Seth Hoenig 238ac718f2 connect: use exp backoff when waiting on consul envoy bootstrap
This PR wraps the use of the consul envoy bootstrap command in
an expoenential backoff closure, configured to timeout after 60
seconds. This is an increase over the current behavior of making
3 attempts over 6 seconds.

Should help with #10451
2021-04-27 09:21:50 -06:00
Seth Hoenig f258fc8270
Merge pull request #10401 from hashicorp/cp-cns-ent-test-fixes
cherry-pick fixes from cns ent tests
2021-04-20 08:45:15 -06:00
Seth Hoenig 6e1c71446d client: always set script checks hook
Similar to a bugfix made for the services hook, we need to always
set the script checks hook, in case a task is initially launched
without script checks, but then updated to include script checks.

The scipt checks hook is the thing that handles that new registration.
2021-04-19 15:37:42 -06:00
Seth Hoenig 509490e5d2 e2e: consul namespace tests from nomad ent
(cherry-picked from ent without _ent things)

This is part 2/4 of e2e tests for Consul Namespaces. Took a
first pass at what the parameterized tests can look like, but
only on the ENT side for this PR. Will continue to refactor
in the next PRs.

Also fixes 2 bugs:
 - Config Entries registered by Nomad Server on job registration
   were not getting Namespace set
 - Group level script checks were not getting Namespace set

Those changes will need to be copied back to Nomad OSS.

Nomad OSS + no ACLs (previously, needs refactor)
Nomad ENT + no ACLs (this)
Nomad OSS + ACLs (todo)
Nomad ENT + ALCs (todo)
2021-04-19 15:35:31 -06:00
Nick Ethier 8140b0160c
Merge pull request #10369 from hashicorp/f-cpu-cores-4
Reserved Cores [4/4]: Implement driver cpuset cgroup path consumption
2021-04-19 14:53:29 -04:00
Nick Ethier 110f982eb3 plugins/drivers: fix deprecated fields 2021-04-16 14:13:29 -04:00
Nick Ethier f6d7285157
Merge pull request #10328 from hashicorp/f-cpu-cores-3
Reserved Cores [3/4]: Client cpuset cgroup managment
2021-04-16 14:11:45 -04:00
Nick Ethier 1e09ca5cd7 tr: set cpuset cpus if reserved 2021-04-15 13:31:51 -04:00
Adam Duncan 7588cf0ec3 networking: Ensure CNI iptables rules are appended to chain and not forced to be first 2021-04-15 10:11:15 -04:00
Nick Ethier 0a4e298221 testing fixes 2021-04-14 10:17:28 -04:00
Nick Ethier 155a2ca5fb client/ar: thread through cpuset manager 2021-04-13 13:28:36 -04:00
Mahmood Ali 2fd9eafc28
only publish measured metrics (#10376) 2021-04-13 11:39:33 -04:00
Michael Schurter a595409ce9
Merge pull request #9895 from hashicorp/b-cni-ipaddr
CNI: add fallback logic if no ip address references sandboxed interface
2021-04-09 08:58:35 -07:00
Michael Schurter 4a53633a1d ar: refactor go-cni results processing & add test
The goal is to always find an interface with an address, preferring
sandbox interfaces, but falling back to the first address found.

A test was added against a known CNI plugin output that was not handled
correctly before.
2021-04-08 09:20:14 -07:00
Tim Gross 276633673d CSI: use AccessMode/AttachmentMode from CSIVolumeClaim
Registration of Nomad volumes previously allowed for a single volume
capability (access mode + attachment mode pair). The recent `volume create`
command requires that we pass a list of requested capabilities, but the
existing workflow for claiming volumes and attaching them on the client
assumed that the volume's single capability was correct and unchanging.

Add `AccessMode` and `AttachmentMode` to `CSIVolumeClaim`, use these fields to
set the initial claim value, and add backwards compatibility logic to handle
the existing volumes that already have claims without these fields.
2021-04-07 11:24:09 -04:00
Nick Ethier 5aed5b7cd4
ar: stringify CNI result debug message 2021-04-05 12:35:34 -04:00
Seth Hoenig f17ba33f61 consul: plubming for specifying consul namespace in job/group
This PR adds the common OSS changes for adding support for Consul Namespaces,
which is going to be a Nomad Enterprise feature. There is no new functionality
provided by this changeset and hopefully no new bugs.
2021-04-05 10:03:19 -06:00