Commit graph

3467 commits

Author SHA1 Message Date
Jasmine Dahilig 71a694f39c
Merge pull request #8390 from hashicorp/lifecycle-poststart-hook
task lifecycle poststart hook
2020-08-31 13:53:24 -07:00
Jasmine Dahilig fbe0c89ab1 task lifecycle poststart: code review fixes 2020-08-31 13:22:41 -07:00
Mahmood Ali 117aec0036 Fix accidental broken clones
Fix CSIMountOptions.Copy() and VolumeRequest.Copy() where they
accidentally returned a reference to self rather than a deep copy.

`&(*ref)` in Golang apparently equivalent to plain `&ref`.
2020-08-28 15:29:22 -04:00
Tim Gross b77fe023b5
MRD: move 'job stop -global' handling into RPC (#8776)
The initial implementation of global job stop for MRD looped over all the
regions in the CLI for expedience. This changeset includes the OSS parts of
moving this into the RPC layer so that API consumers don't have to implement
this logic themselves.
2020-08-28 14:28:13 -04:00
Tim Gross 35b1b3bed7
structs: filter NomadTokenID from job diff (#8773)
Multiregion deployments use the `NomadTokenID` to allow the deploymentwatcher
to send RPCs between regions with the original submitter's ACL token. This ID
should be filtered from diffs so that it doesn't cause a difference for
purposes of job plans.
2020-08-28 13:40:51 -04:00
Lang Martin 7d483f93c0
csi: plugins track jobs in addition to allocations, and use job information to set expected counts (#8699)
* nomad/structs/csi: add explicit job support
* nomad/state/state_store: capture job updates directly
* api/nodes: CSIInfo needs the AllocID
* command/agent/csi_endpoint: AllocID was missing
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2020-08-27 17:20:00 -04:00
Seth Hoenig c4fd1c97aa
Merge pull request #8761 from hashicorp/b-consul-op-token-check
consul/connect: make use of task kind to determine service name in consul token checks
2020-08-27 14:08:33 -05:00
Tim Gross 606df14e78
MRD: deregister regions that are dropped on update (#8763)
This changeset is the OSS hooks for what will be implemented in ENT.
2020-08-27 14:54:45 -04:00
Seth Hoenig 84176c9a41 consul/connect: make use of task kind to determine service name in consul token checks
When consul.allow_unauthenticated is set to false, the job_endpoint hook validates
that a `-consul-token` is provided and validates the token against the privileges
inherent to a Consul Service Identity policy for all the Connect enabled services
defined in the job.

Before, the check was assuming the service was of type sidecar-proxy. This fixes the
check to use the type of the task so we can distinguish between the different connect
types.
2020-08-27 12:14:40 -05:00
Chris Baker 8b9145fabd state_store/fix the prefix bugs for scaling policies documented in 1a9318 2020-08-27 04:25:37 +00:00
Chris Baker 655cbb4d3c documenting tests for prefix bugs around job scaling policies 2020-08-27 03:22:13 +00:00
Seth Hoenig 9f1f2a5673 Merge branch 'master' into f-cc-ingress 2020-08-26 15:31:05 -05:00
Seth Hoenig 5d670c6d01 consul/connect: use context cancel more safely 2020-08-26 14:23:31 -05:00
Seth Hoenig dfe179abc5 consul/connect: fixup some comments and context timeout 2020-08-26 13:17:16 -05:00
Mahmood Ali 45f549e29e
Merge pull request #8691 from hashicorp/b-reschedule-job-versions
Respect alloc job version for lost/failed allocs
2020-08-25 18:02:45 -04:00
Mahmood Ali def768728e Have Plan.AppendAlloc accept the job 2020-08-25 17:22:09 -04:00
Mahmood Ali 18632955f2 clarify PathEscapesAllocDir specification
Clarify how to handle prefix value and path traversal within the alloc
dir but outside the prefix directory.
2020-08-24 20:44:26 -04:00
Mahmood Ali 9794760933 validate parameterized job request meta
Fixes a bug where `keys` metadata wasn't populated, as we iterated over
the empty newly-created `keys` map rather than the request Meta field.
2020-08-24 20:39:01 -04:00
Seth Hoenig 26e77623e5 consul/connect: fixup tests to use new consul sdk 2020-08-24 12:02:41 -05:00
Seth Hoenig c4fa644315 consul/connect: remove envoy dns option from gateway proxy config 2020-08-24 09:11:55 -05:00
Yoan Blanc 327d17e0dc
fixup! vendor: consul/api, consul/sdk v1.6.0
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-08-24 08:59:03 +02:00
Seth Hoenig 5b072029f2 consul/connect: add initial support for ingress gateways
This PR adds initial support for running Consul Connect Ingress Gateways (CIGs) in Nomad. These gateways are declared as part of a task group level service definition within the connect stanza.

```hcl
service {
  connect {
    gateway {
      proxy {
        // envoy proxy configuration
      }
      ingress {
        // ingress-gateway configuration entry
      }
    }
  }
}
```

A gateway can be run in `bridge` or `host` networking mode, with the caveat that host networking necessitates manually specifying the Envoy admin listener (which cannot be disabled) via the service port value.

Currently Envoy is the only supported gateway implementation in Consul, and Nomad only supports running Envoy as a gateway using the docker driver.

Aims to address #8294 and tangentially #8647
2020-08-21 16:21:54 -05:00
Nick Ethier 3cd5f46613
Update UI to use new allocated ports fields (#8631)
* nomad: canonicalize alloc shared resources to populate ports

* ui: network ports

* ui: remove unused task network references and update tests with new shared ports model

* ui: lint

* ui: revert auto formatting

* ui: remove unused page objects

* structs: remove unrelated test from bad conflict resolution

* ui: formatting
2020-08-20 11:07:13 -04:00
Mahmood Ali 8a342926b7 Respect alloc job version for lost/failed allocs
This change fixes a bug where lost/failed allocations are replaced by
allocations with the latest versions, even if the version hasn't been
promoted yet.

Now, when generating a plan for lost/failed allocations, the scheduler
first checks if the current deployment is in Canary stage, and if so, it
ensures that any lost/failed allocations is replaced one with the latest
promoted version instead.
2020-08-19 09:52:48 -04:00
Tim Gross 1aa242c15a
failed core jobs should not have follow-ups (#8682)
If a core job fails more than the delivery limit, the leader will create a new
eval with the TriggeredBy field set to `failed-follow-up`.

Evaluations for core jobs have the leader's ACL, which is not valid on another
leader after an election. The `failed-follow-up` evals do not have ACLs, so
core job evals that fail more than the delivery limit or core job evals that
span leader elections will never succeed and will be re-enqueued forever. So
we should not retry with a `failed-follow-up`.
2020-08-18 16:48:43 -04:00
Tim Gross 38ec70eb8d
multiregion: validation should always return error for OSS (#8687) 2020-08-18 15:35:38 -04:00
Lang Martin e8a5565c1a
nomad/state/state_store: handle type conversion failure explicitly (#8660) 2020-08-12 17:53:12 -04:00
Michael Schurter de08ae8083 test: add allocrunner test for poststart hooks 2020-08-12 09:54:14 -07:00
Mahmood Ali c462f8d0d5
Merge pull request #8524 from hashicorp/b-vault-health-checks
Skip checking Vault health
2020-08-11 16:01:07 -04:00
Lang Martin a27913e699
CSI RPC Token (#8626)
* client/allocrunner/csi_hook: use the Node SecretID
* client/allocrunner/csi_hook: include the namespace for Claim
2020-08-11 13:08:39 -04:00
Lang Martin c82b2a2454
CSI: volume and plugin allocations in the API (#8590)
* command/agent/csi_endpoint: explicitly convert to API structs, and convert allocs for single object get endpoints
2020-08-11 12:24:41 -04:00
Tim Gross def7084be7
msgpack-rpc errors cannot be wrapped (#8633)
Our RPC calls mangle the errors we get, which prevents us from using wrapped
errors and `errors.Is`.

Also fixes log message fields.
2020-08-11 10:25:43 -04:00
Tim Gross 443fdaa86b
csi: nomad volume detach command (#8584)
The soundness guarantees of the CSI specification leave a little to be desired
in our ability to provide a 100% reliable automated solution for managing
volumes. This changeset provides a new command to bridge this gap by providing
the operator the ability to intervene.

The command doesn't take an allocation ID so that the operator doesn't have to
keep track of alloc IDs that may have been GC'd. Handle this case in the
unpublish RPC by sending the client RPC for all the terminal/nil allocs on the
selected node.
2020-08-11 10:18:54 -04:00
Tim Gross fb27082e5c
RPC errors must be wrapped in order to wrap internal errors (#8632)
The CSI client RPC uses error wrapping to detect the type of error bubbling up
from plugins, but if the errors we get aren't wrapped at each layer, we can't
unwrap the inner error.

Also eliminates some unused args.
2020-08-11 09:13:52 -04:00
Lang Martin f245ba91c4
nomad/state/state_store: two cases of incorrect CSIPlugin in-place (#8630) 2020-08-10 18:15:29 -04:00
Mahmood Ali dce1dc44eb distinguish between transient and persistent errors 2020-08-10 16:46:06 -04:00
Seth Hoenig 6ab3d21d2c consul: validate script type when ussing check thresholds 2020-08-10 14:08:09 -05:00
Seth Hoenig fd4804bf26 consul: able to set pass/fail thresholds on consul service checks
This change adds the ability to set the fields `success_before_passing` and
`failures_before_critical` on Consul service check definitions. This is a
feature added to Consul v1.7.0 and later.
  https://www.consul.io/docs/agent/checks#success-failures-before-passing-critical

Nomad doesn't do much besides pass the fields through to Consul.

Fixes #6913
2020-08-10 14:08:09 -05:00
Tim Gross e5496c7994
csi: missing plugins during node delete are not an error (#8619)
When deregistering a client, CSI plugins running on that client may not get a
chance to fingerprint before being stopped. Account for the case where a
plugin allocation is the last instance of the plugin and has been deleted from
the state store to avoid errors during node deregistration.
2020-08-10 11:02:01 -04:00
Mahmood Ali 628985c51e
Merge pull request #8613 from alrs/state-test-errs
nomad/state: fix dropped scaling_policy test errors
2020-08-10 08:14:19 -04:00
Lars Lehtonen f8a42f587f
nomad/state: fix dropped scaling_policy test errors 2020-08-07 23:05:33 -07:00
Tim Gross 69f4f171e5
CSI: fix missing ACL tokens for leader-driven RPCs (#8607)
The volumewatcher and GC job in the leader can't make CSI RPCs when ACLs are
enabled without the leader ACL token being passed thru.
2020-08-07 15:37:27 -04:00
Tim Gross 7d53ed88d6
csi: client RPCs should return wrapped errors for checking (#8605)
When the client-side actions of a CSI client RPC succeed but we get
disconnected during the RPC or we fail to checkpoint the claim state, we want
to be able to retry the client RPC without getting blocked by the client-side
state (ex. mount points) already having been cleaned up in previous calls.
2020-08-07 11:01:36 -04:00
Tim Gross 81b604fa13
csi: controller unpublish should check current alloc count (#8604)
Using the count of node claims from earlier in the `CSIVolume.Unpublish RPC
doesn't correctly account for cases where the RPC was interrupted but
checkpointed. Instead, we'll check the current allocation count and status to
determine whether we need to send a controller unpublish.
2020-08-07 10:43:45 -04:00
Tim Gross 2854298089
csi: release claims via csi_hook postrun unpublish RPC (#8580)
Add a Postrun hook to send the `CSIVolume.Unpublish` RPC to the server. This
may forward client RPCs to the node plugins or to the controller plugins,
depending on whether other allocations on this node have claims on this
volume.

By making clients responsible for running the `CSIVolume.Unpublish` RPC (and
making the RPC available to a `nomad volume detach` command), the
volumewatcher becomes only used by the core GC job and we no longer need
async volume GC from job deregister and node update.
2020-08-06 14:51:46 -04:00
Michael Schurter 057e1c021f
Merge pull request #8597 from hashicorp/b-vault-revoke-log-line
vault: log once per interval if batching revocation
2020-08-06 11:32:47 -07:00
Tim Gross 314458ebdb
csi: update volumewatcher to use unpublish RPC (#8579)
This changeset updates `nomad/volumewatcher` to take advantage of the
`CSIVolume.Unpublish` RPC. This lets us eliminate a bunch of code and
associated tests. The raft batching code can be safely dropped, as the
characteristic times of the CSI RPCs are on the order of seconds or even
minutes, so batching up raft RPCs added complexity without any real world
performance wins.

Includes refactor w/ test cleanup and dead code elimination in volumewatcher
2020-08-06 14:31:18 -04:00
Tim Gross eaa14ab64c
csi: add unpublish RPC (#8572)
This changeset is plumbing for a `nomad volume detach` command that will be
reused by the volumewatcher claim GC as well.
2020-08-06 13:51:29 -04:00
Tim Gross 4bbf18703f
csi: retry controller client RPCs on next controller (#8561)
The documentation encourages operators to run multiple controller plugin
instances for HA, but the client RPCs don't take advantage of this by retrying
when the RPC fails in cases when the plugin is unavailable (because the node
has drained or the alloc has failed but we haven't received an updated
fingerprint yet).

This changeset tries all known controllers on ready nodes before giving up,
and adds tests that exercise the client RPC routing and retries.
2020-08-06 13:24:24 -04:00
Michael Schurter 2385fee0d2 vault: log once per interval if batching revocation
This log line should be rare since:

1. Most tokens should be logged synchronously, not via this async
   batched method. Async revocation only takes place when Vault
   connectivity is lost and after leader election so no revocations are
   missed.
2. There should rarely be >1 batch (1,000) tokens to revoke since the
   above conditions should be brief and infrequent.
3. Interval is 5 minutes, so this log line will be emitted at *most*
   once every 5 minutes.

What makes this log line rare is also what makes it interesting: due to
a bug prior to Nomad 0.11.2 some tokens may never get revoked. Therefore
Nomad tries to re-revoke them on every leader election. This caused a
massive buildup of old tokens that would never be properly revoked and
purged. Nomad 0.11.3 mostly fixed this but still had a bug in purging
revoked tokens via Raft (fixed in #8553).

The nomad.vault.distributed_tokens_revoked metric is only ticked upon
successful revocation and purging, making any bugs or slowness in the
process difficult to detect.

Logging before a potentially slow revocation+purge operation is
performed will give users much better indications of what activity is
going on should the process fail to make it to the metric.
2020-08-05 15:39:21 -07:00