Commit Graph

22481 Commits

Author SHA1 Message Date
Tim Gross 66b4b28b1a
CSI: node unmount from the client before unpublish RPC (#11892)
When an allocation stops, the `csi_hook` makes an unpublish RPC to the
servers to unpublish via the CSI RPCs: first to the node plugins and
then the controller plugins. The controller RPCs must happen after the
node RPCs so that the node has had a chance to unmount the volume
before the controller tries to detach the associated device.

But the client has local access to the node plugins and can
independently determine if it's safe to send unpublish RPC to those
plugins. This will allow the server to treat the node plugin as
abandoned if a client is disconnected and `stop_on_client_disconnect`
is set. This will let the server try to send unpublish RPCs to the
controller plugins, under the assumption that the client will be
trying to unmount the volume on its end first.

Note that the CSI `NodeUnpublishVolume`/`NodeUnstageVolume` RPCs can 
return ignorable errors in the case where the volume has already been
unmounted from the node. Handle all other errors by retrying until we
get success so as to give operators the opportunity to reschedule a
failed node plugin (ex. in the case where they accidentally drained a
node without `-ignore-system`). Fan-out the work for each volume into
its own goroutine so that we can release a subset of volumes if only
one is stuck.
2022-01-28 08:30:31 -05:00
Jai ff9d39a6b3
Merge pull request #11942 from hashicorp/f-ui/test-tooling
ui:  test tooling
2022-01-27 11:21:23 -05:00
Seth Hoenig 56bf1e4e7b
Merge pull request #11951 from hashicorp/b-cgroups-broken-part1-oss
client: change test to not poke cgroupv2 edge case
2022-01-27 10:06:03 -06:00
Tim Gross b20a6c9ffb
CSI: move terminal alloc handling into denormalization (#11931)
* The volume claim GC method and volumewatcher both have logic
collecting terminal allocations that duplicates most of the logic
that's now in the state store's `CSIVolumeDenormalize` method. Copy
this logic into the state store so that all code paths have the same
view of the past claims.
* Remove logic in the volume claim GC that now lives in the state
store's `CSIVolumeDenormalize` method.
* Remove logic in the volumewatcher that now lives in the state
store's `CSIVolumeDenormalize` method.
* Remove logic in the node unpublish RPC that now lives in the state
store's `CSIVolumeDenormalize` method.
2022-01-27 10:39:08 -05:00
Tim Gross a40a20cff8
csi: ensure that PastClaims are populated with correct mode (#11932)
In the client's `(*csiHook) Postrun()` method, we make an unpublish
RPC that includes a claim in the `CSIVolumeClaimStateUnpublishing`
state and using the mode from the client. But then in the
`(*CSIVolume) Unpublish` RPC handler, we query the volume from the
state store (because we only get an ID from the client). And when we
make the client RPC for the node unpublish step, we use the _current
volume's_ view of the mode. If the volume's mode has been changed
before the old allocations can have their claims released, then we end
up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the
claim? Because the claim written by the GC job in `(*CoreScheduler)
csiVolumeClaimGC` doesn't have a mode. Instead it just writes a claim
in the unpublishing state to ensure the volumewatcher detects a "past
claim" change and reaps all the claims on the volumes.

Fix this by ensuring that the `CSIVolumeDenormalize` creates past
claims for all nil allocations with a correct access mode set.
2022-01-27 10:05:41 -05:00
Tim Gross a2433e35fb
CSI: resolve invalid claim states (#11890)
* csi: resolve invalid claim states on read

It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.

* csi: make feasibility check errors more understandable

When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
2022-01-27 09:30:03 -05:00
Seth Hoenig cade04d3f6 client: change test to not poke cgroupv2 edge case
This PR tweaks the TestCpusetManager_AddAlloc unit test to not break
when being run on a machine using cgroupsv2. The behavior of writing
an empty cpuset.cpu changes in cgroupv2, where such a group now inherits
the value of its parent group, rather than remaining empty.

The test in question was written such that a task would consume all available
cores shared on an alloc, causing the empty set to be written to the shared
group, which works fine on cgroupsv1 but breaks on cgroupsv2. By adjusting
the test to consume only 1 core instead of all cores, it no longer triggers
that edge case.

The actual fix for the new cgroupsv2 behavior will be in #11933
2022-01-27 08:27:40 -06:00
Jai Bhagat e950300ecb fix: differentiate commands for circleci and local use 2022-01-27 09:19:03 -05:00
James Rasell 23cee5067e
Merge pull request #11940 from hashicorp/b-docs-add-client-reserved-cores
docs: add `cores` to client reserved config block.
2022-01-27 08:29:16 +01:00
Luiz Aoqui 5e9f4be2a1
ci: add semgrep (#11934) 2022-01-26 16:32:47 -05:00
André 518fc11dca
ui: move volume link to the source column and fix the link target (#11896)
The link target used the volume name instead of the volume id.
Fixes issue #11884.
2022-01-26 14:17:29 -05:00
Jai Bhagat d88f70265b ui: add local testing script 2022-01-26 13:36:26 -05:00
Jai Bhagat 7e46fb319e ui: replace qunit start tests with ember-exam start 2022-01-26 13:35:59 -05:00
Jai Bhagat df87911cb9 ui: allow parallel test-runs 2022-01-26 13:06:56 -05:00
Jai Bhagat 36b1681d6d ui: add ember-exam 2022-01-26 13:06:00 -05:00
Jai 2ebc512dfe
Merge pull request #11780 from hashicorp/f-ui/job-page-refactor
fix:   authorization bug for `job-client-status-summary`
2022-01-26 13:00:48 -05:00
Jai Bhagat 14adca3a1f ui: add npm script for running ember test server 2022-01-26 12:48:39 -05:00
Jai Bhagat 37886861e2 refact: extract setPolicy into utils 2022-01-26 12:06:18 -05:00
Derek Strickland b3c8ab9be7
Update IsEmpty to check for pre-1.2.4 fields (#11930) 2022-01-26 11:31:37 -05:00
Jai Bhagat 2249722ec6 refact: fix tests after contextual job page changes 2022-01-26 11:31:18 -05:00
Jai Bhagat be43976db0 ui: prettify remaining files 2022-01-26 11:28:21 -05:00
James Rasell a7f569d0e1
docs: add `cores` to client reserved config block. 2022-01-26 15:56:16 +01:00
Seth Hoenig b5c5d59fa3
Merge pull request #11927 from hashicorp/b-hcl1-sidecar_task-resources
connect: fix bug where sidecar_task.resources was ignored with hcl1
2022-01-26 06:32:52 -06:00
Seth Hoenig 86330e43c8 changelog: use pr number not issue number 2022-01-26 06:32:10 -06:00
dependabot[bot] 1ec7b64f6a
build(deps): bump github.com/mitchellh/copystructure from 1.1.1 to 1.2.0
Bumps [github.com/mitchellh/copystructure](https://github.com/mitchellh/copystructure) from 1.1.1 to 1.2.0.
- [Release notes](https://github.com/mitchellh/copystructure/releases)
- [Commits](https://github.com/mitchellh/copystructure/compare/v1.1.1...v1.2.0)

---
updated-dependencies:
- dependency-name: github.com/mitchellh/copystructure
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-01-25 16:43:36 +00:00
Seth Hoenig 3e4ac74335
Merge pull request #11920 from hashicorp/dependabot/go_modules/github.com/rs/cors-1.8.2
build(deps): bump github.com/rs/cors from 1.8.0 to 1.8.2
2022-01-25 10:42:24 -06:00
Seth Hoenig ffe7f87912 connect: fix bug where sidecar_task.resources was ignored with hcl1
The HCL1 parser did not respect connect.sidecar_task.resources if the
connect.sidecar_service block was not set (an optimiztion that no longer
makes sense with connect gateways).

Fixes #10899
2022-01-25 10:17:54 -06:00
Tim Gross 1dad0e597e
fix integer bounds checks (#11815)
* driver: fix integer conversion error

The shared executor incorrectly parsed the user's group into int32 and
then cast to uint32 without bounds checking. This is harmless because
an out-of-bounds gid will throw an error later, but it triggers
security and code quality scans. Parse directly to uint32 so that we
get correct error handling.

* helper: fix integer conversion error

The autopilot flags helper incorrectly parses a uint64 to a uint which
is machine specific size. Although we don't have 32-bit builds, this
sets off security and code quality scaans. Parse to the machine sized
uint.

* driver: restrict bounds of port map

The plugin server doesn't constrain the maximum integer for port
maps. This could result in a user-visible misconfiguration, but it
also triggers security and code quality scans. Restrict the bounds
before casting to int32 and return an error.

* cpuset: restrict upper bounds of cpuset values

Our cpuset configuration expects values in the range of uint16 to
match the expectations set by the kernel, but we don't constrain the
values before downcasting. An underflow could lead to allocations
failing on the client rather than being caught earlier. This also make
security and code quality scanners happy.

* http: fix integer downcast for per_page parameter

The parser for the `per_page` query parameter downcasts to int32
without bounds checking. This could result in underflow and
nonsensical paging, but there's no server-side consequences for
this. Fixing this will silence some security and code quality scanners
though.
2022-01-25 11:16:48 -05:00
James Rasell c93c292dca
Merge pull request #11907 from hashicorp/f-state-store-nomad-file
state: move restore functionality into its own file.
2022-01-25 08:55:49 +01:00
dependabot[bot] c8443011a8
build(deps): bump github.com/rs/cors from 1.8.0 to 1.8.2
Bumps [github.com/rs/cors](https://github.com/rs/cors) from 1.8.0 to 1.8.2.
- [Release notes](https://github.com/rs/cors/releases)
- [Commits](https://github.com/rs/cors/compare/v1.8.0...v1.8.2)

---
updated-dependencies:
- dependency-name: github.com/rs/cors
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-01-24 20:35:06 +00:00
Seth Hoenig 2d950f8403
Merge pull request #11918 from hashicorp/deps-update-api-deps
deps: update api go version and dependencies
2022-01-24 14:33:04 -06:00
Seth Hoenig 5452b972ef
Merge pull request #11883 from hashicorp/dependabot/go_modules/github.com/prometheus/client_golang-1.12.0
build(deps): bump github.com/prometheus/client_golang from 1.7.1 to 1.12.0
2022-01-24 12:26:50 -06:00
Seth Hoenig ef9b84ad82 deps: update api go version and dependencies
This PR sets the minimum Go version for the `api` submodule to Go 1.17.

It also upgrades
 - gorilla/websocket 1.4.1 -> 1.4.2
 - mitchelh/mapstructure 1.4.2 -> 1.4.3
 - stretchr/testify 1.5.1 -> 1.7.0

Closes #11518 #11602 #11528
2022-01-24 12:23:26 -06:00
Seth Hoenig 0e638b6014
Merge pull request #11836 from hashicorp/dependabot/go_modules/github.com/hashicorp/memberlist-0.3.1
chore(deps): bump github.com/hashicorp/memberlist from 0.2.2 to 0.3.1
2022-01-24 11:56:18 -06:00
Tim Gross 04977525dd
csi: update leader's ACL in volumewatcher (#11891)
The volumewatcher that runs on the leader needs to make RPC calls
rather than writing to raft (as we do in the deploymentwatcher)
because the unpublish workflow needs to make RPC calls to the
clients. This requires that the volumewatcher has access to the
leader's ACL token.

But when leadership transitions, the new leader creates a new leader
ACL token. This ACL token needs to be passed into the volumewatcher
when we enable it, otherwise the volumewatcher can find itself with a
stale token.
2022-01-24 11:49:50 -05:00
Dan Norris 160682cf2b
docs: Update volume create/register mount options to use []string example (#11912)
The examples for `nomad volume create` and `nomad volume register` are
not setting `mount_flags` using an array of strings.

This fixes the issue by changing the example to be `mount_flags =
["noatime"]`.
2022-01-24 11:34:21 -05:00
Seth Hoenig 0030424384
Merge pull request #11889 from hashicorp/build-update-circle
build: upgrade circleci configuration
2022-01-24 10:18:21 -06:00
Luiz Aoqui dd3b01ffcd
Merge pull request #11876 from hashicorp/e2e-fix-consul-tls
e2e: enable Consul HTTPS port and always restart Nomad systemd unit
2022-01-24 11:18:09 -05:00
Seth Hoenig c220d3009b
Merge pull request #11910 from hashicorp/deps-update-containernetworking
deps: upgrade containernetworking/plugins
2022-01-24 10:14:50 -06:00
Seth Hoenig 8a96e5d567 deps: add missing cl note 2022-01-24 10:13:13 -06:00
Jai Bhagat 6f32b9d62e fix: no longer need gotoTaskGroup prop 2022-01-24 11:11:05 -05:00
Jai Bhagat 7c3ace8c41 test: add tests for not auth behavior for job-client-status-summary 2022-01-24 11:08:33 -05:00
Jai Bhagat 5495e9f7bc fix: auth node requests with mirage 2022-01-24 11:07:17 -05:00
Jai Bhagat 0b368e43eb fix: protect route if not auth 2022-01-24 11:07:17 -05:00
Jai Bhagat d075d6f7cb feat: add conditional rendering logic to template for not auth concern 2022-01-24 11:07:15 -05:00
Jai Bhagat fe63af5484 fix: prevent async request for node relationship on alloc 2022-01-24 11:06:30 -05:00
Jai Bhagat 9bb3e7a4f7 fix: update component props for glimmer syntax 2022-01-24 11:06:10 -05:00
Jai Bhagat cc8c9672e7 fix: update conditional rendering of clients tab 2022-01-24 11:06:08 -05:00
Jai Bhagat c3a305f807 fix: move node loading to jobs.job route and add auth logic 2022-01-24 11:05:50 -05:00
Luiz Aoqui ee83fb8bb1 ui: add "read client" ability 2022-01-24 11:05:48 -05:00