Commit graph

22277 commits

Author SHA1 Message Date
Seth Hoenig 437bb4b86d
client: check escaping of alloc dir using symlinks
This PR adds symlink resolution when doing validation of paths
to ensure they do not escape client allocation directories.
2022-02-09 19:50:13 -05:00
Seth Hoenig de078e7ac6
client: fix race condition in use of go-getter
go-getter creates a circular dependency between a Client and Getter,
which means each is inherently thread-unsafe if you try to re-use
on or the other.

This PR fixes Nomad to no longer make use of the default Getter objects
provided by the go-getter package. Nomad must create a new Client object
on every artifact download, as the Client object controls the Src and Dst
among other things. When Caling Client.Get, the Getter modifies its own
Client reference, creating the circular reference and race condition.

We can still achieve most of the desired connection caching behavior by
re-using a shared HTTP client with transport pooling enabled.
2022-02-09 19:48:28 -05:00
Nomad Release Bot cf7f0977ff
Release v1.2.5 2022-01-31 15:36:54 +00:00
Nomad Release bot 8af121bfbe Generate files for 1.2.5 release 2022-01-31 14:54:26 +00:00
Tim Gross 18c528313c docs: add 1.2.5 to changelog 2022-01-28 15:08:48 -05:00
Tim Gross 6af1b359ed docs: missing changelog for #11892 (#11959) 2022-01-28 15:08:48 -05:00
Tim Gross d8a74efb07 set LAST_RELEASE to 1.2.4 for the 1.2.5 release branch 2022-01-28 14:50:54 -05:00
Tim Gross 622ed093ae CSI: node unmount from the client before unpublish RPC (#11892)
When an allocation stops, the `csi_hook` makes an unpublish RPC to the
servers to unpublish via the CSI RPCs: first to the node plugins and
then the controller plugins. The controller RPCs must happen after the
node RPCs so that the node has had a chance to unmount the volume
before the controller tries to detach the associated device.

But the client has local access to the node plugins and can
independently determine if it's safe to send unpublish RPC to those
plugins. This will allow the server to treat the node plugin as
abandoned if a client is disconnected and `stop_on_client_disconnect`
is set. This will let the server try to send unpublish RPCs to the
controller plugins, under the assumption that the client will be
trying to unmount the volume on its end first.

Note that the CSI `NodeUnpublishVolume`/`NodeUnstageVolume` RPCs can 
return ignorable errors in the case where the volume has already been
unmounted from the node. Handle all other errors by retrying until we
get success so as to give operators the opportunity to reschedule a
failed node plugin (ex. in the case where they accidentally drained a
node without `-ignore-system`). Fan-out the work for each volume into
its own goroutine so that we can release a subset of volumes if only
one is stuck.
2022-01-28 14:43:58 -05:00
Tim Gross 5773fc93a2 CSI: move terminal alloc handling into denormalization (#11931)
* The volume claim GC method and volumewatcher both have logic
collecting terminal allocations that duplicates most of the logic
that's now in the state store's `CSIVolumeDenormalize` method. Copy
this logic into the state store so that all code paths have the same
view of the past claims.
* Remove logic in the volume claim GC that now lives in the state
store's `CSIVolumeDenormalize` method.
* Remove logic in the volumewatcher that now lives in the state
store's `CSIVolumeDenormalize` method.
* Remove logic in the node unpublish RPC that now lives in the state
store's `CSIVolumeDenormalize` method.
2022-01-28 14:43:50 -05:00
Tim Gross c67c31e543 csi: ensure that PastClaims are populated with correct mode (#11932)
In the client's `(*csiHook) Postrun()` method, we make an unpublish
RPC that includes a claim in the `CSIVolumeClaimStateUnpublishing`
state and using the mode from the client. But then in the
`(*CSIVolume) Unpublish` RPC handler, we query the volume from the
state store (because we only get an ID from the client). And when we
make the client RPC for the node unpublish step, we use the _current
volume's_ view of the mode. If the volume's mode has been changed
before the old allocations can have their claims released, then we end
up making a CSI RPC that will never succeed.

Why does this code path get the mode from the volume and not the
claim? Because the claim written by the GC job in `(*CoreScheduler)
csiVolumeClaimGC` doesn't have a mode. Instead it just writes a claim
in the unpublishing state to ensure the volumewatcher detects a "past
claim" change and reaps all the claims on the volumes.

Fix this by ensuring that the `CSIVolumeDenormalize` creates past
claims for all nil allocations with a correct access mode set.
2022-01-28 14:43:43 -05:00
Tim Gross 951661db04 CSI: resolve invalid claim states (#11890)
* csi: resolve invalid claim states on read

It's currently possible for CSI volumes to be claimed by allocations
that no longer exist. This changeset asserts a reasonable state at
the state store level by registering these nil allocations as "past
claims" on any read. This will cause any pass through the periodic GC
or volumewatcher to trigger the unpublishing workflow for those claims.

* csi: make feasibility check errors more understandable

When the feasibility checker finds we have no free write claims, it
checks to see if any of those claims are for the job we're currently
scheduling (so that earlier versions of a job can't block claims for
new versions) and reports a conflict if the volume can't be scheduled
so that the user can fix their claims. But when the checker hits a
claim that has a GCd allocation, the state is recoverable by the
server once claim reaping completes and no user intervention is
required; the blocked eval should complete. Differentiate the
scheduler error produced by these two conditions.
2022-01-28 14:43:35 -05:00
Tim Gross 4e559c6255 csi: update leader's ACL in volumewatcher (#11891)
The volumewatcher that runs on the leader needs to make RPC calls
rather than writing to raft (as we do in the deploymentwatcher)
because the unpublish workflow needs to make RPC calls to the
clients. This requires that the volumewatcher has access to the
leader's ACL token.

But when leadership transitions, the new leader creates a new leader
ACL token. This ACL token needs to be passed into the volumewatcher
when we enable it, otherwise the volumewatcher can find itself with a
stale token.
2022-01-28 14:43:27 -05:00
Derek Strickland 460416e787 Update IsEmpty to check for pre-1.2.4 fields (#11930) 2022-01-28 14:41:49 -05:00
Nomad Release Bot 7b5ee8a3f3
Release v1.2.4 2022-01-19 00:22:47 +00:00
Nomad Release bot de3070d49a Generate files for 1.2.4 release 2022-01-18 23:43:00 +00:00
Luiz Aoqui 28f8cede4b
docs: add 1.2.4 to changelog 2022-01-18 18:31:34 -05:00
Luiz Aoqui 626e633b41
docs: add nomad.plan.node_rejected metric (#11860) 2022-01-18 13:47:20 -05:00
Luiz Aoqui e2e229f659
ui: fix test (#11870) 2022-01-18 10:36:10 -05:00
Dave May 330d24a873
cli: Add event stream capture to nomad operator debug (#11865) 2022-01-17 21:35:51 -05:00
Michael Schurter 99c863f909
cli: improve debug error messages (#11507)
Improves `nomad debug` error messages when contacting agents that do not
have /v1/agent/host endpoints (the endpoint was added in v0.12.0)

Part of #9568 and manually tested against Nomad v0.8.7.

Hopefully isRedirectError can be reused for more cases listed in #9568
2022-01-17 11:15:17 -05:00
Luiz Aoqui ed9f277925
docs: update 1.2.0 upgrade note now that the UI ACL is fixed (#11840) 2022-01-17 11:09:08 -05:00
Luiz Aoqui f981a1ed7e
docs: add HashiBox to the list of community tools (#11861) 2022-01-17 11:08:41 -05:00
Luiz Aoqui 57a61b50f4
changelog: add entry for #11793 (#11862) 2022-01-17 11:08:29 -05:00
James Rasell 29fb0ddb4e
Merge pull request #11849 from hashicorp/b-changelog-11848
changelog: add entry for #11848
2022-01-17 09:35:10 +01:00
Luiz Aoqui b1753d0568
scheduler: detect and log unexpected scheduling collisions (#11793) 2022-01-14 20:09:14 -05:00
Jai 7ad9ec1143
Merge pull request #11820 from hashicorp/f-ui/alloc-legend
feat:  add links to legend items in `allocation-summary`
2022-01-14 14:02:54 -05:00
Tim Gross d7756f8cdb
csi: volume deregistration should require exact ID (#11852)
The command line client sends a specific volume ID, but this isn't
enforced at the API level and we were incorrectly using a prefix match
for volume deregistration, resulting in cases where a volume with a
shorter ID that's a prefix of another volume would be deregistered
instead of the intended volume.
2022-01-14 12:26:03 -05:00
Tim Gross 33f7c6cba4
csi: when warning for multiple prefix matches, use full ID (#11853)
When the `volume deregister` or `volume detach` commands get an ID
prefix that matches multiple volumes, show the full length of the
volume IDs in the list of volumes shown so so that the user can select
the correct one.
2022-01-14 12:25:48 -05:00
Tim Gross 9c4864badd
freebsd: build fix for ARM7 32-bit (#11854)
The size of `stat_t` fields is architecture dependent, which was
reportedly causing a build failure on FreeBSD ARM7 32-bit
systems. This changeset matches the behavior we have on Linux.
2022-01-14 12:25:32 -05:00
Tim Gross 73d0779858
drivers: set world-readable permissions on copied resolv.conf (#11856)
When we copy the system DNS to a task's `resolv.conf`, we should set
the permissions as world-readable so that unprivileged users within
the task can read it.
2022-01-14 12:25:23 -05:00
Jai Bhagat 764936a768 chore: add changelog 2022-01-14 10:23:09 -05:00
Jai Bhagat 862bc0341c test: add test stories for clicking allocation summary 2022-01-14 10:23:09 -05:00
Jai Bhagat 337920263f refact: add data-test-selectors and correct css selectors in summary 2022-01-14 10:23:06 -05:00
Jai Bhagat 2268cc1d6c styling: remove clickable link text decoration override to match new mocks 2022-01-14 10:20:36 -05:00
Jai Bhagat dc85cd29a9 refact: allocation and child summaries into ember-cli-page-object components 2022-01-14 10:20:33 -05:00
Jai Bhagat 2dddb20471 fix: typo in data-test-selector 2022-01-14 10:19:01 -05:00
Jai Bhagat d058029f90 styling: update styling to match new figma mocks 2022-01-14 10:14:44 -05:00
Jai Bhagat fe9f35c587 feat: add clicking functionality to alloc status legend 2022-01-14 10:14:44 -05:00
James Rasell 10405e9c6b
changelog: add entry for #11848 2022-01-14 13:40:50 +01:00
James Rasell c2412a890d
Merge pull request #11842 from hashicorp/b-name-oss-files-consistently
chore: ensure consistent file naming for non-enterprise files.
2022-01-14 08:13:49 +01:00
James Rasell 82b168bf34
Merge pull request #11403 from hashicorp/f-gh-11059
agent/docs: add better clarification when top-level data dir needs setting
2022-01-13 16:41:35 +01:00
James Rasell 7205b3f08e
Merge pull request #11402 from hashicorp/document-client-initial-vault-renew
taskrunner: add clarifying initial vault token renew comment.
2022-01-13 16:21:58 +01:00
Luiz Aoqui d48e50da9a
Fix log level parsing from lines that include a timestamp (#11838) 2022-01-13 09:56:35 -05:00
Seth Hoenig cfb8152158
Merge pull request #11831 from hashicorp/mods-explain-pinned
mods: explain replace statements
2022-01-13 08:53:17 -06:00
James Rasell 1d5c5062f2
chore: ensure consistent file naming for non-enterprise files. 2022-01-13 11:32:16 +01:00
Luiz Aoqui c7ae13a1f3
Fix ACL requirements for job details UI (#11672) 2022-01-12 21:26:02 -05:00
Luiz Aoqui 7e6acf0e68
docs: fix autoscaling Datadog site configuration (#11824) 2022-01-12 21:06:30 -05:00
Michael Schurter ed77b51c3f
Merge pull request #11830 from hashicorp/b-validate-reserved-ports
agent: validate reserved_ports are valid
2022-01-12 17:12:30 -08:00
Michael Schurter 211ae8315a
Merge pull request #11833 from hashicorp/deps-go-getter-v1.5.11
deps: update go-getter to v1.5.11
2022-01-12 16:42:55 -08:00
Michael Schurter ebadaabc71 doc: add changelog for #11830 2022-01-12 14:21:47 -08:00