open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	66b4b28b1a	CSI: node unmount from the client before unpublish RPC (#11892 ) When an allocation stops, the `csi_hook` makes an unpublish RPC to the servers to unpublish via the CSI RPCs: first to the node plugins and then the controller plugins. The controller RPCs must happen after the node RPCs so that the node has had a chance to unmount the volume before the controller tries to detach the associated device. But the client has local access to the node plugins and can independently determine if it's safe to send unpublish RPC to those plugins. This will allow the server to treat the node plugin as abandoned if a client is disconnected and `stop_on_client_disconnect` is set. This will let the server try to send unpublish RPCs to the controller plugins, under the assumption that the client will be trying to unmount the volume on its end first. Note that the CSI `NodeUnpublishVolume`/`NodeUnstageVolume` RPCs can return ignorable errors in the case where the volume has already been unmounted from the node. Handle all other errors by retrying until we get success so as to give operators the opportunity to reschedule a failed node plugin (ex. in the case where they accidentally drained a node without `-ignore-system`). Fan-out the work for each volume into its own goroutine so that we can release a subset of volumes if only one is stuck.	2022-01-28 08:30:31 -05:00
Jai	ff9d39a6b3	Merge pull request #11942 from hashicorp/f-ui/test-tooling ui: test tooling	2022-01-27 11:21:23 -05:00
Seth Hoenig	56bf1e4e7b	Merge pull request #11951 from hashicorp/b-cgroups-broken-part1-oss client: change test to not poke cgroupv2 edge case	2022-01-27 10:06:03 -06:00
Tim Gross	b20a6c9ffb	CSI: move terminal alloc handling into denormalization (#11931 ) * The volume claim GC method and volumewatcher both have logic collecting terminal allocations that duplicates most of the logic that's now in the state store's `CSIVolumeDenormalize` method. Copy this logic into the state store so that all code paths have the same view of the past claims. * Remove logic in the volume claim GC that now lives in the state store's `CSIVolumeDenormalize` method. * Remove logic in the volumewatcher that now lives in the state store's `CSIVolumeDenormalize` method. * Remove logic in the node unpublish RPC that now lives in the state store's `CSIVolumeDenormalize` method.	2022-01-27 10:39:08 -05:00
Tim Gross	a40a20cff8	csi: ensure that PastClaims are populated with correct mode (#11932 ) In the client's `(csiHook) Postrun()` method, we make an unpublish RPC that includes a claim in the `CSIVolumeClaimStateUnpublishing` state and using the mode from the client. But then in the `(CSIVolume) Unpublish` RPC handler, we query the volume from the state store (because we only get an ID from the client). And when we make the client RPC for the node unpublish step, we use the _current volume's_ view of the mode. If the volume's mode has been changed before the old allocations can have their claims released, then we end up making a CSI RPC that will never succeed. Why does this code path get the mode from the volume and not the claim? Because the claim written by the GC job in `(*CoreScheduler) csiVolumeClaimGC` doesn't have a mode. Instead it just writes a claim in the unpublishing state to ensure the volumewatcher detects a "past claim" change and reaps all the claims on the volumes. Fix this by ensuring that the `CSIVolumeDenormalize` creates past claims for all nil allocations with a correct access mode set.	2022-01-27 10:05:41 -05:00
Tim Gross	a2433e35fb	CSI: resolve invalid claim states (#11890 ) * csi: resolve invalid claim states on read It's currently possible for CSI volumes to be claimed by allocations that no longer exist. This changeset asserts a reasonable state at the state store level by registering these nil allocations as "past claims" on any read. This will cause any pass through the periodic GC or volumewatcher to trigger the unpublishing workflow for those claims. * csi: make feasibility check errors more understandable When the feasibility checker finds we have no free write claims, it checks to see if any of those claims are for the job we're currently scheduling (so that earlier versions of a job can't block claims for new versions) and reports a conflict if the volume can't be scheduled so that the user can fix their claims. But when the checker hits a claim that has a GCd allocation, the state is recoverable by the server once claim reaping completes and no user intervention is required; the blocked eval should complete. Differentiate the scheduler error produced by these two conditions.	2022-01-27 09:30:03 -05:00
Seth Hoenig	cade04d3f6	client: change test to not poke cgroupv2 edge case This PR tweaks the TestCpusetManager_AddAlloc unit test to not break when being run on a machine using cgroupsv2. The behavior of writing an empty cpuset.cpu changes in cgroupv2, where such a group now inherits the value of its parent group, rather than remaining empty. The test in question was written such that a task would consume all available cores shared on an alloc, causing the empty set to be written to the shared group, which works fine on cgroupsv1 but breaks on cgroupsv2. By adjusting the test to consume only 1 core instead of all cores, it no longer triggers that edge case. The actual fix for the new cgroupsv2 behavior will be in #11933	2022-01-27 08:27:40 -06:00
Jai Bhagat	e950300ecb	fix: differentiate commands for circleci and local use	2022-01-27 09:19:03 -05:00
James Rasell	23cee5067e	Merge pull request #11940 from hashicorp/b-docs-add-client-reserved-cores docs: add `cores` to client reserved config block.	2022-01-27 08:29:16 +01:00
Luiz Aoqui	5e9f4be2a1	ci: add semgrep (#11934 )	2022-01-26 16:32:47 -05:00
André	518fc11dca	ui: move volume link to the source column and fix the link target (#11896 ) The link target used the volume name instead of the volume id. Fixes issue #11884.	2022-01-26 14:17:29 -05:00
Jai Bhagat	d88f70265b	ui: add local testing script	2022-01-26 13:36:26 -05:00
Jai Bhagat	7e46fb319e	ui: replace qunit start tests with ember-exam start	2022-01-26 13:35:59 -05:00
Jai Bhagat	df87911cb9	ui: allow parallel test-runs	2022-01-26 13:06:56 -05:00
Jai Bhagat	36b1681d6d	ui: add ember-exam	2022-01-26 13:06:00 -05:00
Jai	2ebc512dfe	Merge pull request #11780 from hashicorp/f-ui/job-page-refactor fix: authorization bug for `job-client-status-summary`	2022-01-26 13:00:48 -05:00
Jai Bhagat	14adca3a1f	ui: add npm script for running ember test server	2022-01-26 12:48:39 -05:00
Jai Bhagat	37886861e2	refact: extract setPolicy into utils	2022-01-26 12:06:18 -05:00
Derek Strickland	b3c8ab9be7	Update IsEmpty to check for pre-1.2.4 fields (#11930 )	2022-01-26 11:31:37 -05:00
Jai Bhagat	2249722ec6	refact: fix tests after contextual job page changes	2022-01-26 11:31:18 -05:00
Jai Bhagat	be43976db0	ui: prettify remaining files	2022-01-26 11:28:21 -05:00
James Rasell	a7f569d0e1	docs: add `cores` to client reserved config block.	2022-01-26 15:56:16 +01:00
Seth Hoenig	b5c5d59fa3	Merge pull request #11927 from hashicorp/b-hcl1-sidecar_task-resources connect: fix bug where sidecar_task.resources was ignored with hcl1	2022-01-26 06:32:52 -06:00
Seth Hoenig	86330e43c8	changelog: use pr number not issue number	2022-01-26 06:32:10 -06:00
dependabot[bot]	1ec7b64f6a	build(deps): bump github.com/mitchellh/copystructure from 1.1.1 to 1.2.0 Bumps [github.com/mitchellh/copystructure](https://github.com/mitchellh/copystructure) from 1.1.1 to 1.2.0. - [Release notes](https://github.com/mitchellh/copystructure/releases) - [Commits](https://github.com/mitchellh/copystructure/compare/v1.1.1...v1.2.0) --- updated-dependencies: - dependency-name: github.com/mitchellh/copystructure dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2022-01-25 16:43:36 +00:00
Seth Hoenig	3e4ac74335	Merge pull request #11920 from hashicorp/dependabot/go_modules/github.com/rs/cors-1.8.2 build(deps): bump github.com/rs/cors from 1.8.0 to 1.8.2	2022-01-25 10:42:24 -06:00
Seth Hoenig	ffe7f87912	connect: fix bug where sidecar_task.resources was ignored with hcl1 The HCL1 parser did not respect connect.sidecar_task.resources if the connect.sidecar_service block was not set (an optimiztion that no longer makes sense with connect gateways). Fixes #10899	2022-01-25 10:17:54 -06:00
Tim Gross	1dad0e597e	fix integer bounds checks (#11815 ) * driver: fix integer conversion error The shared executor incorrectly parsed the user's group into int32 and then cast to uint32 without bounds checking. This is harmless because an out-of-bounds gid will throw an error later, but it triggers security and code quality scans. Parse directly to uint32 so that we get correct error handling. * helper: fix integer conversion error The autopilot flags helper incorrectly parses a uint64 to a uint which is machine specific size. Although we don't have 32-bit builds, this sets off security and code quality scaans. Parse to the machine sized uint. * driver: restrict bounds of port map The plugin server doesn't constrain the maximum integer for port maps. This could result in a user-visible misconfiguration, but it also triggers security and code quality scans. Restrict the bounds before casting to int32 and return an error. * cpuset: restrict upper bounds of cpuset values Our cpuset configuration expects values in the range of uint16 to match the expectations set by the kernel, but we don't constrain the values before downcasting. An underflow could lead to allocations failing on the client rather than being caught earlier. This also make security and code quality scanners happy. * http: fix integer downcast for per_page parameter The parser for the `per_page` query parameter downcasts to int32 without bounds checking. This could result in underflow and nonsensical paging, but there's no server-side consequences for this. Fixing this will silence some security and code quality scanners though.	2022-01-25 11:16:48 -05:00
James Rasell	c93c292dca	Merge pull request #11907 from hashicorp/f-state-store-nomad-file state: move restore functionality into its own file.	2022-01-25 08:55:49 +01:00
dependabot[bot]	c8443011a8	build(deps): bump github.com/rs/cors from 1.8.0 to 1.8.2 Bumps [github.com/rs/cors](https://github.com/rs/cors) from 1.8.0 to 1.8.2. - [Release notes](https://github.com/rs/cors/releases) - [Commits](https://github.com/rs/cors/compare/v1.8.0...v1.8.2) --- updated-dependencies: - dependency-name: github.com/rs/cors dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2022-01-24 20:35:06 +00:00
Seth Hoenig	2d950f8403	Merge pull request #11918 from hashicorp/deps-update-api-deps deps: update api go version and dependencies	2022-01-24 14:33:04 -06:00
Seth Hoenig	5452b972ef	Merge pull request #11883 from hashicorp/dependabot/go_modules/github.com/prometheus/client_golang-1.12.0 build(deps): bump github.com/prometheus/client_golang from 1.7.1 to 1.12.0	2022-01-24 12:26:50 -06:00
Seth Hoenig	ef9b84ad82	deps: update api go version and dependencies This PR sets the minimum Go version for the `api` submodule to Go 1.17. It also upgrades - gorilla/websocket 1.4.1 -> 1.4.2 - mitchelh/mapstructure 1.4.2 -> 1.4.3 - stretchr/testify 1.5.1 -> 1.7.0 Closes #11518 #11602 #11528	2022-01-24 12:23:26 -06:00
Seth Hoenig	0e638b6014	Merge pull request #11836 from hashicorp/dependabot/go_modules/github.com/hashicorp/memberlist-0.3.1 chore(deps): bump github.com/hashicorp/memberlist from 0.2.2 to 0.3.1	2022-01-24 11:56:18 -06:00
Tim Gross	04977525dd	csi: update leader's ACL in volumewatcher (#11891 ) The volumewatcher that runs on the leader needs to make RPC calls rather than writing to raft (as we do in the deploymentwatcher) because the unpublish workflow needs to make RPC calls to the clients. This requires that the volumewatcher has access to the leader's ACL token. But when leadership transitions, the new leader creates a new leader ACL token. This ACL token needs to be passed into the volumewatcher when we enable it, otherwise the volumewatcher can find itself with a stale token.	2022-01-24 11:49:50 -05:00
Dan Norris	160682cf2b	docs: Update volume create/register mount options to use []string example (#11912 ) The examples for `nomad volume create` and `nomad volume register` are not setting `mount_flags` using an array of strings. This fixes the issue by changing the example to be `mount_flags = ["noatime"]`.	2022-01-24 11:34:21 -05:00
Seth Hoenig	0030424384	Merge pull request #11889 from hashicorp/build-update-circle build: upgrade circleci configuration	2022-01-24 10:18:21 -06:00
Luiz Aoqui	dd3b01ffcd	Merge pull request #11876 from hashicorp/e2e-fix-consul-tls e2e: enable Consul HTTPS port and always restart Nomad systemd unit	2022-01-24 11:18:09 -05:00
Seth Hoenig	c220d3009b	Merge pull request #11910 from hashicorp/deps-update-containernetworking deps: upgrade containernetworking/plugins	2022-01-24 10:14:50 -06:00
Seth Hoenig	8a96e5d567	deps: add missing cl note	2022-01-24 10:13:13 -06:00
Jai Bhagat	6f32b9d62e	fix: no longer need gotoTaskGroup prop	2022-01-24 11:11:05 -05:00
Jai Bhagat	7c3ace8c41	test: add tests for not auth behavior for job-client-status-summary	2022-01-24 11:08:33 -05:00
Jai Bhagat	5495e9f7bc	fix: auth node requests with mirage	2022-01-24 11:07:17 -05:00
Jai Bhagat	0b368e43eb	fix: protect route if not auth	2022-01-24 11:07:17 -05:00
Jai Bhagat	d075d6f7cb	feat: add conditional rendering logic to template for not auth concern	2022-01-24 11:07:15 -05:00
Jai Bhagat	fe63af5484	fix: prevent async request for node relationship on alloc	2022-01-24 11:06:30 -05:00
Jai Bhagat	9bb3e7a4f7	fix: update component props for glimmer syntax	2022-01-24 11:06:10 -05:00
Jai Bhagat	cc8c9672e7	fix: update conditional rendering of clients tab	2022-01-24 11:06:08 -05:00
Jai Bhagat	c3a305f807	fix: move node loading to jobs.job route and add auth logic	2022-01-24 11:05:50 -05:00
Luiz Aoqui	ee83fb8bb1	ui: add "read client" ability	2022-01-24 11:05:48 -05:00

1 2 3 4 5 ...

22481 Commits All Branches Search

22481 Commits

All Branches