open-nomad

Author	SHA1	Message	Date
Danielle Lancashire	e75f057df3	csi: Fix Controller RPCs Currently the handling of CSINode RPCs does not correctly handle forwarding RPCs to Nodes. This commit fixes this by introducing a shim RPC (nomad/client_csi_enpdoint) that will correctly forward the request to the owning node, or submit the RPC to the client. In the process it also cleans up handling a little bit by adding the `CSIControllerQuery` embeded struct for required forwarding state. The CSIControllerQuery embeding the requirement of a `PluginID` also means we could move node targetting into the shim RPC if wanted in the future.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	d5e255f97a	client: Rename ClientCSI -> CSIController	2020-03-23 13:58:30 -04:00
Danielle Lancashire	5b05baf9f6	csi: Add /dev mounts to CSI Plugins CSI Plugins that manage devices need not just access to the CSI directory, but also to manage devices inside `/dev`. This commit introduces a `/dev:/dev` mount to the container so that they may do so.	2020-03-23 13:58:30 -04:00
Tim Gross	8bc5641438	csi: volume claim garbage collection (#7125 ) When an alloc is marked terminal (and after node unstage/unpublish have been called), the client syncs the terminal alloc state with the server via `Node.UpdateAlloc RPC`. For each job that has a terminal alloc, the `Node.UpdateAlloc` RPC handler at the server will emit an eval for a new core job to garbage collect CSI volume claims. When this eval is handled on the core scheduler, it will call a `volumeReap` method to release the claims for all terminal allocs on the job. The volume reap will issue a `ControllerUnpublishVolume` RPC for any node that has no alloc claiming the volume. Once this returns (or is skipped), the volume reap will send a new `CSIVolume.Claim` RPC that releases the volume claim for that allocation in the state store, making it available for scheduling again. This same `volumeReap` method will be called from the core job GC, which gives us a second chance to reclaim volumes during GC if there were controller RPC failures.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	6fc7f7779d	csimanager/volume: Update MountVolume docstring	2020-03-23 13:58:30 -04:00
Danielle Lancashire	cd5b4923d0	api: Register CSIPlugin before registering a Volume	2020-03-23 13:58:30 -04:00
Danielle Lancashire	1b70fb1398	hook resources: Init with empty resources during setup	2020-03-23 13:58:30 -04:00
Danielle Lancashire	511b7775a6	csi: Claim CSI Volumes during csi_hook.Prerun This commit is the initial implementation of claiming volumes from the server and passes through any publishContext information as appropriate. There's nothing too fancy here.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	9d4307a3ef	csi_endpoint: Provide AllocID in req, and return Volume Currently, the client has to ship an entire allocation to the server as part of performing a VolumeClaim, this has a few problems: Firstly, it means the client is sending significantly more data than is required (an allocation contains the entire contents of a Nomad job, alongside other irrelevant state) which has a non-zero (de)serialization cost. Secondly, because the allocation was never re-fetched from the state store, it means that we were potentially open to issues caused by stale state on a misbehaving or malicious client. The change removes both of those issues at the cost of a couple of more state store lookups, but they should be relatively cheap. We also now provide the CSIVolume in the response for a claim, so the client can perform a Claim without first going ahead and fetching all of the volumes.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	f79351915c	csi: Basic volume usage tracking	2020-03-23 13:58:30 -04:00
Danielle Lancashire	0203341033	csi: Add comment to UsageOptions.ToFS()	2020-03-23 13:58:30 -04:00
Danielle Lancashire	c3b1154703	csi: Validate Volumes during registration This PR implements some intitial support for doing deeper validation of a volume during its registration with the server. This allows us to validate the capabilities before users attempt to use the volumes during most cases, and also prevents registering volumes without first setting up a plugin, which should help to catch typos and the like during registration. This does have the downside of requiring users to wait for (1) instance of a plugin to be running in their cluster before they can register volumes.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	9f1a076bd5	client: Implement ClientCSI.ControllerValidateVolume	2020-03-23 13:58:30 -04:00
Danielle Lancashire	34acb596e3	plugins/csi: Implement ConvtrollerValidateCapabilities RPC	2020-03-23 13:58:30 -04:00
Danielle Lancashire	6b7ee96a88	csi: Move VolumeCapabilties helper to package	2020-03-23 13:58:30 -04:00
Danielle Lancashire	e227f31584	sched/feasible: Return more detailed CSI Failure messages	2020-03-23 13:58:30 -04:00
Danielle Lancashire	da4f6b60a2	csi: Pass through usage options to the csimanager The CSI Spec requires us to attach and stage volumes based on different types of usage information when it may effect how they are bound. Here we pass through some basic usage options in the CSI Hook (specifically the volume aliases ReadOnly field), and the attachment/access mode from the volume. We pass the attachment/access mode seperately from the volume as it simplifies some handling and doesn't necessarily force every attachment to use the same mode should more be supported (I.e if we let each `volume "foo" {}` specify an override in the future).	2020-03-23 13:58:30 -04:00
Danielle Lancashire	a62a90e03c	csi: Unpublish volumes during ar.Postrun This commit introduces initial support for unmounting csi volumes. It takes a relatively simplistic approach to performing NodeUnpublishVolume calls, optimising for cleaning up any leftover state rather than terminating early in the case of errors. This is because it happens during an allocation's shutdown flow and may not always have a corresponding call to `NodePublishVolume` that succeeded.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	6762442199	csiclient: Add grpc.CallOption support to NodeUnpublishVolume	2020-03-23 13:58:30 -04:00
Danielle Lancashire	6665bdec2e	taskrunner/volume_hook: Cleanup arg order of prepareHostVolumes	2020-03-23 13:58:30 -04:00
Danielle Lancashire	8692ca86bb	taskrunner/volume_hook: Mounts for CSI Volumes This commit implements support for creating driver mounts for CSI Volumes. It works by fetching the created mounts from the allocation resources and then iterates through the volume requests, creating driver mount configs as required. It's a little bit messy primarily because there's _so_ much terminology overlap and it's a bit difficult to follow.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	7a33864edf	volume_hook: Loosen validation in host volume prep	2020-03-23 13:58:30 -04:00
Danielle Lancashire	d8334cf884	allocrunner: Push state from hooks to taskrunners This commit is an initial (read: janky) approach to forwarding state from an allocrunner hook to a taskrunner using a similar `hookResources` approach that tr's use internally. It should eventually probably be replaced with something a little bit more message based, but for things that only come from pre-run hooks, and don't change, it's probably fine for now.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	3ef41fbb86	csi_hook: Stage/Mount volumes as required This commit introduces the first stage of volume mounting for an allocation. The csimanager.VolumeMounter interface manages the blocking and actual minutia of the CSI implementation allowing this hook to do the minimal work of volume retrieval and creating mount info. In the future the `CSIVolume.Get` request should be replaced by `CSIVolume.Claim(Batch?)` to minimize the number of RPCs and to handle external triggering of a ControllerPublishVolume request as required. We also need to ensure that if pre-run hooks fail, we still get a full unwinding of any publish and staged volumes to ensure that there are no hanging references to volumes. That is not handled in this commit.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	4a2492ecb1	client: Pass an RPC Client to AllocRunners As part of introducing support for CSI, AllocRunner hooks need to be able to communicate with Nomad Servers for validation of and interaction with storage volumes. Here we create a small RPCer interface and pass the client (rpc client) to the AR in preparation for making these RPCs.	2020-03-23 13:58:30 -04:00
Tim Gross	b03b78b212	csi: server-to-controller publish/unpublish RPCs (#7124 ) Nomad servers need to make requests to CSI controller plugins running on a client for publish/unpublish. The RPC needs to look up the client node based on the plugin, load balancing across controllers, and then perform the required client RPC to that node (via server forwarding if neccessary).	2020-03-23 13:58:30 -04:00
Tim Gross	b9b315f8d1	csi: stub methods for server-to-controller RPC calls (#7117 )	2020-03-23 13:58:30 -04:00
Danielle Lancashire	77bcaa8183	csi_endpoint: Support No ACLs and restrict Nodes This commit refactors the ACL code for the CSI endpoint to support environments that run without acls enabled (e.g developer environments) and also provides an easy way to restrict which endpoints may be accessed with a client's SecretID to limit the blast radius of a malicious client on the state of the environment.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	a2e01c4369	sched/feasible: Validate CSIVolume's correctly Previously we were looking up plugins based on the Alias Name for a CSI Volume within the context of its task group. Here we first look up a volume based on its identifier and then validate the existence of the plugin based on its `PluginID`.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	22e8317a53	csi: Disable validation of volume topology	2020-03-23 13:58:30 -04:00
Danielle Lancashire	15c6c05ccf	api: Parse CSI Volumes Previously when deserializing volumes we skipped over volumes that were not of type `host`. This commit ensures that we parse both host and csi volumes correctly.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	e56c677221	sched/feasible: CSI - Filter applicable volumes This commit filters the jobs volumes when setting them on the feasibility checker. This ensures that the rest of the checker does not have to worry about non-csi volumes.	2020-03-23 13:58:30 -04:00
Tim Gross	01c704ab9d	csi: add PublishContext to CSIVolumeClaimResponse (#7113 ) The `ControllerPublishVolumeResponse` CSI RPC includes the publish context intended to be passed by the orchestrator as an opaque value to the node plugins. This changeset adds it to our response to a volume claim request to proxy the controller's response back to the client node.	2020-03-23 13:58:29 -04:00
Tim Gross	60901fa764	csi: implement CSI controller detach request/response (#7107 ) This changeset implements the minimal structs on the client-side we need to compile the work-in-progress implementation of the server-to-controller RPCs. It doesn't include implementing the `ClientCSI.DettachVolume` RPC on the client.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	f77d3813d1	csi: Fix broken call to newVolumeManager	2020-03-23 13:58:29 -04:00
Danielle Lancashire	3bff9fefae	csi: Provide plugin-scoped paths during RPCs When providing paths to plugins, the path needs to be in the scope of the plugins container, rather than that of the host. Here we enable that by providing the mount point through the plugin registration and then use it when constructing request target paths.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	94e87fbe9c	csimanager: Cleanup volumemanager setup	2020-03-23 13:58:29 -04:00
Danielle Lancashire	ee85c468c0	csimanager: Instantiate fingerprint manager's csiclient	2020-03-23 13:58:29 -04:00
Tim Gross	fb1aad66ee	csi: implement releasing volume claims for terminal allocs (#7076 ) When an alloc is marked terminal, and after node unstage/unpublish have been called, the client will sync the terminal alloc state with the server via `Node.UpdateAlloc` RPC. This changeset implements releasing the volume claim for each volume associated with the terminal alloc. It doesn't yet implement the RPC call we need to make to the `ControllerUnpublishVolume` CSI RPC.	2020-03-23 13:58:29 -04:00
Tim Gross	d4cd272de3	csi: implement VolumeClaimRPC (#7048 ) When the client receives an allocation which includes a CSI volume, the alloc runner will block its main `Run` loop. The alloc runner will issue a `VolumeClaim` RPC to the Nomad servers. This changeset implements the portions of the `VolumeClaim` RPC endpoint that have not been previously completed.	2020-03-23 13:58:29 -04:00
Lang Martin	421d7ed2e4	nomad: csi_endpoint send register & deregister requests to raft (#7059 )	2020-03-23 13:58:29 -04:00
Lang Martin	7b675f89ac	csi: fix index maintenance for CSIVolume and CSIPlugin tables (#7049 ) * state_store: csi volumes/plugins store the index in the txn * nomad: csi_endpoint_test require index checks need uint64() * nomad: other tests using int 0 not uint64(0) * structs: pass index into New, but not other struct methods * state_store: csi plugin indexes, use new struct interface * nomad: csi_endpoint_test check index/query meta (on explicit 0) * structs: NewCSIVolume takes an index arg now * scheduler/test: NewCSIVolume takes an index arg now	2020-03-23 13:58:29 -04:00
Danielle Lancashire	bbf6a9c14b	volume_manager: cleanup of mount detection No functional changes, but makes ensure.*Dir follow a nicer return style.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	80b7aa0a31	volume_manager: Add support for publishing volumes	2020-03-23 13:58:29 -04:00
Danielle Lancashire	a5c96ce2e1	csi: Add grpc.CallOption support to NodePublishVolume	2020-03-23 13:58:29 -04:00
Lang Martin	a0a6766740	CSI: Scheduler knows about CSI constraints and availability (#6995 ) * structs: piggyback csi volumes on host volumes for job specs * state_store: CSIVolumeByID always includes plugins, matches usecase * scheduler/feasible: csi volume checker * scheduler/stack: add csi volumes * contributing: update rpc checklist * scheduler: add volumes to State interface * scheduler/feasible: introduce new checker collection tgAvailable * scheduler/stack: taskGroupCSIVolumes checker is transient * state_store CSIVolumeDenormalizePlugins comment clarity * structs: remote TODO comment in TaskGroup Validate * scheduler/feasible: CSIVolumeChecker hasPlugins improve comment * scheduler/feasible_test: set t.Parallel * Update nomad/state/state_store.go Co-Authored-By: Danielle <dani@hashicorp.com> * Update scheduler/feasible.go Co-Authored-By: Danielle <dani@hashicorp.com> * structs: lift ControllerRequired to each volume * state_store: store plug.ControllerRequired, use it for volume health * feasible: csi match fast path remove stale host volume copied logic * scheduler/feasible: improve comments Co-authored-by: Danielle <dani@builds.terrible.systems>	2020-03-23 13:58:29 -04:00
Danielle Lancashire	e619ae5a42	volume_manager: Initial support for unstaging volumes	2020-03-23 13:58:29 -04:00
Danielle Lancashire	add55e37b8	csi: Expose gRPC Options on NodeUnstageVolume	2020-03-23 13:58:29 -04:00
Tim Gross	8673ea5cba	csi: add empty CSI volume publication GC to scheduled core jobs (#7014 ) This changeset adds a new core job `CoreJobCSIVolumePublicationGC` to the leader's loop for scheduling core job evals. Right now this is an empty method body without even a config file stanza. Later changesets will implement the logic of volume publication GC.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	6e71baa77d	volume_manager: NodeStageVolume Support This commit introduces support for staging volumes when a plugin implements the STAGE_UNSTAGE_VOLUME capability. See the following for further reference material: `4731db0e0b/spec.md (nodestagevolume)`	2020-03-23 13:58:29 -04:00

1 2 3 4 5 ...

17425 commits