open-nomad

Commit Graph

Author	SHA1	Message	Date
Tim Gross	0856483115	CSI: fingerprint detailed node capabilities In order to support new node RPCs, we need to fingerprint plugin capabilities in more detail. This changeset mirrors recent work to fingerprint controller capabilities, but is not yet in use by any Nomad RPC.	2021-04-01 16:00:58 -04:00
Tim Gross	466b620fa4	CSI: volume snapshot	2021-04-01 11:16:52 -04:00
Tim Gross	9fc4cf1419	CSI: fingerprint detailed controller capabilities In order to support new controller RPCs, we need to fingerprint volume capabilities in more detail and perform controller RPCs only when the specific capability is present. This fixes a bug in Ceph support where the plugin can only suport create/delete but we assume that it also supports attach/detach.	2021-03-31 16:37:09 -04:00
Tim Gross	d38008176e	CSI: create/delete/list volume RPCs This commit implements the RPC handlers on the client that talk to the CSI plugins on that client for the Create/Delete/List RPC.	2021-03-31 16:37:09 -04:00
Tim Gross	d97401f60e	CSI: protobuffer mappings for Create/Delete/List volume RPCs Note that unset proto fields for volume create should be nil. The CSI spec handles empty fields and nil fields in the protobuf differently, which may result in validation failures for creating volumes with no prior source (and does in testing with the AWS EBS plugin). Refactor the `CreateVolumeRequest` mapping to the protobuf in the plugin client to avoid this bug.	2021-03-31 16:37:09 -04:00
Tim Gross	29a5454894	csi: loosen ValidateVolumeCapability requirements (#9049 ) The CSI specification for `ValidateVolumeCapability` says that we shall "reconcile successful capability-validation responses by comparing the validated capabilities with those that it had originally requested" but leaves the details of that reconcilation unspecified. This API is not implemented in Kubernetes, so controller plugins don't have a real-world implementation to verify their behavior against. We have found that CSI plugins in the wild may return "successful" but incomplete `VolumeCapability` responses, so we can't require that all capabilities we expect have been validated, only that the ones that have been validated match. This appears to violate the CSI specification but until that's been resolved in upstream we have to loosen our validation requirements. The tradeoff is that we're more likely to have runtime errors during `NodeStageVolume` instead of at the time of volume registration.	2020-10-08 12:53:24 -04:00
Tim Gross	7d53ed88d6	csi: client RPCs should return wrapped errors for checking (#8605 ) When the client-side actions of a CSI client RPC succeed but we get disconnected during the RPC or we fail to checkpoint the claim state, we want to be able to retry the client RPC without getting blocked by the client-side state (ex. mount points) already having been cleaned up in previous calls.	2020-08-07 11:01:36 -04:00
Tim Gross	3d38592fbb	csi: add VolumeContext to NodeStage/Publish RPCs (#8239 ) In #7957 we added support for passing a volume context to the controller RPCs. This is an opaque map that's created by `CreateVolume` or, in Nomad's case, in the volume registration spec. However, we missed passing this field to the `NodeStage` and `NodePublish` RPC, which prevents certain plugins (such as MooseFS) from making node RPCs.	2020-06-22 13:54:32 -04:00
Tim Gross	0f1946d395	csi: improve plugin error messages and volume validation (#7984 ) Some CSI plugins don't return much for errors over the gRPC socket above and beyond the bare minimum error codes. This changeset improves the operator experience by unpacking the error codes when available and wrapping the error with some user-friendly direction. Improving these errors also revealed a bad comparison with `require.Error` when `require.EqualError` should be used in the test code for plugin errors. This defect in turn was hiding a bug in volume validation where we're being overly permissive in allowing mount flags, which is now fixed.	2020-05-18 08:23:17 -04:00
Tim Gross	6a463dc13a	csi: use a blocking initial connection with timeout (#7965 ) The plugin supervisor lazily connects to plugins, but this means we only get "Unavailable" back from the gRPC call in cases where the plugin can never be reached (for example, if the Nomad client has the wrong permissions for the socket). This changeset improves the operator experience by switching to a blocking `DialWithContext`. It eagerly connects so that we can validate the connection is real and get a "failed to open" error in case where Nomad can't establish the initial connection.	2020-05-15 08:17:11 -04:00
Tim Gross	2082cf738a	csi: support for VolumeContext and VolumeParameters (#7957 ) The MVP for CSI in the 0.11.0 release of Nomad did not include support for opaque volume parameters or volume context. This changeset adds support for both. This also moves args for ControllerValidateCapabilities into a struct. The CSI plugin `ControllerValidateCapabilities` struct that we turn into a CSI RPC is accumulating arguments, so moving it into a request struct will reduce the churn of this internal API, make the plugin code more readable, and make this method consistent with the other plugin methods in that package.	2020-05-15 08:16:01 -04:00
Tim Gross	24aa32c503	csi: use a blocking initial connection with timeout The plugin supervisor lazily connects to plugins, but this means we only get "Unavailable" back from the gRPC call in cases where the plugin can never be reached (for example, if the Nomad client has the wrong permissions for the socket). This changeset improves the operator experience by switching to a blocking `DialWithContext`. It eagerly connects so that we can validate the connection is real and get a "failed to open" error in case where Nomad can't establish the initial connection.	2020-05-14 15:59:19 -04:00
Tim Gross	4f54a633a2	csi: refactor internal client field name to ExternalID (#7958 ) The CSI plugins RPCs require the use of the storage provider's volume ID, rather than the user-defined volume ID. Although changing the RPCs to use the field name `ExternalID` risks breaking backwards compatibility, we can use the `ExternalID` name internally for the client and only use `VolumeID` at the RPC boundaries.	2020-05-14 11:56:07 -04:00
Tim Gross	4374c1a837	csi: support Secrets parameter in CSI RPCs (#7923 ) CSI plugins can require credentials for some publishing and unpublishing workflow RPCs. Secrets are configured at the time of volume registration, stored in the volume struct, and then passed around as an opaque map by Nomad to the plugins.	2020-05-11 17:12:51 -04:00
Tim Gross	3cca738478	csi: fix mount validation (#7869 ) Several of the CSI `VolumeCapability` methods return pointers, which we were then comparing to pointers in the request rather than dereferencing them and comparing their contents. This changeset does a more fine-grained comparison of the request vs the capabilities, and adds better error messaging.	2020-05-05 15:13:07 -04:00
Tim Gross	cbae10333c	csi: check returned volume capability validation (#7831 ) This changeset corrects handling of the `ValidationVolumeCapabilities` response: * The CSI spec for the `ValidationVolumeCapabilities` requires that plugins only set the `Confirmed` field if they've validated all capabilities. The Nomad client improperly assumes that the lack of a `Confirmed` field should be treated as a failure. This breaks the Azure and Linode block storage plugins, which don't set this optional field. * The CSI spec also requires that the orchestrator check the validation responses to guard against older versions of a plugin reporting "valid" for newer fields it doesn't understand.	2020-04-30 17:12:32 -04:00
Tim Gross	f3bae55fae	set safe default for CSI plugin MaxVolumes (#7583 )	2020-04-01 11:08:55 -04:00
Tim Gross	6ffd36c4e5	csi: add grpc retries to client controller RPCs (#7549 ) The CSI Specification defines various gRPC Errors and how they may be retried. After auditing all our CSI RPC calls in #6863, this changeset: * adds retries and backoffs to the where they were needed but not implemented * annotates those CSI RPCs that do not need retries so that we don't wonder whether it's been left off accidentally * added a timeout and cancellation context to the `Probe` call, which didn't have one.	2020-03-30 16:26:03 -04:00
Lang Martin	e100444740	csi: add mount_options to volumes and volume requests (#7398 ) Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host. Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context. closes #7007 * nomad/structs/volumes: add MountOptions to volume request * jobspec/test-fixtures/basic.hcl: add mount_options to volume block * jobspec/parse_test: add expected MountOptions * api/tasks: add mount_options * jobspec/parse_group: use hcl decode not mapstructure, mount_options * client/allocrunner/csi_hook: pass MountOptions through client/allocrunner/csi_hook: add a VolumeMountOptions client/allocrunner/csi_hook: drop Options client/allocrunner/csi_hook: use the structs options * client/pluginmanager/csimanager/interface: UsageOptions.MountOptions * client/pluginmanager/csimanager/volume: pass MountOptions in capabilities * plugins/csi/plugin: remove todo 7007 comment * nomad/structs/csi: MountOptions * api/csi: add options to the api for parsing, match structs * plugins/csi/plugin: move VolumeMountOptions to structs * api/csi: use specific type for mount_options * client/allocrunner/csi_hook: merge MountOptions here * rename CSIOptions to CSIMountOptions * client/allocrunner/csi_hook * client/pluginmanager/csimanager/volume * nomad/structs/csi * plugins/csi/fake/client: add PrevVolumeCapability * plugins/csi/plugin * client/pluginmanager/csimanager/volume_test: remove debugging * client/pluginmanager/csimanager/volume: fix odd merging logic * api: rename CSIOptions -> CSIMountOptions * nomad/csi_endpoint: remove a 7007 comment * command/alloc_status: show mount options in the volume list * nomad/structs/csi: include MountOptions in the volume stub * api/csi: add MountOptions to stub * command/volume_status_csi: clean up csiVolMountOption, add it * command/alloc_status: csiVolMountOption lives in volume_csi_status * command/node_status: display mount flags * nomad/structs/volumes: npe * plugins/csi/plugin: npe in ToCSIRepresentation * jobspec/parse_test: expand volume parse test cases * command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions * command/volume_status_csi: copy paste error * jobspec/test-fixtures/basic: hclfmt * command/volume_status_csi: clean up csiVolMountOption	2020-03-23 13:59:25 -04:00
Tim Gross	32b94bf1a4	csi: stub fingerprint on instance manager shutdown (#7388 ) Run the plugin fingerprint one last time with a closed client during instance manager shutdown. This will return quickly and will give us a correctly-populated `PluginInfo` marked as unhealthy so the Nomad client can update the server about plugin health.	2020-03-23 13:59:25 -04:00
Tim Gross	de4ad6ca38	csi: add Provider field to CSI CLIs and APIs (#7285 ) Derive a provider name and version for plugins (and the volumes that use them) from the CSI identity API `GetPluginInfo`. Expose the vendor name as `Provider` in the API and CLI commands.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	247e86bb35	csi: VolumeCapabilities for ControllerPublishVolume This commit introduces support for providing VolumeCapabilities during requests to `ControllerPublishVolumes` as this is a required field.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	34acb596e3	plugins/csi: Implement ConvtrollerValidateCapabilities RPC	2020-03-23 13:58:30 -04:00
Danielle Lancashire	6b7ee96a88	csi: Move VolumeCapabilties helper to package	2020-03-23 13:58:30 -04:00
Danielle Lancashire	6762442199	csiclient: Add grpc.CallOption support to NodeUnpublishVolume	2020-03-23 13:58:30 -04:00
Tim Gross	60901fa764	csi: implement CSI controller detach request/response (#7107 ) This changeset implements the minimal structs on the client-side we need to compile the work-in-progress implementation of the server-to-controller RPCs. It doesn't include implementing the `ClientCSI.DettachVolume` RPC on the client.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	a5c96ce2e1	csi: Add grpc.CallOption support to NodePublishVolume	2020-03-23 13:58:29 -04:00
Danielle Lancashire	add55e37b8	csi: Expose gRPC Options on NodeUnstageVolume	2020-03-23 13:58:29 -04:00
Danielle Lancashire	65d9ddc9af	csi: Expose grpc.CallOptions for NodeStageVolume	2020-03-23 13:58:29 -04:00
Danielle Lancashire	51270ae0f4	csi: Support for NodeUnpublishVolume RPCs	2020-03-23 13:58:29 -04:00
Danielle Lancashire	a4b96aff33	csi: Nil check ToCSIRepresentation implementations	2020-03-23 13:58:29 -04:00
Danielle Lancashire	02c4612e65	csi: Add NodePublishVolume RPCs	2020-03-23 13:58:29 -04:00
Danielle Lancashire	98f00a9220	csi: Add NodeUnstageVolume RPCs to CSIPlugin	2020-03-23 13:58:29 -04:00
Danielle Lancashire	5c447396fa	csi: Add NodeUnstageVolume as a CSI Dependency	2020-03-23 13:58:29 -04:00
Danielle Lancashire	f208770e94	csi: Add NodeStageVolume to fake client	2020-03-23 13:58:29 -04:00
Danielle Lancashire	07651a5231	csi: Add NodeStageVolume RPC	2020-03-23 13:58:29 -04:00
Danielle Lancashire	317b680744	csi: Add csi.NodeStageVolume to the NodeClient Implements a fake version of NodeStageVolume as a dependency of implementing the client.NodeStageVolume request	2020-03-23 13:58:29 -04:00
Danielle Lancashire	ab1edd4e24	csi: Add Nomad Model for VolumeCapabilities This commit introduces a nomad model for interacting with CSI VolumeCapabilities as a pre-requisite for implementing NodeStageVolume and NodeMountVolume correctly. These fields have a few special characteristics that I've tried to model here - specificially, we make a basic attempt to avoid printing data that should be redacted during debug logs (additional mount flags), and also attempt to make debuggability of other integer fields easier by implementing the fmt.Stringer and fmt.GoStringer interfaces as necessary. We do not currnetly implement a CSI Protobuf -> Nomad implementation transformation as this is currently not needed by any used RPCs.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	de5d373001	csi: Setup gRPC Clients with a logger	2020-03-23 13:58:29 -04:00
Danielle Lancashire	c16812280c	csi: Add NodeGetCapabilities RPC	2020-03-23 13:58:29 -04:00
Danielle Lancashire	05525c98ae	plugins_csi: Add GetControllerCapabilities RPC	2020-03-23 13:58:28 -04:00
Danielle Lancashire	72ee2d4c1c	csi: Add initial plumbing for controller rpcs	2020-03-23 13:58:28 -04:00
Danielle Lancashire	426c26d7c0	CSI Plugin Registration (#6555 ) This changeset implements the initial registration and fingerprinting of CSI Plugins as part of #5378. At a high level, it introduces the following: * A `csi_plugin` stanza as part of a Nomad task configuration, to allow a task to expose that it is a plugin. * A new task runner hook: `csi_plugin_supervisor`. This hook does two things. When the `csi_plugin` stanza is detected, it will automatically configure the plugin task to receive bidirectional mounts to the CSI intermediary directory. At runtime, it will then perform an initial heartbeat of the plugin and handle submitting it to the new `dynamicplugins.Registry` for further use by the client, and then run a lightweight heartbeat loop that will emit task events when health changes. * The `dynamicplugins.Registry` for handling plugins that run as Nomad tasks, in contrast to the existing catalog that requires `go-plugin` type plugins and to know the plugin configuration in advance. * The `csimanager` which fingerprints CSI plugins, in a similar way to `drivermanager` and `devicemanager`. It currently only fingerprints the NodeID from the plugin, and assumes that all plugins are monolithic. Missing features * We do not use the live updates of the `dynamicplugin` registry in the `csimanager` yet. * We do not deregister the plugins from the client when they shutdown yet, they just become indefinitely marked as unhealthy. This is deliberate until we figure out how we should manage deploying new versions of plugins/transitioning them.	2020-03-23 13:58:28 -04:00

43 Commits