Commit Graph

4122 Commits

Author SHA1 Message Date
Tim Gross f6b3d38eb8
CSI: move node unmount to server-driven RPCs (#7596)
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.

This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
2020-04-02 16:04:56 -04:00
Mahmood Ali 37c0dbcfe6 fix codegen for ugorji/go
When generating ugorji/go package, we should use
github.com/hashicorp/go-msgpack/codec instead.

Also fix the reference for codegen_generated
2020-03-31 21:30:21 -04:00
Seth Hoenig 0266f056b8 connect: enable proxy.passthrough configuration
Enable configuration of HTTP and gRPC endpoints which should be exposed by
the Connect sidecar proxy. This changeset is the first "non-magical" pass
that lays the groundwork for enabling Consul service checks for tasks
running in a network namespace because they are Connect-enabled. The changes
here provide for full configuration of the

  connect {
    sidecar_service {
      proxy {
        expose {
          paths = [{
		path = <exposed endpoint>
                protocol = <http or grpc>
                local_path_port = <local endpoint port>
                listener_port = <inbound mesh port>
	  }, ... ]
       }
    }
  }

stanza. Everything from `expose` and below is new, and partially implements
the precedent set by Consul:
  https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference

Combined with a task-group level network port-mapping in the form:

  port "exposeExample" { to = -1 }

it is now possible to "punch a hole" through the network namespace
to a specific HTTP or gRPC path, with the anticipated use case of creating
Consul checks on Connect enabled services.

A future PR may introduce more automagic behavior, where we can do things like

1) auto-fill the 'expose.path.local_path_port' with the default value of the
   'service.port' value for task-group level connect-enabled services.

2) automatically generate a port-mapping

3) enable an 'expose.checks' flag which automatically creates exposed endpoints
   for every compatible consul service check (http/grpc checks on connect
   enabled services).
2020-03-31 17:15:27 -06:00
Lang Martin 8d4f39fba1
csi: add node events to report progress mounting and unmounting volumes (#7547)
* nomad/structs/structs: new NodeEventSubsystemCSI

* client/client: pass triggerNodeEvent in the CSIConfig

* client/pluginmanager/csimanager/instance: add eventer to instanceManager

* client/pluginmanager/csimanager/manager: pass triggerNodeEvent

* client/pluginmanager/csimanager/volume: node event on [un]mount

* nomad/structs/structs: use storage, not CSI

* client/pluginmanager/csimanager/volume: use storage, not CSI

* client/pluginmanager/csimanager/volume_test: eventer

* client/pluginmanager/csimanager/volume: event on error

* client/pluginmanager/csimanager/volume_test: check event on error

* command/node_status: remove an extra space in event detail format

* client/pluginmanager/csimanager/volume: use snake_case for details

* client/pluginmanager/csimanager/volume_test: snake_case details
2020-03-31 17:13:52 -04:00
Mahmood Ali 14a461d6c4
Merge pull request #7560 from hashicorp/vendor-go-msgpack-v1.1.5
vendor: explicit use of hashicorp/go-msgpack
2020-03-31 10:09:05 -04:00
Tim Gross 4a834ea0fa
client: use NewNodeEvent builder for consistency (#7559) 2020-03-31 10:02:16 -04:00
Yoan Blanc 225c9c1215 fixup! vendor: explicit use of hashicorp/go-msgpack
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-31 09:48:07 -04:00
Yoan Blanc 761d014071 vendor: explicit use of hashicorp/go-msgpack
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-31 09:45:21 -04:00
Tim Gross 14b4712f01
csi: annotate remaining missing cancellation contexts (#7552) 2020-03-30 16:46:43 -04:00
Tim Gross 6ffd36c4e5
csi: add grpc retries to client controller RPCs (#7549)
The CSI Specification defines various gRPC Errors and how they may be retried. After auditing all our CSI RPC calls in #6863, this changeset:

* adds retries and backoffs to the where they were needed but not implemented
* annotates those CSI RPCs that do not need retries so that we don't wonder whether it's been left off accidentally
* added a timeout and cancellation context to the `Probe` call, which didn't have one.
2020-03-30 16:26:03 -04:00
Seth Hoenig b3664c628c
Merge pull request #7524 from hashicorp/docs-consul-acl-minimums
consul: annotate Consul interfaces with ACLs
2020-03-30 13:27:27 -06:00
Seth Hoenig 7dbc22539e docs: remove erroneous characters from comment 2020-03-30 13:26:48 -06:00
Seth Hoenig 41cabd3e18
Merge pull request #7542 from jorgemarey/b-fix-lockedUpstreamsUpdate
Add new setUpstreamsLocked function to avoid blocking on Update
2020-03-30 11:27:32 -06:00
Seth Hoenig 0a812ab689 consul: annotate Consul interfaces with ACLs 2020-03-30 10:17:28 -06:00
Mahmood Ali b4a00f8dd7 tests: deflake TestAllocGarbageCollector_MakeRoomFor_MaxAllocs
The test inserts an alloc in the server state, but expect the client to
start the alloc runner for it almost immediately.

Here, we add a retry loop to check that the client start all expected
alloc runners eventually.
2020-03-30 07:06:53 -04:00
Jorge Marey 3731b70e03 Add new setUpstreamsLocked function to avoid lock 2020-03-29 20:34:04 +02:00
Mahmood Ali 7985b1893f fixup! tests: Add tests for EC2 Metadata immitation cases 2020-03-26 11:37:54 -04:00
Mahmood Ali a1e7378c7b fixup! tests: Add tests for EC2 Metadata immitation cases 2020-03-26 11:33:44 -04:00
Mahmood Ali 1d50379bc6 fingerprint: handle incomplete AWS immitation APIs
Fix a regression where we accidentally started treating non-AWS
environments as AWS environments, resulting in bad networking settings.

Two factors some at play:

First, in [1], we accidentally switched the ultimate AWS test from
checking `ami-id` to `instance-id`.  This means that nomad started
treating more environments as AWS; e.g. Hetzner implements `instance-id`
but not `ami-id`.

Second, some of these environments return empty values instead of
errors!  Hetzner returns empty 200 response for `local-ipv4`, resulting
into bad networking configuration.

This change fix the situation by restoring the check to `ami-id` and
ensuring that we only set network configuration when the ip address is
not-empty.  Also, be more defensive around response whitespace input.

[1] https://github.com/hashicorp/nomad/pull/6779
2020-03-26 11:23:15 -04:00
Mahmood Ali b3de5d5721 tests: Add tests for EC2 Metadata immitation cases
Test that nomad doesn't set empty/bad network configuration when in an
environment that does incomplete immitation of EC2 Metadata API.
2020-03-26 11:13:21 -04:00
Mahmood Ali 884d18f068
Merge pull request #7383 from hashicorp/b-health-detect-failing-tasks
health: detect failing tasks
2020-03-25 06:30:05 -04:00
Mahmood Ali a5b024fdea tests: restart restartpolicy for all tasks in tests 2020-03-24 21:52:48 -04:00
Mahmood Ali 7565ac34c0 tests: populate task restart policy properly 2020-03-24 21:44:37 -04:00
Mahmood Ali a45202399c tests: fix TestAllocations_GarbageCollect 2020-03-24 17:38:59 -04:00
Mahmood Ali 5ed346bf05 tests: update AR task restart policy 2020-03-24 17:00:42 -04:00
Mahmood Ali ceed57b48f per-task restart policy 2020-03-24 17:00:41 -04:00
Tim Gross 076fbbf08f
Merge pull request #7012 from hashicorp/f-csi-volumes
Container Storage Interface Support
2020-03-23 14:19:46 -04:00
Lang Martin e100444740 csi: add mount_options to volumes and volume requests (#7398)
Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host.

Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context.

closes #7007

* nomad/structs/volumes: add MountOptions to volume request

* jobspec/test-fixtures/basic.hcl: add mount_options to volume block

* jobspec/parse_test: add expected MountOptions

* api/tasks: add mount_options

* jobspec/parse_group: use hcl decode not mapstructure, mount_options

* client/allocrunner/csi_hook: pass MountOptions through

client/allocrunner/csi_hook: add a VolumeMountOptions

client/allocrunner/csi_hook: drop Options

client/allocrunner/csi_hook: use the structs options

* client/pluginmanager/csimanager/interface: UsageOptions.MountOptions

* client/pluginmanager/csimanager/volume: pass MountOptions in capabilities

* plugins/csi/plugin: remove todo 7007 comment

* nomad/structs/csi: MountOptions

* api/csi: add options to the api for parsing, match structs

* plugins/csi/plugin: move VolumeMountOptions to structs

* api/csi: use specific type for mount_options

* client/allocrunner/csi_hook: merge MountOptions here

* rename CSIOptions to CSIMountOptions

* client/allocrunner/csi_hook

* client/pluginmanager/csimanager/volume

* nomad/structs/csi

* plugins/csi/fake/client: add PrevVolumeCapability

* plugins/csi/plugin

* client/pluginmanager/csimanager/volume_test: remove debugging

* client/pluginmanager/csimanager/volume: fix odd merging logic

* api: rename CSIOptions -> CSIMountOptions

* nomad/csi_endpoint: remove a 7007 comment

* command/alloc_status: show mount options in the volume list

* nomad/structs/csi: include MountOptions in the volume stub

* api/csi: add MountOptions to stub

* command/volume_status_csi: clean up csiVolMountOption, add it

* command/alloc_status: csiVolMountOption lives in volume_csi_status

* command/node_status: display mount flags

* nomad/structs/volumes: npe

* plugins/csi/plugin: npe in ToCSIRepresentation

* jobspec/parse_test: expand volume parse test cases

* command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions

* command/volume_status_csi: copy paste error

* jobspec/test-fixtures/basic: hclfmt

* command/volume_status_csi: clean up csiVolMountOption
2020-03-23 13:59:25 -04:00
Tim Gross 32b94bf1a4 csi: stub fingerprint on instance manager shutdown (#7388)
Run the plugin fingerprint one last time with a closed client during
instance manager shutdown. This will return quickly and will give us a
correctly-populated `PluginInfo` marked as unhealthy so the Nomad
client can update the server about plugin health.
2020-03-23 13:59:25 -04:00
Tim Gross 5a0bcd39d1 csi: dynamically update plugin registration (#7386)
Allow for faster updates to plugin status when allocations become
terminal by listening for register/deregister events from the dynamic
plugin registry (which in turn are triggered by the plugin supervisor
hook).

The deregistration function closures that we pass up to the CSI plugin
manager don't properly close over the name and type of the
registration, causing monolith-type plugins to deregister only one of
their two plugins on alloc shutdown. Rebind plugin supervisor 
deregistration targets to fix that.

Includes log message and comment improvements
2020-03-23 13:59:25 -04:00
Tim Gross fe926e899e volumes: add task environment interpolation to volume_mount (#7364) 2020-03-23 13:59:25 -04:00
Tim Gross 22e9f679c3 csi: implement controller detach RPCs (#7356)
This changeset implements the remaining controller detach RPCs: server-to-client and client-to-controller. The tests also uncovered a bug in our RPC for claims which is fixed here; the volume claim RPC is used for both claiming and releasing a claim on a volume. We should only submit a controller publish RPC when the claim is new and not when it's being released.
2020-03-23 13:59:25 -04:00
Tim Gross eda7be552c csi: add dynamicplugins registry to client state store (#7330)
In order to correctly fingerprint dynamic plugins on client restarts,
we need to persist a handle to the plugin (that is, connection info)
to the client state store.

The dynamic registry will sync automatically to the client state
whenever it receives a register/deregister call.
2020-03-23 13:58:30 -04:00
Lang Martin 6750c262a4 csi: use `ExternalID`, when set, to identify volumes for outside RPC calls (#7326)
* nomad/structs/csi: new RemoteID() uses the ExternalID if set

* nomad/csi_endpoint: pass RemoteID to volume request types

* client/pluginmanager/csimanager/volume: pass RemoteID to NodePublishVolume
2020-03-23 13:58:30 -04:00
Tim Gross 1cf7ef44ed csi: docstring and log message fixups (#7327)
Fix some docstring typos and fix noisy log message during client restarts.
A log for the common case where the plugin socket isn't ready yet
isn't actionable by the operator so having it at info is just noise.
2020-03-23 13:58:30 -04:00
Lang Martin de25fc6cf4 csi: csi-hostpath plugin unimplemented error on controller publish (#7299)
* client/allocrunner/csi_hook: tag errors

* nomad/client_csi_endpoint: tag errors

* nomad/client_rpc: remove an unnecessary error tag

* nomad/state/state_store: ControllerRequired fix intent

We use ControllerRequired to indicate that a volume should use the
publish/unpublish workflow, rather than that it has a controller. We
need to check both RequiresControllerPlugin and SupportsAttachDetach
from the fingerprint to check that.

* nomad/csi_endpoint: tag errors

* nomad/csi_endpoint_test: longer error messages, mock fingerprints
2020-03-23 13:58:30 -04:00
Tim Gross de4ad6ca38 csi: add Provider field to CSI CLIs and APIs (#7285)
Derive a provider name and version for plugins (and the volumes that
use them) from the CSI identity API `GetPluginInfo`. Expose the vendor
name as `Provider` in the API and CLI commands.
2020-03-23 13:58:30 -04:00
Lang Martin a4784ef258 csi add allocation context to fingerprinting results (#7133)
* structs: CSIInfo include AllocID, CSIPlugins no Jobs

* state_store: eliminate plugin Jobs, delete an empty plugin

* nomad/structs/csi: detect empty plugins correctly

* client/allocrunner/taskrunner/plugin_supervisor_hook: option AllocID

* client/pluginmanager/csimanager/instance: allocID

* client/pluginmanager/csimanager/fingerprint: set AllocID

* client/node_updater: split controller and node plugins

* api/csi: remove Jobs

The CSI Plugin API will map plugins to allocations, which allows
plugins to be defined by jobs in many configurations. In particular,
multiple plugins can be defined in the same job, and multiple jobs can
be used to define a single plugin.

Because we now map the allocation context directly from the node, it's
no longer necessary to track the jobs associated with a plugin
directly.

* nomad/csi_endpoint_test: CreateTestPlugin & register via fingerprint

* client/dynamicplugins: lift AllocID into the struct from Options

* api/csi_test: remove Jobs test

* nomad/structs/csi: CSIPlugins has an array of allocs

* nomad/state/state_store: implement CSIPluginDenormalize

* nomad/state/state_store: CSIPluginDenormalize npe on missing alloc

* nomad/csi_endpoint_test: defer deleteNodes for clarity

* api/csi_test: disable this test awaiting mocks:
https://github.com/hashicorp/nomad/issues/7123
2020-03-23 13:58:30 -04:00
Danielle Lancashire 247e86bb35 csi: VolumeCapabilities for ControllerPublishVolume
This commit introduces support for providing VolumeCapabilities during
requests to `ControllerPublishVolumes` as this is a required field.
2020-03-23 13:58:30 -04:00
Danielle Lancashire e75f057df3 csi: Fix Controller RPCs
Currently the handling of CSINode RPCs does not correctly handle
forwarding RPCs to Nodes.

This commit fixes this by introducing a shim RPC
(nomad/client_csi_enpdoint) that will correctly forward the request to
the owning node, or submit the RPC to the client.

In the process it also cleans up handling a little bit by adding the
`CSIControllerQuery` embeded struct for required forwarding state.

The CSIControllerQuery embeding the requirement of a `PluginID` also
means we could move node targetting into the shim RPC if wanted in the
future.
2020-03-23 13:58:30 -04:00
Danielle Lancashire d5e255f97a client: Rename ClientCSI -> CSIController 2020-03-23 13:58:30 -04:00
Danielle Lancashire 5b05baf9f6 csi: Add /dev mounts to CSI Plugins
CSI Plugins that manage devices need not just access to the CSI
directory, but also to manage devices inside `/dev`.

This commit introduces a `/dev:/dev` mount to the container so that they
may do so.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 6fc7f7779d csimanager/volume: Update MountVolume docstring 2020-03-23 13:58:30 -04:00
Danielle Lancashire 1b70fb1398 hook resources: Init with empty resources during setup 2020-03-23 13:58:30 -04:00
Danielle Lancashire 511b7775a6 csi: Claim CSI Volumes during csi_hook.Prerun
This commit is the initial implementation of claiming volumes from the
server and passes through any publishContext information as appropriate.

There's nothing too fancy here.
2020-03-23 13:58:30 -04:00
Danielle Lancashire f79351915c csi: Basic volume usage tracking 2020-03-23 13:58:30 -04:00
Danielle Lancashire 0203341033 csi: Add comment to UsageOptions.ToFS() 2020-03-23 13:58:30 -04:00
Danielle Lancashire 9f1a076bd5 client: Implement ClientCSI.ControllerValidateVolume 2020-03-23 13:58:30 -04:00
Danielle Lancashire 6b7ee96a88 csi: Move VolumeCapabilties helper to package 2020-03-23 13:58:30 -04:00
Danielle Lancashire da4f6b60a2 csi: Pass through usage options to the csimanager
The CSI Spec requires us to attach and stage volumes based on different
types of usage information when it may effect how they are bound. Here
we pass through some basic usage options in the CSI Hook (specifically
the volume aliases ReadOnly field), and the attachment/access mode from
the volume. We pass the attachment/access mode seperately from the
volume as it simplifies some handling and doesn't necessarily force
every attachment to use the same mode should more be supported (I.e if
we let each `volume "foo" {}` specify an override in the future).
2020-03-23 13:58:30 -04:00