open-nomad

Commit Graph

Author	SHA1	Message	Date
Nick Ethier	5166806993	Merge pull request #7600 from hashicorp/b-5767 tr/service_hook: prevent Update from running before Poststart finish	2020-04-06 16:52:42 -04:00
Nick Ethier	567609e101	tr/service_hook: reset initialized flag during deregister	2020-04-06 16:05:36 -04:00
Drew Bailey	0d550049e9	ensure shutdown delay can be removed	2020-04-06 11:33:04 -04:00
Drew Bailey	9874e7b21d	Group shutdown delay fixes Group shutdown delay updates were not properly handled in Update hook. This commit also ensures that plan output is displayed.	2020-04-06 11:29:12 -04:00
Seth Hoenig	60c9b73eba	Merge pull request #7602 from hashicorp/b-connect-bootstrap-tls-config connect: set consul TLS options on envoy bootstrap	2020-04-03 08:50:36 -06:00
Tim Gross	f6b3d38eb8	CSI: move node unmount to server-driven RPCs (#7596 ) If a volume-claiming alloc stops and the CSI Node plugin that serves that alloc's volumes is missing, there's no way for the allocrunner hook to send the `NodeUnpublish` and `NodeUnstage` RPCs. This changeset addresses this issue with a redesign of the client-side for CSI. Rather than unmounting in the alloc runner hook, the alloc runner hook will simply exit. When the server gets the `Node.UpdateAlloc` for the terminal allocation that had a volume claim, it creates a volume claim GC job. This job will made client RPCs to a new node plugin RPC endpoint, and only once that succeeds, move on to making the client RPCs to the controller plugin. If the node plugin is unavailable, the GC job will fail and be requeued.	2020-04-02 16:04:56 -04:00
Nick Ethier	3b5d2f8eb8	tr/service_hook: update hook fields during update when poststart hasn't finished	2020-04-02 12:48:19 -04:00
Seth Hoenig	e7fcd281ae	connect: set consul TLS options on envoy bootstrap Fixes #6594 #6711 #6714 #7567 e2e testing is still TBD in #6502 Before, we only passed the Nomad agent's configured Consul HTTP address onto the `consul connect envoy ...` bootstrap command. This meant any Consul setup with TLS enabled would not work with Nomad's Connect integration. This change now sets CLI args and Environment Variables for configuring TLS options for communicating with Consul when doing the envoy bootstrap, as described in https://www.consul.io/docs/commands/connect/envoy.html#usage	2020-04-02 10:30:50 -06:00
Nick Ethier	fa271ff1b3	tr/service_hook: prevent Update from running before Poststart has finished	2020-04-02 12:17:36 -04:00
Seth Hoenig	0266f056b8	connect: enable proxy.passthrough configuration Enable configuration of HTTP and gRPC endpoints which should be exposed by the Connect sidecar proxy. This changeset is the first "non-magical" pass that lays the groundwork for enabling Consul service checks for tasks running in a network namespace because they are Connect-enabled. The changes here provide for full configuration of the connect { sidecar_service { proxy { expose { paths = [{ path = <exposed endpoint> protocol = <http or grpc> local_path_port = <local endpoint port> listener_port = <inbound mesh port> }, ... ] } } } stanza. Everything from `expose` and below is new, and partially implements the precedent set by Consul: https://www.consul.io/docs/connect/registration/service-registration.html#expose-paths-configuration-reference Combined with a task-group level network port-mapping in the form: port "exposeExample" { to = -1 } it is now possible to "punch a hole" through the network namespace to a specific HTTP or gRPC path, with the anticipated use case of creating Consul checks on Connect enabled services. A future PR may introduce more automagic behavior, where we can do things like 1) auto-fill the 'expose.path.local_path_port' with the default value of the 'service.port' value for task-group level connect-enabled services. 2) automatically generate a port-mapping 3) enable an 'expose.checks' flag which automatically creates exposed endpoints for every compatible consul service check (http/grpc checks on connect enabled services).	2020-03-31 17:15:27 -06:00
Tim Gross	14b4712f01	csi: annotate remaining missing cancellation contexts (#7552 )	2020-03-30 16:46:43 -04:00
Mahmood Ali	884d18f068	Merge pull request #7383 from hashicorp/b-health-detect-failing-tasks health: detect failing tasks	2020-03-25 06:30:05 -04:00
Mahmood Ali	a5b024fdea	tests: restart restartpolicy for all tasks in tests	2020-03-24 21:52:48 -04:00
Mahmood Ali	7565ac34c0	tests: populate task restart policy properly	2020-03-24 21:44:37 -04:00
Mahmood Ali	5ed346bf05	tests: update AR task restart policy	2020-03-24 17:00:42 -04:00
Mahmood Ali	ceed57b48f	per-task restart policy	2020-03-24 17:00:41 -04:00
Lang Martin	e100444740	csi: add mount_options to volumes and volume requests (#7398 ) Add mount_options to both the volume definition on registration and to the volume block in the group where the volume is requested. If both are specified, the options provided in the request replace the options defined in the volume. They get passed to the NodePublishVolume, which causes the node plugin to actually mount the volume on the host. Individual tasks just mount bind into the host mounted volume (unchanged behavior). An operator can mount the same volume with different options by specifying it twice in the group context. closes #7007 * nomad/structs/volumes: add MountOptions to volume request * jobspec/test-fixtures/basic.hcl: add mount_options to volume block * jobspec/parse_test: add expected MountOptions * api/tasks: add mount_options * jobspec/parse_group: use hcl decode not mapstructure, mount_options * client/allocrunner/csi_hook: pass MountOptions through client/allocrunner/csi_hook: add a VolumeMountOptions client/allocrunner/csi_hook: drop Options client/allocrunner/csi_hook: use the structs options * client/pluginmanager/csimanager/interface: UsageOptions.MountOptions * client/pluginmanager/csimanager/volume: pass MountOptions in capabilities * plugins/csi/plugin: remove todo 7007 comment * nomad/structs/csi: MountOptions * api/csi: add options to the api for parsing, match structs * plugins/csi/plugin: move VolumeMountOptions to structs * api/csi: use specific type for mount_options * client/allocrunner/csi_hook: merge MountOptions here * rename CSIOptions to CSIMountOptions * client/allocrunner/csi_hook * client/pluginmanager/csimanager/volume * nomad/structs/csi * plugins/csi/fake/client: add PrevVolumeCapability * plugins/csi/plugin * client/pluginmanager/csimanager/volume_test: remove debugging * client/pluginmanager/csimanager/volume: fix odd merging logic * api: rename CSIOptions -> CSIMountOptions * nomad/csi_endpoint: remove a 7007 comment * command/alloc_status: show mount options in the volume list * nomad/structs/csi: include MountOptions in the volume stub * api/csi: add MountOptions to stub * command/volume_status_csi: clean up csiVolMountOption, add it * command/alloc_status: csiVolMountOption lives in volume_csi_status * command/node_status: display mount flags * nomad/structs/volumes: npe * plugins/csi/plugin: npe in ToCSIRepresentation * jobspec/parse_test: expand volume parse test cases * command/agent/job_endpoint: ApiTgToStructsTG needs MountOptions * command/volume_status_csi: copy paste error * jobspec/test-fixtures/basic: hclfmt * command/volume_status_csi: clean up csiVolMountOption	2020-03-23 13:59:25 -04:00
Tim Gross	5a0bcd39d1	csi: dynamically update plugin registration (#7386 ) Allow for faster updates to plugin status when allocations become terminal by listening for register/deregister events from the dynamic plugin registry (which in turn are triggered by the plugin supervisor hook). The deregistration function closures that we pass up to the CSI plugin manager don't properly close over the name and type of the registration, causing monolith-type plugins to deregister only one of their two plugins on alloc shutdown. Rebind plugin supervisor deregistration targets to fix that. Includes log message and comment improvements	2020-03-23 13:59:25 -04:00
Tim Gross	fe926e899e	volumes: add task environment interpolation to volume_mount (#7364 )	2020-03-23 13:59:25 -04:00
Tim Gross	1cf7ef44ed	csi: docstring and log message fixups (#7327 ) Fix some docstring typos and fix noisy log message during client restarts. A log for the common case where the plugin socket isn't ready yet isn't actionable by the operator so having it at info is just noise.	2020-03-23 13:58:30 -04:00
Lang Martin	de25fc6cf4	csi: csi-hostpath plugin unimplemented error on controller publish (#7299 ) * client/allocrunner/csi_hook: tag errors * nomad/client_csi_endpoint: tag errors * nomad/client_rpc: remove an unnecessary error tag * nomad/state/state_store: ControllerRequired fix intent We use ControllerRequired to indicate that a volume should use the publish/unpublish workflow, rather than that it has a controller. We need to check both RequiresControllerPlugin and SupportsAttachDetach from the fingerprint to check that. * nomad/csi_endpoint: tag errors * nomad/csi_endpoint_test: longer error messages, mock fingerprints	2020-03-23 13:58:30 -04:00
Tim Gross	de4ad6ca38	csi: add Provider field to CSI CLIs and APIs (#7285 ) Derive a provider name and version for plugins (and the volumes that use them) from the CSI identity API `GetPluginInfo`. Expose the vendor name as `Provider` in the API and CLI commands.	2020-03-23 13:58:30 -04:00
Lang Martin	a4784ef258	csi add allocation context to fingerprinting results (#7133 ) * structs: CSIInfo include AllocID, CSIPlugins no Jobs * state_store: eliminate plugin Jobs, delete an empty plugin * nomad/structs/csi: detect empty plugins correctly * client/allocrunner/taskrunner/plugin_supervisor_hook: option AllocID * client/pluginmanager/csimanager/instance: allocID * client/pluginmanager/csimanager/fingerprint: set AllocID * client/node_updater: split controller and node plugins * api/csi: remove Jobs The CSI Plugin API will map plugins to allocations, which allows plugins to be defined by jobs in many configurations. In particular, multiple plugins can be defined in the same job, and multiple jobs can be used to define a single plugin. Because we now map the allocation context directly from the node, it's no longer necessary to track the jobs associated with a plugin directly. * nomad/csi_endpoint_test: CreateTestPlugin & register via fingerprint * client/dynamicplugins: lift AllocID into the struct from Options * api/csi_test: remove Jobs test * nomad/structs/csi: CSIPlugins has an array of allocs * nomad/state/state_store: implement CSIPluginDenormalize * nomad/state/state_store: CSIPluginDenormalize npe on missing alloc * nomad/csi_endpoint_test: defer deleteNodes for clarity * api/csi_test: disable this test awaiting mocks: https://github.com/hashicorp/nomad/issues/7123	2020-03-23 13:58:30 -04:00
Danielle Lancashire	5b05baf9f6	csi: Add /dev mounts to CSI Plugins CSI Plugins that manage devices need not just access to the CSI directory, but also to manage devices inside `/dev`. This commit introduces a `/dev:/dev` mount to the container so that they may do so.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	1b70fb1398	hook resources: Init with empty resources during setup	2020-03-23 13:58:30 -04:00
Danielle Lancashire	511b7775a6	csi: Claim CSI Volumes during csi_hook.Prerun This commit is the initial implementation of claiming volumes from the server and passes through any publishContext information as appropriate. There's nothing too fancy here.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	da4f6b60a2	csi: Pass through usage options to the csimanager The CSI Spec requires us to attach and stage volumes based on different types of usage information when it may effect how they are bound. Here we pass through some basic usage options in the CSI Hook (specifically the volume aliases ReadOnly field), and the attachment/access mode from the volume. We pass the attachment/access mode seperately from the volume as it simplifies some handling and doesn't necessarily force every attachment to use the same mode should more be supported (I.e if we let each `volume "foo" {}` specify an override in the future).	2020-03-23 13:58:30 -04:00
Danielle Lancashire	a62a90e03c	csi: Unpublish volumes during ar.Postrun This commit introduces initial support for unmounting csi volumes. It takes a relatively simplistic approach to performing NodeUnpublishVolume calls, optimising for cleaning up any leftover state rather than terminating early in the case of errors. This is because it happens during an allocation's shutdown flow and may not always have a corresponding call to `NodePublishVolume` that succeeded.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	6665bdec2e	taskrunner/volume_hook: Cleanup arg order of prepareHostVolumes	2020-03-23 13:58:30 -04:00
Danielle Lancashire	8692ca86bb	taskrunner/volume_hook: Mounts for CSI Volumes This commit implements support for creating driver mounts for CSI Volumes. It works by fetching the created mounts from the allocation resources and then iterates through the volume requests, creating driver mount configs as required. It's a little bit messy primarily because there's _so_ much terminology overlap and it's a bit difficult to follow.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	7a33864edf	volume_hook: Loosen validation in host volume prep	2020-03-23 13:58:30 -04:00
Danielle Lancashire	d8334cf884	allocrunner: Push state from hooks to taskrunners This commit is an initial (read: janky) approach to forwarding state from an allocrunner hook to a taskrunner using a similar `hookResources` approach that tr's use internally. It should eventually probably be replaced with something a little bit more message based, but for things that only come from pre-run hooks, and don't change, it's probably fine for now.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	3ef41fbb86	csi_hook: Stage/Mount volumes as required This commit introduces the first stage of volume mounting for an allocation. The csimanager.VolumeMounter interface manages the blocking and actual minutia of the CSI implementation allowing this hook to do the minimal work of volume retrieval and creating mount info. In the future the `CSIVolume.Get` request should be replaced by `CSIVolume.Claim(Batch?)` to minimize the number of RPCs and to handle external triggering of a ControllerPublishVolume request as required. We also need to ensure that if pre-run hooks fail, we still get a full unwinding of any publish and staged volumes to ensure that there are no hanging references to volumes. That is not handled in this commit.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	4a2492ecb1	client: Pass an RPC Client to AllocRunners As part of introducing support for CSI, AllocRunner hooks need to be able to communicate with Nomad Servers for validation of and interaction with storage volumes. Here we create a small RPCer interface and pass the client (rpc client) to the AR in preparation for making these RPCs.	2020-03-23 13:58:30 -04:00
Danielle Lancashire	3bff9fefae	csi: Provide plugin-scoped paths during RPCs When providing paths to plugins, the path needs to be in the scope of the plugins container, rather than that of the host. Here we enable that by providing the mount point through the plugin registration and then use it when constructing request target paths.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	1a10433b97	csi: Add VolumeManager (#6920 ) This changeset is some pre-requisite boilerplate that is required for introducing CSI volume management for client nodes. It extracts out fingerprinting logic from the csi instance manager. This change is to facilitate reusing the csimanager to also manage the node-local CSI functionality, as it is the easiest place for us to guaruntee health checking and to provide additional visibility into the running operations through the fingerprinter mechanism and goroutine. It also introduces the VolumeMounter interface that will be used to manage staging/publishing unstaging/unpublishing of volumes on the host.	2020-03-23 13:58:29 -04:00
Danielle Lancashire	de5d373001	csi: Setup gRPC Clients with a logger	2020-03-23 13:58:29 -04:00
Danielle Lancashire	426c26d7c0	CSI Plugin Registration (#6555 ) This changeset implements the initial registration and fingerprinting of CSI Plugins as part of #5378. At a high level, it introduces the following: * A `csi_plugin` stanza as part of a Nomad task configuration, to allow a task to expose that it is a plugin. * A new task runner hook: `csi_plugin_supervisor`. This hook does two things. When the `csi_plugin` stanza is detected, it will automatically configure the plugin task to receive bidirectional mounts to the CSI intermediary directory. At runtime, it will then perform an initial heartbeat of the plugin and handle submitting it to the new `dynamicplugins.Registry` for further use by the client, and then run a lightweight heartbeat loop that will emit task events when health changes. * The `dynamicplugins.Registry` for handling plugins that run as Nomad tasks, in contrast to the existing catalog that requires `go-plugin` type plugins and to know the plugin configuration in advance. * The `csimanager` which fingerprints CSI plugins, in a similar way to `drivermanager` and `devicemanager`. It currently only fingerprints the NodeID from the plugin, and assumes that all plugins are monolithic. Missing features * We do not use the live updates of the `dynamicplugin` registry in the `csimanager` yet. * We do not deregister the plugins from the client when they shutdown yet, they just become indefinitely marked as unhealthy. This is deliberate until we figure out how we should manage deploying new versions of plugins/transitioning them.	2020-03-23 13:58:28 -04:00
Mahmood Ali	07a30580ac	health: fail health if any task is pending Fixes a bug where an allocation is considered healthy if some of the tasks are being restarted and as such, their checks aren't tracked by consul agent client. Here, we fix the immediate case by ensuring that an alloc is healthy only if tasks are running and the registered checks at the time are healthy. Previously, health tracker tracked task "health" independently from checks and leads to problems when a task restarts. Consider the following series of events: 1. all tasks start running -> `tracker.tasksHealthy` is true 2. one task has unhealthy checks and get restarted 3. remaining checks are healthy -> `tracker.checksHealthy` is true 4. propagate health status now that `tracker.tasksHealthy` and `tracker.checksHealthy`. This change ensures that we accurately use the latest status of tasks and checks regardless of their status changes. Also, ensures that we only consider check health after tasks are considered healthy, otherwise we risk trusting incomplete checks. This approach accomodates task dependencies well. Service jobs can have prestart short-lived tasks that will terminate before main process runs. These dead tasks that complete successfully will not negate health status.	2020-03-22 11:13:41 -04:00
Mahmood Ali	b0a7e4381b	tests: add a check for failing service checks Add tests to check for failing or missing service checks in consul update.	2020-03-22 11:13:40 -04:00
Mahmood Ali	5801039214	address review feedback	2020-03-21 17:52:58 -04:00
Mahmood Ali	e1f53347e9	tr: proceed to mark other tasks as dead if alloc fails	2020-03-21 17:52:58 -04:00
Mahmood Ali	e30d26b404	fix test	2020-03-21 17:52:57 -04:00
Jasmine Dahilig	73a64e4397	change jobspec lifecycle stanza to use sidecar attribute instead of block_until status	2020-03-21 17:52:57 -04:00
Jasmine Dahilig	89778bc88d	fix restart policy for system jobs with no lifecycle	2020-03-21 17:52:56 -04:00
Jasmine Dahilig	56e0b8e933	refactor TaskHookCoordinator tests to use mock package and add failed init and sidecar test cases	2020-03-21 17:52:56 -04:00
Jasmine Dahilig	2a8dac077c	remove debugging test code from TestAllocRunner_TaskLeader_StopRestoredTG	2020-03-21 17:52:54 -04:00
Jasmine Dahilig	deb26aefab	fix bug in lifecycle restore tests after refactor	2020-03-21 17:52:54 -04:00
Jasmine Dahilig	2e93d7a875	fix failing ci test: TestTaskRunner_UnregisterConsul_Retries	2020-03-21 17:52:54 -04:00
Jasmine Dahilig	d54a83afee	fix linting errors	2020-03-21 17:52:53 -04:00

1 2 3 4 5 ...

434 Commits