Commit graph

403 commits

Author SHA1 Message Date
Danielle Lancashire 8692ca86bb taskrunner/volume_hook: Mounts for CSI Volumes
This commit implements support for creating driver mounts for CSI
Volumes.

It works by fetching the created mounts from the allocation resources
and then iterates through the volume requests, creating driver mount
configs as required.

It's a little bit messy primarily because there's _so_ much terminology
overlap and it's a bit difficult to follow.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 7a33864edf volume_hook: Loosen validation in host volume prep 2020-03-23 13:58:30 -04:00
Danielle Lancashire d8334cf884 allocrunner: Push state from hooks to taskrunners
This commit is an initial (read: janky) approach to forwarding state
from an allocrunner hook to a taskrunner using a similar `hookResources`
approach that tr's use internally.

It should eventually probably be replaced with something a little bit
more message based, but for things that only come from pre-run hooks,
and don't change, it's probably fine for now.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 3ef41fbb86 csi_hook: Stage/Mount volumes as required
This commit introduces the first stage of volume mounting for an
allocation. The csimanager.VolumeMounter interface manages the blocking
and actual minutia of the CSI implementation allowing this hook to do
the minimal work of volume retrieval and creating mount info.

In the future the `CSIVolume.Get` request should be replaced by
`CSIVolume.Claim(Batch?)` to minimize the number of RPCs and to handle
external triggering of a ControllerPublishVolume request as required.

We also need to ensure that if pre-run hooks fail, we still get a full
unwinding of any publish and staged volumes to ensure that there are no hanging
references to volumes. That is not handled in this commit.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 4a2492ecb1 client: Pass an RPC Client to AllocRunners
As part of introducing support for CSI, AllocRunner hooks need to be
able to communicate with Nomad Servers for validation of and interaction
with storage volumes. Here we create a small RPCer interface and pass
the client (rpc client) to the AR in preparation for making these RPCs.
2020-03-23 13:58:30 -04:00
Danielle Lancashire 3bff9fefae csi: Provide plugin-scoped paths during RPCs
When providing paths to plugins, the path needs to be in the scope of
the plugins container, rather than that of the host.

Here we enable that by providing the mount point through the plugin
registration and then use it when constructing request target paths.
2020-03-23 13:58:29 -04:00
Danielle Lancashire 1a10433b97 csi: Add VolumeManager (#6920)
This changeset is some pre-requisite boilerplate that is required for
introducing CSI volume management for client nodes.

It extracts out fingerprinting logic from the csi instance manager.
This change is to facilitate reusing the csimanager to also manage the
node-local CSI functionality, as it is the easiest place for us to
guaruntee health checking and to provide additional visibility into the
running operations through the fingerprinter mechanism and goroutine.

It also introduces the VolumeMounter interface that will be used to
manage staging/publishing unstaging/unpublishing of volumes on the host.
2020-03-23 13:58:29 -04:00
Danielle Lancashire de5d373001 csi: Setup gRPC Clients with a logger 2020-03-23 13:58:29 -04:00
Danielle Lancashire 426c26d7c0 CSI Plugin Registration (#6555)
This changeset implements the initial registration and fingerprinting
of CSI Plugins as part of #5378. At a high level, it introduces the
following:

* A `csi_plugin` stanza as part of a Nomad task configuration, to
  allow a task to expose that it is a plugin.

* A new task runner hook: `csi_plugin_supervisor`. This hook does two
  things. When the `csi_plugin` stanza is detected, it will
  automatically configure the plugin task to receive bidirectional
  mounts to the CSI intermediary directory. At runtime, it will then
  perform an initial heartbeat of the plugin and handle submitting it to
  the new `dynamicplugins.Registry` for further use by the client, and
  then run a lightweight heartbeat loop that will emit task events
  when health changes.

* The `dynamicplugins.Registry` for handling plugins that run
  as Nomad tasks, in contrast to the existing catalog that requires
  `go-plugin` type plugins and to know the plugin configuration in
  advance.

* The `csimanager` which fingerprints CSI plugins, in a similar way to
  `drivermanager` and `devicemanager`. It currently only fingerprints
  the NodeID from the plugin, and assumes that all plugins are
  monolithic.

Missing features

* We do not use the live updates of the `dynamicplugin` registry in
  the `csimanager` yet.

* We do not deregister the plugins from the client when they shutdown
  yet, they just become indefinitely marked as unhealthy. This is
  deliberate until we figure out how we should manage deploying new
  versions of plugins/transitioning them.
2020-03-23 13:58:28 -04:00
Mahmood Ali 5801039214 address review feedback 2020-03-21 17:52:58 -04:00
Mahmood Ali e1f53347e9 tr: proceed to mark other tasks as dead if alloc fails 2020-03-21 17:52:58 -04:00
Mahmood Ali e30d26b404 fix test 2020-03-21 17:52:57 -04:00
Jasmine Dahilig 73a64e4397 change jobspec lifecycle stanza to use sidecar attribute instead of
block_until status
2020-03-21 17:52:57 -04:00
Jasmine Dahilig 89778bc88d fix restart policy for system jobs with no lifecycle 2020-03-21 17:52:56 -04:00
Jasmine Dahilig 56e0b8e933 refactor TaskHookCoordinator tests to use mock package and add failed init and sidecar test cases 2020-03-21 17:52:56 -04:00
Jasmine Dahilig 2a8dac077c remove debugging test code from TestAllocRunner_TaskLeader_StopRestoredTG 2020-03-21 17:52:54 -04:00
Jasmine Dahilig deb26aefab fix bug in lifecycle restore tests after refactor 2020-03-21 17:52:54 -04:00
Jasmine Dahilig 2e93d7a875 fix failing ci test: TestTaskRunner_UnregisterConsul_Retries 2020-03-21 17:52:54 -04:00
Jasmine Dahilig d54a83afee fix linting errors 2020-03-21 17:52:53 -04:00
Jasmine Dahilig 3d1ffb9337 add task hook coordinator many init tasks test case 2020-03-21 17:52:53 -04:00
Jasmine Dahilig 80f0256cb4 refactor task hook coordinator helper method and tests 2020-03-21 17:52:53 -04:00
Jasmine Dahilig a0fe570317 clean up restore test 2020-03-21 17:52:52 -04:00
Jasmine Dahilig 7ed08eb75a partial test for restore functionality 2020-03-21 17:52:52 -04:00
Jasmine Dahilig 0c44d0017d account for client restarts in task lifecycle hooks 2020-03-21 17:52:51 -04:00
Jasmine Dahilig 4ab39318cc clean up restart conditions and restart tests for task lifecycle 2020-03-21 17:52:50 -04:00
Jasmine Dahilig 7064deaafb put lifecycle nil and empty checks in api Canonicalize 2020-03-21 17:52:50 -04:00
Jasmine Dahilig c27223207c update task hook coordinator tests 2020-03-21 17:52:46 -04:00
Jasmine Dahilig 12393f90e7 add test for lifecycle coordinator 2020-03-21 17:52:42 -04:00
Jasmine Dahilig b9a258ed7b incorporate lifecycle into restart tracker 2020-03-21 17:52:40 -04:00
Mahmood Ali d7354b8920 Add a coordinator for alloc runners 2020-03-21 17:52:38 -04:00
Fredrik Hoem Grelland edb3bd0f3f Update consul-template to v0.24.1 and remove deprecated vault_grace (#7170) 2020-02-23 16:24:53 +01:00
Mahmood Ali 98ad59b1de update rest of consul packages 2020-02-16 16:25:04 -06:00
Seth Hoenig 0e44094d1a client: enable configuring enable_tag_override for services
Consul provides a feature of Service Definitions where the tags
associated with a service can be modified through the Catalog API,
overriding the value(s) configured in the agent's service configuration.

To enable this feature, the flag enable_tag_override must be configured
in the service definition.

Previously, Nomad did not allow configuring this flag, and thus the default
value of false was used. Now, it is configurable.

Because Nomad itself acts as a state machine around the the service definitions
of the tasks it manages, it's worth describing what happens when this feature
is enabled and why.

Consider the basic case where there is no Nomad, and your service is provided
to consul as a boring JSON file. The ultimate source of truth for the definition
of that service is the file, and is stored in the agent. Later, Consul performs
"anti-entropy" which synchronizes the Catalog (stored only the leaders). Then
with enable_tag_override=true, the tags field is available for "external"
modification through the Catalog API (rather than directly configuring the
service definition file, or using the Agent API). The important observation
is that if the service definition ever changes (i.e. the file is changed &
config reloaded OR the Agent API is used to modify the service), those
"external" tag values are thrown away, and the new service definition is
once again the source of truth.

In the Nomad case, Nomad itself is the source of truth over the Agent in
the same way the JSON file was the source of truth in the example above.
That means any time Nomad sets a new service definition, any externally
configured tags are going to be replaced. When does this happen? Only on
major lifecycle events, for example when a task is modified because of an
updated job spec from the 'nomad job run <existing>' command. Otherwise,
Nomad's periodic re-sync's with Consul will now no longer try to restore
the externally modified tag values (as long as enable_tag_override=true).

Fixes #2057
2020-02-10 08:00:55 -06:00
Seth Hoenig db7bcba027 tests: set consul token for nomad client for testing SIDS TR hook 2020-01-31 19:06:15 -06:00
Seth Hoenig 9b20ca5b25 e2e: setup consul ACLs a little more correctly 2020-01-31 19:06:11 -06:00
Seth Hoenig 4152254c3a tests: skip some SIDS hook tests if running tests as root 2020-01-31 19:05:32 -06:00
Seth Hoenig 441e8c7db7 client: additional test cases around failures in SIDS hook 2020-01-31 19:05:27 -06:00
Seth Hoenig c281b05fc0 client: PR cleanup - improved logging around kill task in SIDS hook 2020-01-31 19:05:23 -06:00
Seth Hoenig 03a4af9563 client: PR cleanup - shadow context variable 2020-01-31 19:05:19 -06:00
Seth Hoenig 587a5d4a8d nomad: make TaskGroup.UsesConnect helper a public helper 2020-01-31 19:05:11 -06:00
Seth Hoenig 057f117592 client: manage TR kill from parent on SI token derivation failure
Re-orient the management of the tr.kill to happen in the parent of
the spawned goroutine that is doing the actual token derivation. This
makes the code a little more straightforward, making it easier to
reason about not leaking the worker goroutine.
2020-01-31 19:05:02 -06:00
Seth Hoenig c8761a3f11 client: set context timeout around SI token derivation
The derivation of an SI token needs to be safegaurded by a context
timeout, otherwise an unresponsive Consul could cause the siHook
to block forever on Prestart.
2020-01-31 19:04:56 -06:00
Seth Hoenig 4ee55fcd6c nomad,client: apply more comment/style PR tweaks 2020-01-31 19:04:52 -06:00
Seth Hoenig be7c671919 nomad,client: apply smaller PR suggestions
Apply smaller suggestions like doc strings, variable names, etc.

Co-Authored-By: Nick Ethier <nethier@hashicorp.com>
Co-Authored-By: Michael Schurter <mschurter@hashicorp.com>
2020-01-31 19:04:40 -06:00
Seth Hoenig 78a7d1e426 comments: cleanup some leftover debug comments and such 2020-01-31 19:04:35 -06:00
Seth Hoenig 5c5da95f34 client: skip task SI token file load failure if testing as root
The TestEnvoyBootstrapHook_maybeLoadSIToken test case only works when
running as a non-priveleged user, since it deliberately tries to read
an un-readable file to simulate a failure loading the SI token file.
2020-01-31 19:04:30 -06:00
Seth Hoenig ab7ae8bbb4 client: remove unused indirection for referencing consul executable
Was thinking about using the testing pattern where you create executable
shell scripts as test resources which "mock" the process a bit of code
is meant to fork+exec. Turns out that wasn't really necessary in this case.
2020-01-31 19:04:25 -06:00
Seth Hoenig 2c7ac9a80d nomad: fixup token policy validation 2020-01-31 19:04:08 -06:00
Seth Hoenig d204f2f4f0 client: enable envoy bootstrap hook to set SI token
When creating the envoy bootstrap configuration, we should append
the "-token=<token>" argument in the case where the sidsHook placed
the token in the secrets directory.
2020-01-31 19:04:01 -06:00
Seth Hoenig 9df33f622f nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00