Commit graph

15 commits

Author SHA1 Message Date
Tim Gross ce3eef8037
metrics: Add rate metrics to Client CSI endpoints (#15905)
Also tightens up authentication for these endpoints by enforcing the server
certificate name is valid. We protect these endpoints currently by mTLS and
can't use an auth token because these endpoints are (uniquely) called by the
leader and followers for a given node won't have the leader's ephemeral ACL
token. Add a certificate name check that requests come from a server and not a
client, because no client should ever send these RPCs directly.
2023-01-26 16:40:58 -05:00
Tim Gross dfed1ba5bc
remove most static RPC handlers (#15451)
Nomad server components that aren't in the `nomad` package like the deployment
watcher and volume watcher need to make RPC calls but can't import the Server
struct to do so because it creates a circular reference. These components have a
"shim" object that gets populated to pass a "static" handler that has no RPC
context.

Most RPC handlers are never used in this way, but during server setup we were
constructing a set of static handlers for most RPC endpoints anyways. This is
slightly wasteful but also confusing to developers who end up being encouraged
to just copy what was being done for previous RPCs.

This changeset includes the following refactorings:
* Remove the static handlers field on the server
* Instead construct just the specific static handlers we need to pass into the
  deployment watcher and volume watcher.
* Remove the unnecessary static handler from heartbeater
* Update various tests to avoid needing the static endpoints and have them use a
  endpoint constructed on the spot.

Follow-up work will examine whether we can remove the RPCs from deployment
watcher and volume watcher entirely, falling back to raft applies like node
drainer does currently.
2022-12-02 10:12:05 -05:00
Michael Schurter 3b57df33e3
client: fix data races in config handling (#14139)
Before this change, Client had 2 copies of the config object: config and configCopy. There was no guidance around which to use where (other than configCopy's comment to pass it to alloc runners), both are shared among goroutines and mutated in data racy ways. At least at one point I think the idea was to have `config` be mutable and then grab a lock to overwrite `configCopy`'s pointer atomically. This would have allowed alloc runners to read their config copies in data race safe ways, but this isn't how the current implementation worked.

This change takes the following approach to safely handling configs in the client:

1. `Client.config` is the only copy of the config and all access must go through the `Client.configLock` mutex
2. Since the mutex *only protects the config pointer itself and not fields inside the Config struct:* all config mutation must be done on a *copy* of the config, and then Client's config pointer is overwritten while the mutex is acquired. Alloc runners and other goroutines with the old config pointer will not see config updates.
3. Deep copying is implemented on the Config struct to satisfy the previous approach. The TLS Keyloader is an exception because it has its own internal locking to support mutating in place. An unfortunate complication but one I couldn't find a way to untangle in a timely fashion.
4. To facilitate deep copying I made an *internally backward incompatible API change:* our `helper/funcs` used to turn containers (slices and maps) with 0 elements into nils. This probably saves a few memory allocations but makes it very easy to cause panics. Since my new config handling approach uses more copying, it became very difficult to ensure all code that used containers on configs could handle nils properly. Since this code has caused panics in the past, I fixed it: nil containers are copied as nil, but 0-element containers properly return a new 0-element container. No more "downgrading to nil!"
2022-08-18 16:32:04 -07:00
Seth Hoenig 2631659551 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Tim Gross 466b620fa4
CSI: volume snapshot 2021-04-01 11:16:52 -04:00
Tim Gross d38008176e CSI: create/delete/list volume RPCs
This commit implements the RPC handlers on the client that talk to the CSI
plugins on that client for the Create/Delete/List RPC.
2021-03-31 16:37:09 -04:00
Tim Gross 43622680fa test infrastructure for mock client RPCs (#10193)
This commit includes a new test client that allows overriding the RPC
protocols. Only the RPCs that are passed in are registered, which lets you
implement a mock RPC in the server tests. This commit includes an example of
this for the ClientCSI RPC server.
2021-03-31 16:37:09 -04:00
Drew Bailey 6c788fdccd
Events/msgtype cleanup (#9117)
* use msgtype in upsert node

adds message type to signature for upsert node, update tests, remove placeholder method

* UpsertAllocs msg type test setup

* use upsertallocs with msg type in signature

update test usage of delete node

delete placeholder msgtype method

* add msgtype to upsert evals signature, update test call sites with test setup msg type

handle snapshot upsert eval outside of FSM and ignore eval event

remove placeholder upsertevalsmsgtype

handle job plan rpc and prevent event creation for plan

msgtype cleanup upsertnodeevents

updatenodedrain msgtype

msg type 0 is a node registration event, so set the default  to the ignore type

* fix named import

* fix signature ordering on upsertnode to match
2020-10-19 09:30:15 -04:00
Tim Gross 4bbf18703f
csi: retry controller client RPCs on next controller (#8561)
The documentation encourages operators to run multiple controller plugin
instances for HA, but the client RPCs don't take advantage of this by retrying
when the RPC fails in cases when the plugin is unavailable (because the node
has drained or the alloc has failed but we haven't received an updated
fingerprint yet).

This changeset tries all known controllers on ready nodes before giving up,
and adds tests that exercise the client RPC routing and retries.
2020-08-06 13:24:24 -04:00
Mahmood Ali c3c2a85314 tests: wait until clients are in the state store 2020-05-26 18:53:24 -04:00
Tim Gross f37e986b1b
refactor: make nodeForControllerPlugin private to ClientCSI (#7688)
The current design of `ClientCSI` RPC requires that callers in the
server know about the free-standing `nodeForControllerPlugin`
function. This makes it difficult to send `ClientCSI` RPC messages
from subpackages of `nomad` and adds a bunch of boilerplate to every
server-side caller of a controller RPC.

This changeset makes it so that the `ClientCSI` RPCs will populate and
validate the controller's client node ID if it hasn't been passed by
the caller, centralizing the logic of picking and validating
controller targets into the `nomad.ClientCSI` struct.
2020-04-10 16:47:21 -04:00
Tim Gross f6b3d38eb8
CSI: move node unmount to server-driven RPCs (#7596)
If a volume-claiming alloc stops and the CSI Node plugin that serves
that alloc's volumes is missing, there's no way for the allocrunner
hook to send the `NodeUnpublish` and `NodeUnstage` RPCs.

This changeset addresses this issue with a redesign of the client-side
for CSI. Rather than unmounting in the alloc runner hook, the alloc
runner hook will simply exit. When the server gets the
`Node.UpdateAlloc` for the terminal allocation that had a volume claim,
it creates a volume claim GC job. This job will made client RPCs to a
new node plugin RPC endpoint, and only once that succeeds, move on to
making the client RPCs to the controller plugin. If the node plugin is
unavailable, the GC job will fail and be requeued.
2020-04-02 16:04:56 -04:00
Tim Gross 22e9f679c3 csi: implement controller detach RPCs (#7356)
This changeset implements the remaining controller detach RPCs: server-to-client and client-to-controller. The tests also uncovered a bug in our RPC for claims which is fixed here; the volume claim RPC is used for both claiming and releasing a claim on a volume. We should only submit a controller publish RPC when the claim is new and not when it's being released.
2020-03-23 13:59:25 -04:00
Tim Gross b3bf64485e csi: remove DevDisableBootstrap flag from tests (#7267)
In #7252 we removed the `DevDisableBootstrap` flag to require tests to
honor only `BootstrapExpect`, in order to reduce a source of test
flakiness. This changeset applies the same fix to the CSI tests.
2020-03-23 13:58:30 -04:00
Danielle Lancashire e75f057df3 csi: Fix Controller RPCs
Currently the handling of CSINode RPCs does not correctly handle
forwarding RPCs to Nodes.

This commit fixes this by introducing a shim RPC
(nomad/client_csi_enpdoint) that will correctly forward the request to
the owning node, or submit the RPC to the client.

In the process it also cleans up handling a little bit by adding the
`CSIControllerQuery` embeded struct for required forwarding state.

The CSIControllerQuery embeding the requirement of a `PluginID` also
means we could move node targetting into the shim RPC if wanted in the
future.
2020-03-23 13:58:30 -04:00