In #15417 we added a new `Authenticate` method to the server that returns an
`AuthenticatedIdentity` struct. This changeset implements this method for a
small number of RPC endpoints that together represent all the various ways in
which RPCs are sent, so that we can validate that we're happy with this
approach.
Nomad server components that aren't in the `nomad` package like the deployment
watcher and volume watcher need to make RPC calls but can't import the Server
struct to do so because it creates a circular reference. These components have a
"shim" object that gets populated to pass a "static" handler that has no RPC
context.
Most RPC handlers are never used in this way, but during server setup we were
constructing a set of static handlers for most RPC endpoints anyways. This is
slightly wasteful but also confusing to developers who end up being encouraged
to just copy what was being done for previous RPCs.
This changeset includes the following refactorings:
* Remove the static handlers field on the server
* Instead construct just the specific static handlers we need to pass into the
deployment watcher and volume watcher.
* Remove the unnecessary static handler from heartbeater
* Update various tests to avoid needing the static endpoints and have them use a
endpoint constructed on the spot.
Follow-up work will examine whether we can remove the RPCs from deployment
watcher and volume watcher entirely, falling back to raft applies like node
drainer does currently.
When replication of a single key fails, the replication loop breaks early and
therefore keys that fall later in the sorting order will never get
replicated. This is particularly a problem for clusters impacted by the bug that
caused #14981 and that were later upgraded; the keys that were never replicated
can now never be replicated, and so we need to handle them safely.
Included in the replication fix:
* Refactor the replication loop so that each key replicated in a function call
that returns an error, to make the workflow more clear and reduce nesting. Log
the error and continue.
* Improve stability of keyring replication tests. We no longer block leadership
on initializing the keyring, so there's a race condition in the keyring tests
where we can test for the existence of the root key before the keyring has
been initialize. Change this to an "eventually" test.
But these fixes aren't enough to fix#14981 because they'll end up seeing an
error once a second complaining about the missing key, so we also need to fix
keyring GC so the keys can be removed from the state store. Now we'll store the
key ID used to sign a workload identity in the Allocation, and we'll index the
Allocation table on that so we can track whether any live Allocation was signed
with a particular key ID.
The List RPC correctly authorized against the prefix argument. But when
filtering results underneath the prefix, it only checked authorization for
standard ACL tokens and not Workload Identity. This results in WI tokens being
able to read List results (metadata only: variable paths and timestamps) for
variables under the `nomad/` prefix that belong to other jobs in the same
namespace.
Fixes the filtering and split the `handleMixedAuthEndpoint` function into
separate authentication and authorization steps so that we don't need to
re-verify the claim token on each filtered object.
Also includes:
* update semgrep rule for mixed auth endpoints
* variables: List returns empty set when all results are filtered