When native service discovery was added, we used the node secret as the auth
token. Once Workload Identity was added in Nomad 1.4.x we needed to use the
claim token for `template` blocks, and so we allowed valid claims to bypass the
ACL policy check to preserve the existing behavior. (Invalid claims are still
rejected, so this didn't widen any security boundary.)
In reworking authentication for 1.5.0, we unintentionally removed this
bypass. For WIs without a policy attached to their job, everything works as
expected because the resulting `acl.ACL` is nil. But once a policy is attached
to the job the `acl.ACL` is no longer nil and this causes permissions errors.
Fix the regression by adding back the bypass for valid claims. In future work,
we should strongly consider getting turning the implicit policies into real
`ACLPolicy` objects (even if not stored in state) so that we don't have these
kind of brittle exceptions to the auth code.
The signature of the `raftApply` function requires that the caller unwrap the
first returned value (the response from `FSM.Apply`) to see if it's an
error. This puts the burden on the caller to remember to check two different
places for errors, and we've done so inconsistently.
Update `raftApply` to do the unwrapping for us and return any `FSM.Apply` error
as the error value. Similar work was done in Consul in
https://github.com/hashicorp/consul/pull/9991. This eliminates some boilerplate
and surfaces a few minor bugs in the process:
* job deregistrations of already-GC'd jobs were still emitting evals
* reconcile job summaries does not return scheduler errors
* node updates did not report errors associated with inconsistent service
discovery or CSI plugin states
Note that although _most_ of the `FSM.Apply` functions return only errors (which
makes it tempting to remove the first return value entirely), there are few that
return `bool` for some reason and Variables relies on the response value for
proper CAS checking.
This changeset configures the RPC rate metrics that were added in #15515 to all
the RPCs that support authenticated HTTP API requests. These endpoints already
configured with pre-forwarding authentication in #15870, and a handful of others
were done already as part of the proof-of-concept work. So this changeset is
entirely copy-and-pasting one method call into a whole mess of handlers.
Upcoming PRs will wire up pre-forwarding auth and rate metrics for the remaining
set of RPCs that have no API consumers or aren't authenticated, in smaller
chunks that can be more thoughtfully reviewed.
This changeset allows Workload Identities to authenticate to all the RPCs that
support HTTP API endpoints, for use with PR #15864.
* Extends the work done for pre-forwarding authentication to all RPCs that
support a HTTP API endpoint.
* Consolidates the auth helpers used by the CSI, Service Registration, and Node
endpoints that are currently used to support both tokens and client secrets.
Intentionally excluded from this changeset:
* The Variables endpoint still has custom handling because of the implicit
policies. Ideally we'll figure out an efficient way to resolve those into real
policies and then we can get rid of that custom handling.
* The RPCs that don't currently support auth tokens (i.e. those that don't
support HTTP endpoints) have not been updated with the new pre-forwarding auth
We'll be doing this under a separate PR to support RPC rate metrics.
Upcoming work to instrument the rate of RPC requests by consumer (and eventually
rate limit) requires that we thread the `RPCContext` through all RPC
handlers so that we can access the underlying connection. This changeset adds
the context to everywhere we intend to initially support it and intentionally
excludes streaming RPCs and client RPCs.
To improve the ergonomics of adding the context everywhere its needed and to
clarify the requirements of dynamic vs static handlers, I've also done a good
bit of refactoring here:
* canonicalized the RPC handler fields so they're as close to identical as
possible without introducing unused fields (i.e. I didn't add loggers if the
handler doesn't use them already).
* canonicalized the imports in the handler files.
* added a `NewExampleEndpoint` function for each handler that ensures we're
constructing the handlers with the required arguments.
* reordered the registration in server.go to match the order of the files (to
make it easier to see if we've missed one), and added a bunch of commentary
there as to what the difference between static and dynamic handlers is.
Adds a new policy block inside namespaces to control access to secure
variables on the basis of path, with support for globbing.
Splits out VerifyClaim from ResolveClaim.
The ServiceRegistration RPC only needs to be able to verify that a
claim is valid for some allocation in the store; it doesn't care about
implicit policies or capabilities. Split this out to its own method on
the server so that the SecureVariables RPC can reuse it as a separate
step from resolving policies (see next commit).
Support implicit policies based on workload identity
In order to support implicit ACL policies for tasks to get their own
secrets, each task would need to have its own ACL token. This would
add extra raft overhead as well as new garbage collection jobs for
cleaning up task-specific ACL tokens. Instead, Nomad will create a
workload Identity Claim for each task.
An Identity Claim is a JSON Web Token (JWT) signed by the server’s
private key and attached to an Allocation at the time a plan is
applied. The encoded JWT can be submitted as the X-Nomad-Token header
to replace ACL token secret IDs for the RPCs that support identity
claims.
Whenever a key is is added to a server’s keyring, it will use the key
as the seed for a Ed25519 public-private private keypair. That keypair
will be used for signing the JWT and for verifying the JWT.
This implementation is a ruthlessly minimal approach to support the
secure variables feature. When a JWT is verified, the allocation ID
will be checked against the Nomad state store, and non-existent or
terminal allocation IDs will cause the validation to be rejected. This
is sufficient to support the secure variables feature at launch
without requiring implementation of a background process to renew
soon-to-expire tokens.
This PR adds the 'choose' query parameter to the '/v1/service/<service>' endpoint.
The value of 'choose' is in the form '<number>|<key>', number is the number
of desired services and key is a value unique but consistent to the requester
(e.g. allocID).
Folks aren't really expected to use this API directly, but rather through consul-template
which will soon be getting a new helper function making use of this query parameter.
Example,
curl 'localhost:4646/v1/service/redis?choose=2|abc123'
Note: consul-templte v0.29.1 includes the necessary nomadServices functionality.
* services: add pagination and filter support to info RPC.
* cli: add filter flag to service info command.
* docs: add pagination and filter details to services info API.
* paginator: minor updates to comment and func signature.
In the same manner as the delete RPC, the list and read service
registration endpoints can be called either by external operators
or Nomad nodes. The latter occurs when a template is being
rendered which includes Nomad API template funcs. In this case,
the auth token is looked up as the node secret ID for auth.