* Upgrade from hashicorp/go-msgpack v1.1.5 to v2.1.0
Fixes#16808
* Update hashicorp/net-rpc-msgpackrpc to v2 to match go-msgpack
* deps: use go-msgpack v2.0.0
go-msgpack v2.1.0 includes some code changes that we will need to
investigate furthere to assess its impact on Nomad, so keeping this
dependency on v2.0.0 for now since it's no-op.
---------
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Requests without an ACL token that pass thru the client's HTTP API are treated
as though they come from the client itself. This allows bypass of ACLs on RPC
requests where ACL permissions are checked (like `Job.Register`). Invalid tokens
are correctly rejected.
Fix the bypass by only setting a client ID on the identity if we have a valid node secret.
Note that this changeset will break rate metrics for RPCs sent by clients
without a client secret such as `Node.GetClientAllocs`; these requests will be
recorded as anonymous.
Future work should:
* Ensure the node secret is sent with all client-driven RPCs except
`Node.Register` which is TOFU.
* Create a new `acl.ACL` object from client requests so that we
can enforce ACLs for all endpoints in a uniform way that's less error-prone.~
ACL policies can be associated with a job so that the job's Workload Identity
can have expanded access to other policy objects, including other
variables. Policies set on the variables the job automatically has access to
were ignored, but this includes policies with `deny` capabilities.
Additionally, when resolving claims for a workload identity without any attached
policies, the `ResolveClaims` method returned a `nil` ACL object, which is
treated similarly to a management token. While this was safe in Nomad 1.4.x,
when the workload identity token was exposed to the task via the `identity`
block, this allows a user with `submit-job` capabilities to escalate their
privileges.
We originally implemented automatic workload access to Variables as a separate
code path in the Variables RPC endpoint so that we don't have to generate
on-the-fly policies that blow up the ACL policy cache. This is fairly brittle
but also the behavior around wildcard paths in policies different from the rest
of our ACL polices, which is hard to reason about.
Add an `ACLClaim` parameter to the `AllowVariableOperation` method so that we
can push all this logic into the `acl` package and the behavior can be
consistent. This will allow a `deny` policy to override automatic access (and
probably speed up checks of non-automatic variable access).
If a consumer of the new `Authenticate` method gets passed a bogus token that's
a correctly-shaped UUID, it will correctly get an identity without a ACL
token. But most consumers will then panic when they consume this nil `ACLToken`
for authorization.
Because no API client should ever send a bogus auth token, update the
`Authenticate` method to create the identity with remote IP (for metrics
tracking) but also return an `ErrPermissionDenied`.
Implement a metric for RPC requests with labels on the identity, so that
administrators can monitor the source of requests within the cluster. This
changeset demonstrates the change with the new `ACL.WhoAmI` RPC, and we'll wire
up the remaining RPCs once we've threaded the new pre-forwarding authentication
through the all.
Note that metrics are measured after we forward but before we return any
authentication error. This ensures that we only emit metrics on the server that
actually serves the request. We'll perform rate limiting at the same place.
Includes telemetry configuration to omit identity labels.
This changeset covers a sidebar discussion that @schmichael and I had around the
design for pre-forwarding auth. This includes some changes extracted out of
#15513 to make it easier to review both and leave a clean history.
* Remove fast path for NodeID. Previously-connected clients will have a NodeID
set on the context, and because this is a large portion of the RPCs sent we
fast-pathed it at the top of the `Authenticate` method. But the context is
shared for all yamux streams over the same yamux session (and TCP
connection). This lets an authenticated HTTP request to a client use the
NodeID for authentication, which is a privilege escalation. Remove the fast
path and annotate it so that we don't break it again.
* Add context to decisions around AuthenticatedIdentity. The `Authenticate`
method taken on its own looks like it wants to return an `acl.ACL` that folds
over all the various identity types (creating an ephemeral ACL on the fly if
neccessary). But keeping these fields idependent allows RPC handlers to
differentiate between internal and external origins so we most likely want to
avoid this. Leave some docstrings as a warning as to why this is built the way
it is.
* Mutate the request rather than returning. When reviewing #15513 we decided
that forcing the request handler to call `SetIdentity` was repetitive and
error prone. Instead, the `Authenticate` method mutates the request by setting
its `AuthenticatedIdentity`.
Upcoming work to instrument the rate of RPC requests by consumer (and eventually
rate limit) require that we authenticate a RPC request before forwarding. Add a
new top-level `Authenticate` method to the server and have it return an
`AuthenticatedIdentity` struct. RPC handlers will use the relevant fields of
this identity for performing authorization.
This changeset includes:
* The main implementation of `Authenticate`
* Provide a new RPC `ACL.WhoAmI` for debugging authentication. This endpoint
returns the same `AuthenticatedIdentity` that will be used by RPC handlers. At
some point we might want to give this an equivalent HTTP endpoint but I didn't
want to add that to our public API until some of the other Workload Identity
work is solidified, especially if we don't need it yet.
* A full coverage test of the `Authenticate` method. This sets up two server
nodes with mTLS and ACLs, some tokens, and some allocations with workload
identities.
* Wire up an example of using `Authenticate` in the `Namespace.Upsert` RPC and
see how authorization happens after forwarding.
* A new semgrep rule for `Authenticate`, which we'll need to update once we're
ready to wire up more RPC endpoints with authorization steps.
The original design for workload identities and ACLs allows for operators to
extend the automatic capabilities of a workload by using a specially-named
policy. This has shown to be potentially unsafe because of naming collisions, so
instead we'll allow operators to explicitly attach a policy to a workload
identity.
This changeset adds workload identity fields to ACL policy objects and threads
that all the way down to the command line. It also a new secondary index to the
ACL policy table on namespace and job so that claim resolution can efficiently
query for related policies.
ACL Roles along with policies and global token will be replicated
from the authoritative region to all federated regions. This
involves a new replication loop running on the federated leader.
Policies and roles may be replicated at different times, meaning
the policies and role references may not be present within the
local state upon replication upsert. In order to bypass the RPC
and state check, a new RPC request parameter has been added. This
is used by the replication process; all other callers will trigger
the ACL role policy validation check.
There is a new ACL RPC endpoint to allow the reading of a set of
ACL Roles which is required by the replication process and matches
ACL Policies and Tokens. A bug within the ACL Role listing RPC has
also been fixed which returned incorrect data during blocking
queries where a deletion had occurred.
ACL tokens can now utilize ACL roles in order to provide API
authorization. Each ACL token can be created and linked to an
array of policies as well as an array of ACL role links. The link
can be provided via the role name or ID, but internally, is always
resolved to the ID as this is immutable whereas the name can be
changed by operators.
When resolving an ACL token, the policies linked from an ACL role
are unpacked and combined with the policy array to form the
complete auth set for the token.
The ACL token creation endpoint handles deduplicating ACL role
links as well as ensuring they exist within state.
When reading a token, Nomad will also ensure the ACL role link is
current. This handles ACL roles being deleted from under a token
from a UX standpoint.
This commit adds basic expiry checking when performing ACL token
resolution. This expiry checking is local to each server and does
not at this time take into account potential time skew on server
hosts.
A new error message has been created so clients whose token has
expired get a clear message, rather than a generic token not
found.
The ACL resolution tests have been refactored into table driven
tests, so additions are easier in the future.
* upsertaclpolicies
* delete acl policies msgtype
* upsert acl policies msgtype
* delete acl tokens msgtype
* acl bootstrap msgtype
wip unsubscribe on token delete
test that subscriptions are closed after an ACL token has been deleted
Start writing policyupdated test
* update test to use before/after policy
* add SubscribeWithACLCheck to run acl checks on subscribe
* update rpc endpoint to use broker acl check
* Add and use subscriptions.closeSubscriptionFunc
This fixes the issue of not being able to defer unlocking the mutex on
the event broker in the for loop.
handle acl policy updates
* rpc endpoint test for terminating acl change
* add comments
Co-authored-by: Kris Hicks <khicks@hashicorp.com>
allow oss to parse sink duration
clean up audit sink parsing
ent eventer config reload
fix typo
SetEnabled to eventer interface
client acl test
rm dead code
fix failing test
Copy the updated version of freeport (sdk/freeport), and tweak it for use
in Nomad tests. This means staying below port 10000 to avoid conflicts with
the lib/freeport that is still transitively used by the old version of
consul that we vendor. Also provide implementations to find ephemeral ports
of macOS and Windows environments.
Ports acquired through freeport are supposed to be returned to freeport,
which this change now also introduces. Many tests are modified to include
calls to a cleanup function for Server objects.
This should help quite a bit with some flakey tests, but not all of them.
Our port problems will not go away completely until we upgrade our vendor
version of consul. With Go modules, we'll probably do a 'replace' to swap
out other copies of freeport with the one now in 'nomad/helper/freeport'.