* Persist HCP management token from server config
We want to move away from injecting an initial management token into
Consul clusters linked to HCP. The reasoning is that by using a separate
class of token we can have more flexibility in terms of allowing HCP's
token to co-exist with the user's management token.
Down the line we can also more easily adjust the permissions attached to
HCP's token to limit it's scope.
With these changes, the cloud management token is like the initial
management token in that iit has the same global management policy and
if it is created it effectively bootstraps the ACL system.
* Update SDK and mock HCP server
The HCP management token will now be sent in a special field rather than
as Consul's "initial management" token configuration.
This commit also updates the mock HCP server to more accurately reflect
the behavior of the CCM backend.
* Refactor HCP bootstrapping logic and add tests
We want to allow users to link Consul clusters that already exist to
HCP. Existing clusters need care when bootstrapped by HCP, since we do
not want to do things like change ACL/TLS settings for a running
cluster.
Additional changes:
* Deconstruct MaybeBootstrap so that it can be tested. The HCP Go SDK
requires HTTPS to fetch a token from the Auth URL, even if the backend
server is mocked. By pulling the hcp.Client creation out we can modify
its TLS configuration in tests while keeping the secure behavior in
production code.
* Add light validation for data received/loaded.
* Sanitize initial_management token from received config, since HCP will
only ever use the CloudConfig.MangementToken.
* Add changelog entry
All of the current integration tests where Vault is the Connect CA now use non-root tokens for the test. This helps us detect privilege changes in the vault model so we can keep our guides up to date.
One larger change was that the RenewIntermediate function got refactored slightly so it could be used from a test, rather than the large duplicated function we were testing in a test which seemed error prone.
* Remove log line about server mgmt token init
Currently the server management token is only being bootstrapped in the
primary datacenter. That means that servers on the secondary datacenter
will never have this token available, and would log this line any time a
token is resolved.
Bootstrapping the token in secondary datacenters will be done in a
follow-up.
* Add changelog entry
This commit introduces a new ACL token used for internal server
management purposes.
It has a few key properties:
- It has unlimited permissions.
- It is persisted through Raft as System Metadata rather than in the
ACL tokens table. This is to avoid users seeing or modifying it.
- It is re-generated on leadership establishment.
Introduces two new public gRPC endpoints (`Login` and `Logout`) and
includes refactoring of the equivalent net/rpc endpoints to enable the
majority of logic to be reused (i.e. by extracting the `Binder` and
`TokenWriter` types).
This contains the OSS portions of the following enterprise commits:
- 75fcdbfcfa6af21d7128cb2544829ead0b1df603
- bce14b714151af74a7f0110843d640204082630a
- cc508b70fbf58eda144d9af3d71bd0f483985893
This change allows us to remove one of the last remaining duplicate
resolve token methods (Server.ResolveToken).
With this change we are down to only 2, where the second one also
handles setting the default EnterpriseMeta from the token.
Now that ACLResolver is embedded we don't need ResolveTokenToIdentity on
Client and Server.
Moving ResolveTokenAndDefaultMeta to ACLResolver removes the duplicate
implementation.
As part of removing the legacy ACL system ACL upgrading and the flag for
legacy ACLs is removed from Clients.
This commit also removes the 'acls' serf tag from client nodes. The tag is only ever read
from server nodes.
This commit also introduces a constant for the acl serf tag, to make it easier to track where
it is used.
This field has been unnecessary for a while now. It was always set to the same value
as PrimaryDatacenter. So we can remove the duplicate field and use PrimaryDatacenter
directly.
This change was made by GoLand refactor, which did most of the work for me.
This method suffered from similar naming to a couple other methods on Server, and had not great
re-use (2 callers). By copying a few of the lines into one of the callers we can move the
implementation into the second caller.
Once moved, we can see that ResolveTokenAndDefaultMeta is identical in both Client and Server, and
likely should be further refactored, possibly into ACLResolver.
This change is being made to make ACL resolution easier to trace.
This method was an alias for ACLResolver.ResolveTokenToIdentityAndAuthorizer. By removing the
method that does nothing the code becomes easier to trace.
ACL filtering only needs an authorizer and a logger. We can decouple filtering from
the ACLResolver by passing in the necessary logger.
This change is being made in preparation for moving the ACLResolver into an acl package
filterACLWithAuthorizer could never return an error. This change moves us a little bit
closer to being able to enable errcheck and catch problems caused by unhandled error
return values.
This way we only have to wait for the serf barrier to pass once before
we can upgrade to v2 acls. Without this patch every restart needs to
re-compute the change, and potentially if a stray older node joins after
a migration it might regress back to v1 mode which would be problematic.
This allows for client agent to be run in a more stateless manner where they may be abruptly terminated and not expected to come back. If advertising a per-agent reconnect timeout using the advertise_reconnect_timeout configuration when that agent leaves, other agents will wait only that amount of time for the agent to come back before reaping it.
This has the advantageous side effect of causing servers to deregister the node/services/checks for that agent sooner than if the global reconnect_timeout was used.
In all cases (oss/ent, client/server) this method was returning a value from config. Since the
value is consistent, it doesn't need to be part of the delegate interface.
Previously this happened to be using the method on the Server/Client that was meant to allow the ACLResolver to locally resolve tokens. On Servers that had tokens (primary or secondary dc + token replication) this function would lookup the token from raft and return the ACLIdentity. On clients this was always a noop. We inadvertently used this function instead of creating a new one when we added logging accessor ids for permission denied RPC requests.
With this commit, a new method is used for resolving the identity properly via the ACLResolver which may still resolve locally in the case of being on a server with tokens but also supports remote token resolution.
Also reduce the log level of some version checking messages on the server as they can be pretty noisy during upgrades and really are more for debugging purposes.
These changes are necessary to ensure advertisement happens correctly even when datacenters are connected via network areas in Consul enterprise.
This also changes how we check if ACLs can be upgraded within the local datacenter. Previously we would iterate through all LAN members. Now we just use the ServerLookup type to iterate through all known servers in the DC.
Main Changes:
• method signature updates everywhere to account for passing around enterprise meta.
• populate the EnterpriseAuthorizerContext for all ACL related authorizations.
• ACL resource listings now operate like the catalog or kv listings in that the returned entries are filtered down to what the token is allowed to see. With Namespaces its no longer all or nothing.
• Modified the acl.Policy parsing to abstract away basic decoding so that enterprise can do it slightly differently. Also updated method signatures so that when parsing a policy it can take extra ent metadata to use during rules validation and policy creation.
Secondary Changes:
• Moved protobuf encoding functions out of the agentpb package to eliminate circular dependencies.
• Added custom JSON unmarshalers for a few ACL resource types (to support snake case and to get rid of mapstructure)
• AuthMethod validator cache is now an interface as these will be cached per-namespace for Consul Enterprise.
• Added checks for policy/role link existence at the RPC API so we don’t push the request through raft to have it fail internally.
• Forward ACL token delete request to the primary datacenter when the secondary DC doesn’t have the token.
• Added a bunch of ACL test helpers for inserting ACL resource test data.
Roles are named and can express the same bundle of permissions that can
currently be assigned to a Token (lists of Policies and Service
Identities). The difference with a Role is that it not itself a bearer
token, but just another entity that can be tied to a Token.
This lets an operator potentially curate a set of smaller reusable
Policies and compose them together into reusable Roles, rather than
always exploding that same list of Policies on any Token that needs
similar permissions.
This also refactors the acl replication code to be semi-generic to avoid
3x copypasta.
This PR adds two features which will be useful for operators when ACLs are in use.
1. Tokens set in configuration files are now reloadable.
2. If `acl.enable_token_persistence` is set to `true` in the configuration, tokens set via the `v1/agent/token` endpoint are now persisted to disk and loaded when the agent starts (or during configuration reload)
Note that token persistence is opt-in so our users who do not want tokens on the local disk will see no change.
Some other secondary changes:
* Refactored a bunch of places where the replication token is retrieved from the token store. This token isn't just for replicating ACLs and now it is named accordingly.
* Allowed better paths in the `v1/agent/token/` API. Instead of paths like: `v1/agent/token/acl_replication_token` the path can now be just `v1/agent/token/replication`. The old paths remain to be valid.
* Added a couple new API functions to set tokens via the new paths. Deprecated the old ones and pointed to the new names. The names are also generally better and don't imply that what you are setting is for ACLs but rather are setting ACL tokens. There is a minor semantic difference there especially for the replication token as again, its no longer used only for ACL token/policy replication. The new functions will detect 404s and fallback to using the older token paths when talking to pre-1.4.3 agents.
* Docs updated to reflect the API additions and to show using the new endpoints.
* Updated the ACL CLI set-agent-tokens command to use the non-deprecated APIs.
* Add leader token upgrade test and fix various ACL enablement bugs
* Update the leader ACL initialization tests.
* Add a StateStore ACL tests for ACLTokenSet and ACLTokenGetBy* functions
* Advertise the agents acl support status with the agent/self endpoint.
* Make batch token upsert CAS’able to prevent consistency issues with token auto-upgrade
* Finish up the ACL state store token tests
* Finish the ACL state store unit tests
Also rename some things to make them more consistent.
* Do as much ACL replication testing as I can.