* Add leader token upgrade test and fix various ACL enablement bugs
* Update the leader ACL initialization tests.
* Add a StateStore ACL tests for ACLTokenSet and ACLTokenGetBy* functions
* Advertise the agents acl support status with the agent/self endpoint.
* Make batch token upsert CAS’able to prevent consistency issues with token auto-upgrade
* Finish up the ACL state store token tests
* Finish the ACL state store unit tests
Also rename some things to make them more consistent.
* Do as much ACL replication testing as I can.
This PR is almost a complete rewrite of the ACL system within Consul. It brings the features more in line with other HashiCorp products. Obviously there is quite a bit left to do here but most of it is related docs, testing and finishing the last few commands in the CLI. I will update the PR description and check off the todos as I finish them over the next few days/week.
Description
At a high level this PR is mainly to split ACL tokens from Policies and to split the concepts of Authorization from Identities. A lot of this PR is mostly just to support CRUD operations on ACLTokens and ACLPolicies. These in and of themselves are not particularly interesting. The bigger conceptual changes are in how tokens get resolved, how backwards compatibility is handled and the separation of policy from identity which could lead the way to allowing for alternative identity providers.
On the surface and with a new cluster the ACL system will look very similar to that of Nomads. Both have tokens and policies. Both have local tokens. The ACL management APIs for both are very similar. I even ripped off Nomad's ACL bootstrap resetting procedure. There are a few key differences though.
Nomad requires token and policy replication where Consul only requires policy replication with token replication being opt-in. In Consul local tokens only work with token replication being enabled though.
All policies in Nomad are globally applicable. In Consul all policies are stored and replicated globally but can be scoped to a subset of the datacenters. This allows for more granular access management.
Unlike Nomad, Consul has legacy baggage in the form of the original ACL system. The ramifications of this are:
A server running the new system must still support other clients using the legacy system.
A client running the new system must be able to use the legacy RPCs when the servers in its datacenter are running the legacy system.
The primary ACL DC's servers running in legacy mode needs to be a gate that keeps everything else in the entire multi-DC cluster running in legacy mode.
So not only does this PR implement the new ACL system but has a legacy mode built in for when the cluster isn't ready for new ACLs. Also detecting that new ACLs can be used is automatic and requires no configuration on the part of administrators. This process is detailed more in the "Transitioning from Legacy to New ACL Mode" section below.
* [Performance On Large clusters] Checks do update services/nodes only when really modified to avoid too many updates on very large clusters
In a large cluster, when having a few thousands of nodes, the anti-entropy
mechanism performs lots of changes (several per seconds) while
there is no real change. This patch wants to improve this in order
to increase Consul scalability when using many blocking requests on
health for instance.
* [Performance for large clusters] Only updates index of service if service is really modified
* [Performance for large clusters] Only updates index of nodes if node is really modified
* Added comments / ensure IsSame() has clear semantics
* Avoid having modified boolean, return nil directly if stutures are Same
* Fixed unstable unit tests TestLeader_ChangeServerID
* Rewrite TestNode_IsSame() for better readability as suggested by @banks
* Rename ServiceNode.IsSame() into IsSameService() + added unit tests
* Do not duplicate TestStructs_ServiceNode_Conversions() and increase test coverage of IsSameService
* Clearer documentation in IsSameService
* Take into account ServiceProxy into ServiceNode.IsSameService()
* Fixed IsSameService() with all new structures
* Fix CA pruning when CA config uses string durations.
The tl;dr here is:
- Configuring LeafCertTTL with a string like "72h" is how we do it by default and should be supported
- Most of our tests managed to escape this by defining them as time.Duration directly
- Out actual default value is a string
- Since this is stored in a map[string]interface{} config, when it is written to Raft it goes through a msgpack encode/decode cycle (even though it's written from server not over RPC).
- msgpack decode leaves the string as a `[]uint8`
- Some of our parsers required string and failed
- So after 1 hour, a default configured server would throw an error about pruning old CAs
- If a new CA was configured that set LeafCertTTL as a time.Duration, things might be OK after that, but if a new CA was just configured from config file, intialization would cause same issue but always fail still so would never prune the old CA.
- Mostly this is just a janky error that got passed tests due to many levels of complicated encoding/decoding.
tl;dr of the tl;dr: Yay for type safety. Map[string]interface{} combined with msgpack always goes wrong but we somehow get bitten every time in a new way :D
We already fixed this once! The main CA config had the same problem so @kyhavlov already wrote the mapstructure DecodeHook that fixes it. It wasn't used in several places it needed to be and one of those is notw in `structs` which caused a dependency cycle so I've moved them.
This adds a whole new test thta explicitly tests the case that broke here. It also adds tests that would have failed in other places before (Consul and Vaul provider parsing functions). I'm not sure if they would ever be affected as it is now as we've not seen things broken with them but it seems better to explicitly test that and support it to not be bitten a third time!
* Typo fix
* Fix bad Uint8 usage
* Revert "BUGFIX: Unit test relying on WaitForLeader() did not work due to wrong test (#4472)"
This reverts commit cec5d7239621e0732b3f70158addb1899442acb3.
* Revert "CA initialization while boostrapping and TestLeader_ChangeServerID fix. (#4493)"
This reverts commit 589b589b53e56af38de25db9b56967bdf1f2c069.
* Make sure that id and address are set in member created during reaping of catalog nodes that have been removed from serf
* Get address from node table in the state store rather than from service address
* Fix incorrect lookup by checkname instead of node name
* Make sure that serverlookup is called with the right address format, added unit test.
* Address code review comments
* Tweaks style stuff.
* Changes default Raft protocol to 3.
* Changes numPeers() to report only voters.
This should have been there before, but it's more obvious that this
is incorrect now that we default the Raft protocol to 3, which puts
new servers in a read-only state while Autopilot waits for them to
become healthy.
* Fixes TestLeader_RollRaftServer.
* Fixes TestOperator_RaftRemovePeerByAddress.
* Fixes TestServer_*.
Relaxed the check for a given number of voter peers and instead do
a thorough check that all servers see each other in their Raft
configurations.
* Fixes TestACL_*.
These now just check for Raft replication to be set up, and don't
care about the number of voter peers.
* Fixes TestOperator_Raft_ListPeers.
* Fixes TestAutopilot_CleanupDeadServerPeriodic.
* Fixes TestCatalog_ListNodes_ConsistentRead_Fail.
* Fixes TestLeader_ChangeServerID and adjusts the conn pool to throw away
sockets when it sees io.EOF.
* Changes version to 1.0.0 in the options doc.
* Makes metrics test more deterministic with autopilot metrics possible.
* Moves magic check and service constants into shared structs package.
* Removes the "consul" service from local state.
Since this service is added by the leader, it doesn't really make sense to
also keep it in local state (which requires special ACLs to configure), and
requires a bunch of special cases in the local state logic. This requires
fewer special cases and makes ACL bootstrapping cleaner.
* Makes coordinate update ACL log message a warning, similar to other AE warnings.
* Adds much more detailed examples for bootstrapping ACLs.
This can hopefully replace https://gist.github.com/slackpad/d89ce0e1cc0802c3c4f2d84932fa3234.