In TestServer_LANReap autopilot is running, so the alternate flow
through the serf reaping function is possible. In that situation the
ReconnectTimeout is not relevant so for parity also override the
TombstoneTimeout value as well.
For additional parity update the TestServer_WANReap and
TestClient_LANReap versions of this test in the same way even though
autopilot is irrelevant here .
Fix error in detecting raft replication errors.
Detect redacted token secrets and prevent attempting to insert.
Add a Redacted field to the TokenBatchRead and TokenRead RPC endpoints
This will indicate whether token secrets have been redacted.
Ensure any token with a redacted secret in secondary datacenters is removed.
Test that redacted tokens cannot be replicated.
Previously we were fixing up the token links directly on the *ACLToken returned by memdb. This invalidated some assumptions that a snapshot is immutable as well as potentially being able to cause a crash.
The fix here is to give the policy link fixing function copy on write semantics. When no fixes are necessary we can return the memdb object directly, otherwise we copy it and create a new list of links.
Eventually we might find a better way to keep those policy links in sync but for now this fixes the issue.
This PR adds two features which will be useful for operators when ACLs are in use.
1. Tokens set in configuration files are now reloadable.
2. If `acl.enable_token_persistence` is set to `true` in the configuration, tokens set via the `v1/agent/token` endpoint are now persisted to disk and loaded when the agent starts (or during configuration reload)
Note that token persistence is opt-in so our users who do not want tokens on the local disk will see no change.
Some other secondary changes:
* Refactored a bunch of places where the replication token is retrieved from the token store. This token isn't just for replicating ACLs and now it is named accordingly.
* Allowed better paths in the `v1/agent/token/` API. Instead of paths like: `v1/agent/token/acl_replication_token` the path can now be just `v1/agent/token/replication`. The old paths remain to be valid.
* Added a couple new API functions to set tokens via the new paths. Deprecated the old ones and pointed to the new names. The names are also generally better and don't imply that what you are setting is for ACLs but rather are setting ACL tokens. There is a minor semantic difference there especially for the replication token as again, its no longer used only for ACL token/policy replication. The new functions will detect 404s and fallback to using the older token paths when talking to pre-1.4.3 agents.
* Docs updated to reflect the API additions and to show using the new endpoints.
* Updated the ACL CLI set-agent-tokens command to use the non-deprecated APIs.
In order to be able to reload the TLS configuration, we need one way to generate the different configurations.
This PR introduces a `tlsutil.Configurator` which holds a `tlsutil.Config`. Afterwards it is responsible for rendering every `tls.Config`. In this particular PR I moved `IncomingHTTPSConfig`, `IncomingTLSConfig`, and `OutgoingTLSWrapper` into `tlsutil.Configurator`.
This PR is a pure refactoring - not a single feature added. And not a single test added. I only slightly modified existing tests as necessary.
Also in acl_endpoint_test.go:
* convert logical blocks in some token tests to subtests
* remove use of require.New
This removes a lot of noise in a later PR.
There was an errant early-return in PolicyDelete() that bypassed the
rest of the function. This was ok because the only caller of this
function ignores the results.
This removes the early-return making it structurally behave like
TokenDelete() and for both PolicyDelete and TokenDelete clarify the lone
callers to indicate that the return values are ignored.
We may wish to avoid the entire return value as well, but this patch
doesn't go that far.
`establishLeadership` invoked during leadership monitoring may use autopilot to do promotions etc. There was a race with doing that and having autopilot initialized and this fixes it.
Given a query like:
```
{
"Name": "tagged-connect-query",
"Service": {
"Service": "foo",
"Tags": ["tag"],
"Connect": true
}
}
```
And a Consul configuration like:
```
{
"services": [
"name": "foo",
"port": 8080,
"connect": { "sidecar_service": {} },
"tags": ["tag"]
]
}
```
If you executed the query it would always turn up with 0 results. This was because the sidecar service was being created without any tags. You could instead make your config look like:
```
{
"services": [
"name": "foo",
"port": 8080,
"connect": { "sidecar_service": {
"tags": ["tag"]
} },
"tags": ["tag"]
]
}
```
However that is a bit redundant for most cases. This PR ensures that the tags and service meta of the parent service get copied to the sidecar service. If there are any tags or service meta set in the sidecar service definition then this copying does not take place. After the changes, the query will now return the expected results.
A second change was made to prepared queries in this PR which is to allow filtering on ServiceMeta just like we allow for filtering on NodeMeta.
When tests fail, only the logs for the failing run are dumped to the
console which helps in diagnosis. This is easily added to other test
scenarios as they come up.
* Fix 2 remote ACL policy resolution issues
1 - Use the right method to fire async not found errors when the ACL.PolicyResolve RPC returns that error. This was previously accidentally firing a token result instead of a policy result which would have effectively done nothing (unless there happened to be a token with a secret id == the policy id being resolved.
2. When concurrent policy resolution is being done we single flight the requests. The bug before was that for the policy resolution that was going to piggy back on anothers RPC results it wasn’t waiting long enough for the results to come back due to looping with the wrong variable.
* Fix a handful of other edge case ACL scenarios
The main issue was that token specific issues (not able to access a particular policy or the token being deleted after initial fetching) were poisoning the policy cache.
A second issue was that for concurrent token resolutions, the first resolution to get started would go fetch all the policies. If before the policies were retrieved a second resolution request came in, the new request would register watchers for those policies but then never block waiting for them to complete. This resulted in using the default policy when it shouldn't have.
* Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client
* Add CSR rate limiting docs
* Fix config naming and add tests for new CA configs
* Store leaf cert indexes in raft and use for the ModifyIndex on the returned certs
This ensures that future certificate signings will have a strictly greater ModifyIndex than any previous certs signed.
## Background
When making a blocking query on a missing service (was never registered, or is not registered anymore) the query returns as soon as any service is updated.
On clusters with frequent updates (5~10 updates/s in our DCs) these queries virtually do not block, and clients with no protections againt this waste ressources on the agent and server side. Clients that do protect against this get updates later than they should because of the backoff time they implement between requests.
## Implementation
While reducing the number of unnecessary updates we still want :
* Clients to be notified as soon as when the last instance of a service disapears.
* Clients to be notified whenever there's there is an update for the service.
* Clients to be notified as soon as the first instance of the requested service is added.
To reduce the number of unnecessary updates we need to block when a request to a missing service is made. However in the following case :
1. Client `client1` makes a query for service `foo`, gets back a node and X-Consul-Index 42
2. `foo` is unregistered
3. `client1` makes a query for `foo` with `index=42` -> `foo` does not exist, the query blocks and `client1` is not notified of the change on `foo`
We could store the last raft index when each service was last alive to know wether we should block on the incoming query or not, but that list could grow indefinetly.
We instead store the last raft index when a service was unregistered and use it when a query targets a service that does not exist.
When a service `srv` is unregistered this "missing service index" is always greater than any X-Consul-Index held by the clients while `srv` was up, allowing us to immediatly notify them.
1. Client `client1` makes a query for service `foo`, gets back a node and `X-Consul-Index: 42`
2. `foo` is unregistered, we set the "missing service index" to 43
3. `client1` makes a blocking query for `foo` with `index=42` -> `foo` does not exist, we check against the "missing service index" and return immediatly with `X-Consul-Index: 43`
4. `client1` makes a blocking query for `foo` with `index=43` -> we block
5. Other changes happen in the cluster, but foo still doesn't exist and "missing service index" hasn't changed, the query is still blocked
6. `foo` is registered again on index 62 -> `foo` exists and its index is greater than 43, we unblock the query
Fixes#4897
Also apparently token deletion could segfault in secondary DCs when attempting to delete non-existant tokens. For that reason both checks are wrapped within the non-nil check.
* Avoid to have infinite recursion in DNS lookups when resolving CNAMEs
This will avoid killing Consul when a Service.Address is using CNAME
to a Consul CNAME that creates an infinite recursion.
This will fix https://github.com/hashicorp/consul/issues/4907
* Use maxRecursionLevel = 3 to allow several recursions
* bugfix: use ServiceTags to generate cahce key hash
* update unit test
* update
* remote print log
* Update .gitignore
* Completely deprecate ServiceTag field internally for clarity
* Add explicit test for CacheInfo cases
This PR both prevents a blank CA config from being written out to
a snapshot and allows Consul to gracefully recover from a snapshot
with an invalid CA config.
Fixes#4954.
Fix catalog service node filtering (ex /v1/catalog/service/srv?tag=tag1)
between agent version <=v1.2.3 and server >=v1.3.0.
New server version did not account for the old field when filtering
hence request made from old agent were not tag-filtered.