Deadlock scenario:
1. Due to scheduling, the state runner sends one snapshot into
snapCh and then attempts to send a second. The first send succeeds
because the channel is buffered, but the second blocks.
2. Separately, Manager.Watch is called by the xDS server after
getting a discovery request from Envoy. This function acquires the
manager lock and then blocks on receiving the CurrentSnapshot from
the state runner.
3. Separately, there is a Manager goroutine that reads the snapshots
from the channel in step 1. These reads are done to notify proxy
watchers, but they require holding the manager lock. This goroutine
goes to acquire that lock, but can't because it is held by step 2.
Now, the goroutine from step 3 is waiting on the one from step 2 to
release the lock. The goroutine from step 2 won't release the lock until
the goroutine in step 1 advances. But the goroutine in step 1 is waiting
for the one in step 3. Deadlock.
By making this send non-blocking step 1 above can proceed. The coalesce
timer will be reset and a new valid snapshot will be delivered after it
elapses or when one is requested by xDS.
Add a skip condition to all tests slower than 100ms.
This change was made using `gotestsum tool slowest` with data from the
last 3 CI runs of master.
See https://github.com/gotestyourself/gotestsum#finding-and-skipping-slow-tests
With this change:
```
$ time go test -count=1 -short ./agent
ok github.com/hashicorp/consul/agent 0.743s
real 0m4.791s
$ time go test -count=1 -short ./agent/consul
ok github.com/hashicorp/consul/agent/consul 4.229s
real 0m8.769s
```
This implements a solution for #7863
It does:
Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = rate.Inf
Add cache.entry_fetch_max_burst size of rate limit (default value = 2)
The new configuration now supports the following syntax for instance to allow 1 query every 3s:
command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}'
in JSON
{
"cache": {
"entry_fetch_rate": 0.333
}
}
The rationale behind removing them is that all of our own code (xDS, builtin connect proxy) use the cache notification mechanism. This ensures that the blocking fetch behind the scenes is always executing. Therefore the only way you might go to get a certificate and have to wait is when 1) the request has never been made for that cert before or 2) you are using the v1/agent/connect/ca/leaf API for retrieving the cert yourself.
In the first case, the refresh change doesn’t alter the behavior. In the second case, it can be mitigated by using blocking queries with that API which just like normal cache notification mechanism will cause the blocking fetch to be initiated and to get leaf certs as soon as needed.
If you are not using blocking queries, or Envoy/xDS, or the builtin connect proxy but are retrieving the certs yourself then the HTTP endpoint might take a little longer to respond.
This also renames the RefreshTimeout field on the register options to QueryTimeout to more accurately reflect that it is used for any type that supports blocking queries.
And fix the 'value not used' issues.
Many of these are not bugs, but a few are tests not checking errors, and
one appears to be a missed error in non-test code.
Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Currently when passing hostname clusters to Envoy, we set each service instance registered with Consul as an LbEndpoint for the cluster.
However, Envoy can only handle one per cluster:
[2020-06-04 18:32:34.094][1][warning][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC config for type.googleapis.com/envoy.api.v2.Cluster rejected: Error adding/updating cluster(s) dc2.internal.ddd90499-9b47-91c5-4616-c0cbf0fc358a.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint, server.dc2.consul: LOGICAL_DNS clusters must have a single locality_lb_endpoint and a single lb_endpoint
Envoy is currently handling this gracefully by only picking one of the endpoints. However, we should avoid passing multiple to avoid these warning logs.
This PR:
* Ensures we only pass one endpoint, which is tied to one service instance.
* We prefer sending an endpoint which is marked as Healthy by Consul.
* If no endpoints are healthy we emit a warning and skip the cluster.
* If multiple unique hostnames are spread across service instances we emit a warning and let the user know which will be resolved.
The DNS resolution will be handled by Envoy and defaults to LOGICAL_DNS. This discovery type can be overridden on a per-gateway basis with the envoy_dns_discovery_type Gateway Option.
If a service contains an instance with a hostname as an address we set the Envoy cluster to use DNS as the discovery type rather than EDS. Since both mesh gateways and terminating gateways route to clusters using SNI, whenever there is a mix of hostnames and IP addresses associated with a service we use the hostname + CDS rather than the IPs + EDS.
Note that we detect hostnames by attempting to parse the service instance's address as an IP. If it is not a valid IP we assume it is a hostname.
Three of the checks are temporarily disabled to limit the size of the
diff, and allow us to enable all the other checks in CI.
In a follow up we can fix the issues reported by the other checks one
at a time, and enable them.
* Standardize support for Tagged and BindAddresses in Ingress Gateways
This updates the TaggedAddresses and BindAddresses behavior for Ingress
to match Mesh/Terminating gateways. The `consul connect envoy` command
now also allows passing an address without a port for tagged/bind
addresses.
* Update command/connect/envoy/envoy.go
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
* PR comments
* Check to see if address is an actual IP address
* Update agent/xds/listeners.go
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
* fix whitespace
Co-authored-by: Chris Piraino <cpiraino@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
- Use correct enterprise metadata for finding config entry
- nil out cancel functions on config snapshot copy
- Look at HostsSet when checking validity