The duo of `makeUpstreamFilterChainForDiscoveryChain` and `makeListenerForDiscoveryChain` were really hard to reason about, and led to concealing a bug in their branching logic. There were several issues here:
- They tried to accomplish too much: determining filter name, cluster name, and whether RDS should be used.
- They embedded logic to handle significantly different kinds of upstream listeners (passthrough, prepared query, typical services, and catch-all)
- They needed to coalesce different data sources (Upstream and CompiledDiscoveryChain)
Rather than handling all of those tasks inside of these functions, this PR pulls out the RDS/clusterName/filterName logic.
This refactor also fixed a bug with the handling of [UpstreamDefaults](https://www.consul.io/docs/connect/config-entries/service-defaults#defaults). These defaults get stored as UpstreamConfig in the proxy snapshot with a DestinationName of "*", since they apply to all upstreams. However, this wildcard destination name must not be used when creating the name of the associated upstream cluster. The coalescing logic in the original functions here was in some situations creating clusters with a `*.` prefix, which is not a valid destination.
Fixes an issue described in #10132, where if two DCs are WAN federated
over mesh gateways, and the gateway in the non-primary DC is terminated
and receives a new IP address (as is commonly the case when running them
on ephemeral compute instances) the primary DC is unable to re-establish
its connection until the agent running on its own gateway is restarted.
This was happening because we always preferred gateways discovered by
the `Internal.ServiceDump` RPC (which would fail because there's no way
to dial the remote DC) over those discovered in the federation state,
which is replicated as long as the primary DC's gateway is reachable.
Previously SAN validation for prepared queries was broken because we
validated against the name, namespace, and datacenter for prepared
queries.
However, prepared queries can target:
- Services with a name that isn't their own
- Services in multiple datacenters
This means that the SpiffeID to validate needs to be based on the
prepared query endpoints, and not the prepared query's upstream
definition.
This commit updates prepared query clusters to account for that.
- The TestNodeService helper created services with the fixed name "web",
and now that name is overridable.
- The discovery chain snapshot didn't have prepared query endpoints so
the endpoints tests were missing data for prepared queries
In the absence of stats_tags to handle this pattern, when we pass
"ingress_upstream.$port" as the stat_prefix, Envoy splits up that prefix
and makes the port a part of the metric name.
For example:
- stat_prefix: ingress_upstream.8080
This leads to metric names like envoy_http_8080_no_route. Changing the
stat_prefix to ingress_upstream_80880 yields the expected metric names
such as envoy_http_no_route.
Note that we don't encode the destination's name/ns/dc in this
stat_prefix because for HTTP services ingress gateways use a single
filter chain. Only cluster metrics are available on a per-upstream
basis.
This change makes it so that the stat prefix for terminating gateways
matches that of connect proxies. By using the structure of
"upstream.svc.ns.dc" we can extract labels for the destination service,
namespace, and datacenter.
Initially we were loading every potential upstream address into Envoy
and then routing traffic to the logical upstream service. The downside
of this behavior is that traffic meant to go to a specific instance
would be load balanced across ALL instances.
Traffic to specific instance IPs should be forwarded to the original
destination and if it's a destination in the mesh then we should ensure
the appropriate certificates are used.
This PR makes transparent proxying a Kubernetes-only feature for now
since support for other environments requires generating virtual IPs,
and Consul does not do that at the moment.
The only thing that needed fixing up pertained to this section of the 1.18.x release notes:
> grpc_stats: the default value for stats_for_all_methods is switched from true to false, in order to avoid possible memory exhaustion due to an untrusted downstream sending a large number of unique method names. The previous default value was deprecated in version 1.14.0. This only changes the behavior when the value is not set. The previous behavior can be used by setting the value to true. This behavior change by be overridden by setting runtime feature envoy.deprecated_features.grpc_stats_filter_enable_stats_for_all_methods_by_default.
For now to maintain status-quo I'm explicitly setting `stats_for_all_methods=true` in all versions to avoid relying upon the default.
Additionally the naming of the emitted metrics for these gRPC requests changed slightly so the integration test assertions for `case-grpc` needed adjusting.
Since we currently do no version switching this removes 75% of the PR
noise.
To generate all *.golden files were removed and then I ran:
go test ./agent/xds -update
Note that this does NOT upgrade to xDS v3. That will come in a future PR.
Additionally:
- Ignored staticcheck warnings about how github.com/golang/protobuf is deprecated.
- Shuffled some agent/xds imports in advance of a later xDS v3 upgrade.
- Remove support for envoy 1.13.x but don't add in 1.17.x yet. We have to wait until the xDS v3 support is added in a follow-up PR.
Fixes#8425
This PR updates the tags that we generate for Envoy stats.
Several of these come with breaking changes, since we can't keep two stats prefixes for a filter.