Compiling this will set an optional SNI field on each DiscoveryTarget.
When set this value should be used for TLS connections to the instances
of the target. If not set the default should be used.
Setting ExternalSNI will disable mesh gateway use for that target. It also
disables several service-resolver features that do not make sense for an
external service.
If the entry is updated for reasons other than protocol it is surprising
that the value is explicitly persisted as 'tcp' rather than leaving it
empty and letting it fall back dynamically on the proxy-defaults value.
Since generated envoy clusters all are named using (mostly) SNI syntax
we can have envoy read the various fields out of that structure and emit
it as stats labels to the various telemetry backends.
I changed the delimiter for the 'customization hash' from ':' to '~'
because ':' is always reencoded by envoy as '_' when generating metrics
keys.
Add parameter local-only to operator keyring list requests to force queries to only hit local servers (no WAN traffic).
HTTP API: GET /operator/keyring?local-only=true
CLI: consul keyring -list --local-only
Sending the local-only flag with any non-GET/list request will result in an error.
The json decoder inside of the HCLv1 hcl.Decode function behaves
unexpectedly when decoding generically into a map[string]interface{} as
is done for 'consul config write' pre-submit decoding.
This results in some subtle (service-router Match and Destinations being
separated) and some not so subtle (service-resolver subsets and failover
panic if multiple subsets are referenced) bugs when subsequently passed
through mapstructure to finish decoding.
Given that HCLv1 is basically frozen and the HCL part of it is fine
instead of trying to figure out what the underlying bug is in the json
decoder for our purposes just sniff the byte slice and selectively use
the stdlib json decoder for JSON and hcl decoder for HCL.
Failover is pushed entirely down to the data plane by creating envoy
clusters and putting each successive destination in a different load
assignment priority band. For example this shows that normally requests
go to 1.2.3.4:8080 but when that fails they go to 6.7.8.9:8080:
- name: foo
load_assignment:
cluster_name: foo
policy:
overprovisioning_factor: 100000
endpoints:
- priority: 0
lb_endpoints:
- endpoint:
address:
socket_address:
address: 1.2.3.4
port_value: 8080
- priority: 1
lb_endpoints:
- endpoint:
address:
socket_address:
address: 6.7.8.9
port_value: 8080
Mesh gateways route requests based solely on the SNI header tacked onto
the TLS layer. Envoy currently only lets you configure the outbound SNI
header at the cluster layer.
If you try to failover through a mesh gateway you ideally would
configure the SNI value per endpoint, but that's not possible in envoy
today.
This PR introduces a simpler way around the problem for now:
1. We identify any target of failover that will use mesh gateway mode local or
remote and then further isolate any resolver node in the compiled discovery
chain that has a failover destination set to one of those targets.
2. For each of these resolvers we will perform a small measurement of
comparative healths of the endpoints that come back from the health API for the
set of primary target and serial failover targets. We walk the list of targets
in order and if any endpoint is healthy we return that target, otherwise we
move on to the next target.
3. The CDS and EDS endpoints both perform the measurements in (2) for the
affected resolver nodes.
4. For CDS this measurement selects which TLS SNI field to use for the cluster
(note the cluster is always going to be named for the primary target)
5. For EDS this measurement selects which set of endpoints will populate the
cluster. Priority tiered failover is ignored.
One of the big downsides to this approach to failover is that the failover
detection and correction is going to be controlled by consul rather than
deferring that entirely to the data plane as with the prior version. This also
means that we are bound to only failover using official health signals and
cannot make use of data plane signals like outlier detection to affect
failover.
In this specific scenario the lack of data plane signals is ok because the
effectiveness is already muted by the fact that the ultimate destination
endpoints will have their data plane signals scrambled when they pass through
the mesh gateway wrapper anyway so we're not losing much.
Another related fix is that we now use the endpoint health from the
underlying service, not the health of the gateway (regardless of
failover mode).