Fixes: #5396
This PR adds a proxy configuration stanza called expose. These flags register
listeners in Connect sidecar proxies to allow requests to specific HTTP paths from outside of the node. This allows services to protect themselves by only
listening on the loopback interface, while still accepting traffic from non
Connect-enabled services.
Under expose there is a boolean checks flag that would automatically expose all
registered HTTP and gRPC check paths.
This stanza also accepts a paths list to expose individual paths. The primary
use case for this functionality would be to expose paths for third parties like
Prometheus or the kubelet.
Listeners for requests to exposed paths are be configured dynamically at run
time. Any time a proxy, or check can be registered, a listener can also be
created.
In this initial implementation requests to these paths are not
authenticated/encrypted.
- Bootstrap escape hatches are OK.
- Public listener/cluster escape hatches are OK.
- Upstream listener/cluster escape hatches are not supported.
If an unsupported escape hatch is configured and the discovery chain is
activated log a warning and act like it was not configured.
Fixes#6160
Compiling this will set an optional SNI field on each DiscoveryTarget.
When set this value should be used for TLS connections to the instances
of the target. If not set the default should be used.
Setting ExternalSNI will disable mesh gateway use for that target. It also
disables several service-resolver features that do not make sense for an
external service.
Since generated envoy clusters all are named using (mostly) SNI syntax
we can have envoy read the various fields out of that structure and emit
it as stats labels to the various telemetry backends.
I changed the delimiter for the 'customization hash' from ':' to '~'
because ':' is always reencoded by envoy as '_' when generating metrics
keys.
Failover is pushed entirely down to the data plane by creating envoy
clusters and putting each successive destination in a different load
assignment priority band. For example this shows that normally requests
go to 1.2.3.4:8080 but when that fails they go to 6.7.8.9:8080:
- name: foo
load_assignment:
cluster_name: foo
policy:
overprovisioning_factor: 100000
endpoints:
- priority: 0
lb_endpoints:
- endpoint:
address:
socket_address:
address: 1.2.3.4
port_value: 8080
- priority: 1
lb_endpoints:
- endpoint:
address:
socket_address:
address: 6.7.8.9
port_value: 8080
Mesh gateways route requests based solely on the SNI header tacked onto
the TLS layer. Envoy currently only lets you configure the outbound SNI
header at the cluster layer.
If you try to failover through a mesh gateway you ideally would
configure the SNI value per endpoint, but that's not possible in envoy
today.
This PR introduces a simpler way around the problem for now:
1. We identify any target of failover that will use mesh gateway mode local or
remote and then further isolate any resolver node in the compiled discovery
chain that has a failover destination set to one of those targets.
2. For each of these resolvers we will perform a small measurement of
comparative healths of the endpoints that come back from the health API for the
set of primary target and serial failover targets. We walk the list of targets
in order and if any endpoint is healthy we return that target, otherwise we
move on to the next target.
3. The CDS and EDS endpoints both perform the measurements in (2) for the
affected resolver nodes.
4. For CDS this measurement selects which TLS SNI field to use for the cluster
(note the cluster is always going to be named for the primary target)
5. For EDS this measurement selects which set of endpoints will populate the
cluster. Priority tiered failover is ignored.
One of the big downsides to this approach to failover is that the failover
detection and correction is going to be controlled by consul rather than
deferring that entirely to the data plane as with the prior version. This also
means that we are bound to only failover using official health signals and
cannot make use of data plane signals like outlier detection to affect
failover.
In this specific scenario the lack of data plane signals is ok because the
effectiveness is already muted by the fact that the ultimate destination
endpoints will have their data plane signals scrambled when they pass through
the mesh gateway wrapper anyway so we're not losing much.
Another related fix is that we now use the endpoint health from the
underlying service, not the health of the gateway (regardless of
failover mode).
In addition to exposing compilation over the API cleaned up the structures that would be exchanged to be cleaner and easier to support and understand.
Also removed ability to configure the envoy OverprovisioningFactor.
This should make them better for sending over RPC or the API.
Instead of a chain implemented explicitly like a linked list (nodes
holding pointers to other nodes) instead switch to a flat map of named
nodes with nodes linking other other nodes by name. The shipped
structure is just a map and a string to indicate which key to start
from.
Other changes:
* inline the compiler option InferDefaults as true
* introduce compiled target config to avoid needing to send back
additional maps of Resolvers; future target-specific compiled state
can go here
* move compiled MeshGateway out of the Resolver and into the
TargetConfig where it makes more sense.
* connect: reconcile how upstream configuration works with discovery chains
The following upstream config fields for connect sidecars sanely
integrate into discovery chain resolution:
- Destination Namespace/Datacenter: Compilation occurs locally but using
different default values for namespaces and datacenters. The xDS
clusters that are created are named as they normally would be.
- Mesh Gateway Mode (single upstream): If set this value overrides any
value computed for any resolver for the entire discovery chain. The xDS
clusters that are created may be named differently (see below).
- Mesh Gateway Mode (whole sidecar): If set this value overrides any
value computed for any resolver for the entire discovery chain. If this
is specifically overridden for a single upstream this value is ignored
in that case. The xDS clusters that are created may be named differently
(see below).
- Protocol (in opaque config): If set this value overrides the value
computed when evaluating the entire discovery chain. If the normal chain
would be TCP or if this override is set to TCP then the result is that
we explicitly disable L7 Routing and Splitting. The xDS clusters that
are created may be named differently (see below).
- Connect Timeout (in opaque config): If set this value overrides the
value for any resolver in the entire discovery chain. The xDS clusters
that are created may be named differently (see below).
If any of the above overrides affect the actual result of compiling the
discovery chain (i.e. "tcp" becomes "grpc" instead of being a no-op
override to "tcp") then the relevant parameters are hashed and provided
to the xDS layer as a prefix for use in naming the Clusters. This is to
ensure that if one Upstream discovery chain has no overrides and
tangentially needs a cluster named "api.default.XXX", and another
Upstream does have overrides for "api.default.XXX" that they won't
cross-pollinate against the operator's wishes.
Fixes#6159
The main change is that we no longer filter service instances by health,
preferring instead to render all results down into EDS endpoints in
envoy and merely label the endpoints as HEALTHY or UNHEALTHY.
When OnlyPassing is set to true we will force consul checks in a
'warning' state to render as UNHEALTHY in envoy.
Fixes#6171
* Make cluster names SNI always
* Update some tests
* Ensure we check for prepared query types
* Use sni for route cluster names
* Proper mesh gateway mode defaulting when the discovery chain is used
* Ignore service splits from PatchSliceOfMaps
* Update some xds golden files for proper test output
* Allow for grpc/http listeners/cluster configs with the disco chain
* Update stats expectation
* connect: allow overriding envoy listener bind_address
* Update agent/xds/config.go
Co-Authored-By: Kyle Havlovitz <kylehav@gmail.com>
* connect: allow overriding envoy listener bind_port
* envoy: support unix sockets for grpc in bootstrap
Add AgentSocket BootstrapTplArgs which if set overrides the AgentAddress
and AgentPort to generate a bootstrap which points Envoy to a unix
socket file instead of an ip:port.
* Add a test for passing the consul addr as a unix socket
* Fix config formatting for envoy bootstrap tests
* Fix listeners test cases for bind addr/port
* Update website/source/docs/connect/proxies/envoy.md
The clusters/endpoints test were still relying on deterministic ordering of clusters/endpoints which cannot be relied upon due to golang purposefully not providing any guarantee about consistent interation ordering of maps.
Also fixed a small bug in the connect proxy cluster generation that was causing the clusters slice to be double the size it needed to with the first half being all nil pointers.
* Upgrade xDS (go-control-plane) API to support Envoy 1.10.
This includes backwards compatibility shim to work around the ext_authz package rename in 1.10.
It also adds integration test support in CI for 1.10.0.
* Fix go vet complaints
* go mod vendor
* Update Envoy version info in docs
* Update website/source/docs/connect/proxies/envoy.md
* Add support for HTTP proxy listeners
* Add customizable bootstrap configuration options
* Debug logging for xDS AuthZ
* Add Envoy Integration test suite with basic test coverage
* Add envoy command tests to cover new cases
* Add tracing integration test
* Add gRPC support WIP
* Merged changes from master Docker. get CI integration to work with same Dockerfile now
* Make docker build optional for integration
* Enable integration tests again!
* http2 and grpc integration tests and fixes
* Fix up command config tests
* Store all container logs as artifacts in circle on fail
* Add retries to outer part of stats measurements as we keep missing them in CI
* Only dump logs on failing cases
* Fix typos from code review
* Review tidying and make tests pass again
* Add debug logs to exec test.
* Fix legit test failure caused by upstream rename in envoy config
* Attempt to reduce cases of bad TLS handshake in CI integration tests
* bring up the right service
* Add prometheus integration test
* Add test for denied AuthZ both HTTP and TCP
* Try ANSI term for Circle
* Connect: Fix Envoy getting stuck during load
Also in this PR:
- Enabled outlier detection on upstreams which will mark instances unhealthy after 5 failures (using Envoy's defaults)
- Enable weighted load balancing where DNS weights are configured
* Fix empty load assignments in the right place
* Fix import names from review
* Move millisecond parse to a helper function
* Start adding tests for cluster override
* Refactor tests for clusters
* Passing tests for custom upstream cluster override
* Added capability to customise local app cluster
* Rename config for local cluster override
For established xDS gRPC streams recheck ACLs for each DiscoveryRequest
or DiscoveryResponse. If more than 5 minutes has elapsed since the last
ACL check, recheck even without an incoming DiscoveryRequest or
DiscoveryResponse. ACL failures will terminate the stream.
This PR is almost a complete rewrite of the ACL system within Consul. It brings the features more in line with other HashiCorp products. Obviously there is quite a bit left to do here but most of it is related docs, testing and finishing the last few commands in the CLI. I will update the PR description and check off the todos as I finish them over the next few days/week.
Description
At a high level this PR is mainly to split ACL tokens from Policies and to split the concepts of Authorization from Identities. A lot of this PR is mostly just to support CRUD operations on ACLTokens and ACLPolicies. These in and of themselves are not particularly interesting. The bigger conceptual changes are in how tokens get resolved, how backwards compatibility is handled and the separation of policy from identity which could lead the way to allowing for alternative identity providers.
On the surface and with a new cluster the ACL system will look very similar to that of Nomads. Both have tokens and policies. Both have local tokens. The ACL management APIs for both are very similar. I even ripped off Nomad's ACL bootstrap resetting procedure. There are a few key differences though.
Nomad requires token and policy replication where Consul only requires policy replication with token replication being opt-in. In Consul local tokens only work with token replication being enabled though.
All policies in Nomad are globally applicable. In Consul all policies are stored and replicated globally but can be scoped to a subset of the datacenters. This allows for more granular access management.
Unlike Nomad, Consul has legacy baggage in the form of the original ACL system. The ramifications of this are:
A server running the new system must still support other clients using the legacy system.
A client running the new system must be able to use the legacy RPCs when the servers in its datacenter are running the legacy system.
The primary ACL DC's servers running in legacy mode needs to be a gate that keeps everything else in the entire multi-DC cluster running in legacy mode.
So not only does this PR implement the new ACL system but has a legacy mode built in for when the cluster isn't ready for new ACLs. Also detecting that new ACLs can be used is automatic and requires no configuration on the part of administrators. This process is detailed more in the "Transitioning from Legacy to New ACL Mode" section below.
* Plumb xDS server and proxyxfg into the agent startup
* Add `consul connect envoy` command to allow running Envoy as a connect sidecar.
* Add test for help tabs; typos and style fixups from review
* Vendor updates for gRPC and xDS server
* xDS server implementation for serving Envoy as a Connect proxy
* Address initial review comments
* consistent envoy package aliases; typos fixed; override TLS and authz for custom listeners
* Moar Typos
* Moar typos