When the envoy healthy panic threshold was explicitly disabled as part
of L7 traffic management it changed how envoy decided to load balance to
endpoints in a cluster. This only matters when envoy is in "panic mode"
aka "when you have a bunch of unhealthy endpoints". Panic mode sends
traffic to unhealthy instances in certain circumstances.
Note: Prior to explicitly disabling the healthy panic threshold, the
default value is 50%.
What was happening is that the test harness was bringing up consul the
sidecars, and the service instances all at once and sometimes the
proxies wouldn't have time to be checked by consul to be labeled as
'passing' in the catalog before a round of EDS happened.
The xDS server in consul effectively queries /v1/health/connect/s2 and
gets 1 result, but that one result has a 'critical' check so the xDS
server sends back that endpoint labeled as UNHEALTHY.
Envoy sees that 100% of the endpoints in the cluster are unhealthy and
would enter panic mode and still send traffic to s2. This is why the
test suites PRIOR to disabling the healthy panic threshold worked. They
were _incorrectly_ passing.
When the healthy panic threshol is disabled, envoy never enters panic
mode in this situation and thus the cluster has zero healthy endpoints
so load balancing goes nowhere and the tests fail.
Why does this only affect the test suites for envoy 1.8.0? My guess is
that https://github.com/envoyproxy/envoy/pull/4442 was merged into the
1.9.x series and somehow that plays a role.
This PR modifies the bats scripts to explicitly wait until the upstream
sidecar is healthy as measured by /v1/health/connect/s2?passing BEFORE
trying to interrogate envoy which should make the tests less racy.
* add connect gateway guide
* Remove stray space
Co-Authored-By: Freddy <freddygv@users.noreply.github.com>
* Specify stanza and exact options
Co-Authored-By: Freddy <freddygv@users.noreply.github.com>
* incorporate comments from freddy
* integrate feedback from matt
* make snippets all json
* incorporate more comments from matt
* added links
* incorporate comments from neena on google doc draft
* make learn lnks relative
* clarify that gateways are new
* change socat to netcat
* add more description about replication token permissions
* Apply suggestions from code review
Co-Authored-By: Matt Keeler <mkeeler@users.noreply.github.com>
* add the prerequisite to enable centralized service config
* finish adding docs links
* website: link to 1.6.0 beta in downloads page
* website: reorganize intention replication/ca federation
* website: remove announcement bar
* Update website/source/docs/connect/connect-internals.html.md
Co-Authored-By: Paul Banks <banks@banksco.de>
* website: update homepage and service mesh page
Aligning messaging to current product.
* website: fix link TODOs
* Add Mesh Gateway to mesh page, update use case wording
* Make cluster names SNI always
* Update some tests
* Ensure we check for prepared query types
* Use sni for route cluster names
* Proper mesh gateway mode defaulting when the discovery chain is used
* Ignore service splits from PatchSliceOfMaps
* Update some xds golden files for proper test output
* Allow for grpc/http listeners/cluster configs with the disco chain
* Update stats expectation
* connect: allow overriding envoy listener bind_address
* Update agent/xds/config.go
Co-Authored-By: Kyle Havlovitz <kylehav@gmail.com>
* connect: allow overriding envoy listener bind_port
* envoy: support unix sockets for grpc in bootstrap
Add AgentSocket BootstrapTplArgs which if set overrides the AgentAddress
and AgentPort to generate a bootstrap which points Envoy to a unix
socket file instead of an ip:port.
* Add a test for passing the consul addr as a unix socket
* Fix config formatting for envoy bootstrap tests
* Fix listeners test cases for bind addr/port
* Update website/source/docs/connect/proxies/envoy.md
- Removes 'type' icons (basically the proxy icon, not the text itself)
- Add support for Mesh Gateways plus their addresses
This adds a 'Mesh Gateway' type label to service and service instance
pages, plus a new 'Addresses' tab if the service is a Mesh Gateway
showing a table of addresses for the service - plus tests
maxIndexWatchTxn was only watching the IndexEntry of the max index of all the entries. It needed to watch all of them regardless of which was the max.
Also plumbed the query source through in the proxy config to help better track requests.
* Implement the test-docker make target
Running tests within docker allows us to resource constrain them better to not take over our systems. Additionally it allows us to run the tests on linux instead of the host OS which often times is macOS.
* Use GOMAXPROCS instead of -p
* Add a comment about docker cpus
The clusters/endpoints test were still relying on deterministic ordering of clusters/endpoints which cannot be relied upon due to golang purposefully not providing any guarantee about consistent interation ordering of maps.
Also fixed a small bug in the connect proxy cluster generation that was causing the clusters slice to be double the size it needed to with the first half being all nil pointers.
* Fix some tests that I broke when refactoring the ConfigSnapshot
* Make sure the MeshGateway config is added to all the right api structs
* Fix some more tests
The general problem was that a the CA config which contained the trust domain was happening outside of the blocking mechanism so if the client started the blocking query before the primary dcs roots had been set then a state trust domain was being pushed down.
This was fixed here but in the future we should probably fixup the CA initialization code to not initialize the CA config twice when it doesn’t need to.