Commit Graph

19555 Commits

Author SHA1 Message Date
freddygv 56b153e57f Add docs about upgrading primary mesh gateways
Care must be taken when replacing mesh gateways in the primary
datacenter, because if the old addresses become unreachable before the
secondary datacenters receive the new addresses then the primary
datacenter overall will become unreachable.

This commit adds docs related to this class of upgrades.
2022-10-18 10:08:43 -06:00
trujillo-adam aba377cee4 removed quotation marks around front matter and revised the introduction 2022-10-18 08:56:38 -07:00
R.B. Boyer 0712e1a456
test: possibly fix flake in TestIntentionGetExact (#15021)
Restructure test setup to be similar to TestAgent_ServerCertificate
and see if that's enough to avoid flaking after join.
2022-10-18 10:51:20 -05:00
freddygv f08acdc092 Update upgrade docs for 1.13.2.
In 1.13.2 we added a new flag called use_auto_cert to address issues
previously documented in the upgrade guide. Originally there was no way
to disable TLS for gRPC when auto-encrypt was in use, because TLS was
enabled for gRPC due to the presence of auto-encrypt certs.

As of 1.13.2, using auto-encrypt certs as the signal to enable TLS for
gRPC is opt-in only. Meaning that if anyone who had upgraded to 1.13
relied on that side-effect, they now need to explicitly configure it.
2022-10-18 09:43:32 -06:00
wenincode 443b73435a Use local-storage service to manage localStorage
Use local-storage service, prototyped here https://github.com/LevelbossMike/local-storage-service, to manage local storage usage in an octane way. Does not write to local storage in tests by default and is easy to stub out.
2022-10-18 09:40:47 -06:00
Dhia Ayachi c5f0f33130
bump relevant modules versions (#14972) 2022-10-18 11:24:26 -04:00
Michael Klein 101a20e03e Improve testability `env`-service 2022-10-18 16:07:12 +02:00
Iryna Shustava 22b6c39092
Support auth method with snapshot agent [ENT] (#15020)
Port of hashicorp/consul-enterprise#3303
2022-10-17 15:57:48 -06:00
R.B. Boyer 9f41cc4a25
cache: prevent goroutine leak in agent cache (#14908)
There is a bug in the error handling code for the Agent cache subsystem discovered:

1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in
   a loop (which backs off on-error up to 1 minute)

2. getWithIndex calls fetch if there’s no valid entry in the cache

3. fetch starts a goroutine which calls Fetch on the cache-type, waits
   for a while (again with backoff up to 1 minute for errors) and then
   calls fetch to trigger a refresh

The end result being that every 1 minute notifyBlockingQuery spawns an
ancestry of goroutines that essentially lives forever.

This PR ensures that the goroutine started by `fetch` cancels any prior
goroutine spawned by the same line for the same key.

In isolated testing where a cache type was tweaked to indefinitely
error, this patch prevented goroutine counts from skyrocketing.
2022-10-17 14:38:10 -05:00
R.B. Boyer ca916eec32
ca: fix a masked bug in leaf cert generation that would not be notified of root cert rotation after the first one (#15005)
In practice this was masked by #14956 and was only uncovered fixing the
other bug.

  go test ./agent -run TestAgentConnectCALeafCert_goodNotLocal

would fail when only #14956 was fixed.
2022-10-17 13:24:27 -05:00
David Yu 4ba1e75259
docs: formatting on backend application and delete peering CRDs (#15007)
* docs: formatting on backend application and delete peering CRDs

Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
2022-10-17 10:34:05 -07:00
Dan Upton 22ff376bba
proto: deep-copy PeeringTrustBundle using proto.Clone (#15004)
Fixes a `go vet` warning caused by the pragma.DoNotCopy on the protobuf
message type.

Originally I'd hoped we wouldn't need any reflection in the proxycfg hot
path, but it seems proto.Clone is the only supported way to copy a message.
2022-10-17 16:30:35 +01:00
Chris S. Kim 58c041eb6e
Merge pull request #13388 from deblasis/feature/health-checks_windows_service
Feature: Health checks windows service
2022-10-17 09:26:19 -04:00
Dan Upton 90129919a8
proxycfg: fix goroutine leak when service is re-registered (#14988)
Fixes a bug where we'd leak a goroutine in state.run when the given
context was canceled while there was a pending update.
2022-10-17 11:31:10 +01:00
Kyle Havlovitz 73d252c6d8
Merge pull request #14800 from hashicorp/mgw-tcp-keepalives
Add TCP keepalive settings to proxy config for mesh gateways
2022-10-14 19:01:02 -07:00
Kyle Havlovitz 096ca5e4b0 Extend tcp keepalive settings to work for terminating gateways as well 2022-10-14 17:05:46 -07:00
Kyle Havlovitz f8e745315f Update docs and add tcp_keepalive_probes setting 2022-10-14 17:05:46 -07:00
Kyle Havlovitz 526d49c6ff Add TCP keepalive settings to proxy config for mesh gateways 2022-10-14 17:05:46 -07:00
David Yu 5fbb4aaac0
docs: improvements on language from cluster peering steps (#14993)
* docs: improvements on language from cluster peering steps

Co-authored-by: trujillo-adam <47586768+trujillo-adam@users.noreply.github.com>
2022-10-14 14:29:11 -07:00
wenincode f9f4ca8da4 Set postfix for agentless-notice storage key based on partition and dc 2022-10-14 14:08:40 -06:00
wenincode 9777ee0077 Save agentless node notice dismissal per dc 2022-10-14 12:21:25 -06:00
Derek Menteer 25d3d244f0 Fix issue with incorrect method signature on test. 2022-10-14 11:04:57 -05:00
Freddy bbf6b17e44
Merge pull request #14981 from hashicorp/peering/dial-through-gateways 2022-10-14 09:44:56 -06:00
Tyler Wendlandt 4a3801385d
Merge pull request #14986 from hashicorp/ui/feature/filter-node-healthchecks-agentless
UI: filter node healthchecks on agentless service instances
2022-10-14 09:33:45 -06:00
Dan Upton 3b9297f95a
proxycfg: rate-limit delivery of config snapshots (#14960)
Adds a user-configurable rate limiter to proxycfg snapshot delivery,
with a default limit of 250 updates per second.

This addresses a problem observed in our load testing of Consul
Dataplane where updating a "global" resource such as a wildcard
intention or the proxy-defaults config entry could starve the Raft or
Memberlist goroutines of CPU time, causing general cluster instability.
2022-10-14 15:52:00 +01:00
Derek Menteer 6c355134e8 Add tests for peering state snapshots / restores. 2022-10-14 09:48:04 -05:00
Derek Menteer 27bbdced8d Add test for ExportedServicesForAllPeersByName 2022-10-14 09:48:04 -05:00
Alessandro De Blasis fe9078238e
Update website/content/api-docs/agent/check.mdx 2022-10-14 12:32:55 +01:00
Dan Upton 0a0534a094
perf: remove expensive reflection from xDS hot path (#14934)
Replaces the reflection-based implementation of proxycfg's
ConfigSnapshot.Clone with code generated by deep-copy.

While load testing server-based xDS (for consul-dataplane) we discovered
this method is extremely expensive. The ConfigSnapshot struct, directly
or indirectly, contains a copy of many of the structs in the agent/structs
package, which creates a large graph for copystructure.Copy to traverse
at runtime, on every proxy reconfiguration.
2022-10-14 10:26:42 +01:00
Michael Klein 00201936c8
Merge pull request #14977 from hashicorp/ui/fix/scrollbar-bento-box
ui: Bento-Box show scrollbars only when necessary
2022-10-14 09:07:57 +02:00
wenincode b761f583a8 Address linting errors 2022-10-13 19:05:19 -06:00
wenincode 229a97967a Add changelog entry 2022-10-13 18:54:39 -06:00
wenincode e36848111a Add tests for filtering node health checks 2022-10-13 18:45:15 -06:00
freddygv 89596f13c4 Use split var in tests 2022-10-13 17:12:47 -06:00
freddygv b4e48f0a70 Use split wildcard partition name
This way OSS avoids passing a non-empty label, which will be rejected in
OSS consul.
2022-10-13 16:55:28 -06:00
Freddy 909fc33271
Merge pull request #14935 from hashicorp/fix/alias-leak 2022-10-13 16:31:15 -06:00
freddygv c5040b8111 Add changelog entry 2022-10-13 16:09:32 -06:00
freddygv a468cbcce9 Add changelog entry 2022-10-13 16:03:15 -06:00
freddygv 452dc2867c Lint 2022-10-13 15:55:55 -06:00
wenincode c27cc17991 Format healthchecks template 2022-10-13 15:48:18 -06:00
wenincode 9526f9f4f5 Filter healthchecks for synthetic-nodes 2022-10-13 15:47:47 -06:00
David Yu e1093b8576
1.14 dataplane docs beta: Bump to beta3 (#14979)
Bump to beta
2022-10-13 14:40:40 -07:00
Derek Menteer 092e5fd074 Reset wait on ensureServerAddrSubscription 2022-10-13 15:58:26 -05:00
freddygv 437a513d9b Fix CA init error code 2022-10-13 14:58:11 -06:00
freddygv a0bcf4b941 Add integ test for peering through gateways 2022-10-13 14:58:05 -06:00
freddygv 37a765f8df Update leader routine to maybe use gateways 2022-10-13 14:58:00 -06:00
freddygv 239f0e3084 Update peering establishment to maybe use gateways
When peering through mesh gateways we expect outbound dials to peer
servers to flow through the local mesh gateway addresses.

Now when establishing a peering we get a list of dial addresses as a
ring buffer that includes local mesh gateway addresses if the local DC
is configured to peer through mesh gateways. The ring buffer includes
the mesh gateway addresses first, but also includes the remote server
addresses as a fallback.

This fallback is present because it's possible that direct egress from
the servers may be allowed. If not allowed then the leader will cycle
back to a mesh gateway address through the ring.

When attempting to dial the remote servers we retry up to a fixed
timeout. If using mesh gateways we also have an initial wait in
order to allow for the mesh gateways to configure themselves.

Note that if we encounter a permission denied error we do not retry
since that error indicates that the secret in the peering token is
invalid.
2022-10-13 14:57:55 -06:00
malizz 27d0181806
increase protobuf size limit for cluster peering (#14976) 2022-10-13 13:46:51 -07:00
Jasmine W 233a461fd1
Merge pull request #14975 from hashicorp/ui/bugfix/peering-misspelling
UI: Copy changes for peering detail page
2022-10-13 15:28:21 -04:00
Derek Menteer ff01c11672 Address PR comments. 2022-10-13 14:11:02 -05:00