open-consul

Commit Graph

Author	SHA1	Message	Date
Derek Menteer	cad89029dd	Decrease retry time for failed peering connections.	2022-10-31 14:30:27 -05:00
Eric Haberkorn	57fb729547	Fix peering metrics bug (#15178 ) This bug was caused by the peering health metric being set to NaN.	2022-10-28 10:51:12 -04:00
R.B. Boyer	87432a8dd4	chore: update golangci-lint to v1.50.1 (#15022 )	2022-10-24 11:48:02 -05:00
cskh	e18434bcb1	peering: skip registering duplicate node and check from the peer (#14994 ) * peering: skip register duplicate node and check from the peer * Prebuilt the nodes map and checks map to avoid repeated for loop * use key type to struct: node id, service id, and check id	2022-10-18 16:19:24 -04:00
freddygv	37a765f8df	Update leader routine to maybe use gateways	2022-10-13 14:58:00 -06:00
freddygv	239f0e3084	Update peering establishment to maybe use gateways When peering through mesh gateways we expect outbound dials to peer servers to flow through the local mesh gateway addresses. Now when establishing a peering we get a list of dial addresses as a ring buffer that includes local mesh gateway addresses if the local DC is configured to peer through mesh gateways. The ring buffer includes the mesh gateway addresses first, but also includes the remote server addresses as a fallback. This fallback is present because it's possible that direct egress from the servers may be allowed. If not allowed then the leader will cycle back to a mesh gateway address through the ring. When attempting to dial the remote servers we retry up to a fixed timeout. If using mesh gateways we also have an initial wait in order to allow for the mesh gateways to configure themselves. Note that if we encounter a permission denied error we do not retry since that error indicates that the secret in the peering token is invalid.	2022-10-13 14:57:55 -06:00
malizz	27d0181806	increase protobuf size limit for cluster peering (#14976 )	2022-10-13 13:46:51 -07:00
Derek Menteer	d47c9b446c	Prevent consul peer-exports by discovery chain.	2022-10-13 12:45:09 -05:00
Derek Menteer	bfa4adbfce	Add remote peer partition and datacenter info.	2022-10-13 10:37:41 -05:00
Chris S. Kim	9d4fb0445a	Include stream-related information in peering endpoints	2022-10-10 13:20:14 -06:00
freddygv	ae9b3eb662	Fixup test	2022-10-07 09:34:16 -06:00
freddygv	6ef8d329d2	Require Connect and TLS to generate peering tokens By requiring Connect and a gRPC TLS listener we can automatically configure TLS for all peering control-plane traffic.	2022-10-07 09:06:29 -06:00
freddygv	a21e5799f7	Use internal server certificate for peering TLS A previous commit introduced an internally-managed server certificate to use for peering-related purposes. Now the peering token has been updated to match that behavior: - The server name matches the structure of the server cert - The CA PEMs correspond to the Connect CA Note that if Conect is disabled, and by extension the Connect CA, we fall back to the previous behavior of returning the manually configured certs and local server SNI. Several tests were updated to use the gRPC TLS port since they enable Connect by default. This means that the peering token will embed the Connect CA, and the dialer will expect a TLS listener.	2022-10-07 09:05:32 -06:00
Eric Haberkorn	2178e38204	Rename `PeerName` to `Peer` on prepared queries and exported services (#14854 )	2022-10-04 14:46:15 -04:00
Eric Haberkorn	5fd1e6daea	Add exported services event to cluster peering replication. (#14797 )	2022-09-29 15:37:19 -04:00
Freddy	69d99aa8c0	Merge pull request #14364 from hashicorp/peering/term-delete	2022-08-29 15:33:18 -06:00
Chris S. Kim	e4a154c88e	Add heartbeat timeout grace period when accounting for peering health	2022-08-29 16:32:26 -04:00
freddygv	f790d84c04	Add validation to prevent switching dialing mode This prevents unexpected changes to the output of ShouldDial, which should never change unless a peering is deleted and recreated.	2022-08-29 12:31:13 -06:00
Chris S. Kim	a8090268d4	Replace ring buffer with async version (#14314 ) We need to watch for changes to peerings and update the server addresses which get served by the ring buffer. Also, if there is an active connection for a peer, we are getting up-to-date server addresses from the replication stream and can safely ignore the token's addresses which may be stale.	2022-08-26 10:27:13 -04:00
alex	f64af3be24	peering: add peer health metric (#14004 ) Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>	2022-08-25 16:32:59 -07:00
Chris S. Kim	534096a6ac	Handle wrapped errors in isFailedPreconditionErr	2022-08-11 11:16:02 -04:00
freddygv	b089472a12	Pass explicit signal with op for secrets write Previously the updates to the peering secrets UUID table relied on inferring what action triggered the update based on a reconciliation against the existing secrets. Instead we now explicitly require the operation to be given so that the inference isn't necessary. This makes the UUID table logic easier to reason about and fixes some related bugs. There is also an update so that the peering secrets get handled on snapshots/restores.	2022-08-03 17:25:12 -05:00
Freddy	56144cf5f7	Various peering fixes (#13979 ) * Avoid logging StreamSecretID * Wrap additional errors in stream handler * Fix flakiness in leader test and rename servers for clarity. There was a race condition where the peering was being deleted in the test before the stream was active. Now the test waits for the stream to be connected on both sides before deleting the associated peering. * Run flaky test serially	2022-08-01 15:06:18 -06:00
Matt Keeler	795e5830c6	Implement/Utilize secrets for Peering Replication Stream (#13977 )	2022-08-01 10:33:18 -04:00
Luke Kysow	17594a123e	peering: retry establishing connection more quickly on certain errors (#13938 ) When we receive a FailedPrecondition error, retry that more quickly because we expect it will resolve shortly. This is particularly important in the context of Consul servers behind a load balancer because when establishing a connection we have to retry until we randomly land on a leader node. The default retry backoff goes from 2s, 4s, 8s, etc. which can result in very long delays quite quickly. Instead, this backoff retries in 8ms five times, then goes exponentially from there: 16ms, 32ms, ... up to a max of 8152ms.	2022-07-29 13:04:32 -07:00
Luke Kysow	d21f793b74	peering: add config to enable/disable peering (#13867 ) * peering: add config to enable/disable peering Add config: ``` peering { enabled = true } ``` Defaults to true. When disabled: 1. All peering RPC endpoints will return an error 2. Leader won't start its peering establishment goroutines 3. Leader won't start its peering deletion goroutines	2022-07-22 15:20:21 -07:00
alex	7bd55578cc	peering: emit exported services count metric (#13811 ) Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>	2022-07-22 12:05:08 -07:00
alex	64b3705a31	peering: refactor reconcile, cleanup (#13795 ) Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>	2022-07-19 11:43:29 -07:00
alex	4ff097c4cf	peering: track exported services (#13784 ) Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>	2022-07-18 10:20:04 -07:00
Luke Kysow	a8721c33c5	peerstream: dialer should reconnect when stream closes (#13745 ) * peerstream: dialer should reconnect when stream closes If the stream is closed unexpectedly (i.e. when we haven't received a terminated message), the dialer should attempt to re-establish the stream. Previously, the `HandleStream` would return `nil` when the stream was closed. The caller then assumed the stream was terminated on purpose and so didn't reconnect when instead it was stopped unexpectedly and the dialer should have attempted to reconnect.	2022-07-15 11:58:33 -07:00
R.B. Boyer	61ebb38092	server: ensure peer replication can successfully use TLS over external gRPC (#13733 ) Ensure that the peer stream replication rpc can successfully be used with TLS activated. Also: - If key material is configured for the gRPC port but HTTPS is not enabled now TLS will still be activated for the gRPC port. - peerstream replication stream opened by the establishing-side will now ignore grpc.WithBlock so that TLS errors will bubble up instead of being awkwardly delayed or suppressed	2022-07-15 13:15:50 -05:00
alex	70ad4804b6	peering: track imported services (#13718 )	2022-07-15 10:20:43 -07:00
R.B. Boyer	5b801db24b	peering: move peer replication to the external gRPC port (#13698 ) Peer replication is intended to be between separate Consul installs and effectively should be considered "external". This PR moves the peer stream replication bidirectional RPC endpoint to the external gRPC server and ensures that things continue to function.	2022-07-08 12:01:13 -05:00
R.B. Boyer	e7a7232a6b	state: peering ID assignment cannot happen inside of the state store (#13525 ) Move peering ID assignment outisde of the FSM, so that the ID is written to the raft log and the same ID is used by all voters, and after restarts.	2022-06-21 13:04:08 -05:00
freddygv	a288d0c388	Avoid deleting peerings marked as terminated. When our peer deletes the peering it is locally marked as terminated. This termination should kick off deleting all imported data, but should not delete the peering object itself. Keeping peerings marked as terminated acts as a signal that the action took place.	2022-06-14 15:37:09 -06:00
freddygv	a5283e4361	Add leader routine to clean up peerings Once a peering is marked for deletion a new leader routine will now clean up all imported resources and then the peering itself. A lot of the logic was grabbed from the namespace/partitions deferred deletions but with a handful of simplifications: - The rate limiting is not configurable. - Deleting imported nodes/services/checks is done by deleting nodes with the Txn API. The services and checks are deleted as a side-effect. - There is no "round rate limiter" like with namespaces and partitions. This is because peerings are purely local, and deleting a peering in the datacenter does not depend on deleting data from other DCs like with WAN-federated namespaces. All rate limiting is handled by the Raft rate limiter.	2022-06-14 15:36:50 -06:00
freddygv	dbcbf3978f	Fixup stream tear-down steps. 1. Fix a bug where the peering leader routine would not track all active peerings in the "stored" reconciliation map. This could lead to tearing down streams where the token was generated, since the ConnectedStreams() method used for reconciliation returns all streams and not just the ones initiated by this leader routine. 2. Fix a race where stream contexts were being canceled before termination messages were being processed by a peer. Previously the leader routine would tear down streams by canceling their context right after the termination message was sent. This context cancelation could be propagated to the server side faster than the termination message. Now there is a change where the dialing peer uses CloseSend() to signal when no more messages will be sent. Eventually the server peer will read an EOF after receiving and processing the preceding termination message. Using CloseSend() is actually not enough to address the issue mentioned, since it doesn't wait for the server peer to finish processing messages. Because of this now the dialing peer also reads from the stream until an error signals that there are no more messages. Receiving an EOF from our peer indicates that they processed the termination message and have no additional work to do. Given that the stream is being closed, all the messages received by Recv are discarded. We only check for errors to avoid importing new data.	2022-06-13 12:10:42 -06:00
R.B. Boyer	809344a6f5	peering: initial sync (#12842 ) - Add endpoints related to peering: read, list, generate token, initiate peering - Update node/service/check table indexing to account for peers - Foundational changes for pushing service updates to a peer - Plumb peer name through Health.ServiceNodes path see: ENT-1765, ENT-1280, ENT-1283, ENT-1283, ENT-1756, ENT-1739, ENT-1750, ENT-1679, ENT-1709, ENT-1704, ENT-1690, ENT-1689, ENT-1702, ENT-1701, ENT-1683, ENT-1663, ENT-1650, ENT-1678, ENT-1628, ENT-1658, ENT-1640, ENT-1637, ENT-1597, ENT-1634, ENT-1613, ENT-1616, ENT-1617, ENT-1591, ENT-1588, ENT-1596, ENT-1572, ENT-1555 Co-authored-by: R.B. Boyer <rb@hashicorp.com> Co-authored-by: freddygv <freddy@hashicorp.com> Co-authored-by: Chris S. Kim <ckim@hashicorp.com> Co-authored-by: Evan Culver <eculver@hashicorp.com> Co-authored-by: Nitya Dhanushkodi <nitya@hashicorp.com>	2022-04-21 17:34:40 -05:00

38 Commits