Commit Graph

124 Commits

Author SHA1 Message Date
Derek Menteer bfa4adbfce Add remote peer partition and datacenter info. 2022-10-13 10:37:41 -05:00
Paul Glass 8cf430140a
gRPC server metrics (#14922)
* Move stats.go from grpc-internal to grpc-middleware
* Update grpc server metrics with server type label
* Add stats test to grpc-external
* Remove global metrics instance from grpc server tests
2022-10-11 17:00:32 -05:00
Chris S. Kim 7f48033d0b Fix nil pointer 2022-10-10 13:20:14 -06:00
Chris S. Kim 9d4fb0445a Include stream-related information in peering endpoints 2022-10-10 13:20:14 -06:00
Freddy 8d93f120ea
Merge pull request #14796 from hashicorp/peering/use-connect-ca 2022-10-07 10:37:37 -06:00
freddygv 6ef8d329d2 Require Connect and TLS to generate peering tokens
By requiring Connect and a gRPC TLS listener we can automatically
configure TLS for all peering control-plane traffic.
2022-10-07 09:06:29 -06:00
freddygv a21e5799f7 Use internal server certificate for peering TLS
A previous commit introduced an internally-managed server certificate
to use for peering-related purposes.

Now the peering token has been updated to match that behavior:
- The server name matches the structure of the server cert
- The CA PEMs correspond to the Connect CA

Note that if Conect is disabled, and by extension the Connect CA, we
fall back to the previous behavior of returning the manually configured
certs and local server SNI.

Several tests were updated to use the gRPC TLS port since they enable
Connect by default. This means that the peering token will embed the
Connect CA, and the dialer will expect a TLS listener.
2022-10-07 09:05:32 -06:00
DanStough df94470e76 feat: xDS updates for peerings control plane through mesh gw 2022-10-07 08:46:42 -06:00
Eric Haberkorn 2178e38204
Rename `PeerName` to `Peer` on prepared queries and exported services (#14854) 2022-10-04 14:46:15 -04:00
freddygv 17463472b7 Return mesh gateway addrs if peering through mgw 2022-10-03 11:35:10 -06:00
Eric Haberkorn 5fd1e6daea
Add exported services event to cluster peering replication. (#14797) 2022-09-29 15:37:19 -04:00
malizz 5c470b28dd
Support Stale Queries for Trust Bundle Lookups (#14724)
* initial commit

* add tags, add conversations

* add test for query options utility functions

* update previous tests

* fix test

* don't error out on empty context

* add changelog

* update decode config
2022-09-28 09:56:59 -07:00
Gabriel Santos 09c00ff39a
Middleware: `RequestRecorder` reports calls below 1ms as decimal value (#12905)
* Typos

* Test failing

* Convert values <1ms to decimal

* Fix test

* Update docs and test error msg

* Applied suggested changes to test case

* Changelog file and suggested changes

* Update .changelog/12905.txt

Co-authored-by: Chris S. Kim <kisunji92@gmail.com>

* suggested change - start duration with microseconds instead of nanoseconds

* fix error

* suggested change - floats

Co-authored-by: alex <8968914+acpana@users.noreply.github.com>
Co-authored-by: Chris S. Kim <kisunji92@gmail.com>
2022-09-15 13:04:37 -04:00
DanStough f9b4fa17f1 fix(peering): generate token metrics only for leader 2022-09-14 11:37:30 -04:00
DanStough b37a2ba889 feat(peering): validate server name conflicts on establish 2022-09-14 11:37:30 -04:00
Dan Upton 9fe6c33c0d
xDS Load Balancing (#14397)
Prior to #13244, connect proxies and gateways could only be configured by an
xDS session served by the local client agent.

In an upcoming release, it will be possible to deploy a Consul service mesh
without client agents. In this model, xDS sessions will be handled by the
servers themselves, which necessitates load-balancing to prevent a single
server from receiving a disproportionate amount of load and becoming
overwhelmed.

This introduces a simple form of load-balancing where Consul will attempt to
achieve an even spread of load (xDS sessions) between all healthy servers.
It does so by implementing a concurrent session limiter (limiter.SessionLimiter)
and adjusting the limit according to autopilot state and proxy service
registrations in the catalog.

If a server is already over capacity (i.e. the session limit is lowered),
Consul will begin draining sessions to rebalance the load. This will result
in the client receiving a `RESOURCE_EXHAUSTED` status code. It is the client's
responsibility to observe this response and reconnect to a different server.

Users of the gRPC client connection brokered by the
consul-server-connection-manager library will get this for free.

The rate at which Consul will drain sessions to rebalance load is scaled
dynamically based on the number of proxies in the catalog.
2022-09-09 15:02:01 +01:00
freddygv 19f25fc3a5 Allow terminated peerings to be deleted
Peerings are terminated when a peer decides to delete the peering from
their end. Deleting a peering sends a termination message to the peer
and triggers them to mark the peering as terminated but does NOT delete
the peering itself. This is to prevent peerings from disappearing from
both sides just because one side deleted them.

Previously the Delete endpoint was skipping the deletion if the peering
was not marked as active. However, terminated peerings are also
inactive.

This PR makes some updates so that peerings marked as terminated can be
deleted by users.
2022-08-26 10:52:47 -06:00
Chris S. Kim a8090268d4
Replace ring buffer with async version (#14314)
We need to watch for changes to peerings and update the server addresses which get served by the ring buffer.

Also, if there is an active connection for a peer, we are getting up-to-date server addresses from the replication stream and can safely ignore the token's addresses which may be stale.
2022-08-26 10:27:13 -04:00
freddygv 01b0cbcbd7 Use proto message for each secrets write op
Previously there was a field indicating the operation that triggered a
secrets write. Now there is a message for each operation and it contains
the secret ID being persisted.
2022-08-08 01:41:00 -06:00
freddygv b089472a12 Pass explicit signal with op for secrets write
Previously the updates to the peering secrets UUID table relied on
inferring what action triggered the update based on a reconciliation
against the existing secrets.

Instead we now explicitly require the operation to be given so that the
inference isn't necessary. This makes the UUID table logic easier to
reason about and fixes some related bugs.

There is also an update so that the peering secrets get handled on
snapshots/restores.
2022-08-03 17:25:12 -05:00
freddygv 544b3603e9 Avoid deleting peering secret UUIDs at dialers
Dialers do not keep track of peering secret UUIDs, so they should not
attempt to clean up data from that table when their peering is deleted.

We also now keep peer server addresses when marking peerings for
deletion. Peer server addresses are used by the ShouldDial() helper
when determining whether the peering is for a dialer or an acceptor.
We need to keep this data so that peering secrets can be cleaned up
accordingly.
2022-08-03 16:34:57 -05:00
Luke Kysow e9960dfdf3
peering: default to false (#13963)
* defaulting to false because peering will be released as beta
* Ignore peering disabled error in bundles cachetype

Co-authored-by: Matt Keeler <mkeeler@users.noreply.github.com>
Co-authored-by: freddygv <freddy@hashicorp.com>
Co-authored-by: Matt Keeler <mjkeeler7@gmail.com>
2022-08-01 15:22:36 -04:00
Matt Keeler 795e5830c6
Implement/Utilize secrets for Peering Replication Stream (#13977) 2022-08-01 10:33:18 -04:00
acpana 778c796ec9
use EqualPartitions
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-27 14:48:30 -07:00
acpana 8042b3aeed
better fix
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-27 14:28:08 -07:00
acpana b03467e3bd
sync w ent
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-27 11:41:39 -07:00
alex 0a66d0188d
peering: prevent peering in same partition (#13851)
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
2022-07-25 18:00:48 -07:00
Nitya Dhanushkodi 03ea6517c9
peering: remove validation that forces peering token server addresses to be an IP, allow hostname based addresses (#13874) 2022-07-25 16:33:47 -07:00
Luke Kysow a8ae88ec59
peering: read endpoints can now return failing status (#13849)
Track streams that have been disconnected due to an error and
set their statuses to failing.
2022-07-25 14:27:53 -07:00
Chris S. Kim 1f8ae56951
Preserve PeeringState on upsert (#13666)
Fixes a bug where if the generate token is called twice, the second call upserts the zero-value (undefined) of PeeringState.
2022-07-25 14:37:56 -04:00
freddygv 5bbc0cc615 Add ACL enforcement to peering endpoints 2022-07-25 09:34:29 -06:00
alex b60ebc022e
peering: use ShouldDial to validate peer role (#13823)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-22 15:56:25 -07:00
Luke Kysow d21f793b74
peering: add config to enable/disable peering (#13867)
* peering: add config to enable/disable peering

Add config:

```
peering {
  enabled = true
}
```

Defaults to true. When disabled:
1. All peering RPC endpoints will return an error
2. Leader won't start its peering establishment goroutines
3. Leader won't start its peering deletion goroutines
2022-07-22 15:20:21 -07:00
Nitya Dhanushkodi cbafabde16
update generate token endpoint to take external addresses (#13844)
Update generate token endpoint (rpc, http, and api module)

If ServerExternalAddresses are set, it will override any addresses gotten from the "consul" service, and be used in the token instead, and dialed by the dialer. This allows for setting up a load balancer for example, in front of the consul servers.
2022-07-21 14:56:11 -07:00
alex 64b3705a31
peering: refactor reconcile, cleanup (#13795)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-19 11:43:29 -07:00
alex 4ff097c4cf
peering: track exported services (#13784)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-07-18 10:20:04 -07:00
R.B. Boyer 61ebb38092
server: ensure peer replication can successfully use TLS over external gRPC (#13733)
Ensure that the peer stream replication rpc can successfully be used with TLS activated.

Also:

- If key material is configured for the gRPC port but HTTPS is not
  enabled now TLS will still be activated for the gRPC port.

- peerstream replication stream opened by the establishing-side will now
  ignore grpc.WithBlock so that TLS errors will bubble up instead of
  being awkwardly delayed or suppressed
2022-07-15 13:15:50 -05:00
alex 70ad4804b6
peering: track imported services (#13718) 2022-07-15 10:20:43 -07:00
Dan Upton 34140ff3e0
grpc: rename public/private directories to external/internal (#13721)
Previously, public referred to gRPC services that are both exposed on
the dedicated gRPC port and have their definitions in the proto-public
directory (so were considered usable by 3rd parties). Whereas private
referred to services on the multiplexed server port that are only usable
by agents and other servers.

Now, we're splitting these definitions, such that external/internal
refers to the port and public/private refers to whether they can be used
by 3rd parties.

This is necessary because the peering replication API needs to be
exposed on the dedicated port, but is not (yet) suitable for use by 3rd
parties.
2022-07-13 16:33:48 +01:00
R.B. Boyer 5b801db24b
peering: move peer replication to the external gRPC port (#13698)
Peer replication is intended to be between separate Consul installs and
effectively should be considered "external". This PR moves the peer
stream replication bidirectional RPC endpoint to the external gRPC
server and ensures that things continue to function.
2022-07-08 12:01:13 -05:00
Chris S. Kim 0910c41d95
Revise possible states for a peering. (#13661)
These changes are primarily for Consul's UI, where we want to be more
specific about the state a peering is in.

- The "initial" state was renamed to pending, and no longer applies to
  peerings being established from a peering token.

- Upon request to establish a peering from a peering token, peerings
  will be set as "establishing". This will help distinguish between the
  two roles: the cluster that generates the peering token and the
  cluster that establishes the peering.

- When marked for deletion, peering state will be set to "deleting".
  This way the UI determines the deletion via the state rather than the
  "DeletedAt" field.

Co-authored-by: freddygv <freddy@hashicorp.com>
2022-07-04 10:47:58 -04:00
Daniel Upton 497df1ca3b proxycfg: server-local config entry data sources
This is the OSS portion of enterprise PR 2056.

This commit provides server-local implementations of the proxycfg.ConfigEntry
and proxycfg.ConfigEntryList interfaces, that source data from streaming events.

It makes use of the LocalMaterializer type introduced for peering replication,
adding the necessary support for authorization.

It also adds support for "wildcard" subscriptions (within a topic) to the event
publisher, as this is needed to fetch service-resolvers for all services when
configuring mesh gateways.

Currently, events will be emitted for just the ingress-gateway, service-resolver,
and mesh config entry types, as these are the only entries required by proxycfg
— the events will be emitted on topics named IngressGateway, ServiceResolver,
and MeshConfig topics respectively.

Though these events will only be consumed "locally" for now, they can also be
consumed via the gRPC endpoint (confirmed using grpcurl) so using them from
client agents should be a case of swapping the LocalMaterializer for an
RPCMaterializer.
2022-07-04 10:48:36 +01:00
alex 90577810cc
peering: add imported/exported counts to peering (#13644)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
2022-06-29 14:07:30 -07:00
alex a8ae8de20e
peering: reconcile/ hint active state for list (#13619)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-06-29 09:43:50 -07:00
R.B. Boyer 2dba16be52
peering: replicate all SpiffeID values necessary for the importing side to do SAN validation (#13612)
When traversing an exported peered service, the discovery chain
evaluation at the other side may re-route the request to a variety of
endpoints. Furthermore we intend to terminate mTLS at the mesh gateway
for arriving peered traffic that is http-like (L7), so the caller needs
to know the mesh gateway's SpiffeID in that case as well.

The following new SpiffeID values will be shipped back in the peerstream
replication:

- tcp: all possible SpiffeIDs resulting from the service-resolver
        component of the exported discovery chain

- http-like: the SpiffeID of the mesh gateway
2022-06-27 14:37:18 -05:00
R.B. Boyer e7a7232a6b
state: peering ID assignment cannot happen inside of the state store (#13525)
Move peering ID assignment outisde of the FSM, so that the ID is written
to the raft log and the same ID is used by all voters, and after
restarts.
2022-06-21 13:04:08 -05:00
R.B. Boyer 93611819e2
xds: mesh gateways now have their own leaf certificate when involved in a peering (#13460)
This is only configured in xDS when a service with an L7 protocol is
exported.

They also load any relevant trust bundles for the peered services to
eventually use for L7 SPIFFE validation during mTLS termination.
2022-06-15 14:36:18 -05:00
freddygv a288d0c388 Avoid deleting peerings marked as terminated.
When our peer deletes the peering it is locally marked as terminated.
This termination should kick off deleting all imported data, but should
not delete the peering object itself.

Keeping peerings marked as terminated acts as a signal that the action
took place.
2022-06-14 15:37:09 -06:00
freddygv a5283e4361 Add leader routine to clean up peerings
Once a peering is marked for deletion a new leader routine will now
clean up all imported resources and then the peering itself.

A lot of the logic was grabbed from the namespace/partitions deferred
deletions but with a handful of simplifications:
- The rate limiting is not configurable.

- Deleting imported nodes/services/checks is done by deleting nodes with
  the Txn API. The services and checks are deleted as a side-effect.

- There is no "round rate limiter" like with namespaces and partitions.
  This is because peerings are purely local, and deleting a peering in
  the datacenter does not depend on deleting data from other DCs like
  with WAN-federated namespaces. All rate limiting is handled by the
  Raft rate limiter.
2022-06-14 15:36:50 -06:00
freddygv dbcbf3978f Fixup stream tear-down steps.
1. Fix a bug where the peering leader routine would not track all active
   peerings in the "stored" reconciliation map. This could lead to
   tearing down streams where the token was generated, since the
   ConnectedStreams() method used for reconciliation returns all streams
   and not just the ones initiated by this leader routine.

2. Fix a race where stream contexts were being canceled before
   termination messages were being processed by a peer.

   Previously the leader routine would tear down streams by canceling
   their context right after the termination message was sent. This
   context cancelation could be propagated to the server side faster
   than the termination message. Now there is a change where the
   dialing peer uses CloseSend() to signal when no more messages will
   be sent. Eventually the server peer will read an EOF after receiving
   and processing the preceding termination message.

   Using CloseSend() is actually not enough to address the issue
   mentioned, since it doesn't wait for the server peer to finish
   processing messages. Because of this now the dialing peer also reads
   from the stream until an error signals that there are no more
   messages. Receiving an EOF from our peer indicates that they
   processed the termination message and have no additional work to do.

   Given that the stream is being closed, all the messages received by
   Recv are discarded. We only check for errors to avoid importing new
   data.
2022-06-13 12:10:42 -06:00