Protobuf Refactoring for Multi-Module Cleanliness
This commit includes the following:
Moves all packages that were within proto/ to proto/private
Rewrites imports to account for the packages being moved
Adds in buf.work.yaml to enable buf workspaces
Names the proto-public buf module so that we can override the Go package imports within proto/buf.yaml
Bumps the buf version dependency to 1.14.0 (I was trying out the version to see if it would get around an issue - it didn't but it also doesn't break things and it seemed best to keep up with the toolchain changes)
Why:
In the future we will need to consume other protobuf dependencies such as the Google HTTP annotations for openapi generation or grpc-gateway usage.
There were some recent changes to have our own ratelimiting annotations.
The two combined were not working when I was trying to use them together (attempting to rebase another branch)
Buf workspaces should be the solution to the problem
Buf workspaces means that each module will have generated Go code that embeds proto file names relative to the proto dir and not the top level repo root.
This resulted in proto file name conflicts in the Go global protobuf type registry.
The solution to that was to add in a private/ directory into the path within the proto/ directory.
That then required rewriting all the imports.
Is this safe?
AFAICT yes
The gRPC wire protocol doesn't seem to care about the proto file names (although the Go grpc code does tack on the proto file name as Metadata in the ServiceDesc)
Other than imports, there were no changes to any generated code as a result of this.
Enforce lowercase peer names.
Prior to this change peer names could be mixed case.
This can cause issues, as peer names are used as DNS labels
in various locations. It also caused issues with envoy
configuration.
* Protobuf Modernization
Remove direct usage of golang/protobuf in favor of google.golang.org/protobuf
Marshallers (protobuf and json) needed some changes to account for different APIs.
Moved to using the google.golang.org/protobuf/types/known/* for the well known types including replacing some custom Struct manipulation with whats available in the structpb well known type package.
This also updates our devtools script to install protoc-gen-go from the right location so that files it generates conform to the correct interfaces.
* Fix go-mod-tidy make target to work on all modules
Re-add ServerExternalAddresses parameter in GenerateToken endpoint
This reverts commit 5e156772f6a7fba5324eb6804ae4e93c091229a6
and adds extra functionality to support newer peering behaviors.
This commit updates the establish endpoint to bubble up a 403 status
code to callers when the establishment secret from the token is invalid.
This is a signal that a new peering token must be generated.
When peering through mesh gateways we expect outbound dials to peer
servers to flow through the local mesh gateway addresses.
Now when establishing a peering we get a list of dial addresses as a
ring buffer that includes local mesh gateway addresses if the local DC
is configured to peer through mesh gateways. The ring buffer includes
the mesh gateway addresses first, but also includes the remote server
addresses as a fallback.
This fallback is present because it's possible that direct egress from
the servers may be allowed. If not allowed then the leader will cycle
back to a mesh gateway address through the ring.
When attempting to dial the remote servers we retry up to a fixed
timeout. If using mesh gateways we also have an initial wait in
order to allow for the mesh gateways to configure themselves.
Note that if we encounter a permission denied error we do not retry
since that error indicates that the secret in the peering token is
invalid.
A previous commit introduced an internally-managed server certificate
to use for peering-related purposes.
Now the peering token has been updated to match that behavior:
- The server name matches the structure of the server cert
- The CA PEMs correspond to the Connect CA
Note that if Conect is disabled, and by extension the Connect CA, we
fall back to the previous behavior of returning the manually configured
certs and local server SNI.
Several tests were updated to use the gRPC TLS port since they enable
Connect by default. This means that the peering token will embed the
Connect CA, and the dialer will expect a TLS listener.
Peerings are terminated when a peer decides to delete the peering from
their end. Deleting a peering sends a termination message to the peer
and triggers them to mark the peering as terminated but does NOT delete
the peering itself. This is to prevent peerings from disappearing from
both sides just because one side deleted them.
Previously the Delete endpoint was skipping the deletion if the peering
was not marked as active. However, terminated peerings are also
inactive.
This PR makes some updates so that peerings marked as terminated can be
deleted by users.
We need to watch for changes to peerings and update the server addresses which get served by the ring buffer.
Also, if there is an active connection for a peer, we are getting up-to-date server addresses from the replication stream and can safely ignore the token's addresses which may be stale.
Previously there was a field indicating the operation that triggered a
secrets write. Now there is a message for each operation and it contains
the secret ID being persisted.
Previously the updates to the peering secrets UUID table relied on
inferring what action triggered the update based on a reconciliation
against the existing secrets.
Instead we now explicitly require the operation to be given so that the
inference isn't necessary. This makes the UUID table logic easier to
reason about and fixes some related bugs.
There is also an update so that the peering secrets get handled on
snapshots/restores.
Dialers do not keep track of peering secret UUIDs, so they should not
attempt to clean up data from that table when their peering is deleted.
We also now keep peer server addresses when marking peerings for
deletion. Peer server addresses are used by the ShouldDial() helper
when determining whether the peering is for a dialer or an acceptor.
We need to keep this data so that peering secrets can be cleaned up
accordingly.
Update generate token endpoint (rpc, http, and api module)
If ServerExternalAddresses are set, it will override any addresses gotten from the "consul" service, and be used in the token instead, and dialed by the dialer. This allows for setting up a load balancer for example, in front of the consul servers.
Previously, public referred to gRPC services that are both exposed on
the dedicated gRPC port and have their definitions in the proto-public
directory (so were considered usable by 3rd parties). Whereas private
referred to services on the multiplexed server port that are only usable
by agents and other servers.
Now, we're splitting these definitions, such that external/internal
refers to the port and public/private refers to whether they can be used
by 3rd parties.
This is necessary because the peering replication API needs to be
exposed on the dedicated port, but is not (yet) suitable for use by 3rd
parties.
Peer replication is intended to be between separate Consul installs and
effectively should be considered "external". This PR moves the peer
stream replication bidirectional RPC endpoint to the external gRPC
server and ensures that things continue to function.
These changes are primarily for Consul's UI, where we want to be more
specific about the state a peering is in.
- The "initial" state was renamed to pending, and no longer applies to
peerings being established from a peering token.
- Upon request to establish a peering from a peering token, peerings
will be set as "establishing". This will help distinguish between the
two roles: the cluster that generates the peering token and the
cluster that establishes the peering.
- When marked for deletion, peering state will be set to "deleting".
This way the UI determines the deletion via the state rather than the
"DeletedAt" field.
Co-authored-by: freddygv <freddy@hashicorp.com>
When traversing an exported peered service, the discovery chain
evaluation at the other side may re-route the request to a variety of
endpoints. Furthermore we intend to terminate mTLS at the mesh gateway
for arriving peered traffic that is http-like (L7), so the caller needs
to know the mesh gateway's SpiffeID in that case as well.
The following new SpiffeID values will be shipped back in the peerstream
replication:
- tcp: all possible SpiffeIDs resulting from the service-resolver
component of the exported discovery chain
- http-like: the SpiffeID of the mesh gateway
This is only configured in xDS when a service with an L7 protocol is
exported.
They also load any relevant trust bundles for the peered services to
eventually use for L7 SPIFFE validation during mTLS termination.
1. Fix a bug where the peering leader routine would not track all active
peerings in the "stored" reconciliation map. This could lead to
tearing down streams where the token was generated, since the
ConnectedStreams() method used for reconciliation returns all streams
and not just the ones initiated by this leader routine.
2. Fix a race where stream contexts were being canceled before
termination messages were being processed by a peer.
Previously the leader routine would tear down streams by canceling
their context right after the termination message was sent. This
context cancelation could be propagated to the server side faster
than the termination message. Now there is a change where the
dialing peer uses CloseSend() to signal when no more messages will
be sent. Eventually the server peer will read an EOF after receiving
and processing the preceding termination message.
Using CloseSend() is actually not enough to address the issue
mentioned, since it doesn't wait for the server peer to finish
processing messages. Because of this now the dialing peer also reads
from the stream until an error signals that there are no more
messages. Receiving an EOF from our peer indicates that they
processed the termination message and have no additional work to do.
Given that the stream is being closed, all the messages received by
Recv are discarded. We only check for errors to avoid importing new
data.