Commit Graph

1792 Commits

Author SHA1 Message Date
Dhia Ayachi 3deaf767f2
Egress gtw/intention rpc endpoint (#13354)
* update gateway-services table with endpoints

* fix failing test

* remove unneeded config in test

* rename "endpoint" to "destination"

* more endpoint renaming to destination in tests

* update isDestination based on service-defaults config entry creation

* use a 3 state kind to be able to set the kind to unknown (when neither a service or a destination exist)

* set unknown state to empty to avoid modifying alot of tests

* fix logic to set the kind correctly on CRUD

* fix failing tests

* add missing tests and fix service delete

* fix failing test

* Apply suggestions from code review

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>

* fix a bug with kind and add relevant test

* fix compile error

* fix failing tests

* add kind to clone

* fix failing tests

* fix failing tests in catalog endpoint

* fix service dump test

* Apply suggestions from code review

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>

* remove duplicate tests

* first draft of destinations intention in connect proxy

* remove ServiceDestinationList

* fix failing tests

* fix agent/consul failing tests

* change to filter intentions in the state store instead of adding a field.

* fix failing tests

* fix comment

* fix comments

* store service kind destination and add relevant tests

* changes based on review

* filter on destinations when querying source match

* change state store API to get an IntentionTarget parameter

* add intentions tests

* add destination upstream endpoint

* fix failing test

* fix failing test and a bug with wildcard intentions

* fix failing test

* Apply suggestions from code review

Co-authored-by: alex <8968914+acpana@users.noreply.github.com>

* add missing test and clarify doc

* fix style

* gofmt intention.go

* fix merge introduced issue

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>
Co-authored-by: alex <8968914+acpana@users.noreply.github.com>
Co-authored-by: github-team-consul-core <github-team-consul-core@hashicorp.com>
2022-06-07 15:55:02 -04:00
Dhia Ayachi 7602b6ebf2
Egress gtw/connect destination intentions (#13341)
* update gateway-services table with endpoints

* fix failing test

* remove unneeded config in test

* rename "endpoint" to "destination"

* more endpoint renaming to destination in tests

* update isDestination based on service-defaults config entry creation

* use a 3 state kind to be able to set the kind to unknown (when neither a service or a destination exist)

* set unknown state to empty to avoid modifying alot of tests

* fix logic to set the kind correctly on CRUD

* fix failing tests

* add missing tests and fix service delete

* fix failing test

* Apply suggestions from code review

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>

* fix a bug with kind and add relevant test

* fix compile error

* fix failing tests

* add kind to clone

* fix failing tests

* fix failing tests in catalog endpoint

* fix service dump test

* Apply suggestions from code review

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>

* remove duplicate tests

* first draft of destinations intention in connect proxy

* remove ServiceDestinationList

* fix failing tests

* fix agent/consul failing tests

* change to filter intentions in the state store instead of adding a field.

* fix failing tests

* fix comment

* fix comments

* store service kind destination and add relevant tests

* changes based on review

* filter on destinations when querying source match

* Apply suggestions from code review

Co-authored-by: alex <8968914+acpana@users.noreply.github.com>

* fix style

* Apply suggestions from code review

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>

* rename destinationType to targetType.

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>
Co-authored-by: alex <8968914+acpana@users.noreply.github.com>
Co-authored-by: github-team-consul-core <github-team-consul-core@hashicorp.com>
2022-06-07 15:03:59 -04:00
R.B. Boyer 0681f3571d
peering: allow mesh gateways to proxy L4 peered traffic (#13339)
Mesh gateways will now enable tcp connections with SNI names including peering information so that those connections may be proxied.

Note: this does not change the callers to use these mesh gateways.
2022-06-06 14:20:41 -05:00
alex ff2ad3ba0c
peering: send leader addr (#13342)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>
2022-06-06 10:00:38 -07:00
cskh bd3a4dfeab
Add isLeader metric to track if a server is a leader (#13304)
CTIA-21: sdd is_leader metric to track if a server is a leader

Co-authored-by: alex <8968914+acpana@users.noreply.github.com>
2022-06-03 13:07:37 -04:00
freddygv ad6dbe081a Add agent cache-type for TrustBundleListByService
There are a handful of changes in this commit:
* When querying trust bundles for a service we need to be able to
  specify the namespace of the service.
* The endpoint needs to track the index because the cache watches use
  it.
* Extracted bulk of the endpoint's logic to a state store function
  so that index tracking could be tested more easily.
* Removed check for service existence, deferring that sort of work to ACL authz
* Added the cache type
2022-06-01 17:05:10 -06:00
freddygv 073c9e3a91 Update assumptions around exported-service config
Given that the exported-services config entry can use wildcards, the
precedence for wildcards is handled as with intentions. The most exact
match is the match that applies for any given service. We do not take
the union of all that apply.

Another update that was made was to reflect that only one
exported-services config entry applies to any given service in a
partition. This is a pre-existing constraint that gets enforced by
the Normalize() method on that config entry type.
2022-06-01 17:03:51 -06:00
Freddy 6ef38eaea7
Configure upstream TLS context with peer root certs (#13321)
For mTLS to work between two proxies in peered clusters with different root CAs,
proxies need to configure their outbound listener to use different root certificates
for validation.

Up until peering was introduced proxies would only ever use one set of root certificates
to validate all mesh traffic, both inbound and outbound. Now an upstream proxy
may have a leaf certificate signed by a CA that's different from the dialing proxy's.

This PR makes changes to proxycfg and xds so that the upstream TLS validation
uses different root certificates depending on which cluster is being dialed.
2022-06-01 15:53:52 -06:00
Dhia Ayachi d4a04457e1
update gateway-services table with endpoints (#13217)
* update gateway-services table with endpoints

* fix failing test

* remove unneeded config in test

* rename "endpoint" to "destination"

* more endpoint renaming to destination in tests

* update isDestination based on service-defaults config entry creation

* use a 3 state kind to be able to set the kind to unknown (when neither a service or a destination exist)

* set unknown state to empty to avoid modifying alot of tests

* fix logic to set the kind correctly on CRUD

* fix failing tests

* add missing tests and fix service delete

* fix failing test

* Apply suggestions from code review

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>

* fix a bug with kind and add relevant test

* fix compile error

* fix failing tests

* add kind to clone

* fix failing tests

* fix failing tests in catalog endpoint

* fix service dump test

* Apply suggestions from code review

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>

* remove duplicate tests

* rename consts and fix kind when no destination is defined in the service-defaults.

* rename Kind to ServiceKind and change switch to use .(type)

Co-authored-by: Dan Stough <dan.stough@hashicorp.com>
2022-05-31 16:20:12 -04:00
Dan Upton a6a6d5a8ee
Enable servers to configure arbitrary proxies from the catalog (#13244)
OSS port of enterprise PR 1822

Includes the necessary changes to the `proxycfg` and `xds` packages to enable
Consul servers to configure arbitrary proxies using catalog data.

Broadly, `proxycfg.Manager` now has public methods for registering,
deregistering, and listing registered proxies — the existing local agent
state-sync behavior has been moved into a separate component that makes use of
these methods.

When an xDS session is started for a proxy service in the catalog, a goroutine
will be spawned to watch the service in the server's state store and
re-register it with the `proxycfg.Manager` whenever it is updated (and clean
it up when the client goes away).
2022-05-27 12:38:52 +01:00
alex 2d8664d384
monitor leadership in peering service (#13257)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>

Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
2022-05-26 17:55:16 -07:00
Chris S. Kim d73a9522cb
Add support for streaming CA roots to peers (#13260)
Sender watches for changes to CA roots and sends
them through the replication stream. Receiver saves
CA roots to tablePeeringTrustBundle
2022-05-26 15:24:09 -04:00
Riddhi Shah e5f1d8dce4
Add support for merge-central-config query param (#13001)
Adds a new query param merge-central-config for use with the below endpoints:

/catalog/service/:service
/catalog/connect/:service
/health/service/:service
/health/connect/:service

If set on the request, the response will include a fully resolved service definition which is merged with the proxy-defaults/global and service-defaults/:service config entries (on-demand style). This is useful to view the full service definition for a mesh service (connect-proxy kind or gateway kind) which might not be merged before being written into the catalog (example: in case of services in the agentless model).
2022-05-25 13:20:17 -07:00
R.B. Boyer bc10055edc
peering: replicate expected SNI, SPIFFE, and service protocol to peers (#13218)
The importing peer will need to know what SNI and SPIFFE name
corresponds to each exported service. Additionally it will need to know
at a high level the protocol in use (L4/L7) to generate the appropriate
connection pool and local metrics.

For replicated connect synthetic entities we edit the `Connect{}` part
of a `NodeService` to have a new section:

    {
      "PeerMeta": {
        "SNI": [
          "web.default.default.owt.external.183150d5-1033-3672-c426-c29205a576b8.consul"
        ],
        "SpiffeID": [
          "spiffe://183150d5-1033-3672-c426-c29205a576b8.consul/ns/default/dc/dc1/svc/web"
        ],
        "Protocol": "tcp"
      }
    }

This data is then replicated and saved as-is at the importing side. Both
SNI and SpiffeID are slices for now until I can be sure we don't need
them for how mesh gateways will ultimately work.
2022-05-25 12:37:44 -05:00
alex 451dc50f4f
peering: expose IsLeader, hung up on dialer if follower (#13164)
Signed-off-by: acpana <8968914+acpana@users.noreply.github.com>

Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
2022-05-23 11:30:58 -07:00
cskh 39cb731988
Upgrade golangci-lint for go v1.18 (#13176) 2022-05-23 10:26:45 -04:00
R.B. Boyer 3b12a5179f
test: fix flaky test TestEventBufferFuzz (#13175) 2022-05-23 09:22:30 -05:00
Matt Keeler c629e89289
Fix tests broken in #13173 (#13178)
I changed the error type returned in a situation but didn’t update the tests to expect that error.
2022-05-23 10:00:06 -04:00
Matt Keeler 8a968299dd
Fix flaky tests in the agent/grpc/public/services/serverdiscovery package (#13173)
Occasionally we had seen the TestWatchServers_ACLToken_PermissionDenied be flagged as flaky in circleci. This change should fix that.

Why it fixes it is complicated. The test was failing with a panic when a mocked ACL Resolver was being called more times than expected. I struggled for a while to determine how that could be. This test should call authorize once and only once and the error returned should cause the stream to be terminated and the error returned to the gRPC client. Another oddity was no amount of running this test locally seemed to be able to reproduce the issue. I ran the test hundreds of thousands of time and it always passed.

It turns out that there is nothing wrong with the test. It just so happens that the panic from unexpected invocation of a mocked call happened during the test but was caused by a previous test (specifically the TestWatchServers_StreamLifecycle test)

The stream from the previous test remained open after all the test Cleanup functions were run and it just so happened that when the EventPublisher eventually picked up that the context was cancelled during cleanup, it force closes all subscriptions which causes some loops to be re-entered and the streams to be reauthorized. Its that looping in response to forced subscription closures that causes the mock to eventually panic. All the components, publisher, server, client all operate based on contexts. We cancel all those contexts but there is no syncrhonous way to know when they are stopped.

We could have implemented a syncrhonous stop but in the context of an actual running Consul, context cancellation + async stopping is perfectly fine. What we (Dan and I) eventually thought was that the behavior of grpc streams such as this when a server was shutting down wasn’t super helpful. What we would want is for a client to be able to distinguish between subscription closed because something may have changed requiring re-authentication and subscription closed because the server is shutting down. That way we can send back appropriate error messages to detail that the server is shutting down and not confuse users with potentially needing to resubscribe.

So thats what this PR does. We have introduced a shutting down state to our event subscriptions and the various streaming gRPC services that rely on the event publisher will all just behave correctly and actually stop the stream (not attempt transparent reauthorization) if this particular error is the one we get from the stream. Additionally the error that gets transmitted back through gRPC when this does occur indicates to the consumer that the server is going away. That is more helpful so that a client can then attempt to reconnect to another server.
2022-05-23 08:59:13 -04:00
R.B. Boyer 69d3e729a4
agent: allow for service discovery queries involving peer name to use streaming (#13168) 2022-05-20 15:27:01 -05:00
R.B. Boyer 68789effeb
test: TestServer_RPC_MetricsIntercept should use a concurrency-safe metrics store (#13157) 2022-05-19 15:39:28 -05:00
R.B. Boyer 91691eca87 peering: replicate discovery chains information to importing peers
Treat each exported service as a "discovery chain" and replicate one
synthetic CheckServiceNode for each chain and remote mesh gateway.

The health will be a flattened generated check of the checks for that
mesh gateway node.
2022-05-19 14:21:44 -05:00
Freddy 8894365c5a
[OSS] Add upsert handling for receiving CheckServiceNode (#13061) 2022-05-12 15:04:44 -06:00
R.B. Boyer c855df87ec
remove remaining shim runStep functions (#13015)
Wraps up the refactor from #13013
2022-05-10 16:24:45 -05:00
R.B. Boyer 9ad10318cd
add general runstep test helper instead of copying it all over the place (#13013) 2022-05-10 15:25:51 -05:00
Evan Culver d64726c8e9
peering: add store.PeeringsForService implementation (#12957) 2022-05-06 12:35:31 -07:00
Dan Upton 6bfdb48560
acl: gRPC login and logout endpoints (#12935)
Introduces two new public gRPC endpoints (`Login` and `Logout`) and
includes refactoring of the equivalent net/rpc endpoints to enable the
majority of logic to be reused (i.e. by extracting the `Binder` and
`TokenWriter` types).

This contains the OSS portions of the following enterprise commits:

- 75fcdbfcfa6af21d7128cb2544829ead0b1df603
- bce14b714151af74a7f0110843d640204082630a
- cc508b70fbf58eda144d9af3d71bd0f483985893
2022-05-04 17:38:45 +01:00
Kyle Havlovitz 3bd001fb29 Return ACLRemoteError from cache and test it correctly 2022-05-03 10:05:26 -07:00
Kyle Havlovitz f84ed5f70b Store and return rpc error in acl cache entries 2022-04-28 09:08:55 -07:00
R.B. Boyer 642b75b60b
health: ensure /v1/health/service/:service endpoint returns the most recent results when a filter is used with streaming (#12640)
The primary bug here is in the streaming subsystem that makes the overall v1/health/service/:service request behave incorrectly when servicing a blocking request with a filter provided.

There is a secondary non-streaming bug being fixed here that is much less obvious related to when to update the `reply` variable in a `blockingQuery` evaluation. It is unlikely that it is triggerable in practical environments and I could not actually get the bug to manifest, but I fixed it anyway while investigating the original issue.

Simple reproduction (streaming):

1. Register a service with a tag.

        curl -sL --request PUT 'http://localhost:8500/v1/agent/service/register' \
            --header 'Content-Type: application/json' \
            --data-raw '{ "ID": "ID1", "Name": "test", "Tags":[ "a" ], "EnableTagOverride": true }'

2. Do an initial filter query that matches on the tag.

        curl -sLi --get 'http://localhost:8500/v1/health/service/test' --data-urlencode 'filter=a in Service.Tags'

3. Note you get one result. Use the `X-Consul-Index` header to establish
   a blocking query in another terminal, this should not return yet.

        curl -sLi --get 'http://localhost:8500/v1/health/service/test?index=$INDEX' --data-urlencode 'filter=a in Service.Tags'

4. Re-register that service with a different tag.

        curl -sL --request PUT 'http://localhost:8500/v1/agent/service/register' \
            --header 'Content-Type: application/json' \
            --data-raw '{ "ID": "ID1", "Name": "test", "Tags":[ "b" ], "EnableTagOverride": true }'

5. Your blocking query from (3) should return with a header
   `X-Consul-Query-Backend: streaming` and empty results if it works
   correctly `[]`.

Attempts to reproduce with non-streaming failed (where you add `&near=_agent` to the read queries and ensure `X-Consul-Query-Backend: blocking-query` shows up in the results).
2022-04-27 10:39:45 -05:00
Dhia Ayachi 9dc5200155
update raft to v1.3.8 (#12844)
* update raft to v1.3.7

* add changelog

* fix compilation error

* fix HeartbeatTimeout

* fix ElectionTimeout to reload only if value is valid

* fix default values for `ElectionTimeout` and `HeartbeatTimeout`

* fix test defaults

* bump raft to v1.3.8
2022-04-25 10:19:26 -04:00
R.B. Boyer 809344a6f5
peering: initial sync (#12842)
- Add endpoints related to peering: read, list, generate token, initiate peering
- Update node/service/check table indexing to account for peers
- Foundational changes for pushing service updates to a peer
- Plumb peer name through Health.ServiceNodes path

see: ENT-1765, ENT-1280, ENT-1283, ENT-1283, ENT-1756, ENT-1739, ENT-1750, ENT-1679,
     ENT-1709, ENT-1704, ENT-1690, ENT-1689, ENT-1702, ENT-1701, ENT-1683, ENT-1663,
     ENT-1650, ENT-1678, ENT-1628, ENT-1658, ENT-1640, ENT-1637, ENT-1597, ENT-1634,
     ENT-1613, ENT-1616, ENT-1617, ENT-1591, ENT-1588, ENT-1596, ENT-1572, ENT-1555

Co-authored-by: R.B. Boyer <rb@hashicorp.com>
Co-authored-by: freddygv <freddy@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Evan Culver <eculver@hashicorp.com>
Co-authored-by: Nitya Dhanushkodi <nitya@hashicorp.com>
2022-04-21 17:34:40 -05:00
Will Jordan 45ffdc360e
Add timeout to Client RPC calls (#11500)
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.

Co-authored-by: kisunji <ckim@hashicorp.com>
2022-04-21 16:21:35 -04:00
Matt Keeler f49adfaaf0
Implement the ServerDiscovery.WatchServers gRPC endpoint (#12819)
* Implement the ServerDiscovery.WatchServers gRPC endpoint
* Fix the ConnectCA.Sign gRPC endpoints metadata forwarding.
* Unify public gRPC endpoints around the public.TraceID function for request_id logging
2022-04-21 12:56:18 -04:00
Blake Covarrubias 2beea7eb7c
acl: Clarify node/service identities must be lowercase (#12807)
Modify ACL error message for invalid node/service identities names to
clearly state only lowercase alphanumeric characters are supported.
2022-04-21 09:29:16 -07:00
R.B. Boyer bbd38e95ce
chore: upgrade mockery to v2 and regenerate (#12836) 2022-04-21 09:48:21 -05:00
Riddhi Shah 1d49f5c84e
[OSS] gRPC call to get envoy bootstrap params (#12825)
Adds a new gRPC endpoint to get envoy bootstrap params. The new consul-dataplane service will use this
endpoint to generate an envoy bootstrap configuration.
2022-04-19 17:24:21 -07:00
Matt Keeler 3badd4c35c
Add event generation for autopilot state updates (#12626)
Whenever autopilot updates its state it notifies Consul. That notification will then trigger Consul to extract out the ready server information. If the ready servers have changed, then an event will be published to notify any subscribers of the full set of ready servers.

All these ready server event things are contained within an autopilotevents package instead of the consul package to make importing them into the grpc related packages possible
2022-04-19 13:03:03 -04:00
DanStough a050aa39b9 Update go version to 1.18.1 2022-04-18 11:41:10 -04:00
Dan Upton 769d1d6e8e
ConnectCA.Sign gRPC Endpoint (#12787)
Introduces a gRPC endpoint for signing Connect leaf certificates. It's also
the first of the public gRPC endpoints to perform leader-forwarding, so
establishes the pattern of forwarding over the multiplexed internal RPC port.
2022-04-14 14:26:14 +01:00
Kyle Havlovitz 199f1c7200
Fix namespace default field names in expanded token output 2022-04-13 16:46:39 -07:00
Paul Glass 5eea62b47a
acl: Adjust region handling in AWS IAM auth method (#12774)
* acl: Adjust region handling in AWS IAM auth method
2022-04-13 14:31:37 -05:00
Karl Cardenas b0b197964c
Merge pull request #12562 from hashicorp/docs/blake-agent-config
docs: Agent configuration hierarchy reorganization
2022-04-12 12:33:42 -07:00
FFMMM cf7e6484aa
add more labels to RequestRecorder (#12727)
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
2022-04-12 10:50:25 -07:00
Matt Keeler 2a4ca71d3f
Move to using a shared EventPublisher (#12673)
Previously we had 1 EventPublisher per state.Store. When a state store was closed/abandoned such as during a consul snapshot restore, this had the behavior of force closing subscriptions for that topic and evicting event snapshots from the cache.

The intention of this commit is to keep all that behavior. To that end, the shared EventPublisher now supports the ability to refresh a topic. That will perform the force close + eviction. The FSM upon abandoning the previous state.Store will call RefreshTopic for all the topics with events generated by the state.Store.
2022-04-12 09:47:42 -04:00
Blake Covarrubias 3175bf6b1b Remove .html extensions from docs URLs 2022-04-11 17:38:49 -07:00
Natalie Smith cd17e98800 docs: fix yet more references to agent/options 2022-04-11 17:38:49 -07:00
R.B. Boyer f4eac06b21
xds: ensure that all connect timeout configs can apply equally to tproxy direct dial connections (#12711)
Just like standard upstreams the order of applicability in descending precedence:

1. caller's `service-defaults` upstream override for destination
2. caller's `service-defaults` upstream defaults
3. destination's `service-resolver` ConnectTimeout
4. system default of 5s

Co-authored-by: mrspanishviking <kcardenas@hashicorp.com>
2022-04-07 16:58:21 -05:00
Matt Keeler 3447880091
Enable running autopilot state updates on all servers (#12617)
* Fixes a lint warning about t.Errorf not supporting %w

* Enable running autopilot on all servers

On the non-leader servers all they do is update the state and do not attempt any modifications.

* Fix the RPC conn limiting tests

Technically they were relying on racey behavior before. Now they should be reliable.
2022-04-07 10:48:48 -04:00
FFMMM 0f68bf879a
[rpc/middleware][consul] plumb intercept off, add server level happy test (#12692) 2022-04-06 14:33:05 -07:00