* Add a test to reproduce the race condition
* Fix race condition by publishing the event after the commit and adding a lock to prevent out of order events.
* split publish to generate the list of events before committing the transaction.
* add changelog
* remove extra func
* Apply suggestions from code review
Co-authored-by: Dan Upton <daniel@floppy.co>
* add comment to explain test
---------
Co-authored-by: Dan Upton <daniel@floppy.co>
Prior to this change, peer services would be targeted by service-default
overrides as long as the new `peer` field was not found in the config entry.
This commit removes that deprecated backwards-compatibility behavior. Now
it is necessary to specify the `peer` field in order for upstream overrides
to apply to a peer upstream.
Currently, if an acceptor peer deletes a peering the dialer's peering
will eventually get to a "terminated" state. If the two clusters need to
be re-peered the acceptor will re-generate the token but the dialer will
encounter this error on the call to establish:
"failed to get addresses to dial peer: failed to refresh peer server
addresses, will continue to use initial addresses: there is no active
peering for "<<<ID>>>""
This is because in `exchangeSecret().GetDialAddresses()` we will get an
error if fetching addresses for an inactive peering. The peering shows
up as inactive at this point because of the existing terminated state.
Rather than checking whether a peering is active we can instead check
whether it was deleted. This way users do not need to delete terminated
peerings in the dialing cluster before re-establishing them.
* Use merge of enterprise meta's rather than new custom method
* Add merge logic for tcp routes
* Add changelog
* Normalize certificate refs on gateways
* Fix infinite call loop
* Explicitly call enterprise meta
This commit swaps the partition field to the local partition for
discovery chains targeting peers. Prior to this change, peer upstreams
would always use a value of default regardless of which partition they
exist in. This caused several issues in xds / proxycfg because of id
mismatches.
Some prior fixes were made to deal with one-off id mismatches that this
PR also cleans up, since they are no longer needed.
* add snapshot restore test
* add logstore as test parameter
* Use the correct image version
* make sure we read the logs from a followers to test the follower snapshot install path.
* update to raf-wal v0.3.0
* add changelog.
* updating changelog for bug description and removed integration test.
* setting up test container builder to only set logStore for 1.15 and higher
---------
Co-authored-by: Paul Banks <pbanks@hashicorp.com>
Co-authored-by: John Murret <john.murret@hashicorp.com>
This commit fixes an issue where trust bundles could not be read
by services in a non-default namespace, unless they had excessive
ACL permissions given to them.
Prior to this change, `service:write` was required in the default
namespace in order to read the trust bundle. Now, `service:write`
to a service in any namespace is sufficient.
If a CA config update did not cause a root change, the codepath would return early and skip some steps which preserve its intermediate certificates and signing key ID. This commit re-orders some code and prevents updates from generating new intermediate certificates.
Co-authored-by: Ashvitha Sridharan <ashvitha.sridharan@hashicorp.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Add a new envoy flag: "envoy_hcp_metrics_bind_socket_dir", a directory
where a unix socket will be created with the name
`<namespace>_<proxy_id>.sock` to forward Envoy metrics.
If set, this will configure:
- In bootstrap configuration a local stats_sink and static cluster.
These will forward metrics to a loopback listener sent over xDS.
- A dynamic listener listening at the socket path that the previously
defined static cluster is sending metrics to.
- A dynamic cluster that will forward traffic received at this listener
to the hcp-metrics-collector service.
Reasons for having a static cluster pointing at a dynamic listener:
- We want to secure the metrics stream using TLS, but the stats sink can
only be defined in bootstrap config. With dynamic listeners/clusters
we can use the proxy's leaf certificate issued by the Connect CA,
which isn't available at bootstrap time.
- We want to intelligently route to the HCP collector. Configuring its
addreess at bootstrap time limits our flexibility routing-wise. More
on this below.
Reasons for defining the collector as an upstream in `proxycfg`:
- The HCP collector will be deployed as a mesh service.
- Certificate management is taken care of, as mentioned above.
- Service discovery and routing logic is automatically taken care of,
meaning that no code changes are required in the xds package.
- Custom routing rules can be added for the collector using discovery
chain config entries. Initially the collector is expected to be
deployed to each admin partition, but in the future could be deployed
centrally in the default partition. These config entries could even be
managed by HCP itself.
Add support for using existing vault auto-auth configurations as the
provider configuration when using Vault's CA provider with AliCloud.
AliCloud requires 2 extra fields to enable it to use STS (it's preferred
auth setup). Our vault-plugin-auth-alicloud package contained a method
to help generate them as they require you to make an http call to
a faked endpoint proxy to get them (url and headers base64 encoded).
* Add some basic ui improvements for api-gateway services
* Add changelog entry
* Use ternary for null check
* Update gateway doc links
* rename changelog entry for new PR
* Fix test
Receiving an "acl not found" error from an RPC in the agent cache and the
streaming/event components will cause any request loops to cease under the
assumption that they will never work again if the token was destroyed. This
prevents log spam (#14144, #9738).
Unfortunately due to things like:
- authz requests going to stale servers that may not have witnessed the token
creation yet
- authz requests in a secondary datacenter happening before the tokens get
replicated to that datacenter
- authz requests from a primary TO a secondary datacenter happening before the
tokens get replicated to that datacenter
The caller will get an "acl not found" *before* the token exists, rather than
just after. The machinery added above in the linked PRs will kick in and
prevent the request loop from looping around again once the tokens actually
exist.
For `consul-dataplane` usages, where xDS is served by the Consul servers
rather than the clients ultimately this is not a problem because in that
scenario the `agent/proxycfg` machinery is on-demand and launched by a new xDS
stream needing data for a specific service in the catalog. If the watching
goroutines are terminated it ripples down and terminates the xDS stream, which
CDP will eventually re-establish and restart everything.
For Consul client usages, the `agent/proxycfg` machinery is ahead-of-time
launched at service registration time (called "local" in some of the proxycfg
machinery) so when the xDS stream comes in the data is already ready to go. If
the watching goroutines terminate it should terminate the xDS stream, but
there's no mechanism to re-spawn the watching goroutines. If the xDS stream
reconnects it will see no `ConfigSnapshot` and will not get one again until
the client agent is restarted, or the service is re-registered with something
changed in it.
This PR fixes a few things in the machinery:
- there was an inadvertent deadlock in fetching snapshot from the proxycfg
machinery by xDS, such that when the watching goroutine terminated the
snapshots would never be fetched. This caused some of the xDS machinery to
get indefinitely paused and not finish the teardown properly.
- Every 30s we now attempt to re-insert all locally registered services into
the proxycfg machinery.
- When services are re-inserted into the proxycfg machinery we special case
"dead" ones such that we unilaterally replace them rather that doing that
conditionally.
Adds support for the approle auth-method. Only handles using the approle
role/secret to auth and it doesn't support the agent's extra management
configuration options (wrap and delete after read) as they are not
required as part of the auth (ie. they are vault agent things).
* Fix issue where terminating gateway service resolvers weren't properly cleaned up
* Add integration test for cleaning up resolvers
* Add changelog entry
* Use state test and drop integration test
* Leverage ServiceResolver ConnectTimeout for route timeouts to make TerminatingGateway upstream timeouts configurable
* Regenerate golden files
* Add RequestTimeout field
* Add changelog entry
Adds support for a jwt token in a file. Simply reads the file and sends
the read in jwt along to the vault login.
It also supports a legacy mode with the jwt string being passed
directly. In which case the path is made optional.
Does the required dance with the local HTTP endpoint to get the required
data for the jwt based auth setup in Azure. Keeps support for 'legacy'
mode where all login data is passed on via the auth methods parameters.
Refactored check for hardcoded /login fields.
Fixes a regression in #16044
The consul acl token read -self cli command should not require an -accessor-id because typically the persona invoking this would not already know the accessor id of their own token.