Adds a new query param merge-central-config for use with the below endpoints:
/catalog/service/:service
/catalog/connect/:service
/health/service/:service
/health/connect/:service
If set on the request, the response will include a fully resolved service definition which is merged with the proxy-defaults/global and service-defaults/:service config entries (on-demand style). This is useful to view the full service definition for a mesh service (connect-proxy kind or gateway kind) which might not be merged before being written into the catalog (example: in case of services in the agentless model).
The importing peer will need to know what SNI and SPIFFE name
corresponds to each exported service. Additionally it will need to know
at a high level the protocol in use (L4/L7) to generate the appropriate
connection pool and local metrics.
For replicated connect synthetic entities we edit the `Connect{}` part
of a `NodeService` to have a new section:
{
"PeerMeta": {
"SNI": [
"web.default.default.owt.external.183150d5-1033-3672-c426-c29205a576b8.consul"
],
"SpiffeID": [
"spiffe://183150d5-1033-3672-c426-c29205a576b8.consul/ns/default/dc/dc1/svc/web"
],
"Protocol": "tcp"
}
}
This data is then replicated and saved as-is at the importing side. Both
SNI and SpiffeID are slices for now until I can be sure we don't need
them for how mesh gateways will ultimately work.
Occasionally we had seen the TestWatchServers_ACLToken_PermissionDenied be flagged as flaky in circleci. This change should fix that.
Why it fixes it is complicated. The test was failing with a panic when a mocked ACL Resolver was being called more times than expected. I struggled for a while to determine how that could be. This test should call authorize once and only once and the error returned should cause the stream to be terminated and the error returned to the gRPC client. Another oddity was no amount of running this test locally seemed to be able to reproduce the issue. I ran the test hundreds of thousands of time and it always passed.
It turns out that there is nothing wrong with the test. It just so happens that the panic from unexpected invocation of a mocked call happened during the test but was caused by a previous test (specifically the TestWatchServers_StreamLifecycle test)
The stream from the previous test remained open after all the test Cleanup functions were run and it just so happened that when the EventPublisher eventually picked up that the context was cancelled during cleanup, it force closes all subscriptions which causes some loops to be re-entered and the streams to be reauthorized. Its that looping in response to forced subscription closures that causes the mock to eventually panic. All the components, publisher, server, client all operate based on contexts. We cancel all those contexts but there is no syncrhonous way to know when they are stopped.
We could have implemented a syncrhonous stop but in the context of an actual running Consul, context cancellation + async stopping is perfectly fine. What we (Dan and I) eventually thought was that the behavior of grpc streams such as this when a server was shutting down wasn’t super helpful. What we would want is for a client to be able to distinguish between subscription closed because something may have changed requiring re-authentication and subscription closed because the server is shutting down. That way we can send back appropriate error messages to detail that the server is shutting down and not confuse users with potentially needing to resubscribe.
So thats what this PR does. We have introduced a shutting down state to our event subscriptions and the various streaming gRPC services that rely on the event publisher will all just behave correctly and actually stop the stream (not attempt transparent reauthorization) if this particular error is the one we get from the stream. Additionally the error that gets transmitted back through gRPC when this does occur indicates to the consumer that the server is going away. That is more helpful so that a client can then attempt to reconnect to another server.
Treat each exported service as a "discovery chain" and replicate one
synthetic CheckServiceNode for each chain and remote mesh gateway.
The health will be a flattened generated check of the checks for that
mesh gateway node.
Introduces two new public gRPC endpoints (`Login` and `Logout`) and
includes refactoring of the equivalent net/rpc endpoints to enable the
majority of logic to be reused (i.e. by extracting the `Binder` and
`TokenWriter` types).
This contains the OSS portions of the following enterprise commits:
- 75fcdbfcfa6af21d7128cb2544829ead0b1df603
- bce14b714151af74a7f0110843d640204082630a
- cc508b70fbf58eda144d9af3d71bd0f483985893
The primary bug here is in the streaming subsystem that makes the overall v1/health/service/:service request behave incorrectly when servicing a blocking request with a filter provided.
There is a secondary non-streaming bug being fixed here that is much less obvious related to when to update the `reply` variable in a `blockingQuery` evaluation. It is unlikely that it is triggerable in practical environments and I could not actually get the bug to manifest, but I fixed it anyway while investigating the original issue.
Simple reproduction (streaming):
1. Register a service with a tag.
curl -sL --request PUT 'http://localhost:8500/v1/agent/service/register' \
--header 'Content-Type: application/json' \
--data-raw '{ "ID": "ID1", "Name": "test", "Tags":[ "a" ], "EnableTagOverride": true }'
2. Do an initial filter query that matches on the tag.
curl -sLi --get 'http://localhost:8500/v1/health/service/test' --data-urlencode 'filter=a in Service.Tags'
3. Note you get one result. Use the `X-Consul-Index` header to establish
a blocking query in another terminal, this should not return yet.
curl -sLi --get 'http://localhost:8500/v1/health/service/test?index=$INDEX' --data-urlencode 'filter=a in Service.Tags'
4. Re-register that service with a different tag.
curl -sL --request PUT 'http://localhost:8500/v1/agent/service/register' \
--header 'Content-Type: application/json' \
--data-raw '{ "ID": "ID1", "Name": "test", "Tags":[ "b" ], "EnableTagOverride": true }'
5. Your blocking query from (3) should return with a header
`X-Consul-Query-Backend: streaming` and empty results if it works
correctly `[]`.
Attempts to reproduce with non-streaming failed (where you add `&near=_agent` to the read queries and ensure `X-Consul-Query-Backend: blocking-query` shows up in the results).
* update raft to v1.3.7
* add changelog
* fix compilation error
* fix HeartbeatTimeout
* fix ElectionTimeout to reload only if value is valid
* fix default values for `ElectionTimeout` and `HeartbeatTimeout`
* fix test defaults
* bump raft to v1.3.8
Adds a timeout (deadline) to client RPC calls, so that streams will no longer hang indefinitely in unstable network conditions.
Co-authored-by: kisunji <ckim@hashicorp.com>
* Implement the ServerDiscovery.WatchServers gRPC endpoint
* Fix the ConnectCA.Sign gRPC endpoints metadata forwarding.
* Unify public gRPC endpoints around the public.TraceID function for request_id logging
Adds a new gRPC endpoint to get envoy bootstrap params. The new consul-dataplane service will use this
endpoint to generate an envoy bootstrap configuration.
Whenever autopilot updates its state it notifies Consul. That notification will then trigger Consul to extract out the ready server information. If the ready servers have changed, then an event will be published to notify any subscribers of the full set of ready servers.
All these ready server event things are contained within an autopilotevents package instead of the consul package to make importing them into the grpc related packages possible
Introduces a gRPC endpoint for signing Connect leaf certificates. It's also
the first of the public gRPC endpoints to perform leader-forwarding, so
establishes the pattern of forwarding over the multiplexed internal RPC port.
Previously we had 1 EventPublisher per state.Store. When a state store was closed/abandoned such as during a consul snapshot restore, this had the behavior of force closing subscriptions for that topic and evicting event snapshots from the cache.
The intention of this commit is to keep all that behavior. To that end, the shared EventPublisher now supports the ability to refresh a topic. That will perform the force close + eviction. The FSM upon abandoning the previous state.Store will call RefreshTopic for all the topics with events generated by the state.Store.
Just like standard upstreams the order of applicability in descending precedence:
1. caller's `service-defaults` upstream override for destination
2. caller's `service-defaults` upstream defaults
3. destination's `service-resolver` ConnectTimeout
4. system default of 5s
Co-authored-by: mrspanishviking <kcardenas@hashicorp.com>
* Fixes a lint warning about t.Errorf not supporting %w
* Enable running autopilot on all servers
On the non-leader servers all they do is update the state and do not attempt any modifications.
* Fix the RPC conn limiting tests
Technically they were relying on racey behavior before. Now they should be reliable.
Adds a new gRPC service and endpoint to return the list of supported
consul dataplane features. The Consul Dataplane will use this API to
customize its interaction with that particular server.
Adds a new gRPC streaming endpoint (WatchRoots) that dataplane clients will
use to fetch the current list of active Connect CA roots and receive new
lists whenever the roots are rotated.
This adds an aws-iam auth method type which supports authenticating to Consul using AWS IAM identities.
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* tlsutil: initial implementation of types/TLSVersion
tlsutil: add test for parsing deprecated agent TLS version strings
tlsutil: return TLSVersionInvalid with error
tlsutil: start moving tlsutil cipher suite lookups over to types/tls
tlsutil: rename tlsLookup to ParseTLSVersion, add cipherSuiteLookup
agent: attempt to use types in runtime config
agent: implement b.tlsVersion validation in config builder
agent: fix tlsVersion nil check in builder
tlsutil: update to renamed ParseTLSVersion and goTLSVersions
tlsutil: fixup TestConfigurator_CommonTLSConfigTLSMinVersion
tlsutil: disable invalid config parsing tests
tlsutil: update tests
auto_config: lookup old config strings from base.TLSMinVersion
auto_config: update endpoint tests to use TLS types
agent: update runtime_test to use TLS types
agent: update TestRuntimeCinfig_Sanitize.golden
agent: update config runtime tests to expect TLS types
* website: update Consul agent tls_min_version values
* agent: fixup TLS parsing and compilation errors
* test: fixup lint issues in agent/config_runtime_test and tlsutil/config_test
* tlsutil: add CHACHA20_POLY1305 cipher suites to goTLSCipherSuites
* test: revert autoconfig tls min version fixtures to old format
* types: add TLSVersions public function
* agent: add warning for deprecated TLS version strings
* agent: move agent config specific logic from tlsutil.ParseTLSVersion into agent config builder
* tlsutil(BREAKING): change default TLS min version to TLS 1.2
* agent: move ParseCiphers logic from tlsutil into agent config builder
* tlsutil: remove unused CipherString function
* agent: fixup import for types package
* Revert "tlsutil: remove unused CipherString function"
This reverts commit 6ca7f6f58d268e617501b7db9500113c13bae70c.
* agent: fixup config builder and runtime tests
* tlsutil: fixup one remaining ListenerConfig -> ProtocolConfig
* test: move TLS cipher suites parsing test from tlsutil into agent config builder tests
* agent: remove parseCiphers helper from auto_config_endpoint_test
* test: remove unused imports from tlsutil
* agent: remove resolved FIXME comment
* tlsutil: remove TODO and FIXME in cipher suite validation
* agent: prevent setting inherited cipher suite config when TLS 1.3 is specified
* changelog: add entry for converting agent config to TLS types
* agent: remove FIXME in runtime test, this is covered in builder tests with invalid tls9 value now
* tlsutil: remove config tests for values checked at agent config builder boundary
* tlsutil: remove tls version check from loadProtocolConfig
* tlsutil: remove tests and TODOs for logic checked in TestBuilder_tlsVersion and TestBuilder_tlsCipherSuites
* website: update search link for supported Consul agent cipher suites
* website: apply review suggestions for tls_min_version description
* website: attempt to clean up markdown list formatting for tls_min_version
* website: moar linebreaks to fix tls_min_version formatting
* Revert "website: moar linebreaks to fix tls_min_version formatting"
This reverts commit 38585927422f73ebf838a7663e566ac245f2a75c.
* autoconfig: translate old values for TLSMinVersion
* agent: rename var for translated value of deprecated TLS version value
* Update agent/config/deprecated.go
Co-authored-by: Dan Upton <daniel@floppy.co>
* agent: fix lint issue
* agent: fixup deprecated config test assertions for updated warning
Co-authored-by: Dan Upton <daniel@floppy.co>
Looks like something got munged at some point. Not sure how it slipped in, but my best guess is that because TestTxn_Apply_ACLDeny is marked flaky we didn't block merge because it failed.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
This extends the acl.AllowAuthorizer with source of authority information.
The next step is to unify the AllowAuthorizer and ACLResolveResult structures; that will be done in a separate PR.
Part of #12481
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Introduces the capability to configure TLS differently for Consul's
listeners/ports (i.e. HTTPS, gRPC, and the internal multiplexed RPC
port) which is useful in scenarios where you may want the HTTPS or
gRPC interfaces to present a certificate signed by a well-known/public
CA, rather than the certificate used for internal communication which
must have a SAN in the form `server.<dc>.consul`.
Currently the config_entry.go subsystem delegates authorization decisions via the ConfigEntry interface CanRead and CanWrite code. Unfortunately this returns a true/false value and loses the details of the source.
This is not helpful, especially since it the config subsystem can be more complex to understand, since it covers so many domains.
This refactors CanRead/CanWrite to return a structured error message (PermissionDenied or the like) with more details about the reason for denial.
Part of #12241
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* First pass for helper for bulk changes
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Convert ACLRead and ACLWrite to new form
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* AgentRead and AgentWRite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix EventWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* KeyRead, KeyWrite, KeyList
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* KeyRing
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* NodeRead NodeWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* OperatorRead and OperatorWrite
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* PreparedQuery
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Intention partial
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix ServiceRead, Write ,etc
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Error check ServiceRead?
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fix Sessionread/Write
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fixup snapshot ACL
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Error fixups for txn
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Add changelog
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Fixup review comments
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Improves tests from #12362
These tests try to setup the following concurrent scenario:
1. (goroutine 1) execute read RPC with index=0
2. (goroutine 1) get response from (1) @ index=10
3. (goroutine 1) execute read RPC with index=10 and block
4. (goroutine 2) WHILE (3) is blocking, start slamming the system with stray writes that will cause the WatchSet to wakeup
5. (goroutine 2) after doing all writes, shut down the reader above
6. (goroutine 1) stops reading, double checks that it only ever woke up once (from 1)
Minor fix for behavior in #12362
IsDefault sometimes returns true even if there was a proxy-defaults or service-defaults config entry that was consulted. This PR fixes that.
before:
$ go test ./agent/consul -run TestLeader_ReapOrLeftMember_IgnoreSelf
ok github.com/hashicorp/consul/agent/consul 21.147s
after:
$ go test ./agent/consul -run TestLeader_ReapOrLeftMember_IgnoreSelf
ok github.com/hashicorp/consul/agent/consul 5.402s
Starting from and extending the mechanism introduced in #12110 we can specially handle the 3 main special Consul RPC endpoints that react to many config entries in a single blocking query in Connect:
- `DiscoveryChain.Get`
- `ConfigEntry.ResolveServiceConfig`
- `Intentions.Match`
All of these will internally watch for many config entries, and at least one of those will likely be not found in any given query. Because these are blends of multiple reads the exact solution from #12110 isn't perfectly aligned, but we can tweak the approach slightly and regain the utility of that mechanism.
### No Config Entries Found
In this case, despite looking for many config entries none may be found at all. Unlike #12110 in this scenario we do not return an empty reply to the caller, but instead synthesize a struct from default values to return. This can be handled nearly identically to #12110 with the first 1-2 replies being non-empty payloads followed by the standard spurious wakeup suppression mechanism from #12110.
### No Change Since Last Wakeup
Once a blocking query loop on the server has completed and slept at least once, there is a further optimization we can make here to detect if any of the config entries that were present at specific versions for the prior execution of the loop are identical for the loop we just woke up for. In that scenario we can return a slightly different internal sentinel error and basically externally handle it similar to #12110.
This would mean that even if 20 discovery chain read RPC handling goroutines wakeup due to the creation of an unrelated config entry, the only ones that will terminate and reply with a blob of data are those that genuinely have new data to report.
### Extra Endpoints
Since this pattern is pretty reusable, other key config-entry-adjacent endpoints used by `agent/proxycfg` also were updated:
- `ConfigEntry.List`
- `Internal.IntentionUpstreams` (tproxy)
Many places in consul already treated node names case insensitively.
The state store indexes already do it, but there are a few places that
did a direct byte comparison which have now been corrected.
One place of particular consideration is ensureCheckIfNodeMatches
which is executed during snapshot restore (among other places). If a
node check used a slightly different casing than the casing of the node
during register then the snapshot restore here would deterministically
fail. This has been fixed.
Primary approach:
git grep -i "node.*[!=]=.*node" -- ':!*_test.go' ':!docs'
git grep -i '\[[^]]*member[^]]*\]
git grep -i '\[[^]]*\(member\|name\|node\)[^]]*\]' -- ':!*_test.go' ':!website' ':!ui' ':!agent/proxycfg/testing.go:' ':!*.md'
There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.
Config entry replication procedes in a particular sorted order by kind
and name.
Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.
This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.
This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.
Fixes#9319
The scenario that is NOT ADDRESSED by this PR is as follows:
1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web
If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
Otherwise when the query times out we might incorrectly send a value for
the reply, when we should send an empty reply.
Also document errNotFound and how to handle the result in that case.
By using the query results as state.
Blocking queries are efficient when the query matches some results,
because the ModifyIndex of those results, returned as queryMeta.Mindex,
will never change unless the items themselves change.
Blocking queries for non-existent items are not efficient because the
queryMeta.Index can (and often does) change when other entities are
written.
This commit reduces the churn of these queries by using a different
comparison for "has changed". Instead of using the modified index, we
use the existence of the results. If the previous result was "not found"
and the new result is still "not found", we know we can ignore the
modified index and continue to block.
This is done by setting the minQueryIndex to the returned
queryMeta.Index, which prevents the query from returning before a state
change is observed.
This test shows how blocking queries are not efficient when the query
returns no results. The test fails with 100+ calls instead of the
expected 2.
This test is still a bit flaky because it depends on the timing of the
writes. It can sometimes return 3 calls.
A future commit should fix this and make blocking queries even more
optimal for not-found results.
Follow the Go convention of accepting a small interface that documents
the methods used by the function.
Clarify the rules for implementing a query function passed to
blockingQuery.
This will both save on unnecessary raft operations as well as
unnecessarily incrementing the raft modify index of config entries
subject to no-op updates.
This commit syncs ENT changes to the OSS repo.
Original commit details in ENT:
```
commit 569d25f7f4578981c3801e6e067295668210f748
Author: FFMMM <FFMMM@users.noreply.github.com>
Date: Thu Feb 10 10:23:33 2022 -0800
Vendor fork net rpc (#1538)
* replace net/rpc w consul-net-rpc/net/rpc
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* replace msgpackrpc and go-msgpack with fork from mono repo
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* gofmt all files touched
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
```
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* First phase of refactoring PermissionDeniedError
Add extended type PermissionDeniedByACLError that captures information
about the accessor, particular permission type and the object and name
of the thing being checked.
It may be worth folding the test and error return into a single helper
function, that can happen at a later date.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Due to timing, a transparent proxy could have two upstreams to dial
directly with the same address.
For example:
- The orders service can dial upstreams shipping and payment directly.
- An instance of shipping at address 10.0.0.1 is deregistered.
- Payments is scaled up and scheduled to have address 10.0.0.1.
- The orders service receives the event for the new payments instance
before seeing the deregistration for the shipping instance. At this
point two upstreams have the same passthrough address and Envoy will
reject the listener configuration.
To disambiguate this commit considers the Raft index when storing
passthrough addresses. In the example above, 10.0.0.1 would only be
associated with the newer payments service instance.
This commit makes two changes to the validation.
Previously we would call this validation in GenerateRoot, which happens
both on initialization (when a follower becomes leader), and when a
configuration is updated. We only want to do this validation during
config update so the logic was moved to the UpdateConfiguration
function.
Previously we would compare the config values against the actual cert.
This caused problems when the cert was created manually in Vault (not
created by Consul). Now we compare the new config against the previous
config. Using a already created CA cert should never error now.
Adding the key bit and types to the config should only error when
the previous values were not the defaults.
These two tests require debug logging enabled, because they look for log lines.
Also switched to testify assertions because the previous errors were not clear.
This test found a bug in the secondary. We were appending the root cert
to the PEM, but that cert was already appended. This was failing
validation in Vault here:
https://github.com/hashicorp/vault/blob/sdk/v0.3.0/sdk/helper/certutil/types.go#L329
Previously this worked because self signed certs have the same
SubjectKeyID and AuthorityKeyID. So having the same self-signed cert
repeated doesn't fail that check.
However with an intermediate that is not self-signed, those values are
different, and so we fail the check. A test I added in a previous commit
should show that this continues to work with self-signed root certs as
well.
This is safer than embedding two interface because there are a number of
places where we check the concrete type. If we check the concrete type
on the top-level interface it will fail. So instead expose the
ACLIdentity from a method.
This change allows us to remove one of the last remaining duplicate
resolve token methods (Server.ResolveToken).
With this change we are down to only 2, where the second one also
handles setting the default EnterpriseMeta from the token.
Now that ACLResolver is embedded we don't need ResolveTokenToIdentity on
Client and Server.
Moving ResolveTokenAndDefaultMeta to ACLResolver removes the duplicate
implementation.
set -euo pipefail
unset CDPATH
cd "$(dirname "$0")"
for f in $(git grep '\brequire := require\.New(' | cut -d':' -f1 | sort -u); do
echo "=== require: $f ==="
sed -i '/require := require.New(t)/d' $f
# require.XXX(blah) but not require.XXX(tblah) or require.XXX(rblah)
sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\([^tr]\)/require.\1(t,\2/g' $f
# require.XXX(tblah) but not require.XXX(t, blah)
sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\(t[^,]\)/require.\1(t,\2/g' $f
# require.XXX(rblah) but not require.XXX(r, blah)
sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\(r[^,]\)/require.\1(t,\2/g' $f
gofmt -s -w $f
done
for f in $(git grep '\bassert := assert\.New(' | cut -d':' -f1 | sort -u); do
echo "=== assert: $f ==="
sed -i '/assert := assert.New(t)/d' $f
# assert.XXX(blah) but not assert.XXX(tblah) or assert.XXX(rblah)
sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\([^tr]\)/assert.\1(t,\2/g' $f
# assert.XXX(tblah) but not assert.XXX(t, blah)
sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\(t[^,]\)/assert.\1(t,\2/g' $f
# assert.XXX(rblah) but not assert.XXX(r, blah)
sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\(r[^,]\)/assert.\1(t,\2/g' $f
gofmt -s -w $f
done
Remove some unnecessary comments around query_blocking metric. The only
line that needs any comments in the atomic decrement.
Cleanup the block and return comments and logic. The old comment about
AbandonCh may have been relevant before, but it is expected behaviour
now.
The logic was simplified by inverting the err condition.
This helps keep the logic in blockingQuery more focused. In the future we
may have a separate struct for RPC queries which may allow us to move this
off of Server.
This safeguard should be safe to apply in general. We are already
applying it to non-blocking queries that call blockingQuery, so it
should be fine to apply it to others.
To remove the TODO, and make it more readable.
In general this reduces the scope of variables, making them easier to reason about.
It also introduces more early returns so that we can see the flow from the structure
of the function.
`newIntermediate` is always equal to `needsNewIntermediate`, so we can
remove the extra variable and use the original directly.
Also remove the `activeRoot.ID != newActiveRoot.ID` case from an if,
because that case is already checked above, and `needsNewIntermediate` will
already be true in that case.
This condition now reads a lot better:
> Persist a new root if we did not have one before, or if generated a new intermediate.
In the previous commit the single use of this storedRoot was removed.
In this commit the original objective is completed. The
Provider.ActiveRoot is being removed because
1. the secondary should get the active root from the Consul primary DC,
not the provider, so that secondary DCs do not need to communicate
with a provider instance in a different DC.
2. so that the Provider.ActiveRoot interface can be changed without
impacting other code paths.
This method had only one caller, which always looked for the active
root. This commit moves the lookup into the method to reduce the logic
in the one caller.
This is being done in preparation for a larger change. Keeping this
separate so it is easier to see.
The `storedRootID != primaryRoots.ActiveRootID` is being removed because
these can never be different.
The `storedRootID` comes from `provider.ActiveRoot`, the
`primaryRoots.ActiveRootID` comes from the store `CARoot` from the
primary. In both cases the source of the data is the primary DC.
Technically they could be different if someone modified the provider
outside of Consul, but that would break many things, so is not a
supported flow.
If these were out of sync because of ordering of events then the
secondary will soon receive an update to `primaryRoots` and everything
will be sorted out again.
ActiveRoot should not be called from the secondary DC, because there
should not be a requirement to run the same Vault instance in a
secondary DC. SignIntermediate is called in a secondary DC, so it should
not call ActiveRoot
We would also like to change the interface of ActiveRoot so that we can
support using an intermediate cert as the primary CA in Consul. In
preparation for making that change I am reducing the number of calls to
ActiveRoot, so that there are fewer code paths to modify when the
interface changes.
This change required a change to the mockCAServerDelegate we use in
tests. It was returning the RootCert for SignIntermediate, but that is
not an accurate fake of production. In production this would also be a
separate cert.
Immediately above this line we are already appending the full list of
intermediates. The `provider.ActiveIntermediate` MUST be in this list of
intermediates because it must be available to all the other non-leader
Servers. If it was not in this list of intermediates then any proxy
that received data from a non-leader would have the wrong certs.
This is being removed now because we are planning on changing the
`Provider.ActiveIntermediate` interface, and removing these extra calls ahead of
time helps make that change easier.
Using tracing and cpu profiling I found that the majority of the time in
these test cases is spent generating a private key. We really don't need
separate private keys, so we can generate only one and use it for all
cases.
With this change the test runs much faster.
Fix the name to match the function it is testing
Remove unused code
Fix the signature, instead of returning (error, string) which should be (string, error)
accept a testing.T to emit errors.
Handle the error from encode.
The only function passed to SnapshotRPC today always returns a nil error, so there's no
way to exercise this bug in practice. This change is being made for correctness so that
it doesn't become a problem in the future, if we ever pass a different function to
SnapshotRPC.
Error messages related to service and check operations previously included
the following substrings:
- service %q
- check %q
From this error message, it isn't clear that the expected field is the ID for
the entity, not the name. For example, if the user has a service named test,
the error message would read 'Unknown service "test"'. This is misleading -
a service with that *name* does exist, but not with that *ID*.
The substrings above have been modified to make it clear that ID is needed,
not name:
- service with ID %q
- check with ID %q
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Dan Upton <daniel@floppy.co>
This query has been incorrectly querying by accessor ID since New ACLs
were added. However, the legacy token compat allowed this to continue to
work, since it made a fallback query for the anonymousToken ID.
PR #11184 removed this legacy token query, which means that the query by
accessor ID is now the only check for the anonymous token's existence.
This PR updates the GetBySecret call to use the secret ID of the token.
These helper functions actually end up hiding important setup details
that should be visible from the test case. We already have a convenient
way of setting this config when calling newTestServerWithConfig.
I suspect one problem was that we set structs.IntermediateCertRenewInterval to 1ms, which meant
that in some cases the intermediate could renew before we stored the original value.
Another problem was that the 'wait for intermediate' loop was calling the provider.ActiveIntermediate,
but the comparison needs to use the RPC endpoint to accurately represent a user request. So
changing the 'wait for' to use the state store ensures we don't race.
Also moves the patching into a separate function.
Removes the addition of ca.CertificateTimeDriftBuffer as part of calculating halfTime. This was added
in a previous commit to attempt to fix the flake, but it did not appear to fix the problem. Adding the
time here was making the tests fail when using the shared patch
function. It's not clear to me why, but there's no reason we should be
including this time in the halfTime calculation.
Use the new verifyLearfCert to show the cert verifies with intermediates
from both sources. This required using the RPC interface so that the
leaf pem was constructed correctly.
Add IndexedCARoots.Active since that is a common operation we see in a
few places.
Previously we had a couple copies that reproduced the FSM operation.
These copies introduce risk that the test does not accurately match
production.
This PR removes the test versions of the FSM operation, and exports the
real production FSM operation so that it can be used in tests.
The consul provider tests did need to change because of this. Previously
we would return a hardcoded value of 2, but in production this value is
always incremented.
Failing over to a partition is more siimilar to failing over to another
datacenter than it is to failing over to a namespace. In a future
release we should update how localities for failover are specified. We
should be able to accept a list of localities which can include both
partition and datacenter.
* Add partition fields to targets like service route destinations
* Update validation to prevent cross-DC + cross-partition references
* Handle partitions when reading config entries for disco chain
* Encode partition in compiled targets
Given that we do not allow wildcard partitions in intentions, no one ixn
can override the DefaultAllow setting. Only the default ACL policy
applies across all partitions.
This table purposefully does not index by partition/namespace. It's a
global view into all service names.
This table is intended to replace the current serviceListTxn watch in
intentionTopologyTxn. For cross-partition transparent proxying we need
to be able to calculate upstreams from intentions in any partition. This
means that the existing serviceListTxn function is insufficient since
it's scoped to a partition.
Moving away from that function is also beneficial because it watches the
main "services" table, so watchers will wake up when any instance is
registered or deregistered.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
* convert `IndexID` of `session_checks` table
* convert `indexSession` of `session_checks` table
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* fix oss linter
* fix review comments
* remove partition for Checks as it's always use the session partition
* fix tests
* fix tests
* do not namespace nodeChecks index
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Cross port of ent #1383 "Reject non-default datacenter when making partitioned ACLs"
On the OSS side this is a minor refactor to add some more checks that are only applicable to enterprise code.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
The test added in this commit shows the problem. Previously the
SigningKeyID was set to the RootCert not the local leaf signing cert.
This same bug was fixed in two other places back in 2019, but this last one was
missed.
While fixing this bug I noticed I had the same few lines of code in 3
places, so I extracted a new function for them.
There would be 4 places, but currently the InitializeCA flow sets this
SigningKeyID in a different way, so I've left that alone for now.
While working on the CA system it is important to be able to run all the
tests related to the system, without having to wait for unrelated tests.
There are many slow and unrelated tests in agent/consul, so we need some
way to filter to only the relevant tests.
This PR renames all the CA system related tests to start with either
`TestCAMananger` for tests of internal operations that don't have RPC
endpoint, or `TestConnectCA` for tests of RPC endpoints. This allows us
to run all the test with:
go test -run 'TestCAMananger|TestConnectCA' ./agent/consul
The test naming follows an undocumented convention of naming tests as
follows:
Test[<struct name>_]<function name>[_<test case description>]
I tried to always keep Primary/Secondary at the end of the description,
and _Vault_ has to be in the middle because of our regex to run those
tests as a separate CI job.
You may notice some of the test names changed quite a bit. I did my best
to identify the underlying method being tested, but I may have been
slightly off in some cases.
As a method on the struct type this would not be safe to call without first checking
c.isIntermediateUsedToSignLeaf.
So for now, move this logic to the CAMananger, so that it is always correct.
We were not adding the local signing cert to the CARoot. This commit
fixes that bug, and also adds support for fixing existing CARoot on
upgrade.
Also update the tests for both primary and secondary to be more strict.
Check the SigningKeyID is correct after initialization and rotation.
Previously we believe it was necessary for all code that required ports
to use freeport to prevent conflicts.
https://github.com/dnephin/freeport-test shows that it is actually save
to use port 0 (`127.0.0.1:0`) as long as it is passed directly to
`net.Listen`, and the listener holds the port for as long as it is
needed.
This works because freeport explicitly avoids the ephemeral port range,
and port 0 always uses that range. As you can see from the test output
of https://github.com/dnephin/freeport-test, the two systems never use
overlapping ports.
This commit converts all uses of freeport that were being passed
directly to a net.Listen to use port 0 instead. This allows us to remove
a bit of wrapping we had around httptest, in a couple places.
In d2ab767fef21244e9fe3b9887ea70fc177912381 raftApply was changed to handle this check in
a single place, instad of having every caller check it. It looks like these few places
were missed when I did that clean up.
This commit removes the remaining resp.(error) checks, since they are all no-ops now.
This function is only ever called from operations that have already acquired the state lock, so checking
the value of state can never fail.
This change is being made in preparation for splitting out a separate type for the secondary logic. The
state can't easily be shared, so really only the expored top-level functions should acquire the 'state lock'.
This commit removes the actingSecondaryCA field, and removes the stateLock around it. This field
was acting as a proxy for providerRoot != nil, so replace it with that check instead.
The two methods which called secondarySetCAConfigured already set the state, so checking the
state again at this point will not catch runtime errors (only programming errors, which we can catch with tests).
In general, handling state transitions should be done on the "entrypoint" methods where execution starts, not
in every internal method.
This is being done to remove some unnecessary references to c.state, in preparations for extracting
types for primary/secondary.
This makes it easier to fake, which will allow me to use the ConsulProvider as
an 'external PKI' to test a customer setup where the actual root CA is not
the root we use for the Consul CA.
Replaces a call to the state store to fetch the clusterID with the
clusterID field already available on the built-in provider.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* remove partition for Checks as it's always use the session partition
* partition sessions index id table
* fix rebase issues
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
* convert `IndexID` of `session_checks` table
* convert `indexSession` of `session_checks` table
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* fix oss linter
* fix review comments
* remove partition for Checks as it's always use the session partition
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
The TrustDomain is populated from the Host() method which includes the
hard-coded "consul" domain. This means that despite having an empty
cluster ID, the TrustDomain won't be empty.
There are two restrictions:
- Writes from the primary DC which explicitly target a secondary DC.
- Writes to a secondary DC that do not explicitly target the primary DC.
The first restriction is because the config entry is not supported in
secondary datacenters.
The second restriction is to prevent the scenario where a user writes
the config entry to a secondary DC, the write gets forwarded to the
primary, but then the config entry does not apply in the secondary.
This makes the scope more explicit.
Currently getCARoots could return an empty object with an empty trust
domain before the CA is initialized. This commit returns an error while
there is no CA config or no trust domain.
There could be a CA config and no trust domain because the CA config can
be created in InitializeCA before initialization succeeds.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused func `prefixIndexFromServiceNameAsString`
* fix test error check
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* partition kvs indexID table
* add `partitionedIndexEntryName` in oss for test purpose
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* remove entmeta reference from oss
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>