* return an error when the index is not valid
* check response as bool when applying `CAOpSetConfig`
* remove check for bool response
* fix error message and add check to test
* fix comment
* add changelog
If multiple instances of a service are co-located on the same node then
their proxies will all share a cache entry for their resolved service
configuration. This is because the cache key contains the name of the
watched service but does not take into account the ID of the watching
proxies.
This means that there will be multiple agent service manager watches
that can wake up on the same cache update. These watchers then
concurrently modify the value in the cache when merging the resolved
config into the local proxy definitions.
To avoid this concurrent map write we will only delete the key from
opaque config in the local proxy definition after the merge, rather
than from the cached value before the merge.
This change adds a new `dns_config.recursor_strategy` option which
controls how Consul queries DNS resolvers listed in the `recursors`
config option. The supported options are `sequential` (default), and
`random`.
Closes#8807
Co-authored-by: Blake Covarrubias <blake@covarrubi.as>
Co-authored-by: Priyanka Sengupta <psengupta@flatiron.com>
A previous commit used SetHash on two of the cases to fix a data race. This commit applies
that change to all cases. Using SetHash in this test helper should ensure that the
test helper behaves closer to production.
These changes ensure that the identity of services dialed is
cryptographically verified.
For all upstreams we validate against SPIFFE IDs in the format used by
Consul's service mesh:
spiffe://<trust-domain>/ns/<namespace>/dc/<datacenter>/svc/<service>
The dnsConfig pulled from the atomic.Value is a pointer, so modifying it in place
creates a data race. Use the exported ReloadConfig interface instead.
The LogOutput io.Writer used by TestAgent must allow concurrent reads and writes, and a
bytes.Buffer does not allow this. The bytes.Buffer must be wrapped with a lock to make this safe.
Some global variables are patched to shorter values in these tests. But the goroutines that read
them can outlive the test because nothing waited for them to exit.
This commit adds a Wait() method to the routine manager, so that tests can wait for the goroutines
to exit. This prevents the data race because the 'reset to original value' can happen
after all other goroutines have stopped.
A previous commit started using QueryRefuced, but that is not correct. QueryRefuced refers to
the OpCode, not the query type.
Instead use errNoAnswer because we have no records for that query type.
Now that trimDNSResponse is handled by the caller we don't need to pass this value
around. We can remove it from both the serviceLookup struct, and two functions.
The dispatch function was called from a single place and did nothing but add a default value.
Removing it makes code easier to trace by removing an unnecessary hop.
The compatv2 integration tests were failing because they use an older CLI version with a newer
HTTP API. This commit restores the GRPCPort field to the DebugConfig output to allow older
CIs to continue to fetch the port.
The DebugConfig in the self endpoint can change at any time. It's not a stable API.
With the previous change to rename GRPCPort to XDSPort this command would have broken.
This commit adds the XDSPort to a stable part of the XDS api, and changes the envoy command to read
this new field.
It includes support for the old API as well, in case a newer CLI is used with an older API, and
adds a test for both cases.
* ca: move provider creation into CAManager
This further decouples the CAManager from Server. It reduces the interface between them and
removes the need for the SetLogger method on providers.
* ca: move SignCertificate to CAManager
To reduce the scope of Server, and keep all the CA logic together
* ca: move SignCertificate to the file where it is used
* auto-config: move autoConfigBackend impl off of Server
Most of these methods are used exclusively for the AutoConfig RPC
endpoint. This PR uses a pattern that we've used in other places as an
incremental step to reducing the scope of Server.
* fix linter issues
* check error when `raftApplyMsgpack`
* ca: move SignCertificate to CAManager
To reduce the scope of Server, and keep all the CA logic together
* check expiry date of the intermediate before using it to sign a leaf
* fix typo in comment
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
* Fix test name
* do not check cert start date
* wrap error to mention it is the intermediate expired
* Fix failing test
* update comment
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use shim to avoid sleep in test
* add root cert validation
* remove duplicate code
* Revert "fix linter issues"
This reverts commit 6356302b54f06c8f2dee8e59740409d49e84ef24.
* fix import issue
* gofmt leader_connect_ca
* add changelog entry
* update error message
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
* fix error message in test
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
Co-authored-by: Freddy <freddygv@users.noreply.github.com>
Most of these methods are used exclusively for the AutoConfig RPC
endpoint. This PR uses a pattern that we've used in other places as an
incremental step to reducing the scope of Server.
raftApply was removed so ApplyCARequest needs to handle all the possible operations
Also set the providerShim to use the mock provider.
other changes are small test improvements that were necessary to debug the failures.
and small refactor to getCAProvider so that GoLand is less confused about what it is doing.
Previously it was reporting that the for condition was always true, which was not the case.
After moving ca.ConsulProviderStateDelegate into the interface we now
have the ApplyCARequest method which does the same thing. Use this more
specific method instead of raftApply.
This field was documented as enabling TLS for outgoing RPC, but that was not the case.
All this field did was set the use_tls serf tag.
Instead of setting this field in a place far from where it is used, move the logic to where
the serf tag is set, so that the code is much more obvious.
tlsutil.Config already presents an excellent structure for this
configuration. Copying the runtime config fields to agent/consul.Config
makes code harder to trace, and provides no advantage.
Instead of copying the fields around, use the tlsutil.Config struct
directly instead.
This is one small step in removing the many layers of duplicate
configuration.
* add intermediate ca metric routine
* add Gauge config for intermediate cert
* Stop metrics routine when stopping leader
* add changelog entry
* updage changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use variables instead of a map
* go imports sort
* Add metrics for primary and secondary ca
* start metrics routine in the right DC
* add telemetry documentation
* update docs
* extract expiry fetching in a func
* merge metrics for primary and secondary into signing ca metric
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
And remove BuildAndValidate. This commit completes some earlier work to reduce the config
interface a single Load function.
The last remaining test was converted to use Load instad of BuildAndValidate.
Previously, for a POST request to the /v1/operator/autopilot/configuration
endpoint, any fields not included in the payload were set to a zero-initialized
value rather than the documented default value.
Now, if an optional field is not included in the payload, it will be set to its
documented default value:
- CleanupDeadServers: true
- LastContactThreshold: "200ms"
- MaxTrailingLogs: 250
- MinQuorum: 0
- ServerStabilizationTime: "10s"
- RedundancyZoneTag: ""
- DisableUpgradeMigration: false
- UpgradeVersionTag: ""
* trim carriage return from certificates when inserting rootCA in the inMemDB
* format rootCA properly when returning the CA on the connect CA endpoint
* Fix linter warnings
* Fix providers to trim certs before returning it
* trim newlines on write when possible
* add changelog
* make sure all provider return a trailing newline after the root and intermediate certs
* Fix endpoint to return trailing new line
* Fix failing test with vault provider
* make test more robust
* make sure all provider return a trailing newline after the leaf certs
* Check for suffix before removing newline and use function
* Add comment to consul provider
* Update change log
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* fix typo
* simplify code callflow
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* extract requireNewLine as shared func
* remove dependency to testify in testing file
* remove extra newline in vault provider
* Add cert newline fix to envoy xds
* remove new line from mock provider
* Remove adding a new line from provider and fix it when the cert is read
* Add a comment to explain the fix
* Add missing for leaf certs
* fix missing new line
* fix missing new line in leaf certs
* remove extra new line in test
* updage changelog
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* fix in vault provider and when reading cache (RPC call)
* fix AWS provider
* fix failing test in the provider
* remove comments and empty lines
* add check for empty cert in test
* fix linter warnings
* add new line for leaf and private key
* use string concat instead of Sprintf
* fix new lines for leaf signing
* preallocate slice and remove append
* Add new line to `SignIntermediate` and `CrossSignCA`
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
The hcl decoding apparently uses strconv.ParseInt, which fails to parse a 64bit int.
Since hcl v1 is basically EOl, it seems unlikely we'll fix this in hcl.
Since this test is only about loading values from config files, the extra large number
doesn't seem important. Trim a few zeros from the numbers so that they parse
properly on 32bit platforms.
Also skip a slow test when -short is used.
sync/atomic must be used with 64-bit aligned fields, and that alignment is difficult to
ensure unless the field is the first one in the struct.
https://golang.org/pkg/sync/atomic/#pkg-note-BUG.
If a value was already available in the local view the request is considered a cache hit.
If the materialized had to wait for a value, it is considered a cache miss.
This test is super racy (it's not just a single line).
This test also starts failing once streaming is enabled, because the
cache rate limit no longer applies to the requests in the test. The
queries use streaming instead of the cache.
This test is no longer valid, and the functionality is already well
tested by TestCacheThrottle. Instead of spending time rewriting this
test, let's remove it.
```
WARNING: DATA RACE
Read at 0x00c01de410fc by goroutine 735:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:1024 +0x9af
github.com/hashicorp/consul/testrpc.WaitForTestAgent()
/home/daniel/pers/code/consul/testrpc/wait.go:99 +0x209
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:966 +0x1ad
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
Previous write at 0x00c01de410fc by goroutine 605:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1.2()
/home/daniel/pers/code/consul/agent/agent_test.go:998 +0xe9
Goroutine 735 (running) created at:
testing.(*T).Run()
/usr/lib/go/src/testing/testing.go:1238 +0x5d7
github.com/hashicorp/consul/agent.TestCacheRateLimit()
/home/daniel/pers/code/consul/agent/agent_test.go:961 +0x375
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
Goroutine 605 (finished) created at:
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:1022 +0x91e
github.com/hashicorp/consul/testrpc.WaitForTestAgent()
/home/daniel/pers/code/consul/testrpc/wait.go:99 +0x209
github.com/hashicorp/consul/agent.TestCacheRateLimit.func1()
/home/daniel/pers/code/consul/agent/agent_test.go:966 +0x1ad
testing.tRunner()
/usr/lib/go/src/testing/testing.go:1193 +0x202
```
The query metrics are actually reported for all read queries, not only
ones that use a MinIndex to block for updates.
Also clarify the raft.apply metric is only on the leader.
As part of this change, we ensure that the SAN extensions are marked as
critical when the subject is empty so that AWS PCA tolerates the loss of
common names well and continues to function as a Connect CA provider.
Parts of this currently hack around a bug in crypto/x509 and can be
removed after https://go-review.googlesource.com/c/go/+/329129 lands in
a Go release.
Note: the AWS PCA tests do not run automatically, but the following
passed locally for me:
ENABLE_AWS_PCA_TESTS=1 go test ./agent/connect/ca -run TestAWS
* return an invalid record when asked for an addr dns with type other then A and AAAA
* add changelog
* fix ANY use case and add a test for it
* update changelog type
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* return empty response if the question record type do not match for addr
* set comment in the right place
* return A\AAAA record in extra section if record type is not A\AAAA for addr
* Fix failing test
* remove commented code
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* use require for test validation
* use variable to init struct
* fix failing test
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update .changelog/10401.txt
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update agent/dns.go
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* fix compilation error
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>