I suspect one problem was that we set structs.IntermediateCertRenewInterval to 1ms, which meant
that in some cases the intermediate could renew before we stored the original value.
Another problem was that the 'wait for intermediate' loop was calling the provider.ActiveIntermediate,
but the comparison needs to use the RPC endpoint to accurately represent a user request. So
changing the 'wait for' to use the state store ensures we don't race.
Also moves the patching into a separate function.
Removes the addition of ca.CertificateTimeDriftBuffer as part of calculating halfTime. This was added
in a previous commit to attempt to fix the flake, but it did not appear to fix the problem. Adding the
time here was making the tests fail when using the shared patch
function. It's not clear to me why, but there's no reason we should be
including this time in the halfTime calculation.
Use the new verifyLearfCert to show the cert verifies with intermediates
from both sources. This required using the RPC interface so that the
leaf pem was constructed correctly.
Add IndexedCARoots.Active since that is a common operation we see in a
few places.
Previously we had a couple copies that reproduced the FSM operation.
These copies introduce risk that the test does not accurately match
production.
This PR removes the test versions of the FSM operation, and exports the
real production FSM operation so that it can be used in tests.
The consul provider tests did need to change because of this. Previously
we would return a hardcoded value of 2, but in production this value is
always incremented.
Failing over to a partition is more siimilar to failing over to another
datacenter than it is to failing over to a namespace. In a future
release we should update how localities for failover are specified. We
should be able to accept a list of localities which can include both
partition and datacenter.
* Add partition fields to targets like service route destinations
* Update validation to prevent cross-DC + cross-partition references
* Handle partitions when reading config entries for disco chain
* Encode partition in compiled targets
Given that we do not allow wildcard partitions in intentions, no one ixn
can override the DefaultAllow setting. Only the default ACL policy
applies across all partitions.
This table purposefully does not index by partition/namespace. It's a
global view into all service names.
This table is intended to replace the current serviceListTxn watch in
intentionTopologyTxn. For cross-partition transparent proxying we need
to be able to calculate upstreams from intentions in any partition. This
means that the existing serviceListTxn function is insufficient since
it's scoped to a partition.
Moving away from that function is also beneficial because it watches the
main "services" table, so watchers will wake up when any instance is
registered or deregistered.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
* convert `IndexID` of `session_checks` table
* convert `indexSession` of `session_checks` table
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* fix oss linter
* fix review comments
* remove partition for Checks as it's always use the session partition
* fix tests
* fix tests
* do not namespace nodeChecks index
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Cross port of ent #1383 "Reject non-default datacenter when making partitioned ACLs"
On the OSS side this is a minor refactor to add some more checks that are only applicable to enterprise code.
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
The test added in this commit shows the problem. Previously the
SigningKeyID was set to the RootCert not the local leaf signing cert.
This same bug was fixed in two other places back in 2019, but this last one was
missed.
While fixing this bug I noticed I had the same few lines of code in 3
places, so I extracted a new function for them.
There would be 4 places, but currently the InitializeCA flow sets this
SigningKeyID in a different way, so I've left that alone for now.
While working on the CA system it is important to be able to run all the
tests related to the system, without having to wait for unrelated tests.
There are many slow and unrelated tests in agent/consul, so we need some
way to filter to only the relevant tests.
This PR renames all the CA system related tests to start with either
`TestCAMananger` for tests of internal operations that don't have RPC
endpoint, or `TestConnectCA` for tests of RPC endpoints. This allows us
to run all the test with:
go test -run 'TestCAMananger|TestConnectCA' ./agent/consul
The test naming follows an undocumented convention of naming tests as
follows:
Test[<struct name>_]<function name>[_<test case description>]
I tried to always keep Primary/Secondary at the end of the description,
and _Vault_ has to be in the middle because of our regex to run those
tests as a separate CI job.
You may notice some of the test names changed quite a bit. I did my best
to identify the underlying method being tested, but I may have been
slightly off in some cases.
As a method on the struct type this would not be safe to call without first checking
c.isIntermediateUsedToSignLeaf.
So for now, move this logic to the CAMananger, so that it is always correct.
We were not adding the local signing cert to the CARoot. This commit
fixes that bug, and also adds support for fixing existing CARoot on
upgrade.
Also update the tests for both primary and secondary to be more strict.
Check the SigningKeyID is correct after initialization and rotation.
Previously we believe it was necessary for all code that required ports
to use freeport to prevent conflicts.
https://github.com/dnephin/freeport-test shows that it is actually save
to use port 0 (`127.0.0.1:0`) as long as it is passed directly to
`net.Listen`, and the listener holds the port for as long as it is
needed.
This works because freeport explicitly avoids the ephemeral port range,
and port 0 always uses that range. As you can see from the test output
of https://github.com/dnephin/freeport-test, the two systems never use
overlapping ports.
This commit converts all uses of freeport that were being passed
directly to a net.Listen to use port 0 instead. This allows us to remove
a bit of wrapping we had around httptest, in a couple places.
In d2ab767fef21244e9fe3b9887ea70fc177912381 raftApply was changed to handle this check in
a single place, instad of having every caller check it. It looks like these few places
were missed when I did that clean up.
This commit removes the remaining resp.(error) checks, since they are all no-ops now.
This function is only ever called from operations that have already acquired the state lock, so checking
the value of state can never fail.
This change is being made in preparation for splitting out a separate type for the secondary logic. The
state can't easily be shared, so really only the expored top-level functions should acquire the 'state lock'.
This commit removes the actingSecondaryCA field, and removes the stateLock around it. This field
was acting as a proxy for providerRoot != nil, so replace it with that check instead.
The two methods which called secondarySetCAConfigured already set the state, so checking the
state again at this point will not catch runtime errors (only programming errors, which we can catch with tests).
In general, handling state transitions should be done on the "entrypoint" methods where execution starts, not
in every internal method.
This is being done to remove some unnecessary references to c.state, in preparations for extracting
types for primary/secondary.
This makes it easier to fake, which will allow me to use the ConsulProvider as
an 'external PKI' to test a customer setup where the actual root CA is not
the root we use for the Consul CA.
Replaces a call to the state store to fetch the clusterID with the
clusterID field already available on the built-in provider.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* remove partition for Checks as it's always use the session partition
* partition sessions index id table
* fix rebase issues
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
* convert `IndexID` of `session_checks` table
* convert `indexSession` of `session_checks` table
* convert `indexNodeCheck` of `session_checks` table
* partition `indexID` and `indexSession` of `tableSessionChecks`
* fix oss linter
* fix review comments
* remove partition for Checks as it's always use the session partition
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
The TrustDomain is populated from the Host() method which includes the
hard-coded "consul" domain. This means that despite having an empty
cluster ID, the TrustDomain won't be empty.
There are two restrictions:
- Writes from the primary DC which explicitly target a secondary DC.
- Writes to a secondary DC that do not explicitly target the primary DC.
The first restriction is because the config entry is not supported in
secondary datacenters.
The second restriction is to prevent the scenario where a user writes
the config entry to a secondary DC, the write gets forwarded to the
primary, but then the config entry does not apply in the secondary.
This makes the scope more explicit.
Currently getCARoots could return an empty object with an empty trust
domain before the CA is initialized. This commit returns an error while
there is no CA config or no trust domain.
There could be a CA config and no trust domain because the CA config can
be created in InitializeCA before initialization succeeds.
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* partition `tableSessions` table
* fix sessions to use UUID and fix prefix index
* fix oss build
* clean up unused functions
* fix oss compilation
* add a partition indexer for sessions
* Fix oss to not have partition index
* fix oss tests
* remove unused func `prefixIndexFromServiceNameAsString`
* fix test error check
* remove unused operations_ent.go and operations_oss.go func
* remove unused const
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* state: port KV and Tombstone tables to new pattern
* go fmt'ed
* handle wildcards for tombstones
* Fix graveyard ent vs oss
* fix oss compilation error
* add partition to tombstones and kv state store indexes
* refactor to use `indexWithEnterpriseIndexable`
* partition kvs indexID table
* add `partitionedIndexEntryName` in oss for test purpose
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
* add `singleValueID` implementation assertions
* remove entmeta reference from oss
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: R.B. Boyer <4903+rboyer@users.noreply.github.com>
Previously secondaryInitialize would return nil in this case, which prevented the
deferred initialize from happening, and left the CA in an uninitialized state until a config
update or root rotation.
To fix this I extracted the common parts into the delegate implementation. However looking at this
again, it seems like the handling in secondaryUpdateRoots is impossible, because that function
should never be called before the secondary is initialzied. I beleive we can remove some of that
logic in a follow up.
* add root_cert_ttl option for consul connect, vault ca providers
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* Apply suggestions from code review
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
* add changelog, pr feedback
Signed-off-by: FFMMM <FFMMM@users.noreply.github.com>
* Update .changelog/11428.txt, more docs
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
* Update website/content/docs/agent/options.mdx
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Daniel Nephin <dnephin@hashicorp.com>
Co-authored-by: Kyle Havlovitz <kylehav@gmail.com>
These labels should be set by whatever process scrapes Consul (for
prometheus), or by the agent that receives them (for datadog/statsd).
We need to remove them here because the labels are part of the "metric
key", so we'd have to pre-declare the metrics with the labels. We could
do that, but that is extra work for labels that should be added from
elsewhere.
Also renames the closure to be more descriptive.
Prometheus scrapes metrics from each process, so when leadership transfers to a different node
the previous leader would still be reporting the old cached value.
By setting NaN, I believe we should zero-out the value, so that prometheus should only consider the
value from the new leader.