These tests can flake when we get a notification for an earlier event.
Retry the read from update channel a few times to make sure we get the
event we expect.
Split the TestStreamingClient into the two logical components the real
client uses. This allows us to test multiple clients properly.
Previously writing of ctx from multiple Subscribe calls was showing a
data race.
Once this was fixed a test started to fail because the request had to be
made with a greater index, so that the store.Get call did not return
immediately.
The idleTTL was being written and read concurrently. Instead move the idleTTL to a struct
field so that when one test patches the TTL it does not impact others.
The background goroutines for the store can outlive a test because context cancellation
is async.
Setting this field to a value is equivalent to using the 'near' query paramter.
The intent is to sort the results by proximity to the node requesting
them. However with connect we send the results to envoy, which doesn't
care about the order, so setting this field is increasing the work
performed for no gain.
It is necessary to unset this field now because we would like connect
to use streaming, but streaming does not support sorting by proximity.
Remove View.Result error return value, it was always nil, and seems like it will likely always remain nill
since it is simply reading a stored value.
Also replace some cache types with local types.
And only start expiration time when the last request ends. This makes tracking expiry simpler, and
ensures that no entry can be expired while there are active requests.
Previously canRetry was attempting to retrieve this error from args, however there was never
any callers that would pass an error to args.
With the change to raftApply to move this error to the error return value, it is now possible
to receive this error from the err argument.
This commit updates canRetry to check for ErrChunkingResubmit in err.
Previously we were inconsistently checking the response for errors. This
PR moves the response-is-error check into raftApply, so that all callers
can look at only the error response, instead of having to know that
errors could come from two places.
This should expose a few more errors that were previously hidden because
in some calls to raftApply we were ignoring the response return value.
Also handle errors more consistently. In some cases we would log the
error before returning it. This can be very confusing because it can
result in the same error being logged multiple times. Instead return
a wrapped error.
Previously only a single auth method would be saved to the snapshot. This commit fixes the typo
and adds to the test, to show that all auth methods are now saved.
This way we avoid serializing these when empty. Otherwise users of the
latest version of the api submodule cannot interact with older versions
of Consul, because a new api client would send keys that the older Consul
doesn't recognize yet.
The zero value of these flags was already being excluded in the xDS
generation of circuit breaker/outlier detection config.
See: makeThresholdsIfNeeded and ToOutlierDetection.
This PR replaces the original boolean used to configure transparent
proxy mode. It was replaced with a string mode that can be set to:
- "": Empty string is the default for when the setting should be
defaulted from other configuration like config entries.
- "direct": Direct mode is how applications originally opted into the
mesh. Proxy listeners need to be dialed directly.
- "transparent": Transparent mode enables configuring Envoy as a
transparent proxy. Traffic must be captured and redirected to the
inbound and outbound listeners.
This PR also adds a struct for transparent proxy specific configuration.
Initially this is not stored as a pointer. Will revisit that decision
before GA.
* add http2 ping checks
* fix test issue
* add h2ping check to config resources
* add new test and docs for h2ping
* fix grammatical inconsistency in H2PING documentation
* resolve rebase conflicts, add test for h2ping tls verification failure
* api documentation for h2ping
* update test config data with H2PING
* add H2PING to protocol buffers and update changelog
* fix typo in changelog entry
* Add new consul connect redirect-traffic command for applying traffic redirection rules when Transparent Proxy is enabled.
* Add new iptables package for applying traffic redirection rules with iptables.
* Fix bug in cache where TTLs are effectively ignored
This mostly affects streaming since streaming will immediately return from Fetch calls when the state is Closed on eviction which causes the race condition every time.
However this also affects all other cache types if the fetch call happens to return between the eviction and then next time around the Get loop by any client.
There is a separate bug that allows cache items to be evicted even when there are active clients which is the trigger here.
* Add changelog entry
* Update .changelog/9978.txt
Synthetic upstreams from service-defaults config are stored locally in
the Upstreams list. Since these come from service-defaults they should
be cleaned up locally when no longer present in the service config
response.
This is needed in case the client proxy is in TransparentProxy mode.
Typically they won't have explicit configuration for every upstream, so
this ensures the settings can be applied to all of them when generating
xDS config.
New clients in transparent proxy mode can send requests for service
config resolution without any upstream args because they do not have
explicitly defined upstreams.
Old clients on the other hand will never send requests without the
Upstreams args unless they don't have upstreams, in which case we do not
send back upstream config.
The streaming cache type for service health has no way to handle v1/health/ingress/:service queries as there is no equivalent topic that would return the appropriate data.
Ensure that attempts to use this endpoint will use the old cache-type for now so that they return appropriate data when streaming is enabled.
As part of this change the indexer will now be case insensitive by using
the lower case value. This should be safe because previously we always
had lower case strings.
This change was made out of convenience. All the other indexers use
lowercase, so we can re-use the indexFromQuery function by using
lowercase here as well.
This bug would result in the UI not having the correct settings in
Consul enterprise, which could produce many warnings in the logs.
This bug occured because the index page, which includes a map of configuration
was rendered when the HTTPHandler is first created. This PR changes the
UIServer to instead render the index page when the page is requested.
The rendering does not appear to be all that expensive, so rendering it
when requested should not cause much extra latency.
Previously we were encoding the UUID as a string, but the index it references uses a UUID
so this index can also use an encoded UUID to save a bit of memory.
Prefix queries are generally being used to match part of a partial
index. We can support these indexes by using a function that accept
different types for each subset of the index.
What I found interesting is that in the generic StringFieldIndexer the
implementation for PrefixFromArgs would remove the trailing null, but
at least in these 2 cases we actually want a null terminated string.
We simply want fewer components in the string.
Some TLS servers require SNI, but the Golang HTTP client doesn't
include it in the ClientHello when connecting to an IP address. This
change adds a new TLSServerName field to health check definitions to
optionally set it. This fixes#9473.
The TestServiceHealthEventsFromChanges function was over 1400 lines.
Attempting to debug test failures in test functions this large is
difficult. It requires scrolling to the line which defines the testcase
because the failure message only includes the line number of the
assertion, not the line number of the test case.
This is an excellent example of where test tables stop working well, and
start being a problem. To mitigate this problem, the runCase pattern can
be used. When one of these tests fails, a failure message will print the
line number of both the test case and the assertion. This allows a
developer to quickly jump to both of the relevant lines, signficanting
reducing the time it takes to debug test failures.
For example, one such failure could look like this:
catalog_events_test.go:1610: case: service reg, new node
catalog_events_test.go:1605: assertion failed: values are not equal
ResolveServiceConfig is called by service manager before the proxy
registration is in the catalog. Therefore we should pass proxy
registration flags in the request rather than trying to fetch
them from the state store (where they may not exist yet).
This is done because after removing ID and NodeName from
ServiceConfigRequest we will no longer know whether a request coming in
is for a Consul client earlier than v1.10.
This enables it to be called for many upstreams or downstreams of a
service while only querying intentions once.
Additionally, decisions are now optionally denied due to L7 permissions
being present. This enables the function to be used to filter for
potential upstreams/downstreams of a service.
Previously this type was defined in structs, but unlike the other types in structs this type
is not used by RPC requests. By moving it to state we can better indicate that this is not
an API type, but part of the state implementation.
I added this recently without realizing that the method already existed and was named
NamespaceOrEmpty. Replace all calls to GetNamespace with NamespaceOrEmpty or NamespaceOrDefault
as appropriate.
Document that this comparison should roughly match MatchesKey
Only sort by overrideKey or service name, but not both
Add namespace to the sort.
The client side also builds a map of these based on the namespace/node/service key, so the only order
that really matters is the ordering of register/dereigster events.
Refactored out a function that can be used for both the snapshot and stream of events to translate
an event into an appropriate connect event.
Previously terminating gateway events would have used the wrong key in the snapshot, which would have
caused them to be filtered out later on.
Also removed an unused function, and some commented out code.
Health of a terminating gateway instance changes
- Generate an event for creating/destroying this instance of the terminating gateway,
duplicate it for each affected service
Co-Authored-By: Kyle Havlovitz <kylehav@gmail.com>
Previous to this commit, the API response would include Gateway
Addresses in the form `domain.name.:8080`, which due to the addition of
the port is probably not the expected response.
This commit rightTrims any `.` characters from the end of the domain
before formatting the address to include the port resulting in
`domain.name:8080`
Since we currently do no version switching this removes 75% of the PR
noise.
To generate all *.golden files were removed and then I ran:
go test ./agent/xds -update
Note that this does NOT upgrade to xDS v3. That will come in a future PR.
Additionally:
- Ignored staticcheck warnings about how github.com/golang/protobuf is deprecated.
- Shuffled some agent/xds imports in advance of a later xDS v3 upgrade.
- Remove support for envoy 1.13.x but don't add in 1.17.x yet. We have to wait until the xDS v3 support is added in a follow-up PR.
Fixes#8425
When de-registering in anti-entropy sync, when there is no service or
check token.
The agent token will fall back to the default (aka user) token if no agent
token is set, so the existing behaviour still works, but it will prefer
the agent token over the user token if both are set.
ref: https://www.consul.io/docs/agent/options#acl_tokens
The agent token seems more approrpiate in this case, since this is an
"internal operation", not something initiated by the user.
These new functional indexers provide a few advantages:
1. enterprise differences can be isolated to a single function (the
indexer function), making code easier to change
2. as a consequence of (1) we no longer need to wrap all the calls to
Txn operations, making code easier to read.
3. by removing reflection we should increase the performance of all
operations.
One important change is in making all the function signatures the same.
https://blog.golang.org/errors-are-values
An extra boolean return value for SingleIndexer.FromObject is superfluous.
The error value can indicate when the index value could not be created.
By removing this extra return value we can use the same signature for both
indexer functions.
This has the nice properly of a function being usable for both indexing operations.
By using a new pattern for more specific indexes. This allows us to use
the same index for both service checks and node checks. It removes the
abstraction around memdb.Txn operations, and isolates all of the
enterprise differences in a single place (the indexer).
* A GET of the /acl/auth-method/:name endpoint returns the fields
MaxTokenTTL and TokenLocality, while a LIST (/acl/auth-methods) does
not.
The list command returns a filtered subset of the full set. This is
somewhat deliberate, so that secrets aren't shown, but the TTL and
Locality fields aren't (IMO) security critical, and it is useful for
the front end to be able to show them.
For consistency these changes mirror the 'omit empty' and string
representation choices made for the GET call.
This includes changes to the gRPC and API code in the client.
The new output looks similar to this
curl 'http://localhost:8500/v1/acl/auth-methods' | jq '.'
{
"MaxTokenTTL": "8m20s",
"Name": "minikube-ttl-local2",
"Type": "kubernetes",
"Description": "minikube auth method",
"TokenLocality": "local",
"CreateIndex": 530,
"ModifyIndex": 530,
"Namespace": "default"
}
]
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
* Add changelog
Signed-off-by: Mark Anderson <manderson@hashicorp.com>
Replace the large table of tests with individual calls to run(). By using
runCase, failure messages will include the line number for the test case, as
well as a line number from the test functions.
Example:
=== FAIL: agent/config TestLoad_IntegrationWithFlags/failing_case (0.01s)
runtime_test.go:4721: case: failing case
runtime_test.go:4864: error "data_dir cannot be empty" does not contain "I expected this error"
Previous:
runtime_test.go:4864: error "data_dir cannot be empty" does not contain "I expected this error"
Without the line number to the testCase data, debugging these tests is
difficult. It is impossible to jump directly to the test case, and
difficult to find the location because of many similarly named cases.
AEInterval is overridden by NonUserSource, so there is no way for a user
to set this value. These two cases represented impossible real world
scenarios.
Instead the test is replaced with one that shows that the AEInterval can
not be set by config.
This change allows us to remove the hcltail and jsontail fields from
testCase
Previously a snapshot created as part of a resumse-stream request could have incorrectly
cached the newSnapshotToFollow event. This would cause clients to error because they
received an unexpected framing event.
So that all the client side filtering is in the same place. Previously
only the bexpr filter was in the cache-entry.
Also makes a small change to the filtering so that instead of rebuilding
slices of items, the filtering can return a bool to determine if the
event payload is saved or not.
Send empty array [] instead of [""] in DNS requests when TagFilter is not set
Do not change case sensitivity of services anymore in `getServiceNodes()` since
cache keys are now case insensitive
10 minutes is the default blocking query timeout. Using the same value results in us hitting
the expired cache entry bug frequently. By extending this TTL we at least mitigate the problem.
The underlying bug still needs to be fixed.
The materializer is often reset when an error is received. By resetting
the retryWaiter we effectively never wait. The retryWaiter should only
be reset when we get an event without error. This is done in
Materializer.updateView().
Streaming can not be used for these queries because the near query
paramter indicates a specific sort of the results, and that sort
requires data that is not available to the client from the streaming
API.
This fixes an issue where leaf certificates issued in primary
datacenters using Vault as a Connect CA would be reissued very
frequently (every ~20 seconds) because the logic meant to detect root
rotation was errantly triggering.
The hash of the rootCA was being compared against a hash of the
intermediateCA and always failing. This doesn't apply to the Consul
built-in CA provider because there is no intermediate in use in the
primary DC.
This is reminiscent of #6513
In a situation where the mesh gateway is configured to bind to multiple
network interfaces, we use a feature called 'tagged addresses'.
Sometimes an address is duplicated across multiple tags such as 'lan'
and 'lan_ipv4'.
There is code to deduplicate these things when creating envoy listeners,
but that code doesn't ensure that the same tag wins every time. If the
winning tag flaps between xDS discovery requests it will cause the
listener to be drained and replaced.
registerSchema creates some indirection which is not necessary in this
case. newDBSchema can call each of the tables.
Enterprise tables can be added from the existing withEnterpriseSchema
shim.
Deadlock scenario:
1. Due to scheduling, the state runner sends one snapshot into
snapCh and then attempts to send a second. The first send succeeds
because the channel is buffered, but the second blocks.
2. Separately, Manager.Watch is called by the xDS server after
getting a discovery request from Envoy. This function acquires the
manager lock and then blocks on receiving the CurrentSnapshot from
the state runner.
3. Separately, there is a Manager goroutine that reads the snapshots
from the channel in step 1. These reads are done to notify proxy
watchers, but they require holding the manager lock. This goroutine
goes to acquire that lock, but can't because it is held by step 2.
Now, the goroutine from step 3 is waiting on the one from step 2 to
release the lock. The goroutine from step 2 won't release the lock until
the goroutine in step 1 advances. But the goroutine in step 1 is waiting
for the one in step 3. Deadlock.
By making this send non-blocking step 1 above can proceed. The coalesce
timer will be reset and a new valid snapshot will be delivered after it
elapses or when one is requested by xDS.
This allows setting ForceWithoutCrossSigning when reconfiguring the CA
for any provider, in order to forcibly move to a new root in cases where
the old provider isn't reachable or able to cross-sign for whatever
reason.
This commit makes a number of changes that should make
TestLoad_FullConfig easier to work with, and make the test more like
real world scenarios.
* use separate files in testdata/ dir to store the config source.
Separate files are much easier to edit because editors can syntax
highlight json/hcl, and it makes strings easier to find. Previously
trying to find strings would match strings used in other tests.
* use the exported config.Load interface instead of internal NewBuilder
and BuildAndValidate.
* remove the tail config overrides, which are only necessary with
nonZero works.
This commit reduces the interface to Load() a bit, in preparation for
unexporting NewBuilder and having everything call Load.
The three arguments are reduced to a single argument by moving the other
two into the options struct.
The three return values are reduced to two by moving the RuntimeConfig
and Warnings into a LoadResult struct.
Previously the ServiceManager had to run a separate goroutine so that it could block on a channel
send/receive instead of a lock. Using this mutex with TryLock allows us to cancel the lock when
the serviceConfigWatch is stopped.
Without this change removing the ServiceManager.Start goroutine would not be possible because
when AddService is called it acquires the stateLock. While that lock is held, if there are
existing watches for the service, the old watch will be stopped, and the goroutine holding the
lock will attempt to wait for that watcher goroutine to exit.
If the goroutine is handling an update (serviceConfigWatch.handleUpdate) then it can block on
acquiring the stateLock and deadlock the agent. With this change the context is cancelled as
and the goroutine will exit instead of waiting on the stateLock.
The ServiceManager.Start goroutine was used to serialize calls to
agent.addServiceInternal.
All the goroutines which sent events to the channel would block waiting
for a response from that same goroutine, which is effectively the same
as a synchronous call without any channels.
This commit removes the goroutine and channels, and instead calls
addServiceInternal directly. Since all of these goroutines will need to
take the agent.stateLock, the mutex handles the serializing of calls.
Move the field into the struct for addServiceLocked. Also don't require
setting a default value, so that the callers can leave it as nil if they
don't already have a snapshot.
Replace with the existing AddServiceRequest struct. These structs are
almost identical. Additionally, the only reason the serviceRegistration
struct existed was to recreate an AddServiceRequest.
By storing and re-using the AddServiceRequest we remove the need to
translate into one type and back to the original type.
We also remove the extra parameters to a function, because those values
are already available from the AddServiceRequest field.
Also a minor optimization to only call tokens.AgentToken() when
necessary. Previous it was being called every time, but the value was
being ignored if the AddServiceRequest had a token.
Handle the decision to use ServiceManager in a single place. Instead of
calling ServiceManager.AddService, then calling back into
addServiceInternal, only call ServiceManager.AddService if we are going
to use it.
This change removes some small duplication and removes a branch from the
AddService flow.
The temprorary variables make it much harder to trace where and how struct
fields are used. If a field is only used a small number of times than
refer to the field directly.
The method is only used in tests, and only exists for legacy calls.
There was one other package which used this method in tests. Export
the AddServiceRequest and a couple of its fields so the new function can
be used in those tests.
This way we only have to wait for the serf barrier to pass once before
we can make use of federation state APIs Without this patch every
restart needs to re-compute the change.
* Add templating to inject JSON into an application/json script tag
Plus an external script in order to pick it out and inject the values we
need injecting into ember's environment meta tag.
The UI still uses env style naming (CONSUL_*) but we uses the new style
JSON/golang props behind the scenes.
Co-authored-by: Paul Banks <banks@banksco.de>
In some circumstances this endpoint will have no results in it (dues to
ACLs, Namespaces, filtering or missing configuration).
This ensures that the response is at least an empty array (`[]`) rather
than `null`
These expectations are optional because in a slow CI environment the deadline to cancell the context might occur before the go routine reaches issuing the RPC. Either way we are successfully ensuring context cancellation is working.
Deleting from memdb inside an interation can cause a panic from Iterator.Next. This
case is technically safe (for now) because the iterator is using the root radix tree
not a modified one.
However this could break at any time if someone adds an insert or delete to the coordinates table
before this place in the function.
It also sets a bad example, because generally deletes in an interator are not safe. So this
commit uses the pattern we have in other places to move the deletes out of the iteration.
After fixing that bug I uncovered a couple more:
Fix an issue where we might try to cross sign a cert when we never had a valid root.
Fix a potential issue where reconfiguring the CA could cause either the Vault or AWS PCA CA providers to delete resources that are still required by the new incarnation of the CA.
Using withEnterpriseSchema() we can apply any enterprise schema changes
with a single shim, removing the need to duplicate all of the table
definitions.
Also move all the catalog schemas to a new file to shrink catalog.go a bit.
I believe this commit also fixes a bug. Previously RPCMaxConnsPerClient was not being re-read from the RuntimeConfig, so passing it to Server.ReloadConfig was never changing the value.
Also improve the test runtime by not doing a lot of unnecessary work.
The field was not being included in the cache info key. This would result in a DNS request for
web.service.consul returning the same result as web.ingress.consul, when those results should
not be the same.
include all fields when fuzzing in tests
split tests by struct type
Ensure the new value for the field is different
fuzzer.Fuzz could produce the same value again in some cases.
Use a custom fuzz function for QueryOptions. That type is an embedded struct in the request types
but only one of the fields is important to include in the cache key.
Move enterpriseMetaField to an oss file so that we can change it in enterprise.
* Fix bug in usage metrics that caused a negative count to occur
There were a couple of instances were usage metrics would do the wrong
thing and result in incorrect counts, causing the count to attempt to
decrement below zero and return an error. The usage metrics did not
account for various places where a single transaction could
delete/update/add multiple service instances at once.
We also remove the error when attempting to decrement below zero, and
instead just make sure we do not accidentally underflow the unsigned
integer. This is a more graceful failure than returning an error and not
allowing a transaction to commit.
* Add changelog
This change allows us to re-use these functions in other places without the Builder, and makes it
more explicit about which functions can warn/error and which can not.
A golden file makes the expected value easier to work with. This change also
removes a number of shims for enterprise and replaces them with a single one
for the golden filename.
* Display a warning when rpc.enable_streaming = true is set on a client
This option has no effect when running as an agent
* Added warning when server starts with use_streaming_backend but without rpc.enable_streaming
* Added unit test