Commit Graph

46 Commits

Author SHA1 Message Date
R.B. Boyer a01936442c
cache: refactor agent cache fetching to prevent unnecessary fetches on error (#14956)
This continues the work done in #14908 where a crude solution to prevent a
goroutine leak was implemented. The former code would launch a perpetual
goroutine family every iteration (+1 +1) and the fixed code simply caused a
new goroutine family to first cancel the prior one to prevent the
leak (-1 +1 == 0).

This PR refactors this code completely to:

- make it more understandable
- remove the recursion-via-goroutine strangeness
- prevent unnecessary RPC fetches when the prior one has errored.

The core issue arose from a conflation of the entry.Fetching field to mean:

- there is an RPC (blocking query) in flight right now
- there is a goroutine running to manage the RPC fetch retry loop

The problem is that the goroutine-leak-avoidance check would treat
Fetching like (2), but within the body of a goroutine it would flip that
boolean back to false before the retry sleep. This would cause a new
chain of goroutines to launch which #14908 would correct crudely.

The refactored code uses a plain for-loop and changes the semantics
to track state for "is there a goroutine associated with this cache entry"
instead of the former.

We use a uint64 unique identity per goroutine instead of a boolean so
that any orphaned goroutines can tell when they've been replaced when
the expiry loop deletes a cache entry while the goroutine is still running
and is later replaced.
2022-10-25 10:27:26 -05:00
R.B. Boyer 9f41cc4a25
cache: prevent goroutine leak in agent cache (#14908)
There is a bug in the error handling code for the Agent cache subsystem discovered:

1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in
   a loop (which backs off on-error up to 1 minute)

2. getWithIndex calls fetch if there’s no valid entry in the cache

3. fetch starts a goroutine which calls Fetch on the cache-type, waits
   for a while (again with backoff up to 1 minute for errors) and then
   calls fetch to trigger a refresh

The end result being that every 1 minute notifyBlockingQuery spawns an
ancestry of goroutines that essentially lives forever.

This PR ensures that the goroutine started by `fetch` cancels any prior
goroutine spawned by the same line for the same key.

In isolated testing where a cache type was tweaked to indefinitely
error, this patch prevented goroutine counts from skyrocketing.
2022-10-17 14:38:10 -05:00
R.B. Boyer 809344a6f5
peering: initial sync (#12842)
- Add endpoints related to peering: read, list, generate token, initiate peering
- Update node/service/check table indexing to account for peers
- Foundational changes for pushing service updates to a peer
- Plumb peer name through Health.ServiceNodes path

see: ENT-1765, ENT-1280, ENT-1283, ENT-1283, ENT-1756, ENT-1739, ENT-1750, ENT-1679,
     ENT-1709, ENT-1704, ENT-1690, ENT-1689, ENT-1702, ENT-1701, ENT-1683, ENT-1663,
     ENT-1650, ENT-1678, ENT-1628, ENT-1658, ENT-1640, ENT-1637, ENT-1597, ENT-1634,
     ENT-1613, ENT-1616, ENT-1617, ENT-1591, ENT-1588, ENT-1596, ENT-1572, ENT-1555

Co-authored-by: R.B. Boyer <rb@hashicorp.com>
Co-authored-by: freddygv <freddy@hashicorp.com>
Co-authored-by: Chris S. Kim <ckim@hashicorp.com>
Co-authored-by: Evan Culver <eculver@hashicorp.com>
Co-authored-by: Nitya Dhanushkodi <nitya@hashicorp.com>
2022-04-21 17:34:40 -05:00
R.B. Boyer 05c7373a28 bulk rewrite using this script
set -euo pipefail

    unset CDPATH

    cd "$(dirname "$0")"

    for f in $(git grep '\brequire := require\.New(' | cut -d':' -f1 | sort -u); do
        echo "=== require: $f ==="
        sed -i '/require := require.New(t)/d' $f
        # require.XXX(blah) but not require.XXX(tblah) or require.XXX(rblah)
        sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\([^tr]\)/require.\1(t,\2/g' $f
        # require.XXX(tblah) but not require.XXX(t, blah)
        sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\(t[^,]\)/require.\1(t,\2/g' $f
        # require.XXX(rblah) but not require.XXX(r, blah)
        sed -i 's/\brequire\.\([a-zA-Z0-9_]*\)(\(r[^,]\)/require.\1(t,\2/g' $f
        gofmt -s -w $f
    done

    for f in $(git grep '\bassert := assert\.New(' | cut -d':' -f1 | sort -u); do
        echo "=== assert: $f ==="
        sed -i '/assert := assert.New(t)/d' $f
        # assert.XXX(blah) but not assert.XXX(tblah) or assert.XXX(rblah)
        sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\([^tr]\)/assert.\1(t,\2/g' $f
        # assert.XXX(tblah) but not assert.XXX(t, blah)
        sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\(t[^,]\)/assert.\1(t,\2/g' $f
        # assert.XXX(rblah) but not assert.XXX(r, blah)
        sed -i 's/\bassert\.\([a-zA-Z0-9_]*\)(\(r[^,]\)/assert.\1(t,\2/g' $f
        gofmt -s -w $f
    done
2022-01-20 10:46:23 -06:00
Paul Banks ae5c0aad39
cache: Fix bug where connection errors can cause early cache expiry (#9979)
Fixes a cache bug where TTL is not updated while a value isn't changing or cache entry is returning fetch errors.
2021-04-08 11:11:15 +01:00
Paul Banks b61e00b772
cache: fix bug where TTLs were ignored leading to leaked memory in client agents (#9978)
* Fix bug in cache where TTLs are effectively ignored

This mostly affects streaming since streaming will immediately return from Fetch calls when the state is Closed on eviction which causes the race condition every time.

However this also affects all other cache types if the fetch call happens to return between the eviction and then next time around the Get loop by any client.

There is a separate bug that allows cache items to be evicted even when there are active clients which is the trigger here.

* Add changelog entry

* Update .changelog/9978.txt
2021-04-08 11:08:56 +01:00
Matt Keeler 19c99dc104
Stop background refresh of cached data for requests that result in ACL not found errors (#9738) 2021-02-09 10:15:53 -05:00
Daniel Nephin ef0999547a testing: skip slow tests with -short
Add a skip condition to all tests slower than 100ms.

This change was made using `gotestsum tool slowest` with data from the
last 3 CI runs of master.
See https://github.com/gotestyourself/gotestsum#finding-and-skipping-slow-tests

With this change:

```
$ time go test -count=1 -short ./agent
ok      github.com/hashicorp/consul/agent       0.743s

real    0m4.791s

$ time go test -count=1 -short ./agent/consul
ok      github.com/hashicorp/consul/agent/consul        4.229s

real    0m8.769s
```
2020-12-07 13:42:55 -05:00
Daniel Nephin 0beaced90f cache: fix a bug with Prepopulate
Prepopulate was setting entry.Expiry.HeapIndex to 0. Previously this would result in a call to heap.Fix(0)
which wasn't correct, but was also not really a problem because at worse it would re-notify.

With the recent change to extract cachettl it was changed to call Update(idx), which would have updated
the wrong entry.

A previous commit removed the setting of entry.Expiry so that the HeapIndex would be reported
as -1, and this commit adds a test and handles the -1 heap index.
2020-10-20 19:10:20 -04:00
Daniel Nephin 9d5b738cdb lib/ttlcache: extract package from agent/cache 2020-10-20 19:10:20 -04:00
Daniel Nephin b6f24c6554 cache: extract cache eviction heap
Start creating an interface that doesn't require using heap and hides more of the
entry internals.
2020-10-20 19:10:19 -04:00
Daniel Nephin f857aef4a8 submatview: add a test for handling of NewSnapshotToFollow
Also add some godoc
Rename some vars and functions
Fix a data race in the new cache test for entry closing.
2020-10-06 13:22:02 -04:00
Daniel Nephin 50846a96ff cache-types: Update Streaming health cache-type
To use latest protobuf types
2020-10-06 13:22:02 -04:00
Daniel Nephin e5d37bdf23 agent/cache: Add cache-type and materialized view for streaming health
Extracted from d97412ce4c399a35b41bbdae2716f0e32dce80bf

Co-authored-by: Paul Banks <banks@banksco.de>
2020-10-06 13:21:57 -04:00
Pierre Souchay ee50b55163 Added Unit test for cache reloading 2020-08-28 13:03:58 +02:00
Hans Hasselberg ffdc3057fe
agent/cache test for cache throttling. (#8396) 2020-07-30 14:41:13 +02:00
Matt Keeler 2ec4e46eb2
Default Cache rate limiting options in New
Also get rid of the TestCache helper which was where these defaults were happening previously.
2020-07-28 12:34:35 -04:00
Matt Keeler 6d94900cd7
Disable background cache refresh for Connect Leaf Certs
The rationale behind removing them is that all of our own code (xDS, builtin connect proxy) use the cache notification mechanism. This ensures that the blocking fetch behind the scenes is always executing. Therefore the only way you might go to get a certificate and have to wait is when 1) the request has never been made for that cert before or 2) you are using the v1/agent/connect/ca/leaf API for retrieving the cert yourself.

In the first case, the refresh change doesn’t alter the behavior. In the second case, it can be mitigated by using blocking queries with that API which just like normal cache notification mechanism will cause the blocking fetch to be initiated and to get leaf certs as soon as needed.

If you are not using blocking queries, or Envoy/xDS, or the builtin connect proxy but are retrieving the certs yourself then the HTTP endpoint might take a little longer to respond.

This also renames the RefreshTimeout field on the register options to QueryTimeout to more accurately reflect that it is used for any type that supports blocking queries.
2020-07-21 12:19:25 -04:00
Matt Keeler 976f922abf
Make the Agent Cache more Context aware (#8092)
Blocking queries issues will still be uncancellable (that cannot be helped until we get rid of net/rpc). However this makes it so that if calling getWithIndex (like during a cache Notify go routine) we can cancell the outer routine. Previously it would keep issuing more blocking queries until the result state actually changed.
2020-06-15 11:01:25 -04:00
Daniel Nephin 1251c01b73 agent/cache: Make all cache options RegisterOptions
Previously the SupportsBlocking option was specified by a method on the
type, and all the other options were specified from RegisterOptions.

This change moves RegisterOptions to a method on the type, and moves
SupportsBlocking into the options struct.

Currently there are only 2 cache-types. So all cache-types can implement
this method by embedding a struct with those predefined values. In the
future if a cache type needs to be registered more than once with different
options it can remove the embedded type and implement the method in a way
that allows for paramaterization.
2020-04-16 18:56:34 -04:00
R.B. Boyer 0e152672a1
avoid 'panic: Log in goroutine after TestCacheGet_refreshAge has completed' (#7276) 2020-02-12 10:01:51 -06:00
Matt Keeler 07f2854683 Fixes race condition in Agent Cache (#5796)
* Fix race condition during a cache get

Check the entry we pulled out of the cache while holding the lock had Fetching set.
If it did then we should use the existing Waiter instead of calling fetch. The reason
this is better than just calling fetch is that fetch re-gets the entry out of the
entries map and the previous fetch may have finished. Therefore this prevents
erroneously starting a new fetch because we just missed the last update.

* Fix race condition fully

The first commit still allowed for the following scenario:

• No entry existing when checked in getWithIndex while holding the read lock
• Then by time we had reached fetch it had been created and finished.

* always use ok when returning

* comment mentioning the reading from entries.

* use cacheHit consistently
2019-05-07 11:15:49 +01:00
Paul Banks 1c4dfbcd2e
connect: tame thundering herd of CSRs on CA rotation (#5228)
* Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client

* Add CSR rate limiting docs

* Fix config naming and add tests for new CA configs
2019-01-22 17:19:36 +00:00
Paul Banks c4fa66b4c9
connect: agent leaf cert caching improvements (#5091)
* Add State storage and LastResult argument into Cache so that cache.Types can safely store additional data that is eventually expired.

* New Leaf cache type working and basic tests passing. TODO: more extensive testing for the Root change jitter across blocking requests, test concurrent fetches for different leaves interact nicely with rootsWatcher.

* Add multi-client and delayed rotation tests.

* Typos and cleanup error handling in roots watch

* Add comment about how the FetchResult can be used and change ca leaf state to use a non-pointer state.

* Plumb test override of root CA jitter through TestAgent so that tests are deterministic again!

* Fix failing config test
2019-01-10 12:46:11 +00:00
Paul Banks 1bf3a37597
agent: Don't leave old errors around in cache (#5094)
* Fixes #4480. Don't leave old errors around in cache that can be hit in specific circumstances.

* Move error reset to cover extreme edge case of nil Value, nil err Fetch
2019-01-08 10:06:38 +00:00
Paul Banks bf741cbc04
Quick fix for cache age flakiness in CI 2018-10-11 13:12:19 +01:00
Paul Banks 5b0d4db6bc Support Agent Caching for Service Discovery Results (#4541)
* Add cache types for catalog/services and health/services and basic test that caching works

* Support non-blocking cache types with Cache-Control semantics.

* Update API docs to include caching info for every endpoint.

* Comment updates per PR feedback.

* Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear.

* Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.
2018-10-10 16:55:34 +01:00
Paul Banks 217137b775
Fixes #4421: General solution to stop blocking queries with index 0 (#4437)
* Fix theoretical cache collision bug if/when we use more cache types with same result type

* Generalized fix for blocking query handling when state store methods return zero index

* Refactor test retry to only affect CI

* Undo make file merge

* Add hint to error message returned to end-user requests if Connect is not enabled when they try to request cert

* Explicit error for Roots endpoint if connect is disabled

* Fix tests that were asserting old behaviour
2018-07-25 20:26:27 +01:00
Paul Banks 81bd1b43a3 Fix hot loop in cache for RPC returning zero index. 2018-06-25 12:25:37 -07:00
Paul Banks 3d51c2aeac Get agent cache tests passing without global hit count (which is racy).
Few other fixes in here just to get a clean run locally - they are all also fixed in other PRs but shouldn't conflict.

This should be robust to timing between goroutines now.
2018-06-25 12:25:37 -07:00
Mitchell Hashimoto 29af8fb5ab agent/cache: always schedule the refresh 2018-06-25 12:25:14 -07:00
Mitchell Hashimoto a3ef9c2308 agent/cache: always fetch with minimum index of 1 at least 2018-06-25 12:25:12 -07:00
Mitchell Hashimoto ea0270e6aa agent/cache: correct test name 2018-06-25 12:24:07 -07:00
Mitchell Hashimoto b5201276bc agent/cache: change behavior to return error rather than retry
The cache behavior should not be to mask errors and retry. Instead, it
should aim to return errors as quickly as possible. We do that here.
2018-06-25 12:24:07 -07:00
Mitchell Hashimoto 778e318a52 agent/cache: perform backoffs on error retries on blocking queries 2018-06-25 12:24:06 -07:00
Mitchell Hashimoto c42510e1ec
agent/cache: implement refresh backoff 2018-06-14 09:42:14 -07:00
Mitchell Hashimoto 07d878a157
agent/cache: address feedback, clarify comments 2018-06-14 09:42:03 -07:00
Mitchell Hashimoto b319d06276
agent/cache: rework how expiry data is stored to be more efficient 2018-06-14 09:42:03 -07:00
Mitchell Hashimoto 449bbd817d
agent/cache: initial TTL work 2018-06-14 09:42:02 -07:00
Mitchell Hashimoto 3c6acbda5d
agent/cache: send the RefreshTimeout into the backend fetch 2018-06-14 09:42:02 -07:00
Mitchell Hashimoto 257fc34e51
agent/cache: on error, return from Get immediately, don't block forever 2018-06-14 09:42:02 -07:00
Mitchell Hashimoto 4509589427
agent/cache: support timeouts for cache reads and empty fetch results 2018-06-14 09:42:01 -07:00
Mitchell Hashimoto 286217cbd8
agent/cache: partition by DC/ACL token 2018-06-14 09:42:00 -07:00
Mitchell Hashimoto 72c82a9b29
agent/cache: Reorganize some files, RequestInfo struct, prepare for partitioning 2018-06-14 09:42:00 -07:00
Mitchell Hashimoto c69df79e0c
agent/cache: blank cache key means to always fetch 2018-06-14 09:42:00 -07:00
Mitchell Hashimoto 8584e9262e
agent/cache: initial kind-of working cache 2018-06-14 09:42:00 -07:00