open-consul

Commit Graph

Author	SHA1	Message	Date
R.B. Boyer	5e019393d3	Revert "cache: refactor agent cache fetching to prevent unnecessary f… (#16818 ) (#17046 ) Revert "cache: refactor agent cache fetching to prevent unnecessary fetches on error (#14956)" Co-authored-by: Derek Menteer <105233703+hashi-derek@users.noreply.github.com>	2023-04-19 13:17:21 -05:00
Ronald	dd0e8eec14	copyright headers for agent folder (#16704 ) * copyright headers for agent folder * Ignore test data files * fix proto files and remove headers in agent/uiserver folder * ignore deep-copy files	2023-03-28 14:39:22 -04:00
R.B. Boyer	a01936442c	cache: refactor agent cache fetching to prevent unnecessary fetches on error (#14956 ) This continues the work done in #14908 where a crude solution to prevent a goroutine leak was implemented. The former code would launch a perpetual goroutine family every iteration (+1 +1) and the fixed code simply caused a new goroutine family to first cancel the prior one to prevent the leak (-1 +1 == 0). This PR refactors this code completely to: - make it more understandable - remove the recursion-via-goroutine strangeness - prevent unnecessary RPC fetches when the prior one has errored. The core issue arose from a conflation of the entry.Fetching field to mean: - there is an RPC (blocking query) in flight right now - there is a goroutine running to manage the RPC fetch retry loop The problem is that the goroutine-leak-avoidance check would treat Fetching like (2), but within the body of a goroutine it would flip that boolean back to false before the retry sleep. This would cause a new chain of goroutines to launch which #14908 would correct crudely. The refactored code uses a plain for-loop and changes the semantics to track state for "is there a goroutine associated with this cache entry" instead of the former. We use a uint64 unique identity per goroutine instead of a boolean so that any orphaned goroutines can tell when they've been replaced when the expiry loop deletes a cache entry while the goroutine is still running and is later replaced.	2022-10-25 10:27:26 -05:00
R.B. Boyer	9f41cc4a25	cache: prevent goroutine leak in agent cache (#14908 ) There is a bug in the error handling code for the Agent cache subsystem discovered: 1. NotifyCallback calls notifyBlockingQuery which calls getWithIndex in a loop (which backs off on-error up to 1 minute) 2. getWithIndex calls fetch if there’s no valid entry in the cache 3. fetch starts a goroutine which calls Fetch on the cache-type, waits for a while (again with backoff up to 1 minute for errors) and then calls fetch to trigger a refresh The end result being that every 1 minute notifyBlockingQuery spawns an ancestry of goroutines that essentially lives forever. This PR ensures that the goroutine started by `fetch` cancels any prior goroutine spawned by the same line for the same key. In isolated testing where a cache type was tweaked to indefinitely error, this patch prevented goroutine counts from skyrocketing.	2022-10-17 14:38:10 -05:00
skpratt	cf6c1d9388	add non-double-prefixed metrics (#14193 )	2022-09-09 12:13:43 -05:00
R.B. Boyer	4ce9651421	test: update mockery use to put mocks into test files (#13656 ) --testonly doesn't do anything anymore so switch to --filename instead	2022-07-05 16:57:15 -05:00
R.B. Boyer	809344a6f5	peering: initial sync (#12842 ) - Add endpoints related to peering: read, list, generate token, initiate peering - Update node/service/check table indexing to account for peers - Foundational changes for pushing service updates to a peer - Plumb peer name through Health.ServiceNodes path see: ENT-1765, ENT-1280, ENT-1283, ENT-1283, ENT-1756, ENT-1739, ENT-1750, ENT-1679, ENT-1709, ENT-1704, ENT-1690, ENT-1689, ENT-1702, ENT-1701, ENT-1683, ENT-1663, ENT-1650, ENT-1678, ENT-1628, ENT-1658, ENT-1640, ENT-1637, ENT-1597, ENT-1634, ENT-1613, ENT-1616, ENT-1617, ENT-1591, ENT-1588, ENT-1596, ENT-1572, ENT-1555 Co-authored-by: R.B. Boyer <rb@hashicorp.com> Co-authored-by: freddygv <freddy@hashicorp.com> Co-authored-by: Chris S. Kim <ckim@hashicorp.com> Co-authored-by: Evan Culver <eculver@hashicorp.com> Co-authored-by: Nitya Dhanushkodi <nitya@hashicorp.com>	2022-04-21 17:34:40 -05:00
R.B. Boyer	bbd38e95ce	chore: upgrade mockery to v2 and regenerate (#12836 )	2022-04-21 09:48:21 -05:00
R.B. Boyer	cf0c5110be	ca: fix a bug that caused a non blocking leaf cert query after a blocking leaf cert query to block (#12820 ) Fixes #12048 Fixes #12319 Regression introduced in #11693 Local reproduction steps: 1. `consul agent -dev` 2. `curl -sLiv 'localhost:8500/v1/agent/connect/ca/leaf/web'` 3. make note of the `X-Consul-Index` header returned 4. `curl -sLi 'localhost:8500/v1/agent/connect/ca/leaf/web?index=<VALUE_FROM_STEP_3>'` 5. Kill the above curl when it hangs with Ctrl-C 6. Repeat (2) and it should not hang.	2022-04-20 12:21:47 -05:00
Paul Banks	ae5c0aad39	cache: Fix bug where connection errors can cause early cache expiry (#9979 ) Fixes a cache bug where TTL is not updated while a value isn't changing or cache entry is returning fetch errors.	2021-04-08 11:11:15 +01:00
Paul Banks	b61e00b772	cache: fix bug where TTLs were ignored leading to leaked memory in client agents (#9978 ) * Fix bug in cache where TTLs are effectively ignored This mostly affects streaming since streaming will immediately return from Fetch calls when the state is Closed on eviction which causes the race condition every time. However this also affects all other cache types if the fetch call happens to return between the eviction and then next time around the Get loop by any client. There is a separate bug that allows cache items to be evicted even when there are active clients which is the trigger here. * Add changelog entry * Update .changelog/9978.txt	2021-04-08 11:08:56 +01:00
Daniel Nephin	e47131bfe6	cache: log a warning when Cache.Notify handles an error Without these warnings, errors are silently ignored, which can make debugging problems more challenging.	2021-02-12 13:02:23 -05:00
Matt Keeler	19c99dc104	Stop background refresh of cached data for requests that result in ACL not found errors (#9738 )	2021-02-09 10:15:53 -05:00
Hans Hasselberg	25f9e232af	add missing descriptions for metrics	2020-11-23 22:06:30 +01:00
Kit Patella	af719981f3	finish adding static server metrics	2020-11-13 16:26:08 -08:00
Daniel Nephin	09d62f1df0	lib/ttlcache: unexport key and additional godoc	2020-10-20 19:16:03 -04:00
Daniel Nephin	2601998766	lib/ttlcache: add a constant for NotIndexed	2020-10-20 19:10:20 -04:00
Daniel Nephin	9d5b738cdb	lib/ttlcache: extract package from agent/cache	2020-10-20 19:10:20 -04:00
Daniel Nephin	909b8e674e	cache: export ExpiryHeap and hide internal methods on an unexported type, so that when it is extrated those methods are not exported.	2020-10-20 19:10:20 -04:00
Daniel Nephin	a96646c562	cache: Move more of the expiryLoop into the Heap	2020-10-20 19:10:20 -04:00
Daniel Nephin	b6f24c6554	cache: extract cache eviction heap Start creating an interface that doesn't require using heap and hides more of the entry internals.	2020-10-20 19:10:19 -04:00
Daniel Nephin	f857aef4a8	submatview: add a test for handling of NewSnapshotToFollow Also add some godoc Rename some vars and functions Fix a data race in the new cache test for entry closing.	2020-10-06 13:22:02 -04:00
Daniel Nephin	e5d37bdf23	agent/cache: Add cache-type and materialized view for streaming health Extracted from d97412ce4c399a35b41bbdae2716f0e32dce80bf Co-authored-by: Paul Banks <banks@banksco.de>	2020-10-06 13:21:57 -04:00
Pierre Souchay	084d0e8015	Added `options.Equals()` and minor fixes indentation fixes	2020-08-27 13:44:45 +02:00
Pierre Souchay	dd385f05e6	Ensure that Cache options are reloaded when `consul reload` is performed. This will apply cache throttling parameters are properly applied: * cache.EntryFetchMaxBurst * cache.EntryFetchRate When values are updated, a log is displayed in info.	2020-08-24 23:33:10 +02:00
Matt Keeler	2ec4e46eb2	Default Cache rate limiting options in New Also get rid of the TestCache helper which was where these defaults were happening previously.	2020-07-28 12:34:35 -04:00
Pierre Souchay	947d8eb039	Added ratelimit to handle throtling cache (#8226 ) This implements a solution for #7863 It does: Add a new config cache.entry_fetch_rate to limit the number of calls/s for a given cache entry, default value = rate.Inf Add cache.entry_fetch_max_burst size of rate limit (default value = 2) The new configuration now supports the following syntax for instance to allow 1 query every 3s: command line HCL: -hcl 'cache = { entry_fetch_rate = 0.333}' in JSON { "cache": { "entry_fetch_rate": 0.333 } }	2020-07-27 23:11:11 +02:00
Matt Keeler	6d94900cd7	Disable background cache refresh for Connect Leaf Certs The rationale behind removing them is that all of our own code (xDS, builtin connect proxy) use the cache notification mechanism. This ensures that the blocking fetch behind the scenes is always executing. Therefore the only way you might go to get a certificate and have to wait is when 1) the request has never been made for that cert before or 2) you are using the v1/agent/connect/ca/leaf API for retrieving the cert yourself. In the first case, the refresh change doesn’t alter the behavior. In the second case, it can be mitigated by using blocking queries with that API which just like normal cache notification mechanism will cause the blocking fetch to be initiated and to get leaf certs as soon as needed. If you are not using blocking queries, or Envoy/xDS, or the builtin connect proxy but are retrieving the certs yourself then the HTTP endpoint might take a little longer to respond. This also renames the RefreshTimeout field on the register options to QueryTimeout to more accurately reflect that it is used for any type that supports blocking queries.	2020-07-21 12:19:25 -04:00
Daniel Nephin	797abe1f00	agent/cache: Use AllowNotModifiedResponse in CatalogListServices Co-authored-by: Pierre Souchay <pierresouchay@users.noreply.github.com>	2020-07-14 18:58:20 -04:00
Daniel Nephin	8aa3335b22	agent/cache: Update some docstrings	2020-07-14 18:58:20 -04:00
Matt Keeler	976f922abf	Make the Agent Cache more Context aware (#8092 ) Blocking queries issues will still be uncancellable (that cannot be helped until we get rid of net/rpc). However this makes it so that if calling getWithIndex (like during a cache Notify go routine) we can cancell the outer routine. Previously it would keep issuing more blocking queries until the result state actually changed.	2020-06-15 11:01:25 -04:00
Daniel Nephin	3114943f8d	agent/cache: remove error return from fetch A previous change removed the only error, so the return value can be removed now.	2020-04-17 11:55:01 -04:00
Daniel Nephin	d015d3c563	agent/cache: reduce function arguments by removing duplicates A few of the unexported functions in agent/cache took a large number of arguments. These arguments were effectively overrides for values that were provided in RequestInfo. By using a struct we can not only reduce the number of arguments, but also simplify the logic by removing the need for overrides.	2020-04-17 11:35:07 -04:00
Daniel Nephin	1251c01b73	agent/cache: Make all cache options RegisterOptions Previously the SupportsBlocking option was specified by a method on the type, and all the other options were specified from RegisterOptions. This change moves RegisterOptions to a method on the type, and moves SupportsBlocking into the options struct. Currently there are only 2 cache-types. So all cache-types can implement this method by embedding a struct with those predefined values. In the future if a cache type needs to be registered more than once with different options it can remove the embedded type and implement the method in a way that allows for paramaterization.	2020-04-16 18:56:34 -04:00
Daniel Nephin	fb31212de7	Remove TTL from cacheEntryExpiry This should very slightly reduce the amount of memory required to store each item in the cache. It will also enable setting different TTLs based on the type of result. For example we may want to use a shorter TTL when the result indicates the resource does not exist, as storing these types of records could easily lead to a DOS caused by OOM.	2020-04-13 13:10:38 -04:00
Daniel Nephin	4d398d26ae	agent/cache: Inline the refresh function to make recursion more obvious fetch is already an exceptionally long function, but hiding the recrusion in a function call likely does not help.	2020-04-13 13:10:38 -04:00
Daniel Nephin	98ef66e70a	agent/cache: Make the return values of getEntryLocked more obvious Use named returned so that the caller has a better idea of what these bools mean. Return early to reduce the scope, and make it more obvious what values are returned in which cases. Also reduces the number of conditional expressions in each case.	2020-04-13 13:10:38 -04:00
Daniel Nephin	cef60d1547	agent/cache: Small formatting improvements to improve readability Remove Cache.entryKey which called a single function. Format multiline struct creation one field per line.	2020-04-13 12:34:11 -04:00
Daniel Nephin	ab068325da	agent/cache: move typeEntry lookup to the edge This change moves all the typeEntry lookups to the first step in the exported methods, and makes unexporter internals accept the typeEntry struct. This change is primarily intended to make it easier to extract the container of caches from the Cache type. It may incidentally reduce locking in fetch, but that was not a goal.	2020-04-03 16:01:56 -04:00
Christian Muehlhaeuser	2602f6907e	Simplified code in various places (#6176 ) All these changes should have no side-effects or change behavior: - Use bytes.Buffer's String() instead of a conversion - Use time.Since and time.Until where fitting - Drop unnecessary returns and assignment	2019-07-20 09:37:19 -04:00
Hans Hasselberg	73c4e9f07c	tls: auto_encrypt enables automatic RPC cert provisioning for consul clients (#5597 )	2019-06-27 22:22:07 +02:00
Matt Keeler	07f2854683	Fixes race condition in Agent Cache (#5796 ) * Fix race condition during a cache get Check the entry we pulled out of the cache while holding the lock had Fetching set. If it did then we should use the existing Waiter instead of calling fetch. The reason this is better than just calling fetch is that fetch re-gets the entry out of the entries map and the previous fetch may have finished. Therefore this prevents erroneously starting a new fetch because we just missed the last update. * Fix race condition fully The first commit still allowed for the following scenario: • No entry existing when checked in getWithIndex while holding the read lock • Then by time we had reached fetch it had been created and finished. * always use ok when returning * comment mentioning the reading from entries. * use cacheHit consistently	2019-05-07 11:15:49 +01:00
R.B. Boyer	91e78e00c7	fix typos reported by golangci-lint:misspell (#5434 )	2019-03-06 11:13:28 -06:00
Paul Banks	1c4dfbcd2e	connect: tame thundering herd of CSRs on CA rotation (#5228 ) * Support rate limiting and concurrency limiting CSR requests on servers; handle CA rotations gracefully with jitter and backoff-on-rate-limit in client * Add CSR rate limiting docs * Fix config naming and add tests for new CA configs	2019-01-22 17:19:36 +00:00
Matt Keeler	8e54856c46	Implement prepared query upstreams watching for envoy (#5224 ) Fixes #4969 This implements non-blocking request polling at the cache layer which is currently only used for prepared queries. Additionally this enables the proxycfg manager to poll prepared queries for use in envoy proxy upstreams.	2019-01-18 12:44:04 -05:00
Paul Banks	c4fa66b4c9	connect: agent leaf cert caching improvements (#5091 ) * Add State storage and LastResult argument into Cache so that cache.Types can safely store additional data that is eventually expired. * New Leaf cache type working and basic tests passing. TODO: more extensive testing for the Root change jitter across blocking requests, test concurrent fetches for different leaves interact nicely with rootsWatcher. * Add multi-client and delayed rotation tests. * Typos and cleanup error handling in roots watch * Add comment about how the FetchResult can be used and change ca leaf state to use a non-pointer state. * Plumb test override of root CA jitter through TestAgent so that tests are deterministic again! * Fix failing config test	2019-01-10 12:46:11 +00:00
Paul Banks	1bf3a37597	agent: Don't leave old errors around in cache (#5094 ) * Fixes #4480. Don't leave old errors around in cache that can be hit in specific circumstances. * Move error reset to cover extreme edge case of nil Value, nil err Fetch	2019-01-08 10:06:38 +00:00
Paul Banks	a640cc6bc7	Add cache.Notify to abstract watching for cache updates for types that support blocking semantics. (#4695 )	2018-10-10 16:55:34 +01:00
Paul Banks	5b0d4db6bc	Support Agent Caching for Service Discovery Results (#4541 ) * Add cache types for catalog/services and health/services and basic test that caching works * Support non-blocking cache types with Cache-Control semantics. * Update API docs to include caching info for every endpoint. * Comment updates per PR feedback. * Add note on caching to the 10,000 foot view on the architecture page to make the new data path more clear. * Document prepared query staleness quirk and force all background requests to AllowStale so we can spread service discovery load across servers.	2018-10-10 16:55:34 +01:00
Paul Banks	77e0577ff6	Add a Close method to cache that stops background goroutines. (#4746 ) In a real agent the `cache` instance is alive until the agent shuts down so this is not a real leak in production, however in out test suite, every testAgent that is started and stops leaks goroutines that never get cleaned up which accumulate consuming CPU and memory through subsequent test in the `agent` package which doesn't help our test flakiness. This adds a Close method that doesn't invalidate or clean up the cache, and still allows concurrent blocking queries to run (for up to 10 mins which might still affect tests). But at least it doesn't maintain them forever with background refresh and an expiry watcher routine. It would be nice to cancel any outstanding blocking requests as well when we close but that requires much more invasive surgery right into our RPC protocol since we don't have a way to cancel requests currently. Unscientifically this seems to make tests pass a bit quicker and more reliably locally but I can't really be sure of that!	2018-10-04 11:27:11 +01:00

1 2

78 Commits