mirror of https://github.com/facebook/rocksdb.git
37 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
anand76 | f48b64460e |
Provide a way to invoke a callback for a Cache handle (#12987)
Summary: Add the `ApplyToHandle` method to the `Cache` interface to allow a caller to request the invocation of a callback on the given cache handle. The goal here is to allow a cache that manages multiple cache instances to use a callback on a handle to determine which instance it belongs to. For example, the callback can hash the key and use that to pick the correct target instance. This is useful to redirect methods like `Ref` and `Release`, which don't know the cache key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12987 Reviewed By: pdillinger Differential Revision: D62151907 Pulled By: anand1976 fbshipit-source-id: e4ffbbb96eac9061d2ab0e7e1739eea5ebb1cd58 |
|
Peter Dillinger | 54cb9c77d9 |
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: * Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. * Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2 |
|
Peter Dillinger | 88bc91f3cc |
Cap eviction effort (CPU under stress) in HyperClockCache (#12141)
Summary: HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem. However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system. To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141 Test Plan: unit test included ## Performance test We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7 ``` ``` Before, LRU (alternate baseline not exhibiting bad behavior): Rough parallel ops/sec = 2290997 1088060 maxresident Before, AutoHCC (bad behavior): Rough parallel ops/sec = 141011 <- Yes, more than 10x slower 1083932 maxresident ``` Now let us sample a range of values in the solution space: ``` After, AutoHCC, eviction_effort_cap = 1: Rough parallel ops/sec = 3212586 2402216 maxresident After, AutoHCC, eviction_effort_cap = 10: Rough parallel ops/sec = 2371639 1248884 maxresident After, AutoHCC, eviction_effort_cap = 30: Rough parallel ops/sec = 1981092 1131596 maxresident After, AutoHCC, eviction_effort_cap = 100: Rough parallel ops/sec = 1446188 1090976 maxresident After, AutoHCC, eviction_effort_cap = 1000: Rough parallel ops/sec = 549568 1084064 maxresident ``` I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default. ``` Change to -pinned_ratio=0.85 Before, LRU: Rough parallel ops/sec = 2108373 1078232 maxresident Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2164910 1077312 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2145542 1077216 maxresident ``` The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress. ``` Change to -pinned_ratio=0.25 (low stress) Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2221149 1076540 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2224521 1076664 maxresident ``` No measurable difference under normal circumstances. Some tests repeated with FixedHCC, with similar results. Reviewed By: anand1976 Differential Revision: D52174755 Pulled By: pdillinger fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8 |
|
Peter Dillinger | 65cde19f40 |
Safer wrapper for std::atomic, use in HCC (#12051)
Summary: See new atomic.h file comments for motivation. I have updated HyperClockCache to use the new atomic wrapper, fixing a few cases where an implicit conversion was accidentally used and therefore mixing std::memory_order_seq_cst where release/acquire ordering (or relaxed) was intended. There probably wasn't a real bug because I think all the cases happened to be in single-threaded contexts like constructors/destructors or statistical ops like `GetCapacity()` that don't need any particular ordering constraints. Recommended follow-up: * Replace other uses of std::atomic to help keep them safe from bugs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12051 Test Plan: Did some local correctness stress testing with cache_bench. Also triggered 15 runs of fbcode_blackbox_crash_test and saw no related failures (just 3 failures in ~CacheWithSecondaryAdapter(), already known) No performance difference seen before & after running simultaneously: ``` (while ./cache_bench -cache_type=fixed_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=500000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}' ``` ... for both fixed_hcc and auto_hcc. Reviewed By: jowlyzhang Differential Revision: D51090518 Pulled By: pdillinger fbshipit-source-id: eeb324facb3185584603f9ea0c4de6f32919a2d7 |
|
Peter Dillinger | 16ae3548a2 |
AutoHCC: Improve/fix allocation/detection of grow homes (#12047)
Summary: This change simplifies some code and logic by introducing a new atomic field that tracks the next slot to grow into. It should offer slightly better performance during the growth phase (not measurable; see Test Plan below) and fix a suspected (but unconfirmed) bug like this: * Thread 1 is in non-trivial SplitForGrow() with grow_home=n. * Thread 2 reaches Grow() with grow_home=2n, and waits at the start of SplitForGrow() for the rewrite lock on n. By this point, the head at 2n is marked with the new shift amount but no chain is locked. * Thread 3 reaches Grow() with grow_home=4n, and waits before SplitForGrow() for the rewrite lock on n. By this point, the head at 4n is marked with the new shift amount but no chain is locked. * Thread 4 reaches Grow() with grow_home=8n and meets no resistance to proceeding through a SplitForGrow() on an empty chain, permanently missing out on any entries from chain n that should have ended up here. This is fixed by not updating the shift amount at the grow_home head until we have checked the preconditions that Grow()s feeding into this one have completed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12047 Test Plan: Some manual cache_bench stress runs, and about 20 triggered runs of fbcode_blackbox_crash_test No discernible performance difference on this benchmark, running before & after in parallel for a few minutes: ``` (while ./cache_bench -cache_type=auto_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=50000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}' ``` Reviewed By: jowlyzhang Differential Revision: D51017007 Pulled By: pdillinger fbshipit-source-id: 5f6d6a6194fc966f94693f3205ed75c87cdad269 |
|
Peter Dillinger | f6cb763409 |
Fix major performance bug in AutoHCC growth phase (#11871)
Summary: ## The Problem Mark Callaghan found a performance bug in yet-unreleased AutoHCC (which should have been found in my own testing). The observed behavior is very slow insertion performance as the table is growing into a very large structure. The root cause is the precarious combination of linear hashing (indexing into the table while allowing growth) and linear probing (for finding an empty slot to insert into). Naively combined, this is a disaster because in linear hashing, part of the table is twice as dense as first probing location as the rest. Thus, even a modest load factor like 0.6 could cause the dense part of the table to degrade to linear search. The code had a correction for this imbalance, which works in steady-state operation, but failed to account for the concentrating effect of table growth. Specifically, newly-added slots were underpopulated which allowed old slots to become over-populated and degrade to linear search, even in single-threaded operation. Here's an example: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=1 -populate_cache=0 -value_bytes=500 -cache_size=3000000000 -histograms=0 -report_problems -ops_per_thread=20000000 -resident_ratio=0.6 ``` AutoHCC: Complete in 774.213 s; Rough parallel ops/sec = 25832 FixedHCC: Complete in 19.630 s; Rough parallel ops/sec = 1018840 LRUCache: Complete in 25.842 s; Rough parallel ops/sec = 773947 ## The Fix One small change is apparently sufficient to fix the problem, but I wanted to re-optimize the whole "finding a good empty slot" algorithm to improve safety margins for good performance and to improve typical case performance. The small change is to track the newly-added slot from Grow in Insert, when applicable, and use that slot for insertion if (a) the home slot is already occupied, and (b) the newly-added slot is empty. This appears to sufficiently load new slots while avoiding over-population of either old or new slots. See `likely_empty_slot`. However I've also made the logic much more resilient to parts of the table becoming over-populated. I tested a variant that used double hashing instead of linear probing and found that hurt steady-state average-case performance, presumably due to loss of locality in the chains. And even conventional double hashing might not be ideally robust against density skew in the table (still present because of home location bias), because double hashing might choose a small increment that could take a long time to iterate to the under-populated part of the table. The compromise that seems to bring the best of each approach is this: do linear probing (+1 at a time) within a small bound (chosen bound of 4 based on performance testing) and then fall back on a double-hashing variant if no slot has been found. The double-hashing variant uses a probing increment that is always close to the golden ratio, relative to the table size, so that any under-populated regions of the table can be found relatively quickly, without introducing any additional skew. And the increment is varied slightly to avoid clustering effects that could happen with a fixed increment (regardless of how big it is). And that leaves us with one remaining problem: the double hashing increment might not be relatively prime to the table size, so the probing sequence might be a cycle that does not cover the full set of slots. To solve this we can use a technique I developed many years ago (probably also developed by others) that simply adds one (in modular arithmetic) whenever we finish a (potentially incomplete) cycle. This is a simple and reasonably efficient way to iterate over all the slots without repetition, regardless of whether the increment is not relatively prime to the table size, or even zero. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11871 Test Plan: existing correctness tests, especially ClockCacheTest.ClockTableFull Intended follow-up: make ClockTableFull test more complete for AutoHCC ## Performance Ignoring old AutoHCC performance, as we established above it could be terrible. FixedHCC and LRUCache are unaffected by this change. All tests below include this change. ### Getting up to size, single thread (same cache_bench command as above, all three run at same time) AutoHCC: Complete in 26.724 s; Rough parallel ops/sec = 748400 FixedHCC: Complete in 19.987 s; Rough parallel ops/sec = 1000631 LRUCache: Complete in 28.291 s; Rough parallel ops/sec = 706939 Single-threaded faster than LRUCache (often / sometimes) is good. FixedHCC has an obvious advantage because it starts at full size. ### Multiple threads, steady state, high hit rate ~95% Using `-threads=10 -populate_cache=1 -ops_per_thread=10000000` and still `-resident_ratio=0.6` AutoHCC: Complete in 48.778 s; Rough parallel ops/sec = 2050119 FixedHCC: Complete in 46.569 s; Rough parallel ops/sec = 2147329 LRUCache: Complete in 50.537 s; Rough parallel ops/sec = 1978735 ### Multiple threads, steady state, low hit rate ~50% Change to `-resident_ratio=0.2` AutoHCC: Complete in 49.264 s; Rough parallel ops/sec = 2029884 FixedHCC: Complete in 49.750 s; Rough parallel ops/sec = 2010041 LRUCache: Complete in 53.002 s; Rough parallel ops/sec = 1886713 Don't expect AutoHCC to be consistently faster than FixedHCC, but they are at least similar in these benchmarks. Reviewed By: jowlyzhang Differential Revision: D49548534 Pulled By: pdillinger fbshipit-source-id: 263e4f4d71d0e9a7d91db3795b48fad75408822b |
|
Peter Dillinger | fe3405e80f |
Automatic table sizing for HyperClockCache (AutoHCC) (#11738)
Summary: This change add an experimental next-generation HyperClockCache (HCC) with automatic sizing of the underlying hash table. Both the existing version (stable) and the new version (experimental for now) of HCC are available depending on whether an estimated average entry charge is provided in HyperClockCacheOptions. Internally, we call the two implementations AutoHyperClockCache (new) and FixedHyperClockCache (existing). The performance characteristics and much of the underlying logic are similar enough that AutoHCC is likely to make FixedHCC obsolete, and so it's best considered an evolution of the same technology or solution rather than an alternative. More specifically, both implementations share essentially the same logic for managing the state of individual entries in the cache, including metadata for reference counting and counting clocks for eviction. This metadata, which I like to call the "low-level HCC protocol," includes a read-write lock on entries, but relaxed consistency requirements on the cache (e.g. allowing rare duplication) means high-level cache operations never wait for these low-level per-entry locks. FixedHCC is fully wait-free. AutoHCC is different in how entries are indexed into an efficient hash table. AutoHCC is "essentially wait-free" as there is no pattern of typical high-level operations on a large cache that can lead to one thread waiting on another to complete some work, though it can happen in some unusual/unlucky cases, or atypical uses such as erasing specific cache keys. Table growth and entry reclamation is more complex in AutoHCC compared to FixedHCC, so uses some localized locking to manage that. AutoHCC uses linear hashing to grow the table as needed, with low latency and to a precise size. AutoHCC depends on anonymous mmap support from the OS (currently verified working on Linux, MacOS, and Windows) to allow the array underlying a hash table to grow in place without wasting resident memory on space reserved but unused. AutoHCC uses a form of chaining while FixedHCC uses open addressing and double hashing. More specifics: * In developing this PR, a rare availability bug (minor) was noticed in the existing HCC implementation of Release()+erase_if_last_ref, which is now inherited into AutoHCC. Fixing this without a performance regression will not be simple, so is left for follow-up work. * Some existing unit tests required adjustment of operational parameters or conditions to work with the new behaviors of AutoHCC. A number of bugs were found and fixed in the validation process, including getting unit tests in good working order. * Added an option to cache_bench, `-degenerate_hash_bits` for correctness stress testing described below. For this, the tool uses the reverse-engineered hash function for HCC to generate keys in which the specified number of hash bits, in critical positions, have a fixed value. Essentially each degenerate hash bit will half the number of chain heads utilized and double the average chain length. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11738 Test Plan: unit tests updated, and already added to db crash test. Also ## Correctness The code includes generous assertions to check for unexpected states, especially at destruction time, so should be able to detect critical concurrency bugs. Less serious "availability bugs" in which cache data is hidden or cleanly lost are more difficult to detect, but also less scary for data correctness (as long as performance is good and the design is sound). In average operation, the structure is extremely low stress and low contention (see next section) so stressing the corner case logic requires artificially stressing the operating conditions. First, we keep the structure small to increase the number of threads hitting the same chain or entry, and just one cache shard. Second, we artificially degrade the hashing so that chains are much longer than typical, using the new `-degenerate_hash_bits` option to cache_bench. Third, we re-create the structure from scratch frequently in order to exercise the Grow logic repeatedly and to get the benefit of the consistency checks in the structure's destructor in debug builds. For cache_bench this also means disabling the single-threaded "populate cache" step (normally used for steady state performance testing). And of course use many more threads than cores to have many preemptions. An effective test for working out bugs was this (using debug build of course): ``` while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=8000000 -threads=100 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done ``` Or even smaller cases. This setup has around 27 utilized chains, with around 35 entries each, and yield-waits more than 1 million times per second (very high contention; see next section). I have let this run for hours searching for any lingering issues. I've also run cache_bench under ASAN, UBSAN, and TSAN. ## Essentially wait free There is a counter for number of yield() calls when one thread is waiting on another. When we pre-populate the structure in a single thread, ``` ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -populate_cache=1 -ops_per_thread=200000 2>&1 | grep Yield ``` We see something on the order of 1 yield call per second across 16 threads, even when we load the system other other jobs (parallel compilation). With -populate_cache=0, there are more yield opportunities with parallel table growth. On an otherwise unloaded system, we still see very small (single digit) yield counts, with a chance of getting into the thousands, and getting into 10s of thousands per second during table growth phase if the system is loaded with other jobs. However, I am not worried about this if performance is still good (see next section). ## Overall performance Although cache_bench initially suggested performance very close to FixedHCC, there was a very noticeable performance hit under a db_bench setup like used in validating https://github.com/facebook/rocksdb/issues/10626. Much of the difference has been reduced by optimizing Lookup with a "naive" pass that will almost always find entries quickly, and only falling back to the careful Lookup algorithm when not found in the first pass. Setups (chosen to be sensitive to block cache performance), and compiled with USE_CLANG=1 JEMALLOC=1 PORTABLE=0 DEBUG_LEVEL=0: ``` TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` ### No regression on FixedHCC Running before & after builds at the same time on a 48 core machine. ``` TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=readrandom[-X10],block_cache_entry_stats,cache_report_problems -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=24 -cache_type=fixed_hyper_clock_cache -seed=1234 ``` Before: readrandom [AVG 10 runs] : 847234 (± 8150) ops/sec; 59.2 (± 0.6) MB/sec 703MB max RSS After: readrandom [AVG 10 runs] : 851021 (± 7929) ops/sec; 59.5 (± 0.6) MB/sec 706MB max RSS Probably no material difference. ### Single-threaded performance Using `[-X2]` and `-threads=1` and `-duration=30`, running all three at the same time: lru_cache: 55100 ops/sec, then 55862 ops/sec (627MB max RSS) fixed_hyper_clock_cache: 60496 ops/sec, then 61231 ops/sec (626MB max RSS) auto_hyper_clock_cache: 47560 ops/sec, then 56081 ops/sec (626MB max RSS) So AutoHCC has more ramp-up cost in the first pass as the cache grows to the appropriate size. (In single-threaded operation, the parallelizability and per-op low latency of table growth is overall slower.) However, once up to size, its performance is comparable to LRUCache. FixedHCC's lean operations still win overall when a good estimate is available. If we look at HCC table stats, we can see that this configuration is not favorable to AutoHCC (and I have verified that other memory sizes do not yield substantially different results, until shards are under-sized for the full filters): FixedHCC: Slot occupancy stats: Overall 47% (124991/262144), Min/Max/Window = 28%/64%/500, MaxRun{Pos/Neg} = 17/22 AutoHCC: Slot occupancy stats: Overall 59% (125781/209682), Min/Max/Window = 43%/82%/500, MaxRun{Pos/Neg} = 76/16 Head occupancy stats: Overall 43% (92259/209682), Min/Max/Window = 24%/74%/500, MaxRun{Pos/Neg} = 19/26 Entries at home count: 53350 FixedHCC configuration is relatively good for speed, and not ideal for space utilization. As is typical, AutoHCC has tighter control on metadata usage (209682 x 64 bytes rather than 262144 x 64 bytes), and the higher load factor is slightly worse for speed. LRUCache also has more metadata usage, at 199680 x 96 bytes of tracked metadata (plus roughly another 10% of that untracked in the head pointers), and that metadata is subject to fragmentation. ### Parallel performance, high hit rate Now using `[-X10]` and `-threads=10`, all three at the same time lru_cache: [AVG 10 runs] : 263629 (± 1425) ops/sec; 18.4 (± 0.1) MB/sec 655MB max RSS, 97.1% cache hit rate fixed_hyper_clock_cache: [AVG 10 runs] : 479590 (± 8114) ops/sec; 33.5 (± 0.6) MB/sec 651MB max RSS, 97.1% cache hit rate auto_hyper_clock_cache: [AVG 10 runs] : 418687 (± 5915) ops/sec; 29.3 (± 0.4) MB/sec 657MB max RSS, 97.1% cache hit rate Even with just 10-way parallelism for each cache (though 30+/48 cores busy overall), LRUCache is already showing performance degradation, while AutoHCC is in the neighborhood of FixedHCC. And that brings us to the question of how AutoHCC holds up under extreme parallelism, so now independent runs with `-threads=100` (overloading 48 cores). lru_cache: 438613 ops/sec, 827MB max RSS fixed_hyper_clock_cache: 1651310 ops/sec, 812MB max RSS auto_hyper_clock_cache: 1505875 ops/sec, 821MB max RSS (Yield count: 1089 over 30s) Clearly, AutoHCC holds up extremely well under extreme parallelism, even closing some of the modest performance gap with FixedHCC. ### Parallel performance, low hit rate To get down to roughly 50% cache hit rate, we use `-cache_index_and_filter_blocks=0 -cache_size=1650000000` with `-threads=10`. Here the extra cost of running counting clock eviction, especially on the chains of AutoHCC, are evident, especially with the lower contention of cache_index_and_filter_blocks=0: lru_cache: 725231 ops/sec, 1770MB max RSS, 51.3% hit rate fixed_hyper_clock_cache: 638620 ops/sec, 1765MB max RSS, 50.2% hit rate auto_hyper_clock_cache: 541018 ops/sec, 1777MB max RSS, 50.8% hit rate Reviewed By: jowlyzhang Differential Revision: D48784755 Pulled By: pdillinger fbshipit-source-id: e79813dc087474ac427637dd282a14fa3011a6e4 |
|
anand76 | a1743e85be |
Implement a allow cache hits admission policy for the compressed secondary cache (#11713)
Summary: This PR implements a new admission policy for the compressed secondary cache, which includes the functionality of the existing policy, and also admits items evicted from the primary block cache with the hit bit set. Effectively, the new policy works as follows - 1. When an item is demoted from the primary cache without a hit, a placeholder is inserted in the compressed cache. A second demotion will insert the full entry. 2. When an item is promoted from the compressed cache to the primary cache for the first time, a placeholder is inserted in the primary. The second promotion inserts the full entry, while erasing it form the compressed cache. 3. If an item is demoted from the primary cache with the hit bit set, it is immediately inserted in the compressed secondary cache. The ```TieredVolatileCacheOptions``` has been updated with a new option, ```adm_policy```, which allows the policy to be selected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11713 Reviewed By: pdillinger Differential Revision: D48444512 Pulled By: anand1976 fbshipit-source-id: b4cbf8c169a88097dff08e36e8bc4b3088de1492 |
|
Peter Dillinger | ef6f025563 |
Placeholder for AutoHyperClockCache, more (#11692)
Summary: * The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation. * Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache). * Simplify some existing tests not to depend directly on cache type. * Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation. * Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test. * Add another templated class BaseHyperClockCache to reduce future copy-paste * Added ReportProblems support to cache_bench * Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC: ``` 2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17 ``` In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692 Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in. Reviewed By: jowlyzhang Differential Revision: D48247413 Pulled By: pdillinger fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da |
|
Peter Dillinger | 99daea3481 |
Prepare tests for new HCC naming (#11676)
Summary: I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache. This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676 Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache Reviewed By: ajkr Differential Revision: D48103165 Pulled By: pdillinger fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736 |
|
Peter Dillinger | cdb11f5ce6 |
More minor HCC refactoring + typed mmap (#11670)
Summary: More code leading up to dynamic HCC. * Small enhancements to cache_bench * Extra assertion in Unref * Improve a CAS loop in ChargeUsageMaybeEvictStrict * Put load factor constants in appropriate class * Move `standalone` field to HyperClockTable::HandleImpl because it can be encoded differently in the upcoming dynamic HCC. * Add a typed version of MemMapping to simplify some future code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11670 Test Plan: existing tests, unit test added for TypedMemMapping Reviewed By: jowlyzhang Differential Revision: D48056464 Pulled By: pdillinger fbshipit-source-id: 186b7d3105c5d6d2eb6a592369bc10a97ee14a15 |
|
Peter Dillinger | f4e4039f00 |
Add some more bit operations to internal APIs (#11660)
Summary: BottomNBits() - there is a single fast instruction for this on x86 since BMI2, but testing with godbolt indicates you need at least GCC 10 for the compiler to choose that instruction from the obvious C++ code. https://godbolt.org/z/5a7Ysd41h BitwiseAnd() - this is a convenience function that works around the language flaw that the type of the result of x & y is the larger of the two input types, when it should be the smaller. This can save some ugly static_cast. I expect to use both of these in coming HyperClockCache developments, and have applied them in a couple of places in existing code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11660 Test Plan: unit tests added Reviewed By: jowlyzhang Differential Revision: D47935531 Pulled By: pdillinger fbshipit-source-id: d148c43a1e51df4a1c549b93aaf2725a3f8d3bd6 |
|
Peter Dillinger | c41122b1a0 |
Even more HyperClockCache refactoring (#11630)
Summary: ... ahead of dynamic variant. * Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release. * Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function. * Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11630 Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster. Reviewed By: jowlyzhang Differential Revision: D47657334 Pulled By: pdillinger fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3 |
|
Peter Dillinger | 846db9d7b1 |
Refactor ClockCache ApplyToEntries (#11609)
Summary: ... ahead of planned dynamic HCC variant. This changes simplifies some logic while still enabling future code sharing between implementation variants. Detail: For complicated reasons, using a std::function parameter to `ConstApplyToEntriesRange` with a lambda argument does not play nice with templated HandleImpl. An explicit conversion to std::function would be needed for it to compile. Templating the function type is the easy work-around. Also made some functions from https://github.com/facebook/rocksdb/issues/11572 private as recommended Pull Request resolved: https://github.com/facebook/rocksdb/pull/11609 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D47407415 Pulled By: pdillinger fbshipit-source-id: 0f65954db16335999b78fb7d2563ec627624cef0 |
|
Peter Dillinger | b1b6f87fbe |
Some small improvements to HyperClockCache (#11601)
Summary: Stacked on https://github.com/facebook/rocksdb/issues/11572 * Minimize use of std::function and lambdas to minimize chances of compiler heap-allocating closures (unnecessary stress on allocator). It appears that converting FindSlot to a template enables inlining the lambda parameters, avoiding heap allocations. * Clean up some logic with FindSlot (FIXMEs from https://github.com/facebook/rocksdb/issues/11572) * Fix handling of rare case of probing all slots, with new unit test. (Previously Insert would not roll back displacements in that case, which would kill performance if it were to happen.) * Add an -early_exit option to cache_bench for gathering memory stats before deallocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11601 Test Plan: unit test added for probing all slots ## Seeing heap allocations Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache` before https://github.com/facebook/rocksdb/issues/11572 vs. after this change. Before, we see this in the interesting bin statistics: ``` size nrequests ---- --------- 32 578460 64 24340 8192 578460 ``` And after: ``` size nrequests ---- --------- 32 (insignificant) 64 24370 8192 579130 ``` ## Performance test Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench` Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000` in before and after configurations, simultaneously: ``` Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442 After: Complete in 32.773 s; Rough parallel ops/sec = 2441019 ``` Reviewed By: jowlyzhang Differential Revision: D47375092 Pulled By: pdillinger fbshipit-source-id: 46f0f57257ddb374290a0a38c651764ea60ba410 |
|
Peter Dillinger | c3c84b3397 |
Refactor (Hyper)ClockCache code for upcoming changes (#11572)
Summary: Separate out some functionality that will be common to both static and dynamic HCC into BaseClockTable. Table::InsertState and GrowIfNeeded will be used by the dynamic HCC so don't make much sense right now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11572 Test Plan: existing tests. No functional changes intended. Performance test in subsequent PR https://github.com/facebook/rocksdb/issues/11601 Reviewed By: jowlyzhang Differential Revision: D47110496 Pulled By: pdillinger fbshipit-source-id: 379bd433322a42ea28c0043b41ec24956d21e7aa |
|
Peter Dillinger | f4a02f2c52 |
Add hash_seed to Caches (#11391)
Summary: See motivation and description in new ShardedCacheOptions::hash_seed option. Updated db_bench so that its seed param is used for the cache hash seed. Made its code more safe to ensure seed is set before use. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391 Test Plan: unit tests added / updated **Performance** - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache. Reviewed By: hx235 Differential Revision: D45557797 Pulled By: pdillinger fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934 |
|
Peter Dillinger | 41a7fbf758 |
Avoid long parameter lists configuring Caches (#11386)
Summary: For better clarity, encouraging more options explicitly specified using fields rather than positionally via constructor parameter lists. Simplifies code maintenance as new fields are added. Deprecate some cases of the confusing pattern of NewWhatever() functions returning shared_ptr. Net reduction of about 70 source code lines (including comments). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11386 Test Plan: existing tests Reviewed By: ajkr Differential Revision: D45059075 Pulled By: pdillinger fbshipit-source-id: d53fa09b268024f9c55254bb973b6c69feebf41a |
|
Peter Dillinger | 204fcff751 |
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5 |
|
Peter Dillinger | 601efe3cf2 |
Misc cleanup of block cache code (#11291)
Summary: ... ahead of a larger change. * Rename confusingly named `is_in_sec_cache` to `kept_in_sec_cache` * Unify naming of "standalone" block cache entries (was "detached" in clock_cache) * Remove some unused definitions in clock_cache.h (leftover from a previous revision) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11291 Test Plan: usual tests and CI, no behavior changes Reviewed By: anand1976 Differential Revision: D43984642 Pulled By: pdillinger fbshipit-source-id: b8bf0c5b90a932a88bcbdb413b2f256834aedf97 |
|
Peter Dillinger | 9f7801c5f1 |
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432 |
|
Peter Dillinger | 3182beeffc |
Observe and warn about misconfigured HyperClockCache (#10965)
Summary: Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused. Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965 Test Plan: unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not. Also manual testing with db_bench. Create db with ``` ./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1 ``` Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge) ``` ./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX ``` Seeing warnings / errors or not as expected. Reviewed By: anand1976 Differential Revision: D41406932 Pulled By: pdillinger fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7 |
|
Peter Dillinger | 8c0f5b1fcf |
Mark HyperClockCache as production-ready (#10963)
Summary: After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963 Test Plan: existing tests, comment changes only Reviewed By: siying Differential Revision: D41384083 Pulled By: pdillinger fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d |
|
Peter Dillinger | cc8c8f6958 |
Refactor (Hyper)ClockCache code (#10887)
Summary: For clean-up and in preparation for some other anticipated changes, including * A new dynamically-scaling variant of HyperClockCache * SecondaryCache support for HyperClockCache This change does some refactoring for current and future code sharing and reusability. (Including follow-up on https://github.com/facebook/rocksdb/issues/10843) ## clock_cache.h * TBD whether new variant will be a HyperClockCache or use some other name, so namespace is just clock_cache for the family of structures. * A number of helper functions introduced and used. * Pre-emptively split ClockHandle (shared among lock-free clock cache variants) and HandleImpl (specific to a kind of Table), and introduce template to plug new Table implementation into ClockCacheShard. ## clock_cache.cc * Mostly using helper functions. Some things like `Rollback()` and `FreeDataMarkEmpty()` were not combined because `Rollback()` is Table-specific while `FreeDataMarkEmpty()` can be used with different table implementations. * Performance testing indicated that despite more opportunities for parallelism, making a local copy of handle data for processing after marking an entry empty was slower than doing that processing before marking the entry empty (but after marking it "under construction"), thus avoiding a few words of copying data. At least for now, this answers the "TODO? Delay freeing?" questions (no). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10887 Test Plan: fixed a unit testing gap; other minor test updates for refactoring No functionality change ## Performance Same setup as https://github.com/facebook/rocksdb/issues/10801: Before: `readrandom [AVG 81 runs] : 627992 (± 5124) ops/sec` After: `readrandom [AVG 81 runs] : 637512 (± 4866) ops/sec` I've been getting some inconsistent results on restarts like the system is not being fair to the two processes, so I'm not sure there's such a real difference. Reviewed By: anand1976 Differential Revision: D40959240 Pulled By: pdillinger fbshipit-source-id: 0a8f3646b3bdb5bc7aaad60b26790b0779189949 |
|
Peter Dillinger | 7555243bcf |
Refactor ShardedCache for more sharing, static polymorphism (#10801)
Summary: The motivations for this change include * Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size. * This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash. * Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads * ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks). * You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache. More detail: * Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`. * Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards. * Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10801 Test Plan: Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash. Performance: Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16 ``` Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec` After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec` So possibly ~0.1% improvement. And with `-cache_type=hyper_clock_cache`: Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec` After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec` So roughly 5% improvement! Reviewed By: anand1976 Differential Revision: D40252236 Pulled By: pdillinger fbshipit-source-id: ff8fc70ef569585edc95bcbaaa0386f61355ae5b |
|
Peter Dillinger | 0f91c72adc |
Call experimental new clock cache HyperClockCache (#10684)
Summary: This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons: * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache." * We want to highlight the key feature: it's fast (especially under parallel load) * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements. * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.) Some API detail: * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`. * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated. For performance tests see https://github.com/facebook/rocksdb/pull/10626 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684 Test Plan: no interesting functional changes; tests updated Reviewed By: anand1976 Differential Revision: D39547800 Pulled By: pdillinger fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf |
|
Peter Dillinger | 5724348689 |
Revamp, optimize new experimental clock cache (#10626)
Summary: * Consolidates most metadata into a single word per slot so that more can be accomplished with a single atomic update. In the common case, Lookup was previously about 4 atomic updates, now just 1 atomic update. Common case Release was previously 1 atomic read + 1 atomic update, now just 1 atomic update. * Eliminate spins / waits / yields, which likely threaten some "lock free" benefits. Compare-exchange loops are only used in explicit Erase, and strict_capacity_limit=true Insert. Eviction uses opportunistic compare- exchange. * Relaxes some aggressiveness and guarantees. For example, * Duplicate Inserts will sometimes go undetected and the shadow duplicate will age out with eviction. * In many cases, the older Inserted value for a given cache key will be kept (i.e. Insert does not support overwrite). * Entries explicitly erased (rather than evicted) might not be freed immediately in some rare cases. * With strict_capacity_limit=false, capacity limit is not tracked/enforced as precisely as LRUCache, but is self-correcting and should only deviate by a very small number of extra or fewer entries. * Use smaller "computed default" number of cache shards in many cases, because benefits to larger usage tracking / eviction pools outweigh the small cost of more lock-free atomic contention. The improvement in CPU and I/O is dramatic in some limit-memory cases. * Even without the sharding change, the eviction algorithm is likely more effective than LRU overall because it's more stateful, even though the "hot path" state tracking for it is essentially free with ref counting. It is like a generalized CLOCK with aging (see code comments). I don't have performance numbers showing a specific improvement, but in theory, for a Poisson access pattern to each block, keeping some state allows better estimation of time to next access (Poisson interval) than strict LRU. The bounded randomness in CLOCK can also reduce "cliff" effect for repeated range scans approaching and exceeding cache size. ## Hot path algorithm comparison Rough descriptions, focusing on number and kind of atomic operations: * Old `Lookup()` (2-5 atomic updates per probe): ``` Loop: Increment internal ref count at slot If possible hit: Check flags atomic (and non-atomic fields) If cache hit: Three distinct updates to 'flags' atomic Increment refs for internal-to-external Return Decrement internal ref count while atomic read 'displacements' > 0 ``` * New `Lookup()` (1-2 atomic updates per probe): ``` Loop: Increment acquire counter in meta word (optimistic) If visible entry (already read meta word): If match (read non-atomic fields): Return Else: Decrement acquire counter in meta word Else if invisible entry (rare, already read meta word): Decrement acquire counter in meta word while atomic read 'displacements' > 0 ``` * Old `Release()` (1 atomic update, conditional on atomic read, rarely more): ``` Read atomic ref count If last reference and invisible (rare): Use CAS etc. to remove Return Else: Decrement ref count ``` * New `Release()` (1 unconditional atomic update, rarely more): ``` Increment release counter in meta word If last reference and invisible (rare): Use CAS etc. to remove Return ``` ## Performance test setup Build DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics ``` Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations: base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6) folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry) gt_clock: experimental ClockCache before this change new_clock: experimental ClockCache with this change ## Performance test results First test "hot path" read performance, with block cache large enough for whole DB: 4181MB 1thread base -> kops/s: 47.761 4181MB 1thread folly -> kops/s: 45.877 4181MB 1thread gt_clock -> kops/s: 51.092 4181MB 1thread new_clock -> kops/s: 53.944 4181MB 16thread base -> kops/s: 284.567 4181MB 16thread folly -> kops/s: 249.015 4181MB 16thread gt_clock -> kops/s: 743.762 4181MB 16thread new_clock -> kops/s: 861.821 4181MB 24thread base -> kops/s: 303.415 4181MB 24thread folly -> kops/s: 266.548 4181MB 24thread gt_clock -> kops/s: 975.706 4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944) 4181MB 32thread base -> kops/s: 311.251 4181MB 32thread folly -> kops/s: 274.952 4181MB 32thread gt_clock -> kops/s: 1045.98 4181MB 32thread new_clock -> kops/s: 1370.38 4181MB 48thread base -> kops/s: 310.504 4181MB 48thread folly -> kops/s: 268.322 4181MB 48thread gt_clock -> kops/s: 1195.65 4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944) 4181MB 64thread base -> kops/s: 307.839 4181MB 64thread folly -> kops/s: 272.172 4181MB 64thread gt_clock -> kops/s: 1204.47 4181MB 64thread new_clock -> kops/s: 1615.37 4181MB 128thread base -> kops/s: 310.934 4181MB 128thread folly -> kops/s: 267.468 4181MB 128thread gt_clock -> kops/s: 1188.75 4181MB 128thread new_clock -> kops/s: 1595.46 Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x. Now test a large block cache with low miss ratio, but some eviction is required: 1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23 1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43 1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4 1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56 1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59 1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8 1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89 1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45 1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98 1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91 1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26 1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63 610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137 610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996 610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934 610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5 610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402 610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742 610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062 610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453 610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457 610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426 610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273 610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812 The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.) Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc. 233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371 233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293 233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844 233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461 233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227 233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738 233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688 233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402 233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84 233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785 233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94 233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016 89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086 89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984 89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441 89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754 89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812 89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418 89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422 89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293 89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43 89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824 89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32 89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223 ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.) Even smaller cache size: 34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914 34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281 34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523 34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125 34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48 34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531 34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465 34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793 34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484 34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457 34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41 34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52 As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn: 13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328 13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633 13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684 13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383 13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492 13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863 13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121 13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758 13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539 13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098 13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77 13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27 gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention: 13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852 13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516 13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688 13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707 13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57 13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219 13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871 13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626 Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN Reviewed By: anand1976 Differential Revision: D39368406 Pulled By: pdillinger fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9 |
|
Guido Tagliavini Ponce | a0798f6f92 |
Enable ClockCache in DB block cache test (#10482)
Summary: A test in db_block_cache_test.cc was skipping ClockCache due to the 16-byte key length requirement. We fixed this. Along the way, we fixed a bug in ApplyToSomeEntries, which assumed the function being applied could modify handle metadata, and thus took an exclusive reference. This is incompatible with calls that need to inspect every element (including externally referenced ones) to gather stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10482 Test Plan: ``make -j24 check`` Reviewed By: anand1976 Differential Revision: D38553073 Pulled By: guidotag fbshipit-source-id: 0ed63fed4d3b89e5056b35b7091fce579f5647ae |
|
Guido Tagliavini Ponce | d976f68977 |
Fix assertion failure and memory leak in ClockCache. (#10430)
Summary: This fixes two issues: - [T127355728](https://www.internalfb.com/intern/tasks/?t=127355728): In the stress tests, when the ClockCache is operating close to full capacity and a burst of inserts are concurrently executed, every slot in the hash table may become occupied. This contradicts an assertion in the code, which is no longer valid in the lock-free setting. We are removing that assertion and handling the case of an insertion into a full table. - [T127427659](https://www.internalfb.com/intern/tasks/?t=127427659): There was a memory leak when an insertion is performed over capacity, but no handle is provided. In that case, a handle was dynamically allocated, but the pointer wasn't stored anywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10430 Test Plan: - ``make -j24 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D38226114 Pulled By: guidotag fbshipit-source-id: 18f6ab7e6214e11e9721d5ff289db1bf795d0008 |
|
Guido Tagliavini Ponce | 9d7de6517c |
Towards a production-quality ClockCache (#10418)
Summary: In this PR we bring ClockCache closer to production quality. We implement the following changes: 1. Fixed a few bugs in ClockCache. 2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table. 3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity. 4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.) 5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache. As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418 Test Plan: - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D38170673 Pulled By: guidotag fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a |
|
Guido Tagliavini Ponce | 6a160e1fec |
Lock-free ClockCache (#10390)
Summary: ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10390 Test Plan: - ``make -j24 check`` - ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D38106945 Pulled By: guidotag fbshipit-source-id: 6cbf6bd2397dc9f582809ccff5118a8a33ea6cb1 |
|
Guido Tagliavini Ponce | efdb428edc |
Lock-free Lookup and Release in ClockCache (#10347)
Summary: This is a prototype of a partially lock-free version of ClockCache. Roughly speaking, reads are lock-free and writes are lock-based: - Lookup is lock-free. - Release is lock-free, unless (i) no references to the element are left and (ii) it was marked for deletion or ``erase_if_last_ref`` is set. - Insert and Erase still use a per-shard lock. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10347 Test Plan: - ``make -j24 check`` - ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D37898776 Pulled By: guidotag fbshipit-source-id: 6418fd980f786d69b871bf2fe959398e44cd3d80 |
|
Guido Tagliavini Ponce | 9645e66fc9 |
Temporarily return a LRUCache from NewClockCache (#10351)
Summary: ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable. The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify. A new version of the NewClockCache function was created for our internal tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351 Test Plan: ``make -j24 check`` and re-run the pre-release tests. Reviewed By: pdillinger Differential Revision: D37802685 Pulled By: guidotag fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972 |
|
Guido Tagliavini Ponce | c277aeb42c |
Midpoint insertions in ClockCache (#10305)
Summary: When an element is first inserted into the ClockCache, it is now assigned either medium or high clock priority, depending on whether its cache priority is low or high, respectively. This is a variant of LRUCache's midpoint insertions. The main difference is that LRUCache can specify the allocated capacity for high-priority elements via the ``high_pri_pool_ratio`` parameter. Contrarily, in ClockCache, low- and high-priority elements compete for all cache slots, and one group can take over the other (of course, it takes more low-priority insertions to push out high-priority elements). However, just as LRUCache, ClockCache provides the following guarantee: a high-priority element will not be evicted before a low-priority element that was inserted earlier in time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10305 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37607787 Pulled By: guidotag fbshipit-source-id: 24d9f2523d2f4e6415e7f0029cc061fa275c2040 |
|
Guido Tagliavini Ponce | 57a0e2f304 |
Clock cache (#10273)
Summary: This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work: - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash). - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores. - Middle insertions into the clock list. - A test that exercises the clock eviction policy. - Update the Java API of ClockCache and Java calls to C++. Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37522461 Pulled By: guidotag fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943 |
|
Siying Dong | 3c327ac2d0 |
Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75 |
|
Siying Dong | d2dce5611a |
Move some files under util/ to separate dirs
Summary: Move some files under util/ to new directories env/, monitoring/ options/ and cache/ Closes https://github.com/facebook/rocksdb/pull/2090 Differential Revision: D4833681 Pulled By: siying fbshipit-source-id: 2fd8bef |
Renamed from util/clock_cache.h (Browse further)