mirror of
https://github.com/facebook/rocksdb.git
synced 2024-11-25 22:44:05 +00:00
197 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Peter Dillinger | 41a7fbf758 |
Avoid long parameter lists configuring Caches (#11386)
Summary: For better clarity, encouraging more options explicitly specified using fields rather than positionally via constructor parameter lists. Simplifies code maintenance as new fields are added. Deprecate some cases of the confusing pattern of NewWhatever() functions returning shared_ptr. Net reduction of about 70 source code lines (including comments). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11386 Test Plan: existing tests Reviewed By: ajkr Differential Revision: D45059075 Pulled By: pdillinger fbshipit-source-id: d53fa09b268024f9c55254bb973b6c69feebf41a |
||
Zdenek Korcak | c8552d8c63 |
fix bad implementation of ShardedCache::GetOccupancyCount (#11325)
Summary: copy paste typo Pull Request resolved: https://github.com/facebook/rocksdb/pull/11325 Reviewed By: hx235 Differential Revision: D44378512 Pulled By: ajkr fbshipit-source-id: 509ed2697c06eed975914359ece0459a0ea40312 |
||
Peter Dillinger | 03ccb1cd42 |
Re-clarify SecondaryCache API (#11316)
Summary: I previously misread or misinterpreted API contracts for SecondaryCache and this should correct the record. (Follow-up item from https://github.com/facebook/rocksdb/issues/11301) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11316 Test Plan: comments only Reviewed By: anand1976 Differential Revision: D44245107 Pulled By: pdillinger fbshipit-source-id: 3f8ddec150674b75728f1730f99b963bbf7b76e7 |
||
Peter Dillinger | 204fcff751 |
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5 |
||
Peter Dillinger | ccaa3225b0 |
Simplify tracking entries already in SecondaryCache (#11299)
Summary: In preparation for factoring secondary cache support out of individual Cache implementations, we can get rid of the "in secondary cache" flag on entries through a workable hack: when an entry is promoted from secondary, it is inserted in primary using a helper that lacks secondary cache support, thus preventing re-insertion into secondary cache through existing logic. This adds to the complexity of building CacheItemHelpers, because you always have to be able to get to an equivalent helper without secondary cache support, but that complexity is reasonably isolated within RocksDB typed_cache.h and test code. gcc-7 seems to have problems with constexpr constructor referencing `this` so removed constexpr support on CacheItemHelper. Also refactored some related test code to share common code / functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11299 Test Plan: existing tests Reviewed By: anand1976 Differential Revision: D44101453 Pulled By: pdillinger fbshipit-source-id: 7a59d0a3938ee40159c90c3e65d7004f6a272345 |
||
Peter Dillinger | 601efe3cf2 |
Misc cleanup of block cache code (#11291)
Summary: ... ahead of a larger change. * Rename confusingly named `is_in_sec_cache` to `kept_in_sec_cache` * Unify naming of "standalone" block cache entries (was "detached" in clock_cache) * Remove some unused definitions in clock_cache.h (leftover from a previous revision) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11291 Test Plan: usual tests and CI, no behavior changes Reviewed By: anand1976 Differential Revision: D43984642 Pulled By: pdillinger fbshipit-source-id: b8bf0c5b90a932a88bcbdb413b2f256834aedf97 |
||
Peter Dillinger | 2a23bee963 |
Use CacheWrapper in more places (#11295)
Summary: ... to simplify code and make it less prone to needless updates on refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11295 Test Plan: existing tests (no functional changes intended) Reviewed By: hx235 Differential Revision: D44040260 Pulled By: pdillinger fbshipit-source-id: 1b6badb5c8ca673db0903bfaba3cfbc986f386be |
||
Jaepil Jeong | 969d4e1dd2 |
Fix compile errors in Clang due to unused variables depending on the build configuration (#11234)
Summary: This PR fixes compilation errors in Clang due to unused variables like the below: ``` [109/329] Building CXX object CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o FAILED: CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o ccache /opt/homebrew/opt/llvm/bin/clang++ -DGFLAGS=1 -DGFLAGS_IS_A_DLL=0 -DHAVE_FULLFSYNC -DJEMALLOC_NO_DEMANGLE -DLZ4 -DOS_MACOSX -DROCKSDB_JEMALLOC -DROCKSDB_LIB_IO_POSIX -DROCKSDB_NO_DYNAMIC_EXTENSION -DROCKSDB_PLATFORM_POSIX -DSNAPPY -DTBB -DZLIB -DZSTD -I/Users/jaepil/work/deepsearch/deps/cpp/rocksdb -I/Users/jaepil/work/deepsearch/deps/cpp/rocksdb/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -I/Users/jaepil/app/include -I/opt/homebrew/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include/c++/v1 -W -Wextra -Wall -pthread -Wsign-compare -Wshadow -Wno-unused-parameter -Wno-unused-variable -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers -Wno-strict-aliasing -Wno-invalid-offsetof -fno-omit-frame-pointer -momit-leaf-frame-pointer -march=armv8-a+crc+crypto -Wno-unused-function -Werror -O2 -g -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk -std=gnu++20 -MD -MT CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o -MF CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o.d -o CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o -c /Users/jaepil/work/deepsearch/deps/cpp/rocksdb/db/version_edit_handler.cc /Users/jaepil/work/deepsearch/deps/cpp/rocksdb/db/version_edit_handler.cc:30:10: error: variable 'recovered_edits' set but not used [-Werror,-Wunused-but-set-variable] size_t recovered_edits = 0; ^ 1 error generated. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11234 Reviewed By: cbi42 Differential Revision: D43458604 Pulled By: ajkr fbshipit-source-id: d8c50e1a108887b037a120cd9f19374ddaeee817 |
||
anand76 | cf09917c18 |
Add filter/index/data secondary cache hits stats (#11246)
Summary: Add more stats for better visibility into the usefulness of the secondary cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11246 Test Plan: Add a new unit test Reviewed By: akankshamahajan15 Differential Revision: D43521364 Pulled By: anand1976 fbshipit-source-id: a92f04884e738a9bf40ad4047acaaaea343838a7 |
||
mrambacher | b6640c3117 |
Remove FactoryFunc from LoadXXXObject (#11203)
Summary: The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available. With the removal of LITE mode, the function was no longer required. Note that the MergeOperator had some private classes defined in header files. To gain access to their constructors (and name methods), the class definitions were moved into header files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203 Reviewed By: cbi42 Differential Revision: D43160255 Pulled By: pdillinger fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c |
||
Peter Dillinger | 64a1f7670f |
Customize CompressedSecondaryCache by block kind (#11204)
Summary: Added `do_not_compress_roles` to `CompressedSecondaryCacheOptions` to disable compression on certain kinds of block. Filter blocks are now not compressed by CompressedSecondaryCache by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11204 Test Plan: unit test added Reviewed By: anand1976 Differential Revision: D43147698 Pulled By: pdillinger fbshipit-source-id: db496975ae975fa18f157f93fe131a16315ac875 |
||
Wentian Guo | 42d6652ba2 |
remove dependency on options.h for port_posix.h andport_win.h (#11214)
Summary: The files in `port/`, such as `port_posix.h`, are layering over the system libraries, so shouldn't include the DB-specific files like `options.h`. This PR remove this dependency. # How The reason that `port_posix.h` (or `port_win.h`) include `options.h` is to use `CpuPriority`, as there is a method `SetCpuPriority()` in `port_posix.h` that uses `CpuPriority.` - I think `SetCpuPriority()` make sense to exist in `port_posix.h` as it provides has platform-dependent implementation - `CpuPriority` enum is defined in `env.h`, but used in `rocksdb/include` and `port/`. Hence, let us define `CpuPriority` enum in a common file, say `port_defs.h`, such that both directories `rocksdb/include` and `port/` can include. When we remove this dependency, some other files have compile errors because they can't find definitions, so add header files to resolve # Test make all check -j Pull Request resolved: https://github.com/facebook/rocksdb/pull/11214 Reviewed By: pdillinger Differential Revision: D43196910 Pulled By: guowentian fbshipit-source-id: 70deccb72844cfb08fcc994f76c6ef6df5d55ab9 |
||
Peter Dillinger | 3cacd4b4ec |
Put Cache and CacheWrapper in new public header (#11192)
Summary: The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment. In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544 Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it. Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192 Test Plan: `make check` Reviewed By: anand1976 Differential Revision: D43055487 Pulled By: pdillinger fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110 |
||
sdong | 4720ba4391 |
Remove RocksDB LITE (#11147)
Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2 |
||
sdong | 2800aa069a |
Remove compressed block cache (#11117)
Summary: Compressed block cache is replaced by compressed secondary cache. Remove the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117 Test Plan: See CI passes Reviewed By: pdillinger Differential Revision: D42700164 fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d |
||
Peter Dillinger | 9f7801c5f1 |
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432 |
||
Peter Dillinger | e079d562af |
Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945)
Summary: Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945 Test Plan: existing tests for CacheDumper, added basic unit test Reviewed By: anand1976 Differential Revision: D41231497 Pulled By: pdillinger fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068 |
||
Peter Dillinger | 3182beeffc |
Observe and warn about misconfigured HyperClockCache (#10965)
Summary: Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused. Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965 Test Plan: unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not. Also manual testing with db_bench. Create db with ``` ./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1 ``` Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge) ``` ./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX ``` Seeing warnings / errors or not as expected. Reviewed By: anand1976 Differential Revision: D41406932 Pulled By: pdillinger fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7 |
||
Peter Dillinger | 8c0f5b1fcf |
Mark HyperClockCache as production-ready (#10963)
Summary: After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963 Test Plan: existing tests, comment changes only Reviewed By: siying Differential Revision: D41384083 Pulled By: pdillinger fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d |
||
Peter Dillinger | 32520df1d9 |
Remove prototype FastLRUCache (#10954)
Summary: This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954 Test Plan: tests updated Reviewed By: akankshamahajan15 Differential Revision: D41310123 Pulled By: pdillinger fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c |
||
Peter Dillinger | f321e8fc98 |
Don't attempt to use SecondaryCache on block_cache_compressed (#10944)
Summary: Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom! This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed. Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache. This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944 Test Plan: test added that fails on base revision (at least with ASAN) Reviewed By: akankshamahajan15 Differential Revision: D41205578 Pulled By: pdillinger fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f |
||
Peter Dillinger | cc8c8f6958 |
Refactor (Hyper)ClockCache code (#10887)
Summary: For clean-up and in preparation for some other anticipated changes, including * A new dynamically-scaling variant of HyperClockCache * SecondaryCache support for HyperClockCache This change does some refactoring for current and future code sharing and reusability. (Including follow-up on https://github.com/facebook/rocksdb/issues/10843) ## clock_cache.h * TBD whether new variant will be a HyperClockCache or use some other name, so namespace is just clock_cache for the family of structures. * A number of helper functions introduced and used. * Pre-emptively split ClockHandle (shared among lock-free clock cache variants) and HandleImpl (specific to a kind of Table), and introduce template to plug new Table implementation into ClockCacheShard. ## clock_cache.cc * Mostly using helper functions. Some things like `Rollback()` and `FreeDataMarkEmpty()` were not combined because `Rollback()` is Table-specific while `FreeDataMarkEmpty()` can be used with different table implementations. * Performance testing indicated that despite more opportunities for parallelism, making a local copy of handle data for processing after marking an entry empty was slower than doing that processing before marking the entry empty (but after marking it "under construction"), thus avoiding a few words of copying data. At least for now, this answers the "TODO? Delay freeing?" questions (no). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10887 Test Plan: fixed a unit testing gap; other minor test updates for refactoring No functionality change ## Performance Same setup as https://github.com/facebook/rocksdb/issues/10801: Before: `readrandom [AVG 81 runs] : 627992 (± 5124) ops/sec` After: `readrandom [AVG 81 runs] : 637512 (± 4866) ops/sec` I've been getting some inconsistent results on restarts like the system is not being fair to the two processes, so I'm not sure there's such a real difference. Reviewed By: anand1976 Differential Revision: D40959240 Pulled By: pdillinger fbshipit-source-id: 0a8f3646b3bdb5bc7aaad60b26790b0779189949 |
||
Denis Hananein | 9f3475eccf |
Fix compilation errors, clang++-15 (#10907)
Summary: I've tried to compile the main branch, but there are two minor things which are make CE. I'm not sure about the second one (`num_empty_non_l0_level`), probably there is should be additional assert. ``` -c ../cache/clock_cache.cc [build] ../cache/clock_cache.cc:855:15: error: variable 'i' set but not used [-Werror,-Wunused-but-set-variable] [build] for (size_t i = 0; &array_[current] != h; i++) { [build] ^ ``` ``` [build] ../db/version_set.cc:3665:7: error: variable 'num_empty_non_l0_level' set but not used [-Werror,-Wunused-but-set-variable] [build] int num_empty_non_l0_level = 0; [build] ^ [build] 1 error generated. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10907 Reviewed By: jay-zhuang Differential Revision: D40866667 Pulled By: ajkr fbshipit-source-id: 963b7bd56859d0b3b2779cd36fad229425cb7b17 |
||
Levi Tamasi | 22ff8c5af7 |
Use malloc/free for LRUHandle instead of new[]/delete[] (#10884)
Summary: It's unsafe to call `malloc_usable_size` with an address not returned by a function from the `malloc` family (see https://github.com/facebook/rocksdb/issues/10798). The patch switches from using `new[]` / `delete[]` for `LRUHandle` to `malloc` / `free`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10884 Test Plan: `make check` Reviewed By: pdillinger Differential Revision: D40738089 Pulled By: ltamasi fbshipit-source-id: ac5583f88125fee49c314639be6b6df85937fbee |
||
Peter Dillinger | 7fff38b1fe |
clang-format cache/ and util/ directories (#10867)
Summary: This is purely the result of running `clang-format -i` on files, except some files have been excluded for manual intervention in a separate PR Pull Request resolved: https://github.com/facebook/rocksdb/pull/10867 Test Plan: `make check`, `make check-headers`, `make format` Reviewed By: jay-zhuang Differential Revision: D40682086 Pulled By: pdillinger fbshipit-source-id: 8673d978553ab99b516da7fb63ba0b82523337f8 |
||
Peter Dillinger | b6e33dbc0e |
Fix HyperClockCache Rollback bug in #10801 (#10843)
Summary: In https://github.com/facebook/rocksdb/issues/10801 in ClockHandleTable::Evict, we saved a reference to the hash value (`const UniqueId64x2& hashed_key`) instead of saving the hash value itself before marking the handle as empty and thus free for use by other threads. This could lead to Rollback seeing the wrong hash value for updating the `displacements` after an entry is removed. The fix is (like other places) to copy the hash value before it's released. (We could Rollback while we own the entry, but that creates more dependences between atomic updates, because in that case, based on the code, the Rollback writes would have to happen before or after the entry is released by marking empty. By doing the relaxed Rollback after marking empty, there's more opportunity for re-ordering / ILP.) Intended follow-up: refactoring for better code sharing in clock_cache.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/10843 Test Plan: watch for clean crash test, TSAN Reviewed By: siying Differential Revision: D40579680 Pulled By: pdillinger fbshipit-source-id: 258e43b3b80bc980a161d5c675ccc6708ecb8025 |
||
Peter Dillinger | 7555243bcf |
Refactor ShardedCache for more sharing, static polymorphism (#10801)
Summary: The motivations for this change include * Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size. * This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash. * Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads * ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks). * You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache. More detail: * Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`. * Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards. * Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10801 Test Plan: Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash. Performance: Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16 ``` Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec` After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec` So possibly ~0.1% improvement. And with `-cache_type=hyper_clock_cache`: Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec` After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec` So roughly 5% improvement! Reviewed By: anand1976 Differential Revision: D40252236 Pulled By: pdillinger fbshipit-source-id: ff8fc70ef569585edc95bcbaaa0386f61355ae5b |
||
Peter Dillinger | e466173d5c |
Print stack traces on frozen tests in CI (#10828)
Summary: Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off. For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828 Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI Reviewed By: hx235 Differential Revision: D40447634 Pulled By: pdillinger fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1 |
||
Peter Dillinger | b205c6d029 |
Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768)
Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.) This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768 Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test) Reviewed By: anand1976 Differential Revision: D40034747 Pulled By: anand1976 fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68 |
||
Peter Dillinger | 5f4391dda2 |
Some clean-up of secondary cache (#10730)
Summary: This is intended as a step toward possibly separating secondary cache integration from the Cache implementation as much as possible, to (hopefully) minimize code duplication in adding secondary cache support to HyperClockCache. * Major clarifications to API docs of secondary cache compatible parts of Cache. For example, previously the docs seemed to suggest that Wait() was not needed if IsReady()==true. And it wasn't clear what operations were actually supported on pending handles. * Add some assertions related to these requirements, such as that we don't Release() before Wait() (which would leak a secondary cache handle). * Fix a leaky abstraction with dummy handles, which are supposed to be internal to the Cache. Previously, these just used value=nullptr to indicate dummy handle, which meant that they could be confused with legitimate value=nullptr cases like cache reservations. Also fixed blob_source_test which was relying on this leaky abstraction. * Drop "incomplete" terminology, which was another name for "pending". * Split handle flags into "mutable" ones requiring mutex and "immutable" ones which do not. Because of single-threaded access to pending handles, the "Is Pending" flag can be in the "immutable" set. This allows removal of a TSAN work-around and removing a mutex acquire-release in IsReady(). * Remove some unnecessary handling of charges on handles of failed lookups. Keeping total_charge=0 means no special handling needed. (Removed one unnecessary mutex acquire/release.) * Simplify handling of dummy handle in Lookup(). There is no need to explicitly Ref & Release w/Erase if we generally overwrite the dummy anyway. (Removed one mutex acquire/release, a call to Release().) Intended follow-up: * Clarify APIs in secondary_cache.h * Doesn't SecondaryCacheResultHandle transfer ownership of the Value() on success (implementations should not release the value in destructor)? * Does Wait() need to be called if IsReady() == true? (This would be different from Cache.) * Do Value() and Size() have undefined behavior if IsReady() == false? * Why have a custom API for what is essentially a std::future<std::pair<void*, size_t>>? * Improve unit testing of standalone handle case * Apparent null `e` bug in `free_standalone_handle` case * Clean up secondary cache testing in lru_cache_test * Why does TestSecondaryCacheResultHandle hold on to a Cache::Handle? * Why does TestSecondaryCacheResultHandle::Wait() do nothing? Shouldn't it establish the post-condition IsReady() == true? * (Assuming that is sorted out...) Shouldn't TestSecondaryCache::WaitAll simply wait on each handle in order (no casting required)? How about making that the default implementation? * Why does TestSecondaryCacheResultHandle::Size() check Value() first? If the API is intended to be returning 0 before IsReady(), then that is weird but should at least be documented. Otherwise, if it's intended to be undefined behavior, we should assert IsReady(). * Consider replacing "standalone" and "dummy" entries with a single kind of "weak" entry that deletes its value when it reaches zero refs. Suppose you are using compressed secondary cache and have two iterators at similar places. It will probably common for one iterator to have standalone results pinned (out of cache) when the second iterator needs those same blocks and has to re-load them from secondary cache and duplicate the memory. Combining the dummy and the standalone should fix this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10730 Test Plan: existing tests (minor update), and crash test with sanitizers and secondary cache Performance test for any regressions in LRUCache (primary only): Create DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` Test before & after (run at same time) with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X100] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=233000000 -duration 30 -threads=16 ``` Before: readrandom [AVG 100 runs] : 22234 (± 63) ops/sec; 1.6 (± 0.0) MB/sec After: readrandom [AVG 100 runs] : 22197 (± 64) ops/sec; 1.6 (± 0.0) MB/sec That's within 0.2%, which is not significant by the confidence intervals. Reviewed By: anand1976 Differential Revision: D39826010 Pulled By: anand1976 fbshipit-source-id: 3202b4a91f673231c97648ae070e502ae16b0f44 |
||
gitbw95 | 47b57a3731 |
add SetCapacity and GetCapacity for secondary cache (#10712)
Summary: To support tuning secondary cache dynamically, add `SetCapacity()` and `GetCapacity()` for CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10712 Test Plan: Unit Tests Reviewed By: anand1976 Differential Revision: D39685212 Pulled By: gitbw95 fbshipit-source-id: 19573c67237011927320207732b5de083cb87240 |
||
Levi Tamasi | 12f5a1e35c |
Clarify comments for cache priorities and pool options (#10718)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10718 Reviewed By: riversand963 Differential Revision: D39707115 Pulled By: ltamasi fbshipit-source-id: 59aec8c732482f063d0abaad4d9200ba57ebf437 |
||
gitbw95 | 2cc5b39560 |
Add enable_split_merge option for CompressedSecondaryCache (#10690)
Summary: `enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690 Test Plan: Unit Tests Stress Tests Reviewed By: anand1976 Differential Revision: D39567604 Pulled By: gitbw95 fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7 |
||
Peter Dillinger | 0f91c72adc |
Call experimental new clock cache HyperClockCache (#10684)
Summary: This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons: * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache." * We want to highlight the key feature: it's fast (especially under parallel load) * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements. * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.) Some API detail: * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`. * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated. For performance tests see https://github.com/facebook/rocksdb/pull/10626 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684 Test Plan: no interesting functional changes; tests updated Reviewed By: anand1976 Differential Revision: D39547800 Pulled By: pdillinger fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf |
||
Peter Dillinger | 5724348689 |
Revamp, optimize new experimental clock cache (#10626)
Summary: * Consolidates most metadata into a single word per slot so that more can be accomplished with a single atomic update. In the common case, Lookup was previously about 4 atomic updates, now just 1 atomic update. Common case Release was previously 1 atomic read + 1 atomic update, now just 1 atomic update. * Eliminate spins / waits / yields, which likely threaten some "lock free" benefits. Compare-exchange loops are only used in explicit Erase, and strict_capacity_limit=true Insert. Eviction uses opportunistic compare- exchange. * Relaxes some aggressiveness and guarantees. For example, * Duplicate Inserts will sometimes go undetected and the shadow duplicate will age out with eviction. * In many cases, the older Inserted value for a given cache key will be kept (i.e. Insert does not support overwrite). * Entries explicitly erased (rather than evicted) might not be freed immediately in some rare cases. * With strict_capacity_limit=false, capacity limit is not tracked/enforced as precisely as LRUCache, but is self-correcting and should only deviate by a very small number of extra or fewer entries. * Use smaller "computed default" number of cache shards in many cases, because benefits to larger usage tracking / eviction pools outweigh the small cost of more lock-free atomic contention. The improvement in CPU and I/O is dramatic in some limit-memory cases. * Even without the sharding change, the eviction algorithm is likely more effective than LRU overall because it's more stateful, even though the "hot path" state tracking for it is essentially free with ref counting. It is like a generalized CLOCK with aging (see code comments). I don't have performance numbers showing a specific improvement, but in theory, for a Poisson access pattern to each block, keeping some state allows better estimation of time to next access (Poisson interval) than strict LRU. The bounded randomness in CLOCK can also reduce "cliff" effect for repeated range scans approaching and exceeding cache size. ## Hot path algorithm comparison Rough descriptions, focusing on number and kind of atomic operations: * Old `Lookup()` (2-5 atomic updates per probe): ``` Loop: Increment internal ref count at slot If possible hit: Check flags atomic (and non-atomic fields) If cache hit: Three distinct updates to 'flags' atomic Increment refs for internal-to-external Return Decrement internal ref count while atomic read 'displacements' > 0 ``` * New `Lookup()` (1-2 atomic updates per probe): ``` Loop: Increment acquire counter in meta word (optimistic) If visible entry (already read meta word): If match (read non-atomic fields): Return Else: Decrement acquire counter in meta word Else if invisible entry (rare, already read meta word): Decrement acquire counter in meta word while atomic read 'displacements' > 0 ``` * Old `Release()` (1 atomic update, conditional on atomic read, rarely more): ``` Read atomic ref count If last reference and invisible (rare): Use CAS etc. to remove Return Else: Decrement ref count ``` * New `Release()` (1 unconditional atomic update, rarely more): ``` Increment release counter in meta word If last reference and invisible (rare): Use CAS etc. to remove Return ``` ## Performance test setup Build DB with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` Test with ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics ``` Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations: base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6) folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry) gt_clock: experimental ClockCache before this change new_clock: experimental ClockCache with this change ## Performance test results First test "hot path" read performance, with block cache large enough for whole DB: 4181MB 1thread base -> kops/s: 47.761 4181MB 1thread folly -> kops/s: 45.877 4181MB 1thread gt_clock -> kops/s: 51.092 4181MB 1thread new_clock -> kops/s: 53.944 4181MB 16thread base -> kops/s: 284.567 4181MB 16thread folly -> kops/s: 249.015 4181MB 16thread gt_clock -> kops/s: 743.762 4181MB 16thread new_clock -> kops/s: 861.821 4181MB 24thread base -> kops/s: 303.415 4181MB 24thread folly -> kops/s: 266.548 4181MB 24thread gt_clock -> kops/s: 975.706 4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944) 4181MB 32thread base -> kops/s: 311.251 4181MB 32thread folly -> kops/s: 274.952 4181MB 32thread gt_clock -> kops/s: 1045.98 4181MB 32thread new_clock -> kops/s: 1370.38 4181MB 48thread base -> kops/s: 310.504 4181MB 48thread folly -> kops/s: 268.322 4181MB 48thread gt_clock -> kops/s: 1195.65 4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944) 4181MB 64thread base -> kops/s: 307.839 4181MB 64thread folly -> kops/s: 272.172 4181MB 64thread gt_clock -> kops/s: 1204.47 4181MB 64thread new_clock -> kops/s: 1615.37 4181MB 128thread base -> kops/s: 310.934 4181MB 128thread folly -> kops/s: 267.468 4181MB 128thread gt_clock -> kops/s: 1188.75 4181MB 128thread new_clock -> kops/s: 1595.46 Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x. Now test a large block cache with low miss ratio, but some eviction is required: 1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23 1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43 1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4 1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56 1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59 1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8 1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89 1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45 1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98 1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91 1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26 1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63 610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137 610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996 610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934 610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5 610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402 610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742 610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062 610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453 610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457 610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426 610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273 610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812 The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.) Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc. 233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371 233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293 233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844 233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461 233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227 233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738 233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688 233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402 233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84 233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785 233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94 233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016 89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086 89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984 89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441 89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754 89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812 89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418 89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422 89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293 89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43 89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824 89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32 89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223 ^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.) Even smaller cache size: 34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914 34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281 34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523 34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125 34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48 34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531 34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465 34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793 34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484 34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457 34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41 34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52 As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn: 13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328 13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633 13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684 13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383 13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492 13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863 13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121 13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758 13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539 13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098 13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77 13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27 gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention: 13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852 13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516 13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688 13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707 13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57 13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219 13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871 13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626 Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN Reviewed By: anand1976 Differential Revision: D39368406 Pulled By: pdillinger fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9 |
||
gitbw95 | 0148c4934d |
Add PerfContext counters for CompressedSecondaryCache (#10650)
Summary: Add PerfContext counters for CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10650 Test Plan: Unit Tests. Reviewed By: anand1976 Differential Revision: D39354712 Pulled By: gitbw95 fbshipit-source-id: 1b90d3df99d08ddecd351edfd48d1e3723fdbc15 |
||
Bo Wang | d490bfcdb6 |
Avoid recompressing cold block in CompressedSecondaryCache (#10527)
Summary: **Summary:** When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache. When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache. **Implementation Details** Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0. (refs >= 1 && in_cache == false && IS_STANDALONE == true) The behaviors of `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache: 1. If a handle is found in primary cache: 1.1. If the handle's value is not nullptr, it is returned immediately. 1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache. - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr. - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache. 2. If a handle is not found in primary cache: 2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr. 2.2. If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block) and return a standalone handle. The behaviors of `LRUCacheShard::Promote()` are updated as follows: 1. If `e->sec_handle` has value, one of the following steps can happen: 1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle. 1.2. Insert the item into the primary cache and return the handle to caller. 1.3. Exception handling. 3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released. The behavior of `CompressedSecondaryCache::Insert()` is updated: 1. If a block is evicted from the primary cache for the first time, a dummy item is inserted. 4. If a dummy item is found for a block, the block is inserted into the secondary cache. The behavior of `CompressedSecondaryCache:::Lookup()` is updated: 1. If a handle is not found or it is a dummy item, a nullptr is returned. 2. If `erase_handle` is true, the handle is erased. The behaviors of `LRUCacheShard::Release()` are adjusted for the standalone handles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527 Test Plan: 1. stress tests. 5. unit tests. 6. CPU profiling for db_bench. Reviewed By: siying Differential Revision: D38747613 Pulled By: gitbw95 fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca |
||
Levi Tamasi | 7818560194 |
Add a dedicated cache entry role for blobs (#10601)
Summary: The patch adds a dedicated cache entry role for blob values and switches to a registered deleter so that blobs show up as a separate bucket (as opposed to "Misc") in the cache occupancy statistics, e.g. ``` Block cache entry stats(count,size,portion): DataBlock(133515,531.73 MB,13.6866%) BlobValue(1824855,3.10 GB,81.7071%) Misc(1,0.00 KB,0%) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10601 Test Plan: Ran `make check` and tested the cache occupancy statistics using `db_bench`. Reviewed By: riversand963 Differential Revision: D39107915 Pulled By: ltamasi fbshipit-source-id: 8446c3b190a41a144030df73f318eeda4398c125 |
||
Gang Liao | 275cd80cdb |
Add a blob-specific cache priority (#10461)
Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461 Reviewed By: siying Differential Revision: D38672823 Pulled By: ltamasi fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743 |
||
Peter Dillinger | 86a1e3e0e7 |
Derive cache keys from SST unique IDs (#10394)
Summary: ... so that cache keys can be derived from DB manifest data before reading the file from storage--so that every part of the file can potentially go in a persistent cache. See updated comments in cache_key.cc for technical details. Importantly, the new cache key encoding uses some fancy but efficient math to pack data into the cache key without depending on the sizes of the various pieces. This simplifies some existing code creating cache keys, like cache warming before the file size is known. This should provide us an essentially permanent mapping between SST unique IDs and base cache keys, with the ability to "upgrade" SST unique IDs (and thus cache keys) with new SST format_versions. These cache keys are of similar, perhaps indistinguishable quality to the previous generation. Before this change (see "corrected" days between collision): ``` ./cache_bench -stress_cache_key -sck_keep_bits=43 18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected) ``` After this change (keep 43 bits, up through 50, to validate "trajectory" is ok on "corrected" days between collision): ``` 19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected) 16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected) 15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected) 15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected) 15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected) 15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected) 15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected) 15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected) ``` However, the change does modify (probably weaken) the "guaranteed unique" promise from this > SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86 to this (see https://github.com/facebook/rocksdb/issues/10388) > With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits I don't think this is a practical concern, though. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394 Test Plan: unit tests updated, see simulation results above Reviewed By: jay-zhuang Differential Revision: D38667529 Pulled By: pdillinger fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba |
||
Guido Tagliavini Ponce | a0798f6f92 |
Enable ClockCache in DB block cache test (#10482)
Summary: A test in db_block_cache_test.cc was skipping ClockCache due to the 16-byte key length requirement. We fixed this. Along the way, we fixed a bug in ApplyToSomeEntries, which assumed the function being applied could modify handle metadata, and thus took an exclusive reference. This is incompatible with calls that need to inspect every element (including externally referenced ones) to gather stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10482 Test Plan: ``make -j24 check`` Reviewed By: anand1976 Differential Revision: D38553073 Pulled By: guidotag fbshipit-source-id: 0ed63fed4d3b89e5056b35b7091fce579f5647ae |
||
gitbw95 | f060b47ee8 |
Fix the segdefault bug in CompressedSecondaryCache and its tests (#10507)
Summary: This fix is to replace `AllocateBlock()` with `new`. Once I figure out why `AllocateBlock()` might cause the segfault, I will update the implementation. Fix the bug that causes ./compressed_secondary_cache_test output following test failures: ``` Note: Google Test filter = CompressedSecondaryCacheTest.MergeChunksIntoValueTest [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from CompressedSecondaryCacheTest [ RUN ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest [ OK ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest (1 ms) [----------] 1 test from CompressedSecondaryCacheTest (1 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (9 ms total) [ PASSED ] 1 test. t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.MergeChunksIntoValueTest: line 4: 1091086 Segmentation fault (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.MergeChunksIntoValueTest Note: Google Test filter = CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from CompressedSecondaryCacheTest [ RUN ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression [ OK ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression (1 ms) [----------] 1 test from CompressedSecondaryCacheTest (1 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (2 ms total) [ PASSED ] 1 test. t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression: line 4: 1090883 Segmentation fault (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10507 Test Plan: Test 1: ``` $make -j 24 $./compressed_secondary_cache_test ``` Test 2: ``` $COMPILE_WITH_ASAN=1 make -j 24 $./compressed_secondary_cache_test ``` Test 3: ``` $COMPILE_WITH_TSAN=1 make -j 24 $./compressed_secondary_cache_test ``` Reviewed By: anand1976 Differential Revision: D38529885 Pulled By: gitbw95 fbshipit-source-id: d903fa3fadbd4d29f9528728c63a4f61c4396890 |
||
Bo Wang | f28d0c2020 |
Fix data race reported on SetIsInSecondaryCache in LRUCache (#10472)
Summary: Currently, `SetIsInSecondaryCache` is after `Promote`. After `Promote`, a handle can be accessed and its flags can be set. This causes data race. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10472 Test Plan: unit tests stress tests Reviewed By: pdillinger Differential Revision: D38403991 Pulled By: gitbw95 fbshipit-source-id: 0aaa2d2edeaf5bc799fcce605648fe49eb7119c2 |
||
Bo Wang | 87b82f28a1 |
Split cache to minimize internal fragmentation (#10287)
Summary: ### **Summary:** To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`. Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue. _Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._ To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows. **T1** Keys: 16 bytes each (+ 0 bytes user-defined timestamp) Values: 100 bytes each (50 bytes after compression) Entries: 90000000 RawSize: 9956.4 MB (estimated) FileSize: 5664.8 MB (estimated) | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) | | - | - | - | | T1_3 | 4000 | 4000 | | T1_4 | 2000 | 3000 | Populate the DB: ./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1 Overwrite it to a stable state: ./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1 Run read tests with differnt cache setting: T1_3: MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 & T1_4: MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 & For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats. | Test Name | T1_3 | T1_3 after mem defrag | T1_4 | T1_4 after mem defrag | | - | - | - | - | - | | allocated (MB) | 8728 | 8076 | 5518 | 5043 | | available (MB) | 8753 | 8092 | 5536 | 5051 | | external fragmentation rate | 0.003 | 0.002 | 0.003 | 0.0016 | | resident (MB) | 8956 | 8365 | 5655 | 5235 | **T2** Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 256 bytes each (128 bytes after compression) Entries: 40000000 RawSize: 10986.3 MB (estimated) FileSize: 6103.5 MB (estimated) | Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) | | - | - | - | | T2_3 | 4000 | 4000 | | T2_4 | 2000 | 3000 | Create DB (10GB): ./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2 Overwrite it to a stable state: ./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2 Run read tests with differnt cache setting: T2_3: MALLOC_CONF="prof:true,prof_stats:true" ./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800 & T2_4: MALLOC_CONF="prof:true,prof_stats:true" ./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800 & For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats. | Test Name | T2_3 | T2_3 after mem defrag | T2_4 | T2_4 after mem defrag | | - | - | - | - | - | | allocated (MB) | 8425 | 8093 | 5426 | 5149 | | available (MB) | 8489 | 8138 | 5435 | 5158 | | external fragmentation rate | 0.008 | 0.0055 | 0.0017 | 0.0017 | | resident (MB) | 8676 | 8392 | 5541 | 5321 | Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287 Test Plan: Unit tests. Reviewed By: anand1976 Differential Revision: D37743362 Pulled By: gitbw95 fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad |
||
anand76 | 54aebb2cc5 |
Fix cache metrics update when secondary cache is used (#10440)
Summary: If a secondary cache is configured, its possible that a cache lookup will get a hit in the secondary cache. In that case, the ```LRUCacheShard::Lookup``` doesn't immediately update the ```total_charge``` for the item handle if the ```wait``` parameter is false (i.e caller will call later to check the completeness). However, ```BlockBasedTable::GetEntryFromCache``` assumes the handle is complete and calls ```UpdateCacheHitMetrics```, which checks the usage of the cache item and fails the assert in https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h#L237 (```assert(total_charge >= meta_charge)```). To fix this, we call ```UpdateCacheHitMetrics``` later in ```MultiGet```, after waiting for all cache lookup completions. Test plan - Run crash test with changes from https://github.com/facebook/rocksdb/issues/10160 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10440 Reviewed By: gitbw95 Differential Revision: D38283968 Pulled By: anand1976 fbshipit-source-id: 31c54ef43517726c6e5fdda81899b364241dd7e1 |
||
Peter Dillinger | 65036e4217 |
Revert "Add a blob-specific cache priority (#10309)" (#10434)
Summary:
This reverts commit
|
||
Gang Liao | 8d178090be |
Add a blob-specific cache priority (#10309)
Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309 Reviewed By: ltamasi Differential Revision: D38211655 Pulled By: gangliao fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5 |
||
Guido Tagliavini Ponce | d976f68977 |
Fix assertion failure and memory leak in ClockCache. (#10430)
Summary: This fixes two issues: - [T127355728](https://www.internalfb.com/intern/tasks/?t=127355728): In the stress tests, when the ClockCache is operating close to full capacity and a burst of inserts are concurrently executed, every slot in the hash table may become occupied. This contradicts an assertion in the code, which is no longer valid in the lock-free setting. We are removing that assertion and handling the case of an insertion into a full table. - [T127427659](https://www.internalfb.com/intern/tasks/?t=127427659): There was a memory leak when an insertion is performed over capacity, but no handle is provided. In that case, a handle was dynamically allocated, but the pointer wasn't stored anywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10430 Test Plan: - ``make -j24 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D38226114 Pulled By: guidotag fbshipit-source-id: 18f6ab7e6214e11e9721d5ff289db1bf795d0008 |
||
Guido Tagliavini Ponce | 9d7de6517c |
Towards a production-quality ClockCache (#10418)
Summary: In this PR we bring ClockCache closer to production quality. We implement the following changes: 1. Fixed a few bugs in ClockCache. 2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table. 3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity. 4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.) 5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache. As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418 Test Plan: - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D38170673 Pulled By: guidotag fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a |
||
Peter Dillinger | 01a2e20299 |
Account for DB ID in stress testing block cache keys (#10388)
Summary: I recently discovered that block cache keys are slightly lower quality than previously thought, because my stress testing tool failed to simulate the effect of DB ID differences. This change updates the tool and gives us data to guide future developments. (No changes to production code here and now.) Nevertheless, the following promise still holds ``` // In fact, if our SST files are all < 4TB (see // BlockBasedTable::kMaxFileSizeStandardEncoding), then SST files generated // in a single process are guaranteed to have unique cache keys, unless/until // number session ids * max file number = 2**86 ... ``` because although different DB IDs could cause collision in file number and offset data, that would have to be using the same DB session (lower) to cause a block cache key collision, which is not possible in the same process. (A session is associated with only one DB ID.) This change fixes cache_bench -stress_cache_key to set and reset DB IDs in a parameterized way to evaluate the effect. Previous results assumed to be representative (using -sck_keep_bits=43): ``` 15 collisions after 15 x 90 days, est 90 days between (1.03763e+20 corrected) ``` or expected collision on a single machine every 104 billion billion days (see "corrected" value). After accounting for DB IDs, test never really changing, intermediate, and very frequently changing (using default -sck_db_count=100): ``` -sck_newdb_nreopen=1000000000: 15 collisions after 2 x 90 days, est 12 days between (1.38351e+19 corrected) -sck_newdb_nreopen=10000: 17 collisions after 2 x 90 days, est 10.5882 days between (1.22074e+19 corrected) -sck_newdb_nreopen=100: 19 collisions after 2 x 90 days, est 9.47368 days between (1.09224e+19 corrected) ``` or roughly 10x more often than previously thought (still extremely if not impossibly rare), and better than random base cache keys (with -sck_randomize), though < 10x better than random: ``` 31 collisions after 1 x 90 days, est 2.90323 days between (3.34719e+18 corrected) ``` If we simply fixed this by ignoring DB ID for cache keys, we would potentially have a shortage of entropy for some cases, such as small file numbers and offsets (e.g. many short-lived processes each using SstFileWriter to create a small file), because existing DB session IDs only provide ~103 bits of entropy. We could upgrade the entropy in DB session IDs to accommodate, but it's not known what all would be affected by changing from 20 digit session IDs to something larger. Instead, my plan is to 1) Move to block cache keys derived from SST unique IDs (so that we can derive block cache keys from manifest data without reading file on storage), and show no significant regression in expected collision rate. 2) Generate better SST unique IDs in format_version=6 (https://github.com/facebook/rocksdb/issues/9058), which should have ~100x lower expected/predicted collision rate based on simulations with this stress test: ``` ./cache_bench -stress_cache_key -sck_keep_bits=39 -sck_newdb_nreopen=100 -sck_footer_unique_id ... 15 collisions after 19 x 90 days, est 114 days between (2.10293e+21 corrected) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10388 Test Plan: no production changes Reviewed By: jay-zhuang Differential Revision: D37986714 Pulled By: pdillinger fbshipit-source-id: e759b2469e3365cb01c6661a69e0ab849ef4c3df |