mirror of https://github.com/facebook/rocksdb.git
252 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Peter Dillinger | b34cef57b7 |
Support pro-actively erasing obsolete block cache entries (#12694)
Summary: Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries. Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive. This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on. The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF. Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths. Notable auxiliary code details: * Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.) * Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item) Marked EXPERIMENTAL until more thorough validation is complete. Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694 Test Plan: * Unit tests added, including refactoring an existing test to make better use of parameterized tests. * Added to crash test. * Performance, sample command: ``` for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done ``` Averaging 10 runs each case, block cache data block hit rates ``` lru_cache UA=0 -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0 UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0 fixed_hyper_clock_cache UA=0 -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9 UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2 auto_hyper_clock_cache UA=0 -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5 UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8 ``` Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings). Reviewed By: ajkr Differential Revision: D57932442 Pulled By: pdillinger fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c |
|
anand76 | 0ae3d9f98d |
Fix stale memory access with FSBuffer and tiered sec cache (#12712)
Summary: A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks into the compressed cache and the corresponding compressed block into the NVM cache. The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712 Test Plan: 1. Add a unit test 2. Release validation stress test Reviewed By: jaykorean Differential Revision: D57974026 Pulled By: anand1976 fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd |
|
anand76 | 6cc7ad15b6 |
Implement secondary cache admission policy to allow all evicted blocks (#12599)
Summary: Add a secondary cache admission policy to admit all blocks evicted from the block cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12599 Reviewed By: pdillinger Differential Revision: D56891760 Pulled By: anand1976 fbshipit-source-id: 193c98c055aa3477f4e3a78e5d3daef27a5eacf4 |
|
Richard Barnes | ee3159e7dd |
Remove extra semi colon from icsp/lib/logging/IcspLogRpcMessage.cpp
Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D55534619 fbshipit-source-id: 26f3c35a51b38a3cbfa12a6f76a2bb783a7b4d8e |
|
Changyu Bi | f77b788545 |
Fix a bug in `LRUCacheShard::LRU_Insert` (#12429)
Summary: we saw crash test fail with ``` lru_cache.cc:249: void rocksdb::lru_cache::LRUCacheShard::LRU_Remove(rocksdb::lru_cache::LRUHandle *): Assertion `high_pri_pool_usage_ >= e->total_charge' failed. ``` One cause for this is that `lru_low_pri_` pointer is not updated in `LRU_insert()` before we try to balance high pri and low pri pool in `MaintainPoolSize();`. A repro unit test is provided. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12429 Test Plan: Not able to reproduce the failure with db_stress yet. `./lru_cache_test --gtest_filter="*InsertAfterReducingCapacity*`. It fails the assertion before this PR. Reviewed By: pdillinger Differential Revision: D54908919 Pulled By: cbi42 fbshipit-source-id: f485fdbc0ea61c8092a0be5fe561a59c15c78fd3 |
|
yuzhangyu@fb.com | 1cfdece85d |
Run internal cpp modernizer on RocksDB repo (#12398)
Summary: When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE` with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains: Running `xplat/scripts/codemod_service/cpp_modernizer.sh` in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment) without swapping out the namespace macro `ROCKSDB_NAMESPACE` Followed by RocksDB's own `make format` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398 Test Plan: Auto tests Reviewed By: hx235 Differential Revision: D54382532 Pulled By: jowlyzhang fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa |
|
Richard Barnes | ced333ee45 |
Remove extra semi colon from instagram/ranking/mezql/shots/parser/fast/Token.cpp
Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D54362213 fbshipit-source-id: 0bbc9e5fce917fc4f72423f0a4c8cb2c2b1759dd |
|
Peter Dillinger | 54cb9c77d9 |
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: * Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. * Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2 |
|
Peter Dillinger | 76c834e441 |
Remove 'virtual' when implied by 'override' (#12319)
Summary: ... to follow modern C++ style / idioms. Used this hack: ``` for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319 Test Plan: existing tests, CI Reviewed By: jaykorean Differential Revision: D53275303 Pulled By: pdillinger fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377 |
|
anand76 | b49f9cdd3c |
Add CompressionOptions to the compressed secondary cache (#12234)
Summary: Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned. Tests - Run db_bench and verify compression options in the LOG file Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234 Reviewed By: ajkr Differential Revision: D52758133 Pulled By: anand1976 fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6 |
|
anand76 | cc069f25b3 |
Add some compressed and tiered secondary cache stats (#12150)
Summary: Add statistics for more visibility. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12150 Reviewed By: akankshamahajan15 Differential Revision: D52184633 Pulled By: anand1976 fbshipit-source-id: 9969e05d65223811cd12627102b020bb6d229352 |
|
Peter Dillinger | 88bc91f3cc |
Cap eviction effort (CPU under stress) in HyperClockCache (#12141)
Summary: HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem. However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system. To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141 Test Plan: unit test included ## Performance test We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7 ``` ``` Before, LRU (alternate baseline not exhibiting bad behavior): Rough parallel ops/sec = 2290997 1088060 maxresident Before, AutoHCC (bad behavior): Rough parallel ops/sec = 141011 <- Yes, more than 10x slower 1083932 maxresident ``` Now let us sample a range of values in the solution space: ``` After, AutoHCC, eviction_effort_cap = 1: Rough parallel ops/sec = 3212586 2402216 maxresident After, AutoHCC, eviction_effort_cap = 10: Rough parallel ops/sec = 2371639 1248884 maxresident After, AutoHCC, eviction_effort_cap = 30: Rough parallel ops/sec = 1981092 1131596 maxresident After, AutoHCC, eviction_effort_cap = 100: Rough parallel ops/sec = 1446188 1090976 maxresident After, AutoHCC, eviction_effort_cap = 1000: Rough parallel ops/sec = 549568 1084064 maxresident ``` I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default. ``` Change to -pinned_ratio=0.85 Before, LRU: Rough parallel ops/sec = 2108373 1078232 maxresident Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2164910 1077312 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2145542 1077216 maxresident ``` The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress. ``` Change to -pinned_ratio=0.25 (low stress) Before, AutoHCC, averaged over ~20 runs: Rough parallel ops/sec = 2221149 1076540 maxresident After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs: Rough parallel ops/sec = 2224521 1076664 maxresident ``` No measurable difference under normal circumstances. Some tests repeated with FixedHCC, with similar results. Reviewed By: anand1976 Differential Revision: D52174755 Pulled By: pdillinger fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8 |
|
Peter Dillinger | c74531b1d2 |
Fix a nuisance compiler warning from clang (#12144)
Summary: Example: ``` cache/clock_cache.cc:56:7: error: fallthrough annotation in unreachable code [-Werror,-Wimplicit-fallthrough] FALLTHROUGH_INTENDED; ^ ./port/lang.h:10:30: note: expanded from macro 'FALLTHROUGH_INTENDED' ^ ``` In clang < 14, this is annoyingly generated from -Wimplicit-fallthrough, but was changed to -Wunreachable-code-fallthrough (implied by -Wunreachable-code) in clang 14. See https://reviews.llvm.org/D107933 for how this nuisance pattern generated false positives similar to ours in the Linux kernel. Just to underscore the ridiculousness of this warning, here an error is reported on the annotation, not the call to do_something(), depending on the constexpr value (https://godbolt.org/z/EvxqdPTdr): ``` #include <atomic> void do_something(); void test(int v) { switch (v) { case 1: if constexpr (std::atomic<long>::is_always_lock_free) { return; } else { do_something(); [[fallthrough]]; } case 2: return; } } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12144 Test Plan: Added the warning to our Makefile for USE_CLANG, which reproduced the warning-as-error as shown above, but is now fixed. Reviewed By: jaykorean Differential Revision: D52139615 Pulled By: pdillinger fbshipit-source-id: ba967ae700c0916d1a478bc465cf917633e337d9 |
|
anand76 | ebb5242d55 |
Sanitize the secondary_cache option in TieredCacheOptions (#12137)
Summary: Sanitize the `secondary_cache` field in the `cache_opts` option of `TieredCacheOptions` to `nullptr` if set by the user. The nvm secondary cache should be directly set in `TieredCacheOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12137 Reviewed By: akankshamahajan15 Differential Revision: D52063817 Pulled By: anand1976 fbshipit-source-id: 255116c665a9b908c8f44109a2d331d4b73e7591 |
|
anand76 | c1b84d0437 |
Fix false negative in TieredSecondaryCache nvm cache lookup (#12134)
Summary: There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134 Test Plan: Add a new unit test in tiered_secondary_cache_test Reviewed By: akankshamahajan15 Differential Revision: D52023309 Pulled By: anand1976 fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b |
|
anand76 | 336a74db60 |
Add some asserts in ~CacheWithSecondaryAdapter (#12082)
Summary: Add some asserts in the `CacheWithSecondaryAdapter` destructor to help debug a crash test failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12082 Reviewed By: cbi42 Differential Revision: D51486041 Pulled By: anand1976 fbshipit-source-id: 76537beed31ba27ab9ac8b4ce6deb775629e3be5 |
|
anand76 | 2222caec9e |
Make CacheWithSecondaryAdapter reservation accounting more robust (#12059)
Summary: `CacheWithSecondaryAdapter` can distribute placeholder reservations across the primary and secondary caches. The current implementation of the accounting is quite complicated in order to avoid using a mutex. This may cause the accounting to be slightly off after changes to the cache capacity and ratio, resulting in assertion failures. There's also a bug in the unlikely event that the total reservation exceeds the cache capacity. Furthermore, the current implementation is difficult to reason about. This PR simplifies it by doing the accounting while holding a mutex. The reservations are processed in 1MB chunks in order to avoid taking a lock too frequently. As a side effect, this also removes the restriction of not allowing to increase the compressed secondary cache capacity after decreasing it to 0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12059 Test Plan: Existing unit tests, and a new test for capacity increase from 0 Reviewed By: pdillinger Differential Revision: D51278686 Pulled By: anand1976 fbshipit-source-id: 7e1ad2c50694772997072dd59cab35c93c12ba4f |
|
Peter Dillinger | 65cde19f40 |
Safer wrapper for std::atomic, use in HCC (#12051)
Summary: See new atomic.h file comments for motivation. I have updated HyperClockCache to use the new atomic wrapper, fixing a few cases where an implicit conversion was accidentally used and therefore mixing std::memory_order_seq_cst where release/acquire ordering (or relaxed) was intended. There probably wasn't a real bug because I think all the cases happened to be in single-threaded contexts like constructors/destructors or statistical ops like `GetCapacity()` that don't need any particular ordering constraints. Recommended follow-up: * Replace other uses of std::atomic to help keep them safe from bugs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12051 Test Plan: Did some local correctness stress testing with cache_bench. Also triggered 15 runs of fbcode_blackbox_crash_test and saw no related failures (just 3 failures in ~CacheWithSecondaryAdapter(), already known) No performance difference seen before & after running simultaneously: ``` (while ./cache_bench -cache_type=fixed_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=500000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}' ``` ... for both fixed_hcc and auto_hcc. Reviewed By: jowlyzhang Differential Revision: D51090518 Pulled By: pdillinger fbshipit-source-id: eeb324facb3185584603f9ea0c4de6f32919a2d7 |
|
Peter Dillinger | 9af25a392b |
Clean up AutoHyperClockTable::PurgeImpl (#12052)
Summary: There was some unncessary logic (e.g. a dead assignment to home_shift) left over from earlier revision of the code. Also, rename confusing ChainRewriteLock::new_head_ / GetNewHead() to saved_head_ / GetSavedHead(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12052 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D51091499 Pulled By: pdillinger fbshipit-source-id: 4b191b60a2b16085681e59d49c4d97e802869db8 |
|
Peter Dillinger | 16ae3548a2 |
AutoHCC: Improve/fix allocation/detection of grow homes (#12047)
Summary: This change simplifies some code and logic by introducing a new atomic field that tracks the next slot to grow into. It should offer slightly better performance during the growth phase (not measurable; see Test Plan below) and fix a suspected (but unconfirmed) bug like this: * Thread 1 is in non-trivial SplitForGrow() with grow_home=n. * Thread 2 reaches Grow() with grow_home=2n, and waits at the start of SplitForGrow() for the rewrite lock on n. By this point, the head at 2n is marked with the new shift amount but no chain is locked. * Thread 3 reaches Grow() with grow_home=4n, and waits before SplitForGrow() for the rewrite lock on n. By this point, the head at 4n is marked with the new shift amount but no chain is locked. * Thread 4 reaches Grow() with grow_home=8n and meets no resistance to proceeding through a SplitForGrow() on an empty chain, permanently missing out on any entries from chain n that should have ended up here. This is fixed by not updating the shift amount at the grow_home head until we have checked the preconditions that Grow()s feeding into this one have completed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12047 Test Plan: Some manual cache_bench stress runs, and about 20 triggered runs of fbcode_blackbox_crash_test No discernible performance difference on this benchmark, running before & after in parallel for a few minutes: ``` (while ./cache_bench -cache_type=auto_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=50000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}' ``` Reviewed By: jowlyzhang Differential Revision: D51017007 Pulled By: pdillinger fbshipit-source-id: 5f6d6a6194fc966f94693f3205ed75c87cdad269 |
|
Peter Dillinger | 92dc5f3e67 |
AutoHCC: fix a bug with "blind" Insert (#12046)
Summary: I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions. It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`). The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have * Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain. * Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty. * Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain. This is easily fixed by holding on to a reference until inserted onto the chain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046 Test Plan: As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed. Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000) Reviewed By: jowlyzhang Differential Revision: D51016979 Pulled By: pdillinger fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b |
|
Peter Dillinger | a399bbc037 |
More fixes and enhancements for cache_bench (#12041)
Summary: Mostly things for using cache_bench for stress/correctness testing. * Make secondary_cache_uri option work with HCC (forgot to update when secondary support was added for HCC) * Add -pinned_ratio option to keep more than just one entry per thread pinned. This can be important for testing eviction stress. * Add -vary_capacity_ratio for testing dynamically changing capacity. Also added some overrides to CacheWrapper to help with diagnostic output. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12041 Test Plan: manual, make check Reviewed By: jowlyzhang Differential Revision: D51013430 Pulled By: pdillinger fbshipit-source-id: 7914adc1218f0afacace05ccd77d3bfb91a878d0 |
|
anand76 | 52be8f54f2 |
Add APIs to query secondary cache capacity and usage for TieredCache (#12011)
Summary: In `TieredCache`, the underlying compressed secondary cache is hidden from the user. So we need a way to query the capacity, as well as the portion of cache reservation charged to the compressed secondary cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12011 Test Plan: Update the unit tests Reviewed By: akankshamahajan15 Differential Revision: D50651943 Pulled By: anand1976 fbshipit-source-id: 06d1cb5edb75a790c919bce718e2ff65f5908220 |
|
Peter Dillinger | ef0c3f08fa |
Fix rare destructor bug in AutoHCC (#11988)
Summary: and some other small enhancements/fixes: * The main bug fixed is that in some rare cases, the "published" table size might be smaller than the actual table size. This is a transient state that can happen with concurrent growth that is normally fixed after enough insertions, but if the cache is destroyed soon enough after growth, it could fail to fully destroy some entries and cause assertion failures. We can fix this by detecting the true table size in the destructor. * Change the "too many iterations" debug threshold from 512 to 768. We might have hit at least one false positive failure. (Failed despite legitimate operation.) * Added some stronger assertions in some places to aid in debugging. * Use COERCE_CONTEXT_SWITCH to make behavior of Grow less predictable in terms of thread interleaving. (Might add in more places.) This was useful in reproducing the destructor bug. * Fix some comments with typos or that were based on earlier revisions of the code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11988 Test Plan: Variants of this bug-finding command: ``` USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 COERCE_CONTEXT_SWITCH=1 DEBUG_LEVEL=2 make -j32 cache_bench && while ROCKSDB_DEBUG=1 ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=80000000 -threads=32 -populate_cache=0 -ops_per_thread=1000 -num_shard_bits=0; do :; done ``` Reviewed By: jowlyzhang Differential Revision: D50470318 Pulled By: pdillinger fbshipit-source-id: d407a8bb0b6d2ddc598a954c319a1640136f12f2 |
|
Peter Dillinger | dc576af0fd |
AutoHCC - fix a rare loop condition in Lookup (#11948)
Summary: Saw this in stress test: ``` db_stress: cache/clock_cache.cc:3152:[...] Assertion `i < 0x2000' failed. ``` The problem is related to Lookups on a chain currently involved in a Grow operation. To avoid Lookup waiting on Grow, Lookup is able to walk a chain whose first part is already migrated and tail is not yet migrated, so is mixed with entries with a different destination home (according to `home_shift`) than what we're looking for. This is fine until we save one of these entries as a safe point in the chain to backtrack to (`read_ref_on_chain`) in case of concurrent modification and end up backtracking to it. In that case, we can get stuck on the wrong destination chain and keep trying to backtrack to an entry that is supposed to be on the correct chain but is not (anymore). For some reason I haven't quite worked out, I believe it's usually able to recover after some 1000+ looop iterations, so reproducibility depends on the threshold at which we consider a Lookup loop to be too many iterations for a plausibly valid Lookup. Detecting and working around this case is relatively simple. We can (and must) keep going on the chain but ensure we don't save it as a safe entry to backtrack to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11948 Test Plan: The problem could be reproduced in a few minutes with this (debug build): ``` $ while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=80000000 -threads=32 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done ``` At least with a lower threshold on suspiciously high number of iterations. I've lowered the thresholds quite a bit and no longer able to reproduce a failure. Reviewed By: jowlyzhang Differential Revision: D50236574 Pulled By: pdillinger fbshipit-source-id: 2cb54a4e02bb51d5933eea41fcd489ab9d34aa96 |
|
anand76 | 90e160733e |
Fix runtime error in UpdateTieredCache due to integer underflow (#11949)
Summary: With the introduction of the `UpdateTieredCache` API, its possible to dynamically change the compressed secondary cache ratio of the total cache capacity. In order to optimize performance, we avoid using a mutex when inserting/releasing placeholder entries, which can result in some inaccuracy in the accounting during the dynamic update. This inaccuracy was causing a runtime error due to an integer underflow in `UpdateCacheReservationRatio`, causing ubsan crash tests to fail. This PR fixes it by explicitly checking for the underflow. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11949 Test Plan: 1. Added a unit test that fails without the fix 2. Run ubsan_crash Reviewed By: akankshamahajan15 Differential Revision: D50240217 Pulled By: anand1976 fbshipit-source-id: d2f7b79da54eec8b61aec2cc1f2943da5d5847ac |
|
anand76 | d367b34cc9 |
Fix TSAN crash test false positive (#11941)
Summary: Fix the TSAN false positive caused by reading a bool flag without synchronization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11941 Test Plan: Run tsan crash test locally Reviewed By: akankshamahajan15 Differential Revision: D50181799 Pulled By: anand1976 fbshipit-source-id: 889e7237e9f3c9452a9df94a0d949db5fe13bb57 |
|
anand76 | 5b11f5a3a2 |
Add TieredCache and compressed cache capacity change to db_stress (#11935)
Summary: Add `TieredCache` to the cache types tested by db_stress. Also add compressed secondary cache capacity change, and `WriteBufferManager` integration with `TieredCache` for memory charging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11935 Test Plan: Run whitebox/blackbox crash tests locally Reviewed By: akankshamahajan15 Differential Revision: D50135365 Pulled By: anand1976 fbshipit-source-id: 7d73ed00c00a0953d86e49f35cce6bd550ba00f1 |
|
anand76 | 35a0250293 |
Don't call InsertSaved on compressed only secondary cache (#11889)
Summary: In https://github.com/facebook/rocksdb/issues/11812, the ```CacheWithSecondaryAdapter::Insert``` calls ```InsertSaved``` on the secondary cache to warm it up with the compressed blocks. This should only be done if its a stacked cache with compressed and nvm cache. If its in-memory compressed only, then don't call ```InsertSaved```. Tests: Add a new unit test Pull Request resolved: https://github.com/facebook/rocksdb/pull/11889 Reviewed By: akankshamahajan15 Differential Revision: D49615758 Pulled By: anand1976 fbshipit-source-id: 156ff968ad014ac319f8840da7a48193e4cebfa9 |
|
anand76 | 48589b961f |
Fix updating the capacity of a tiered cache (#11873)
Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf |
|
Peter Dillinger | 77a1d6eafb |
Fix assertion failure in AutoHCC (#11877)
Summary: Example crash seen in crash test: ``` db_stress: cache/clock_cache.cc:237: bool rocksdb::clock_cache::{anonymous}::BeginSlotInsert(const rocksdb::clock_cache::ClockHandleBasicData&, rocksdb::clock_cache::ClockHandle&, uint64_t, bool*): Assertion `*already_matches == false' failed. ``` I was intentionally ignoring `already_matches` without resetting it to false for the next call. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11877 Test Plan: Reproducer no longer reproduces: ``` while ./cache_bench -cache_type=auto_hyper_clock_cache -threads=32 -populate_cache=0 -histograms=0 -report_problems -insert_percent=87 -lookup_insert_percent=2 -skew=10 -ops_per_thread=100 -cache_size=1000000; do echo hi; done ``` Reviewed By: cbi42 Differential Revision: D49562065 Pulled By: pdillinger fbshipit-source-id: 941062e6eac7a4b56157925b1cf2a0b15ff9cc9d |
|
Peter Dillinger | f6cb763409 |
Fix major performance bug in AutoHCC growth phase (#11871)
Summary: ## The Problem Mark Callaghan found a performance bug in yet-unreleased AutoHCC (which should have been found in my own testing). The observed behavior is very slow insertion performance as the table is growing into a very large structure. The root cause is the precarious combination of linear hashing (indexing into the table while allowing growth) and linear probing (for finding an empty slot to insert into). Naively combined, this is a disaster because in linear hashing, part of the table is twice as dense as first probing location as the rest. Thus, even a modest load factor like 0.6 could cause the dense part of the table to degrade to linear search. The code had a correction for this imbalance, which works in steady-state operation, but failed to account for the concentrating effect of table growth. Specifically, newly-added slots were underpopulated which allowed old slots to become over-populated and degrade to linear search, even in single-threaded operation. Here's an example: ``` ./cache_bench -cache_type=auto_hyper_clock_cache -threads=1 -populate_cache=0 -value_bytes=500 -cache_size=3000000000 -histograms=0 -report_problems -ops_per_thread=20000000 -resident_ratio=0.6 ``` AutoHCC: Complete in 774.213 s; Rough parallel ops/sec = 25832 FixedHCC: Complete in 19.630 s; Rough parallel ops/sec = 1018840 LRUCache: Complete in 25.842 s; Rough parallel ops/sec = 773947 ## The Fix One small change is apparently sufficient to fix the problem, but I wanted to re-optimize the whole "finding a good empty slot" algorithm to improve safety margins for good performance and to improve typical case performance. The small change is to track the newly-added slot from Grow in Insert, when applicable, and use that slot for insertion if (a) the home slot is already occupied, and (b) the newly-added slot is empty. This appears to sufficiently load new slots while avoiding over-population of either old or new slots. See `likely_empty_slot`. However I've also made the logic much more resilient to parts of the table becoming over-populated. I tested a variant that used double hashing instead of linear probing and found that hurt steady-state average-case performance, presumably due to loss of locality in the chains. And even conventional double hashing might not be ideally robust against density skew in the table (still present because of home location bias), because double hashing might choose a small increment that could take a long time to iterate to the under-populated part of the table. The compromise that seems to bring the best of each approach is this: do linear probing (+1 at a time) within a small bound (chosen bound of 4 based on performance testing) and then fall back on a double-hashing variant if no slot has been found. The double-hashing variant uses a probing increment that is always close to the golden ratio, relative to the table size, so that any under-populated regions of the table can be found relatively quickly, without introducing any additional skew. And the increment is varied slightly to avoid clustering effects that could happen with a fixed increment (regardless of how big it is). And that leaves us with one remaining problem: the double hashing increment might not be relatively prime to the table size, so the probing sequence might be a cycle that does not cover the full set of slots. To solve this we can use a technique I developed many years ago (probably also developed by others) that simply adds one (in modular arithmetic) whenever we finish a (potentially incomplete) cycle. This is a simple and reasonably efficient way to iterate over all the slots without repetition, regardless of whether the increment is not relatively prime to the table size, or even zero. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11871 Test Plan: existing correctness tests, especially ClockCacheTest.ClockTableFull Intended follow-up: make ClockTableFull test more complete for AutoHCC ## Performance Ignoring old AutoHCC performance, as we established above it could be terrible. FixedHCC and LRUCache are unaffected by this change. All tests below include this change. ### Getting up to size, single thread (same cache_bench command as above, all three run at same time) AutoHCC: Complete in 26.724 s; Rough parallel ops/sec = 748400 FixedHCC: Complete in 19.987 s; Rough parallel ops/sec = 1000631 LRUCache: Complete in 28.291 s; Rough parallel ops/sec = 706939 Single-threaded faster than LRUCache (often / sometimes) is good. FixedHCC has an obvious advantage because it starts at full size. ### Multiple threads, steady state, high hit rate ~95% Using `-threads=10 -populate_cache=1 -ops_per_thread=10000000` and still `-resident_ratio=0.6` AutoHCC: Complete in 48.778 s; Rough parallel ops/sec = 2050119 FixedHCC: Complete in 46.569 s; Rough parallel ops/sec = 2147329 LRUCache: Complete in 50.537 s; Rough parallel ops/sec = 1978735 ### Multiple threads, steady state, low hit rate ~50% Change to `-resident_ratio=0.2` AutoHCC: Complete in 49.264 s; Rough parallel ops/sec = 2029884 FixedHCC: Complete in 49.750 s; Rough parallel ops/sec = 2010041 LRUCache: Complete in 53.002 s; Rough parallel ops/sec = 1886713 Don't expect AutoHCC to be consistently faster than FixedHCC, but they are at least similar in these benchmarks. Reviewed By: jowlyzhang Differential Revision: D49548534 Pulled By: pdillinger fbshipit-source-id: 263e4f4d71d0e9a7d91db3795b48fad75408822b |
|
anand76 | 269478ee46 |
Support compressed and local flash secondary cache stacking (#11812)
Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b |
|
anand76 | 548aabfe5f |
Disable compressed secondary cache if capacity is 0 (#11863)
Summary: This PR makes disabling the compressed secondary cache by setting capacity to 0 a bit more efficient. Previously, inserts/lookups would go to the backing LRUCache before getting rejected due to 0 capacity. With this change, insert/lookup would return from ```CompressedSecondaryCache``` itself. Tests: Existing tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11863 Reviewed By: akankshamahajan15 Differential Revision: D49476248 Pulled By: anand1976 fbshipit-source-id: f0f17a5e3df7d8bfc06709f8f23c1302056ba590 |
|
Peter Dillinger | e67ee46642 |
Suppress TSAN reports on AutoHyperClockTable::Lookup (#11806)
Summary: This function uses racing reads for heuristic performance improvement. My change in https://github.com/facebook/rocksdb/issues/11792 only worked for clang, not gcc, and gcc does not accurately handle TSAN suppressions. I would have to mark much more code as suppressed than I want to. So I've taken a different approach: TSAN build does not use the racing reads but substitutes random results, as an extra test that a "correct" value is not needed for correct overall behavior. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11806 Test Plan: manual TSAN builds & tests with cache_bench Reviewed By: ajkr Differential Revision: D49100115 Pulled By: pdillinger fbshipit-source-id: d6d0dfb796d710b953212dd3fc171b6e88fadea1 |
|
Peter Dillinger | d01b1215bd |
Fix TSAN reports on AutoHCC (#11792)
Summary: Forgot to run TSAN test on latest revision of https://github.com/facebook/rocksdb/issues/11738 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11792 Test Plan: Use cache_bench to reproduce TSAN errors and observe fix Reviewed By: ajkr Differential Revision: D48953196 Pulled By: pdillinger fbshipit-source-id: 9e358b4768d8ddde86f84b451863263f661d7b80 |
|
Peter Dillinger | fe3405e80f |
Automatic table sizing for HyperClockCache (AutoHCC) (#11738)
Summary: This change add an experimental next-generation HyperClockCache (HCC) with automatic sizing of the underlying hash table. Both the existing version (stable) and the new version (experimental for now) of HCC are available depending on whether an estimated average entry charge is provided in HyperClockCacheOptions. Internally, we call the two implementations AutoHyperClockCache (new) and FixedHyperClockCache (existing). The performance characteristics and much of the underlying logic are similar enough that AutoHCC is likely to make FixedHCC obsolete, and so it's best considered an evolution of the same technology or solution rather than an alternative. More specifically, both implementations share essentially the same logic for managing the state of individual entries in the cache, including metadata for reference counting and counting clocks for eviction. This metadata, which I like to call the "low-level HCC protocol," includes a read-write lock on entries, but relaxed consistency requirements on the cache (e.g. allowing rare duplication) means high-level cache operations never wait for these low-level per-entry locks. FixedHCC is fully wait-free. AutoHCC is different in how entries are indexed into an efficient hash table. AutoHCC is "essentially wait-free" as there is no pattern of typical high-level operations on a large cache that can lead to one thread waiting on another to complete some work, though it can happen in some unusual/unlucky cases, or atypical uses such as erasing specific cache keys. Table growth and entry reclamation is more complex in AutoHCC compared to FixedHCC, so uses some localized locking to manage that. AutoHCC uses linear hashing to grow the table as needed, with low latency and to a precise size. AutoHCC depends on anonymous mmap support from the OS (currently verified working on Linux, MacOS, and Windows) to allow the array underlying a hash table to grow in place without wasting resident memory on space reserved but unused. AutoHCC uses a form of chaining while FixedHCC uses open addressing and double hashing. More specifics: * In developing this PR, a rare availability bug (minor) was noticed in the existing HCC implementation of Release()+erase_if_last_ref, which is now inherited into AutoHCC. Fixing this without a performance regression will not be simple, so is left for follow-up work. * Some existing unit tests required adjustment of operational parameters or conditions to work with the new behaviors of AutoHCC. A number of bugs were found and fixed in the validation process, including getting unit tests in good working order. * Added an option to cache_bench, `-degenerate_hash_bits` for correctness stress testing described below. For this, the tool uses the reverse-engineered hash function for HCC to generate keys in which the specified number of hash bits, in critical positions, have a fixed value. Essentially each degenerate hash bit will half the number of chain heads utilized and double the average chain length. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11738 Test Plan: unit tests updated, and already added to db crash test. Also ## Correctness The code includes generous assertions to check for unexpected states, especially at destruction time, so should be able to detect critical concurrency bugs. Less serious "availability bugs" in which cache data is hidden or cleanly lost are more difficult to detect, but also less scary for data correctness (as long as performance is good and the design is sound). In average operation, the structure is extremely low stress and low contention (see next section) so stressing the corner case logic requires artificially stressing the operating conditions. First, we keep the structure small to increase the number of threads hitting the same chain or entry, and just one cache shard. Second, we artificially degrade the hashing so that chains are much longer than typical, using the new `-degenerate_hash_bits` option to cache_bench. Third, we re-create the structure from scratch frequently in order to exercise the Grow logic repeatedly and to get the benefit of the consistency checks in the structure's destructor in debug builds. For cache_bench this also means disabling the single-threaded "populate cache" step (normally used for steady state performance testing). And of course use many more threads than cores to have many preemptions. An effective test for working out bugs was this (using debug build of course): ``` while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=8000000 -threads=100 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done ``` Or even smaller cases. This setup has around 27 utilized chains, with around 35 entries each, and yield-waits more than 1 million times per second (very high contention; see next section). I have let this run for hours searching for any lingering issues. I've also run cache_bench under ASAN, UBSAN, and TSAN. ## Essentially wait free There is a counter for number of yield() calls when one thread is waiting on another. When we pre-populate the structure in a single thread, ``` ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -populate_cache=1 -ops_per_thread=200000 2>&1 | grep Yield ``` We see something on the order of 1 yield call per second across 16 threads, even when we load the system other other jobs (parallel compilation). With -populate_cache=0, there are more yield opportunities with parallel table growth. On an otherwise unloaded system, we still see very small (single digit) yield counts, with a chance of getting into the thousands, and getting into 10s of thousands per second during table growth phase if the system is loaded with other jobs. However, I am not worried about this if performance is still good (see next section). ## Overall performance Although cache_bench initially suggested performance very close to FixedHCC, there was a very noticeable performance hit under a db_bench setup like used in validating https://github.com/facebook/rocksdb/issues/10626. Much of the difference has been reduced by optimizing Lookup with a "naive" pass that will almost always find entries quickly, and only falling back to the careful Lookup algorithm when not found in the first pass. Setups (chosen to be sensitive to block cache performance), and compiled with USE_CLANG=1 JEMALLOC=1 PORTABLE=0 DEBUG_LEVEL=0: ``` TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 ``` ### No regression on FixedHCC Running before & after builds at the same time on a 48 core machine. ``` TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=readrandom[-X10],block_cache_entry_stats,cache_report_problems -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=24 -cache_type=fixed_hyper_clock_cache -seed=1234 ``` Before: readrandom [AVG 10 runs] : 847234 (± 8150) ops/sec; 59.2 (± 0.6) MB/sec 703MB max RSS After: readrandom [AVG 10 runs] : 851021 (± 7929) ops/sec; 59.5 (± 0.6) MB/sec 706MB max RSS Probably no material difference. ### Single-threaded performance Using `[-X2]` and `-threads=1` and `-duration=30`, running all three at the same time: lru_cache: 55100 ops/sec, then 55862 ops/sec (627MB max RSS) fixed_hyper_clock_cache: 60496 ops/sec, then 61231 ops/sec (626MB max RSS) auto_hyper_clock_cache: 47560 ops/sec, then 56081 ops/sec (626MB max RSS) So AutoHCC has more ramp-up cost in the first pass as the cache grows to the appropriate size. (In single-threaded operation, the parallelizability and per-op low latency of table growth is overall slower.) However, once up to size, its performance is comparable to LRUCache. FixedHCC's lean operations still win overall when a good estimate is available. If we look at HCC table stats, we can see that this configuration is not favorable to AutoHCC (and I have verified that other memory sizes do not yield substantially different results, until shards are under-sized for the full filters): FixedHCC: Slot occupancy stats: Overall 47% (124991/262144), Min/Max/Window = 28%/64%/500, MaxRun{Pos/Neg} = 17/22 AutoHCC: Slot occupancy stats: Overall 59% (125781/209682), Min/Max/Window = 43%/82%/500, MaxRun{Pos/Neg} = 76/16 Head occupancy stats: Overall 43% (92259/209682), Min/Max/Window = 24%/74%/500, MaxRun{Pos/Neg} = 19/26 Entries at home count: 53350 FixedHCC configuration is relatively good for speed, and not ideal for space utilization. As is typical, AutoHCC has tighter control on metadata usage (209682 x 64 bytes rather than 262144 x 64 bytes), and the higher load factor is slightly worse for speed. LRUCache also has more metadata usage, at 199680 x 96 bytes of tracked metadata (plus roughly another 10% of that untracked in the head pointers), and that metadata is subject to fragmentation. ### Parallel performance, high hit rate Now using `[-X10]` and `-threads=10`, all three at the same time lru_cache: [AVG 10 runs] : 263629 (± 1425) ops/sec; 18.4 (± 0.1) MB/sec 655MB max RSS, 97.1% cache hit rate fixed_hyper_clock_cache: [AVG 10 runs] : 479590 (± 8114) ops/sec; 33.5 (± 0.6) MB/sec 651MB max RSS, 97.1% cache hit rate auto_hyper_clock_cache: [AVG 10 runs] : 418687 (± 5915) ops/sec; 29.3 (± 0.4) MB/sec 657MB max RSS, 97.1% cache hit rate Even with just 10-way parallelism for each cache (though 30+/48 cores busy overall), LRUCache is already showing performance degradation, while AutoHCC is in the neighborhood of FixedHCC. And that brings us to the question of how AutoHCC holds up under extreme parallelism, so now independent runs with `-threads=100` (overloading 48 cores). lru_cache: 438613 ops/sec, 827MB max RSS fixed_hyper_clock_cache: 1651310 ops/sec, 812MB max RSS auto_hyper_clock_cache: 1505875 ops/sec, 821MB max RSS (Yield count: 1089 over 30s) Clearly, AutoHCC holds up extremely well under extreme parallelism, even closing some of the modest performance gap with FixedHCC. ### Parallel performance, low hit rate To get down to roughly 50% cache hit rate, we use `-cache_index_and_filter_blocks=0 -cache_size=1650000000` with `-threads=10`. Here the extra cost of running counting clock eviction, especially on the chains of AutoHCC, are evident, especially with the lower contention of cache_index_and_filter_blocks=0: lru_cache: 725231 ops/sec, 1770MB max RSS, 51.3% hit rate fixed_hyper_clock_cache: 638620 ops/sec, 1765MB max RSS, 50.2% hit rate auto_hyper_clock_cache: 541018 ops/sec, 1777MB max RSS, 50.8% hit rate Reviewed By: jowlyzhang Differential Revision: D48784755 Pulled By: pdillinger fbshipit-source-id: e79813dc087474ac427637dd282a14fa3011a6e4 |
|
Peter Dillinger | d3420464c3 |
cache_bench enhancements for jemalloc etc. (#11758)
Summary: * Add some options to cache_bench to use JemallocNodumpAllocator * Make num_shard_bits option use and report cache-specific defaults * Add a usleep option to sleep between operations, for simulating a workload with more CPU idle/wait time. * Use const& for JemallocAllocatorOptions, to improve API usability (e.g. can bind to temporary `{}`) * InstallStackTraceHandler() Pull Request resolved: https://github.com/facebook/rocksdb/pull/11758 Test Plan: manual Reviewed By: jowlyzhang Differential Revision: D48668479 Pulled By: pdillinger fbshipit-source-id: b6032fbe09444cdb8f1443a5e017d2eea4f6205a |
|
Changyu Bi | c2aad555c3 |
Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (
|
|
anand76 | a1743e85be |
Implement a allow cache hits admission policy for the compressed secondary cache (#11713)
Summary: This PR implements a new admission policy for the compressed secondary cache, which includes the functionality of the existing policy, and also admits items evicted from the primary block cache with the hit bit set. Effectively, the new policy works as follows - 1. When an item is demoted from the primary cache without a hit, a placeholder is inserted in the compressed cache. A second demotion will insert the full entry. 2. When an item is promoted from the compressed cache to the primary cache for the first time, a placeholder is inserted in the primary. The second promotion inserts the full entry, while erasing it form the compressed cache. 3. If an item is demoted from the primary cache with the hit bit set, it is immediately inserted in the compressed secondary cache. The ```TieredVolatileCacheOptions``` has been updated with a new option, ```adm_policy```, which allows the policy to be selected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11713 Reviewed By: pdillinger Differential Revision: D48444512 Pulled By: anand1976 fbshipit-source-id: b4cbf8c169a88097dff08e36e8bc4b3088de1492 |
|
Peter Dillinger | ef6f025563 |
Placeholder for AutoHyperClockCache, more (#11692)
Summary: * The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation. * Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache). * Simplify some existing tests not to depend directly on cache type. * Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation. * Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test. * Add another templated class BaseHyperClockCache to reduce future copy-paste * Added ReportProblems support to cache_bench * Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC: ``` 2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17 ``` In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692 Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in. Reviewed By: jowlyzhang Differential Revision: D48247413 Pulled By: pdillinger fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da |
|
Peter Dillinger | 99daea3481 |
Prepare tests for new HCC naming (#11676)
Summary: I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache. This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676 Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache Reviewed By: ajkr Differential Revision: D48103165 Pulled By: pdillinger fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736 |
|
Peter Dillinger | cdb11f5ce6 |
More minor HCC refactoring + typed mmap (#11670)
Summary: More code leading up to dynamic HCC. * Small enhancements to cache_bench * Extra assertion in Unref * Improve a CAS loop in ChargeUsageMaybeEvictStrict * Put load factor constants in appropriate class * Move `standalone` field to HyperClockTable::HandleImpl because it can be encoded differently in the upcoming dynamic HCC. * Add a typed version of MemMapping to simplify some future code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11670 Test Plan: existing tests, unit test added for TypedMemMapping Reviewed By: jowlyzhang Differential Revision: D48056464 Pulled By: pdillinger fbshipit-source-id: 186b7d3105c5d6d2eb6a592369bc10a97ee14a15 |
|
Peter Dillinger | f9de217353 |
Some cache_bench enhancements (#11661)
Summary: ... used in validating some HyperClockCache development in progress. * Revamp the "populate cache" step to avoid redundant insertions (very rare in practice) and more consistently approach the desired resident_ratio while maintaining appropriate skew (still not perfect). * Track and print hit ratio on lookups, to ensure a fair comparison is happening between implementations etc. * Add an option to disable tracking and printing histograms (lots of output) * Add an option to specify a random seed (for more reproducibility) * Remove confusing/redundant "-skewed" option Uses BitwiseAnd from https://github.com/facebook/rocksdb/issues/11660 (tested there) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11661 Test Plan: manual Reviewed By: akankshamahajan15, jowlyzhang Differential Revision: D47937671 Pulled By: pdillinger fbshipit-source-id: 85a2bb881b1bca4f63e015bac684105fd91c9f35 |
|
Peter Dillinger | f4e4039f00 |
Add some more bit operations to internal APIs (#11660)
Summary: BottomNBits() - there is a single fast instruction for this on x86 since BMI2, but testing with godbolt indicates you need at least GCC 10 for the compiler to choose that instruction from the obvious C++ code. https://godbolt.org/z/5a7Ysd41h BitwiseAnd() - this is a convenience function that works around the language flaw that the type of the result of x & y is the larger of the two input types, when it should be the smaller. This can save some ugly static_cast. I expect to use both of these in coming HyperClockCache developments, and have applied them in a couple of places in existing code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11660 Test Plan: unit tests added Reviewed By: jowlyzhang Differential Revision: D47935531 Pulled By: pdillinger fbshipit-source-id: d148c43a1e51df4a1c549b93aaf2725a3f8d3bd6 |
|
Peter Dillinger | c41122b1a0 |
Even more HyperClockCache refactoring (#11630)
Summary: ... ahead of dynamic variant. * Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release. * Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function. * Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11630 Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster. Reviewed By: jowlyzhang Differential Revision: D47657334 Pulled By: pdillinger fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3 |
|
Andrew Kryczka | 05c3b8ecac |
Prepare for specialized interface for row cache (#11620)
Summary: An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema. This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620 Reviewed By: pdillinger Differential Revision: D47536962 Pulled By: ajkr fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691 |
|
Peter Dillinger | 846db9d7b1 |
Refactor ClockCache ApplyToEntries (#11609)
Summary: ... ahead of planned dynamic HCC variant. This changes simplifies some logic while still enabling future code sharing between implementation variants. Detail: For complicated reasons, using a std::function parameter to `ConstApplyToEntriesRange` with a lambda argument does not play nice with templated HandleImpl. An explicit conversion to std::function would be needed for it to compile. Templating the function type is the easy work-around. Also made some functions from https://github.com/facebook/rocksdb/issues/11572 private as recommended Pull Request resolved: https://github.com/facebook/rocksdb/pull/11609 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D47407415 Pulled By: pdillinger fbshipit-source-id: 0f65954db16335999b78fb7d2563ec627624cef0 |
|
Peter Dillinger | b1b6f87fbe |
Some small improvements to HyperClockCache (#11601)
Summary: Stacked on https://github.com/facebook/rocksdb/issues/11572 * Minimize use of std::function and lambdas to minimize chances of compiler heap-allocating closures (unnecessary stress on allocator). It appears that converting FindSlot to a template enables inlining the lambda parameters, avoiding heap allocations. * Clean up some logic with FindSlot (FIXMEs from https://github.com/facebook/rocksdb/issues/11572) * Fix handling of rare case of probing all slots, with new unit test. (Previously Insert would not roll back displacements in that case, which would kill performance if it were to happen.) * Add an -early_exit option to cache_bench for gathering memory stats before deallocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11601 Test Plan: unit test added for probing all slots ## Seeing heap allocations Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache` before https://github.com/facebook/rocksdb/issues/11572 vs. after this change. Before, we see this in the interesting bin statistics: ``` size nrequests ---- --------- 32 578460 64 24340 8192 578460 ``` And after: ``` size nrequests ---- --------- 32 (insignificant) 64 24370 8192 579130 ``` ## Performance test Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench` Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000` in before and after configurations, simultaneously: ``` Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442 After: Complete in 32.773 s; Rough parallel ops/sec = 2441019 ``` Reviewed By: jowlyzhang Differential Revision: D47375092 Pulled By: pdillinger fbshipit-source-id: 46f0f57257ddb374290a0a38c651764ea60ba410 |
|
Peter Dillinger | c3c84b3397 |
Refactor (Hyper)ClockCache code for upcoming changes (#11572)
Summary: Separate out some functionality that will be common to both static and dynamic HCC into BaseClockTable. Table::InsertState and GrowIfNeeded will be used by the dynamic HCC so don't make much sense right now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11572 Test Plan: existing tests. No functional changes intended. Performance test in subsequent PR https://github.com/facebook/rocksdb/issues/11601 Reviewed By: jowlyzhang Differential Revision: D47110496 Pulled By: pdillinger fbshipit-source-id: 379bd433322a42ea28c0043b41ec24956d21e7aa |