Commit graph

254 commits

Author SHA1 Message Date
anand76 f48b64460e Provide a way to invoke a callback for a Cache handle (#12987)
Summary:
Add the `ApplyToHandle` method to the `Cache` interface to allow a caller to request the invocation of a callback on the given cache handle. The goal here is to allow a cache that manages multiple cache instances to use a callback on a handle to determine which instance it belongs to. For example, the callback can hash the key and use that to pick the correct target instance. This is useful to redirect methods like `Ref` and `Release`, which don't know the cache key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12987

Reviewed By: pdillinger

Differential Revision: D62151907

Pulled By: anand1976

fbshipit-source-id: e4ffbbb96eac9061d2ab0e7e1739eea5ebb1cd58
2024-09-06 14:54:09 -07:00
Peter Dillinger 0d3aaf7c0f Ensure SSTs compressed in tiered_secondary_cache_test (#12993)
Summary:
It appears the arm testsuite is failing because it is building without snappy, which is causing the SST files not to be compressed, which somehow causes these tests to fail. Manually setting LZ4 which is already required.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12993

Test Plan: reproduced and verified fix on ARM laptop

Reviewed By: anand1976

Differential Revision: D62216451

Pulled By: pdillinger

fbshipit-source-id: 3f21fcd9be0edaa66c7eca0cb7d56b998171e263
2024-09-05 10:36:29 -07:00
Peter Dillinger b34cef57b7 Support pro-actively erasing obsolete block cache entries (#12694)
Summary:
Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries.

Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive.

This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on.

The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF.

Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths.

Notable auxiliary code details:
* Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.)
* Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item)

Marked EXPERIMENTAL until more thorough validation is complete.

Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694

Test Plan:
* Unit tests added, including refactoring an existing test to make better use of parameterized tests.
* Added to crash test.
* Performance, sample command:
```
for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done
```

Averaging 10 runs each case, block cache data block hit rates

```
lru_cache
UA=0   -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0
UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0

fixed_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9
UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2

auto_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5
UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8
```

Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings).

Reviewed By: ajkr

Differential Revision: D57932442

Pulled By: pdillinger

fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c
2024-06-07 08:57:11 -07:00
anand76 0ae3d9f98d Fix stale memory access with FSBuffer and tiered sec cache (#12712)
Summary:
A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks  into the compressed cache and the corresponding compressed block into the NVM cache.  The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712

Test Plan:
1. Add a unit test
2. Release validation stress test

Reviewed By: jaykorean

Differential Revision: D57974026

Pulled By: anand1976

fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd
2024-05-30 12:33:58 -07:00
anand76 6cc7ad15b6 Implement secondary cache admission policy to allow all evicted blocks (#12599)
Summary:
Add a secondary cache admission policy to admit all blocks evicted from the block cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12599

Reviewed By: pdillinger

Differential Revision: D56891760

Pulled By: anand1976

fbshipit-source-id: 193c98c055aa3477f4e3a78e5d3daef27a5eacf4
2024-05-02 11:23:35 -07:00
Richard Barnes ee3159e7dd Remove extra semi colon from icsp/lib/logging/IcspLogRpcMessage.cpp
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D55534619

fbshipit-source-id: 26f3c35a51b38a3cbfa12a6f76a2bb783a7b4d8e
2024-03-31 10:26:34 -07:00
Changyu Bi f77b788545 Fix a bug in LRUCacheShard::LRU_Insert (#12429)
Summary:
we saw crash test fail with
```
lru_cache.cc:249: void rocksdb::lru_cache::LRUCacheShard::LRU_Remove(rocksdb::lru_cache::LRUHandle *): Assertion `high_pri_pool_usage_ >= e->total_charge' failed.
```
One cause for this is that `lru_low_pri_` pointer is not updated in `LRU_insert()` before we try to balance high pri and low pri pool in `MaintainPoolSize();`. A repro unit test is provided.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12429

Test Plan:
Not able to reproduce the failure with db_stress yet.
`./lru_cache_test --gtest_filter="*InsertAfterReducingCapacity*`. It fails the assertion before this PR.

Reviewed By: pdillinger

Differential Revision: D54908919

Pulled By: cbi42

fbshipit-source-id: f485fdbc0ea61c8092a0be5fe561a59c15c78fd3
2024-03-14 14:58:30 -07:00
yuzhangyu@fb.com 1cfdece85d Run internal cpp modernizer on RocksDB repo (#12398)
Summary:
When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE`  with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains:

Running
`xplat/scripts/codemod_service/cpp_modernizer.sh`
in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment)
without swapping out the namespace macro `ROCKSDB_NAMESPACE`

Followed by RocksDB's own
`make format`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398

Test Plan: Auto tests

Reviewed By: hx235

Differential Revision: D54382532

Pulled By: jowlyzhang

fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa
2024-03-04 10:08:32 -08:00
Richard Barnes ced333ee45 Remove extra semi colon from instagram/ranking/mezql/shots/parser/fast/Token.cpp
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362213

fbshipit-source-id: 0bbc9e5fce917fc4f72423f0a4c8cb2c2b1759dd
2024-03-04 06:32:50 -08:00
Peter Dillinger 54cb9c77d9 Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.

I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`.  If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.

With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.

A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).

Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.

I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308

Test Plan: existing tests, CI

Reviewed By: ltamasi

Differential Revision: D53204947

Pulled By: pdillinger

fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 10:44:11 -08:00
Peter Dillinger 76c834e441 Remove 'virtual' when implied by 'override' (#12319)
Summary:
... to follow modern C++ style / idioms.

Used this hack:
```
for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319

Test Plan: existing tests, CI

Reviewed By: jaykorean

Differential Revision: D53275303

Pulled By: pdillinger

fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377
2024-01-31 13:14:42 -08:00
anand76 b49f9cdd3c Add CompressionOptions to the compressed secondary cache (#12234)
Summary:
Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned.

Tests -
Run db_bench and verify compression options in the LOG file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234

Reviewed By: ajkr

Differential Revision: D52758133

Pulled By: anand1976

fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6
2024-01-16 12:21:27 -08:00
anand76 cc069f25b3 Add some compressed and tiered secondary cache stats (#12150)
Summary:
Add statistics for more visibility.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12150

Reviewed By: akankshamahajan15

Differential Revision: D52184633

Pulled By: anand1976

fbshipit-source-id: 9969e05d65223811cd12627102b020bb6d229352
2023-12-15 11:34:08 -08:00
Peter Dillinger 88bc91f3cc Cap eviction effort (CPU under stress) in HyperClockCache (#12141)
Summary:
HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list.  All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem.

However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system.

To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141

Test Plan:
unit test included

 ## Performance test

We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior:

```
./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7
```

```
Before, LRU (alternate baseline not exhibiting bad behavior):
Rough parallel ops/sec = 2290997
1088060 maxresident

Before, AutoHCC (bad behavior):
Rough parallel ops/sec = 141011 <- Yes, more than 10x slower
1083932 maxresident
```

Now let us sample a range of values in the solution space:

```
After, AutoHCC, eviction_effort_cap = 1:
Rough parallel ops/sec = 3212586
2402216 maxresident

After, AutoHCC, eviction_effort_cap = 10:
Rough parallel ops/sec = 2371639
1248884 maxresident

After, AutoHCC, eviction_effort_cap = 30:
Rough parallel ops/sec = 1981092
1131596 maxresident

After, AutoHCC, eviction_effort_cap = 100:
Rough parallel ops/sec = 1446188
1090976 maxresident

After, AutoHCC, eviction_effort_cap = 1000:
Rough parallel ops/sec = 549568
1084064 maxresident
```

I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default.

```
Change to -pinned_ratio=0.85
Before, LRU:
Rough parallel ops/sec = 2108373
1078232 maxresident

Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2164910
1077312 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2145542
1077216 maxresident
```

The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress.

```
Change to -pinned_ratio=0.25 (low stress)
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2221149
1076540 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2224521
1076664 maxresident
```

No measurable difference under normal circumstances.

Some tests repeated with FixedHCC, with similar results.

Reviewed By: anand1976

Differential Revision: D52174755

Pulled By: pdillinger

fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8
2023-12-14 22:13:32 -08:00
Peter Dillinger c74531b1d2 Fix a nuisance compiler warning from clang (#12144)
Summary:
Example:

```
cache/clock_cache.cc:56:7: error: fallthrough annotation in unreachable code [-Werror,-Wimplicit-fallthrough]
      FALLTHROUGH_INTENDED;
      ^
./port/lang.h:10:30: note: expanded from macro 'FALLTHROUGH_INTENDED'
                             ^
```

In clang < 14, this is annoyingly generated from -Wimplicit-fallthrough, but was changed to -Wunreachable-code-fallthrough (implied by -Wunreachable-code) in clang 14. See https://reviews.llvm.org/D107933 for how this nuisance pattern generated false positives similar to ours in the Linux kernel.

Just to underscore the ridiculousness of this warning, here an error is reported on the annotation, not the call to do_something(), depending on the constexpr value (https://godbolt.org/z/EvxqdPTdr):

```
#include <atomic>
void do_something();
void test(int v) {
    switch (v) {
        case 1:
            if constexpr (std::atomic<long>::is_always_lock_free) {
                return;
            } else {
                do_something();
                [[fallthrough]];
            }
        case 2:
            return;
    }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12144

Test Plan: Added the warning to our Makefile for USE_CLANG, which reproduced the warning-as-error as shown above, but is now fixed.

Reviewed By: jaykorean

Differential Revision: D52139615

Pulled By: pdillinger

fbshipit-source-id: ba967ae700c0916d1a478bc465cf917633e337d9
2023-12-13 15:58:46 -08:00
anand76 ebb5242d55 Sanitize the secondary_cache option in TieredCacheOptions (#12137)
Summary:
Sanitize the `secondary_cache` field in the `cache_opts` option of `TieredCacheOptions` to `nullptr` if set by the user. The nvm secondary cache should be directly set in `TieredCacheOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12137

Reviewed By: akankshamahajan15

Differential Revision: D52063817

Pulled By: anand1976

fbshipit-source-id: 255116c665a9b908c8f44109a2d331d4b73e7591
2023-12-12 10:58:00 -08:00
anand76 c1b84d0437 Fix false negative in TieredSecondaryCache nvm cache lookup (#12134)
Summary:
There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134

Test Plan: Add a new unit test in tiered_secondary_cache_test

Reviewed By: akankshamahajan15

Differential Revision: D52023309

Pulled By: anand1976

fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b
2023-12-11 16:59:59 -08:00
anand76 336a74db60 Add some asserts in ~CacheWithSecondaryAdapter (#12082)
Summary:
Add some asserts in the `CacheWithSecondaryAdapter` destructor to help debug a crash test failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12082

Reviewed By: cbi42

Differential Revision: D51486041

Pulled By: anand1976

fbshipit-source-id: 76537beed31ba27ab9ac8b4ce6deb775629e3be5
2023-11-20 17:48:17 -08:00
anand76 2222caec9e Make CacheWithSecondaryAdapter reservation accounting more robust (#12059)
Summary:
`CacheWithSecondaryAdapter` can distribute placeholder reservations across the primary and secondary caches. The current implementation of the accounting is quite complicated in order to avoid using a mutex. This may cause the accounting to be slightly off after changes to the cache capacity and ratio, resulting in assertion failures. There's also a bug in the unlikely event that the total reservation exceeds the cache capacity. Furthermore, the current implementation is difficult to reason about.

This PR simplifies it by doing the accounting while holding a mutex. The reservations are processed in 1MB chunks in order to avoid taking a lock too frequently. As a side effect, this also removes the restriction of not allowing to increase the compressed secondary cache capacity after decreasing it to 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12059

Test Plan: Existing unit tests, and a new test for capacity increase from 0

Reviewed By: pdillinger

Differential Revision: D51278686

Pulled By: anand1976

fbshipit-source-id: 7e1ad2c50694772997072dd59cab35c93c12ba4f
2023-11-14 16:25:52 -08:00
Peter Dillinger 65cde19f40 Safer wrapper for std::atomic, use in HCC (#12051)
Summary:
See new atomic.h file comments for motivation.

I have updated HyperClockCache to use the new atomic wrapper, fixing a few cases where an implicit conversion was accidentally used and therefore mixing std::memory_order_seq_cst where release/acquire ordering (or relaxed) was intended. There probably wasn't a real bug because I think all the cases happened to be in single-threaded contexts like constructors/destructors or statistical ops like `GetCapacity()` that don't need any particular ordering constraints.

Recommended follow-up:
* Replace other uses of std::atomic to help keep them safe from bugs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12051

Test Plan:
Did some local correctness stress testing with cache_bench. Also triggered 15 runs of fbcode_blackbox_crash_test and saw no related failures (just 3 failures in ~CacheWithSecondaryAdapter(), already known)

No performance difference seen before & after running simultaneously:
```
(while ./cache_bench -cache_type=fixed_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=500000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}'
```

... for both fixed_hcc and auto_hcc.

Reviewed By: jowlyzhang

Differential Revision: D51090518

Pulled By: pdillinger

fbshipit-source-id: eeb324facb3185584603f9ea0c4de6f32919a2d7
2023-11-08 13:28:43 -08:00
Peter Dillinger 9af25a392b Clean up AutoHyperClockTable::PurgeImpl (#12052)
Summary:
There was some unncessary logic (e.g. a dead assignment to home_shift) left over from earlier revision of the code.

Also, rename confusing ChainRewriteLock::new_head_ / GetNewHead() to saved_head_ / GetSavedHead().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12052

Test Plan: existing tests

Reviewed By: jowlyzhang

Differential Revision: D51091499

Pulled By: pdillinger

fbshipit-source-id: 4b191b60a2b16085681e59d49c4d97e802869db8
2023-11-07 16:35:19 -08:00
Peter Dillinger 16ae3548a2 AutoHCC: Improve/fix allocation/detection of grow homes (#12047)
Summary:
This change simplifies some code and logic by introducing a new atomic field that tracks the next slot to grow into. It should offer slightly better performance during the growth phase (not measurable; see Test Plan below) and fix a suspected (but unconfirmed) bug like this:
* Thread 1 is in non-trivial SplitForGrow() with grow_home=n.
* Thread 2 reaches Grow() with grow_home=2n, and waits at the start of SplitForGrow() for the rewrite lock on n. By this point, the head at 2n is marked with the new shift amount but no chain is locked.
* Thread 3 reaches Grow() with grow_home=4n, and waits before SplitForGrow() for the rewrite lock on n. By this point, the head at 4n is marked with the new shift amount but no chain is locked.
* Thread 4 reaches Grow() with grow_home=8n and meets no resistance to proceeding through a SplitForGrow() on an empty chain, permanently missing out on any entries from chain n that should have ended up here.

This is fixed by not updating the shift amount at the grow_home head until we have checked the preconditions that Grow()s feeding into this one have completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12047

Test Plan:
Some manual cache_bench stress runs, and about 20 triggered runs of fbcode_blackbox_crash_test

No discernible performance difference on this benchmark, running before & after in parallel for a few minutes:
```
(while ./cache_bench -cache_type=auto_hyper_clock_cache -populate_cache=0 -cache_size=3000000000 -ops_per_thread=50000 -threads=12 -histograms=0 2>&1 | grep parallel; do :; done) | awk '{ s += $3; c++; print "Avg time: " (s/c);}'
```

Reviewed By: jowlyzhang

Differential Revision: D51017007

Pulled By: pdillinger

fbshipit-source-id: 5f6d6a6194fc966f94693f3205ed75c87cdad269
2023-11-07 10:40:39 -08:00
Peter Dillinger 92dc5f3e67 AutoHCC: fix a bug with "blind" Insert (#12046)
Summary:
I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions.

It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`).

The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have
* Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain.
* Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty.
* Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain.

This is easily fixed by holding on to a reference until inserted onto the chain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046

Test Plan:
As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed.

Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000)

Reviewed By: jowlyzhang

Differential Revision: D51016979

Pulled By: pdillinger

fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b
2023-11-06 16:06:01 -08:00
Peter Dillinger a399bbc037 More fixes and enhancements for cache_bench (#12041)
Summary:
Mostly things for using cache_bench for stress/correctness testing.
* Make secondary_cache_uri option work with HCC (forgot to update when secondary support was added for HCC)
* Add -pinned_ratio option to keep more than just one entry per thread pinned. This can be important for testing eviction stress.
* Add -vary_capacity_ratio for testing dynamically changing capacity.

Also added some overrides to CacheWrapper to help with diagnostic output.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12041

Test Plan: manual, make check

Reviewed By: jowlyzhang

Differential Revision: D51013430

Pulled By: pdillinger

fbshipit-source-id: 7914adc1218f0afacace05ccd77d3bfb91a878d0
2023-11-06 09:59:09 -08:00
anand76 52be8f54f2 Add APIs to query secondary cache capacity and usage for TieredCache (#12011)
Summary:
In `TieredCache`, the underlying compressed secondary cache is hidden from the user. So we need a way to query the capacity, as well as the portion of cache reservation charged to the compressed secondary cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12011

Test Plan: Update the unit tests

Reviewed By: akankshamahajan15

Differential Revision: D50651943

Pulled By: anand1976

fbshipit-source-id: 06d1cb5edb75a790c919bce718e2ff65f5908220
2023-10-25 16:54:50 -07:00
Peter Dillinger ef0c3f08fa Fix rare destructor bug in AutoHCC (#11988)
Summary:
and some other small enhancements/fixes:
* The main bug fixed is that in some rare cases, the "published" table size might be smaller than the actual table size. This is a transient state that can happen with concurrent growth that is normally fixed after enough insertions, but if the cache is destroyed soon enough after growth, it could fail to fully destroy some entries and cause assertion failures. We can fix this by detecting the true table size in the destructor.
* Change the "too many iterations" debug threshold from 512 to 768. We might have hit at least one false positive failure. (Failed despite legitimate operation.)
* Added some stronger assertions in some places to aid in debugging.
* Use COERCE_CONTEXT_SWITCH to make behavior of Grow less predictable in terms of thread interleaving. (Might add in more places.) This was useful in reproducing the destructor bug.
* Fix some comments with typos or that were based on earlier revisions of the code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11988

Test Plan:
Variants of this bug-finding command:
```
USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 COERCE_CONTEXT_SWITCH=1 DEBUG_LEVEL=2 make -j32 cache_bench && while ROCKSDB_DEBUG=1 ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=80000000 -threads=32 -populate_cache=0 -ops_per_thread=1000 -num_shard_bits=0; do :; done
```

Reviewed By: jowlyzhang

Differential Revision: D50470318

Pulled By: pdillinger

fbshipit-source-id: d407a8bb0b6d2ddc598a954c319a1640136f12f2
2023-10-19 14:51:22 -07:00
Peter Dillinger dc576af0fd AutoHCC - fix a rare loop condition in Lookup (#11948)
Summary:
Saw this in stress test:
```
db_stress: cache/clock_cache.cc:3152:[...] Assertion `i < 0x2000' failed.
```

The problem is related to Lookups on a chain currently involved in a Grow operation. To avoid Lookup waiting on Grow, Lookup is able to walk a chain whose first part is already migrated and tail is not yet migrated, so is mixed with entries with a different destination home (according to `home_shift`) than what we're looking for. This is fine until we save one of these entries as a safe point in the chain to backtrack to (`read_ref_on_chain`) in case of concurrent modification and end up backtracking to it. In that case, we can get stuck on the wrong destination chain and keep trying to backtrack to an entry that is supposed to be on the correct chain but is not (anymore).

For some reason I haven't quite worked out, I believe it's usually able to recover after some 1000+ looop iterations, so reproducibility depends on the threshold at which we consider a Lookup loop to be too many iterations for a plausibly valid Lookup.

Detecting and working around this case is relatively simple. We can (and must) keep going on the chain but ensure we don't save it as a safe entry to backtrack to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11948

Test Plan:
The problem could be reproduced in a few minutes with this (debug build):
```
$ while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=80000000 -threads=32 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done
```

At least with a lower threshold on suspiciously high number of iterations. I've lowered the thresholds quite a bit and no longer able to reproduce a failure.

Reviewed By: jowlyzhang

Differential Revision: D50236574

Pulled By: pdillinger

fbshipit-source-id: 2cb54a4e02bb51d5933eea41fcd489ab9d34aa96
2023-10-13 09:52:33 -07:00
anand76 90e160733e Fix runtime error in UpdateTieredCache due to integer underflow (#11949)
Summary:
With the introduction of the `UpdateTieredCache` API, its possible to dynamically change the compressed secondary cache ratio of the total cache capacity. In order to optimize performance, we avoid using a mutex when inserting/releasing placeholder entries, which can result in some inaccuracy in the accounting during the dynamic update. This inaccuracy was causing a runtime error due to an integer underflow in `UpdateCacheReservationRatio`, causing ubsan crash tests to fail. This PR fixes it by explicitly checking for the underflow.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11949

Test Plan:
1. Added a unit test that fails without the fix
2. Run ubsan_crash

Reviewed By: akankshamahajan15

Differential Revision: D50240217

Pulled By: anand1976

fbshipit-source-id: d2f7b79da54eec8b61aec2cc1f2943da5d5847ac
2023-10-12 15:09:40 -07:00
anand76 d367b34cc9 Fix TSAN crash test false positive (#11941)
Summary:
Fix the TSAN false positive caused by reading a bool flag without synchronization.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11941

Test Plan: Run tsan crash test locally

Reviewed By: akankshamahajan15

Differential Revision: D50181799

Pulled By: anand1976

fbshipit-source-id: 889e7237e9f3c9452a9df94a0d949db5fe13bb57
2023-10-11 13:28:10 -07:00
anand76 5b11f5a3a2 Add TieredCache and compressed cache capacity change to db_stress (#11935)
Summary:
Add `TieredCache` to the cache types tested by db_stress. Also add compressed secondary cache capacity change, and `WriteBufferManager` integration with `TieredCache` for memory charging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11935

Test Plan: Run whitebox/blackbox crash tests locally

Reviewed By: akankshamahajan15

Differential Revision: D50135365

Pulled By: anand1976

fbshipit-source-id: 7d73ed00c00a0953d86e49f35cce6bd550ba00f1
2023-10-10 13:12:18 -07:00
anand76 35a0250293 Don't call InsertSaved on compressed only secondary cache (#11889)
Summary:
In https://github.com/facebook/rocksdb/issues/11812, the ```CacheWithSecondaryAdapter::Insert``` calls ```InsertSaved``` on the secondary cache to warm it up with the compressed blocks. This should only be done if its a stacked cache with compressed and nvm cache. If its in-memory compressed only, then don't call ```InsertSaved```.

Tests:
Add a new unit test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11889

Reviewed By: akankshamahajan15

Differential Revision: D49615758

Pulled By: anand1976

fbshipit-source-id: 156ff968ad014ac319f8840da7a48193e4cebfa9
2023-09-27 12:08:08 -07:00
anand76 48589b961f Fix updating the capacity of a tiered cache (#11873)
Summary:
Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity.

With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy.

Tests:
New unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873

Reviewed By: akankshamahajan15

Differential Revision: D49562250

Pulled By: anand1976

fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-22 18:07:46 -07:00
Peter Dillinger 77a1d6eafb Fix assertion failure in AutoHCC (#11877)
Summary:
Example crash seen in crash test:

```
db_stress: cache/clock_cache.cc:237: bool rocksdb::clock_cache::{anonymous}::BeginSlotInsert(const rocksdb::clock_cache::ClockHandleBasicData&, rocksdb::clock_cache::ClockHandle&, uint64_t, bool*): Assertion `*already_matches == false' failed.
```

I was intentionally ignoring `already_matches` without resetting it to false for the next call.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11877

Test Plan:
Reproducer no longer reproduces:

```
while ./cache_bench -cache_type=auto_hyper_clock_cache -threads=32 -populate_cache=0 -histograms=0 -report_problems -insert_percent=87 -lookup_insert_percent=2 -skew=10 -ops_per_thread=100 -cache_size=1000000; do echo hi; done
```

Reviewed By: cbi42

Differential Revision: D49562065

Pulled By: pdillinger

fbshipit-source-id: 941062e6eac7a4b56157925b1cf2a0b15ff9cc9d
2023-09-22 16:42:52 -07:00
Peter Dillinger f6cb763409 Fix major performance bug in AutoHCC growth phase (#11871)
Summary:
## The Problem
Mark Callaghan found a performance bug in yet-unreleased AutoHCC (which should have been found in my own testing). The observed behavior is very slow insertion performance as the table is growing into a very large structure. The root cause is the precarious combination of linear hashing (indexing into the table while allowing growth) and linear probing (for finding an empty slot to insert into). Naively combined, this is a disaster because in linear hashing, part of the table is twice as dense as first probing location as the rest. Thus, even a modest load factor like 0.6 could cause the dense part of the table to degrade to linear search. The code had a correction for this imbalance, which works in steady-state operation, but failed to account for the concentrating effect of table growth. Specifically, newly-added slots were underpopulated which allowed old slots to become over-populated and degrade to linear search, even in single-threaded operation. Here's an example:

```
./cache_bench -cache_type=auto_hyper_clock_cache -threads=1 -populate_cache=0 -value_bytes=500 -cache_size=3000000000 -histograms=0 -report_problems -ops_per_thread=20000000 -resident_ratio=0.6
```
AutoHCC: Complete in 774.213 s; Rough parallel ops/sec = 25832
FixedHCC: Complete in 19.630 s; Rough parallel ops/sec = 1018840
LRUCache: Complete in 25.842 s; Rough parallel ops/sec = 773947

## The Fix
One small change is apparently sufficient to fix the problem, but I wanted to re-optimize the whole "finding a good empty slot" algorithm to improve safety margins for good performance and to improve typical case performance.

The small change is to track the newly-added slot from Grow in Insert, when applicable, and use that slot for insertion if (a) the home slot is already occupied, and (b) the newly-added slot is empty. This appears to sufficiently load new slots while avoiding over-population of either old or new slots. See `likely_empty_slot`.

However I've also made the logic much more resilient to parts of the table becoming over-populated. I tested a variant that used double hashing instead of linear probing and found that hurt steady-state average-case performance, presumably due to loss of locality in the chains. And even conventional double hashing might not be ideally robust against density skew in the table (still present because of home location bias), because double hashing might choose a small increment that could take a long time to iterate to the under-populated part of the table.

The compromise that seems to bring the best of each approach is this: do linear probing (+1 at a time) within a small bound (chosen bound of 4 based on performance testing) and then fall back on a double-hashing variant if no slot has been found. The double-hashing variant uses a probing increment that is always close to the golden ratio, relative to the table size, so that any under-populated regions of the table can be found relatively quickly, without introducing any additional skew. And the increment is varied slightly to avoid clustering effects that could happen with a fixed increment (regardless of how big it is).

And that leaves us with one remaining problem: the double hashing increment might not be relatively prime to the table size, so the probing sequence might be a cycle that does not cover the full set of slots. To solve this we can use a technique I developed many years ago (probably also developed by others) that simply adds one (in modular arithmetic) whenever we finish a (potentially incomplete) cycle. This is a simple and reasonably efficient way to iterate over all the slots without repetition, regardless of whether the increment is not relatively prime to the table size, or even zero.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11871

Test Plan:
existing correctness tests, especially ClockCacheTest.ClockTableFull

Intended follow-up: make ClockTableFull test more complete for AutoHCC

## Performance
Ignoring old AutoHCC performance, as we established above it could be terrible. FixedHCC and LRUCache are unaffected by this change. All tests below include this change.

### Getting up to size, single thread
(same cache_bench command as above, all three run at same time)

AutoHCC: Complete in 26.724 s; Rough parallel ops/sec = 748400
FixedHCC: Complete in 19.987 s; Rough parallel ops/sec = 1000631
LRUCache: Complete in 28.291 s; Rough parallel ops/sec = 706939

Single-threaded faster than LRUCache (often / sometimes) is good. FixedHCC has an obvious advantage because it starts at full size.

### Multiple threads, steady state, high hit rate ~95%
Using `-threads=10 -populate_cache=1 -ops_per_thread=10000000` and still `-resident_ratio=0.6`

AutoHCC: Complete in 48.778 s; Rough parallel ops/sec = 2050119
FixedHCC: Complete in 46.569 s; Rough parallel ops/sec = 2147329
LRUCache: Complete in 50.537 s; Rough parallel ops/sec = 1978735

### Multiple threads, steady state, low hit rate ~50%
Change to `-resident_ratio=0.2`

AutoHCC: Complete in 49.264 s; Rough parallel ops/sec = 2029884
FixedHCC: Complete in 49.750 s; Rough parallel ops/sec = 2010041
LRUCache: Complete in 53.002 s; Rough parallel ops/sec = 1886713

Don't expect AutoHCC to be consistently faster than FixedHCC, but they are at least similar in these benchmarks.

Reviewed By: jowlyzhang

Differential Revision: D49548534

Pulled By: pdillinger

fbshipit-source-id: 263e4f4d71d0e9a7d91db3795b48fad75408822b
2023-09-22 13:47:31 -07:00
anand76 269478ee46 Support compressed and local flash secondary cache stacking (#11812)
Summary:
This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache.

The basic design is as follows -
1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache.
2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block.
3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache.
4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded.

We can easily implement additional admission policies if desired.

Todo (In a subsequent PR):
1. Add to db_bench and run benchmarks
2. Add to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812

Reviewed By: pdillinger

Differential Revision: D49461842

Pulled By: anand1976

fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-21 20:30:53 -07:00
anand76 548aabfe5f Disable compressed secondary cache if capacity is 0 (#11863)
Summary:
This PR makes disabling the compressed secondary cache by setting capacity to 0 a bit more efficient. Previously, inserts/lookups would go to the backing LRUCache before getting rejected due to 0 capacity. With this change, insert/lookup would return from ```CompressedSecondaryCache``` itself.

Tests:
Existing tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11863

Reviewed By: akankshamahajan15

Differential Revision: D49476248

Pulled By: anand1976

fbshipit-source-id: f0f17a5e3df7d8bfc06709f8f23c1302056ba590
2023-09-20 22:30:17 -07:00
Peter Dillinger e67ee46642 Suppress TSAN reports on AutoHyperClockTable::Lookup (#11806)
Summary:
This function uses racing reads for heuristic performance improvement. My change in https://github.com/facebook/rocksdb/issues/11792 only worked for clang, not gcc, and gcc does not accurately handle TSAN suppressions. I would have to mark much more code as suppressed than I want to.

So I've taken a different approach: TSAN build does not use the racing reads but substitutes random results, as an extra test that a "correct" value is not needed for correct overall behavior.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11806

Test Plan: manual TSAN builds & tests with cache_bench

Reviewed By: ajkr

Differential Revision: D49100115

Pulled By: pdillinger

fbshipit-source-id: d6d0dfb796d710b953212dd3fc171b6e88fadea1
2023-09-08 10:50:47 -07:00
Peter Dillinger d01b1215bd Fix TSAN reports on AutoHCC (#11792)
Summary:
Forgot to run TSAN test on latest revision of https://github.com/facebook/rocksdb/issues/11738

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11792

Test Plan: Use cache_bench to reproduce TSAN errors and observe fix

Reviewed By: ajkr

Differential Revision: D48953196

Pulled By: pdillinger

fbshipit-source-id: 9e358b4768d8ddde86f84b451863263f661d7b80
2023-09-04 10:59:16 -07:00
Peter Dillinger fe3405e80f Automatic table sizing for HyperClockCache (AutoHCC) (#11738)
Summary:
This change add an experimental next-generation HyperClockCache (HCC) with automatic sizing of the underlying hash table. Both the existing version (stable) and the new version (experimental for now) of HCC are available depending on whether an estimated average entry charge is provided in HyperClockCacheOptions.

Internally, we call the two implementations AutoHyperClockCache (new) and FixedHyperClockCache (existing). The performance characteristics and much of the underlying logic are similar enough that AutoHCC is likely to make FixedHCC obsolete, and so it's best considered an evolution of the same technology or solution rather than an alternative. More specifically, both implementations share essentially the same logic for managing the state of individual entries in the cache, including metadata for reference counting and counting clocks for eviction. This metadata, which I like to call the "low-level HCC protocol," includes a read-write lock on entries, but relaxed consistency requirements on the cache (e.g. allowing rare duplication) means high-level cache operations never wait for these low-level per-entry locks. FixedHCC is fully wait-free.

AutoHCC is different in how entries are indexed into an efficient hash table. AutoHCC is "essentially wait-free" as there is no pattern of typical high-level operations on a large cache that can lead to one thread waiting on another to complete some work, though it can happen in some unusual/unlucky cases, or atypical uses such as erasing specific cache keys. Table growth and entry reclamation is more complex in AutoHCC compared to FixedHCC, so uses some localized locking to manage that. AutoHCC uses linear hashing to grow the table as needed, with low latency and to a precise size. AutoHCC depends on anonymous mmap support from the OS (currently verified working on Linux, MacOS, and Windows) to allow the array underlying a hash table to grow in place without wasting resident memory on space reserved but unused. AutoHCC uses a form of chaining while FixedHCC uses open addressing and double hashing.

More specifics:
* In developing this PR, a rare availability bug (minor) was noticed in the existing HCC implementation of Release()+erase_if_last_ref, which is now inherited into AutoHCC. Fixing this without a performance regression will not be simple, so is left for follow-up work.
* Some existing unit tests required adjustment of operational parameters or conditions to work with the new behaviors of AutoHCC. A number of bugs were found and fixed in the validation process, including getting unit tests in good working order.
* Added an option to cache_bench, `-degenerate_hash_bits` for correctness stress testing described below. For this, the tool uses the reverse-engineered hash function for HCC to generate keys in which the specified number of hash bits, in critical positions, have a fixed value. Essentially each degenerate hash bit will half the number of chain heads utilized and double the average chain length.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11738

Test Plan:
unit tests updated, and already added to db crash test. Also

## Correctness
The code includes generous assertions to check for unexpected states, especially at destruction time, so should be able to detect critical concurrency bugs. Less serious "availability bugs" in which cache data is hidden or cleanly lost are more difficult to detect, but also less scary for data correctness (as long as performance is good and the design is sound).

In average operation, the structure is extremely low stress and low contention (see next section) so stressing the corner case logic requires artificially stressing the operating conditions. First, we keep the structure small to increase the number of threads hitting the same chain or entry, and just one cache shard. Second, we artificially degrade the hashing so that chains are much longer than typical, using the new `-degenerate_hash_bits` option to cache_bench. Third, we re-create the structure from scratch frequently in order to exercise the Grow logic repeatedly and to get the benefit of the consistency checks in the structure's destructor in debug builds. For cache_bench this also means disabling the single-threaded "populate cache" step (normally used for steady state performance testing). And of course use many more threads than cores to have many preemptions.

An effective test for working out bugs was this (using debug build of course):
```
while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=8000000 -threads=100 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done
```

Or even smaller cases. This setup has around 27 utilized chains, with around 35 entries each, and yield-waits more than 1 million times per second (very high contention; see next section). I have let this run for hours searching for any lingering issues.

I've also run cache_bench under ASAN, UBSAN, and TSAN.

## Essentially wait free
There is a counter for number of yield() calls when one thread is waiting on another. When we pre-populate the structure in a single thread,
```
./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -populate_cache=1 -ops_per_thread=200000 2>&1 | grep Yield
```
We see something on the order of 1 yield call per second across 16 threads, even when we load the system other other jobs (parallel compilation). With -populate_cache=0, there are more yield opportunities with parallel table growth. On an otherwise unloaded system, we still see very small (single digit) yield counts, with a chance of getting into the thousands, and getting into 10s of thousands per second during table growth phase if the system is loaded with other jobs. However, I am not worried about this if performance is still good (see next section).

## Overall performance
Although cache_bench initially suggested performance very close to FixedHCC, there was a very noticeable performance hit under a db_bench setup like used in validating https://github.com/facebook/rocksdb/issues/10626. Much of the difference has been reduced by optimizing Lookup with a "naive" pass that will almost always find entries quickly, and only falling back to the careful Lookup algorithm when not found in the first pass.

Setups (chosen to be sensitive to block cache performance), and compiled with USE_CLANG=1 JEMALLOC=1 PORTABLE=0 DEBUG_LEVEL=0:
```
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```

### No regression on FixedHCC
Running before & after builds at the same time on a 48 core machine.
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=readrandom[-X10],block_cache_entry_stats,cache_report_problems -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=24 -cache_type=fixed_hyper_clock_cache -seed=1234
```

Before:
readrandom [AVG    10 runs] : 847234 (± 8150) ops/sec;   59.2 (± 0.6) MB/sec
703MB max RSS

After:
readrandom [AVG    10 runs] : 851021 (± 7929) ops/sec;   59.5 (± 0.6) MB/sec
706MB max RSS

Probably no material difference.

### Single-threaded performance
Using `[-X2]` and `-threads=1` and `-duration=30`, running all three at the same time:

lru_cache: 55100 ops/sec, then 55862 ops/sec  (627MB max RSS)
fixed_hyper_clock_cache: 60496 ops/sec, then 61231 ops/sec (626MB max RSS)
auto_hyper_clock_cache: 47560 ops/sec, then 56081 ops/sec (626MB max RSS)

So AutoHCC has more ramp-up cost in the first pass as the cache grows to the appropriate size. (In single-threaded operation, the parallelizability and per-op low latency of table growth is overall slower.) However, once up to size, its performance is comparable to LRUCache. FixedHCC's lean operations still win overall when a good estimate is available.

If we look at HCC table stats, we can see that this configuration is not favorable to AutoHCC (and I have verified that other memory sizes do not yield substantially different results, until shards are under-sized for the full filters):

FixedHCC:
Slot occupancy stats: Overall 47% (124991/262144), Min/Max/Window = 28%/64%/500, MaxRun{Pos/Neg} = 17/22

AutoHCC:
Slot occupancy stats: Overall 59% (125781/209682), Min/Max/Window = 43%/82%/500, MaxRun{Pos/Neg} = 76/16
Head occupancy stats: Overall 43% (92259/209682), Min/Max/Window = 24%/74%/500, MaxRun{Pos/Neg} = 19/26
Entries at home count: 53350

FixedHCC configuration is relatively good for speed, and not ideal for space utilization. As is typical, AutoHCC has tighter control on metadata usage (209682 x 64 bytes rather than 262144 x 64 bytes), and the higher load factor is slightly worse for speed. LRUCache also has more metadata usage, at 199680 x 96 bytes of tracked metadata (plus roughly another 10% of that untracked in the head pointers), and that metadata is subject to fragmentation.

### Parallel performance, high hit rate
Now using `[-X10]` and `-threads=10`, all three at the same time

lru_cache: [AVG    10 runs] : 263629 (± 1425) ops/sec;   18.4 (± 0.1) MB/sec
655MB max RSS, 97.1% cache hit rate
fixed_hyper_clock_cache: [AVG    10 runs] : 479590 (± 8114) ops/sec;   33.5 (± 0.6) MB/sec
651MB max RSS, 97.1% cache hit rate
auto_hyper_clock_cache: [AVG    10 runs] : 418687 (± 5915) ops/sec;   29.3 (± 0.4) MB/sec
657MB max RSS, 97.1% cache hit rate

Even with just 10-way parallelism for each cache (though 30+/48 cores busy overall), LRUCache is already showing performance degradation, while AutoHCC is in the neighborhood of FixedHCC. And that brings us to the question of how AutoHCC holds up under extreme parallelism, so now independent runs with `-threads=100` (overloading 48 cores).

lru_cache: 438613 ops/sec, 827MB max RSS
fixed_hyper_clock_cache: 1651310 ops/sec, 812MB max RSS
auto_hyper_clock_cache: 1505875 ops/sec, 821MB max RSS (Yield count: 1089 over 30s)

Clearly, AutoHCC holds up extremely well under extreme parallelism, even closing some of the modest performance gap with  FixedHCC.

### Parallel performance, low hit rate
To get down to roughly 50% cache hit rate, we use `-cache_index_and_filter_blocks=0 -cache_size=1650000000` with `-threads=10`. Here the extra cost of running counting clock eviction, especially on the chains of AutoHCC, are evident, especially with the lower contention of cache_index_and_filter_blocks=0:

lru_cache: 725231 ops/sec, 1770MB max RSS, 51.3% hit rate
fixed_hyper_clock_cache: 638620 ops/sec, 1765MB max RSS, 50.2% hit rate
auto_hyper_clock_cache: 541018 ops/sec, 1777MB max RSS, 50.8% hit rate

Reviewed By: jowlyzhang

Differential Revision: D48784755

Pulled By: pdillinger

fbshipit-source-id: e79813dc087474ac427637dd282a14fa3011a6e4
2023-09-01 15:44:38 -07:00
Peter Dillinger d3420464c3 cache_bench enhancements for jemalloc etc. (#11758)
Summary:
* Add some options to cache_bench to use JemallocNodumpAllocator
* Make num_shard_bits option use and report cache-specific defaults
* Add a usleep option to sleep between operations, for simulating a workload with more CPU idle/wait time.
* Use const& for JemallocAllocatorOptions, to improve API usability (e.g. can bind to temporary `{}`)
* InstallStackTraceHandler()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11758

Test Plan: manual

Reviewed By: jowlyzhang

Differential Revision: D48668479

Pulled By: pdillinger

fbshipit-source-id: b6032fbe09444cdb8f1443a5e017d2eea4f6205a
2023-08-24 19:14:38 -07:00
Changyu Bi c2aad555c3 Add CompressionOptions::checksum for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (d857369028/lib/zstd.h (L428)) to detect corruption during decompression. Main changes are in compression.h:
* User can set CompressionOptions::checksum to true to enable this feature.
* We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`.
* Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag.
* Error handling during decompression to propagate error message from ZSTD.
* Updated microbench to test read performance impact.

About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666

Test Plan:
* Existing unit tests for zstd compression
* Add unit test `DBTest2.ZSTDChecksum` to test the corruption case
* Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API.
* Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work.
* Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2.
* Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable.
* Microbench:
```
TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100

Min out of 100 runs:
Main:
10390 10436 10456 10484 10499 10535 10544 10545 10565 10568

After this PR, checksum=false
10285 10397 10503 10508 10515 10557 10562 10635 10640 10660

After this PR, checksum=true
10827 10876 10925 10949 10971 11052 11061 11063 11100 11109
```
* db_bench:
```
Write perf
TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=..

[FillSeq checksum=0]
fillseq [AVG    10 runs] : 281635 (± 31711) ops/sec;   31.2 (± 3.5) MB/sec
fillseq [MEDIAN 10 runs] : 294027 ops/sec;   32.5 MB/sec

[FillSeq checksum=1]
fillseq [AVG    10 runs] : 286961 (± 34700) ops/sec;   31.7 (± 3.8) MB/sec
fillseq [MEDIAN 10 runs] : 283278 ops/sec;   31.3 MB/sec

Read perf
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1

[Readrandom checksum=1]
readrandom [AVG    20 runs] : 360928 (± 3579) ops/sec;    4.0 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 362468 ops/sec;    4.0 MB/sec

[Readrandom checksum=0]
readrandom [AVG    20 runs] : 380365 (± 2384) ops/sec;    4.2 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 379800 ops/sec;    4.2 MB/sec

Compression
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1

checksum=1
compress [AVG    20 runs] : 54074 (± 634) ops/sec;  211.2 (± 2.5) MB/sec
compress [MEDIAN 20 runs] : 54396 ops/sec;  212.5 MB/sec

checksum=0
compress [AVG    20 runs] : 54598 (± 393) ops/sec;  213.3 (± 1.5) MB/sec
compress [MEDIAN 20 runs] : 54592 ops/sec;  213.3 MB/sec

Decompression:
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1

checksum = 0
uncompress [AVG    20 runs] : 167499 (± 962) ops/sec;  654.3 (± 3.8) MB/sec
uncompress [MEDIAN 20 runs] : 167210 ops/sec;  653.2 MB/sec
checksum = 1
uncompress [AVG    20 runs] : 167980 (± 924) ops/sec;  656.2 (± 3.6) MB/sec
uncompress [MEDIAN 20 runs] : 168465 ops/sec;  658.1 MB/sec
```

Reviewed By: ajkr

Differential Revision: D48019378

Pulled By: cbi42

fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba
2023-08-18 15:01:59 -07:00
anand76 a1743e85be Implement a allow cache hits admission policy for the compressed secondary cache (#11713)
Summary:
This PR implements a new admission policy for the compressed secondary cache, which includes the functionality of the existing policy, and also admits items evicted from the primary block cache with the hit bit set. Effectively, the new policy works as follows -
1. When an item is demoted from the primary cache without a hit, a placeholder is inserted in the compressed cache. A second demotion will insert the full entry.
2. When an item is promoted from the compressed cache to the primary cache for the first time, a placeholder is inserted in the primary. The second promotion inserts the full entry, while erasing it form the compressed cache.
3. If an item is demoted from the primary cache with the hit bit set, it is immediately inserted in the compressed secondary cache.
The ```TieredVolatileCacheOptions``` has been updated with a new option, ```adm_policy```, which allows the policy to be selected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11713

Reviewed By: pdillinger

Differential Revision: D48444512

Pulled By: anand1976

fbshipit-source-id: b4cbf8c169a88097dff08e36e8bc4b3088de1492
2023-08-18 11:19:48 -07:00
Peter Dillinger ef6f025563 Placeholder for AutoHyperClockCache, more (#11692)
Summary:
* The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation.
* Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache).
* Simplify some existing tests not to depend directly on cache type.
* Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation.
* Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test.
* Add another templated class BaseHyperClockCache to reduce future copy-paste
* Added ReportProblems support to cache_bench
* Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC:
```
2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17
```

In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692

Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in.

Reviewed By: jowlyzhang

Differential Revision: D48247413

Pulled By: pdillinger

fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da
2023-08-11 16:27:38 -07:00
Peter Dillinger 99daea3481 Prepare tests for new HCC naming (#11676)
Summary:
I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache.

This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676

Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache

Reviewed By: ajkr

Differential Revision: D48103165

Pulled By: pdillinger

fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736
2023-08-07 18:17:12 -07:00
Peter Dillinger cdb11f5ce6 More minor HCC refactoring + typed mmap (#11670)
Summary:
More code leading up to dynamic HCC.
* Small enhancements to cache_bench
* Extra assertion in Unref
* Improve a CAS loop in ChargeUsageMaybeEvictStrict
* Put load factor constants in appropriate class
* Move `standalone` field to HyperClockTable::HandleImpl because it can be encoded differently in the upcoming dynamic HCC.
* Add a typed version of MemMapping to simplify some future code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11670

Test Plan: existing tests, unit test added for TypedMemMapping

Reviewed By: jowlyzhang

Differential Revision: D48056464

Pulled By: pdillinger

fbshipit-source-id: 186b7d3105c5d6d2eb6a592369bc10a97ee14a15
2023-08-07 12:20:23 -07:00
Peter Dillinger f9de217353 Some cache_bench enhancements (#11661)
Summary:
... used in validating some HyperClockCache development in progress.

* Revamp the "populate cache" step to avoid redundant insertions (very rare in practice) and more consistently approach the desired resident_ratio while maintaining appropriate skew (still not perfect).
* Track and print hit ratio on lookups, to ensure a fair comparison is happening between implementations etc.
* Add an option to disable tracking and printing histograms (lots of output)
* Add an option to specify a random seed (for more reproducibility)
* Remove confusing/redundant "-skewed" option

Uses BitwiseAnd from https://github.com/facebook/rocksdb/issues/11660 (tested there)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11661

Test Plan: manual

Reviewed By: akankshamahajan15, jowlyzhang

Differential Revision: D47937671

Pulled By: pdillinger

fbshipit-source-id: 85a2bb881b1bca4f63e015bac684105fd91c9f35
2023-08-02 13:19:20 -07:00
Peter Dillinger f4e4039f00 Add some more bit operations to internal APIs (#11660)
Summary:
BottomNBits() - there is a single fast instruction for this on x86 since BMI2, but testing with godbolt indicates you need at least GCC 10 for the compiler to choose that instruction from the obvious C++ code. https://godbolt.org/z/5a7Ysd41h

BitwiseAnd() - this is a convenience function that works around the language flaw that the type of the result of x & y is the larger of the two input types, when it should be the smaller. This can save some ugly static_cast.

I expect to use both of these in coming HyperClockCache developments, and have applied them in a couple of places in existing code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11660

Test Plan: unit tests added

Reviewed By: jowlyzhang

Differential Revision: D47935531

Pulled By: pdillinger

fbshipit-source-id: d148c43a1e51df4a1c549b93aaf2725a3f8d3bd6
2023-08-02 11:30:10 -07:00
Peter Dillinger c41122b1a0 Even more HyperClockCache refactoring (#11630)
Summary:
... ahead of dynamic variant.

* Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release.
* Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function.
* Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11630

Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster.

Reviewed By: jowlyzhang

Differential Revision: D47657334

Pulled By: pdillinger

fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3
2023-07-24 09:36:09 -07:00
Andrew Kryczka 05c3b8ecac Prepare for specialized interface for row cache (#11620)
Summary:
An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema.

This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620

Reviewed By: pdillinger

Differential Revision: D47536962

Pulled By: ajkr

fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691
2023-07-18 19:12:58 -07:00
Peter Dillinger 846db9d7b1 Refactor ClockCache ApplyToEntries (#11609)
Summary:
... ahead of planned dynamic HCC variant. This changes
simplifies some logic while still enabling future code sharing between
implementation variants.

Detail: For complicated reasons, using a std::function parameter to
`ConstApplyToEntriesRange` with a lambda argument does not play
nice with templated HandleImpl. An explicit conversion to std::function
would be needed for it to compile. Templating the function type is the
easy work-around.

Also made some functions from https://github.com/facebook/rocksdb/issues/11572 private as recommended

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11609

Test Plan: existing tests

Reviewed By: jowlyzhang

Differential Revision: D47407415

Pulled By: pdillinger

fbshipit-source-id: 0f65954db16335999b78fb7d2563ec627624cef0
2023-07-18 12:09:27 -07:00