rocksdb/cache/secondary_cache_adapter.cc

700 lines
28 KiB
C++
Raw Normal View History

HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
// Copyright (c) Meta Platforms, Inc. and affiliates.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
#include "cache/secondary_cache_adapter.h"
#include <atomic>
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
#include "cache/tiered_secondary_cache.h"
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
#include "monitoring/perf_context_imp.h"
#include "test_util/sync_point.h"
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
#include "util/cast_util.h"
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
namespace ROCKSDB_NAMESPACE {
namespace {
// A distinct pointer value for marking "dummy" cache entries
struct Dummy {
char val[7] = "kDummy";
};
const Dummy kDummy{};
Cache::ObjectPtr const kDummyObj = const_cast<Dummy*>(&kDummy);
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
const char* kTieredCacheName = "TieredCache";
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
} // namespace
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
// When CacheWithSecondaryAdapter is constructed with the distribute_cache_res
// parameter set to true, it manages the entire memory budget across the
// primary and secondary cache. The secondary cache is assumed to be in
// memory, such as the CompressedSecondaryCache. When a placeholder entry
// is inserted by a CacheReservationManager instance to reserve memory,
// the CacheWithSecondaryAdapter ensures that the reservation is distributed
// proportionally across the primary/secondary caches.
//
// The primary block cache is initially sized to the sum of the primary cache
// budget + teh secondary cache budget, as follows -
// |--------- Primary Cache Configured Capacity -----------|
// |---Secondary Cache Budget----|----Primary Cache Budget-----|
//
// A ConcurrentCacheReservationManager member in the CacheWithSecondaryAdapter,
// pri_cache_res_,
// is used to help with tracking the distribution of memory reservations.
// Initially, it accounts for the entire secondary cache budget as a
// reservation against the primary cache. This shrinks the usable capacity of
// the primary cache to the budget that the user originally desired.
//
// |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---|
//
// When a reservation placeholder is inserted into the adapter, it is inserted
// directly into the primary cache. This means the entire charge of the
// placeholder is counted against the primary cache. To compensate and count
// a portion of it against the secondary cache, the secondary cache Deflate()
// method is called to shrink it. Since the Deflate() causes the secondary
// actual usage to shrink, it is refelcted here by releasing an equal amount
// from the pri_cache_res_ reservation. The Deflate() in the secondary cache
// can be, but is not required to be, implemented using its own cache
// reservation manager.
//
// For example, if the pri/sec ratio is 70/30, and the combined capacity is
// 100MB, the intermediate and final state after inserting a reservation
// placeholder for 10MB would be as follows -
//
// |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|---R---|
// 1. After inserting the placeholder in primary
// |------- 30MB -------------|------- 60MB -------------|-10MB--|
// 2. After deflating the secondary and adjusting the reservation for
// secondary against the primary
// |------- 27MB -------------|------- 63MB -------------|-10MB--|
//
// Likewise, when the user inserted placeholder is released, the secondary
// cache Inflate() method is called to grow it, and the pri_cache_res_
// reservation is increased by an equal amount.
//
// Another way of implementing this would have been to simply split the user
// reservation into primary and seconary components. However, this would
// require allocating a structure to track the associated secondary cache
// reservation, which adds some complexity and overhead.
//
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
CacheWithSecondaryAdapter::CacheWithSecondaryAdapter(
std::shared_ptr<Cache> target,
std::shared_ptr<SecondaryCache> secondary_cache,
TieredAdmissionPolicy adm_policy, bool distribute_cache_res)
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
: CacheWrapper(std::move(target)),
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
secondary_cache_(std::move(secondary_cache)),
adm_policy_(adm_policy),
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
distribute_cache_res_(distribute_cache_res) {
target_->SetEvictionCallback(
[this](const Slice& key, Handle* handle, bool was_hit) {
return EvictionHandler(key, handle, was_hit);
});
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
if (distribute_cache_res_) {
size_t sec_capacity = 0;
pri_cache_res_ = std::make_shared<ConcurrentCacheReservationManager>(
std::make_shared<CacheReservationManagerImpl<CacheEntryRole::kMisc>>(
target_));
Status s = secondary_cache_->GetCapacity(sec_capacity);
assert(s.ok());
// Initially, the primary cache is sized to uncompressed cache budget plsu
// compressed secondary cache budget. The secondary cache budget is then
// taken away from the primary cache through cache reservations. Later,
// when a placeholder entry is inserted by the caller, its inserted
// into the primary cache and the portion that should be assigned to the
// secondary cache is freed from the reservation.
s = pri_cache_res_->UpdateCacheReservation(sec_capacity);
assert(s.ok());
sec_cache_res_ratio_.store((double)sec_capacity / target_->GetCapacity(),
std::memory_order_relaxed);
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
}
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
}
CacheWithSecondaryAdapter::~CacheWithSecondaryAdapter() {
// `*this` will be destroyed before `*target_`, so we have to prevent
// use after free
target_->SetEvictionCallback({});
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
#ifndef NDEBUG
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
if (distribute_cache_res_ && !ratio_changed_) {
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
size_t sec_capacity = 0;
Status s = secondary_cache_->GetCapacity(sec_capacity);
assert(s.ok());
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
assert(pri_cache_res_->GetTotalMemoryUsed() == sec_capacity);
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
}
#endif // NDEBUG
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
}
bool CacheWithSecondaryAdapter::EvictionHandler(const Slice& key,
Handle* handle, bool was_hit) {
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
auto helper = GetCacheItemHelper(handle);
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
if (helper->IsSecondaryCacheCompatible() &&
adm_policy_ != TieredAdmissionPolicy::kAdmPolicyThreeQueue) {
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
auto obj = target_->Value(handle);
// Ignore dummy entry
if (obj != kDummyObj) {
bool hit = false;
if (adm_policy_ == TieredAdmissionPolicy::kAdmPolicyAllowCacheHits) {
hit = was_hit;
}
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
// Spill into secondary cache.
secondary_cache_->Insert(key, obj, helper, hit).PermitUncheckedError();
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
}
}
// Never takes ownership of obj
return false;
}
bool CacheWithSecondaryAdapter::ProcessDummyResult(Cache::Handle** handle,
bool erase) {
if (*handle && target_->Value(*handle) == kDummyObj) {
target_->Release(*handle, erase);
*handle = nullptr;
return true;
} else {
return false;
}
}
void CacheWithSecondaryAdapter::CleanupCacheObject(
ObjectPtr obj, const CacheItemHelper* helper) {
if (helper->del_cb) {
helper->del_cb(obj, memory_allocator());
}
}
Cache::Handle* CacheWithSecondaryAdapter::Promote(
std::unique_ptr<SecondaryCacheResultHandle>&& secondary_handle,
const Slice& key, const CacheItemHelper* helper, Priority priority,
Statistics* stats, bool found_dummy_entry, bool kept_in_sec_cache) {
assert(secondary_handle->IsReady());
ObjectPtr obj = secondary_handle->Value();
if (!obj) {
// Nothing found.
return nullptr;
}
// Found something.
switch (helper->role) {
case CacheEntryRole::kFilterBlock:
RecordTick(stats, SECONDARY_CACHE_FILTER_HITS);
break;
case CacheEntryRole::kIndexBlock:
RecordTick(stats, SECONDARY_CACHE_INDEX_HITS);
break;
case CacheEntryRole::kDataBlock:
RecordTick(stats, SECONDARY_CACHE_DATA_HITS);
break;
default:
break;
}
PERF_COUNTER_ADD(secondary_cache_hit_count, 1);
RecordTick(stats, SECONDARY_CACHE_HITS);
// Note: SecondaryCache::Size() is really charge (from the CreateCallback)
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
size_t charge = secondary_handle->Size();
Handle* result = nullptr;
// Insert into primary cache, possibly as a standalone+dummy entries.
if (secondary_cache_->SupportForceErase() && !found_dummy_entry) {
// Create standalone and insert dummy
// Allow standalone to be created even if cache is full, to avoid
// reading the entry from storage.
result =
CreateStandalone(key, obj, helper, charge, /*allow_uncharged*/ true);
assert(result);
PERF_COUNTER_ADD(block_cache_standalone_handle_count, 1);
// Insert dummy to record recent use
// TODO: try to avoid case where inserting this dummy could overwrite a
// regular entry
Status s = Insert(key, kDummyObj, &kNoopCacheItemHelper, /*charge=*/0,
/*handle=*/nullptr, priority);
s.PermitUncheckedError();
// Nothing to do or clean up on dummy insertion failure
} else {
// Insert regular entry into primary cache.
// Don't allow it to spill into secondary cache again if it was kept there.
Status s = Insert(
key, obj, kept_in_sec_cache ? helper->without_secondary_compat : helper,
charge, &result, priority);
if (s.ok()) {
assert(result);
PERF_COUNTER_ADD(block_cache_real_handle_count, 1);
} else {
// Create standalone result instead, even if cache is full, to avoid
// reading the entry from storage.
result =
CreateStandalone(key, obj, helper, charge, /*allow_uncharged*/ true);
assert(result);
PERF_COUNTER_ADD(block_cache_standalone_handle_count, 1);
}
}
return result;
}
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
Status CacheWithSecondaryAdapter::Insert(const Slice& key, ObjectPtr value,
const CacheItemHelper* helper,
size_t charge, Handle** handle,
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
Priority priority,
const Slice& compressed_value,
CompressionType type) {
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
Status s = target_->Insert(key, value, helper, charge, handle, priority);
if (s.ok() && value == nullptr && distribute_cache_res_) {
size_t sec_charge = static_cast<size_t>(
charge * (sec_cache_res_ratio_.load(std::memory_order_relaxed)));
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
s = secondary_cache_->Deflate(sec_charge);
assert(s.ok());
s = pri_cache_res_->UpdateCacheReservation(sec_charge, /*increase=*/false);
assert(s.ok());
}
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
// Warm up the secondary cache with the compressed block. The secondary
// cache may choose to ignore it based on the admission policy.
if (value != nullptr && !compressed_value.empty() &&
adm_policy_ == TieredAdmissionPolicy::kAdmPolicyThreeQueue) {
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
Status status = secondary_cache_->InsertSaved(key, compressed_value, type);
assert(status.ok() || status.IsNotSupported());
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
}
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
return s;
}
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
Cache::Handle* CacheWithSecondaryAdapter::Lookup(const Slice& key,
const CacheItemHelper* helper,
CreateContext* create_context,
Priority priority,
Statistics* stats) {
// NOTE: we could just StartAsyncLookup() and Wait(), but this should be a bit
// more efficient
Handle* result =
target_->Lookup(key, helper, create_context, priority, stats);
bool secondary_compatible = helper && helper->IsSecondaryCacheCompatible();
bool found_dummy_entry =
ProcessDummyResult(&result, /*erase=*/secondary_compatible);
if (!result && secondary_compatible) {
// Try our secondary cache
bool kept_in_sec_cache = false;
std::unique_ptr<SecondaryCacheResultHandle> secondary_handle =
secondary_cache_->Lookup(key, helper, create_context, /*wait*/ true,
found_dummy_entry, /*out*/ kept_in_sec_cache);
if (secondary_handle) {
result = Promote(std::move(secondary_handle), key, helper, priority,
stats, found_dummy_entry, kept_in_sec_cache);
}
}
return result;
}
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
bool CacheWithSecondaryAdapter::Release(Handle* handle,
bool erase_if_last_ref) {
if (erase_if_last_ref) {
ObjectPtr v = target_->Value(handle);
if (v == nullptr && distribute_cache_res_) {
size_t charge = target_->GetCharge(handle);
size_t sec_charge = static_cast<size_t>(
charge * (sec_cache_res_ratio_.load(std::memory_order_relaxed)));
TEST_SYNC_POINT("CacheWithSecondaryAdapter::Release:ChargeSecCache1");
TEST_SYNC_POINT("CacheWithSecondaryAdapter::Release:ChargeSecCache2");
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
Status s = secondary_cache_->Inflate(sec_charge);
assert(s.ok());
s = pri_cache_res_->UpdateCacheReservation(sec_charge, /*increase=*/true);
assert(s.ok());
}
}
return target_->Release(handle, erase_if_last_ref);
}
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
Cache::ObjectPtr CacheWithSecondaryAdapter::Value(Handle* handle) {
ObjectPtr v = target_->Value(handle);
// TODO with stacked secondaries: might fail in EvictionHandler
assert(v != kDummyObj);
return v;
}
void CacheWithSecondaryAdapter::StartAsyncLookupOnMySecondary(
AsyncLookupHandle& async_handle) {
assert(!async_handle.IsPending());
assert(async_handle.result_handle == nullptr);
std::unique_ptr<SecondaryCacheResultHandle> secondary_handle =
secondary_cache_->Lookup(async_handle.key, async_handle.helper,
async_handle.create_context, /*wait*/ false,
async_handle.found_dummy_entry,
/*out*/ async_handle.kept_in_sec_cache);
if (secondary_handle) {
// TODO with stacked secondaries: Check & process if already ready?
async_handle.pending_handle = secondary_handle.release();
async_handle.pending_cache = secondary_cache_.get();
}
}
void CacheWithSecondaryAdapter::StartAsyncLookup(
AsyncLookupHandle& async_handle) {
target_->StartAsyncLookup(async_handle);
if (!async_handle.IsPending()) {
bool secondary_compatible =
async_handle.helper &&
async_handle.helper->IsSecondaryCacheCompatible();
async_handle.found_dummy_entry |= ProcessDummyResult(
&async_handle.result_handle, /*erase=*/secondary_compatible);
if (async_handle.Result() == nullptr && secondary_compatible) {
// Not found and not pending on another secondary cache
StartAsyncLookupOnMySecondary(async_handle);
}
}
}
void CacheWithSecondaryAdapter::WaitAll(AsyncLookupHandle* async_handles,
size_t count) {
if (count == 0) {
// Nothing to do
return;
}
// Requests that are pending on *my* secondary cache, at the start of this
// function
std::vector<AsyncLookupHandle*> my_pending;
// Requests that are pending on an "inner" secondary cache (managed somewhere
// under target_), as of the start of this function
std::vector<AsyncLookupHandle*> inner_pending;
// Initial accounting of pending handles, excluding those already handled
// by "outer" secondary caches. (See cur->pending_cache = nullptr.)
for (size_t i = 0; i < count; ++i) {
AsyncLookupHandle* cur = async_handles + i;
if (cur->pending_cache) {
assert(cur->IsPending());
assert(cur->helper);
assert(cur->helper->IsSecondaryCacheCompatible());
if (cur->pending_cache == secondary_cache_.get()) {
my_pending.push_back(cur);
// Mark as "to be handled by this caller"
cur->pending_cache = nullptr;
} else {
// Remember as potentially needing a lookup in my secondary
inner_pending.push_back(cur);
}
}
}
// Wait on inner-most cache lookups first
// TODO with stacked secondaries: because we are not using proper
// async/await constructs here yet, there is a false synchronization point
// here where all the results at one level are needed before initiating
// any lookups at the next level. Probably not a big deal, but worth noting.
if (!inner_pending.empty()) {
target_->WaitAll(async_handles, count);
}
// For those that failed to find something, convert to lookup in my
// secondary cache.
for (AsyncLookupHandle* cur : inner_pending) {
if (cur->Result() == nullptr) {
// Not found, try my secondary
StartAsyncLookupOnMySecondary(*cur);
if (cur->IsPending()) {
assert(cur->pending_cache == secondary_cache_.get());
my_pending.push_back(cur);
// Mark as "to be handled by this caller"
cur->pending_cache = nullptr;
}
}
}
// Wait on all lookups on my secondary cache
{
std::vector<SecondaryCacheResultHandle*> my_secondary_handles;
for (AsyncLookupHandle* cur : my_pending) {
my_secondary_handles.push_back(cur->pending_handle);
}
secondary_cache_->WaitAll(std::move(my_secondary_handles));
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
}
// Process results
for (AsyncLookupHandle* cur : my_pending) {
std::unique_ptr<SecondaryCacheResultHandle> secondary_handle(
cur->pending_handle);
cur->pending_handle = nullptr;
cur->result_handle = Promote(
std::move(secondary_handle), cur->key, cur->helper, cur->priority,
cur->stats, cur->found_dummy_entry, cur->kept_in_sec_cache);
assert(cur->pending_cache == nullptr);
}
}
std::string CacheWithSecondaryAdapter::GetPrintableOptions() const {
std::string str = target_->GetPrintableOptions();
str.append(" secondary_cache:\n");
str.append(secondary_cache_->GetPrintableOptions());
return str;
}
const char* CacheWithSecondaryAdapter::Name() const {
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
if (distribute_cache_res_) {
return kTieredCacheName;
} else {
// To the user, at least for now, configure the underlying cache with
// a secondary cache. So we pretend to be that cache
return target_->Name();
}
}
// Update the total cache capacity. If we're distributing cache reservations
// to both primary and secondary, then update the pri_cache_res_reservation
// as well. At the moment, we don't have a good way of handling the case
// where the new capacity < total cache reservations.
void CacheWithSecondaryAdapter::SetCapacity(size_t capacity) {
size_t sec_capacity = static_cast<size_t>(
capacity * (distribute_cache_res_
? sec_cache_res_ratio_.load(std::memory_order_relaxed)
: 0.0));
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
size_t old_sec_capacity = 0;
if (distribute_cache_res_) {
MutexLock m(&mutex_);
Status s = secondary_cache_->GetCapacity(old_sec_capacity);
if (!s.ok()) {
return;
}
if (old_sec_capacity > sec_capacity) {
// We're shrinking the cache. We do things in the following order to
// avoid a temporary spike in usage over the configured capacity -
// 1. Lower the secondary cache capacity
// 2. Credit an equal amount (by decreasing pri_cache_res_) to the
// primary cache
// 3. Decrease the primary cache capacity to the total budget
s = secondary_cache_->SetCapacity(sec_capacity);
if (s.ok()) {
s = pri_cache_res_->UpdateCacheReservation(
old_sec_capacity - sec_capacity,
/*increase=*/false);
assert(s.ok());
target_->SetCapacity(capacity);
}
} else {
// We're expanding the cache. Do it in the following order to avoid
// unnecessary evictions -
// 1. Increase the primary cache capacity to total budget
// 2. Reserve additional memory in primary on behalf of secondary (by
// increasing pri_cache_res_ reservation)
// 3. Increase secondary cache capacity
target_->SetCapacity(capacity);
s = pri_cache_res_->UpdateCacheReservation(
sec_capacity - old_sec_capacity,
/*increase=*/true);
assert(s.ok());
s = secondary_cache_->SetCapacity(sec_capacity);
assert(s.ok());
}
} else {
// No cache reservation distribution. Just set the primary cache capacity.
target_->SetCapacity(capacity);
}
}
// Update the secondary/primary allocation ratio (remember, the primary
// capacity is the total memory budget when distribute_cache_res_ is true).
// When the ratio changes, we may accumulate some error in the calculations
// for secondary cache inflate/deflate and pri_cache_res_ reservations.
// This is due to the rounding of the reservation amount.
//
// We rely on the current pri_cache_res_ total memory used to estimate the
// new secondary cache reservation after the ratio change. For this reason,
// once the ratio is lowered to 0.0 (effectively disabling the secondary
// cache and pri_cache_res_ total mem used going down to 0), we cannot
// increase the ratio and re-enable it, We might remove this limitation
// in the future.
Status CacheWithSecondaryAdapter::UpdateCacheReservationRatio(
double compressed_secondary_ratio) {
if (!distribute_cache_res_ ||
sec_cache_res_ratio_.load(std::memory_order_relaxed) == 0.0) {
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
return Status::NotSupported();
}
MutexLock m(&mutex_);
size_t pri_capacity = target_->GetCapacity();
size_t sec_capacity =
static_cast<size_t>(pri_capacity * compressed_secondary_ratio);
size_t old_sec_capacity;
Status s = secondary_cache_->GetCapacity(old_sec_capacity);
if (!s.ok()) {
return s;
}
TEST_SYNC_POINT(
"CacheWithSecondaryAdapter::UpdateCacheReservationRatio:Begin");
// There's a possible race condition here. Since the read of pri_cache_res_
// memory used (secondary cache usage charged to primary cache), and the
// change to sec_cache_res_ratio_ are not guarded by a mutex, its possible
// that an Insert/Release in another thread might decrease/increase the
// pri_cache_res_ reservation by the wrong amount. This should not be a
// problem because updating the sec/pri ratio is a rare operation, and
// the worst that can happen is we may over/under charge the secondary
// cache usage by a little bit. But we do need to protect against
// underflow of old_sec_reserved.
// TODO: Make the accounting more accurate by tracking the total memory
// reservation on the primary cache. This will also allow us to remove
// the restriction of not being able to change the sec/pri ratio from
// 0.0 to higher.
size_t sec_charge_to_pri = pri_cache_res_->GetTotalMemoryUsed();
size_t old_sec_reserved = (old_sec_capacity > sec_charge_to_pri)
? (old_sec_capacity - sec_charge_to_pri)
: 0;
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
// Calculate the new secondary cache reservation
size_t sec_reserved = static_cast<size_t>(
old_sec_reserved *
(double)(compressed_secondary_ratio /
sec_cache_res_ratio_.load(std::memory_order_relaxed)));
sec_cache_res_ratio_.store(compressed_secondary_ratio,
std::memory_order_relaxed);
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
if (sec_capacity > old_sec_capacity) {
// We're increasing the ratio, thus ending up with a larger secondary
// cache and a smaller usable primary cache capacity. Similar to
// SetCapacity(), we try to avoid a temporary increase in total usage
// beyond teh configured capacity -
// 1. A higher secondary cache ratio means it gets a higher share of
// cache reservations. So first account for that by deflating the
// secondary cache
// 2. Increase pri_cache_res_ reservation to reflect the new secondary
// cache utilization (increase in capacity - increase in share of cache
// reservation)
// 3. Increase secondary cache capacity
s = secondary_cache_->Deflate(sec_reserved - old_sec_reserved);
assert(s.ok());
s = pri_cache_res_->UpdateCacheReservation(
(sec_capacity - old_sec_capacity) - (sec_reserved - old_sec_reserved),
/*increase=*/true);
assert(s.ok());
s = secondary_cache_->SetCapacity(sec_capacity);
assert(s.ok());
} else {
// We're shrinking the ratio. Try to avoid unnecessary evictions -
// 1. Lower the secondary cache capacity
// 2. Decrease pri_cache_res_ reservation to relect lower secondary
// cache utilization (decrease in capacity - decrease in share of cache
// reservations)
// 3. Inflate the secondary cache to give it back the reduction in its
// share of cache reservations
s = secondary_cache_->SetCapacity(sec_capacity);
if (s.ok()) {
s = pri_cache_res_->UpdateCacheReservation(
(old_sec_capacity - sec_capacity) - (old_sec_reserved - sec_reserved),
/*increase=*/false);
assert(s.ok());
s = secondary_cache_->Inflate(old_sec_reserved - sec_reserved);
assert(s.ok());
}
}
TEST_SYNC_POINT("CacheWithSecondaryAdapter::UpdateCacheReservationRatio:End");
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
#ifndef NDEBUG
// As mentioned in the function comments, we may accumulate some erros when
// the ratio is changed. We set a flag here which disables some assertions
// in the destructor
ratio_changed_ = true;
#endif
return s;
}
Status CacheWithSecondaryAdapter::UpdateAdmissionPolicy(
TieredAdmissionPolicy adm_policy) {
adm_policy_ = adm_policy;
return Status::OK();
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
}
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
std::shared_ptr<Cache> NewTieredCache(const TieredCacheOptions& _opts) {
if (!_opts.cache_opts) {
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
return nullptr;
}
TieredCacheOptions opts = _opts;
{
bool valid_adm_policy = true;
switch (_opts.adm_policy) {
case TieredAdmissionPolicy::kAdmPolicyAuto:
// Select an appropriate default policy
if (opts.adm_policy == TieredAdmissionPolicy::kAdmPolicyAuto) {
if (opts.nvm_sec_cache) {
opts.adm_policy = TieredAdmissionPolicy::kAdmPolicyThreeQueue;
} else {
opts.adm_policy = TieredAdmissionPolicy::kAdmPolicyPlaceholder;
}
}
break;
case TieredAdmissionPolicy::kAdmPolicyPlaceholder:
case TieredAdmissionPolicy::kAdmPolicyAllowCacheHits:
if (opts.nvm_sec_cache) {
valid_adm_policy = false;
}
break;
case TieredAdmissionPolicy::kAdmPolicyThreeQueue:
if (!opts.nvm_sec_cache) {
valid_adm_policy = false;
}
break;
default:
valid_adm_policy = false;
}
if (!valid_adm_policy) {
return nullptr;
}
}
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
std::shared_ptr<Cache> cache;
if (opts.cache_type == PrimaryCacheType::kCacheTypeLRU) {
LRUCacheOptions cache_opts =
*(static_cast_with_check<LRUCacheOptions, ShardedCacheOptions>(
opts.cache_opts));
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
cache_opts.capacity = opts.total_capacity;
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
cache = cache_opts.MakeSharedCache();
} else if (opts.cache_type == PrimaryCacheType::kCacheTypeHCC) {
HyperClockCacheOptions cache_opts =
*(static_cast_with_check<HyperClockCacheOptions, ShardedCacheOptions>(
opts.cache_opts));
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
cache_opts.capacity = opts.total_capacity;
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
cache = cache_opts.MakeSharedCache();
} else {
return nullptr;
}
std::shared_ptr<SecondaryCache> sec_cache;
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
opts.comp_cache_opts.capacity = static_cast<size_t>(
opts.total_capacity * opts.compressed_secondary_ratio);
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
sec_cache = NewCompressedSecondaryCache(opts.comp_cache_opts);
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
if (opts.nvm_sec_cache) {
if (opts.adm_policy == TieredAdmissionPolicy::kAdmPolicyThreeQueue) {
Support compressed and local flash secondary cache stacking (#11812) Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-22 03:30:53 +00:00
sec_cache = std::make_shared<TieredSecondaryCache>(
sec_cache, opts.nvm_sec_cache,
TieredAdmissionPolicy::kAdmPolicyThreeQueue);
} else {
return nullptr;
}
}
return std::make_shared<CacheWithSecondaryAdapter>(
cache, sec_cache, opts.adm_policy, /*distribute_cache_res=*/true);
Integrate CacheReservationManager with compressed secondary cache (#11449) Summary: This draft PR implements charging of reserved memory, for write buffers, table readers, and other purposes, proportionally to the block cache and the compressed secondary cache. The basic flow of memory reservation is maintained - clients use ```CacheReservationManager``` to request reservations, and ```CacheReservationManager``` inserts placeholder entries, i.e null value and non-zero charge, into the block cache. The ```CacheWithSecondaryAdapter``` wrapper uses its own instance of ```CacheReservationManager``` to keep track of reservations charged to the secondary cache, while the placeholder entries are inserted into the primary block cache. The design is as follows. When ```CacheWithSecondaryAdapter``` is constructed with the ```distribute_cache_res``` parameter set to true, it manages the entire memory budget across the primary and secondary cache. The secondary cache is assumed to be in memory, such as the ```CompressedSecondaryCache```. When a placeholder entry is inserted by a CacheReservationManager instance to reserve memory, the ```CacheWithSecondaryAdapter```ensures that the reservation is distributed proportionally across the primary/secondary caches. The primary block cache is initially sized to the sum of the primary cache budget + the secondary cache budget, as follows - |--------- Primary Cache Configured Capacity -----------| |---Secondary Cache Budget----|----Primary Cache Budget-----| A ```ConcurrentCacheReservationManager``` member in the ```CacheWithSecondaryAdapter```, ```pri_cache_res_```, is used to help with tracking the distribution of memory reservations. Initially, it accounts for the entire secondary cache budget as a reservation against the primary cache. This shrinks the usable capacity of the primary cache to the budget that the user originally desired. |--Reservation for Sec Cache--|-Pri Cache Usable Capacity---| When a reservation placeholder is inserted into the adapter, it is inserted directly into the primary cache. This means the entire charge of the placeholder is counted against the primary cache. To compensate and count a portion of it against the secondary cache, the secondary cache ```Deflate()``` method is called to shrink it. Since the ```Deflate()``` causes the secondary actual usage to shrink, it is reflected here by releasing an equal amount from the ```pri_cache_res_``` reservation. For example, if the pri/sec ratio is 50/50, this would be the state after placeholder insertion - |-Reservation for Sec Cache-|-Pri Cache Usable Capacity-|-R-| Likewise, when the user inserted placeholder is released, the secondary cache ```Inflate()``` method is called to grow it, and the ```pri_cache_res_``` reservation is increased by an equal amount. Other alternatives - 1. Another way of implementing this would have been to simply split the user reservation in ```CacheWithSecondaryAdapter``` into primary and secondary components. However, this would require allocating a structure to track the associated secondary cache reservation, which adds some complexity and overhead. 2. Yet another option is to implement the splitting directly in ```CacheReservationManager```. However, there are multiple instances of ```CacheReservationManager``` in a DB instance, making it complicated to keep track of them. The PR contains the following changes - 1. A new cache allocator, ```NewTieredVolatileCache()```, is defined for allocating a tiered primary block cache and compressed secondary cache. This internally allocates an instance of ```CacheWithSecondaryAdapter```. 3. New interfaces, ```Deflate()``` and ```Inflate()```, are added to the ```SecondaryCache``` interface. The default implementaion returns ```NotSupported``` with overrides in ```CompressedSecondaryCache```. 4. The ```CompressedSecondaryCache``` uses a ```ConcurrentCacheReservationManager``` instance to manage reservations done using ```Inflate()/Deflate()```. 5. The ```CacheWithSecondaryAdapter``` optionally distributes memory reservations across the primary and secondary caches. The primary cache is sized to the total memory budget (primary + secondary), and the capacity allocated to secondary cache is "reserved" against the primary cache. For any subsequent reservations, the primary cache pre-reserved capacity is adjusted. Benchmarks - Baseline ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true ``` ``` readseq : 3.301 micros/op 9694317 ops/sec 66.018 seconds 640000000 operations; 9763.0 MB/s readwhilewriting : 22.921 micros/op 1396058 ops/sec 300.021 seconds 418846968 operations; 1405.9 MB/s (13068999 of 13068999 found) real 6m31.052s user 152m5.660s sys 26m18.738s ``` With TieredVolatileCache ``` time ~/rocksdb_anand76/db_bench --db=/dev/shm/comp_cache_res/base --use_existing_db=true --benchmarks="readseq,readwhilewriting" --key_size=32 --value_size=1024 --num=20000000 --threads=32 --bloom_bits=10 --cache_size=30000000000 --use_compressed_secondary_cache=true --compressed_secondary_cache_size=5000000000 --duration=300 --cost_write_buffer_to_cache=true --use_tiered_volatile_cache=true ``` ``` readseq : 4.064 micros/op 7873915 ops/sec 81.281 seconds 640000000 operations; 7929.7 MB/s readwhilewriting : 20.944 micros/op 1527827 ops/sec 300.020 seconds 458378968 operations; 1538.6 MB/s (14296999 of 14296999 found) real 6m42.743s user 157m58.972s sys 33m16.671 ``` ``` readseq : 3.484 micros/op 9184967 ops/sec 69.679 seconds 640000000 operations; 9250.0 MB/s readwhilewriting : 21.261 micros/op 1505035 ops/sec 300.024 seconds 451545968 operations; 1515.7 MB/s (14101999 of 14101999 found) real 6m31.469s user 155m16.570s sys 27m47.834s ``` ToDo - 1. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11449 Reviewed By: pdillinger Differential Revision: D46197388 Pulled By: anand1976 fbshipit-source-id: 42d16f0254df683db4929db20d06ff26030e90df
2023-05-30 21:05:48 +00:00
}
Fix updating the capacity of a tiered cache (#11873) Summary: Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity. With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy. Tests: New unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873 Reviewed By: akankshamahajan15 Differential Revision: D49562250 Pulled By: anand1976 fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-23 01:07:46 +00:00
Status UpdateTieredCache(const std::shared_ptr<Cache>& cache,
int64_t total_capacity,
double compressed_secondary_ratio,
TieredAdmissionPolicy adm_policy) {
if (!cache || strcmp(cache->Name(), kTieredCacheName)) {
return Status::InvalidArgument();
}
CacheWithSecondaryAdapter* tiered_cache =
static_cast<CacheWithSecondaryAdapter*>(cache.get());
Status s;
if (total_capacity > 0) {
tiered_cache->SetCapacity(total_capacity);
}
if (compressed_secondary_ratio >= 0.0 && compressed_secondary_ratio <= 1.0) {
s = tiered_cache->UpdateCacheReservationRatio(compressed_secondary_ratio);
}
if (adm_policy < TieredAdmissionPolicy::kAdmPolicyMax) {
s = tiered_cache->UpdateAdmissionPolicy(adm_policy);
}
return s;
}
HyperClockCache support for SecondaryCache, with refactoring (#11301) Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-18 03:23:49 +00:00
} // namespace ROCKSDB_NAMESPACE