rocksdb/cache/sharded_cache.h
Peter Dillinger 204fcff751 HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.

This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.

It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.

## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.

## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.

These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.

* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)

I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.

## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.

## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.

## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.

## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.

## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.

## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.

## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.

Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).

Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.

## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301

Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.

## Crash/stress test
Updated to use the new combination.

## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.

```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```

**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127

**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)

This is an acceptable difference IMHO.

Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.

Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec

So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:

**Before**:
multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec

Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.

Let's also look at Get() in db_bench:

```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
**After**:
readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec

That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:

**Before**:
readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
**After**:
readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec

Reviewed By: ltamasi

Differential Revision: D44177044

Pulled By: pdillinger

fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-17 20:23:49 -07:00

311 lines
11 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#pragma once
#include <atomic>
#include <cstdint>
#include <string>
#include "port/lang.h"
#include "port/port.h"
#include "rocksdb/advanced_cache.h"
#include "util/hash.h"
#include "util/mutexlock.h"
namespace ROCKSDB_NAMESPACE {
// Optional base class for classes implementing the CacheShard concept
class CacheShardBase {
public:
explicit CacheShardBase(CacheMetadataChargePolicy metadata_charge_policy)
: metadata_charge_policy_(metadata_charge_policy) {}
using DeleterFn = Cache::DeleterFn;
// Expected by concept CacheShard (TODO with C++20 support)
// Some Defaults
std::string GetPrintableOptions() const { return ""; }
using HashVal = uint64_t;
using HashCref = uint64_t;
static inline HashVal ComputeHash(const Slice& key) {
return GetSliceNPHash64(key);
}
static inline uint32_t HashPieceForSharding(HashCref hash) {
return Lower32of64(hash);
}
void AppendPrintableOptions(std::string& /*str*/) const {}
// Must be provided for concept CacheShard (TODO with C++20 support)
/*
struct HandleImpl { // for concept HandleImpl
HashVal hash;
HashCref GetHash() const;
...
};
Status Insert(const Slice& key, HashCref hash, Cache::ObjectPtr value,
const Cache::CacheItemHelper* helper, size_t charge,
HandleImpl** handle, Cache::Priority priority,
bool standalone) = 0;
Handle* CreateStandalone(const Slice& key, HashCref hash, ObjectPtr obj,
const CacheItemHelper* helper,
size_t charge, bool allow_uncharged) = 0;
HandleImpl* Lookup(const Slice& key, HashCref hash,
const Cache::CacheItemHelper* helper,
Cache::CreateContext* create_context,
Cache::Priority priority,
Statistics* stats) = 0;
bool Release(HandleImpl* handle, bool useful, bool erase_if_last_ref) = 0;
bool Ref(HandleImpl* handle) = 0;
void Erase(const Slice& key, HashCref hash) = 0;
void SetCapacity(size_t capacity) = 0;
void SetStrictCapacityLimit(bool strict_capacity_limit) = 0;
size_t GetUsage() const = 0;
size_t GetPinnedUsage() const = 0;
size_t GetOccupancyCount() const = 0;
size_t GetTableAddressCount() const = 0;
// Handles iterating over roughly `average_entries_per_lock` entries, using
// `state` to somehow record where it last ended up. Caller initially uses
// *state == 0 and implementation sets *state = SIZE_MAX to indicate
// completion.
void ApplyToSomeEntries(
const std::function<void(const Slice& key, ObjectPtr value,
size_t charge,
const Cache::CacheItemHelper* helper)>& callback,
size_t average_entries_per_lock, size_t* state) = 0;
void EraseUnRefEntries() = 0;
*/
protected:
const CacheMetadataChargePolicy metadata_charge_policy_;
};
// Portions of ShardedCache that do not depend on the template parameter
class ShardedCacheBase : public Cache {
public:
ShardedCacheBase(size_t capacity, int num_shard_bits,
bool strict_capacity_limit,
std::shared_ptr<MemoryAllocator> memory_allocator);
virtual ~ShardedCacheBase() = default;
int GetNumShardBits() const;
uint32_t GetNumShards() const;
uint64_t NewId() override;
bool HasStrictCapacityLimit() const override;
size_t GetCapacity() const override;
using Cache::GetUsage;
size_t GetUsage(Handle* handle) const override;
std::string GetPrintableOptions() const override;
protected: // fns
virtual void AppendPrintableOptions(std::string& str) const = 0;
size_t GetPerShardCapacity() const;
size_t ComputePerShardCapacity(size_t capacity) const;
protected: // data
std::atomic<uint64_t> last_id_; // For NewId
const uint32_t shard_mask_;
// Dynamic configuration parameters, guarded by config_mutex_
bool strict_capacity_limit_;
size_t capacity_;
mutable port::Mutex config_mutex_;
};
// Generic cache interface that shards cache by hash of keys. 2^num_shard_bits
// shards will be created, with capacity split evenly to each of the shards.
// Keys are typically sharded by the lowest num_shard_bits bits of hash value
// so that the upper bits of the hash value can keep a stable ordering of
// table entries even as the table grows (using more upper hash bits).
// See CacheShardBase above for what is expected of the CacheShard parameter.
template <class CacheShard>
class ShardedCache : public ShardedCacheBase {
public:
using HashVal = typename CacheShard::HashVal;
using HashCref = typename CacheShard::HashCref;
using HandleImpl = typename CacheShard::HandleImpl;
ShardedCache(size_t capacity, int num_shard_bits, bool strict_capacity_limit,
std::shared_ptr<MemoryAllocator> allocator)
: ShardedCacheBase(capacity, num_shard_bits, strict_capacity_limit,
allocator),
shards_(reinterpret_cast<CacheShard*>(port::cacheline_aligned_alloc(
sizeof(CacheShard) * GetNumShards()))),
destroy_shards_in_dtor_(false) {}
virtual ~ShardedCache() {
if (destroy_shards_in_dtor_) {
ForEachShard([](CacheShard* cs) { cs->~CacheShard(); });
}
port::cacheline_aligned_free(shards_);
}
CacheShard& GetShard(HashCref hash) {
return shards_[CacheShard::HashPieceForSharding(hash) & shard_mask_];
}
const CacheShard& GetShard(HashCref hash) const {
return shards_[CacheShard::HashPieceForSharding(hash) & shard_mask_];
}
void SetCapacity(size_t capacity) override {
MutexLock l(&config_mutex_);
capacity_ = capacity;
auto per_shard = ComputePerShardCapacity(capacity);
ForEachShard([=](CacheShard* cs) { cs->SetCapacity(per_shard); });
}
void SetStrictCapacityLimit(bool s_c_l) override {
MutexLock l(&config_mutex_);
strict_capacity_limit_ = s_c_l;
ForEachShard(
[s_c_l](CacheShard* cs) { cs->SetStrictCapacityLimit(s_c_l); });
}
Status Insert(const Slice& key, ObjectPtr obj, const CacheItemHelper* helper,
size_t charge, Handle** handle = nullptr,
Priority priority = Priority::LOW) override {
assert(helper);
HashVal hash = CacheShard::ComputeHash(key);
auto h_out = reinterpret_cast<HandleImpl**>(handle);
return GetShard(hash).Insert(key, hash, obj, helper, charge, h_out,
priority);
}
Handle* CreateStandalone(const Slice& key, ObjectPtr obj,
const CacheItemHelper* helper, size_t charge,
bool allow_uncharged) override {
assert(helper);
HashVal hash = CacheShard::ComputeHash(key);
HandleImpl* result = GetShard(hash).CreateStandalone(
key, hash, obj, helper, charge, allow_uncharged);
return reinterpret_cast<Handle*>(result);
}
Handle* Lookup(const Slice& key, const CacheItemHelper* helper = nullptr,
CreateContext* create_context = nullptr,
Priority priority = Priority::LOW,
Statistics* stats = nullptr) override {
HashVal hash = CacheShard::ComputeHash(key);
HandleImpl* result = GetShard(hash).Lookup(key, hash, helper,
create_context, priority, stats);
return reinterpret_cast<Handle*>(result);
}
void Erase(const Slice& key) override {
HashVal hash = CacheShard::ComputeHash(key);
GetShard(hash).Erase(key, hash);
}
bool Release(Handle* handle, bool useful,
bool erase_if_last_ref = false) override {
auto h = reinterpret_cast<HandleImpl*>(handle);
return GetShard(h->GetHash()).Release(h, useful, erase_if_last_ref);
}
bool Ref(Handle* handle) override {
auto h = reinterpret_cast<HandleImpl*>(handle);
return GetShard(h->GetHash()).Ref(h);
}
bool Release(Handle* handle, bool erase_if_last_ref = false) override {
return Release(handle, true /*useful*/, erase_if_last_ref);
}
using ShardedCacheBase::GetUsage;
size_t GetUsage() const override {
return SumOverShards2(&CacheShard::GetUsage);
}
size_t GetPinnedUsage() const override {
return SumOverShards2(&CacheShard::GetPinnedUsage);
}
size_t GetOccupancyCount() const override {
return SumOverShards2(&CacheShard::GetPinnedUsage);
}
size_t GetTableAddressCount() const override {
return SumOverShards2(&CacheShard::GetTableAddressCount);
}
void ApplyToAllEntries(
const std::function<void(const Slice& key, ObjectPtr value, size_t charge,
const CacheItemHelper* helper)>& callback,
const ApplyToAllEntriesOptions& opts) override {
uint32_t num_shards = GetNumShards();
// Iterate over part of each shard, rotating between shards, to
// minimize impact on latency of concurrent operations.
std::unique_ptr<size_t[]> states(new size_t[num_shards]{});
size_t aepl = opts.average_entries_per_lock;
aepl = std::min(aepl, size_t{1});
bool remaining_work;
do {
remaining_work = false;
for (uint32_t i = 0; i < num_shards; i++) {
if (states[i] != SIZE_MAX) {
shards_[i].ApplyToSomeEntries(callback, aepl, &states[i]);
remaining_work |= states[i] != SIZE_MAX;
}
}
} while (remaining_work);
}
virtual void EraseUnRefEntries() override {
ForEachShard([](CacheShard* cs) { cs->EraseUnRefEntries(); });
}
void DisownData() override {
// Leak data only if that won't generate an ASAN/valgrind warning.
if (!kMustFreeHeapAllocations) {
destroy_shards_in_dtor_ = false;
}
}
protected:
inline void ForEachShard(const std::function<void(CacheShard*)>& fn) {
uint32_t num_shards = GetNumShards();
for (uint32_t i = 0; i < num_shards; i++) {
fn(shards_ + i);
}
}
inline size_t SumOverShards(
const std::function<size_t(CacheShard&)>& fn) const {
uint32_t num_shards = GetNumShards();
size_t result = 0;
for (uint32_t i = 0; i < num_shards; i++) {
result += fn(shards_[i]);
}
return result;
}
inline size_t SumOverShards2(size_t (CacheShard::*fn)() const) const {
return SumOverShards([fn](CacheShard& cs) { return (cs.*fn)(); });
}
// Must be called exactly once by derived class constructor
void InitShards(const std::function<void(CacheShard*)>& placement_new) {
ForEachShard(placement_new);
destroy_shards_in_dtor_ = true;
}
void AppendPrintableOptions(std::string& str) const override {
shards_[0].AppendPrintableOptions(str);
}
private:
CacheShard* const shards_;
bool destroy_shards_in_dtor_;
};
// 512KB is traditional minimum shard size.
int GetDefaultCacheShardBits(size_t capacity,
size_t min_shard_size = 512U * 1024U);
} // namespace ROCKSDB_NAMESPACE