2016-02-09 23:12:00 +00:00
|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
2017-07-15 23:03:42 +00:00
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
2016-08-19 19:28:19 +00:00
|
|
|
//
|
2011-03-18 22:37:00 +00:00
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
//
|
|
|
|
// A Cache is an interface that maps keys to values. It has internal
|
|
|
|
// synchronization and may be safely accessed concurrently from
|
|
|
|
// multiple threads. It may automatically evict entries to make room
|
|
|
|
// for new entries. Values have a specified charge against the cache
|
|
|
|
// capacity. For example, a cache where the values are variable
|
|
|
|
// length strings, may use the length of the string as the charge for
|
|
|
|
// the string.
|
|
|
|
//
|
|
|
|
// A builtin cache implementation with a least-recently-used eviction
|
|
|
|
// policy is provided. Clients may use their own implementations if
|
|
|
|
// they want something more sophisticated (like scan-resistance, a
|
|
|
|
// custom eviction policy, variable cache sizing, etc.)
|
|
|
|
|
2016-08-19 19:28:19 +00:00
|
|
|
#pragma once
|
2011-03-18 22:37:00 +00:00
|
|
|
|
New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 23:16:11 +00:00
|
|
|
#include <cstdint>
|
|
|
|
#include <functional>
|
2016-05-24 06:35:23 +00:00
|
|
|
#include <memory>
|
2016-12-22 22:44:01 +00:00
|
|
|
#include <string>
|
New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 23:16:11 +00:00
|
|
|
|
2022-02-24 00:06:27 +00:00
|
|
|
#include "rocksdb/compression_type.h"
|
2018-11-21 19:28:02 +00:00
|
|
|
#include "rocksdb/memory_allocator.h"
|
2013-08-23 15:38:13 +00:00
|
|
|
#include "rocksdb/slice.h"
|
2016-09-01 20:50:39 +00:00
|
|
|
#include "rocksdb/statistics.h"
|
2016-03-11 01:35:19 +00:00
|
|
|
#include "rocksdb/status.h"
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
namespace ROCKSDB_NAMESPACE {
|
2011-03-18 22:37:00 +00:00
|
|
|
|
|
|
|
class Cache;
|
2020-04-29 01:02:11 +00:00
|
|
|
struct ConfigOptions;
|
2021-05-14 05:57:51 +00:00
|
|
|
class SecondaryCache;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2019-03-20 19:24:57 +00:00
|
|
|
extern const bool kDefaultToAdaptiveMutex;
|
|
|
|
|
2019-09-16 22:14:51 +00:00
|
|
|
enum CacheMetadataChargePolicy {
|
|
|
|
kDontChargeCacheMetadata,
|
|
|
|
kFullChargeCacheMetadata
|
|
|
|
};
|
|
|
|
const CacheMetadataChargePolicy kDefaultCacheMetadataChargePolicy =
|
|
|
|
kFullChargeCacheMetadata;
|
|
|
|
|
2017-11-28 18:35:17 +00:00
|
|
|
struct LRUCacheOptions {
|
|
|
|
// Capacity of the cache.
|
|
|
|
size_t capacity = 0;
|
|
|
|
|
|
|
|
// Cache is sharded into 2^num_shard_bits shards,
|
|
|
|
// by hash of key. Refer to NewLRUCache for further
|
|
|
|
// information.
|
|
|
|
int num_shard_bits = -1;
|
|
|
|
|
|
|
|
// If strict_capacity_limit is set,
|
|
|
|
// insert to the cache will fail when cache is full.
|
|
|
|
bool strict_capacity_limit = false;
|
|
|
|
|
|
|
|
// Percentage of cache reserved for high priority entries.
|
2018-05-24 22:45:49 +00:00
|
|
|
// If greater than zero, the LRU list will be split into a high-pri
|
2021-01-16 04:05:42 +00:00
|
|
|
// list and a low-pri list. High-pri entries will be inserted to the
|
2018-05-24 22:45:49 +00:00
|
|
|
// tail of high-pri list, while low-pri entries will be first inserted to
|
2021-01-16 04:05:42 +00:00
|
|
|
// the low-pri list (the midpoint). This is referred to as
|
|
|
|
// midpoint insertion strategy to make entries that never get hit in cache
|
2018-05-24 22:45:49 +00:00
|
|
|
// age out faster.
|
|
|
|
//
|
|
|
|
// See also
|
|
|
|
// BlockBasedTableOptions::cache_index_and_filter_blocks_with_high_priority.
|
2019-06-27 17:16:21 +00:00
|
|
|
double high_pri_pool_ratio = 0.5;
|
2017-11-28 18:35:17 +00:00
|
|
|
|
2018-11-21 19:28:02 +00:00
|
|
|
// If non-nullptr will use this allocator instead of system allocator when
|
|
|
|
// allocating memory for cache blocks. Call this method before you start using
|
|
|
|
// the cache!
|
2018-11-30 01:30:33 +00:00
|
|
|
//
|
|
|
|
// Caveat: when the cache is used as block cache, the memory allocator is
|
|
|
|
// ignored when dealing with compression libraries that allocate memory
|
|
|
|
// internally (currently only XPRESS).
|
2018-11-21 19:28:02 +00:00
|
|
|
std::shared_ptr<MemoryAllocator> memory_allocator;
|
|
|
|
|
2019-03-20 19:24:57 +00:00
|
|
|
// Whether to use adaptive mutexes for cache shards. Note that adaptive
|
|
|
|
// mutexes need to be supported by the platform in order for this to have any
|
|
|
|
// effect. The default value is true if RocksDB is compiled with
|
|
|
|
// -DROCKSDB_DEFAULT_TO_ADAPTIVE_MUTEX, false otherwise.
|
|
|
|
bool use_adaptive_mutex = kDefaultToAdaptiveMutex;
|
|
|
|
|
2019-09-16 22:14:51 +00:00
|
|
|
CacheMetadataChargePolicy metadata_charge_policy =
|
|
|
|
kDefaultCacheMetadataChargePolicy;
|
|
|
|
|
2021-05-14 05:57:51 +00:00
|
|
|
// A SecondaryCache instance to use a the non-volatile tier
|
|
|
|
std::shared_ptr<SecondaryCache> secondary_cache;
|
|
|
|
|
2017-11-28 18:35:17 +00:00
|
|
|
LRUCacheOptions() {}
|
|
|
|
LRUCacheOptions(size_t _capacity, int _num_shard_bits,
|
2018-11-21 19:28:02 +00:00
|
|
|
bool _strict_capacity_limit, double _high_pri_pool_ratio,
|
2019-03-20 19:24:57 +00:00
|
|
|
std::shared_ptr<MemoryAllocator> _memory_allocator = nullptr,
|
2019-09-16 22:14:51 +00:00
|
|
|
bool _use_adaptive_mutex = kDefaultToAdaptiveMutex,
|
|
|
|
CacheMetadataChargePolicy _metadata_charge_policy =
|
|
|
|
kDefaultCacheMetadataChargePolicy)
|
2017-11-28 18:35:17 +00:00
|
|
|
: capacity(_capacity),
|
|
|
|
num_shard_bits(_num_shard_bits),
|
|
|
|
strict_capacity_limit(_strict_capacity_limit),
|
2018-11-21 19:28:02 +00:00
|
|
|
high_pri_pool_ratio(_high_pri_pool_ratio),
|
2019-03-20 19:24:57 +00:00
|
|
|
memory_allocator(std::move(_memory_allocator)),
|
2019-09-16 22:14:51 +00:00
|
|
|
use_adaptive_mutex(_use_adaptive_mutex),
|
|
|
|
metadata_charge_policy(_metadata_charge_policy) {}
|
2017-11-28 18:35:17 +00:00
|
|
|
};
|
|
|
|
|
2013-10-10 00:04:40 +00:00
|
|
|
// Create a new cache with a fixed size capacity. The cache is sharded
|
2016-03-11 01:35:19 +00:00
|
|
|
// to 2^num_shard_bits shards, by hash of the key. The total capacity
|
2016-08-19 23:43:31 +00:00
|
|
|
// is divided and evenly assigned to each shard. If strict_capacity_limit
|
|
|
|
// is set, insert to the cache will fail when cache is full. User can also
|
|
|
|
// set percentage of the cache reserves for high priority entries via
|
|
|
|
// high_pri_pool_pct.
|
2017-01-27 14:35:41 +00:00
|
|
|
// num_shard_bits = -1 means it is automatically determined: every shard
|
|
|
|
// will be at least 512KB and number of shard bits will not exceed 6.
|
2018-11-21 19:28:02 +00:00
|
|
|
extern std::shared_ptr<Cache> NewLRUCache(
|
|
|
|
size_t capacity, int num_shard_bits = -1,
|
2019-06-27 17:16:21 +00:00
|
|
|
bool strict_capacity_limit = false, double high_pri_pool_ratio = 0.5,
|
2019-03-20 19:24:57 +00:00
|
|
|
std::shared_ptr<MemoryAllocator> memory_allocator = nullptr,
|
2019-09-16 22:14:51 +00:00
|
|
|
bool use_adaptive_mutex = kDefaultToAdaptiveMutex,
|
|
|
|
CacheMetadataChargePolicy metadata_charge_policy =
|
|
|
|
kDefaultCacheMetadataChargePolicy);
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2017-11-28 18:35:17 +00:00
|
|
|
extern std::shared_ptr<Cache> NewLRUCache(const LRUCacheOptions& cache_opts);
|
2022-02-24 00:06:27 +00:00
|
|
|
|
|
|
|
// EXPERIMENTAL
|
|
|
|
// Options structure for configuring a SecondaryCache instance based on
|
|
|
|
// LRUCache. The LRUCacheOptions.secondary_cache is not used and
|
|
|
|
// should not be set.
|
|
|
|
struct LRUSecondaryCacheOptions : LRUCacheOptions {
|
|
|
|
// The compression method (if any) that is used to compress data.
|
|
|
|
CompressionType compression_type = CompressionType::kLZ4Compression;
|
|
|
|
|
|
|
|
// compress_format_version can have two values:
|
|
|
|
// compress_format_version == 1 -- decompressed size is not included in the
|
|
|
|
// block header.
|
|
|
|
// compress_format_version == 2 -- decompressed size is included in the block
|
|
|
|
// header in varint32 format.
|
|
|
|
uint32_t compress_format_version = 2;
|
|
|
|
|
|
|
|
LRUSecondaryCacheOptions() {}
|
|
|
|
LRUSecondaryCacheOptions(
|
|
|
|
size_t _capacity, int _num_shard_bits, bool _strict_capacity_limit,
|
|
|
|
double _high_pri_pool_ratio,
|
|
|
|
std::shared_ptr<MemoryAllocator> _memory_allocator = nullptr,
|
|
|
|
bool _use_adaptive_mutex = kDefaultToAdaptiveMutex,
|
|
|
|
CacheMetadataChargePolicy _metadata_charge_policy =
|
|
|
|
kDefaultCacheMetadataChargePolicy,
|
|
|
|
CompressionType _compression_type = CompressionType::kLZ4Compression,
|
|
|
|
uint32_t _compress_format_version = 2)
|
|
|
|
: LRUCacheOptions(_capacity, _num_shard_bits, _strict_capacity_limit,
|
|
|
|
_high_pri_pool_ratio, std::move(_memory_allocator),
|
|
|
|
_use_adaptive_mutex, _metadata_charge_policy),
|
|
|
|
compression_type(_compression_type),
|
|
|
|
compress_format_version(_compress_format_version) {}
|
|
|
|
};
|
|
|
|
|
|
|
|
// EXPERIMENTAL
|
|
|
|
// Create a new Secondary Cache that is implemented on top of LRUCache.
|
|
|
|
extern std::shared_ptr<SecondaryCache> NewLRUSecondaryCache(
|
|
|
|
size_t capacity, int num_shard_bits = -1,
|
|
|
|
bool strict_capacity_limit = false, double high_pri_pool_ratio = 0.5,
|
|
|
|
std::shared_ptr<MemoryAllocator> memory_allocator = nullptr,
|
|
|
|
bool use_adaptive_mutex = kDefaultToAdaptiveMutex,
|
|
|
|
CacheMetadataChargePolicy metadata_charge_policy =
|
|
|
|
kDefaultCacheMetadataChargePolicy,
|
|
|
|
CompressionType compression_type = CompressionType::kLZ4Compression,
|
|
|
|
uint32_t compress_format_version = 2);
|
|
|
|
|
|
|
|
extern std::shared_ptr<SecondaryCache> NewLRUSecondaryCache(
|
|
|
|
const LRUSecondaryCacheOptions& opts);
|
2017-11-28 18:35:17 +00:00
|
|
|
|
2016-08-19 19:28:19 +00:00
|
|
|
// Similar to NewLRUCache, but create a cache based on CLOCK algorithm with
|
|
|
|
// better concurrent performance in some cases. See util/clock_cache.cc for
|
|
|
|
// more detail.
|
|
|
|
//
|
|
|
|
// Return nullptr if it is not supported.
|
Fix use-after-free threading bug in ClockCache (#8261)
Summary:
In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with
-use_clock_cache, as well as db_bench -use_clock_cache, but not
single-threaded. Smaller cache size hits failure much faster. ASAN
reported the failuer as calling malloc_usable_size on the `key` pointer
of a ClockCache handle after it was reportedly freed. On detailed
inspection I found this bad sequence of operations for a cache entry:
state=InCache=1,refs=1
[thread 1] Start ClockCacheShard::Unref (from Release, no mutex)
[thread 1] Decrement ref count
state=InCache=1,refs=0
[thread 1] Suspend before CalcTotalCharge (no mutex)
[thread 2] Start UnsetInCache (from Insert, mutex held)
[thread 2] clear InCache bit
state=InCache=0,refs=0
[thread 2] Calls RecycleHandle (based on pre-updated state)
[thread 2] Returns to Insert which calls Cleanup which deletes `key`
[thread 1] Resume ClockCacheShard::Unref
[thread 1] Read `key` in CalcTotalCharge
To fix this, I've added a field to the handle to store the metadata
charge so that we can efficiently remember everything we need from
the handle in Unref. We must not read from the handle again if we
decrement the count to zero with InCache=1, which means we don't own
the entry and someone else could eject/overwrite it immediately.
Note before this change, on amd64 sizeof(Handle) == 56 even though there
are only 48 bytes of data. Grouping together the uint32_t fields would
cut it down to 48, but I've added another uint32_t, which takes it
back up to 56. Not a big deal.
Also fixed DisownData to cooperate with ASAN as in LRUCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261
Test Plan:
Manual + adding use_clock_cache to db_crashtest.py
Base performance
./cache_bench -use_clock_cache
Complete in 17.060 s; QPS = 2458513
New performance
./cache_bench -use_clock_cache
Complete in 17.052 s; QPS = 2459695
Any difference is easily buried in small noise.
Crash test shows still more bug(s) in ClockCache, so I'm expecting to
disable ClockCache from production code in a follow-up PR (if we
can't find and fix the bug(s))
Reviewed By: mrambacher
Differential Revision: D28207358
Pulled By: pdillinger
fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294
2021-05-05 05:17:02 +00:00
|
|
|
//
|
|
|
|
// BROKEN: ClockCache is known to have bugs that could lead to crash or
|
|
|
|
// corruption, so should not be used until fixed. Use NewLRUCache instead.
|
2019-09-16 22:14:51 +00:00
|
|
|
extern std::shared_ptr<Cache> NewClockCache(
|
|
|
|
size_t capacity, int num_shard_bits = -1,
|
|
|
|
bool strict_capacity_limit = false,
|
|
|
|
CacheMetadataChargePolicy metadata_charge_policy =
|
|
|
|
kDefaultCacheMetadataChargePolicy);
|
Fix use-after-free threading bug in ClockCache (#8261)
Summary:
In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with
-use_clock_cache, as well as db_bench -use_clock_cache, but not
single-threaded. Smaller cache size hits failure much faster. ASAN
reported the failuer as calling malloc_usable_size on the `key` pointer
of a ClockCache handle after it was reportedly freed. On detailed
inspection I found this bad sequence of operations for a cache entry:
state=InCache=1,refs=1
[thread 1] Start ClockCacheShard::Unref (from Release, no mutex)
[thread 1] Decrement ref count
state=InCache=1,refs=0
[thread 1] Suspend before CalcTotalCharge (no mutex)
[thread 2] Start UnsetInCache (from Insert, mutex held)
[thread 2] clear InCache bit
state=InCache=0,refs=0
[thread 2] Calls RecycleHandle (based on pre-updated state)
[thread 2] Returns to Insert which calls Cleanup which deletes `key`
[thread 1] Resume ClockCacheShard::Unref
[thread 1] Read `key` in CalcTotalCharge
To fix this, I've added a field to the handle to store the metadata
charge so that we can efficiently remember everything we need from
the handle in Unref. We must not read from the handle again if we
decrement the count to zero with InCache=1, which means we don't own
the entry and someone else could eject/overwrite it immediately.
Note before this change, on amd64 sizeof(Handle) == 56 even though there
are only 48 bytes of data. Grouping together the uint32_t fields would
cut it down to 48, but I've added another uint32_t, which takes it
back up to 56. Not a big deal.
Also fixed DisownData to cooperate with ASAN as in LRUCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261
Test Plan:
Manual + adding use_clock_cache to db_crashtest.py
Base performance
./cache_bench -use_clock_cache
Complete in 17.060 s; QPS = 2458513
New performance
./cache_bench -use_clock_cache
Complete in 17.052 s; QPS = 2459695
Any difference is easily buried in small noise.
Crash test shows still more bug(s) in ClockCache, so I'm expecting to
disable ClockCache from production code in a follow-up PR (if we
can't find and fix the bug(s))
Reviewed By: mrambacher
Differential Revision: D28207358
Pulled By: pdillinger
fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294
2021-05-05 05:17:02 +00:00
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
class Cache {
|
|
|
|
public:
|
2016-08-19 23:43:31 +00:00
|
|
|
// Depending on implementation, cache entries with high priority could be less
|
|
|
|
// likely to get evicted than low priority entries.
|
|
|
|
enum class Priority { HIGH, LOW };
|
|
|
|
|
2021-05-14 05:57:51 +00:00
|
|
|
// A set of callbacks to allow objects in the primary block cache to be
|
|
|
|
// be persisted in a secondary cache. The purpose of the secondary cache
|
|
|
|
// is to support other ways of caching the object, such as persistent or
|
|
|
|
// compressed data, that may require the object to be parsed and transformed
|
|
|
|
// in some way. Since the primary cache holds C++ objects and the secondary
|
|
|
|
// cache may only hold flat data that doesn't need relocation, these
|
|
|
|
// callbacks need to be provided by the user of the block
|
|
|
|
// cache to do the conversion.
|
|
|
|
// The CacheItemHelper is passed to Insert() and Lookup(). It has pointers
|
|
|
|
// to callback functions for size, saving and deletion of the
|
|
|
|
// object. The callbacks are defined in C-style in order to make them
|
|
|
|
// stateless and not add to the cache metadata size.
|
|
|
|
// Saving multiple std::function objects will take up 32 bytes per
|
|
|
|
// function, even if its not bound to an object and does no capture.
|
|
|
|
//
|
|
|
|
// All the callbacks are C-style function pointers in order to simplify
|
|
|
|
// lifecycle management. Objects in the cache can outlive the parent DB,
|
|
|
|
// so anything required for these operations should be contained in the
|
|
|
|
// object itself.
|
|
|
|
//
|
|
|
|
// The SizeCallback takes a void* pointer to the object and returns the size
|
|
|
|
// of the persistable data. It can be used by the secondary cache to allocate
|
|
|
|
// memory if needed.
|
2021-11-03 03:29:07 +00:00
|
|
|
//
|
|
|
|
// RocksDB callbacks are NOT exception-safe. A callback completing with an
|
|
|
|
// exception can lead to undefined behavior in RocksDB, including data loss,
|
|
|
|
// unreported corruption, deadlocks, and more.
|
2021-05-14 05:57:51 +00:00
|
|
|
using SizeCallback = size_t (*)(void* obj);
|
|
|
|
|
|
|
|
// The SaveToCallback takes a void* object pointer and saves the persistable
|
|
|
|
// data into a buffer. The secondary cache may decide to not store it in a
|
|
|
|
// contiguous buffer, in which case this callback will be called multiple
|
|
|
|
// times with increasing offset
|
2021-05-22 01:28:28 +00:00
|
|
|
using SaveToCallback = Status (*)(void* from_obj, size_t from_offset,
|
|
|
|
size_t length, void* out);
|
2021-05-14 05:57:51 +00:00
|
|
|
|
|
|
|
// A function pointer type for custom destruction of an entry's
|
|
|
|
// value. The Cache is responsible for copying and reclaiming space
|
|
|
|
// for the key, but values are managed by the caller.
|
|
|
|
using DeleterFn = void (*)(const Slice& key, void* value);
|
|
|
|
|
|
|
|
// A struct with pointers to helper functions for spilling items from the
|
|
|
|
// cache into the secondary cache. May be extended in the future. An
|
|
|
|
// instance of this struct is expected to outlive the cache.
|
|
|
|
struct CacheItemHelper {
|
|
|
|
SizeCallback size_cb;
|
|
|
|
SaveToCallback saveto_cb;
|
|
|
|
DeleterFn del_cb;
|
|
|
|
|
|
|
|
CacheItemHelper() : size_cb(nullptr), saveto_cb(nullptr), del_cb(nullptr) {}
|
|
|
|
CacheItemHelper(SizeCallback _size_cb, SaveToCallback _saveto_cb,
|
|
|
|
DeleterFn _del_cb)
|
|
|
|
: size_cb(_size_cb), saveto_cb(_saveto_cb), del_cb(_del_cb) {}
|
|
|
|
};
|
|
|
|
|
|
|
|
// The CreateCallback is passed by the block cache user to Lookup(). It
|
|
|
|
// takes in a buffer from the NVM cache and constructs an object using
|
|
|
|
// it. The callback doesn't have ownership of the buffer and should
|
|
|
|
// copy the contents into its own buffer.
|
2022-02-18 05:08:59 +00:00
|
|
|
using CreateCallback = std::function<Status(const void* buf, size_t size,
|
2021-05-14 05:57:51 +00:00
|
|
|
void** out_obj, size_t* charge)>;
|
|
|
|
|
2018-11-21 19:28:02 +00:00
|
|
|
Cache(std::shared_ptr<MemoryAllocator> allocator = nullptr)
|
|
|
|
: memory_allocator_(std::move(allocator)) {}
|
2019-09-12 01:07:12 +00:00
|
|
|
// No copying allowed
|
|
|
|
Cache(const Cache&) = delete;
|
|
|
|
Cache& operator=(const Cache&) = delete;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2020-04-29 01:02:11 +00:00
|
|
|
// Creates a new Cache based on the input value string and returns the result.
|
|
|
|
// Currently, this method can be used to create LRUCaches only
|
|
|
|
// @param config_options
|
|
|
|
// @param value The value might be:
|
|
|
|
// - an old-style cache ("1M") -- equivalent to NewLRUCache(1024*102(
|
|
|
|
// - Name-value option pairs -- "capacity=1M; num_shard_bits=4;
|
|
|
|
// For the LRUCache, the values are defined in LRUCacheOptions.
|
|
|
|
// @param result The new Cache object
|
2021-03-29 12:04:06 +00:00
|
|
|
// @return OK if the cache was successfully created
|
2020-04-29 01:02:11 +00:00
|
|
|
// @return NotFound if an invalid name was specified in the value
|
|
|
|
// @return InvalidArgument if either the options were not valid
|
|
|
|
static Status CreateFromString(const ConfigOptions& config_options,
|
|
|
|
const std::string& value,
|
|
|
|
std::shared_ptr<Cache>* result);
|
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
// Destroys all existing entries by calling the "deleter"
|
2015-12-11 00:39:10 +00:00
|
|
|
// function that was passed via the Insert() function.
|
|
|
|
//
|
|
|
|
// @See Insert
|
2016-05-24 06:35:23 +00:00
|
|
|
virtual ~Cache() {}
|
2011-03-18 22:37:00 +00:00
|
|
|
|
|
|
|
// Opaque handle to an entry stored in the cache.
|
2016-05-24 06:35:23 +00:00
|
|
|
struct Handle {};
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2016-08-12 21:16:57 +00:00
|
|
|
// The type of the Cache
|
|
|
|
virtual const char* Name() const = 0;
|
|
|
|
|
2021-05-14 05:57:51 +00:00
|
|
|
// Insert a mapping from key->value into the volatile cache only
|
|
|
|
// and assign it // the specified charge against the total cache capacity.
|
2016-03-11 01:35:19 +00:00
|
|
|
// If strict_capacity_limit is true and cache reaches its full capacity,
|
|
|
|
// return Status::Incomplete.
|
2011-03-18 22:37:00 +00:00
|
|
|
//
|
2016-03-11 01:35:19 +00:00
|
|
|
// If handle is not nullptr, returns a handle that corresponds to the
|
|
|
|
// mapping. The caller must call this->Release(handle) when the returned
|
|
|
|
// mapping is no longer needed. In case of error caller is responsible to
|
|
|
|
// cleanup the value (i.e. calling "deleter").
|
|
|
|
//
|
|
|
|
// If handle is nullptr, it is as if Release is called immediately after
|
|
|
|
// insert. In case of error value will be cleanup.
|
2011-03-18 22:37:00 +00:00
|
|
|
//
|
|
|
|
// When the inserted entry is no longer needed, the key and
|
New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 23:16:11 +00:00
|
|
|
// value will be passed to "deleter" which must delete the value.
|
|
|
|
// (The Cache is responsible for copying and reclaiming space for
|
|
|
|
// the key.)
|
2016-03-11 01:35:19 +00:00
|
|
|
virtual Status Insert(const Slice& key, void* value, size_t charge,
|
New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 23:16:11 +00:00
|
|
|
DeleterFn deleter, Handle** handle = nullptr,
|
2016-08-19 23:43:31 +00:00
|
|
|
Priority priority = Priority::LOW) = 0;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2013-03-01 02:04:58 +00:00
|
|
|
// If the cache has no mapping for "key", returns nullptr.
|
2011-03-18 22:37:00 +00:00
|
|
|
//
|
|
|
|
// Else return a handle that corresponds to the mapping. The caller
|
|
|
|
// must call this->Release(handle) when the returned mapping is no
|
|
|
|
// longer needed.
|
2016-09-01 20:50:39 +00:00
|
|
|
// If stats is not nullptr, relative tickers could be used inside the
|
|
|
|
// function.
|
|
|
|
virtual Handle* Lookup(const Slice& key, Statistics* stats = nullptr) = 0;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
2017-01-11 00:48:23 +00:00
|
|
|
// Increments the reference count for the handle if it refers to an entry in
|
|
|
|
// the cache. Returns true if refcount was incremented; otherwise, returns
|
|
|
|
// false.
|
|
|
|
// REQUIRES: handle must have been returned by a method on *this.
|
|
|
|
virtual bool Ref(Handle* handle) = 0;
|
|
|
|
|
2017-04-24 18:21:47 +00:00
|
|
|
/**
|
|
|
|
* Release a mapping returned by a previous Lookup(). A released entry might
|
|
|
|
* still remain in cache in case it is later looked up by others. If
|
2022-03-22 17:22:18 +00:00
|
|
|
* erase_if_last_ref is set then it also erase it from the cache if there is
|
|
|
|
* no other reference to it. Erasing it should call the deleter function that
|
|
|
|
* was provided when the entry was inserted.
|
2017-04-24 18:21:47 +00:00
|
|
|
*
|
|
|
|
* Returns true if the entry was also erased.
|
|
|
|
*/
|
2011-03-18 22:37:00 +00:00
|
|
|
// REQUIRES: handle must not have been released yet.
|
|
|
|
// REQUIRES: handle must have been returned by a method on *this.
|
2022-03-22 17:22:18 +00:00
|
|
|
virtual bool Release(Handle* handle, bool erase_if_last_ref = false) = 0;
|
2011-03-18 22:37:00 +00:00
|
|
|
|
|
|
|
// Return the value encapsulated in a handle returned by a
|
|
|
|
// successful Lookup().
|
|
|
|
// REQUIRES: handle must not have been released yet.
|
|
|
|
// REQUIRES: handle must have been returned by a method on *this.
|
|
|
|
virtual void* Value(Handle* handle) = 0;
|
|
|
|
|
|
|
|
// If the cache contains entry for key, erase it. Note that the
|
|
|
|
// underlying entry will be kept around until all existing handles
|
|
|
|
// to it have been released.
|
|
|
|
virtual void Erase(const Slice& key) = 0;
|
|
|
|
// Return a new numeric id. May be used by multiple clients who are
|
2016-05-24 06:35:23 +00:00
|
|
|
// sharding the same cache to partition the key space. Typically the
|
2011-03-18 22:37:00 +00:00
|
|
|
// client will allocate a new id at startup and prepend the id to
|
|
|
|
// its cache keys.
|
|
|
|
virtual uint64_t NewId() = 0;
|
|
|
|
|
2015-04-24 21:12:58 +00:00
|
|
|
// sets the maximum configured capacity of the cache. When the new
|
|
|
|
// capacity is less than the old capacity and the existing usage is
|
|
|
|
// greater than new capacity, the implementation will do its best job to
|
|
|
|
// purge the released entries from the cache in order to lower the usage
|
|
|
|
virtual void SetCapacity(size_t capacity) = 0;
|
|
|
|
|
2016-03-11 01:35:19 +00:00
|
|
|
// Set whether to return error on insertion when cache reaches its full
|
|
|
|
// capacity.
|
|
|
|
virtual void SetStrictCapacityLimit(bool strict_capacity_limit) = 0;
|
|
|
|
|
2016-07-15 17:41:36 +00:00
|
|
|
// Get the flag whether to return error on insertion when cache reaches its
|
|
|
|
// full capacity.
|
2016-03-11 01:35:19 +00:00
|
|
|
virtual bool HasStrictCapacityLimit() const = 0;
|
|
|
|
|
2012-09-30 01:02:02 +00:00
|
|
|
// returns the maximum configured capacity of the cache
|
2013-12-11 01:34:35 +00:00
|
|
|
virtual size_t GetCapacity() const = 0;
|
2012-09-30 01:02:02 +00:00
|
|
|
|
2013-12-11 00:21:49 +00:00
|
|
|
// returns the memory size for the entries residing in the cache.
|
|
|
|
virtual size_t GetUsage() const = 0;
|
|
|
|
|
2015-10-07 22:17:20 +00:00
|
|
|
// returns the memory size for a specific entry in the cache.
|
|
|
|
virtual size_t GetUsage(Handle* handle) const = 0;
|
|
|
|
|
2015-06-18 20:56:31 +00:00
|
|
|
// returns the memory size for the entries in use by the system
|
|
|
|
virtual size_t GetPinnedUsage() const = 0;
|
|
|
|
|
2019-06-19 00:32:44 +00:00
|
|
|
// returns the charge for the specific entry in the cache.
|
|
|
|
virtual size_t GetCharge(Handle* handle) const = 0;
|
|
|
|
|
Use deleters to label cache entries and collect stats (#8297)
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.
The new information can be seen in info LOG, for example:
Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):
$ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
rocksdb.block-cache-entry-stats.bytes.data-block: 0
rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
rocksdb.block-cache-entry-stats.bytes.index-block: 178992
rocksdb.block-cache-entry-stats.bytes.misc: 0
rocksdb.block-cache-entry-stats.bytes.other-block: 0
rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
rocksdb.block-cache-entry-stats.capacity: 8388608
rocksdb.block-cache-entry-stats.count.data-block: 0
rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
rocksdb.block-cache-entry-stats.count.index-block: 215
rocksdb.block-cache-entry-stats.count.misc: 1
rocksdb.block-cache-entry-stats.count.other-block: 0
rocksdb.block-cache-entry-stats.count.write-buffer: 0
rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
rocksdb.block-cache-entry-stats.percent.misc: 0.000000
rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.
This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.
An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".
Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.
Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.
This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
Test Plan: manual testing with db_bench, and a couple of basic unit tests
Reviewed By: ltamasi
Differential Revision: D28488721
Pulled By: pdillinger
fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
2021-05-19 23:45:51 +00:00
|
|
|
// Returns the deleter for the specified entry. This might seem useless
|
|
|
|
// as the Cache itself is responsible for calling the deleter, but
|
|
|
|
// the deleter can essentially verify that a cache entry is of an
|
|
|
|
// expected type from an expected code source.
|
|
|
|
virtual DeleterFn GetDeleter(Handle* handle) const = 0;
|
|
|
|
|
Add a call DisownData() to Cache, which should speed up shutdown
Summary: On a shutdown, freeing memory takes a long time. If we're shutting down, we don't really care about memory leaks. I added a call to Cache that will avoid freeing all objects in cache.
Test Plan:
I created a script to test the speedup and demonstrate how to use the call: https://phabricator.fb.com/P3864368
Clean shutdown took 7.2 seconds, while fast and dirty one took 6.3 seconds. Unfortunately, the speedup is not that big, but should be bigger with bigger block_cache. I have set up the capacity to 80GB, but the script filled up only ~7GB.
Reviewers: dhruba, haobo, MarkCallaghan, xjin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15069
2014-01-24 22:57:52 +00:00
|
|
|
// Call this on shutdown if you want to speed it up. Cache will disown
|
|
|
|
// any underlying data and will not free it on delete. This call will leak
|
|
|
|
// memory - call this only if you're shutting down the process.
|
|
|
|
// Any attempts of using cache after this call will fail terribly.
|
|
|
|
// Always delete the DB object before calling this method!
|
2016-05-24 06:35:23 +00:00
|
|
|
virtual void DisownData(){
|
|
|
|
// default implementation is noop
|
2020-01-24 21:03:19 +00:00
|
|
|
}
|
Add a call DisownData() to Cache, which should speed up shutdown
Summary: On a shutdown, freeing memory takes a long time. If we're shutting down, we don't really care about memory leaks. I added a call to Cache that will avoid freeing all objects in cache.
Test Plan:
I created a script to test the speedup and demonstrate how to use the call: https://phabricator.fb.com/P3864368
Clean shutdown took 7.2 seconds, while fast and dirty one took 6.3 seconds. Unfortunately, the speedup is not that big, but should be bigger with bigger block_cache. I have set up the capacity to 80GB, but the script filled up only ~7GB.
Reviewers: dhruba, haobo, MarkCallaghan, xjin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15069
2014-01-24 22:57:52 +00:00
|
|
|
|
New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 23:16:11 +00:00
|
|
|
struct ApplyToAllEntriesOptions {
|
|
|
|
// If the Cache uses locks, setting `average_entries_per_lock` to
|
|
|
|
// a higher value suggests iterating over more entries each time a lock
|
|
|
|
// is acquired, likely reducing the time for ApplyToAllEntries but
|
|
|
|
// increasing latency for concurrent users of the Cache. Setting
|
|
|
|
// `average_entries_per_lock` to a smaller value could be helpful if
|
|
|
|
// callback is relatively expensive, such as using large data structures.
|
|
|
|
size_t average_entries_per_lock = 256;
|
|
|
|
};
|
|
|
|
|
|
|
|
// Apply a callback to all entries in the cache. The Cache must ensure
|
|
|
|
// thread safety but does not guarantee that a consistent snapshot of all
|
|
|
|
// entries is iterated over if other threads are operating on the Cache
|
|
|
|
// also.
|
|
|
|
virtual void ApplyToAllEntries(
|
|
|
|
const std::function<void(const Slice& key, void* value, size_t charge,
|
|
|
|
DeleterFn deleter)>& callback,
|
|
|
|
const ApplyToAllEntriesOptions& opts) = 0;
|
|
|
|
|
|
|
|
// DEPRECATED version of above. (Default implementation uses above.)
|
|
|
|
virtual void ApplyToAllCacheEntries(void (*callback)(void* value,
|
|
|
|
size_t charge),
|
|
|
|
bool /*thread_safe*/) {
|
|
|
|
ApplyToAllEntries([callback](const Slice&, void* value, size_t charge,
|
|
|
|
DeleterFn) { callback(value, charge); },
|
|
|
|
{});
|
|
|
|
}
|
2014-05-02 20:24:04 +00:00
|
|
|
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
// Remove all entries.
|
2017-05-18 06:03:54 +00:00
|
|
|
// Prerequisite: no entry is referenced.
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
2016-04-01 17:42:39 +00:00
|
|
|
virtual void EraseUnRefEntries() = 0;
|
|
|
|
|
2016-12-22 22:44:01 +00:00
|
|
|
virtual std::string GetPrintableOptions() const { return ""; }
|
|
|
|
|
2018-11-21 19:28:02 +00:00
|
|
|
MemoryAllocator* memory_allocator() const { return memory_allocator_.get(); }
|
|
|
|
|
2021-05-14 05:57:51 +00:00
|
|
|
// EXPERIMENTAL
|
|
|
|
// The following APIs are experimental and might change in the future.
|
|
|
|
// The Insert and Lookup APIs below are intended to allow cached objects
|
|
|
|
// to be demoted/promoted between the primary block cache and a secondary
|
|
|
|
// cache. The secondary cache could be a non-volatile cache, and will
|
|
|
|
// likely store the object in a different representation more suitable
|
|
|
|
// for on disk storage. They rely on a per object CacheItemHelper to do
|
|
|
|
// the conversions.
|
|
|
|
// The secondary cache may persist across process and system restarts,
|
|
|
|
// and may even be moved between hosts. Therefore, the cache key must
|
|
|
|
// be repeatable across restarts/reboots, and globally unique if
|
|
|
|
// multiple DBs share the same cache and the set of DBs can change
|
|
|
|
// over time.
|
|
|
|
|
|
|
|
// Insert a mapping from key->value into the cache and assign it
|
|
|
|
// the specified charge against the total cache capacity.
|
|
|
|
// If strict_capacity_limit is true and cache reaches its full capacity,
|
|
|
|
// return Status::Incomplete.
|
|
|
|
//
|
|
|
|
// The helper argument is saved by the cache and will be used when the
|
|
|
|
// inserted object is evicted or promoted to the secondary cache. It,
|
|
|
|
// therefore, must outlive the cache.
|
|
|
|
//
|
|
|
|
// If handle is not nullptr, returns a handle that corresponds to the
|
|
|
|
// mapping. The caller must call this->Release(handle) when the returned
|
|
|
|
// mapping is no longer needed. In case of error caller is responsible to
|
|
|
|
// cleanup the value (i.e. calling "deleter").
|
|
|
|
//
|
|
|
|
// If handle is nullptr, it is as if Release is called immediately after
|
|
|
|
// insert. In case of error value will be cleanup.
|
|
|
|
//
|
|
|
|
// Regardless of whether the item was inserted into the cache,
|
|
|
|
// it will attempt to insert it into the secondary cache if one is
|
|
|
|
// configured, and the helper supports it.
|
|
|
|
// The cache implementation must support a secondary cache, otherwise
|
|
|
|
// the item is only inserted into the primary cache. It may
|
|
|
|
// defer the insertion to the secondary cache as it sees fit.
|
|
|
|
//
|
|
|
|
// When the inserted entry is no longer needed, the key and
|
|
|
|
// value will be passed to "deleter".
|
|
|
|
virtual Status Insert(const Slice& key, void* value,
|
|
|
|
const CacheItemHelper* helper, size_t charge,
|
|
|
|
Handle** handle = nullptr,
|
|
|
|
Priority priority = Priority::LOW) {
|
|
|
|
if (!helper) {
|
|
|
|
return Status::InvalidArgument();
|
|
|
|
}
|
|
|
|
return Insert(key, value, charge, helper->del_cb, handle, priority);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Lookup the key in the primary and secondary caches (if one is configured).
|
|
|
|
// The create_cb callback function object will be used to contruct the
|
|
|
|
// cached object.
|
|
|
|
// If none of the caches have the mapping for the key, returns nullptr.
|
|
|
|
// Else, returns a handle that corresponds to the mapping.
|
|
|
|
//
|
|
|
|
// This call may promote the object from the secondary cache (if one is
|
|
|
|
// configured, and has the given key) to the primary cache.
|
|
|
|
//
|
|
|
|
// The helper argument should be provided if the caller wants the lookup
|
|
|
|
// to include the secondary cache (if one is configured) and the object,
|
|
|
|
// if it exists, to be promoted to the primary cache. The helper may be
|
|
|
|
// saved and used later when the object is evicted. Therefore, it must
|
|
|
|
// outlive the cache.
|
|
|
|
//
|
|
|
|
// The handle returned may not be ready. The caller should call IsReady()
|
|
|
|
// to check if the item value is ready, and call Wait() or WaitAll() if
|
|
|
|
// its not ready. The caller should then call Value() to check if the
|
|
|
|
// item was successfully retrieved. If unsuccessful (perhaps due to an
|
|
|
|
// IO error), Value() will return nullptr.
|
|
|
|
virtual Handle* Lookup(const Slice& key, const CacheItemHelper* /*helper_cb*/,
|
|
|
|
const CreateCallback& /*create_cb*/,
|
|
|
|
Priority /*priority*/, bool /*wait*/,
|
|
|
|
Statistics* stats = nullptr) {
|
|
|
|
return Lookup(key, stats);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Release a mapping returned by a previous Lookup(). The "useful"
|
|
|
|
// parameter specifies whether the data was actually used or not,
|
|
|
|
// which may be used by the cache implementation to decide whether
|
|
|
|
// to consider it as a hit for retention purposes.
|
2022-03-22 17:22:18 +00:00
|
|
|
virtual bool Release(Handle* handle, bool /*useful*/,
|
|
|
|
bool erase_if_last_ref) {
|
|
|
|
return Release(handle, erase_if_last_ref);
|
2021-05-14 05:57:51 +00:00
|
|
|
}
|
|
|
|
|
2021-06-18 16:35:03 +00:00
|
|
|
// Determines if the handle returned by Lookup() has a valid value yet. The
|
|
|
|
// call is not thread safe and should be called only by someone holding a
|
|
|
|
// reference to the handle.
|
2021-05-14 05:57:51 +00:00
|
|
|
virtual bool IsReady(Handle* /*handle*/) { return true; }
|
|
|
|
|
|
|
|
// If the handle returned by Lookup() is not ready yet, wait till it
|
|
|
|
// becomes ready.
|
|
|
|
// Note: A ready handle doesn't necessarily mean it has a valid value. The
|
|
|
|
// user should call Value() and check for nullptr.
|
|
|
|
virtual void Wait(Handle* /*handle*/) {}
|
|
|
|
|
|
|
|
// Wait for a vector of handles to become ready. As with Wait(), the user
|
2021-06-18 16:35:03 +00:00
|
|
|
// should check the Value() of each handle for nullptr. This call is not
|
|
|
|
// thread safe and should only be called by the caller holding a reference
|
|
|
|
// to each of the handles.
|
2021-05-14 05:57:51 +00:00
|
|
|
virtual void WaitAll(std::vector<Handle*>& /*handles*/) {}
|
|
|
|
|
2011-03-18 22:37:00 +00:00
|
|
|
private:
|
2018-11-21 19:28:02 +00:00
|
|
|
std::shared_ptr<MemoryAllocator> memory_allocator_;
|
2011-03-18 22:37:00 +00:00
|
|
|
};
|
|
|
|
|
2020-02-20 20:07:53 +00:00
|
|
|
} // namespace ROCKSDB_NAMESPACE
|