Re-implement GetApproximateMemTableStats for skip lists (#13047)

Summary:
GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations;   11.7 MB/s
approximatememtablestats :       3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 2344.1611  StdDev: 26587.27
Min: 0  Median: 965.8555  Max: 835273
Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97
------------------------------------------------------
[       0,       1 ]   131344  13.134%  13.134% ###
(       1,       2 ]      115   0.011%  13.146%
(       2,       3 ]      106   0.011%  13.157%
(       3,       4 ]      190   0.019%  13.176%
(       4,       6 ]      214   0.021%  13.197%
(       6,      10 ]      522   0.052%  13.249%
(      10,      15 ]      748   0.075%  13.324%
(      15,      22 ]     1002   0.100%  13.424%
(      22,      34 ]     1948   0.195%  13.619%
(      34,      51 ]     3067   0.307%  13.926%
(      51,      76 ]     4213   0.421%  14.347%
(      76,     110 ]     5721   0.572%  14.919%
(     110,     170 ]    11375   1.137%  16.056%
(     170,     250 ]    17928   1.793%  17.849%
(     250,     380 ]    36597   3.660%  21.509% #
(     380,     580 ]    77882   7.788%  29.297% ##
(     580,     870 ]   160193  16.019%  45.317% ###
(     870,    1300 ]   210098  21.010%  66.326% ####
(    1300,    1900 ]   167461  16.746%  83.072% ###
(    1900,    2900 ]    78678   7.868%  90.940% ##
(    2900,    4400 ]    47743   4.774%  95.715% #
(    4400,    6600 ]    17650   1.765%  97.480%
(    6600,    9900 ]    11895   1.190%  98.669%
(    9900,   14000 ]     4993   0.499%  99.168%
(   14000,   22000 ]     2384   0.238%  99.407%
(   22000,   33000 ]     1966   0.197%  99.603%
(   50000,   75000 ]     2968   0.297%  99.900%
(  570000,  860000 ]      999   0.100% 100.000%

readrandom   :       1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations;    8.2 MB/s (1000000 of 1000000 found)
```

Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast.

I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail.

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations;   11.0 MB/s
approximatememtablestats :       2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 1073.5158  StdDev: 197.80
Min: 608  Median: 1079.9506  Max: 2176
Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00
------------------------------------------------------
(     580,     870 ]   134848  13.485%  13.485% ###
(     870,    1300 ]   747868  74.787%  88.272% ###############
(    1300,    1900 ]   116536  11.654%  99.925% ##
(    1900,    2900 ]      748   0.075% 100.000%

readrandom   :       1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations;    8.1 MB/s (1000000 of 1000000 found)
```

We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first *much* larger range:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000
filluniquerandom :       1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations;   11.7 MB/s
approximatememtablestats :       1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations;
Reported entry count stats (expected 30000):
Count: 1000000 Average: 31098.8795  StdDev: 3601.47
Min: 21504  Median: 29333.9303  Max: 43008
Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00
------------------------------------------------------
(   14000,   22000 ]      408   0.041%   0.041%
(   22000,   33000 ]   749327  74.933%  74.974% ###############
(   33000,   50000 ]   250265  25.027% 100.000% #####

readrandom   :       1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations;    8.5 MB/s (989989 of 1000000 found)
```

This is *even faster* and relatively *more accurate*, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences:
* When actual number in range is >= 40, the minimum return value is 40.
* When the actual is <= 10, it is guaranteed to return that actual number.
```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75
...
filluniquerandom :       1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations;   11.4 MB/s
approximatememtablestats :       3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations;
Reported entry count stats (expected 75):
Count: 1000000 Average: 75.1210  StdDev: 15.02
Min: 40  Median: 71.9395  Max: 256
Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78
------------------------------------------------------
(      34,      51 ]    38867   3.887%   3.887% #
(      51,      76 ]   550554  55.055%  58.942% ###########
(      76,     110 ]   398854  39.885%  98.828% ########
(     110,     170 ]    11353   1.135%  99.963%
(     170,     250 ]      364   0.036%  99.999%
(     250,     380 ]        8   0.001% 100.000%

readrandom   :       1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations;    8.7 MB/s (999974 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25
...
filluniquerandom :       1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations;   10.8 MB/s
approximatememtablestats :       5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations;
Reported entry count stats (expected 25):
Count: 1000000 Average: 26.2392  StdDev: 4.58
Min: 25  Median: 28.4590  Max: 72
Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00
------------------------------------------------------
(      22,      34 ]   928936  92.894%  92.894% ###################
(      34,      51 ]    67960   6.796%  99.690% #
(      51,      76 ]     3104   0.310% 100.000%

readrandom   :       1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations;    8.6 MB/s (1000000 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10
...
filluniquerandom :       1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations;    9.9 MB/s
approximatememtablestats :       3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations;
Reported entry count stats (expected 10):
Count: 1000000 Average: 10.0000  StdDev: 0.00
Min: 10  Median: 10.0000  Max: 10
Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00
------------------------------------------------------
(       6,      10 ]  1000000 100.000% 100.000% ####################

readrandom   :       1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
```

Remarkably consistent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047

Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated.

Reviewed By: cbi42

Differential Revision: D63722003

Pulled By: pdillinger

fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26
This commit is contained in:
Peter Dillinger 2024-10-02 14:25:50 -07:00 committed by Facebook GitHub Bot
parent 389e66bef5
commit dd23e84cad
8 changed files with 209 additions and 74 deletions

View File

@ -1826,21 +1826,30 @@ TEST_F(DBTest, GetApproximateMemTableStats) {
uint64_t count;
uint64_t size;
// Because Random::GetTLSInstance() seed is reset in DBTestBase,
// this test is deterministic.
std::string start = Key(50);
std::string end = Key(60);
Range r(start, end);
db_->GetApproximateMemTableStats(r, &count, &size);
ASSERT_GT(count, 0);
ASSERT_LE(count, N);
ASSERT_GT(size, 6000);
ASSERT_LT(size, 204800);
// When actual count is <= 10, it returns that as the minimum
EXPECT_EQ(count, 10);
EXPECT_EQ(size, 10440);
start = Key(20);
end = Key(100);
r = Range(start, end);
db_->GetApproximateMemTableStats(r, &count, &size);
EXPECT_EQ(count, 72);
EXPECT_EQ(size, 75168);
start = Key(500);
end = Key(600);
r = Range(start, end);
db_->GetApproximateMemTableStats(r, &count, &size);
ASSERT_EQ(count, 0);
ASSERT_EQ(size, 0);
EXPECT_EQ(count, 0);
EXPECT_EQ(size, 0);
ASSERT_OK(Flush());
@ -1848,8 +1857,8 @@ TEST_F(DBTest, GetApproximateMemTableStats) {
end = Key(60);
r = Range(start, end);
db_->GetApproximateMemTableStats(r, &count, &size);
ASSERT_EQ(count, 0);
ASSERT_EQ(size, 0);
EXPECT_EQ(count, 0);
EXPECT_EQ(size, 0);
for (int i = 0; i < N; i++) {
ASSERT_OK(Put(Key(1000 + i), rnd.RandomString(1024)));
@ -1857,10 +1866,11 @@ TEST_F(DBTest, GetApproximateMemTableStats) {
start = Key(100);
end = Key(1020);
// Actually 20 keys in the range ^^
r = Range(start, end);
db_->GetApproximateMemTableStats(r, &count, &size);
ASSERT_GT(count, 20);
ASSERT_GT(size, 6000);
EXPECT_EQ(count, 20);
EXPECT_EQ(size, 20880);
}
TEST_F(DBTest, ApproximateSizes) {

View File

@ -1031,8 +1031,9 @@ DEFINE_int32(continuous_verification_interval, 1000,
"disables continuous verification.");
DEFINE_int32(approximate_size_one_in, 64,
"If non-zero, DB::GetApproximateSizes() will be called against"
" random key ranges.");
"If non-zero, DB::GetApproximateSizes() and "
"DB::GetApproximateMemTableStats() will be called against "
"random key ranges.");
DEFINE_int32(read_fault_one_in, 1000,
"On non-zero, enables fault injection on read");

View File

@ -2427,22 +2427,31 @@ Status StressTest::TestApproximateSize(
std::string key1_str = Key(key1);
std::string key2_str = Key(key2);
Range range{Slice(key1_str), Slice(key2_str)};
SizeApproximationOptions sao;
sao.include_memtables = thread->rand.OneIn(2);
if (sao.include_memtables) {
sao.include_files = thread->rand.OneIn(2);
}
if (thread->rand.OneIn(2)) {
if (thread->rand.OneIn(2)) {
sao.files_size_error_margin = 0.0;
} else {
sao.files_size_error_margin =
static_cast<double>(thread->rand.Uniform(3));
if (thread->rand.OneIn(3)) {
// Call GetApproximateMemTableStats instead
uint64_t count, size;
db_->GetApproximateMemTableStats(column_families_[rand_column_families[0]],
range, &count, &size);
return Status::OK();
} else {
// Call GetApproximateSizes
SizeApproximationOptions sao;
sao.include_memtables = thread->rand.OneIn(2);
if (sao.include_memtables) {
sao.include_files = thread->rand.OneIn(2);
}
if (thread->rand.OneIn(2)) {
if (thread->rand.OneIn(2)) {
sao.files_size_error_margin = 0.0;
} else {
sao.files_size_error_margin =
static_cast<double>(thread->rand.Uniform(3));
}
}
uint64_t result;
return db_->GetApproximateSizes(
sao, column_families_[rand_column_families[0]], &range, 1, &result);
}
uint64_t result;
return db_->GetApproximateSizes(
sao, column_families_[rand_column_families[0]], &range, 1, &result);
}
Status StressTest::TestCheckpoint(ThreadState* thread,

View File

@ -141,8 +141,9 @@ class InlineSkipList {
// Returns true iff an entry that compares equal to key is in the list.
bool Contains(const char* key) const;
// Return estimated number of entries smaller than `key`.
uint64_t EstimateCount(const char* key) const;
// Return estimated number of entries from `start_ikey` to `end_ikey`.
uint64_t ApproximateNumEntries(const Slice& start_ikey,
const Slice& end_ikey) const;
// Validate correctness of the skip-list.
void TEST_Validate() const;
@ -673,31 +674,88 @@ InlineSkipList<Comparator>::FindRandomEntry() const {
}
template <class Comparator>
uint64_t InlineSkipList<Comparator>::EstimateCount(const char* key) const {
uint64_t count = 0;
uint64_t InlineSkipList<Comparator>::ApproximateNumEntries(
const Slice& start_ikey, const Slice& end_ikey) const {
// The number of entries at a given level for the given range, in terms of
// the actual number of entries in that range (level 0), follows a binomial
// distribution, which is very well approximated by the Poisson distribution.
// That has stddev sqrt(x) where x is the expected number of entries (mean)
// at this level, and the best predictor of x is the number of observed
// entries (at this level). To predict the number of entries on level 0 we use
// x * kBranchinng ^ level. From the standard deviation, the P99+ relative
// error is roughly 3 * sqrt(x) / x. Thus, a reasonable approach would be to
// find the smallest level with at least some moderate constant number entries
// in range. E.g. with at least ~40 entries, we expect P99+ relative error
// (approximation accuracy) of ~ 50% = 3 * sqrt(40) / 40; P95 error of
// ~30%; P75 error of < 20%.
//
// However, there are two issues with this approach, and an observation:
// * Pointer chasing on the larger (bottom) levels is much slower because of
// cache hierarchy effects, so when the result is smaller, getting the result
// will be substantially slower, despite traversing a similar number of
// entries. (We could be clever about pipelining our pointer chasing but
// that's complicated.)
// * The larger (bottom) levels also have lower variance because there's a
// chance (or certainty) that we reach level 0 and return the exact answer.
// * For applications in query planning, we can also tolerate more variance on
// small results because the impact of misestimating is likely smaller.
//
// These factors point us to an approach in which we have a higher minimum
// threshold number of samples for higher levels and lower for lower levels
// (see sufficient_samples below). This seems to yield roughly consistent
// relative error (stddev around 20%, less for large results) and roughly
// consistent query time around the time of two memtable point queries.
//
// Engineering observation: it is tempting to think that taking into account
// what we already found in how many entries occur on higher levels, not just
// the first iterated level with a sufficient number of samples, would yield
// a more accurate estimate. But that doesn't work because of the particular
// correlations and independences of the data: each level higher is just an
// independently probabilistic filtering of the level below it. That
// filtering from level l to l+1 has no more information about levels
// 0 .. l-1 than we can get from level l. The structure of RandomHeight() is
// a clue to these correlations and independences.
Node* x = head_;
int level = GetMaxHeight() - 1;
const DecodedKey key_decoded = compare_.decode_key(key);
while (true) {
assert(x == head_ || compare_(x->Key(), key_decoded) < 0);
Node* next = x->Next(level);
if (next != nullptr) {
PREFETCH(next->Next(level), 0, 1);
Node* lb = head_;
Node* ub = nullptr;
uint64_t count = 0;
for (int level = GetMaxHeight() - 1; level >= 0; level--) {
auto sufficient_samples = static_cast<uint64_t>(level) * kBranching_ + 10U;
if (count >= sufficient_samples) {
// No more counting; apply powers of kBranching and avoid floating point
count *= kBranching_;
continue;
}
if (next == nullptr || compare_(next->Key(), key_decoded) >= 0) {
if (level == 0) {
return count;
} else {
// Switch to next list
count *= kBranching_;
level--;
count = 0;
Node* next;
// Get a more precise lower bound (for start key)
for (;;) {
next = lb->Next(level);
if (next == ub) {
break;
}
assert(next != nullptr);
if (compare_(next->Key(), start_ikey) >= 0) {
break;
}
lb = next;
}
// Count entries on this level until upper bound (for end key)
for (;;) {
if (next == ub) {
break;
}
assert(next != nullptr);
if (compare_(next->Key(), end_ikey) >= 0) {
// Save refined upper bound to potentially save key comparison
ub = next;
break;
}
} else {
x = next;
count++;
next = next->Next(level);
}
}
return count;
}
template <class Comparator>

View File

@ -64,8 +64,9 @@ class SkipList {
// Returns true iff an entry that compares equal to key is in the list.
bool Contains(const Key& key) const;
// Return estimated number of entries smaller than `key`.
uint64_t EstimateCount(const Key& key) const;
// Return estimated number of entries from `start_ikey` to `end_ikey`.
uint64_t ApproximateNumEntries(const Slice& start_ikey,
const Slice& end_ikey) const;
// Iteration over the contents of a skip list
class Iterator {
@ -383,27 +384,49 @@ typename SkipList<Key, Comparator>::Node* SkipList<Key, Comparator>::FindLast()
}
template <typename Key, class Comparator>
uint64_t SkipList<Key, Comparator>::EstimateCount(const Key& key) const {
uint64_t SkipList<Key, Comparator>::ApproximateNumEntries(
const Slice& start_ikey, const Slice& end_ikey) const {
// See InlineSkipList<Comparator>::ApproximateNumEntries() (copy-paste)
Node* lb = head_;
Node* ub = nullptr;
uint64_t count = 0;
Node* x = head_;
int level = GetMaxHeight() - 1;
while (true) {
assert(x == head_ || compare_(x->key, key) < 0);
Node* next = x->Next(level);
if (next == nullptr || compare_(next->key, key) >= 0) {
if (level == 0) {
return count;
} else {
// Switch to next list
count *= kBranching_;
level--;
for (int level = GetMaxHeight() - 1; level >= 0; level--) {
auto sufficient_samples = static_cast<uint64_t>(level) * kBranching_ + 10U;
if (count >= sufficient_samples) {
// No more counting; apply powers of kBranching and avoid floating point
count *= kBranching_;
continue;
}
count = 0;
Node* next;
// Get a more precise lower bound (for start key)
for (;;) {
next = lb->Next(level);
if (next == ub) {
break;
}
assert(next != nullptr);
if (compare_(next->Key(), start_ikey) >= 0) {
break;
}
lb = next;
}
// Count entries on this level until upper bound (for end key)
for (;;) {
if (next == ub) {
break;
}
assert(next != nullptr);
if (compare_(next->Key(), end_ikey) >= 0) {
// Save refined upper bound to potentially save key comparison
ub = next;
break;
}
} else {
x = next;
count++;
next = next->Next(level);
}
}
return count;
}
template <typename Key, class Comparator>

View File

@ -108,11 +108,7 @@ class SkipListRep : public MemTableRep {
uint64_t ApproximateNumEntries(const Slice& start_ikey,
const Slice& end_ikey) override {
std::string tmp;
uint64_t start_count =
skip_list_.EstimateCount(EncodeKey(&tmp, start_ikey));
uint64_t end_count = skip_list_.EstimateCount(EncodeKey(&tmp, end_ikey));
return (end_count >= start_count) ? (end_count - start_count) : 0;
return skip_list_.ApproximateNumEntries(start_ikey, end_ikey);
}
void UniqueRandomSample(const uint64_t num_entries,

View File

@ -153,10 +153,11 @@ DEFINE_string(
"randomtransaction,"
"randomreplacekeys,"
"timeseries,"
"getmergeoperands,",
"getmergeoperands,"
"readrandomoperands,"
"backup,"
"restore"
"restore,"
"approximatememtablestats",
"Comma-separated list of operations to run in the specified"
" order. Available benchmarks:\n"
@ -243,9 +244,14 @@ DEFINE_string(
"operation includes a rare but possible retry in case it got "
"`Status::Incomplete()`. This happens upon encountering more keys than "
"have ever been seen by the thread (or eight initially)\n"
"\tbackup -- Create a backup of the current DB and verify that a new backup is corrected. "
"\tbackup -- Create a backup of the current DB and verify that a new "
"backup is corrected. "
"Rate limit can be specified through --backup_rate_limit\n"
"\trestore -- Restore the DB from the latest backup available, rate limit can be specified through --restore_rate_limit\n");
"\trestore -- Restore the DB from the latest backup available, rate limit "
"can be specified through --restore_rate_limit\n"
"\tapproximatememtablestats -- Tests accuracy of "
"GetApproximateMemTableStats, ideally\n"
"after fillrandom, where actual answer is batch_size");
DEFINE_int64(num, 1000000, "Number of key/values to place in database");
@ -3621,6 +3627,8 @@ class Benchmark {
fprintf(stderr, "entries_per_batch = %" PRIi64 "\n",
entries_per_batch_);
method = &Benchmark::ApproximateSizeRandom;
} else if (name == "approximatememtablestats") {
method = &Benchmark::ApproximateMemtableStats;
} else if (name == "mixgraph") {
method = &Benchmark::MixGraph;
} else if (name == "readmissing") {
@ -6298,6 +6306,35 @@ class Benchmark {
thread->stats.AddMessage(msg);
}
void ApproximateMemtableStats(ThreadState* thread) {
const size_t batch_size = entries_per_batch_;
std::unique_ptr<const char[]> skey_guard;
Slice skey = AllocateKey(&skey_guard);
std::unique_ptr<const char[]> ekey_guard;
Slice ekey = AllocateKey(&ekey_guard);
Duration duration(FLAGS_duration, reads_);
if (FLAGS_num < static_cast<int64_t>(batch_size)) {
std::terminate();
}
uint64_t range = static_cast<uint64_t>(FLAGS_num) - batch_size;
auto count_hist = std::make_shared<HistogramImpl>();
while (!duration.Done(1)) {
DB* db = SelectDB(thread);
uint64_t start_key = thread->rand.Uniform(range);
GenerateKeyFromInt(start_key, FLAGS_num, &skey);
uint64_t end_key = start_key + batch_size;
GenerateKeyFromInt(end_key, FLAGS_num, &ekey);
uint64_t count = UINT64_MAX;
uint64_t size = UINT64_MAX;
db->GetApproximateMemTableStats({skey, ekey}, &count, &size);
count_hist->Add(count);
thread->stats.FinishedOps(nullptr, db, 1, kOthers);
}
thread->stats.AddMessage("\nReported entry count stats (expected " +
std::to_string(batch_size) + "):");
thread->stats.AddMessage("\n" + count_hist->ToString());
}
// Calls ApproximateSize over random key ranges.
void ApproximateSizeRandom(ThreadState* thread) {
int64_t size_sum = 0;

View File

@ -0,0 +1 @@
* `GetApproximateMemTableStats()` could return disastrously bad estimates 5-25% of the time. The function has been re-engineered to return much better estimates with similar CPU cost.