Optimize, simplify filter block building (fix regression) (#12931)

Summary:
This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931

Test Plan:
existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
 https://github.com/facebook/rocksdb/issues/12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872 #12911 https://github.com/facebook/rocksdb/issues/12874 #12867 https://github.com/facebook/rocksdb/issues/12903 #12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency:

```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```

Showing average ops/sec of "after" vs. "before"

```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```

As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.

Reviewed By: jowlyzhang

Differential Revision: D61138271

Pulled By: pdillinger

fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043
This commit is contained in:
Peter Dillinger 2024-08-14 15:13:16 -07:00 committed by Facebook GitHub Bot
parent d458331ee9
commit f63428bcc7
10 changed files with 266 additions and 168 deletions

View File

@ -692,15 +692,20 @@ namespace {
class AlwaysTrueBitsBuilder : public FilterBitsBuilder {
public:
void AddKey(const Slice&) override {}
size_t EstimateEntriesAdded() override { return 0U; }
void AddKey(const Slice&) override { ++count_; }
void AddKeyAndAlt(const Slice&, const Slice&) override { count_ += 2; }
size_t EstimateEntriesAdded() override { return count_; }
Slice Finish(std::unique_ptr<const char[]>* /* buf */) override {
count_ = 0;
// Interpreted as "always true" filter (0 probes over 1 byte of
// payload, 5 bytes metadata)
return Slice("\0\0\0\0\0\0", 6);
}
using FilterBitsBuilder::Finish;
size_t ApproximateNumEntries(size_t) override { return SIZE_MAX; }
private:
size_t count_ = 0;
};
class AlwaysTrueFilterPolicy : public ReadOnlyBuiltinFilterPolicy {
@ -827,6 +832,43 @@ TEST_P(DBBloomFilterTestWithFormatParams, SkipFilterOnEssentiallyZeroBpk) {
EXPECT_EQ(props["filter_size"], "0");
}
TEST_P(DBBloomFilterTestWithFormatParams, FilterBitsBuilderDedup) {
BlockBasedTableOptions table_options;
SetInTableOptions(&table_options);
FilterBuildingContext context{table_options};
std::unique_ptr<FilterBitsBuilder> builder{
table_options.filter_policy->GetBuilderWithContext(context)};
ASSERT_EQ(builder->EstimateEntriesAdded(), 0U);
// Check for sufficient de-duplication between regular keys and alt keys
// (prefixes), keeping in mind that the key might equal its prefix.
builder->AddKey("abc");
ASSERT_EQ(builder->EstimateEntriesAdded(), 1U);
builder->AddKeyAndAlt("abc1", "abc");
ASSERT_EQ(builder->EstimateEntriesAdded(), 2U);
builder->AddKeyAndAlt("bcd", "bcd");
ASSERT_EQ(builder->EstimateEntriesAdded(), 3U);
builder->AddKeyAndAlt("cde-1", "cde");
ASSERT_EQ(builder->EstimateEntriesAdded(), 5U);
builder->AddKeyAndAlt("cde", "cde");
ASSERT_EQ(builder->EstimateEntriesAdded(), 5U);
builder->AddKeyAndAlt("cde1", "cde");
ASSERT_EQ(builder->EstimateEntriesAdded(), 6U);
builder->AddKeyAndAlt("def-1", "def");
ASSERT_EQ(builder->EstimateEntriesAdded(), 8U);
builder->AddKeyAndAlt("def", "def");
ASSERT_EQ(builder->EstimateEntriesAdded(), 8U);
builder->AddKey("def$$"); // Like not in extractor domain
ASSERT_EQ(builder->EstimateEntriesAdded(), 9U);
builder->AddKey("def$$");
ASSERT_EQ(builder->EstimateEntriesAdded(), 9U);
builder->AddKeyAndAlt("efg42", "efg");
ASSERT_EQ(builder->EstimateEntriesAdded(), 11U);
builder->AddKeyAndAlt("efg", "efg"); // Like extra "alt" on a partition
ASSERT_EQ(builder->EstimateEntriesAdded(), 11U);
}
#if !defined(ROCKSDB_VALGRIND_RUN) || defined(ROCKSDB_FULL_VALGRIND_RUN)
INSTANTIATE_TEST_CASE_P(
FormatDef, DBBloomFilterTestDefFormatVersion,

View File

@ -47,7 +47,8 @@ class FilterBitsReader;
// structs because this is expected to be a temporary, stack-allocated object.
struct FilterBuildingContext {
// This constructor is for internal use only and subject to change.
FilterBuildingContext(const BlockBasedTableOptions& table_options);
// Keeps a reference to table_options.
explicit FilterBuildingContext(const BlockBasedTableOptions& table_options);
// Options for the table being built
const BlockBasedTableOptions& table_options;

View File

@ -82,21 +82,32 @@ class XXPH3FilterBitsBuilder : public BuiltinFilterBitsBuilder {
// requirements.
if (hash_entries_info_.entries.empty() ||
hash != hash_entries_info_.entries.back()) {
if (detect_filter_construct_corruption_) {
hash_entries_info_.xor_checksum ^= hash;
}
hash_entries_info_.entries.push_back(hash);
if (cache_res_mgr_ &&
// Traditional rounding to whole bucket size
((hash_entries_info_.entries.size() %
kUint64tHashEntryCacheResBucketSize) ==
kUint64tHashEntryCacheResBucketSize / 2)) {
hash_entries_info_.cache_res_bucket_handles.emplace_back(nullptr);
Status s = cache_res_mgr_->MakeCacheReservation(
kUint64tHashEntryCacheResBucketSize * sizeof(hash),
&hash_entries_info_.cache_res_bucket_handles.back());
s.PermitUncheckedError();
}
AddHash(hash);
}
}
void AddKeyAndAlt(const Slice& key, const Slice& alt) override {
uint64_t key_hash = GetSliceHash64(key);
uint64_t alt_hash = GetSliceHash64(alt);
std::optional<uint64_t> prev_key_hash;
std::optional<uint64_t> prev_alt_hash = hash_entries_info_.prev_alt_hash;
if (!hash_entries_info_.entries.empty()) {
prev_key_hash = hash_entries_info_.entries.back();
}
// Add alt first, so that entries.back() always contains previous key
// ASSUMING a change from one alt to the next implies a change to
// corresponding key
if (alt_hash != prev_alt_hash && alt_hash != key_hash &&
alt_hash != prev_key_hash) {
AddHash(alt_hash);
}
// Overwrite prev_alt_hash for cases like alt_hash == key_hash
hash_entries_info_.prev_alt_hash = alt_hash;
// NOTE: checking key_hash != prev_alt_hash for cases like
// key == prefix(key) at the end of a prefix grouping as in reverse
// byte-wise comparator
if (key_hash != prev_key_hash && key_hash != prev_alt_hash) {
AddHash(key_hash);
}
}
@ -116,6 +127,24 @@ class XXPH3FilterBitsBuilder : public BuiltinFilterBitsBuilder {
CacheEntryRole::kFilterConstruction>::GetDummyEntrySize() /
sizeof(uint64_t);
void AddHash(uint64_t hash) {
if (detect_filter_construct_corruption_) {
hash_entries_info_.xor_checksum ^= hash;
}
hash_entries_info_.entries.push_back(hash);
if (cache_res_mgr_ &&
// Traditional rounding to whole bucket size
((hash_entries_info_.entries.size() %
kUint64tHashEntryCacheResBucketSize) ==
kUint64tHashEntryCacheResBucketSize / 2)) {
hash_entries_info_.cache_res_bucket_handles.emplace_back(nullptr);
Status s = cache_res_mgr_->MakeCacheReservation(
kUint64tHashEntryCacheResBucketSize * sizeof(hash),
&hash_entries_info_.cache_res_bucket_handles.back());
s.PermitUncheckedError();
}
}
// For delegating between XXPH3FilterBitsBuilders
void SwapEntriesWith(XXPH3FilterBitsBuilder* other) {
assert(other != nullptr);
@ -282,17 +311,22 @@ class XXPH3FilterBitsBuilder : public BuiltinFilterBitsBuilder {
// Otherwise, it is 0.
uint64_t xor_checksum = 0;
// A single-element cache to help AddKeyAndAlt
std::optional<uint64_t> prev_alt_hash;
void Swap(HashEntriesInfo* other) {
assert(other != nullptr);
std::swap(entries, other->entries);
std::swap(cache_res_bucket_handles, other->cache_res_bucket_handles);
std::swap(xor_checksum, other->xor_checksum);
std::swap(prev_alt_hash, other->prev_alt_hash);
}
void Reset() {
entries.clear();
cache_res_bucket_handles.clear();
xor_checksum = 0;
prev_alt_hash = {};
}
};
@ -331,6 +365,14 @@ class FastLocalBloomBitsBuilder : public XXPH3FilterBitsBuilder {
Slice Finish(std::unique_ptr<const char[]>* buf, Status* status) override {
size_t num_entries = hash_entries_info_.entries.size();
if (num_entries == 0) {
// This case migrated from FullFilterBlockBuilder::Finish
if (status) {
*status = Status::OK();
}
return FinishAlwaysFalse(buf);
}
size_t len_with_metadata = CalculateSpace(num_entries);
std::unique_ptr<char[]> mutable_buf;
@ -1023,6 +1065,7 @@ class LegacyBloomBitsBuilder : public BuiltinFilterBitsBuilder {
~LegacyBloomBitsBuilder() override;
void AddKey(const Slice& key) override;
void AddKeyAndAlt(const Slice& key, const Slice& alt) override;
size_t EstimateEntriesAdded() override { return hash_entries_.size(); }
@ -1050,6 +1093,9 @@ class LegacyBloomBitsBuilder : public BuiltinFilterBitsBuilder {
int bits_per_key_;
int num_probes_;
std::vector<uint32_t> hash_entries_;
// A single-element cache to help AddKeyAndAlt. (-1 == empty)
int64_t prev_alt_hash_ = -1;
Logger* info_log_;
// Get totalbits that optimized for cpu cache line
@ -1079,14 +1125,39 @@ LegacyBloomBitsBuilder::~LegacyBloomBitsBuilder() = default;
void LegacyBloomBitsBuilder::AddKey(const Slice& key) {
uint32_t hash = BloomHash(key);
if (hash_entries_.size() == 0 || hash != hash_entries_.back()) {
if (hash_entries_.empty() || hash_entries_.back() != hash) {
hash_entries_.push_back(hash);
}
}
void LegacyBloomBitsBuilder::AddKeyAndAlt(const Slice& key, const Slice& alt) {
// Modified from XXPH3FilterBitsBuilder::AddKeyAndAlt
uint32_t key_hash = BloomHash(key);
uint32_t alt_hash = BloomHash(alt);
int64_t prev_key_hash = -1;
int64_t prev_alt_hash = prev_alt_hash_;
if (!hash_entries_.empty()) {
prev_key_hash = hash_entries_.back();
}
if (alt_hash != prev_alt_hash && alt_hash != key_hash &&
alt_hash != prev_key_hash) {
hash_entries_.push_back(alt_hash);
}
prev_alt_hash_ = alt_hash;
if (key_hash != prev_key_hash && key_hash != prev_alt_hash) {
hash_entries_.push_back(key_hash);
}
}
Slice LegacyBloomBitsBuilder::Finish(std::unique_ptr<const char[]>* buf) {
uint32_t total_bits, num_lines;
size_t num_entries = hash_entries_.size();
if (num_entries == 0) {
// This case migrated from FullFilterBlockBuilder::Finish
return FinishAlwaysFalse(buf);
}
uint32_t total_bits, num_lines;
char* data =
ReserveSpace(static_cast<int>(num_entries), &total_bits, &num_lines);
assert(data);
@ -1127,6 +1198,7 @@ Slice LegacyBloomBitsBuilder::Finish(std::unique_ptr<const char[]>* buf) {
const char* const_data = data;
buf->reset(const_data);
hash_entries_.clear();
prev_alt_hash_ = -1;
return Slice(data, total_bits / 8 + kMetadataLen);
}

View File

@ -31,10 +31,22 @@ class FilterBitsBuilder {
// added.
virtual void AddKey(const Slice& key) = 0;
// Add two entries to the filter, typically a key and, as the alternate,
// its prefix. This differs from AddKey(key); AddKey(alt); in that there
// is extra state for de-duplicating successive `alt` entries, as well
// as successive `key` entries. And there is de-duplication between `key`
// and `alt` entries, even in adjacent calls, because a whole key might
// be its own prefix. More specifically,
// AddKey(k1);
// AddKeyAndAlt(k2, a2); // de-dup k2<>k1, k2<>a2, a2<>k1
// AddKeyAndAlt(k3, a3); // de-dup k3<>k2, a3<>a2, k3<>a2, a3<>k2
// AddKey(k4); // de-dup k4<>k3 BUT NOT k4<>a3
virtual void AddKeyAndAlt(const Slice& key, const Slice& alt) = 0;
// Called by RocksDB before Finish to populate
// TableProperties::num_filter_entries, so should represent the
// number of unique keys (and/or prefixes) added, but does not have
// to be exact. `return 0;` may be used to conspicuously indicate "unknown".
// number of unique keys (and/or prefixes) added. MUST return 0
// if and only if none have been added, but otherwise can be estimated.
virtual size_t EstimateEntriesAdded() = 0;
// Generate the filter using the keys that are added

View File

@ -20,13 +20,8 @@ namespace ROCKSDB_NAMESPACE {
FullFilterBlockBuilder::FullFilterBlockBuilder(
const SliceTransform* _prefix_extractor, bool whole_key_filtering,
FilterBitsBuilder* filter_bits_builder)
: need_last_prefix_(whole_key_filtering && _prefix_extractor != nullptr),
prefix_extractor_(_prefix_extractor),
whole_key_filtering_(whole_key_filtering),
last_whole_key_recorded_(false),
last_prefix_recorded_(false),
last_key_in_domain_(false),
any_added_(false) {
: prefix_extractor_(_prefix_extractor),
whole_key_filtering_(whole_key_filtering) {
assert(filter_bits_builder != nullptr);
filter_bits_builder_.reset(filter_bits_builder);
}
@ -36,94 +31,24 @@ size_t FullFilterBlockBuilder::EstimateEntriesAdded() {
}
void FullFilterBlockBuilder::Add(const Slice& key_without_ts) {
const bool add_prefix =
prefix_extractor_ && prefix_extractor_->InDomain(key_without_ts);
if (need_last_prefix_ && !last_prefix_recorded_ && last_key_in_domain_) {
// We can reach here when a new filter partition starts in partitioned
// filter. The last prefix in the previous partition should be added if
// necessary regardless of key_without_ts, to support prefix SeekForPrev.
AddKey(last_prefix_str_);
last_prefix_recorded_ = true;
}
if (whole_key_filtering_) {
if (!add_prefix) {
AddKey(key_without_ts);
if (prefix_extractor_ && prefix_extractor_->InDomain(key_without_ts)) {
Slice prefix = prefix_extractor_->Transform(key_without_ts);
if (whole_key_filtering_) {
filter_bits_builder_->AddKeyAndAlt(key_without_ts, prefix);
} else {
// if both whole_key and prefix are added to bloom then we will have whole
// key_without_ts and prefix addition being interleaved and thus cannot
// rely on the bits builder to properly detect the duplicates by comparing
// with the last item.
Slice last_whole_key = Slice(last_whole_key_str_);
if (!last_whole_key_recorded_ ||
last_whole_key.compare(key_without_ts) != 0) {
AddKey(key_without_ts);
last_whole_key_recorded_ = true;
last_whole_key_str_.assign(key_without_ts.data(),
key_without_ts.size());
}
filter_bits_builder_->AddKey(prefix);
}
} else if (whole_key_filtering_) {
filter_bits_builder_->AddKey(key_without_ts);
}
if (add_prefix) {
last_key_in_domain_ = true;
AddPrefix(key_without_ts);
} else {
last_key_in_domain_ = false;
}
}
// Add key to filter if needed
inline void FullFilterBlockBuilder::AddKey(const Slice& key) {
filter_bits_builder_->AddKey(key);
any_added_ = true;
}
// Add prefix to filter if needed
void FullFilterBlockBuilder::AddPrefix(const Slice& key) {
assert(prefix_extractor_ && prefix_extractor_->InDomain(key));
Slice prefix = prefix_extractor_->Transform(key);
if (need_last_prefix_) {
// WART/FIXME: Because last_prefix_str_ is needed above to make
// SeekForPrev work with partitioned + prefix filters, we are currently
// use this inefficient code in that case (in addition to prefix+whole
// key). Hopefully this can be optimized with some refactoring up the call
// chain to BlockBasedTableBuilder. Even in PartitionedFilterBlockBuilder,
// we don't currently have access to the previous key/prefix by the time we
// know we are starting a new partition.
// if both whole_key and prefix are added to bloom then we will have whole
// key and prefix addition being interleaved and thus cannot rely on the
// bits builder to properly detect the duplicates by comparing with the last
// item.
Slice last_prefix = Slice(last_prefix_str_);
if (!last_prefix_recorded_ || last_prefix.compare(prefix) != 0) {
AddKey(prefix);
last_prefix_recorded_ = true;
last_prefix_str_.assign(prefix.data(), prefix.size());
}
} else {
AddKey(prefix);
}
}
void FullFilterBlockBuilder::Reset() {
last_whole_key_recorded_ = false;
last_prefix_recorded_ = false;
}
Status FullFilterBlockBuilder::Finish(
const BlockHandle& /*last_partition_block_handle*/, Slice* filter,
std::unique_ptr<const char[]>* filter_owner) {
Reset();
Status s = Status::OK();
if (any_added_) {
any_added_ = false;
*filter = filter_bits_builder_->Finish(
filter_owner ? filter_owner : &filter_data_, &s);
} else {
*filter = Slice{};
}
*filter = filter_bits_builder_->Finish(
filter_owner ? filter_owner : &filter_data_, &s);
return s;
}

View File

@ -50,7 +50,9 @@ class FullFilterBlockBuilder : public FilterBlockBuilder {
~FullFilterBlockBuilder() {}
void Add(const Slice& key_without_ts) override;
bool IsEmpty() const override { return !any_added_; }
bool IsEmpty() const override {
return filter_bits_builder_->EstimateEntriesAdded() == 0;
}
size_t EstimateEntriesAdded() override;
Status Finish(const BlockHandle& last_partition_block_handle, Slice* filter,
std::unique_ptr<const char[]>* filter_owner = nullptr) override;
@ -63,30 +65,17 @@ class FullFilterBlockBuilder : public FilterBlockBuilder {
}
protected:
virtual void AddKey(const Slice& key);
const SliceTransform* prefix_extractor() const { return prefix_extractor_; }
bool whole_key_filtering() const { return whole_key_filtering_; }
std::unique_ptr<FilterBitsBuilder> filter_bits_builder_;
virtual void Reset();
void AddPrefix(const Slice& key);
const SliceTransform* prefix_extractor() { return prefix_extractor_; }
const std::string& last_prefix_str() const { return last_prefix_str_; }
bool need_last_prefix_;
private:
// important: all of these might point to invalid addresses
// at the time of destruction of this filter block. destructor
// should NOT dereference them.
const SliceTransform* prefix_extractor_;
bool whole_key_filtering_;
bool last_whole_key_recorded_;
std::string last_whole_key_str_;
bool last_prefix_recorded_;
std::string last_prefix_str_;
// Whether prefix_extractor_->InDomain(last_whole_key_) is true.
// Used in partitioned filters so that the last prefix from the previous
// filter partition will be added to the current partition if
// last_key_in_domain_ is true, regardless of the current key.
bool last_key_in_domain_;
bool any_added_;
const SliceTransform* const prefix_extractor_;
const bool whole_key_filtering_;
std::unique_ptr<const char[]> filter_data_;
};

View File

@ -29,6 +29,10 @@ class TestFilterBitsBuilder : public FilterBitsBuilder {
void AddKey(const Slice& key) override {
hash_entries_.push_back(Hash(key.data(), key.size(), 1));
}
void AddKeyAndAlt(const Slice& key, const Slice& alt) override {
AddKey(key);
AddKey(alt);
}
using FilterBitsBuilder::Finish;
@ -203,6 +207,11 @@ class CountUniqueFilterBitsBuilderWrapper : public FilterBitsBuilder {
b_->AddKey(key);
uniq_.insert(key.ToString());
}
void AddKeyAndAlt(const Slice& key, const Slice& alt) override {
b_->AddKeyAndAlt(key, alt);
uniq_.insert(key.ToString());
uniq_.insert(alt.ToString());
}
using FilterBitsBuilder::Finish;

View File

@ -43,8 +43,6 @@ PartitionedFilterBlockBuilder::PartitionedFilterBlockBuilder(
BlockBasedTableOptions::kDataBlockBinarySearch /* index_type */,
0.75 /* data_block_hash_table_util_ratio */, ts_sz,
persist_user_defined_timestamps, true /* is_user_key */) {
// See FullFilterBlockBuilder::AddPrefix
need_last_prefix_ = prefix_extractor() != nullptr;
// Compute keys_per_partition_
keys_per_partition_ = static_cast<uint32_t>(
filter_bits_builder_->ApproximateNumEntries(partition_size));
@ -74,31 +72,41 @@ PartitionedFilterBlockBuilder::~PartitionedFilterBlockBuilder() {
partitioned_filters_construction_status_.PermitUncheckedError();
}
void PartitionedFilterBlockBuilder::MaybeCutAFilterBlock(
const Slice* next_key) {
// (NOTE: Can't just use ==, because keys_added_to_partition_ might be
// incremented by more than one.)
if (keys_added_to_partition_ >= keys_per_partition_) {
bool PartitionedFilterBlockBuilder::DecideCutAFilterBlock() {
// NOTE: Can't just use ==, because estimated might be incremented by more
// than one. +1 is for adding next_prefix below.
if (filter_bits_builder_->EstimateEntriesAdded() + 1 >= keys_per_partition_) {
// Currently only index builder is in charge of cutting a partition. We keep
// requesting until it is granted.
p_index_builder_->RequestPartitionCut();
}
if (!p_index_builder_->ShouldCutFilterBlock()) {
return;
}
return p_index_builder_->ShouldCutFilterBlock();
}
// Add the prefix of the next key before finishing the partition without
// updating last_prefix_str_. This hack fixes a bug with format_verison=3
// where seeking for the prefix would lead us to the previous partition.
const bool maybe_add_prefix =
next_key && prefix_extractor() && prefix_extractor()->InDomain(*next_key);
if (maybe_add_prefix) {
const Slice next_key_prefix = prefix_extractor()->Transform(*next_key);
if (next_key_prefix.compare(last_prefix_str()) != 0) {
AddKey(next_key_prefix);
void PartitionedFilterBlockBuilder::CutAFilterBlock(const Slice* next_key,
const Slice* next_prefix) {
// When there is a next partition, add the prefix of the first key in the
// next partition before closing this one out. This is needed to support
// prefix Seek, because there could exist a key k where
// * last_key < k < next_key
// * prefix(last_key) != prefix(k)
// * prefix(k) == prefix(next_key)
// * seeking to k lands in this partition, not the next
// in which case the iterator needs to find next_key despite starting in
// the partition before it. (This fixes a bug in the original implementation
// of format_version=3.)
if (next_prefix) {
if (whole_key_filtering()) {
// NOTE: At the end of building filter bits, we need a special case for
// treating prefix as an "alt" entry. See AddKeyAndAlt() comment. This is
// a reasonable hack for that.
filter_bits_builder_->AddKeyAndAlt(*next_prefix, *next_prefix);
} else {
filter_bits_builder_->AddKey(*next_prefix);
}
}
// Cut the partition
total_added_in_built_ += filter_bits_builder_->EstimateEntriesAdded();
std::unique_ptr<const char[]> filter_data;
Status filter_construction_status = Status::OK();
@ -111,18 +119,43 @@ void PartitionedFilterBlockBuilder::MaybeCutAFilterBlock(
{p_index_builder_->GetPartitionKey(), std::move(filter_data), filter});
partitioned_filters_construction_status_.UpdateIfOk(
filter_construction_status);
keys_added_to_partition_ = 0;
Reset();
// If we are building another filter partition, the last prefix in the
// previous partition should be added to support prefix SeekForPrev.
// (Analogous to above fix for prefix Seek.)
if (next_key && last_key_in_domain_) {
// NOTE: At the beginning of building filter bits, we don't need a special
// case for treating prefix as an "alt" entry.
// See DBBloomFilterTest.FilterBitsBuilderDedup
filter_bits_builder_->AddKey(last_prefix_str_);
}
}
void PartitionedFilterBlockBuilder::Add(const Slice& key_without_ts) {
MaybeCutAFilterBlock(&key_without_ts);
FullFilterBlockBuilder::Add(key_without_ts);
}
void PartitionedFilterBlockBuilder::AddKey(const Slice& key) {
FullFilterBlockBuilder::AddKey(key);
keys_added_to_partition_++;
// When filter partitioning is coupled to index partitioning, we need to
// check for cutting a block even if we aren't adding anything this time.
bool cut = DecideCutAFilterBlock();
if (prefix_extractor() && prefix_extractor()->InDomain(key_without_ts)) {
Slice prefix = prefix_extractor()->Transform(key_without_ts);
if (cut) {
CutAFilterBlock(&key_without_ts, &prefix);
}
if (whole_key_filtering()) {
filter_bits_builder_->AddKeyAndAlt(key_without_ts, prefix);
} else {
filter_bits_builder_->AddKey(prefix);
}
last_key_in_domain_ = true;
last_prefix_str_.assign(prefix.data(), prefix.size());
} else {
if (cut) {
CutAFilterBlock(&key_without_ts, nullptr);
}
if (whole_key_filtering()) {
filter_bits_builder_->AddKey(key_without_ts);
}
last_key_in_domain_ = false;
}
}
size_t PartitionedFilterBlockBuilder::EstimateEntriesAdded() {
@ -158,7 +191,10 @@ Status PartitionedFilterBlockBuilder::Finish(
filters_.pop_front();
} else {
assert(last_partition_block_handle == BlockHandle{});
MaybeCutAFilterBlock(nullptr);
if (filter_bits_builder_->EstimateEntriesAdded() > 0) {
CutAFilterBlock(nullptr, nullptr);
}
assert(filter_bits_builder_->EstimateEntriesAdded() == 0);
}
Status s = partitioned_filters_construction_status_;

View File

@ -36,8 +36,11 @@ class PartitionedFilterBlockBuilder : public FullFilterBlockBuilder {
virtual ~PartitionedFilterBlockBuilder();
void AddKey(const Slice& key) override;
void Add(const Slice& key_without_ts) override;
bool IsEmpty() const override {
return filter_bits_builder_->EstimateEntriesAdded() == 0 &&
filters_.empty();
}
size_t EstimateEntriesAdded() override;
Status Finish(const BlockHandle& last_partition_block_handle, Slice* filter,
@ -60,8 +63,9 @@ class PartitionedFilterBlockBuilder : public FullFilterBlockBuilder {
}
private: // fns
// The policy of when cut a filter block and Finish it
void MaybeCutAFilterBlock(const Slice* next_key);
// Whether to cut a filter block before the next key
bool DecideCutAFilterBlock();
void CutAFilterBlock(const Slice* next_key, const Slice* next_prefix);
private: // data
// Currently we keep the same number of partitions for filters and indexes.
@ -80,12 +84,15 @@ class PartitionedFilterBlockBuilder : public FullFilterBlockBuilder {
// used in building the index
// The desired number of keys per partition
uint32_t keys_per_partition_;
// The number of keys added to the last partition so far
uint32_t keys_added_to_partition_ = 0;
// According to the bits builders, how many keys/prefixes added
// in all the filters we have fully built
uint64_t total_added_in_built_ = 0;
// Tracking state about previous prefix, to solve issue with prefix Seeks
// at partition boundaries.
bool last_key_in_domain_ = false;
std::string last_prefix_str_;
// Set to the first non-okay status if any of the filter
// partitions experiences construction error.
// If partitioned_filters_construction_status_ is non-okay,

View File

@ -164,7 +164,8 @@ class PartitionedFilterBlockTest
}
PartitionedFilterBlockReader* NewReader(
PartitionedFilterBlockBuilder* builder, PartitionedIndexBuilder* pib) {
PartitionedFilterBlockBuilder* builder, PartitionedIndexBuilder* pib,
bool expect_empty = false) {
BlockHandle bh;
Status status;
Slice slice;
@ -172,6 +173,11 @@ class PartitionedFilterBlockTest
do {
status = builder->Finish(bh, &slice, &filter_data);
bh = Write(slice);
if (expect_empty) {
// Ensure most efficient "empty" filter is used
EXPECT_OK(status);
EXPECT_EQ(0, slice.size());
}
} while (status.IsIncomplete());
constexpr bool skip_filters = false;
@ -196,7 +202,7 @@ class PartitionedFilterBlockTest
void VerifyReader(PartitionedFilterBlockBuilder* builder,
PartitionedIndexBuilder* pib, bool empty = false) {
std::unique_ptr<PartitionedFilterBlockReader> reader(
NewReader(builder, pib));
NewReader(builder, pib, empty));
// Querying added keys
std::vector<std::string> keys = PrepareKeys(keys_without_ts, kKeyNum);
for (const auto& key : keys) {
@ -312,10 +318,9 @@ class PartitionedFilterBlockTest
void CutABlock(PartitionedIndexBuilder* builder, const std::string& user_key,
const std::string& next_user_key) {
// Assuming a block is cut, add an entry to the index
std::string key =
std::string(*InternalKey(user_key, 0, ValueType::kTypeValue).rep());
std::string next_key = std::string(
*InternalKey(next_user_key, 0, ValueType::kTypeValue).rep());
std::string key = *InternalKey(user_key, 0, ValueType::kTypeValue).rep();
std::string next_key =
*InternalKey(next_user_key, 0, ValueType::kTypeValue).rep();
BlockHandle dont_care_block_handle(1, 1);
Slice slice = Slice(next_key.data(), next_key.size());
std::string scratch;
@ -339,7 +344,7 @@ class PartitionedFilterBlockTest
INSTANTIATE_TEST_CASE_P(
FormatVersions, PartitionedFilterBlockTest,
testing::Combine(testing::ValuesIn(std::set<uint32_t>{
2, 3, 4, test::kDefaultFormatVersion,
2, 3, 4, 5, test::kDefaultFormatVersion,
kLatestFormatVersion}),
testing::ValuesIn(test::GetUDTTestModes())));