Fix and generalize framework for filtering range queries, etc. (#13005)

Summary:
There was a subtle design/contract bug in the previous version of range filtering in experimental.h  If someone implemented a key segments extractor with "all or nothing" fixed size segments, that could result in unsafe range filtering. For example, with two segments of width 3:
```
x = 0x|12 34 56|78 9A 00|
y = 0x|12 34 56||78 9B
z = 0x|12 34 56|78 9C 00|
```
Segment 1 of y (empty) is out of order with segment 1 of x and z.

I have re-worked the contract to make it clear what does work, and implemented a standard extractor for fixed-size segments, CappedKeySegmentsExtractor. The safe approach for filtering is to consume as much as is available for a segment in the case of a short key.

I have also added support for min-max filtering with reverse byte-wise comparator, which is probably the 2nd most common comparator for RocksDB users (because of MySQL). It might seem that a min-max filter doesn't care about forward or reverse ordering, but it does when trying to determine whether in input range from segment values v1 to v2, where it so happens that v2 is byte-wise less than v1, is an empty forward interval or a non-empty reverse interval. At least in the current setup, we don't have that context.

A new unit test (with some refactoring) tests CappedKeySegmentsExtractor, reverse byte-wise comparator, and the corresponding min-max filter.

I have also (contractually / mathematically) generalized the framework to comparators other than the byte-wise comparator, and made other generalizations to make the extractor limitations more explicitly connected to the particular filters and filtering used--at least in description.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13005

Test Plan: added unit tests as described

Reviewed By: jowlyzhang

Differential Revision: D62769784

Pulled By: pdillinger

fbshipit-source-id: 0d41f0d0273586bdad55e4aa30381ebc861f7044
This commit is contained in:
Peter Dillinger 2024-09-18 15:26:37 -07:00 committed by Facebook GitHub Bot
parent 0611eb5b9d
commit 10984e8c26
4 changed files with 672 additions and 179 deletions

View File

@ -3734,14 +3734,33 @@ TEST_F(DBBloomFilterTest, WeirdPrefixExtractorWithFilter3) {
}
}
TEST_F(DBBloomFilterTest, SstQueryFilter) {
using experimental::KeySegmentsExtractor;
using experimental::MakeSharedBytewiseMinMaxSQFC;
using experimental::SelectKeySegment;
using experimental::SstQueryFilterConfigs;
using experimental::SstQueryFilterConfigsManager;
using KeyCategorySet = KeySegmentsExtractor::KeyCategorySet;
using experimental::KeySegmentsExtractor;
using experimental::MakeSharedBytewiseMinMaxSQFC;
using experimental::MakeSharedCappedKeySegmentsExtractor;
using experimental::MakeSharedReverseBytewiseMinMaxSQFC;
using experimental::SelectKeySegment;
using experimental::SstQueryFilterConfigs;
using experimental::SstQueryFilterConfigsManager;
using KeyCategorySet = KeySegmentsExtractor::KeyCategorySet;
static std::vector<std::string> RangeQueryKeys(
SstQueryFilterConfigsManager::Factory& factory, DB& db, const Slice& lb,
const Slice& ub) {
ReadOptions ro;
ro.iterate_lower_bound = &lb;
ro.iterate_upper_bound = &ub;
ro.table_filter = factory.GetTableFilterForRangeQuery(lb, ub);
auto it = db.NewIterator(ro);
std::vector<std::string> ret;
for (it->Seek(lb); it->Valid(); it->Next()) {
ret.push_back(it->key().ToString());
}
EXPECT_OK(it->status());
delete it;
return ret;
};
TEST_F(DBBloomFilterTest, SstQueryFilter) {
struct MySegmentExtractor : public KeySegmentsExtractor {
char min_first_char;
char max_first_char;
@ -3890,101 +3909,86 @@ TEST_F(DBBloomFilterTest, SstQueryFilter) {
ASSERT_OK(Flush());
using Keys = std::vector<std::string>;
auto RangeQueryKeys =
auto RangeQuery =
[factory, db = db_](
std::string lb, std::string ub,
std::shared_ptr<SstQueryFilterConfigsManager::Factory> alt_factory =
nullptr) {
Slice lb_slice = lb;
Slice ub_slice = ub;
ReadOptions ro;
ro.iterate_lower_bound = &lb_slice;
ro.iterate_upper_bound = &ub_slice;
ro.table_filter = (alt_factory ? alt_factory : factory)
->GetTableFilterForRangeQuery(lb_slice, ub_slice);
auto it = db->NewIterator(ro);
Keys ret;
for (it->Seek(lb_slice); it->Valid(); it->Next()) {
ret.push_back(it->key().ToString());
}
EXPECT_OK(it->status());
delete it;
return ret;
return RangeQueryKeys(alt_factory ? *alt_factory : *factory, *db, lb,
ub);
};
// Control 1: range is not filtered but min/max filter is checked
// because of common prefix leading up to 2nd segment
// TODO/future: statistics for when filter is checked vs. not applicable
EXPECT_EQ(RangeQueryKeys("abc_150", "abc_249"),
EXPECT_EQ(RangeQuery("abc_150", "abc_249"),
Keys({"abc_156_987", "abc_234", "abc_245_567"}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 2);
// Test 1: range is filtered to just lowest level, fully containing the
// segments in that category
EXPECT_EQ(RangeQueryKeys("abc_100", "abc_179"),
EXPECT_EQ(RangeQuery("abc_100", "abc_179"),
Keys({"abc_123", "abc_13", "abc_156_987"}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 1);
// Test 2: range is filtered to just lowest level, partial overlap
EXPECT_EQ(RangeQueryKeys("abc_1500_x_y", "abc_16QQ"), Keys({"abc_156_987"}));
EXPECT_EQ(RangeQuery("abc_1500_x_y", "abc_16QQ"), Keys({"abc_156_987"}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 1);
// Test 3: range is filtered to just highest level, fully containing the
// segments in that category but would be overlapping the range for the other
// file if the filter included all categories
EXPECT_EQ(RangeQueryKeys("abc_200", "abc_300"),
EXPECT_EQ(RangeQuery("abc_200", "abc_300"),
Keys({"abc_234", "abc_245_567", "abc_25"}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 1);
// Test 4: range is filtered to just highest level, partial overlap (etc.)
EXPECT_EQ(RangeQueryKeys("abc_200", "abc_249"),
Keys({"abc_234", "abc_245_567"}));
EXPECT_EQ(RangeQuery("abc_200", "abc_249"), Keys({"abc_234", "abc_245_567"}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 1);
// Test 5: range is filtered from both levels, because of category scope
EXPECT_EQ(RangeQueryKeys("abc_300", "abc_400"), Keys({}));
EXPECT_EQ(RangeQuery("abc_300", "abc_400"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 0);
// Control 2: range is not filtered because association between 1st and
// 2nd segment is not represented
EXPECT_EQ(RangeQueryKeys("abc_170", "abc_190"), Keys({}));
EXPECT_EQ(RangeQuery("abc_170", "abc_190"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 2);
// Control 3: range is not filtered because there's no (bloom) filter on
// 1st segment (like prefix filtering)
EXPECT_EQ(RangeQueryKeys("baa_170", "baa_190"), Keys({}));
EXPECT_EQ(RangeQuery("baa_170", "baa_190"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 2);
// Control 4: range is not filtered because difference in segments leading
// up to 2nd segment
EXPECT_EQ(RangeQueryKeys("abc_500", "abd_501"), Keys({}));
EXPECT_EQ(RangeQuery("abc_500", "abd_501"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 2);
// TODO: exclusive upper bound tests
// ======= Testing 3rd segment (cross-category filter) =======
// Control 5: not filtered because of segment range overlap
EXPECT_EQ(RangeQueryKeys(" z__700", " z__750"), Keys({}));
EXPECT_EQ(RangeQuery(" z__700", " z__750"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 2);
// Test 6: filtered on both levels
EXPECT_EQ(RangeQueryKeys(" z__100", " z__300"), Keys({}));
EXPECT_EQ(RangeQuery(" z__100", " z__300"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 0);
// Control 6: finding something, with 2nd segment filter helping
EXPECT_EQ(RangeQueryKeys("abc_156_9", "abc_156_99"), Keys({"abc_156_987"}));
EXPECT_EQ(RangeQuery("abc_156_9", "abc_156_99"), Keys({"abc_156_987"}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 1);
EXPECT_EQ(RangeQueryKeys("abc_245_56", "abc_245_57"), Keys({"abc_245_567"}));
EXPECT_EQ(RangeQuery("abc_245_56", "abc_245_57"), Keys({"abc_245_567"}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 1);
// Test 6: filtered on both levels, for different segments
EXPECT_EQ(RangeQueryKeys("abc_245_900", "abc_245_999"), Keys({}));
EXPECT_EQ(RangeQuery("abc_245_900", "abc_245_999"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 0);
// ======= Testing extractor read portability =======
EXPECT_EQ(RangeQueryKeys("abc_300", "abc_400"), Keys({}));
EXPECT_EQ(RangeQuery("abc_300", "abc_400"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 0);
// Only modifies how filters are written
@ -3992,20 +3996,168 @@ TEST_F(DBBloomFilterTest, SstQueryFilter) {
ASSERT_EQ(factory->GetFilteringVersion(), 0U);
ASSERT_EQ(factory->GetConfigs().IsEmptyNotFound(), true);
EXPECT_EQ(RangeQueryKeys("abc_300", "abc_400"), Keys({}));
EXPECT_EQ(RangeQuery("abc_300", "abc_400"), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 0);
// Even a different config name with different extractor can read
EXPECT_EQ(RangeQueryKeys("abc_300", "abc_400", MakeFactory("bar", 43)),
Keys({}));
EXPECT_EQ(RangeQuery("abc_300", "abc_400", MakeFactory("bar", 43)), Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 0);
// Or a "not found" config name
EXPECT_EQ(RangeQueryKeys("abc_300", "abc_400", MakeFactory("blah", 43)),
EXPECT_EQ(RangeQuery("abc_300", "abc_400", MakeFactory("blah", 43)),
Keys({}));
EXPECT_EQ(TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA), 0);
}
static std::vector<int> ExtractedSizes(const KeySegmentsExtractor& ex,
const Slice& k) {
KeySegmentsExtractor::Result r;
ex.Extract(k, KeySegmentsExtractor::kFullUserKey, &r);
std::vector<int> ret;
uint32_t last = 0;
for (const auto i : r.segment_ends) {
ret.push_back(static_cast<int>(i - last));
last = i;
}
return ret;
}
TEST_F(DBBloomFilterTest, FixedWidthSegments) {
// Unit tests for
auto extractor_none = MakeSharedCappedKeySegmentsExtractor({});
auto extractor0b = MakeSharedCappedKeySegmentsExtractor({0});
auto extractor1b = MakeSharedCappedKeySegmentsExtractor({1});
auto extractor4b = MakeSharedCappedKeySegmentsExtractor({4});
auto extractor4b0b = MakeSharedCappedKeySegmentsExtractor({4, 0});
auto extractor4b0b4b = MakeSharedCappedKeySegmentsExtractor({4, 0, 4});
auto extractor1b3b0b4b = MakeSharedCappedKeySegmentsExtractor({1, 3, 0, 4});
ASSERT_EQ(extractor_none->GetId(), "CappedKeySegmentsExtractor");
ASSERT_EQ(extractor0b->GetId(), "CappedKeySegmentsExtractor0b");
ASSERT_EQ(extractor1b->GetId(), "CappedKeySegmentsExtractor1b");
ASSERT_EQ(extractor4b->GetId(), "CappedKeySegmentsExtractor4b");
ASSERT_EQ(extractor4b0b->GetId(), "CappedKeySegmentsExtractor4b0b");
ASSERT_EQ(extractor4b0b4b->GetId(), "CappedKeySegmentsExtractor4b0b4b");
ASSERT_EQ(extractor1b3b0b4b->GetId(), "CappedKeySegmentsExtractor1b3b0b4b");
using V = std::vector<int>;
ASSERT_EQ(V({}), ExtractedSizes(*extractor_none, {}));
ASSERT_EQ(V({0}), ExtractedSizes(*extractor0b, {}));
ASSERT_EQ(V({0}), ExtractedSizes(*extractor1b, {}));
ASSERT_EQ(V({0}), ExtractedSizes(*extractor4b, {}));
ASSERT_EQ(V({0, 0}), ExtractedSizes(*extractor4b0b, {}));
ASSERT_EQ(V({0, 0, 0}), ExtractedSizes(*extractor4b0b4b, {}));
ASSERT_EQ(V({0, 0, 0, 0}), ExtractedSizes(*extractor1b3b0b4b, {}));
ASSERT_EQ(V({3}), ExtractedSizes(*extractor4b, "bla"));
ASSERT_EQ(V({3, 0}), ExtractedSizes(*extractor4b0b, "bla"));
ASSERT_EQ(V({1, 2, 0, 0}), ExtractedSizes(*extractor1b3b0b4b, "bla"));
ASSERT_EQ(V({}), ExtractedSizes(*extractor_none, "blah"));
ASSERT_EQ(V({0}), ExtractedSizes(*extractor0b, "blah"));
ASSERT_EQ(V({1}), ExtractedSizes(*extractor1b, "blah"));
ASSERT_EQ(V({4}), ExtractedSizes(*extractor4b, "blah"));
ASSERT_EQ(V({4, 0}), ExtractedSizes(*extractor4b0b, "blah"));
ASSERT_EQ(V({4, 0, 0}), ExtractedSizes(*extractor4b0b4b, "blah"));
ASSERT_EQ(V({1, 3, 0, 0}), ExtractedSizes(*extractor1b3b0b4b, "blah"));
ASSERT_EQ(V({4, 0}), ExtractedSizes(*extractor4b0b, "blah1"));
ASSERT_EQ(V({4, 0, 1}), ExtractedSizes(*extractor4b0b4b, "blah1"));
ASSERT_EQ(V({4, 0, 2}), ExtractedSizes(*extractor4b0b4b, "blah12"));
ASSERT_EQ(V({4, 0, 3}), ExtractedSizes(*extractor4b0b4b, "blah123"));
ASSERT_EQ(V({1, 3, 0, 3}), ExtractedSizes(*extractor1b3b0b4b, "blah123"));
ASSERT_EQ(V({4}), ExtractedSizes(*extractor4b, "blah1234"));
ASSERT_EQ(V({4, 0}), ExtractedSizes(*extractor4b0b, "blah1234"));
ASSERT_EQ(V({4, 0, 4}), ExtractedSizes(*extractor4b0b4b, "blah1234"));
ASSERT_EQ(V({1, 3, 0, 4}), ExtractedSizes(*extractor1b3b0b4b, "blah1234"));
ASSERT_EQ(V({4, 0, 4}), ExtractedSizes(*extractor4b0b4b, "blah12345"));
ASSERT_EQ(V({1, 3, 0, 4}), ExtractedSizes(*extractor1b3b0b4b, "blah12345"));
// Filter config for second and fourth segment
auto filter1 =
MakeSharedReverseBytewiseMinMaxSQFC(experimental::SelectKeySegment(1));
auto filter3 =
MakeSharedReverseBytewiseMinMaxSQFC(experimental::SelectKeySegment(3));
SstQueryFilterConfigs configs1 = {{filter1, filter3}, extractor1b3b0b4b};
SstQueryFilterConfigsManager::Data data = {{42, {{"foo", configs1}}}};
std::shared_ptr<SstQueryFilterConfigsManager> configs_manager;
ASSERT_OK(SstQueryFilterConfigsManager::MakeShared(data, &configs_manager));
std::shared_ptr<SstQueryFilterConfigsManager::Factory> f;
ASSERT_OK(configs_manager->MakeSharedFactory("foo", 42, &f));
ASSERT_EQ(f->GetConfigsName(), "foo");
ASSERT_EQ(f->GetConfigs().IsEmptyNotFound(), false);
Options options = CurrentOptions();
options.statistics = CreateDBStatistics();
options.table_properties_collector_factories.push_back(f);
// Next most common comparator after bytewise
options.comparator = ReverseBytewiseComparator();
DestroyAndReopen(options);
ASSERT_OK(Put("abcd1234", "val0"));
ASSERT_OK(Put("abcd1245", "val1"));
ASSERT_OK(Put("abcd99", "val2")); // short key
ASSERT_OK(Put("aqua1200", "val3"));
ASSERT_OK(Put("aqua1230", "val4"));
ASSERT_OK(Put("zen", "val5")); // very short key
ASSERT_OK(Put("azur1220", "val6"));
ASSERT_OK(Put("blah", "val7"));
ASSERT_OK(Put("blah2", "val8"));
ASSERT_OK(Flush());
using Keys = std::vector<std::string>;
// Range is not filtered but segment 1 min/max filter is checked
EXPECT_EQ(RangeQueryKeys(*f, *db_, "aczz0000", "acdc0000"), Keys({}));
EXPECT_EQ(1, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Found (can't use segment 3 filter)
EXPECT_EQ(RangeQueryKeys(*f, *db_, "aqzz0000", "aqdc0000"),
Keys({"aqua1230", "aqua1200"}));
EXPECT_EQ(1, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Filtered because of segment 1 min-max not intersecting [aaa, abb]
EXPECT_EQ(RangeQueryKeys(*f, *db_, "zabb9999", "zaaa0000"), Keys({}));
EXPECT_EQ(0, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Found
EXPECT_EQ(RangeQueryKeys(*f, *db_, "aqua1200ZZZ", "aqua1000ZZZ"),
Keys({"aqua1200"}));
EXPECT_EQ(1, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Found despite short key
EXPECT_EQ(RangeQueryKeys(*f, *db_, "aqua121", "aqua1"), Keys({"aqua1200"}));
EXPECT_EQ(1, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Filtered because of segment 3 min-max not intersecting [1000, 1100]
// Note that the empty string is tracked outside of the min-max range.
EXPECT_EQ(RangeQueryKeys(*f, *db_, "aqua1100ZZZ", "aqua1000ZZZ"), Keys({}));
EXPECT_EQ(0, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Also filtered despite short key
EXPECT_EQ(RangeQueryKeys(*f, *db_, "aqua11", "aqua1"), Keys({}));
EXPECT_EQ(0, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Found
EXPECT_EQ(RangeQueryKeys(*f, *db_, "blah21", "blag"),
Keys({"blah2", "blah"}));
EXPECT_EQ(1, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Found
EXPECT_EQ(RangeQueryKeys(*f, *db_, "blah0", "blag"), Keys({"blah"}));
EXPECT_EQ(1, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
// Filtered because of segment 3 min-max not intersecting [0, 1]
// Note that the empty string is tracked outside of the min-max range.
EXPECT_EQ(RangeQueryKeys(*f, *db_, "blah1", "blah0"), Keys({}));
EXPECT_EQ(0, TestGetAndResetTickerCount(options, NON_LAST_LEVEL_SEEK_DATA));
}
} // namespace ROCKSDB_NAMESPACE
int main(int argc, char** argv) {

View File

@ -152,6 +152,85 @@ Status UpdateManifestForFilesState(
// EXPERIMENTAL new filtering features
namespace {
template <size_t N>
class SemiStaticCappedKeySegmentsExtractor : public KeySegmentsExtractor {
public:
SemiStaticCappedKeySegmentsExtractor(const uint32_t* byte_widths) {
id_ = kName();
uint32_t prev_end = 0;
if constexpr (N > 0) { // Suppress a compiler warning
for (size_t i = 0; i < N; ++i) {
prev_end = prev_end + byte_widths[i];
ideal_ends_[i] = prev_end;
id_ += std::to_string(byte_widths[i]) + "b";
}
}
}
static const char* kName() { return "CappedKeySegmentsExtractor"; }
const char* Name() const override { return kName(); }
std::string GetId() const override { return id_; }
void Extract(const Slice& key_or_bound, KeyKind /*kind*/,
Result* result) const override {
// Optimistic assignment
result->segment_ends.assign(ideal_ends_.begin(), ideal_ends_.end());
if constexpr (N > 0) { // Suppress a compiler warning
uint32_t key_size = static_cast<uint32_t>(key_or_bound.size());
if (key_size < ideal_ends_.back()) {
// Need to fix up (should be rare)
for (size_t i = 0; i < N; ++i) {
result->segment_ends[i] = std::min(key_size, result->segment_ends[i]);
}
}
}
}
private:
std::array<uint32_t, N> ideal_ends_;
std::string id_;
};
class DynamicCappedKeySegmentsExtractor : public KeySegmentsExtractor {
public:
DynamicCappedKeySegmentsExtractor(const std::vector<uint32_t>& byte_widths) {
id_ = kName();
uint32_t prev_end = 0;
for (size_t i = 0; i < byte_widths.size(); ++i) {
prev_end = prev_end + byte_widths[i];
ideal_ends_[i] = prev_end;
id_ += std::to_string(byte_widths[i]) + "b";
}
final_ideal_end_ = prev_end;
}
static const char* kName() { return "CappedKeySegmentsExtractor"; }
const char* Name() const override { return kName(); }
std::string GetId() const override { return id_; }
void Extract(const Slice& key_or_bound, KeyKind /*kind*/,
Result* result) const override {
// Optimistic assignment
result->segment_ends = ideal_ends_;
uint32_t key_size = static_cast<uint32_t>(key_or_bound.size());
if (key_size < final_ideal_end_) {
// Need to fix up (should be rare)
for (size_t i = 0; i < ideal_ends_.size(); ++i) {
result->segment_ends[i] = std::min(key_size, result->segment_ends[i]);
}
}
}
private:
std::vector<uint32_t> ideal_ends_;
uint32_t final_ideal_end_;
std::string id_;
};
void GetFilterInput(FilterInput select, const Slice& key,
const KeySegmentsExtractor::Result& extracted,
Slice* out_input, Slice* out_leadup) {
@ -211,12 +290,6 @@ void GetFilterInput(FilterInput select, const Slice& key,
assert(false);
return Slice();
}
Slice operator()(SelectValue) {
// TODO
assert(false);
return Slice();
}
};
Slice input = std::visit(FilterInputGetter(key, extracted), select);
@ -256,9 +329,6 @@ const char* DeserializeFilterInput(const char* p, const char* limit,
case 3:
*out = SelectColumnName{};
return p;
case 4:
*out = SelectValue{};
return p;
default:
// Reserved for future use
return nullptr;
@ -315,7 +385,6 @@ void SerializeFilterInput(std::string* out, const FilterInput& select) {
void operator()(SelectLegacyKeyPrefix) { out->push_back(1); }
void operator()(SelectUserTimestamp) { out->push_back(2); }
void operator()(SelectColumnName) { out->push_back(3); }
void operator()(SelectValue) { out->push_back(4); }
void operator()(SelectKeySegment select) {
// TODO: expand supported cases
assert(select.segment_index < 16);
@ -372,6 +441,7 @@ enum BuiltinSstQueryFilters : char {
// and filtered independently because it might be a special case that is
// not representative of the minimum in a spread of values.
kBytewiseMinMaxFilter = 0x10,
kRevBytewiseMinMaxFilter = 0x11,
};
class SstQueryFilterBuilder {
@ -459,7 +529,10 @@ class CategoryScopeFilterWrapperBuilder : public SstQueryFilterBuilder {
class BytewiseMinMaxSstQueryFilterConfig : public SstQueryFilterConfigImpl {
public:
using SstQueryFilterConfigImpl::SstQueryFilterConfigImpl;
explicit BytewiseMinMaxSstQueryFilterConfig(
const FilterInput& input,
const KeySegmentsExtractor::KeyCategorySet& categories, bool reverse)
: SstQueryFilterConfigImpl(input, categories), reverse_(reverse) {}
std::unique_ptr<SstQueryFilterBuilder> NewBuilder(
bool sanity_checks) const override {
@ -477,11 +550,13 @@ class BytewiseMinMaxSstQueryFilterConfig : public SstQueryFilterConfigImpl {
const KeySegmentsExtractor::Result& lower_bound_extracted,
const Slice& upper_bound_excl,
const KeySegmentsExtractor::Result& upper_bound_extracted) {
assert(!filter.empty() && filter[0] == kBytewiseMinMaxFilter);
assert(!filter.empty() && (filter[0] == kBytewiseMinMaxFilter ||
filter[0] == kRevBytewiseMinMaxFilter));
if (filter.size() <= 4) {
// Missing some data
return true;
}
bool reverse = (filter[0] == kRevBytewiseMinMaxFilter);
bool empty_included = (filter[1] & kEmptySeenFlag) != 0;
const char* p = filter.data() + 2;
const char* limit = filter.data() + filter.size();
@ -528,8 +603,13 @@ class BytewiseMinMaxSstQueryFilterConfig : public SstQueryFilterConfigImpl {
// May match if both the upper bound and lower bound indicate there could
// be overlap
return upper_bound_input.compare(smallest) >= 0 &&
lower_bound_input.compare(largest) <= 0;
if (reverse) {
return upper_bound_input.compare(smallest) <= 0 &&
lower_bound_input.compare(largest) >= 0;
} else {
return upper_bound_input.compare(smallest) >= 0 &&
lower_bound_input.compare(largest) <= 0;
}
}
protected:
@ -551,19 +631,11 @@ class BytewiseMinMaxSstQueryFilterConfig : public SstQueryFilterConfigImpl {
&prev_leadup);
int compare = prev_leadup.compare(leadup);
if (compare > 0) {
status = Status::Corruption(
"Ordering invariant violated from 0x" +
prev_key->ToString(/*hex=*/true) + " with prefix 0x" +
prev_leadup.ToString(/*hex=*/true) + " to 0x" +
key.ToString(/*hex=*/true) + " with prefix 0x" +
leadup.ToString(/*hex=*/true));
return;
} else if (compare == 0) {
if (compare == 0) {
// On the same prefix leading up to the segment, the segments must
// not be out of order.
compare = prev_input.compare(input);
if (compare > 0) {
if (parent.reverse_ ? compare < 0 : compare > 0) {
status = Status::Corruption(
"Ordering invariant violated from 0x" +
prev_key->ToString(/*hex=*/true) + " with segment 0x" +
@ -573,6 +645,9 @@ class BytewiseMinMaxSstQueryFilterConfig : public SstQueryFilterConfigImpl {
return;
}
}
// NOTE: it is not strictly required that the leadup be ordered, just
// satisfy the "common segment prefix property" which would be
// expensive to check
}
// Now actually update state for the filter inputs
@ -598,7 +673,8 @@ class BytewiseMinMaxSstQueryFilterConfig : public SstQueryFilterConfigImpl {
return 0;
}
return 2 + GetFilterInputSerializedLength(parent.input_) +
VarintLength(smallest.size()) + smallest.size() + largest.size();
VarintLength(parent.reverse_ ? largest.size() : smallest.size()) +
smallest.size() + largest.size();
}
void Finish(std::string& append_to) override {
@ -610,23 +686,27 @@ class BytewiseMinMaxSstQueryFilterConfig : public SstQueryFilterConfigImpl {
}
size_t old_append_to_size = append_to.size();
append_to.reserve(old_append_to_size + encoded_length);
append_to.push_back(kBytewiseMinMaxFilter);
append_to.push_back(parent.reverse_ ? kRevBytewiseMinMaxFilter
: kBytewiseMinMaxFilter);
append_to.push_back(empty_seen ? kEmptySeenFlag : 0);
SerializeFilterInput(&append_to, parent.input_);
PutVarint32(&append_to, static_cast<uint32_t>(smallest.size()));
append_to.append(smallest);
// The end of `largest` is given by the end of the filter
append_to.append(largest);
auto& minv = parent.reverse_ ? largest : smallest;
auto& maxv = parent.reverse_ ? smallest : largest;
PutVarint32(&append_to, static_cast<uint32_t>(minv.size()));
append_to.append(minv);
// The end of `maxv` is given by the end of the filter
append_to.append(maxv);
assert(append_to.size() == old_append_to_size + encoded_length);
}
const BytewiseMinMaxSstQueryFilterConfig& parent;
const bool sanity_checks;
// Smallest and largest segment seen, excluding the empty segment which
// is tracked separately
// is tracked separately. "Reverse" from parent is only applied at
// serialization time, for efficiency.
std::string smallest;
std::string largest;
bool empty_seen = false;
@ -635,6 +715,8 @@ class BytewiseMinMaxSstQueryFilterConfig : public SstQueryFilterConfigImpl {
Status status;
};
bool reverse_;
private:
static constexpr char kEmptySeenFlag = 0x1;
};
@ -1036,6 +1118,7 @@ class SstQueryFilterConfigsManagerImpl : public SstQueryFilterConfigsManager {
may_match = MayMatch_CategoryScopeFilterWrapper(filter, *state);
break;
case kBytewiseMinMaxFilter:
case kRevBytewiseMinMaxFilter:
if (state == nullptr) {
// TODO? Report problem
// No filtering
@ -1189,14 +1272,63 @@ const std::string SstQueryFilterConfigsManagerImpl::kTablePropertyName =
"rocksdb.sqfc";
} // namespace
std::shared_ptr<const KeySegmentsExtractor>
MakeSharedCappedKeySegmentsExtractor(const std::vector<size_t>& byte_widths) {
std::vector<uint32_t> byte_widths_checked;
byte_widths_checked.resize(byte_widths.size());
size_t final_end = 0;
for (size_t i = 0; i < byte_widths.size(); ++i) {
final_end += byte_widths[i];
if (byte_widths[i] > UINT32_MAX / 2 || final_end > UINT32_MAX) {
// Better to crash than to proceed unsafely
return nullptr;
}
byte_widths_checked[i] = static_cast<uint32_t>(byte_widths[i]);
}
switch (byte_widths_checked.size()) {
case 0:
return std::make_shared<SemiStaticCappedKeySegmentsExtractor<0>>(
byte_widths_checked.data());
case 1:
return std::make_shared<SemiStaticCappedKeySegmentsExtractor<1>>(
byte_widths_checked.data());
case 2:
return std::make_shared<SemiStaticCappedKeySegmentsExtractor<2>>(
byte_widths_checked.data());
case 3:
return std::make_shared<SemiStaticCappedKeySegmentsExtractor<3>>(
byte_widths_checked.data());
case 4:
return std::make_shared<SemiStaticCappedKeySegmentsExtractor<4>>(
byte_widths_checked.data());
case 5:
return std::make_shared<SemiStaticCappedKeySegmentsExtractor<5>>(
byte_widths_checked.data());
case 6:
return std::make_shared<SemiStaticCappedKeySegmentsExtractor<6>>(
byte_widths_checked.data());
default:
return std::make_shared<DynamicCappedKeySegmentsExtractor>(
byte_widths_checked);
}
}
bool SstQueryFilterConfigs::IsEmptyNotFound() const {
return this == &kEmptyNotFoundSQFC;
}
std::shared_ptr<SstQueryFilterConfig> MakeSharedBytewiseMinMaxSQFC(
FilterInput input, KeySegmentsExtractor::KeyCategorySet categories) {
return std::make_shared<BytewiseMinMaxSstQueryFilterConfig>(input,
categories);
return std::make_shared<BytewiseMinMaxSstQueryFilterConfig>(
input, categories,
/*reverse=*/false);
}
std::shared_ptr<SstQueryFilterConfig> MakeSharedReverseBytewiseMinMaxSQFC(
FilterInput input, KeySegmentsExtractor::KeyCategorySet categories) {
return std::make_shared<BytewiseMinMaxSstQueryFilterConfig>(input, categories,
/*reverse=*/true);
}
Status SstQueryFilterConfigsManager::MakeShared(

View File

@ -179,13 +179,18 @@ class Comparator : public Customizable, public CompareInterface {
size_t timestamp_size_;
};
// Return a builtin comparator that uses lexicographic byte-wise
// ordering. The result remains the property of this module and
// must not be deleted.
// Return a builtin comparator that uses lexicographic ordering
// on unsigned bytes, so the empty string is ordered before everything
// else and a sufficiently long string of \xFF orders after anything.
// CanKeysWithDifferentByteContentsBeEqual() == false
// Returns an immortal pointer that must not be deleted by the caller.
const Comparator* BytewiseComparator();
// Return a builtin comparator that uses reverse lexicographic byte-wise
// ordering.
// Return a builtin comparator that is the reverse ordering of
// BytewiseComparator(), so the empty string is ordered after everything
// else and a sufficiently long string of \xFF orders before anything.
// CanKeysWithDifferentByteContentsBeEqual() == false
// Returns an immortal pointer that must not be deleted by the caller.
const Comparator* ReverseBytewiseComparator();
// Returns a builtin comparator that enables user-defined timestamps (formatted

View File

@ -61,99 +61,17 @@ Status UpdateManifestForFilesState(
// EXPERIMENTAL new filtering features
// ****************************************************************************
// A class for splitting a key into meaningful pieces, or "segments" for
// filtering purposes. Keys can also be put in "categories" to simplify
// some configuration and handling. To simplify satisfying some filtering
// requirements, the segments must encompass a complete key prefix (or the whole
// key) and segments cannot overlap.
// KeySegmentsExtractor - A class for splitting a key into meaningful pieces, or
// "segments" for filtering purposes. We say the first key segment has segment
// ordinal 0, the second has segment ordinal 1, etc. To simplify satisfying some
// filtering requirements, the segments must encompass a complete key prefix (or
// the whole key). There cannot be gaps between segments (though segments are
// allowed to be essentially unused), and segments cannot overlap.
//
// Once in production, the behavior associated with a particular Name()
// cannot change. Introduce a new Name() when introducing new behaviors.
// See also SstQueryFilterConfigsManager below.
//
// OTHER CURRENT LIMITATIONS (maybe relaxed in the future for segments only
// needing point query or WHERE filtering):
// * Assumes the (default) byte-wise comparator is used.
// * Assumes the category contiguousness property: that each category is
// contiguous in comparator order. In other words, any key between two keys of
// category c must also be in category c.
// * Assumes the (weak) segment ordering property (described below) always
// holds. (For byte-wise comparator, this is implied by the segment prefix
// property, also described below.)
// * Not yet compatible with user timestamp feature
//
// SEGMENT ORDERING PROPERTY: For maximum use in filters, especially for
// filtering key range queries, we must have a correspondence between
// the lexicographic ordering of key segments and the ordering of keys
// they are extracted from. In other words, if we took the segmented keys
// and ordered them primarily by (byte-wise) order on segment 0, then
// on segment 1, etc., then key order of the original keys would not be
// violated. This is the WEAK form of the property, where multiple keys
// might generate the same segments, but such keys must be contiguous in
// key order. (The STRONG form of the property is potentially more useful,
// but for bytewise comparator, it can be inferred from segments satisfying
// the weak property by assuming another segment that extends to the end of
// the key, which would be empty if the segments already extend to the end
// of the key.)
//
// The segment ordering property is hard to think about directly, but for
// bytewise comparator, it is implied by a simpler property to reason about:
// the segment prefix property (see below). (NOTE: an example way to satisfy
// the segment ordering property while breaking the segment prefix property
// is to have a segment delimited by any byte smaller than a certain value,
// and not include the delimiter with the segment leading up to the delimiter.
// For example, the space character is ordered before other printable
// characters, so breaking "foo bar" into "foo", " ", and "bar" would be
// legal, but not recommended.)
//
// SEGMENT PREFIX PROPERTY: If a key generates segments s0, ..., sn (possibly
// more beyond sn) and sn does not extend to the end of the key, then all keys
// starting with bytes s0+...+sn (concatenated) also generate the same segments
// (possibly more). For example, if a key has segment s0 which is less than the
// whole key and another key starts with the bytes of s0--or only has the bytes
// of s0--then the other key must have the same segment s0. In other words, any
// prefix of segments that might not extend to the end of the key must form an
// unambiguous prefix code. See
// https://en.wikipedia.org/wiki/Prefix_code In other other words, parsing
// a key into segments cannot use even a single byte of look-ahead. Upon
// processing each byte, the extractor decides whether to cut a segment that
// ends with that byte, but not one that ends before that byte. The only
// exception is that upon reaching the end of the key, the extractor can choose
// whether to make a segment that ends at the end of the key.
//
// Example types of key segments that can be freely mixed in any order:
// * Some fixed number of bytes or codewords.
// * Ends in a delimiter byte or codeword. (Not including the delimiter as
// part of the segment leading up to it would very likely violate the segment
// prefix property.)
// * Length-encoded sequence of bytes or codewords. The length could even
// come from a preceding segment.
// * Any/all remaining bytes to the end of the key, though this implies all
// subsequent segments will be empty.
// For each kind of segment, it should be determined before parsing the segment
// whether an incomplete/short parse will be treated as a segment extending to
// the end of the key or as an empty segment.
//
// For example, keys might consist of
// * Segment 0: Any sequence of bytes up to and including the first ':'
// character, or the whole key if no ':' is present.
// * Segment 1: The next four bytes, all or nothing (in case of short key).
// * Segment 2: An unsigned byte indicating the number of additional bytes in
// the segment, and then that many bytes (or less up to the end of the key).
// * Segment 3: Any/all remaining bytes in the key
//
// For an example of what can go wrong, consider using '4' as a delimiter
// but not including it with the segment leading up to it. Suppose we have
// these keys and corresponding first segments:
// "123456" -> "123"
// "124536" -> "12"
// "125436" -> "125"
// Notice how byte-wise comparator ordering of the segments does not follow
// the ordering of the keys. This means we cannot safely use a filter with
// a range of segment values for filtering key range queries.
//
// Also note that it is legal for all keys in a category (or many categories)
// to return an empty sequence of segments.
// Keys can also be put in "categories" to simplify some configuration and
// handling. A "legal" key or bound is one that does not return an error (as a
// special, unused category) from the extractor. It is also allowed for all
// keys in a category to return an empty sequence of segments.
//
// To eliminate a confusing distinction between a segment that is empty vs.
// "not present" for a particular key, each key is logically assiciated with
@ -161,6 +79,280 @@ Status UpdateManifestForFilesState(
// segments. In practice, we only represent a finite sequence that (at least)
// covers the non-trivial segments.
//
// Once in production, the behavior associated with a particular GetId()
// cannot change. Introduce a new GetId() when introducing new behaviors.
// See also SstQueryFilterConfigsManager below.
//
// This feature hasn't yet been validated with user timestamp.
//
// = A SIMPLIFIED MODEL =
// Let us start with the easiest set of contraints to satisfy with a key
// segments extractor that generally allows for correct point and range
// filtering, and add complexity from there. Here we first assume
// * The column family is using the byte-wise comparator, or reverse byte-wise
// * A single category is assigned to all keys (by the extractor)
// * Using simplified criteria for legal segment extraction, the "segment
// maximal prefix property"
//
// SEGMENT MAXIMAL PREFIX PROPERTY: The segment that a byte is assigned to can
// only depend on the bytes that come before it, not on the byte itself nor
// anything later including the full length of the key or bound.
//
// Equivalently, two keys or bounds must agree on the segment assignment of
// position i if the two keys share a common byte-wise prefix up to at least
// position i - 1 (and i is within bounds of both keys).
//
// This specifically excludes "all or nothing" segments where it is only
// included if it reaches a particular width or delimiter. A segment resembling
// the FixedPrefixTransform would be illegal (without other assumptions); it
// must be like CappedPrefixTransform.
//
// This basically matches the notion of parsing prefix codes (see
// https://en.wikipedia.org/wiki/Prefix_code) except we have to include any
// partial segment (code word) at the end whenever an extension to that key
// might produce a full segment. An example would be parsing UTF-8 into
// segments corresponding to encoded code points, where any incomplete code
// at the end must be part of a trailing segment. Note a three-way
// correspondence between
// (a) byte-wise ordering of encoded code points, e.g.
// { D0 98 E2 82 AC }
// { E2 82 AC D0 98 }
// (b) lexicographic-then-byte-wise ordering of segments that are each an
// encoded code point, e.g.
// {{ D0 98 } { E2 82 AC }}
// {{ E2 82 AC } { D0 98 }}
// and (c) lexicographic ordering of the decoded code points, e.g.
// { U+0418 U+20AC }
// { U+20AC U+0418 }
// The correspondence between (a) and (b) is a result of the segment maximal
// prefix property and is critical for correct application of filters to
// range queries. The correspondence with (c) is a handy attribute of UTF-8
// (with no over-long encodings) and might be useful to the application.
//
// Example types of key segments that can be freely mixed in any order:
// * Capped number of bytes or codewords. The number cap for the segment
// could be the same for all keys or encoded earlier in the key.
// * Up to *and including* a delimiter byte or codeword.
// * Any/all remaining bytes to the end of the key, though this implies all
// subsequent segments will be empty.
// As part of the segment maximal prefix property, if the segments do not
// extend to the end of the key, that must be implied by the bytes that are
// in segments, NOT because the potential contents of a segment were considered
// incomplete.
//
// For example, keys might consist of
// * Segment 0: Any sequence of bytes up to and including the first ':'
// character, or the whole key if no ':' is present.
// * Segment 1: The next four bytes, or less if we reach end of key.
// * Segment 2: An unsigned byte indicating the number of additional bytes in
// the segment, and then that many bytes (or less up to the end of the key).
// * Segment 3: Any/all remaining bytes in the key
//
// For an example of what can go wrong, consider using '4' as a delimiter
// but not including it with the segment leading up to it. Suppose we have
// these keys and corresponding first segments:
// "123456" -> "123" (in file 1)
// "124536" -> "12" (in file 2)
// "125436" -> "125" (in file 1)
// Notice how byte-wise comparator ordering of the segments does not follow
// the ordering of the keys. This means we cannot safely use a filter with
// a range of segment values for filtering key range queries. For example,
// we might get a range query for ["123", "125Z") and miss that key "124536"
// in file 2 is in range because its first segment "12" is out of the range
// of the first segments on the bounds, "123" and "125". We cannot even safely
// use this for prefix-like range querying with a Bloom filter on the segments.
// For a query ["12", "124Z"), segment "12" would likely not match the Bloom
// filter in file 1 and miss "123456".
//
// CATEGORIES: The KeySegmentsExtractor is allowed to place keys in categories
// so that different parts of the key space can use different filtering
// strategies. The following property is generally recommended for safe filter
// applicability
// * CATEGORY CONTIGUOUSNESS PROPERTY: each category is contiguous in
// comparator order. In other words, any key between two keys of category c
// must also be in category c.
// An alternative to categories when distinct kinds of keys are interspersed
// is to leave some segments empty when they do not apply to that key.
// Filters are generally set up to handle an empty segment specially so that
// it doesn't interfere with tracking accurate ranges on non-empty occurrences
// of the segment.
//
// = BEYOND THE SIMPLIFIED MODEL =
//
// DETAILED GENERAL REQUIREMENTS (incl OTHER COMPARATORS): The exact
// requirements on a key segments extractor depend on whether and how we use
// filters to answer queries that they cannot answer directly. To understand
// this, we describe
// (A) the types of filters in terms of data they represent and can directly
// answer queries about,
// (B) the types of read queries that we want to use filters for, and
// (C) the assumptions that need to be satisfied to connect those two.
//
// TYPES OF FILTERS: Although not exhaustive, here are some useful categories
// of filter data:
// * Equivalence class filtering - Represents or over-approximates a set of
// equivalence classes on keys. The size of the representation is roughly
// proportional to the number of equivalence classes added. Bloom and ribbon
// filters are examples.
// * Order-based filtering - Represents one or more subranges of a key space or
// key segment space. A filter query only requires application of the CF
// comparator. The size of the representation is roughly proportional to the
// number of subranges and to the key or segment size. For example, we call a
// simple filter representing a minimum and a maximum value for a segment a
// min-max filter.
//
// TYPES OF READ QUERIES and their DIRECT FILTERS:
// * Point query - Whether there {definitely isn't, might be} an entry for a
// particular key in an SST file (or partition, etc.).
// The DIRECT FILTER for a point query is an equivalence class filter on the
// whole key.
// * Range query - Whether there {definitely isn't, might be} any entries
// within a lower and upper key bound, in an SST file (or partition, etc.).
// NOTE: For this disucssion, we ignore the detail of inclusive vs.
// exclusive bounds by assuming a generalized notion of "bound" (vs. key)
// that conveniently represents spaces between keys. For details, see
// https://github.com/facebook/rocksdb/pull/11434
// The DIRECT FILTER for a range query is an order-based filter on the whole
// key (non-empty intersection of bounds/keys). Simple minimum and maximum
// keys for each SST file are automatically provided by metadata and used in
// the read path for filtering (as well as binary search indexing).
// PARTITIONING NOTE: SST metadata partitions do not have recorded minimum
// and maximum keys, so require some special handling for range query
// filtering. See https://github.com/facebook/rocksdb/pull/12872 etc.
// * Where clauses - Additional constraints that can be put on range queries.
// Specifically, a where clause is a tuple <i,j,c,b1,b2> representing that the
// concatenated sequence of segments from i to j (inclusive) compares between
// b1 and b2 according to comparator c.
// EXAMPLE: To represent that segment of ordinal i is equal to s, that would
// be <i,i,bytewise_comparator,before(s),after(s)>.
// NOTE: To represent something like segment has a particular prefix, you
// would need to split the key into more segments appropriately. There is
// little loss of generality because we can combine adjacent segments for
// specifying where clauses and implementing filters.
// The DIRECT FILTER for a where clause is an order-based filter on the same
// sequence of segments and comparator (non-empty intersection of bounds/keys),
// or in the special case of an equality clause (see example), an equivalence
// class filter on the sequence of segments.
//
// GENERALIZING FILTERS (INDIRECT):
// * Point queries can utilize essentially any kind of filter by extracting
// applicable segments of the query key (if not using whole key) and querying
// the corresponding equivalence class or trivial range.
// NOTE: There is NO requirement e.g. that the comparator used by the filter
// match the CF key comparator or similar. The extractor simply needs to be
// a pure function that does not return "out of bounds" segments.
// FOR EXAMPLE, a min-max filter on the 4th segment of keys can also be
// used for filtering point queries (Get/MultiGet) and could be as
// effective and much more space efficient than a Bloom filter, depending
// on the workload.
//
// Beyond point queries, we generally expect the key comparator to be a
// lexicographic / big endian ordering at a high level (or the reverse of that
// ordering), while each segment can use an arbitrary comparator.
// FOR EXAMPLE, with a custom key comparator and segments extractor,
// segment 0 could be a 4-byte unsigned little-endian integer,
// segment 1 could be an 8-byte signed big-endian integer. This framework
// requires segment 0 to come before segment 1 in the key and to take
// precedence in key ordering (i.e. segment 1 order is only consulted when
// keys are equal in segment 0).
//
// * Equivalence class filters can apply to range queries under conditions
// resembling legacy prefix filtering (prefix_extractor). An equivalence class
// filter on segments i through j and category set s is applicable to a range
// query from lb to ub if
// * All segments through j extracted from lb and ub are equal.
// NOTE: being in the same filtering equivalence class is insufficient, as
// that could be unrelated inputs with a hash collision. Here we are
// omitting details that would formally accommodate comparators in which
// different bytes can be considered equal.
// * The categories of lb and ub are in the category set s.
// * COMMON SEGMENT PREFIX PROPERTY (for all x, y, z; params j, s): if
// * Keys x and z have equal segments up through ordinal j, and
// * Keys x and z are in categories in category set s, and
// * Key y is ordered x < y < z according to the CF comparator,
// then both
// * Key y has equal segments up through ordinal j (compared to x and z)
// * Key y is in a category in category set s
// (This is implied by the SEGMENT MAXIMAL PREFIX PROPERTY in the simplified
// model.)
//
// * Order-based filters on segments (rather than whole key) can apply to range
// queries (with "whole key" bounds). Specifically, an order-based filter on
// segments i through j and category set s is applicable to a range query from
// lb to ub if
// * All segments through i-1 extracted from lb and ub are equal
// * The categories of lb and ub are in the category set s.
// * SEGMENT ORDERING PROPERTY for ordinal i through j, segments
// comparator c, category set s, for all x, y, and z: if
// * Keys x and z have equal segments up through ordinal i-1, and
// * Keys x and z are in categories in category set s, and
// * Key y is ordered x < y < z according to the CF comparator,
// then both
// * The common segment prefix property is satisifed through ordinal i-1
// and with category set s
// * x_i..j <= y_i..j <= z_i..j according to segment comparator c, where
// x_i..j is the concatenation of segments i through j of key x (etc.).
// (This is implied by the SEGMENT MAXIMAL PREFIX PROPERTY in the simplified
// model.)
//
// INTERESTING EXAMPLES:
// Consider a segment encoding called BadVarInt1 in which a byte with
// highest-order bit 1 means "start a new segment". Also consider BadVarInt0
// which starts a new segment on highest-order bit 0.
//
// Configuration: bytewise comp, BadVarInt1 format for segments 0-3 with
// segment 3 also continuing to the end of the key
// x = 0x 20 21|82 23|||
// y = 0x 20 21|82 23 24|85||
// z = 0x 20 21|82 23|84 25||
//
// For i=j=1, this set of keys violate the common segment prefix property and
// segment ordering property, so can lead to incorrect equivalence class
// filtering or order-based filtering.
//
// Suppose we modify the configuration so that "short" keys (empty in segment
// 2) are placed in an unfiltered category. In that case, x above doesn't meet
// the precondition for being limited by segment properties. Consider these
// keys instead:
// x = 0x 20 21|82 23 24|85||
// y = 0x 20 21|82 23 24|85 26|87|
// z = 0x 20 21|82 23 24|85|86|
// m = 0x 20 21|82 23 25|85|86|
// n = 0x 20 21|82 23|84 25||
//
// Although segment 1 values might be out of order with key order,
// re-categorizing the short keys has allowed satisfying the common segment
// prefix property with j=1 (and with j=0), so we can use equivalence class
// filters on segment 1, or 0, or 0 to 1. However, violation of the segment
// ordering property on i=j=1 (see z, m, n) means we can't use order-based.
//
// p = 0x 20 21|82 23|84 25 26||
// q = 0x 20 21|82 23|84 25|86|
//
// But keys can still be short from segment 2 to 3, and thus we are violating
// the common segment prefix property for segment 2 (see n, p, q).
//
// Configuration: bytewise comp, BadVarInt0 format for segments 0-3 with
// segment 3 also continuing to the end of the key. No short key category.
// x = 0x 80 81|22 83|||
// y = 0x 80 81|22 83|24 85||
// z = 0x 80 81|22 83 84|25||
// m = 0x 80 82|22 83|||
// n = 0x 80 83|22 84|24 85||
//
// Even though this violates the segment maximal prefix property of the
// simplified model, the common segment prefix property and segment ordering
// property are satisfied for the various segment ordinals. In broader terms,
// the usual rule of the delimiter going with the segment before it can be
// violated if every byte value below some threshold starts a segment. (This
// has not been formally verified and is not recommended.)
//
// Suppose that we are paranoid, however, and decide to place short keys
// (empty in segment 2) into an unfiltered category. This is potentially a
// dangerous decision because loss of continuity at least affects the
// ability to filter on segment 0 (common segment prefix property violated
// with i=j=0; see z, m, n; m not in category set). Thus, excluding short keys
// with categories is not a recommended solution either.
class KeySegmentsExtractor {
public:
// The extractor assigns keys to categories so that it is easier to
@ -269,6 +461,14 @@ class KeySegmentsExtractor {
Result* result) const = 0;
};
// Constructs a KeySegmentsExtractor for fixed-width key segments that safely
// handles short keys by truncating segments at the end of the input key.
// See comments on KeySegmentsExtractor for why this is much safer for
// filtering than "all or nothing" fixed-size segments. This is essentially
// a generalization of (New)CappedPrefixTransform.
std::shared_ptr<const KeySegmentsExtractor>
MakeSharedCappedKeySegmentsExtractor(const std::vector<size_t>& byte_widths);
// Alternatives for filtering inputs
// An individual key segment.
@ -305,13 +505,13 @@ struct SelectUserTimestamp {};
struct SelectColumnName {};
struct SelectValue {};
// Note: more variants might be added in the future.
// NOTE: more variants might be added in the future.
// NOTE2: filtering on values is not supported because it could easily break
// overwrite semantics. (Filter out SST with newer, non-matching value but
// see obsolete value that does match.)
using FilterInput =
std::variant<SelectWholeKey, SelectKeySegment, SelectKeySegmentRange,
SelectLegacyKeyPrefix, SelectUserTimestamp, SelectColumnName,
SelectValue>;
SelectLegacyKeyPrefix, SelectUserTimestamp, SelectColumnName>;
// Base class for individual filtering schemes in terms of chosen
// FilterInputs, but not tied to a particular KeySegmentsExtractor.
@ -336,6 +536,10 @@ std::shared_ptr<SstQueryFilterConfig> MakeSharedBytewiseMinMaxSQFC(
FilterInput select, KeySegmentsExtractor::KeyCategorySet categories =
KeySegmentsExtractor::KeyCategorySet::All());
std::shared_ptr<SstQueryFilterConfig> MakeSharedReverseBytewiseMinMaxSQFC(
FilterInput select, KeySegmentsExtractor::KeyCategorySet categories =
KeySegmentsExtractor::KeyCategorySet::All());
// TODO: more kinds of filters, eventually including Bloom/ribbon filters
// and replacing the old filter configuration APIs