mirror of https://github.com/facebook/rocksdb.git
33 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Peter Dillinger | 7fff38b1fe |
clang-format cache/ and util/ directories (#10867)
Summary: This is purely the result of running `clang-format -i` on files, except some files have been excluded for manual intervention in a separate PR Pull Request resolved: https://github.com/facebook/rocksdb/pull/10867 Test Plan: `make check`, `make check-headers`, `make format` Reviewed By: jay-zhuang Differential Revision: D40682086 Pulled By: pdillinger fbshipit-source-id: 8673d978553ab99b516da7fb63ba0b82523337f8 |
|
Bo Wang | 8e63d90ff8 |
Pass rate_limiter_priority through filter block reader functions to FS (#10251)
Summary: With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251 Test Plan: Current unit tests should pass. Reviewed By: pdillinger Differential Revision: D37427667 Pulled By: gitbw95 fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433 |
|
Peter Dillinger | 126c223714 |
Remove deprecated block-based filter (#10184)
Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80 |
|
Hui Xiao | 3573558ec5 |
Rewrite memory-charging feature's option API (#9926)
Summary: **Context:** Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`. Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags. **Summary:** - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that - Added missing db bench/stress test for some memory charging features - Renamed related test suite to indicate they are under the same theme of memory charging - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926 Test Plan: - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)` - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)` - CI - db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465** 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078** - db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36054712 Pulled By: hx235 fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af |
|
Peter Dillinger | 8c681087c7 |
Refactor FilterPolicies toward Customizable (#9567)
Summary: Some changes to make it easier to make FilterPolicy customizable. Especially, create distinct classes for the different testing-only and user-facing built-in FilterPolicy modes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9567 Test Plan: tests updated, with no intended difference in functionality tested. No difference in test performance seen as a result of moving to string-based filter type configuration. Reviewed By: mrambacher Differential Revision: D34234694 Pulled By: pdillinger fbshipit-source-id: 8a94931a9e04c3bcca863a4f524cfd064aaf0122 |
|
Hui Xiao | 920386f2b7 |
Detect (new) Bloom/Ribbon Filter construction corruption (#9342)
Summary: Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393 **Context:** (Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following: a) set of keys to add to filter b) set of hashes to add to filter (64-bit hash applied to each key) c) set of Bloom indices to set in filter, with duplicates d) set of Bloom indices to set in filter, deduplicated e) final filter and its checksum This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level. - b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`) - c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO. Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default. **Summary:** - Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()` - Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design - Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries` - Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()` - When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342 Test Plan: - Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption` - Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break. - Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()` - FastLocalBloom - Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s** - After change: - `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)** - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)** - Standard128Ribbon - Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s** - After change: - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)** - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)** - Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true` - Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status. Reviewed By: pdillinger Differential Revision: D33746928 Pulled By: hx235 fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57 |
|
Hui Xiao | 74544d582f |
Account Bloom/Ribbon filter construction memory in global memory limit (#9073)
Summary: Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge. **Context:** Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`. The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option. **Summary:** - Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction: - hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance), - banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`), - final filter (i.e, `mutable_buf`'s size). - Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table. - Added option fo run `filter_bench` with this memory reservation feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073 Test Plan: - Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory` - To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally - Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter - Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below: - FastLocalBloom - baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`: - **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0) - new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`: - **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` : - **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0) - Ribbon128 - baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`: - **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0) - new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `: - **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` : - **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0) - And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console. Reviewed By: pdillinger Differential Revision: D31991348 Pulled By: hx235 fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e |
|
mrambacher | 0ca6d6297f |
Rename variables in ImmutableCFOptions to avoid conflicts with ImmutableDBOptions (#8227)
Summary: Renaming ImmutableCFOptions::info_log and statistics to logger and stats. This is stage 2 in creating an ImmutableOptions class. It is necessary because the names match those in ImmutableOptions and have different types. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8227 Reviewed By: jay-zhuang Differential Revision: D28000967 Pulled By: mrambacher fbshipit-source-id: 3bf2aa04e8f1e8724d825b7deacf41080c14420b |
|
mrambacher | 01e460d538 |
Make types of Immutable/Mutable Options fields match that of the underlying Option (#8176)
Summary: This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively. readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release). There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176 Reviewed By: pdillinger Differential Revision: D27954339 Pulled By: mrambacher fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad |
|
mrambacher | 3dff28cf9b |
Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033)
Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67 |
|
mrambacher | 12f1137355 |
Add a SystemClock class to capture the time functions of an Env (#7858)
Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90 |
|
Peter Dillinger | 003e72b201 |
Use size_t for filter APIs, protect against overflow (#7726)
Summary: Deprecate CalculateNumEntry and replace with ApproximateNumEntries (better name) using size_t instead of int and uint32_t, to minimize confusing casts and bad overflow behavior (possible though probably not realistic). Bloom sizes are now explicitly capped at max size supported by implementations: just under 4GiB for fv=5 Bloom, and just under 512MiB for fv<5 Legacy Bloom. This hardening could help to set up for fuzzing. Also, since RocksDB only uses this information as an approximation for trying to hit certain sizes for partitioned filters, it's more important that the function be reasonably fast than for it to be completely accurate. It's hard enough to be 100% accurate for Ribbon (currently reversing CalculateSpace) that adding optimize_filters_for_memory into the mix is just not worth trying to be 100% accurate for num entries for bytes. Also: - Cleaned up filter_policy.h to remove MSVC warning handling and potentially unsafe use of exception for "not implemented" - Correct the number of entries limit beyond which current Ribbon implementation falls back on Bloom instead. - Consistently use "num_entries" rather than "num_entry" - Remove LegacyBloomBitsBuilder::CalculateNumEntry as it's essentially obsolete from general implementation BuiltinFilterBitsBuilder::CalculateNumEntries. - Fix filter_bench to skip some tests that don't make sense when only one or a small number of filters has been generated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7726 Test Plan: expanded existing unit tests for CalculateSpace / ApproximateNumEntries. Also manually used filter_bench to verify Legacy and fv=5 Bloom size caps work (much too expensive for unit test). Note that the actual bits per key is below requested due to space cap. $ ./filter_bench -impl=0 -bits_per_key=20 -average_keys_per_filter=256000000 -vary_key_count_ratio=0 -m_keys_total_max=256 -allow_bad_fp_rate ... Total size (MB): 511.992 Bits/key stored: 16.777 ... $ ./filter_bench -impl=2 -bits_per_key=20 -average_keys_per_filter=2000000000 -vary_key_count_ratio=0 -m_keys_total_max=2000 ... Total size (MB): 4096 Bits/key stored: 17.1799 ... $ Reviewed By: jay-zhuang Differential Revision: D25239800 Pulled By: pdillinger fbshipit-source-id: f94e6d065efd31e05ec630ae1a82e6400d8390c4 |
|
Peter Dillinger | 60af964372 |
Experimental (production candidate) SST schema for Ribbon filter (#7658)
Summary: Added experimental public API for Ribbon filter: NewExperimentalRibbonFilterPolicy(). This experimental API will take a "Bloom equivalent" bits per key, and configure the Ribbon filter for the same FP rate as Bloom would have but ~30% space savings. (Note: optimize_filters_for_memory is not yet implemented for Ribbon filter. That can be added with no effect on schema.) Internally, the Ribbon filter is configured using a "one_in_fp_rate" value, which is 1 over desired FP rate. For example, use 100 for 1% FP rate. I'm expecting this will be used in the future for configuring Bloom-like filters, as I expect people to more commonly hold constant the filter accuracy and change the space vs. time trade-off, rather than hold constant the space (per key) and change the accuracy vs. time trade-off, though we might make that available. ### Benchmarking ``` $ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 34.1341 Number of filters: 1993 Total size (MB): 238.488 Reported total allocated memory (MB): 262.875 Reported internal fragmentation: 10.2255% Bits/key stored: 10.0029 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 18.7508 Random filter net ns/op: 258.246 Average FP rate %: 0.968672 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing Building... Build avg ns/key: 130.851 Number of filters: 1993 Total size (MB): 168.166 Reported total allocated memory (MB): 183.211 Reported internal fragmentation: 8.94626% Bits/key stored: 7.05341 ---------------------------- Mixed inside/outside queries... Single filter net ns/op: 58.4523 Random filter net ns/op: 363.717 Average FP rate %: 0.952978 ---------------------------- Done. (For more info, run with -legend or -help.) ``` 168.166 / 238.488 = 0.705 -> 29.5% space reduction 130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction) ### Working around a hashing "flaw" bloom_test discovered a flaw in the simple hashing applied in StandardHasher when num_starts == 1 (num_slots == 128), showing an excessively high FP rate. The problem is that when many entries, on the order of number of hash bits or kCoeffBits, are associated with the same start location, the correlation between the CoeffRow and ResultRow (for efficiency) can lead to a solution that is "universal," or nearly so, for entries mapping to that start location. (Normally, variance in start location breaks the effective association between CoeffRow and ResultRow; the same value for CoeffRow is effectively different if start locations are different.) Without kUseSmash and with num_starts > 1 (thus num_starts ~= num_slots), this flaw should be completely irrelevant. Even with 10M slots, the chances of a single slot having just 16 (or more) entries map to it--not enough to cause an FP problem, which would be local to that slot if it happened--is 1 in millions. This spreadsheet formula shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE))) As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is intended for CPU efficiency of filters with many more entries/slots than kCoeffBits, a very reasonable work-around is to disallow num_starts==1 when !kUseSmash, by making the minimum non-zero number of slots 2*kCoeffBits. This is the work-around I've applied. This also means that the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not space-efficient for less than a few hundred entries. Because of this, I have made it fall back on constructing a Bloom filter, under existing schema, when that is more space efficient for small filters. (We can change this in the future if we want.) TODO: better unit tests for this case in ribbon_test, and probably update StandardHasher for kUseSmash case so that it can scale nicely to small filters. ### Other related changes * Add Ribbon filter to stress/crash test * Add Ribbon filter to filter_bench as -impl=3 * Add option string support, as in "filter_policy=experimental_ribbon:5.678;" where 5.678 is the Bloom equivalent bits per key. * Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom * Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on binary searching CalculateSpace (inefficient), so that subclasses (especially experimental ones) don't have to provide an efficient implementation inverting CalculateSpace. * Minor refactor FastLocalBloomBitsBuilder for new base class XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder, which allows the latter to fall back on Bloom construction in some extreme cases. * Mostly updated bloom_test for Ribbon filter, though a test like FullBloomTest::Schema is a next TODO to ensure schema stability (in case this becomes production-ready schema as it is). * Add some APIs to ribbon_impl.h for configuring Ribbon filters. Although these are reasonably covered by bloom_test, TODO more unit tests in ribbon_test * Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data for constructing the linear approximations in GetNumSlotsFor95PctSuccess. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658 Test Plan: Some unit tests updated but other testing is left TODO. This is considered experimental but laying down schema compatibility as early as possible in case it proves production-quality. Also tested in stress/crash test. Reviewed By: jay-zhuang Differential Revision: D24899349 Pulled By: pdillinger fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8 |
|
Peter Dillinger | 08552b19d3 |
Genericize and clean up FastRange (#7436)
Summary: A generic algorithm in progress depends on a templatized version of fastrange, so this change generalizes it and renames it to fit our style guidelines, FastRange32, FastRange64, and now FastRangeGeneric. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436 Test Plan: added a few more test cases Reviewed By: jay-zhuang Differential Revision: D23958153 Pulled By: pdillinger fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8 |
|
Peter Dillinger | 5b2bbacb6f |
Minimize memory internal fragmentation for Bloom filters (#6427)
Summary: New experimental option BBTO::optimize_filters_for_memory builds filters that maximize their use of "usable size" from malloc_usable_size, which is also used to compute block cache charges. Rather than always "rounding up," we track state in the BloomFilterPolicy object to mix essentially "rounding down" and "rounding up" so that the average FP rate of all generated filters is the same as without the option. (YMMV as heavily accessed filters might be unluckily lower accuracy.) Thus, the option near-minimizes what the block cache considers as "memory used" for a given target Bloom filter false positive rate and Bloom filter implementation. There are no forward or backward compatibility issues with this change, though it only works on the format_version=5 Bloom filter. With Jemalloc, we see about 10% reduction in memory footprint (and block cache charge) for Bloom filters, but 1-2% increase in storage footprint, due to encoding efficiency losses (FP rate is non-linear with bits/key). Why not weighted random round up/down rather than state tracking? By only requiring malloc_usable_size, we don't actually know what the next larger and next smaller usable sizes for the allocator are. We pick a requested size, accept and use whatever usable size it has, and use the difference to inform our next choice. This allows us to narrow in on the right balance without tracking/predicting usable sizes. Why not weight history of generated filter false positive rates by number of keys? This could lead to excess skew in small filters after generating a large filter. Results from filter_bench with jemalloc (irrelevant details omitted): (normal keys/filter, but high variance) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.6278 Number of filters: 5516 Total size (MB): 200.046 Reported total allocated memory (MB): 220.597 Reported internal fragmentation: 10.2732% Bits/key stored: 10.0097 Average FP rate %: 0.965228 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.5104 Number of filters: 5464 Total size (MB): 200.015 Reported total allocated memory (MB): 200.322 Reported internal fragmentation: 0.153709% Bits/key stored: 10.1011 Average FP rate %: 0.966313 (very few keys / filter, optimization not as effective due to ~59 byte internal fragmentation in blocked Bloom filter representation) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.5649 Number of filters: 162950 Total size (MB): 200.001 Reported total allocated memory (MB): 224.624 Reported internal fragmentation: 12.3117% Bits/key stored: 10.2951 Average FP rate %: 0.821534 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 31.8057 Number of filters: 159849 Total size (MB): 200 Reported total allocated memory (MB): 208.846 Reported internal fragmentation: 4.42297% Bits/key stored: 10.4948 Average FP rate %: 0.811006 (high keys/filter) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.7017 Number of filters: 164 Total size (MB): 200.352 Reported total allocated memory (MB): 221.5 Reported internal fragmentation: 10.5552% Bits/key stored: 10.0003 Average FP rate %: 0.969358 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.7131 Number of filters: 160 Total size (MB): 200.928 Reported total allocated memory (MB): 200.938 Reported internal fragmentation: 0.00448054% Bits/key stored: 10.1852 Average FP rate %: 0.963387 And from db_bench (block cache) with jemalloc: $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ (for FILE in /dev/shm/dbbench.no_optimize/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }' 17063835 $ (for FILE in /dev/shm/dbbench/*.sst; do ./sst_dump --file=$FILE --show_properties | grep 'filter block' ; done) | awk '{ t += $4; } END { print t; }' 17430747 $ #^ 2.1% additional filter storage $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8440400 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 21087528 rocksdb.bloom.filter.useful COUNT : 4963889 rocksdb.bloom.filter.full.positive COUNT : 1214081 rocksdb.bloom.filter.full.true.positive COUNT : 1161999 $ #^ 1.04 % observed FP rate $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8448592 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 18220328 rocksdb.bloom.filter.useful COUNT : 5360933 rocksdb.bloom.filter.full.positive COUNT : 1321315 rocksdb.bloom.filter.full.true.positive COUNT : 1262999 $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters (Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427 Test Plan: unit test added, 'make check' with gcc, clang and valgrind Reviewed By: siying Differential Revision: D22124374 Pulled By: pdillinger fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e |
|
Derrick Pallas | 5272305437 |
Fix FilterBench when RTTI=0 (#6732)
Summary: The dynamic_cast in the filter benchmark causes release mode to fail due to no-rtti. Replace with static_cast_with_check. Signed-off-by: Derrick Pallas <derrick@pallas.us> Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732 Reviewed By: ltamasi Differential Revision: D21304260 Pulled By: pdillinger fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c |
|
Peter Dillinger | ab65278b1f |
Misc filter_bench improvements (#6444)
Summary: Useful in validating/testing internal fragmentation changes (https://github.com/facebook/rocksdb/issues/6427) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6444 Test Plan: manual (no changes to production code) Differential Revision: D20040076 Pulled By: pdillinger fbshipit-source-id: 32d26f363d2a9ab9f5bebd281dcebd9915ae340e |
|
sdong | fdf882ded2 |
Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e |
|
Burton Li | c9a5e48762 |
fix build warnnings on MSVC (#6309)
Summary: Fix build warnings on MSVC. siying Pull Request resolved: https://github.com/facebook/rocksdb/pull/6309 Differential Revision: D19455012 Pulled By: ltamasi fbshipit-source-id: 940739f2c92de60e47cc2bed8dd7f921459545a9 |
|
Peter Dillinger | 8aa99fc71e |
Warn on excessive keys for legacy Bloom filter with 32-bit hash (#6317)
Summary: With many millions of keys, the old Bloom filter implementation for the block-based table (format_version <= 4) would have excessive FP rate due to the limitations of feeding the Bloom filter with a 32-bit hash. This change computes an estimated inflated FP rate due to this effect and warns in the log whenever an SST filter is constructed (almost certainly a "full" not "partitioned" filter) that exceeds 1.5x FP rate due to this effect. The detailed condition is only checked if 3 million keys or more have been added to a filter, as this should be a lower bound for common bits/key settings (< 20). Recommended remedies include smaller SST file size, using format_version >= 5 (for new Bloom filter), or using partitioned filters. This does not change behavior other than generating warnings for some constructed filters using the old implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6317 Test Plan: Example with warning, 15M keys @ 15 bits / key: (working_mem_size_mb is just to stop after building one filter if it's large) $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=15000000 2>&1 | grep 'FP rate' [WARN] [/block_based/filter_policy.cc:292] Using legacy SST/BBT Bloom filter with excessive key count (15.0M @ 15bpk), causing estimated 1.8x higher filter FP rate. Consider using new Bloom with format_version>=5, smaller SST file size, or partitioned filters. Predicted FP rate %: 0.766702 Average FP rate %: 0.66846 Example without warning (150K keys): $ ./filter_bench -quick -impl=0 -working_mem_size_mb=1 -bits_per_key=15 -average_keys_per_filter=150000 2>&1 | grep 'FP rate' Predicted FP rate %: 0.422857 Average FP rate %: 0.379301 $ With more samples at 15 bits/key: 150K keys -> no warning; actual: 0.379% FP rate (baseline) 1M keys -> no warning; actual: 0.396% FP rate, 1.045x 9M keys -> no warning; actual: 0.563% FP rate, 1.485x 10M keys -> warning (1.5x); actual: 0.564% FP rate, 1.488x 15M keys -> warning (1.8x); actual: 0.668% FP rate, 1.76x 25M keys -> warning (2.4x); actual: 0.880% FP rate, 2.32x At 10 bits/key: 150K keys -> no warning; actual: 1.17% FP rate (baseline) 1M keys -> no warning; actual: 1.16% FP rate 10M keys -> no warning; actual: 1.32% FP rate, 1.13x 25M keys -> no warning; actual: 1.63% FP rate, 1.39x 35M keys -> warning (1.6x); actual: 1.81% FP rate, 1.55x At 5 bits/key: 150K keys -> no warning; actual: 9.32% FP rate (baseline) 25M keys -> no warning; actual: 9.62% FP rate, 1.03x 200M keys -> no warning; actual: 12.2% FP rate, 1.31x 250M keys -> warning (1.5x); actual: 12.8% FP rate, 1.37x 300M keys -> warning (1.6x); actual: 13.4% FP rate, 1.43x The reason for the modest inaccuracy at low bits/key is that the assumption of independence between a collision between 32-hash values feeding the filter and an FP in the filter is not quite true for implementations using "simple" logic to compute indices from the stock hash result. There's math on this in my dissertation, but I don't think it's worth the effort just for these extreme cases (> 100 million keys and low-ish bits/key). Differential Revision: D19471715 Pulled By: pdillinger fbshipit-source-id: f80c96893a09bf1152630ff0b964e5cdd7e35c68 |
|
Peter Dillinger | 4b86fe1123 |
Log warning for high bits/key in legacy Bloom filter (#6312)
Summary: Help users that would benefit most from new Bloom filter implementation by logging a warning that recommends the using format_version >= 5. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6312 Test Plan: $ (for BPK in 10 13 14 19 20 50; do ./filter_bench -quick -impl=0 -bits_per_key=$BPK -m_queries=1 2>&1; done) | grep 'its/key' Bits/key actual: 10.0647 Bits/key actual: 13.0593 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (14) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 14.0581 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (19) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 19.0542 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 20.0584 [WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (50) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5. Bits/key actual: 50.0577 Differential Revision: D19457191 Pulled By: pdillinger fbshipit-source-id: 073d94cde5c70e03a160f953e1100c15ea83eda4 |
|
Peter Dillinger | a92bd0a183 |
Optimize memory and CPU for building new Bloom filter (#6175)
Summary: The filter bits builder collects all the hashes to add in memory before adding them (because the number of keys is not known until we've walked over all the keys). Existing code uses a std::vector for this, which can mean up to 2x than necessary space allocated (and not freed) and up to ~2x write amplification in memory. Using std::deque uses close to minimal space (for large filters, the only time it matters), no write amplification, frees memory while building, and no need for large contiguous memory area. The only cost is more calls to allocator, which does not appear to matter, at least in benchmark test. For now, this change only applies to the new (format_version=5) Bloom filter implementation, to ease before-and-after comparison downstream. Temporary memory use during build is about the only way the new Bloom filter could regress vs. the old (because of upgrade to 64-bit hash) and that should only matter for full filters. This change should largely mitigate that potential regression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6175 Test Plan: Using filter_bench with -new_builder option and 6M keys per filter is like large full filter (improvement). 10k keys and no -new_builder is like partitioned filters (about the same). (Corresponding configurations run simultaneously on devserver.) std::vector impl (before) $ /usr/bin/time -v ./filter_bench -impl=2 -quick -new_builder -working_mem_size_mb=1000 - average_keys_per_filter=6000000 Build avg ns/key: 52.2027 Maximum resident set size (kbytes): 1105016 $ /usr/bin/time -v ./filter_bench -impl=2 -quick -working_mem_size_mb=1000 - average_keys_per_filter=10000 Build avg ns/key: 30.5694 Maximum resident set size (kbytes): 1208152 std::deque impl (after) $ /usr/bin/time -v ./filter_bench -impl=2 -quick -new_builder -working_mem_size_mb=1000 - average_keys_per_filter=6000000 Build avg ns/key: 39.0697 Maximum resident set size (kbytes): 1087196 $ /usr/bin/time -v ./filter_bench -impl=2 -quick -working_mem_size_mb=1000 - average_keys_per_filter=10000 Build avg ns/key: 30.9348 Maximum resident set size (kbytes): 1207980 Differential Revision: D19053431 Pulled By: pdillinger fbshipit-source-id: 2888e748723a19d9ea40403934f13cbb8483430c |
|
Peter Dillinger | ca3b6c28c9 |
Expose and elaborate FilterBuildingContext (#6088)
Summary: This change enables custom implementations of FilterPolicy to wrap a variety of NewBloomFilterPolicy and select among them based on contextual information such as table level and compaction style. * Moves FilterBuildingContext to public API and elaborates it with more useful data. (It would be nice to put more general options-like data, but at the time this object is constructed, we are using internal APIs ImmutableCFOptions and MutableCFOptions and don't have easy access to ColumnFamilyOptions that I can tell.) * Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to GetBuilderWithContext, because it's now public. * Plumbs through the table's "level_at_creation" for filter building context. * Simplified some tests by adding GetBuilder() to MockBlockBasedTableTester. * Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including sample wrapper class LevelAndStyleCustomFilterPolicy. * Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits where it does not reset perf context. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088 Test Plan: make check, valgrind on db_bloom_filter_test Differential Revision: D18697817 Pulled By: pdillinger fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4 |
|
Peter Dillinger | 57f3032285 |
Allow fractional bits/key in BloomFilterPolicy (#6092)
Summary: There's no technological impediment to allowing the Bloom filter bits/key to be non-integer (fractional/decimal) values, and it provides finer control over the memory vs. accuracy trade-off. This is especially handy in using the format_version=5 Bloom filter in place of the old one, because bits_per_key=9.55 provides the same accuracy as the old bits_per_key=10. This change not only requires refining the logic for choosing the best num_probes for a given bits/key setting, it revealed a flaw in that logic. As bits/key gets higher, the best num_probes for a cache-local Bloom filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a standard Bloom filter. For example, at 16 bits per key, the best num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%). This change fixes and refines that logic (for the format_version=5 Bloom filter only, just in case) based on empirical tests to find accuracy inflection points between each num_probes. Although bits_per_key is now specified as a double, the new Bloom filter converts/rounds this to "millibits / key" for predictable/precise internal computations. Just in case of unforeseen compatibility issues, we round to the nearest whole number bits / key for the legacy Bloom filter, so as not to unlock new behaviors for it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092 Test Plan: unit tests included Differential Revision: D18711313 Pulled By: pdillinger fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a |
|
Peter Dillinger | 0306e01233 |
Fixes for g++ 4.9.2 compatibility (#6053)
Summary: Taken from merryChris in https://github.com/facebook/rocksdb/issues/6043 Stackoverflow ref on {{}} vs. {}: https://stackoverflow.com/questions/26947704/implicit-conversion-failure-from-initializer-list Note to reader: .clear() does not empty out an ostringstream, but .str("") suffices because we don't have to worry about clearing error flags. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6053 Test Plan: make check, manual run of filter_bench Differential Revision: D18602259 Pulled By: pdillinger fbshipit-source-id: f6190f83b8eab4e80e7c107348839edabe727841 |
|
Peter Dillinger | f059c7d9b9 |
New Bloom filter implementation for full and partitioned filters (#6007)
Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f |
|
Peter Dillinger | 18f57f5ef8 |
Add new persistent 64-bit hash (#5984)
Summary: For upcoming new SST filter implementations, we will use a new 64-bit hash function (XXH3 preview, slightly modified). This change updates hash.{h,cc} for that change, adds unit tests, and out-of-lines the implementations to keep hash.h as clean/small as possible. In developing the unit tests, I discovered that the XXH3 preview always returns zero for the empty string. Zero is problematic for some algorithms (including an upcoming SST filter implementation) if it occurs more often than at the "natural" rate, so it should not be returned from trivial values using trivial seeds. I modified our fork of XXH3 to return a modest hash of the seed for the empty string. With hash function details out-of-lines in hash.h, it makes sense to enable XXH_INLINE_ALL, so that direct calls to XXH64/XXH32/XXH3p are inlined. To fix array-bounds warnings on some inline calls, I injected some casts to uintptr_t in xxhash.cc. (Issue reported to Yann.) Revised: Reverted using XXH_INLINE_ALL for now. Some Facebook checks are unhappy about #include on xxhash.cc file. I would fix that by rename to xxhash_cc.h, but to best preserve history I want to do that in a separate commit (PR) from the uintptr casts. Also updated filter_bench for this change, improving the performance predictability of dry run hashing and adding support for 64-bit hash (for upcoming new SST filter implementations, minor dead code in the tool for now). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5984 Differential Revision: D18246567 Pulled By: pdillinger fbshipit-source-id: 6162fbf6381d63c8cc611dd7ec70e1ddc883fbb8 |
|
Peter Dillinger | 26dc29633e |
filter_bench not needed for ROCKSDB_LITE (#5978)
Summary: filter_bench is a specialized micro-benchmarking tool that should not be needed with ROCKSDB_LITE. This should fix the LITE build. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5978 Test Plan: make LITE=1 check Differential Revision: D18177941 Pulled By: pdillinger fbshipit-source-id: b73a171404661e09e018bc99afcf8d4bf1e2949c |
|
Peter Dillinger | 3f891c40a0 |
More improvements to filter_bench (#5968)
Summary: * Adds support for plain table filter. This is not critical right now, but does add a -impl flag that will be useful for new filter implementations initially targeted at block-based table (and maybe later ported to plain table) * Better mixing of inside vs. outside queries, for more realism * A -best_case option handy for implementation tuning inner loop * Option for whether to include hashing time in dry run / net timings No modifications to production code, just filter_bench. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5968 Differential Revision: D18139872 Pulled By: pdillinger fbshipit-source-id: 5b09eba963111b48f9e0525a706e9921070990e8 |
|
Peter Dillinger | 2837008525 |
Vary key size and alignment in filter_bench (#5933)
Summary: The first version of filter_bench has selectable key size but that size does not vary throughout a test run. This artificially favors "branchy" hash functions like the existing BloomHash, MurmurHash1, probably because of optimal return for branch prediction. This change primarily varies those key sizes from -2 to +2 bytes vs. the average selected size. We also set the default key size at 24 to better reflect our best guess of typical key size. But steadily random key sizes may not be realistic either. So this change introduces a new filter_bench option: -vary_key_size_log2_interval=n where the same key size is used 2^n times and then changes to another size. I've set the default at 5 (32 times same size) as a compromise between deployments with rather consistent vs. rather variable key sizes. On my Skylake system, the performance boost to MurmurHash1 largely lies between n=10 and n=15. Also added -vary_key_alignment (bool, now default=true), though this doesn't currently seem to matter in hash functions under consideration. This change also does a "dry run" for each testing scenario, to improve the accuracy of those numbers, as there was more difference between scenarios than expected. Subtracting gross test run times from dry run times is now also embedded in the output, because these "net" times are generally the most useful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5933 Differential Revision: D18121683 Pulled By: pdillinger fbshipit-source-id: 3c7efee1c5661a5fe43de555e786754ddf80dc1e |
|
Levi Tamasi | 29ccf2075c |
Store the filter bits reader alongside the filter block contents (#5936)
Summary: Amongst other things, PR https://github.com/facebook/rocksdb/issues/5504 refactored the filter block readers so that only the filter block contents are stored in the block cache (as opposed to the earlier design where the cache stored the filter block reader itself, leading to potentially dangling pointers and concurrency bugs). However, this change introduced a performance hit since with the new code, the metadata fields are re-parsed upon every access. This patch reunites the block contents with the filter bits reader to eliminate this overhead; since this is still a self-contained pure data object, it is safe to store it in the cache. (Note: this is similar to how the zstd digest is handled.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5936 Test Plan: make asan_check filter_bench results for the old code: ``` $ ./filter_bench -quick WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.7153 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 33.4258 Single filter ns/op: 42.5974 Random filter ns/op: 217.861 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.4217 Single filter ns/op: 50.9855 Random filter ns/op: 219.167 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -quick -use_full_block_reader WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.5172 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 32.3556 Single filter ns/op: 83.2239 Random filter ns/op: 370.676 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.2265 Single filter ns/op: 93.5651 Random filter ns/op: 408.393 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) ``` With the new code: ``` $ ./filter_bench -quick WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 25.4285 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 31.0594 Single filter ns/op: 43.8974 Random filter ns/op: 226.075 ---------------------------- Outside queries... Dry run (25d) ns/op: 31.0295 Single filter ns/op: 50.3824 Random filter ns/op: 226.805 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) $ ./filter_bench -quick -use_full_block_reader WARNING: Assertions are enabled; benchmarks unnecessarily slow Building... Build avg ns/key: 26.5308 Number of filters: 16669 Total memory (MB): 200.009 Bits/key actual: 10.0647 ---------------------------- Inside queries... Dry run (46b) ns/op: 33.2968 Single filter ns/op: 58.6163 Random filter ns/op: 291.434 ---------------------------- Outside queries... Dry run (25d) ns/op: 32.1839 Single filter ns/op: 66.9039 Random filter ns/op: 292.828 Average FP rate %: 1.13993 ---------------------------- Done. (For more info, run with -legend or -help.) ``` Differential Revision: D17991712 Pulled By: ltamasi fbshipit-source-id: 7ea205550217bfaaa1d5158ebd658e5832e60f29 |
|
Peter Dillinger | 90e285efde |
Fix some implicit conversions in filter_bench (#5894)
Summary: Fixed some spots where converting size_t or uint_fast32_t to uint32_t. Wrapped mt19937 in a new Random32 class to avoid future such traps. NB: I tried using Random32::Uniform (std::uniform_int_distribution) in filter_bench instead of fastrange, but that more than doubled the dry run time! So I added fastrange as Random32::Uniformish. ;) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5894 Test Plan: USE_CLANG=1 build, and manual re-run filter_bench Differential Revision: D17825131 Pulled By: pdillinger fbshipit-source-id: 68feee333b5f8193c084ded760e3d6679b405ecd |
|
Peter Dillinger | 46ca51d430 |
filter_bench - a prelim tool for SST filter benchmarking (#5825)
Summary: Example: using the tool before and after PR https://github.com/facebook/rocksdb/issues/5784 shows that the refactoring, presumed performance-neutral, actually sped up SST filters by about 3% to 8% (repeatable result): Before: - Dry run ns/op: 22.4725 - Single filter ns/op: 51.1078 - Random filter ns/op: 120.133 After: + Dry run ns/op: 22.2301 + Single filter run ns/op: 47.4313 + Random filter ns/op: 115.9 Only tests filters for the block-based table (full filters and partitioned filters - same implementation; not block-based filters), which seems to be the recommended format/implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5825 Differential Revision: D17804987 Pulled By: pdillinger fbshipit-source-id: 0f18a9c254c57f7866030d03e7fa4ba503bac3c5 |