rocksdb

Go to file

Peter Dillinger 4d3518951a Option to decouple index and filter partitions (#12939 ) Summary: Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads. However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed. Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB). ## This change Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder. This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931. Suggested follow-up: * Update confusing use of "last key" to refer to "previous key" * Expand unit test coverage with parallel compression and dictionary training * Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939 Test Plan: unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature. ## SST write performance (CPU) Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 923691 vs. 924851 (-0.13%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 921398 vs. 922973 (-0.17%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 902259 vs. 908756 (-0.71%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 917932 vs. 916901 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 912755 vs. 907298 (+0.60%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 899754 vs. 892433 (+0.82%) ``` I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations. ## Read performance Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup: ``` (for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 \| tee results-$CS; done) ``` And read ops/s results: ```CSV Cache size MB,After/decoupled/4k,Before/4k,Before/8k 3,15593,15158,12826 6,16295,16693,14134 10,20427,20813,18459 20,27035,26836,27384 30,33250,31810,33846 60,35518,32585,35329 100,36612,31805,35292 300,35780,31492,35481 1000,34145,31551,35411 1100,35219,31380,34302 1200,35060,31037,34322 ``` If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration. Reviewed By: jowlyzhang Differential Revision: D61376772 Pulled By: pdillinger fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f		2024-08-16 15:34:31 -07:00
.circleci	Enable io_uring in stress test (#12313 )	2024-01-31 12:37:42 -08:00
.github	Attempt to fix the nightly build-linux-clang-13-asan-ubsan-with-folly build	2024-08-01 13:29:56 -07:00
buckifier	add export_file to rockdb TARGETS generator and re-gen	2024-05-25 17:10:12 -07:00
build_tools	Fix folly build (#12795 )	2024-06-22 15:15:02 -07:00
cache	Support pro-actively erasing obsolete block cache entries (#12694 )	2024-06-07 08:57:11 -07:00
cmake	Fix zstd typo in cmake (#12309 )	2024-02-22 14:39:05 -08:00
coverage	Remove platform009 and default to platform010 (#11333 )	2023-03-30 09:56:37 -07:00
db	Option to decouple index and filter partitions (#12939 )	2024-08-16 15:34:31 -07:00
db_stress_tool	Option to decouple index and filter partitions (#12939 )	2024-08-16 15:34:31 -07:00
docs	Java FFI blog post - Post-publication issues with images (2) (#12372 )	2024-02-22 15:01:55 -08:00
env	Add some documentation for Env related interfaces (#12813 )	2024-06-28 18:56:40 -07:00
examples	Prefer static_cast in place of most reinterpret_cast (#12308 )	2024-02-07 10:44:11 -08:00
file	Fix file deletions in DestroyDB not rate limited (#12891 )	2024-08-02 19:31:55 -07:00
fuzz	Block per key-value checksum (#11287 )	2023-04-25 12:08:23 -07:00
include/rocksdb	Option to decouple index and filter partitions (#12939 )	2024-08-16 15:34:31 -07:00
java	Add ticker stats for read corruption retries (#12923 )	2024-08-12 15:32:07 -07:00
logging	Fix data race in AutoRollLogger (#12436 )	2024-03-14 14:28:33 -07:00
memory	Set optimize_filters_for_memory by default (#12377 )	2024-04-30 08:33:31 -07:00
memtable	Prefer static_cast in place of most reinterpret_cast (#12308 )	2024-02-07 10:44:11 -08:00
microbench	internal_repo_rocksdb (-8794174668376270091) (#12114 )	2023-12-01 11:10:30 -08:00
monitoring	Add ticker stats for read corruption retries (#12923 )	2024-08-12 15:32:07 -07:00
options	Option to decouple index and filter partitions (#12939 )	2024-08-16 15:34:31 -07:00
plugin	Add initial CMake support to plugin (#9214 )	2021-11-30 17:16:53 -08:00
port	Fix CondVar::TimedWait for Windows (#12815 )	2024-07-08 21:38:21 -07:00
table	Option to decouple index and filter partitions (#12939 )	2024-08-16 15:34:31 -07:00
test_util	Remove redundant no_io parameters to filter functions (#12762 )	2024-06-12 18:47:11 -07:00
third-party	fix optimization-disabled test builds with platform010 (#11361 )	2023-04-10 13:59:44 -07:00
tools	Option to decouple index and filter partitions (#12939 )	2024-08-16 15:34:31 -07:00
trace_replay	Remove 'virtual' when implied by 'override' (#12319 )	2024-01-31 13:14:42 -08:00
unreleased_history	Option to decouple index and filter partitions (#12939 )	2024-08-16 15:34:31 -07:00
util	fix the non initialized bug in StderrLogger. (#12839 )	2024-07-08 15:59:02 -07:00
utilities	Do not add unprep_seqs when WriteImpl() fails in unprepared txn (#12927 )	2024-08-15 09:16:29 -07:00
.clang-format	A script that automatically reformat affected lines	2014-01-14 12:21:24 -08:00
.gitignore	add gtags files ignore (#12747 )	2024-06-12 21:46:40 -07:00
.lgtm.yml	Create lgtm.yml for LGTM.com C/C++ analysis (#4058 )	2018-06-26 12:43:04 -07:00
.watchmanconfig	Added .watchmanconfig file to rocksdb repo (#5593 )	2019-07-19 15:00:33 -07:00
AUTHORS	Update RocksDB Authors File	2017-10-18 14:42:10 -07:00
CMakeLists.txt	Fix folly build (#12795 )	2024-06-22 15:15:02 -07:00
CODE_OF_CONDUCT.md	Adopt Contributor Covenant	2019-08-29 23:21:01 -07:00
CONTRIBUTING.md	Add Code of Conduct	2017-12-05 18:42:35 -08:00
COPYING	Add GPLv2 as an alternative license.	2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md	Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363 )	2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md	First version of rocksdb_dump and rocksdb_undump.	2015-06-19 16:24:36 -07:00
HISTORY.md	Update history and version for 9.5.fb release (#12880 )	2024-07-22 13:15:09 -07:00
INSTALL.md	fix out of date macos instructions in INSTALL.md (#12393 )	2024-02-28 12:38:15 -08:00
LANGUAGE-BINDINGS.md	Add grocksdb in Go language bindings (#10498 )	2022-08-23 15:02:10 -07:00
LICENSE.Apache	Change RocksDB License	2017-07-15 16:11:23 -07:00
LICENSE.leveldb	Add back the LevelDB license file	2017-07-16 18:42:18 -07:00
Makefile	Update snappy dependency for Java releases. (#12207 )	2024-07-05 09:30:28 -07:00
PLUGINS.md	Add encfs plugin link (#12070 )	2023-11-14 07:33:21 -08:00
README.md	Remove deprecated integration tests from README.md (#11354 )	2023-04-07 16:52:50 -07:00
TARGETS	Add experimental range filters to stress/crash test (#12769 )	2024-06-18 16:16:09 -07:00
USERS.md	Add Qdrant to USERS.md (#12072 )	2023-11-16 10:35:08 -08:00
Vagrantfile	Adding CentOS 7 Vagrantfile & build script	2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md	Update branch name in WINDOWS_PORT.md (#8745 )	2021-09-01 19:26:39 -07:00
common.mk	Clean up variables for temporary directory (#9961 )	2022-05-06 16:38:06 -07:00
crash_test.mk	Stress/Crash Test for OptimisticTransactionDB (#11513 )	2023-06-17 16:27:37 -07:00
issue_template.md	Add Google Group to Issue Template	2020-01-28 14:40:37 -08:00
rocksdb.pc.in	build: fix pkg-config file generation (#9953 )	2022-05-30 12:46:40 -07:00
src.mk	Fix folly build (#12795 )	2024-06-22 15:15:02 -07:00
thirdparty.inc	Fix build jemalloc api (#5470 )	2019-06-24 17:40:32 -07:00

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.