f291eefb02
Summary: Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables. The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547 Test Plan: - TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones. ``` ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000 ``` Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used. micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom. | |fillrandom main | write path caching | read path caching |memtable V3 (https://github.com/facebook/rocksdb/issues/10308) | readrandom main | write path caching | read path caching |memtable V3 | |--- |--- |--- |--- |--- | --- | --- | --- | --- | | 0 |6.35 |6.15 |5.82 |6.12 |2.24 |2.26 |2.03 |2.07 | | 1 |5.99 |5.88 |5.77 |6.28 |2.65 |2.27 |2.24 |2.5 | | 10 |6.15 |6.02 |5.92 |5.95 |5.15 |2.61 |2.31 |2.53 | | 100 |5.95 |5.78 |5.88 |6.23 |28.31 |2.34 |2.45 |2.94 | | 100 25 threads |52.01 |45.85 |46.18 |47.52 |35.97 |3.34 |3.34 |3.56 | | 1000 |6.0 |7.07 |5.98 |6.08 |333.18 |2.86 |2.7 |3.6 | | 1000 25 threads |52.6 |148.86 |79.06 |45.52 |473.49 |3.66 |3.48 |4.38 | - Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes` readrandom micros/op: | |main |write path caching |read path caching |memtable V3 | |---|---|---|---|---| | single thread |48.28 |1.55 |1.52 |1.96 | | 25 threads |64.3 |2.55 |2.67 |2.64 | Reviewed By: ajkr Differential Revision: D38895410 Pulled By: cbi42 fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559 |
||
---|---|---|
.circleci | ||
.github/workflows | ||
buckifier | ||
build_tools | ||
cache | ||
cmake | ||
coverage | ||
db | ||
db_stress_tool | ||
docs | ||
env | ||
examples | ||
file | ||
fuzz | ||
include/rocksdb | ||
java | ||
logging | ||
memory | ||
memtable | ||
microbench | ||
monitoring | ||
options | ||
plugin | ||
port | ||
table | ||
test_util | ||
third-party | ||
tools | ||
trace_replay | ||
util | ||
utilities | ||
.clang-format | ||
.gitignore | ||
.lgtm.yml | ||
.watchmanconfig | ||
AUTHORS | ||
CMakeLists.txt | ||
CODE_OF_CONDUCT.md | ||
common.mk | ||
CONTRIBUTING.md | ||
COPYING | ||
crash_test.mk | ||
DEFAULT_OPTIONS_HISTORY.md | ||
DUMP_FORMAT.md | ||
HISTORY.md | ||
INSTALL.md | ||
issue_template.md | ||
LANGUAGE-BINDINGS.md | ||
LICENSE.Apache | ||
LICENSE.leveldb | ||
Makefile | ||
PLUGINS.md | ||
README.md | ||
rocksdb.pc.in | ||
ROCKSDB_LITE.md | ||
src.mk | ||
TARGETS | ||
thirdparty.inc | ||
USERS.md | ||
Vagrantfile | ||
WINDOWS_PORT.md |
RocksDB: A Persistent Key-Value Store for Flash and RAM Storage
RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)
This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.
Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples
See the github wiki for more explanation.
The public interface is in include/
. Callers should not include or
rely on the details of any other header files in this package. Those
internal APIs may be changed without warning.
Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.
License
RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.