rocksdb/include/rocksdb
Changyu Bi f291eefb02 Cache fragmented range tombstone list for mutable memtables (#10547)
Summary:
Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.

The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547

Test Plan:
- TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
```
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
```
Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.

micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
|   |fillrandom main           | write path caching          | read path caching          |memtable V3 (https://github.com/facebook/rocksdb/issues/10308)     | readrandom main            | write path caching           | read path caching            |memtable V3      |
|---   |---  |---   |---   |---   | ---   |           ---   |  ---   |  ---   |
| 0                    |6.35                           |6.15                           |5.82                           |6.12                           |2.24                           |2.26                           |2.03                           |2.07                           |
| 1                    |5.99                           |5.88                           |5.77                           |6.28                           |2.65                           |2.27                           |2.24                           |2.5                            |
| 10                   |6.15                           |6.02                           |5.92                           |5.95                           |5.15                           |2.61                           |2.31                           |2.53                           |
| 100                  |5.95                           |5.78                           |5.88                           |6.23                           |28.31                          |2.34                           |2.45                           |2.94                           |
| 100 25 threads       |52.01                          |45.85                          |46.18                          |47.52                          |35.97                          |3.34                           |3.34                           |3.56                           |
| 1000                 |6.0                            |7.07                           |5.98                           |6.08                           |333.18                         |2.86                           |2.7                            |3.6                            |
| 1000 25 threads      |52.6                           |148.86                         |79.06                          |45.52                          |473.49                         |3.66                           |3.48                           |4.38                           |

  - Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`

readrandom micros/op:
|  |main            |write path caching           |read path caching            |memtable V3      |
|---|---|---|---|---|
| single thread        |48.28                          |1.55                           |1.52                           |1.96                           |
| 25 threads           |64.3                           |2.55                           |2.67                           |2.64                           |

Reviewed By: ajkr

Differential Revision: D38895410

Pulled By: cbi42

fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
2022-09-13 20:07:28 -07:00
..
utilities Option migration tool to break down files for FIFO compaction (#10600) 2022-08-31 12:08:23 -07:00
advanced_options.h Add memtable per key-value checksum (#10281) 2022-08-12 13:51:32 -07:00
c.h Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449) 2022-09-02 09:51:19 -07:00
cache.h Avoid recompressing cold block in CompressedSecondaryCache (#10527) 2022-09-07 19:00:27 -07:00
cache_bench_tool.h Allow cache_bench/db_bench to use a custom secondary cache (#8312) 2021-05-19 15:26:18 -07:00
cleanable.h Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
compaction_filter.h Rename kRemoveWithSingleDelete to kPurge (#9951) 2022-05-05 08:16:20 -07:00
compaction_job_stats.h Tiered Compaction: per key placement support (#9964) 2022-07-13 20:54:49 -07:00
comparator.h Make InternalKeyComparator not configurable (#10342) 2022-07-14 10:09:31 -07:00
compression_type.h Move CompressionType to its own header file (#7162) 2020-08-03 15:49:31 -07:00
concurrent_task_limiter.h Some API clarifications (#9080) 2021-11-02 20:30:07 -07:00
configurable.h Improve performance of SliceTransform::AsString (#9401) 2022-01-27 10:05:33 -08:00
convenience.h Specify largest_seqno in VerifyChecksum (#9919) 2022-05-02 10:22:08 -07:00
customizable.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
data_structure.h Add (Live)FileStorageInfo API (#8968) 2021-10-16 10:04:32 -07:00
db.h Cache fragmented range tombstone list for mutable memtables (#10547) 2022-09-13 20:07:28 -07:00
db_bench_tool.h
db_dump_tool.h
db_stress_tool.h
env.h Support reservation in thread pool (#10278) 2022-07-08 19:48:09 -07:00
env_encryption.h Some API clarifications (#9080) 2021-11-02 20:30:07 -07:00
experimental.h Add manifest fix-up utility for file temperatures (#9683) 2022-03-18 16:35:51 -07:00
file_checksum.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
file_system.h Use EnvLogger instead of PosixLogger (#10436) 2022-08-01 14:37:18 -07:00
filter_policy.h Fix a major performance bug in 7.0 re: filter compatibility (#9736) 2022-03-23 10:00:54 -07:00
flush_block_policy.h Some API clarifications (#9080) 2021-11-02 20:30:07 -07:00
functor_wrapper.h Fix and detect headers with missing dependencies (#8893) 2021-09-10 10:00:26 -07:00
io_status.h Implement AbortIO using io_uring (#10125) 2022-06-13 18:07:24 -07:00
iostats_context.h Use EnvLogger instead of PosixLogger (#10436) 2022-08-01 14:37:18 -07:00
iterator.h Fix a few documentation errors including in public APIs (#9789) 2022-04-01 10:30:17 -07:00
ldb_tool.h
listener.h Add temperature information to the event listener callbacks (#9591) 2022-02-18 11:23:18 -08:00
memory_allocator.h Make MemoryAllocator into a Customizable class (#8980) 2021-12-17 04:20:47 -08:00
memtablerep.h Added GetFactoryCount/Names/Types to ObjectRegistry (#9358) 2022-05-16 09:44:43 -07:00
merge_operator.h Fix compile warnings (#9199) 2021-11-24 11:19:06 -08:00
metadata.h Pass the size of blob files to SstFileManager during DB open (#10062) 2022-05-27 05:58:43 -07:00
options.h Always verify SST unique IDs on SST file open (#10532) 2022-09-07 22:52:42 -07:00
perf_context.h Add PerfContext counters for CompressedSecondaryCache (#10650) 2022-09-08 16:35:57 -07:00
perf_level.h
persistent_cache.h Check for and disallow shared key space in block caches (#9172) 2021-11-16 11:16:05 -08:00
rate_limiter.h Make RateLimiter not Customizable (#10378) 2022-07-18 14:48:42 -07:00
rocksdb_namespace.h Fix and detect headers with missing dependencies (#8893) 2021-09-10 10:00:26 -07:00
secondary_cache.h Avoid recompressing cold block in CompressedSecondaryCache (#10527) 2022-09-07 19:00:27 -07:00
slice.h Avoid allocations/copies for large GetMergeOperands() results (#10458) 2022-08-04 00:42:13 -07:00
slice_transform.h Document design/specification bugs with auto_prefix_mode (#10144) 2022-06-13 11:08:50 -07:00
snapshot.h Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
sst_dump_tool.h Add --version and --help to ldb and sst_dump (#6951) 2020-06-09 10:04:01 -07:00
sst_file_manager.h Some API clarifications (#9080) 2021-11-02 20:30:07 -07:00
sst_file_reader.h
sst_file_writer.h Support timestamps in SstFileWriter (#8899) 2021-09-09 18:58:01 -07:00
sst_partitioner.h Mark destructors as override (#9404) 2022-01-20 08:44:27 -08:00
statistics.h Update statistics for async scan readaheads (#10585) 2022-08-29 14:37:44 -07:00
stats_history.h More refactoring ahead of footer & meta changes (#9240) 2021-12-10 08:13:26 -08:00
status.h Migrate to docker for CI run (#10496) 2022-08-10 17:34:38 -07:00
system_clock.h Fix compile warnings (#9199) 2021-11-24 11:19:06 -08:00
table.h Add new option num_file_reads_for_auto_readahead in BlockBasedTableOptions (#10556) 2022-09-01 11:56:00 -07:00
table_properties.h Add seqno to time mapping (#10338) 2022-07-14 21:49:34 -07:00
thread_status.h Remove ROCKSDB_SUPPORT_THREAD_LOCAL define because it's a part of C++11 (#10015) 2022-05-18 15:25:19 -07:00
threadpool.h Support reservation in thread pool (#10278) 2022-07-08 19:48:09 -07:00
trace_reader_writer.h Update comments, fix typos. (#8721) 2021-08-27 13:16:32 -07:00
trace_record.h Refactor: Add BlockTypes to make them imply C++ type in block cache (#10098) 2022-06-06 11:16:12 -07:00
trace_record_result.h Add IteratorTraceExecutionResult for iterator related trace records. (#8687) 2021-08-20 15:35:56 -07:00
transaction_log.h Replace most typedef with using= (#8751) 2021-09-07 11:31:59 -07:00
types.h Add new value value type for wide-column entities (#10211) 2022-06-20 18:04:08 -07:00
unique_id.h Adjust public APIs to prefer 128-bit SST unique ID (#10009) 2022-05-17 18:43:48 -07:00
universal_compaction.h Incremental Space Amp Compactions in Universal Style (#8655) 2021-10-20 10:04:13 -07:00
version.h Post 7.6 branch cut changes (#10546) 2022-08-21 20:42:12 -07:00
wal_filter.h Fix compile warnings (#9199) 2021-11-24 11:19:06 -08:00
wide_columns.h Add support for wide-column point lookups (#10540) 2022-08-19 11:51:12 -07:00
write_batch.h WriteBatch reorder fields to reduce padding (#10266) 2022-06-29 13:02:48 -07:00
write_batch_base.h Add API for writing wide-column entities (#10242) 2022-06-25 15:30:47 -07:00
write_buffer_manager.h Account memory of big memory users in BlockBasedTable in global memory limit (#9748) 2022-04-06 10:33:00 -07:00