rocksdb/db
Mike Kolupaev b4d7209428 Add an option to put first key of each sst block in the index (#5289)
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.

Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.

So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.

Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.

This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289

Differential Revision: D15256423

Pulled By: al13n321

fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
2019-06-24 20:54:04 -07:00
..
compaction Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
db_impl Fix ingested file and direcotry not being sync (#5435) 2019-06-21 10:15:38 -07:00
builder.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
builder.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
c.cc Unordered Writes (#5218) 2019-05-13 17:47:21 -07:00
c_test.c add missing rocksdb_flush_cf in c (#5243) 2019-04-25 11:25:43 -07:00
column_family.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
column_family.h Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
column_family_test.cc Make format 2019-05-31 15:24:43 -07:00
compact_files_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
compacted_db_impl.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
compacted_db_impl.h Make format 2019-05-31 15:24:43 -07:00
comparator_db_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
convenience.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
corruption_test.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
cuckoo_table_db_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
db_basic_test.cc Add support for timestamp in Get/Put (#5079) 2019-06-05 23:10:47 -07:00
db_blob_index_test.cc fix lite build 2017-10-17 08:57:09 -07:00
db_block_cache_test.cc Move the index readers out of the block cache (#5298) 2019-05-30 11:53:27 -07:00
db_bloom_filter_test.cc Unordered Writes (#5218) 2019-05-13 17:47:21 -07:00
db_compaction_filter_test.cc Apply modernize-use-override (2nd iteration) 2019-02-14 14:41:36 -08:00
db_compaction_test.cc Combine the read-ahead logic for user reads and compaction reads (#5431) 2019-06-19 14:10:46 -07:00
db_dynamic_level_test.cc Fix flaky DBDynamicLevelTest.DynamicLevelMaxBytesBase2 (#4668) 2018-11-12 16:42:16 -08:00
db_encryption_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
db_filesnapshot.cc Add missing check before calling PurgeObsoleteFiles in EnableFileDeletions (#5448) 2019-06-13 14:43:13 -07:00
db_flush_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
db_info_dumper.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
db_info_dumper.h Change RocksDB License 2017-07-15 16:11:23 -07:00
db_inplace_update_test.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
db_io_failure_test.cc Disable DBIOFailureTest.NoSpaceCompactRange in LITE (#4596) 2018-10-29 14:36:31 -07:00
db_iter.cc Revert "Reduce iterator key comparison for upper/lower bound check (#5111)" (#5440) 2019-06-11 16:23:41 -07:00
db_iter.h Make format 2019-05-31 15:24:43 -07:00
db_iter_stress_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
db_iter_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
db_iterator_test.cc Add an option to put first key of each sst block in the index (#5289) 2019-06-24 20:54:04 -07:00
db_log_iter_test.cc Apply modernize-use-override (2nd iteration) 2019-02-14 14:41:36 -08:00
db_memtable_test.cc Fix tsan complaint in ConcurrentMergeWrite test (#5308) 2019-05-15 11:21:48 -07:00
db_merge_operator_test.cc WriteUnPrepared: less virtual in iterator callback (#5049) 2019-04-02 14:47:16 -07:00
db_options_test.cc fix rocksdb lite and clang contrun test failures (#5477) 2019-06-17 21:16:29 -07:00
db_properties_test.cc Deprecate ttl option from CompactionOptionsFIFO (#4965) 2019-02-15 09:51:41 -08:00
db_range_del_test.cc Fix merging range tombstone covering put during flush/compaction (#5406) 2019-06-04 10:24:14 -07:00
db_sst_test.cc Increase Trash/DB size ratio in DBSSTTest.RateLimitedWALDelete (#5366) 2019-05-30 11:12:59 -07:00
db_statistics_test.cc Make statistics's stats_level change thread-safe (#5030) 2019-03-01 10:42:09 -08:00
db_table_properties_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
db_tailing_iter_test.cc Remove managed iterator 2018-07-17 14:43:18 -07:00
db_test.cc sanitize and limit block_size under 4GB (#5492) 2019-06-20 11:45:08 -07:00
db_test2.cc Fix flaky DBTest2.PresetCompressionDict test (#5378) 2019-05-30 16:11:27 -07:00
db_test_util.cc Unordered Writes (#5218) 2019-05-13 17:47:21 -07:00
db_test_util.h simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
db_universal_compaction_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
db_wal_test.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
db_write_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
dbformat.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
dbformat.h Add support for timestamp in Get/Put (#5079) 2019-06-05 23:10:47 -07:00
dbformat_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
deletefile_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
error_handler.cc Make format 2019-05-31 15:24:43 -07:00
error_handler.h Fix typos in comments (#4456) 2018-10-04 20:46:50 -07:00
error_handler_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
event_helpers.cc Log file_creation_time table property (#5232) 2019-04-22 15:30:07 -07:00
event_helpers.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
experimental.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
external_sst_file_basic_test.cc Fix ingested file and direcotry not being sync (#5435) 2019-06-21 10:15:38 -07:00
external_sst_file_ingestion_job.cc Fix ingested file and direcotry not being sync (#5435) 2019-06-21 10:15:38 -07:00
external_sst_file_ingestion_job.h Fix ingested file and direcotry not being sync (#5435) 2019-06-21 10:15:38 -07:00
external_sst_file_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
fault_injection_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
file_indexer.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
file_indexer.h Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
file_indexer_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
filename_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
flush_job.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
flush_job.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
flush_job_test.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
flush_scheduler.cc Remove global locks from FlushScheduler (#5372) 2019-06-10 16:50:26 -07:00
flush_scheduler.h Unordered Writes (#5218) 2019-05-13 17:47:21 -07:00
forward_iterator.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
forward_iterator.h Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
forward_iterator_bench.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
internal_stats.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
internal_stats.h Collect compaction stats by priority and dump to info LOG (#5050) 2019-03-19 17:28:19 -07:00
job_context.h WritePrepared: Fix visible key compacted out by compaction (#4883) 2019-01-15 21:34:38 -08:00
listener_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
log_format.h Fix an inaccurate comment (#4315) 2018-08-24 18:13:20 -07:00
log_reader.cc Support for single-primary, multi-secondary instances (#4899) 2019-03-26 16:45:31 -07:00
log_reader.h secondary instance: add support for WAL tailing on `OpenAsSecondary` 2019-04-24 12:08:44 -07:00
log_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
log_writer.cc LogWriter to only flush after finish generating whole record (#5328) 2019-05-21 12:33:17 -07:00
log_writer.h Close WAL files before deletion (#5233) 2019-04-25 10:11:41 -07:00
logs_with_prep_tracker.cc Skip deleted WALs during recovery 2018-05-03 15:43:09 -07:00
logs_with_prep_tracker.h Skip deleted WALs during recovery 2018-05-03 15:43:09 -07:00
lookup_key.h Introduce a new MultiGet batching implementation (#5011) 2019-04-11 14:28:26 -07:00
malloc_stats.cc Detect if Jemalloc is linked with the binary (#4844) 2019-01-03 16:30:12 -08:00
malloc_stats.h Change RocksDB License 2017-07-15 16:11:23 -07:00
manual_compaction_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
memtable.cc Add support for timestamp in Get/Put (#5079) 2019-06-05 23:10:47 -07:00
memtable.h Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
memtable_list.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
memtable_list.h Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
memtable_list_test.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
merge_context.h Introduce a new MultiGet batching implementation (#5011) 2019-04-11 14:28:26 -07:00
merge_helper.cc Fix merging range tombstone covering put during flush/compaction (#5406) 2019-06-04 10:24:14 -07:00
merge_helper.h Remove v1 RangeDelAggregator (#4778) 2018-12-17 17:33:46 -08:00
merge_helper_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
merge_operator.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
merge_test.cc Make format 2019-05-31 15:24:43 -07:00
obsolete_files_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
options_file_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
perf_context_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
pinned_iterators_manager.h Change RocksDB License 2017-07-15 16:11:23 -07:00
plain_table_db_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
pre_release_callback.h WritePrepared: reduce prepared_mutex_ overhead (#5420) 2019-06-10 11:53:31 -07:00
prefix_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_del_aggregator.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_del_aggregator.h Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
range_del_aggregator_bench.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
range_del_aggregator_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
range_tombstone_fragmenter.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
range_tombstone_fragmenter.h Add compaction logic to RangeDelAggregatorV2 (#4758) 2018-12-17 13:20:51 -08:00
range_tombstone_fragmenter_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
read_callback.h WritePrepared: fix race condition in reading batch with duplicate keys (#5147) 2019-04-12 14:40:41 -07:00
repair.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
repair_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
snapshot_checker.h WritePrepared: fix issue with snapshot released during compaction (#4858) 2019-01-16 09:55:32 -08:00
snapshot_impl.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
snapshot_impl.h Refresh snapshot list during long compactions (2nd attempt) (#5278) 2019-05-03 17:30:22 -07:00
table_cache.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
table_cache.h Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
table_properties_collector.cc Feature for sampling and reporting compressibility (#4842) 2019-03-18 12:15:34 -07:00
table_properties_collector.h Feature for sampling and reporting compressibility (#4842) 2019-03-18 12:15:34 -07:00
table_properties_collector_test.cc Organizing rocksdb/db directory 2019-05-31 11:57:01 -07:00
transaction_log_impl.cc Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474) 2019-06-19 08:10:08 -07:00
transaction_log_impl.h Move some file related files outside util/ (#5375) 2019-05-29 20:47:06 -07:00
version_builder.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
version_builder.h Support for single-primary, multi-secondary instances (#4899) 2019-03-26 16:45:31 -07:00
version_builder_test.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
version_edit.cc Move some logging related files to logging/ (#5387) 2019-05-31 17:23:59 -07:00
version_edit.h Make RocksDB secondary instance respect atomic groups in version edits. (#5411) 2019-06-04 10:56:19 -07:00
version_edit_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
version_set.cc Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
version_set.h Add more callers for table reader. (#5454) 2019-06-20 14:31:48 -07:00
version_set_test.cc Integrate block cache tracer into db_impl (#5433) 2019-06-13 15:43:10 -07:00
wal_manager.cc simplify include directive involving inttypes (#5402) 2019-06-06 13:56:07 -07:00
wal_manager.h improve comment for WalManager (#5350) 2019-05-24 10:40:30 -07:00
wal_manager_test.cc Replace Corruption with TryAgain status when new tail is not visible to TransactionLogIterator (#5474) 2019-06-19 08:10:08 -07:00
write_batch.cc Make format 2019-05-31 15:24:43 -07:00
write_batch_base.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
write_batch_internal.h WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069) 2018-06-28 18:58:29 -07:00
write_batch_test.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
write_callback.h Change RocksDB License 2017-07-15 16:11:23 -07:00
write_callback_test.cc WritePrepared: reduce prepared_mutex_ overhead (#5420) 2019-06-10 11:53:31 -07:00
write_controller.cc Change RocksDB License 2017-07-15 16:11:23 -07:00
write_controller.h Change RocksDB License 2017-07-15 16:11:23 -07:00
write_controller_test.cc Move test related files under util/ to test_util/ (#5377) 2019-05-30 11:25:51 -07:00
write_thread.cc Move some memory related files from util/ to memory/ (#5382) 2019-05-30 17:44:09 -07:00
write_thread.h Fix skip WAL for whole write_group when leader's callback fail (#4838) 2019-01-03 12:40:42 -08:00