mirror of
https://github.com/facebook/rocksdb.git
synced 2024-11-26 16:30:56 +00:00
7103559f49
Summary: This PR extends the improvements in #3282 to also work when using Direct IO. We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash. **Description:** This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. **Implementation Details:** - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead. - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled. - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer. - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again. - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date. **Constraints:** - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously. - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them. **Benchmarks:** I used the same benchmark as used in #3282. Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` ``` Before: seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found) With this change: seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found) ``` ~4.5X perf improvement. Taken on an average of 3 runs. Closes https://github.com/facebook/rocksdb/pull/3884 Differential Revision: D8082143 Pulled By: sagar0 fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb |
||
---|---|---|
.. | ||
adaptive_table_factory.cc | ||
adaptive_table_factory.h | ||
block.cc | ||
block.h | ||
block_based_filter_block.cc | ||
block_based_filter_block.h | ||
block_based_filter_block_test.cc | ||
block_based_table_builder.cc | ||
block_based_table_builder.h | ||
block_based_table_factory.cc | ||
block_based_table_factory.h | ||
block_based_table_reader.cc | ||
block_based_table_reader.h | ||
block_builder.cc | ||
block_builder.h | ||
block_fetcher.cc | ||
block_fetcher.h | ||
block_prefix_index.cc | ||
block_prefix_index.h | ||
block_test.cc | ||
bloom_block.cc | ||
bloom_block.h | ||
cleanable_test.cc | ||
cuckoo_table_builder.cc | ||
cuckoo_table_builder.h | ||
cuckoo_table_builder_test.cc | ||
cuckoo_table_factory.cc | ||
cuckoo_table_factory.h | ||
cuckoo_table_reader.cc | ||
cuckoo_table_reader.h | ||
cuckoo_table_reader_test.cc | ||
filter_block.h | ||
flush_block_policy.cc | ||
format.cc | ||
format.h | ||
full_filter_bits_builder.h | ||
full_filter_block.cc | ||
full_filter_block.h | ||
full_filter_block_test.cc | ||
get_context.cc | ||
get_context.h | ||
index_builder.cc | ||
index_builder.h | ||
internal_iterator.h | ||
iter_heap.h | ||
iterator.cc | ||
iterator_wrapper.h | ||
merger_test.cc | ||
merging_iterator.cc | ||
merging_iterator.h | ||
meta_blocks.cc | ||
meta_blocks.h | ||
mock_table.cc | ||
mock_table.h | ||
partitioned_filter_block.cc | ||
partitioned_filter_block.h | ||
partitioned_filter_block_test.cc | ||
persistent_cache_helper.cc | ||
persistent_cache_helper.h | ||
persistent_cache_options.h | ||
plain_table_builder.cc | ||
plain_table_builder.h | ||
plain_table_factory.cc | ||
plain_table_factory.h | ||
plain_table_index.cc | ||
plain_table_index.h | ||
plain_table_key_coding.cc | ||
plain_table_key_coding.h | ||
plain_table_reader.cc | ||
plain_table_reader.h | ||
scoped_arena_iterator.h | ||
sst_file_writer.cc | ||
sst_file_writer_collectors.h | ||
table_builder.h | ||
table_properties.cc | ||
table_properties_internal.h | ||
table_reader.h | ||
table_reader_bench.cc | ||
table_test.cc | ||
two_level_iterator.cc | ||
two_level_iterator.h |