rocksdb/table/block_based
Hui Xiao 58fc9d61b7 Trim readahead size based on prefix during prefix scan (#13040)
Summary:
**Context/Summary:**
During prefix scan, prefetched data blocks containing keys not in the same prefix as the `Seek()`'s key will be wasted when `ReadOptions::prefix_same_as_start = true` since they won't be returned to the user. This PR is to exclude those data blocks from being prefetched in a similar manner like trimming according to `ReadOptions::iterate_upper_bound`.

Bonus: refactoring to some existing prefetch test so they are easier to extend and read

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13040

Test Plan:
- New UT, integration to existing UTs
- Benchmark to ensure no regression from CPU due to more trimming logic
```
// Build DB with one sorted run under the same prefix
./db_bench --benchmarks=fillrandom --prefix_size=3 --keys_per_prefix=5000000 --num=5000000 --db=/dev/shm/db_bench --disable_auto_compactions=1
```
```
// Augment the existing db bench to call `Seek()` instead of `SeekToFirst()` in `void ReadSequential(){..}` to trigger the logic in this PR

+++ b/tools/db_bench_tool.cc
@@ -5900,7 +5900,12 @@ class Benchmark {
     Iterator* iter = db->NewIterator(options);
     int64_t i = 0;
     int64_t bytes = 0;
-    for (iter->SeekToFirst(); i < reads_ && iter->Valid(); iter->Next()) {
+
+    iter->SeekToFirst();
+    assert(iter->status().ok() && iter->Valid());
+    auto prefix = prefix_extractor_->Transform(iter->key());
+
+    for (iter->Seek(prefix); i < reads_ && iter->Valid(); iter->Next()) {
       bytes += iter->key().size() + iter->value().size();
       thread->stats.FinishedOps(nullptr, db, 1, kRead);
       ++i;
:
```
```
// Compare prefix scan performance
./db_bench --benchmarks=readseq[-X20] --prefix_size=3  --prefix_same_as_start=1 --auto_readahead_size=1 --cache_size=1 --use_existing_db=1 --db=/dev/shm/db_bench --disable_auto_compactions=1

// Before PR
readseq [AVG    20 runs] : 2449011 (± 50238) ops/sec;  270.9 (± 5.6) MB/sec
readseq [MEDIAN 20 runs] : 2499167 ops/sec;  276.5 MB/sec

// After PR  (regress 0.4 %)
readseq [AVG    20 runs] : 2439098 (± 42931) ops/sec;  269.8 (± 4.7) MB/sec
readseq [MEDIAN 20 runs] : 2460859 ops/sec;  272.2 MB/sec

```

- Stress test: randomly set `prefix_same_as_start` in `TestPrefixScan()`. Run below for a while
```
python3 tools/db_crashtest.py --simple blackbox --prefix_size=5 --prefixpercent=65 --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=-1 --bloom_bits=3 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=0 --compaction_style=2 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=8388608 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=0 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=15 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32  --readpercent=10 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --uncache_aggressiveness=1 --universal_max_read_amp=10 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=10
```

Reviewed By: anand1976

Differential Revision: D64367065

Pulled By: hx235

fbshipit-source-id: 5750c05ccc835c3e9dc81c961b76deaf30bd23c2
2024-10-17 15:52:55 -07:00
..
binary_search_index_reader.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
binary_search_index_reader.h
block.cc Add initial support for TimedPut API (#12419) 2024-03-14 15:44:55 -07:00
block.h Add some per key optimization for UDT in memtable only feature (#13031) 2024-10-03 17:57:50 -07:00
block_based_table_builder.cc Record largest seqno in table properties and verify in file ingestion (#12951) 2024-08-21 16:24:18 -07:00
block_based_table_builder.h Fix/cleanup SeqnoToTimeMapping (#12253) 2024-01-19 21:50:38 -08:00
block_based_table_factory.cc Make simple BlockBasedTableOptions mutable (#10021) 2024-10-14 17:49:26 -07:00
block_based_table_factory.h Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
block_based_table_iterator.cc Trim readahead size based on prefix during prefix scan (#13040) 2024-10-17 15:52:55 -07:00
block_based_table_iterator.h Trim readahead size based on prefix during prefix scan (#13040) 2024-10-17 15:52:55 -07:00
block_based_table_reader.cc Add an ingestion option to not fill block cache (#13067) 2024-10-16 14:11:22 -07:00
block_based_table_reader.h Add an ingestion option to not fill block cache (#13067) 2024-10-16 14:11:22 -07:00
block_based_table_reader_impl.h Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
block_based_table_reader_sync_and_async.h Add ticker stats for read corruption retries (#12923) 2024-08-12 15:32:07 -07:00
block_based_table_reader_test.cc Prefer static_cast in place of most reinterpret_cast (#12308) 2024-02-07 10:44:11 -08:00
block_builder.cc internal_repo_rocksdb (-8794174668376270091) (#12114) 2023-12-01 11:10:30 -08:00
block_builder.h Logically strip timestamp during flush (#11557) 2023-06-29 15:50:50 -07:00
block_cache.cc Support compressed and local flash secondary cache stacking (#11812) 2023-09-21 20:30:53 -07:00
block_cache.h Support pro-actively erasing obsolete block cache entries (#12694) 2024-06-07 08:57:11 -07:00
block_prefetcher.cc Respect ReadOptions::read_tier in prefetching (#12782) 2024-06-19 09:53:59 -07:00
block_prefetcher.h Refactor FilePrefetchBuffer code (#12097) 2024-01-05 09:29:01 -08:00
block_prefix_index.cc
block_prefix_index.h
block_test.cc internal_repo_rocksdb (-8794174668376270091) (#12114) 2023-12-01 11:10:30 -08:00
block_type.h
cachable_entry.h Support pro-actively erasing obsolete block cache entries (#12694) 2024-06-07 08:57:11 -07:00
data_block_footer.cc
data_block_footer.h
data_block_hash_index.cc
data_block_hash_index.h
data_block_hash_index_test.cc Refactor `table_factory` into MutableCFOptions (#13077) 2024-10-17 14:13:20 -07:00
filter_block.h Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
filter_block_reader_common.cc Remove redundant no_io parameters to filter functions (#12762) 2024-06-12 18:47:11 -07:00
filter_block_reader_common.h Remove redundant no_io parameters to filter functions (#12762) 2024-06-12 18:47:11 -07:00
filter_policy.cc More valgrind fixes (#12990) 2024-09-06 10:11:34 -07:00
filter_policy_internal.h Optimize, simplify filter block building (fix regression) (#12931) 2024-08-14 15:13:16 -07:00
flush_block_policy.cc Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
flush_block_policy_impl.h Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
full_filter_block.cc Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
full_filter_block.h Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
full_filter_block_test.cc Optimize, simplify filter block building (fix regression) (#12931) 2024-08-14 15:13:16 -07:00
hash_index_reader.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
hash_index_reader.h
index_builder.cc Refactor IndexBuilder::AddIndexEntry (#12867) 2024-07-22 14:27:31 -07:00
index_builder.h Refactor IndexBuilder::AddIndexEntry (#12867) 2024-07-22 14:27:31 -07:00
index_reader_common.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
index_reader_common.h Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
mock_block_based_table.h
parsed_full_filter_block.cc
parsed_full_filter_block.h
partitioned_filter_block.cc Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
partitioned_filter_block.h Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
partitioned_filter_block_test.cc Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
partitioned_index_iterator.cc Refactor FilePrefetchBuffer code (#12097) 2024-01-05 09:29:01 -08:00
partitioned_index_iterator.h
partitioned_index_reader.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
partitioned_index_reader.h Support pro-actively erasing obsolete block cache entries (#12694) 2024-06-07 08:57:11 -07:00
reader_common.cc Prefer static_cast in place of most reinterpret_cast (#12308) 2024-02-07 10:44:11 -08:00
reader_common.h Remove unnecessary, confusing 'extern' (#12300) 2024-01-29 10:38:08 -08:00
uncompression_dict_reader.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
uncompression_dict_reader.h Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00