rocksdb/table/block_based
Hui Xiao 8f763bdeab Record and use the tail size to prefetch table tail (#11406)
Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g,  small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug.  Therefore we decide to record the exact tail size and use it directly  to prefetch tail of the SST instead of relying heuristics.

**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
   - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406

Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
 diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
 --- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
                            &tail_prefetch_size);

   // Try file system prefetch
-  if (!file->use_direct_io() && !force_direct_prefetch) {
+  if (false && !file->use_direct_io() && !force_direct_prefetch) {
     if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
              .IsNotSupported()) {
       prefetch_buffer->reset(new FilePrefetchBuffer(
 diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
 --- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
           std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
     } else {
       BlockBasedTableOptions block_based_options;
+      block_based_options.metadata_cache_options.partition_pinning =
+      PinningTier::kAll;
       block_based_options.checksum =
           static_cast<ChecksumType>(FLAGS_checksum_type);
       if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```

(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```

As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR

3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.

Reviewed By: pdillinger

Differential Revision: D45413346

Pulled By: hx235

fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
2023-05-08 13:14:28 -07:00
..
binary_search_index_reader.cc Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
binary_search_index_reader.h
block.cc Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
block.h Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
block_based_table_builder.cc Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
block_based_table_builder.h Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
block_based_table_factory.cc Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
block_based_table_factory.h Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
block_based_table_iterator.cc
block_based_table_iterator.h Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
block_based_table_reader.cc Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
block_based_table_reader.h Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
block_based_table_reader_impl.h Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) 2023-04-21 09:07:18 -07:00
block_based_table_reader_sync_and_async.h Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) 2023-04-21 09:07:18 -07:00
block_based_table_reader_test.cc Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
block_builder.cc Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
block_builder.h Clarify SstFileWriter::DeleteRange() ordering requirements (#11390) 2023-04-20 13:02:16 -07:00
block_cache.cc Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
block_cache.h Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
block_prefetcher.cc Fix stress test failure for async_io (#10660) 2022-09-12 14:48:06 -07:00
block_prefetcher.h
block_prefix_index.cc
block_prefix_index.h
block_test.cc Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
block_type.h
cachable_entry.h HyperClockCache support for SecondaryCache, with refactoring (#11301) 2023-03-17 20:23:49 -07:00
data_block_footer.cc
data_block_footer.h
data_block_hash_index.cc Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
data_block_hash_index.h Fix build with gcc 13 by including <cstdint> (#11118) 2023-01-25 14:30:32 -08:00
data_block_hash_index_test.cc Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
filter_block.h Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
filter_block_reader_common.cc Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
filter_block_reader_common.h Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
filter_policy.cc Remove RocksDB LITE (#11147) 2023-01-27 13:14:19 -08:00
filter_policy_internal.h
flush_block_policy.cc Remove FactoryFunc from LoadXXXObject (#11203) 2023-02-17 12:54:07 -08:00
flush_block_policy.h
full_filter_block.cc Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
full_filter_block.h Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
full_filter_block_test.cc Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
hash_index_reader.cc Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) 2023-04-21 09:07:18 -07:00
hash_index_reader.h
index_builder.cc
index_builder.h Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
index_reader_common.cc Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
index_reader_common.h Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
mock_block_based_table.h
parsed_full_filter_block.cc
parsed_full_filter_block.h Major Cache refactoring, CPU efficiency improvement (#10975) 2023-01-11 14:20:40 -08:00
partitioned_filter_block.cc Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
partitioned_filter_block.h Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
partitioned_filter_block_test.cc Use user-provided ReadOptions for metadata block reads more often (#11208) 2023-04-04 16:53:14 -07:00
partitioned_index_iterator.cc
partitioned_index_iterator.h Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
partitioned_index_reader.cc Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
partitioned_index_reader.h Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
reader_common.cc
reader_common.h Put Cache and CacheWrapper in new public header (#11192) 2023-02-09 12:12:02 -08:00
uncompression_dict_reader.cc Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) 2023-04-21 09:07:18 -07:00
uncompression_dict_reader.h Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) 2023-04-21 09:07:18 -07:00