rocksdb/table
Changyu Bi 6a0f637633 Compare the number of input keys and processed keys for compactions (#11571)
Summary:
... to improve data integrity validation during compaction.

A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.

For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.

For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571

Test Plan:
* Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
* All other unit tests for non-corrupted case.
* Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`

Reviewed By: ajkr

Differential Revision: D47131965

Pulled By: cbi42

fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
2023-07-28 09:47:31 -07:00
..
adaptive Remove RocksDB LITE (#11147) 2023-01-27 13:14:19 -08:00
block_based Move prefetching responsibility to page cache for compaction read under non directIO usecase (#11631) 2023-07-21 14:52:52 -07:00
cuckoo Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) 2023-04-21 09:07:18 -07:00
plain Add missing table properties in plaintable GetTableProperties() (#11267) 2023-07-21 17:55:25 -07:00
block_fetcher.cc Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
block_fetcher.h Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
block_fetcher_test.cc Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
cleanable_test.cc Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899) 2022-04-26 21:59:24 -07:00
compaction_merging_iterator.cc Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113) 2023-02-22 12:28:18 -08:00
compaction_merging_iterator.h Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113) 2023-02-22 12:28:18 -08:00
format.cc Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
format.h Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
get_context.cc Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
get_context.h Merge operator failed subcode (#11231) 2023-02-17 10:58:46 -08:00
internal_iterator.h remove unused InternalIteratorBase::is_mutable_ (#11104) 2023-01-19 13:28:58 -08:00
iter_heap.h Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
iterator.cc Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
iterator_wrapper.h Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
merger_test.cc Print stack traces on frozen tests in CI (#10828) 2022-10-18 00:35:35 -07:00
merging_iterator.cc Improve documentation for MergingIterator (#11161) 2023-03-03 12:17:30 -08:00
merging_iterator.h Improve documentation for MergingIterator (#11161) 2023-03-03 12:17:30 -08:00
meta_blocks.cc Record the `persist_user_defined_timestamps` flag in manifest (#11515) 2023-06-21 21:49:01 -07:00
meta_blocks.h Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) 2023-04-21 09:07:18 -07:00
mock_table.cc Compare the number of input keys and processed keys for compactions (#11571) 2023-07-28 09:47:31 -07:00
mock_table.h Align compaction output file boundaries to the next level ones (#10655) 2022-09-29 19:43:55 -07:00
multiget_context.h Add a new MultiGetEntity API (#11222) 2023-02-15 09:34:17 -08:00
persistent_cache_helper.cc Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
persistent_cache_helper.h Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
persistent_cache_options.h Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
scoped_arena_iterator.h Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
sst_file_dumper.cc `sst_dump --command=verify` should verify block checksums (#11576) 2023-07-05 14:12:06 -07:00
sst_file_dumper.h Remove RocksDB LITE (#11147) 2023-01-27 13:14:19 -08:00
sst_file_reader.cc Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
sst_file_reader_test.cc Remove RocksDB LITE (#11147) 2023-01-27 13:14:19 -08:00
sst_file_writer.cc validate SstFileWriter range tombstones cover positive ranges (#11322) 2023-03-22 21:03:13 -07:00
sst_file_writer_collectors.h Refactor to avoid confusing "raw block" (#10408) 2022-09-22 11:25:32 -07:00
table_builder.h Add support to strip / pad timestamp when creating / reading a block based table (#11495) 2023-06-01 11:10:03 -07:00
table_factory.cc Remove FactoryFunc from LoadXXXObject (#11203) 2023-02-17 12:54:07 -08:00
table_properties.cc Record the `persist_user_defined_timestamps` flag in manifest (#11515) 2023-06-21 21:49:01 -07:00
table_properties_internal.h Improve / clean up meta block code & integrity (#9163) 2021-11-18 11:43:44 -08:00
table_reader.h Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) 2023-04-21 09:07:18 -07:00
table_reader_bench.cc Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
table_test.cc Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
two_level_iterator.cc Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
two_level_iterator.h Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
unique_id.cc Derive cache keys from SST unique IDs (#10394) 2022-08-12 13:49:49 -07:00
unique_id_impl.h Derive cache keys from SST unique IDs (#10394) 2022-08-12 13:49:49 -07:00