rocksdb/db_stress_tool
Changyu Bi de16611a50 Fix WAL corruption in stress test (#12834)
Summary:
We are seeing WAL corruption in crash tests where wal_compression and recycled_wal are enabled. With wal_compression, we write a SetCompression record when creating a WAL, which can happen during DB open time. Our current stress test set up may write directly to the underlying WAL file during DB open, while writing to a buffer under TestFSWritableFile later for sync fault injection. This causes the last synced position to be inaccurately tracked in TestFSWritableFile and causes reads to return incorrect data.

This PR removes the line that causes this mixture of WAL writes. Also updated TestFSWritableFile to avoid such a mixture of buffered and direct writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12834

Test Plan:
the following command repros WAL corruption before this PR
```
 ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=0 --bloom_bits=8 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=600 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=16777216 --compression_checksum=1 --compression_max_dict_buffer_bytes=1099511627775 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db_write_buffer_size=134217728 --default_temperature=kHot --default_write_temperature=kCold --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=crc32c --fill_cache=0 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=13 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=113 --universal_max_read_amp=4 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=1 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=5 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=35 --preserve_unverified_changes=1 --db=/dev/shm/rocksdb_test/blackbox --expected_values_dir=/dev/shm/rocksdb_test/expected
Choosing random keys with no overwrite
...
(Re-)verified 0 unique IDs
2024/07/01-16:42:46  Initializing worker threads
Crash-recovery verification passed :)
2024/07/01-16:42:46  Starting database operations
^C

./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=0 --bloom_bits=8 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=600 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=16777216 --compression_checksum=1 --compression_max_dict_buffer_bytes=1099511627775 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db_write_buffer_size=134217728 --default_temperature=kHot --default_write_temperature=kCold --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=crc32c --fill_cache=0 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=13 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=113 --universal_max_read_amp=4 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=1 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=5 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=35 --preserve_unverified_changes=1 --db=/dev/shm/rocksdb_test/blackbox --expected_values_dir=/dev/shm/rocksdb_test/expected
Choosing random keys with no overwrite
...
Crash-recovery verification passed :)
2024/07/01-16:43:02  Starting database operations
Failure in BackupEngine::CreateNewBackup with: Corruption: bad record length under specified BackupEngineOptions: share_table_files: 1, share_files_with_checksum: 1, share_files_with_checksum_naming: 2147483650, schema_version: 1, max_background_operations: 1, backup_rate_limiter: 0x7f2373676280, restore_rate_limiter: 0, current_temperatures_override_manifest: 1, CreateBackupOptions: flush_before_backup: 0, decrease_background_thread_cpu_priority: 0, background_thread_cpu_priority: 2, RestoreOptions: keep_log_files: 1 (Empty string or missing field indicates default option or value is used)
Verification failed: Backup/restore failed: Corruption: bad record length
db_stress: db_stress_tool/db_stress_test_base.cc:528: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
^CCouldn't get CS register: No such process.
Couldn't get registers: No such process.
[Inferior 1 (process 2097222) detached]
```

Reviewed By: pdillinger

Differential Revision: D59260401

Pulled By: cbi42

fbshipit-source-id: fdcdaaab2e14b527b26fbdfa819b4fe3f745a4de
2024-07-02 13:02:39 -07:00
..
batched_ops_stress.cc Add experimental range filters to stress/crash test (#12769) 2024-06-18 16:16:09 -07:00
cf_consistency_stress.cc Remove unnecessary injected error logging in crash test (#12807) 2024-06-24 20:51:39 -07:00
CMakeLists.txt Add experimental range filters to stress/crash test (#12769) 2024-06-18 16:16:09 -07:00
db_stress.cc Disable tiered storage + BlobDB stress test (#10699) 2022-09-19 15:39:31 -07:00
db_stress_common.cc Fix nullptr access and race to fault_fs_guard (#12799) 2024-06-24 16:10:36 -07:00
db_stress_common.h Inject more errors to more files in stress test (#12713) 2024-06-19 08:42:00 -07:00
db_stress_compaction_filter.h Enable compaction filter for db_stress with user-defined timestamp (#10259) 2022-06-27 11:53:09 -07:00
db_stress_driver.cc Decouple sync fault and write injection in FaultInjectionTestFS & fix tracing issue under WAL write error injection (#12797) 2024-06-26 14:56:35 -07:00
db_stress_driver.h Remove unnecessary, confusing 'extern' (#12300) 2024-01-29 10:38:08 -08:00
db_stress_env_wrapper.h Change ReadAsync callback API to remove const from FSReadRequest (#11649) 2024-02-16 09:14:55 -08:00
db_stress_filters.cc Add experimental range filters to stress/crash test (#12769) 2024-06-18 16:16:09 -07:00
db_stress_filters.h Add experimental range filters to stress/crash test (#12769) 2024-06-18 16:16:09 -07:00
db_stress_gflags.cc Decouple sync fault and write injection in FaultInjectionTestFS & fix tracing issue under WAL write error injection (#12797) 2024-06-26 14:56:35 -07:00
db_stress_listener.cc Fix/improve temperature handling for file ingestion (#12402) 2024-03-05 16:56:08 -08:00
db_stress_listener.h Decouple sync fault and write injection in FaultInjectionTestFS & fix tracing issue under WAL write error injection (#12797) 2024-06-26 14:56:35 -07:00
db_stress_shared_state.cc Remove ROCKSDB_SUPPORT_THREAD_LOCAL define because it's a part of C++11 (#10015) 2022-05-18 15:25:19 -07:00
db_stress_shared_state.h Decouple sync fault and write injection in FaultInjectionTestFS & fix tracing issue under WAL write error injection (#12797) 2024-06-26 14:56:35 -07:00
db_stress_stat.cc Fix Statistics in db_stress (#9260) 2021-12-07 16:24:22 -08:00
db_stress_stat.h Fix Statistics in db_stress (#9260) 2021-12-07 16:24:22 -08:00
db_stress_table_properties_collector.h Remove 'virtual' when implied by 'override' (#12319) 2024-01-31 13:14:42 -08:00
db_stress_test_base.cc Fix WAL corruption in stress test (#12834) 2024-07-02 13:02:39 -07:00
db_stress_test_base.h Decouple sync fault and write injection in FaultInjectionTestFS & fix tracing issue under WAL write error injection (#12797) 2024-06-26 14:56:35 -07:00
db_stress_tool.cc Decouple sync fault and write injection in FaultInjectionTestFS & fix tracing issue under WAL write error injection (#12797) 2024-06-26 14:56:35 -07:00
db_stress_wide_merge_operator.cc Add the wide-column aware merge API to the stress tests (#11906) 2023-09-29 08:54:50 -07:00
db_stress_wide_merge_operator.h Add the wide-column aware merge API to the stress tests (#11906) 2023-09-29 08:54:50 -07:00
expected_state.cc Add TimedPut to stress test (#12559) 2024-04-30 15:40:35 -07:00
expected_state.h Improve comment of ExpectedValue in db stress (#11456) 2023-05-18 09:44:15 -07:00
expected_value.cc Improve comment of ExpectedValue in db stress (#11456) 2023-05-18 09:44:15 -07:00
expected_value.h Ensures PendingExpectedValue either Commit or Rollback (#12244) 2024-01-24 11:04:40 -08:00
multi_ops_txns_stress.cc Add experimental range filters to stress/crash test (#12769) 2024-06-18 16:16:09 -07:00
multi_ops_txns_stress.h Disable AttributeGroup in multiops txn test (#12781) 2024-06-18 16:05:18 -07:00
no_batched_ops_stress.cc Remove unnecessary injected error logging in crash test (#12807) 2024-06-24 20:51:39 -07:00