rocksdb

Commit Graph

Author	SHA1	Message	Date
Symious	c989c51ed7	Fix non-ASCII character (#12972 ) Summary: Met the following error while compiling the project. ``` build_tools/check-sources.sh utilities/fault_injection_fs.cc:509: // If there<E2><80><99>s no injected error, then cb will be called asynchronously when utilities/fault_injection_fs.cc:510: // target_ actually finishes the read. But if there<E2><80><99>s an injected error, it utilities/fault_injection_fs.cc:512: // isn<E2><80><99>t invoked at all. ^^^^ Use only ASCII characters in source files make[1]: * [Makefile:1291: check-sources] Error 1 make[1]: Leaving directory '/home/janus/Github/symious/rocksdb' make: * [Makefile:1084: check] Error 2 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12972 Reviewed By: hx235 Differential Revision: D61923865 Pulled By: cbi42 fbshipit-source-id: 63af0a38fea15e09a860895bdd5ed0a57700e447	2024-09-03 14:41:55 -07:00
Jay Huh	b99baa5ae7	Do not add unprep_seqs when WriteImpl() fails in unprepared txn (#12927 ) Summary: With the following scenario, we see assertion failure in write_unprepared_txn 1. The write is the first one in the transaction (`log_number_` is not set yet - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L376-L379) 2. `WriteToWAL()` fails inside `db_impl_->WriteImpl()` due to fault injection. `last_log_number_`is 0. https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L386 3. `prepare_batch_cnt_` is still added to `unprep_seqs_` https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L395-L398 4. When the transaction gets destructed after failed, it expects `log_number_ > 0` if `unprep_seqs_` is not empty - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L54-L55. Step 3 is wrong. `unprep_seqs_` is for the ordered list of unprep sequence numbers that we have already written to the DB. If `db_impl_->WriteImpl()` failed, `unprep_seqs_` shouldn't be set. `prepare_seq` value would be wrong anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12927 Test Plan: Following command steadily repros the issue. This no longer fails after the fix ``` ./db_stress \ --WAL_size_limit_MB=0 \ --WAL_ttl_seconds=60 \ --acquire_snapshot_one_in=10000 \ --adaptive_readahead=1 \ --adm_policy=2 \ --advise_random_on_open=0 \ --allow_data_in_errors=True \ --allow_fallocate=1 \ --async_io=1 \ --auto_readahead_size=0 \ --avoid_flush_during_recovery=0 \ --avoid_flush_during_shutdown=1 \ --avoid_unnecessary_blocking_io=0 \ --backup_max_size=104857600 \ --backup_one_in=100000 \ --batch_protection_bytes_per_key=8 \ --bgerror_resume_retry_interval=100 \ --block_align=1 \ --block_protection_bytes_per_key=8 \ --block_size=16384 \ --bloom_before_level=7 \ --bloom_bits=4.1280979878467345 \ --bottommost_compression_type=none \ --bottommost_file_compaction_delay=600 \ --bytes_per_sync=262144 \ --cache_index_and_filter_blocks=1 \ --cache_index_and_filter_blocks_with_high_priority=0 \ --cache_size=33554432 \ --cache_type=lru_cache \ --charge_compression_dictionary_building_buffer=1 \ --charge_file_metadata=1 \ --charge_filter_construction=1 \ --charge_table_reader=1 \ --check_multiget_consistency=0 \ --check_multiget_entity_consistency=1 \ --checkpoint_one_in=0 \ --checksum_type=kXXH3 \ --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 \ --compact_range_one_in=1000000 \ --compaction_pri=4 \ --compaction_readahead_size=1048576 \ --compaction_ttl=0 \ --compress_format_version=1 \ --compressed_secondary_cache_ratio=0.0 \ --compressed_secondary_cache_size=0 \ --compression_checksum=1 \ --compression_max_dict_buffer_bytes=4294967295 \ --compression_max_dict_bytes=16384 \ --compression_parallel_threads=8 \ --compression_type=none \ --compression_use_zstd_dict_trainer=1 \ --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 \ --create_timestamped_snapshot_one_in=0 \ --daily_offpeak_time_utc=04:00-08:00 \ --data_block_index_type=0 \ --db=$db \ --db_write_buffer_size=8388608 \ --default_temperature=kHot \ --default_write_temperature=kUnknown \ --delete_obsolete_files_period_micros=21600000000 \ --delpercent=5 \ --delrangepercent=0 \ --destroy_db_initially=0 \ --detect_filter_construct_corruption=0 \ --disable_file_deletions_one_in=10000 \ --disable_manual_compaction_one_in=10000 \ --disable_wal=0 \ --dump_malloc_stats=1 \ --enable_checksum_handoff=1 \ --enable_compaction_filter=0 \ --enable_custom_split_merge=1 \ --enable_do_not_compress_roles=0 \ --enable_index_compression=1 \ --enable_memtable_insert_with_hint_prefix_extractor=0 \ --enable_pipelined_write=0 \ --enable_sst_partitioner_factory=0 \ --enable_thread_tracking=1 \ --enable_write_thread_adaptive_yield=1 \ --error_recovery_with_no_fault_injection=1 \ --exclude_wal_from_write_fault_injection=0 \ --expected_values_dir=$exp \ --fail_if_options_file_error=0 \ --fifo_allow_compaction=1 \ --file_checksum_impl=crc32c \ --fill_cache=1 \ --flush_one_in=1000 \ --format_version=2 \ --get_all_column_family_metadata_one_in=1000000 \ --get_current_wal_file_one_in=0 \ --get_live_files_apis_one_in=1000000 \ --get_properties_of_all_tables_one_in=100000 \ --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 \ --hard_pending_compaction_bytes_limit=274877906944 \ --high_pri_pool_ratio=0 \ --index_block_restart_interval=15 \ --index_shortening=0 \ --index_type=2 \ --ingest_external_file_one_in=0 \ --initial_auto_readahead_size=524288 \ --inplace_update_support=0 \ --iterpercent=10 \ --key_len_percent_dist=1,30,69 \ --key_may_exist_one_in=100000 \ --kill_random_test=888887 \ --last_level_temperature=kCold \ --level_compaction_dynamic_level_bytes=1 \ --lock_wal_one_in=1000000 \ --log2_keys_per_lock=10 \ --log_file_time_to_roll=0 \ --log_readahead_size=0 \ --long_running_snapshots=1 \ --low_pri_pool_ratio=0 \ --lowest_used_cache_tier=1 \ --manifest_preallocation_size=5120 \ --manual_wal_flush_one_in=0 \ --mark_for_compaction_one_file_in=10 \ --max_auto_readahead_size=0 \ --max_background_compactions=20 \ --max_bytes_for_level_base=10485760 \ --max_key=100000 \ --max_key_len=3 \ --max_log_file_size=1048576 \ --max_manifest_file_size=1073741824 \ --max_sequential_skip_in_iterations=2 \ --max_total_wal_size=0 \ --max_write_batch_group_size_bytes=16 \ --max_write_buffer_number=10 \ --max_write_buffer_size_to_maintain=0 \ --memtable_insert_hint_per_batch=0 \ --memtable_max_range_deletions=1000 \ --memtable_prefix_bloom_size_ratio=0.1 \ --memtable_protection_bytes_per_key=8 \ --memtable_whole_key_filtering=1 \ --memtablerep=skip_list \ --metadata_charge_policy=0 \ --metadata_read_fault_one_in=32 \ --metadata_write_fault_one_in=0 \ --min_write_buffer_number_to_merge=2 \ --mmap_read=1 \ --mock_direct_io=False \ --nooverwritepercent=1 \ --num_file_reads_for_auto_readahead=2 \ --open_files=-1 \ --open_metadata_read_fault_one_in=0 \ --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 \ --open_write_fault_one_in=0 \ --ops_per_thread=20000000 \ --optimize_filters_for_hits=1 \ --optimize_filters_for_memory=0 \ --optimize_multiget_for_io=0 \ --paranoid_file_checks=1 \ --partition_filters=0 \ --partition_pinning=0 \ --pause_background_one_in=1000000 \ --periodic_compaction_seconds=1 \ --prefix_size=5 \ --prefixpercent=5 \ --prepopulate_block_cache=0 \ --preserve_internal_time_seconds=60 \ --progress_reports=0 \ --promote_l0_one_in=0 \ --read_amp_bytes_per_bit=32 \ --read_fault_one_in=1000 \ --readahead_size=0 \ --readpercent=45 \ --recycle_log_file_num=1 \ --reopen=20 \ --report_bg_io_stats=1 \ --reset_stats_one_in=10000 \ --sample_for_compression=0 \ --secondary_cache_fault_one_in=0 \ --sync=0 \ --sync_fault_injection=0 \ --table_cache_numshardbits=6 \ --target_file_size_base=524288 \ --target_file_size_multiplier=2 \ --test_batches_snapshots=0 \ --top_level_index_pinning=0 \ --txn_write_policy=2 \ --uncache_aggressiveness=126 \ --universal_max_read_amp=4 \ --unordered_write=0 \ --unpartitioned_pinning=1 \ --use_adaptive_mutex=0 \ --use_adaptive_mutex_lru=1 \ --use_attribute_group=1 \ --use_delta_encoding=1 \ --use_direct_io_for_flush_and_compaction=0 \ --use_direct_reads=0 \ --use_full_merge_v1=0 \ --use_get_entity=0 \ --use_merge=1 \ --use_multi_cf_iterator=0 \ --use_multi_get_entity=0 \ --use_multiget=1 \ --use_optimistic_txn=0 \ --use_put_entity_one_in=0 \ --use_timed_put_one_in=0 \ --use_txn=1 \ --use_write_buffer_manager=0 \ --user_timestamp_size=0 \ --value_size_mult=32 \ --verification_only=0 \ --verify_checksum=1 \ --verify_checksum_one_in=1000000 \ --verify_compression=0 \ --verify_db_one_in=10000 \ --verify_file_checksums_one_in=1000000 \ --verify_iterator_with_expected_state_one_in=5 \ --verify_sst_unique_id_in_manifest=1 \ --wal_bytes_per_sync=0 \ --wal_compression=none \ --write_buffer_size=4194304 \ --write_dbid_to_manifest=1 \ --write_fault_one_in=128 \ --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D61048774 Pulled By: jaykorean fbshipit-source-id: 22200d55fd0b22b68732b12516e681a6c6e2c601	2024-08-15 09:16:29 -07:00
Hui Xiao	16c21afc06	Fix failure to clean the temporary directory due to NotFound in crash test checkpoint creation (#12919 ) Summary: Context/Summary: `b26b395e0a` propagates `CleanStagingDirectory()` status to `CreateCheckpoint()`. However, we didn't return early when `Status s = db_->GetEnv()->FileExists(full_private_path);` return non-NotFound non-ok stratus in `CleanStagingDirectory()`. Therefore we can proceed to the next step when `full_private_path` doesn't exist. ``` Verification failed: Checkpoint failed: Operation aborted: Failed to clean the temporary directory /dev/shm/rocksdb.J4Su/rocksdb_crashtest_blackbox/.checkpoint28.tmp needed before checkpoint creation : NotFound: db_stress: db_stress_tool/db_stress_test_base.cc:549: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12919 Test Plan: Below failed before the fix and passes after ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2 --bloom_bits=4 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb.J4Su/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --default_temperature=kUnknown --default_write_temperature=kHot --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb.J4Su/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=13 --index_shortening=0 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=500000 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=0 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=3 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=1 --uncache_aggressiveness=0 --universal_max_read_amp=0 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=1 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D60938952 Pulled By: hx235 fbshipit-source-id: 5696cd6b00f33c9f9a256944fecb4e2f4d52a2e6	2024-08-08 15:37:19 -07:00
Hui Xiao	b26b395e0a	Fix CreateCheckpoint not handling failed CleanStagingDirectory well (#12894 ) Summary: Context/Summary: `CleanStagingDirectory()` is called when the temporary .tmp folder we use to create checkpoint is not empty to begin with. Expanded fault injection can make this call fail e.g, `Delete file /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/.checkpoint17.tmp/012393.sst -- IO error: injected metadata write error`. But The result of `CleanStagingDirectory()` is ignored in `CreateCheckpoint()`. So the injected IO error can't be propagated to db stress test and handled correctly. Hence we see `While mkdir: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/.checkpoint17.tmp: File exists` when we try to re-use a non-empty .tmp folder for new snapshots. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12894 Test Plan: Monitor CI Reviewed By: ltamasi Differential Revision: D60422849 Pulled By: hx235 fbshipit-source-id: 6f735c98eaa05d2b97ba4f781e0928357a50377a	2024-08-06 11:24:29 -07:00
Yu Zhang	719c96125c	Add a TransactionOptions to enable tracking timestamp size info inside WriteBatch (#12864 ) Summary: In normal use cases, meta info like column family's timestamp size is tracked at the transaction layer, so it's not necessary and even detrimental to track such info inside the internal WriteBatch because it may let anti-patterns like bypassing Transaction write APIs and directly write to its internal WriteBatch like this: `9d0a754dc9/storage/rocksdb/ha_rocksdb.cc (L4949-L4950)` Setting this option to true will keep aforementioned use case continue to work before it's refactored out. This option is only for this purpose and it will be gradually deprecated after aforementioned MyRocks use case are refactored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12864 Test Plan: Added unit tests Reviewed By: cbi42 Differential Revision: D60194094 Pulled By: jowlyzhang fbshipit-source-id: 64a98822167e99aa7e4fa2a60085d44a5deaa45c	2024-08-05 13:06:45 -07:00
Yu Zhang	d12aaf23ca	Fix file deletions in DestroyDB not rate limited (#12891 ) Summary: Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`. It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it. This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891 Test Plan: Added unit tests, existing tests. Reviewed By: anand1976 Differential Revision: D60300220 Pulled By: jowlyzhang fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7	2024-08-02 19:31:55 -07:00
anand76	55877d8893	Make transaction name conflict check more robust (#12895 ) Summary: The `PessimisticTransaction::SetName()` code checks for an existing txn of the given name before registering the new txn. However, this is not atomic, which could result in a race condition if two txns try to register with the same name. Both might succeed and lead to unpredictable behavior. This PR makes the test and set atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12895 Reviewed By: pdillinger Differential Revision: D60460482 Pulled By: anand1976 fbshipit-source-id: e8afeb2356e1b8f4e8df785cb73532739f82579d	2024-07-30 12:31:02 -07:00
Yu Zhang	24d86f7b41	Add an option to toggle timestamp based validation for the whole DB (#12857 ) Summary: As titled. This PR adds a `TransactionDBOptions` field `enable_udt_validation` to allow user to toggle the timestamp based validation behavior across the whole DB. When it is true, which is the default value and the existing behavior. A recap of what this behavior is: `GetForUpdate` does timestamp based conflict checking to make sure no other transaction has committed a version of the key tagged with a timestamp equal to or newer than the calling transaction's `read_timestamp_` the user set via `SetReadTimestampForValidation`. When this field is set to false, we disable timestamp based validation for the whole DB. MyRocks find it hard to find a read timestamp for this validation API, so we added this flexibility. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12857 Test Plan: Added unit test Reviewed By: ltamasi Differential Revision: D60194134 Pulled By: jowlyzhang fbshipit-source-id: b8507f8ddc37fc7a2948cf492ce5c599ae646fef	2024-07-29 13:54:37 -07:00
Hui Xiao	408e8d4c85	Handle injected write error after successful WAL write in crash test + misc (#12838 ) Summary: Context/Summary: We discovered the following false positive in our crash test lately: (1) PUT() writes k/v to WAL but fails in `ApplyWALToManifest()`. The k/v is in the WAL (2) Current stress test logic will rollback the expected state of such k/v since PUT() fails (3) If the DB crashes before recovery finishes and reopens, the WAL will be replayed and the k/v is in the DB while the expected state have been roll-backed. We decided to leave those expected state to be pending until the loop-write of the same key succeeds. Bonus: Now that I realized write to manifest can also fail the write which faces the similar problem as https://github.com/facebook/rocksdb/pull/12797, I decided to disable fault injection on user write per thread (instead of globally) when tracing is needed for prefix recovery; some refactory Pull Request resolved: https://github.com/facebook/rocksdb/pull/12838 Test Plan: Rehearsal CI Run below command (varies on sync_fault_injection=1,0 to verify ExpectedState behavior) for a while to ensure crash recovery validation works fine ``` python3 tools/db_crashtest.py --simple blackbox --interval=30 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=20 --delrangepercent=20 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=8 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=8 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=10 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=8 --writepercent=40 ``` Reviewed By: cbi42 Differential Revision: D59377075 Pulled By: hx235 fbshipit-source-id: 91f602fd67e2d339d378cd28b982095fd073dcb6	2024-07-29 13:51:49 -07:00
Jay Huh	086849aa4f	Properly disable MultiCFIterator in WritePrepared/UnPreparedTxnDBs (#12883 ) Summary: MultiCfIterators (`CoalescingIterator` and `AttributeGroupIterator`) are not yet compatible with write-prepared/write-unprepared transactions, yet (write-committed is fine). This fix includes the following. - Properly return `ErrorIterator` if the user attempts to use the `CoalescingIterator` or `AttributeGroupIterator` in WritePreparedTxnDB (and WriteUnpreparedTxnDB) - Set `use_multi_cf_iterator = 0` if `use_txn=1` and `txn_write_policy != 0 (WRITE_COMMITTED)` in stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12883 Test Plan: Works ``` ./db_stress ... --use_txn=1 --txn_write_policy=0 --use_multi_cf_iterator=1 ``` Fails ``` ./db_stress ... --use_txn=1 --txn_write_policy=1 --use_multi_cf_iterator=1 ``` Reviewed By: cbi42 Differential Revision: D60190784 Pulled By: jaykorean fbshipit-source-id: 3bc1093e81a4ef5753ba9b32c5aea997c21bfd33	2024-07-24 16:50:12 -07:00
Peter Dillinger	0e3e43f4d1	FaultInjectionTestFS follow-up and clean-up (#12861 ) Summary: In follow-up to https://github.com/facebook/rocksdb/issues/12852: * Use std::copy in place of copy_n for potentially overlapping buffer * Get rid of troublesome -1 idiom from `pos_at_last_append_` and `pos_at_last_sync_` * Small improvements to test FaultInjectionFSTest.ReadUnsyncedData Pull Request resolved: https://github.com/facebook/rocksdb/pull/12861 Test Plan: CI, crash test, etc. Reviewed By: cbi42 Differential Revision: D59757484 Pulled By: pdillinger fbshipit-source-id: c6fbdc2e97c959983184925a855cc8b0285fa23f	2024-07-15 10:28:34 -07:00
Peter Dillinger	72438a6788	Support read & write with unsynced data in FaultInjectionTestFS (#12852 ) Summary: Follow-up to https://github.com/facebook/rocksdb/issues/12729 and others to fix FaultInjectionTestFS handling the case where a live WAL is being appended to and synced while also being copied for checkpoint or backup, up to a known flushed (but not necessarily synced) prefix of the file. It was tricky to structure the code in a way that could handle a tricky race with Sync in another thread (see code comments, thanks Changyu) while maintaining good performance and test-ability. For more context, see the call to FlushWAL() in DBImpl::GetLiveFilesStorageInfo(). Also, the unit test for https://github.com/facebook/rocksdb/issues/12729 was neutered by https://github.com/facebook/rocksdb/issues/12797, and this re-enables the functionality it is testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12852 Test Plan: unit test expanded/updated. Local runs of blackbox_crash_test. The implementation is structured so that a multi-threaded unit test is not needed to cover at least the code lines, as the race handling is folded into "catch up after returning unsynced and then a sync." Reviewed By: cbi42 Differential Revision: D59594045 Pulled By: pdillinger fbshipit-source-id: 94667bb72255e2952586c53bae2c2dd384e85a50	2024-07-12 16:01:57 -07:00
Yu Zhang	3db030d7ee	Fix bug for recovering a prepared but not committed txn (#12856 ) Summary: This PR fix a bug for recovering a prepared Transaction that can contain user-defined timestamps. The `Transaction::Put` type of APIs expect the key provided to be user key without timestamps. When the original transaction added a key for a column family that enables user-defined timestamps, say of size 8. Internally `WriteBatch::Put` will leave a placeholder 8 bytes for the final commit timestamp. For example: `cec28aa90f/db/write_batch.cc (L937)` When rebuilding this transaction from a `WriteBatch` from WAL log, we should consider this and remove the tailing 8 bytes of a key before adding it via the public Transaction write APIs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12856 Test Plan: Added unit test that would fail without this fix Reviewed By: cbi42 Differential Revision: D59656399 Pulled By: jowlyzhang fbshipit-source-id: c716aefa4d548770b691efe96ac8e6d7dab458b9	2024-07-11 16:25:35 -07:00
Hui Xiao	ebe2116240	Remove false-postive assertion in `FaultInjectionTestFS::RenameFile` (#12828 ) Summary: Context/Summary: The assertion `tlist.find(tdn.second) == tlist.end()` `9eebaf11cb/utilities/fault_injection_fs.cc (L1003)` can catch us false positive. Some context (1) When fault injection is enabled and db open fails because of that, crash test will retry open without injected error in order to proceed with a clean open: `9eebaf11cb/db_stress_tool/db_stress_test_base.cc (L3559)` `9eebaf11cb/db_stress_tool/db_stress_test_base.cc (L3586-L3639)` (2) a. `FaultInjectionTestFS::dir_to_new_files_since_last_sync` records files that are created but not yet synced. b. When we create CURRENT, we will first create a temp file and rename it as "CURRENT". As part of the renaming, we will [assert](`9eebaf11cb/utilities/fault_injection_fs.cc (L1003)`) `FaultInjectionTestFS::dir_to_new_files_since_last_sync ` doesn't already have a file named `CURRENT`. Suppose the following sequence of events happened: (1) 1st open, with metadata write error 1. As part of creating CURRENT file, added "CURRENT" to `FaultInjectionTestFS::dir_to_new_files_since_last_sync_` `9eebaf11cb/utilities/fault_injection_fs.cc (L735)` 2. `SyncDir()` here `9eebaf11cb/file/filename.cc (L412)` failed with injected metadata write error. Therefore, "CURRENT" file didn't get removed from `FaultInjectionTestFS::dir_to_new_files_since_last_sync_` as it would if `SyncDir()` succeeded `9eebaf11cb/utilities/fault_injection_fs.h (L344)` (2) 2st open 1. Attempted to create a CURRENT file and failed during renaming since `FaultInjectionTestFS::dir_to_new_files_since_last_sync_` already had a file called CURRENT. So will fail ``` assertion failed - tlist.find(tdn.second) == tlist.end() ``` This PR fixed this by removing the assertion. It used to catch us some missing sync of some directory (e.,g https://github.com/facebook/rocksdb/pull/10573) so we will keep thinking about a better way to catch that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12828 Test Plan: Command constantly failed before the fix but passed after the PR running for 10 minutes ``` python3 tools/db_crashtest.py --simple blackbox --interval=10 --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --adm_policy=2 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=1 --bloom_bits=10 --bottommost_compression_type=lz4hc --bottommost_file_compaction_delay=86400 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=tiered_auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=1 --compress_format_version=1 --compressed_secondary_cache_ratio=0.5 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=15 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kHot --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0 --index_block_restart_interval=2 --index_shortening=0 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=1000000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=1582 --universal_max_read_amp=4 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=1 --use_multi_get_entity=1 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=1 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=8 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D59241548 Pulled By: hx235 fbshipit-source-id: 5bb49e6a94943273f47578a2caf3d08ca5b67e5f	2024-07-09 15:35:54 -07:00
Yu Zhang	2e1b3f921f	Remove unreachable code (#12846 ) Summary: Removing some unreachable code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12846 Reviewed By: cbi42 Differential Revision: D59498423 Pulled By: jowlyzhang fbshipit-source-id: 6b2c51732d94b1f69a8ba7474b16a171d4e6d640	2024-07-09 09:24:43 -07:00
Changyu Bi	de16611a50	Fix WAL corruption in stress test (#12834 ) Summary: We are seeing WAL corruption in crash tests where wal_compression and recycled_wal are enabled. With wal_compression, we write a SetCompression record when creating a WAL, which can happen during DB open time. Our current stress test set up may write directly to the underlying WAL file during DB open, while writing to a buffer under TestFSWritableFile later for sync fault injection. This causes the last synced position to be inaccurately tracked in TestFSWritableFile and causes reads to return incorrect data. This PR removes the line that causes this mixture of WAL writes. Also updated TestFSWritableFile to avoid such a mixture of buffered and direct writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12834 Test Plan: the following command repros WAL corruption before this PR ``` ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=0 --bloom_bits=8 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=600 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=16777216 --compression_checksum=1 --compression_max_dict_buffer_bytes=1099511627775 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db_write_buffer_size=134217728 --default_temperature=kHot --default_write_temperature=kCold --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=crc32c --fill_cache=0 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=13 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=113 --universal_max_read_amp=4 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=1 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=5 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=35 --preserve_unverified_changes=1 --db=/dev/shm/rocksdb_test/blackbox --expected_values_dir=/dev/shm/rocksdb_test/expected Choosing random keys with no overwrite ... (Re-)verified 0 unique IDs 2024/07/01-16:42:46 Initializing worker threads Crash-recovery verification passed :) 2024/07/01-16:42:46 Starting database operations ^C ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=0 --bloom_bits=8 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=600 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=16777216 --compression_checksum=1 --compression_max_dict_buffer_bytes=1099511627775 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db_write_buffer_size=134217728 --default_temperature=kHot --default_write_temperature=kCold --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=crc32c --fill_cache=0 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=13 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=113 --universal_max_read_amp=4 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=1 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=5 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=35 --preserve_unverified_changes=1 --db=/dev/shm/rocksdb_test/blackbox --expected_values_dir=/dev/shm/rocksdb_test/expected Choosing random keys with no overwrite ... Crash-recovery verification passed :) 2024/07/01-16:43:02 Starting database operations Failure in BackupEngine::CreateNewBackup with: Corruption: bad record length under specified BackupEngineOptions: share_table_files: 1, share_files_with_checksum: 1, share_files_with_checksum_naming: 2147483650, schema_version: 1, max_background_operations: 1, backup_rate_limiter: 0x7f2373676280, restore_rate_limiter: 0, current_temperatures_override_manifest: 1, CreateBackupOptions: flush_before_backup: 0, decrease_background_thread_cpu_priority: 0, background_thread_cpu_priority: 2, RestoreOptions: keep_log_files: 1 (Empty string or missing field indicates default option or value is used) Verification failed: Backup/restore failed: Corruption: bad record length db_stress: db_stress_tool/db_stress_test_base.cc:528: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed. Received signal 6 (Aborted) Invoking GDB for stack trace... ^CCouldn't get CS register: No such process. Couldn't get registers: No such process. [Inferior 1 (process 2097222) detached] ``` Reviewed By: pdillinger Differential Revision: D59260401 Pulled By: cbi42 fbshipit-source-id: fdcdaaab2e14b527b26fbdfa819b4fe3f745a4de	2024-07-02 13:02:39 -07:00
Peter Dillinger	0bb939611d	Avoid unnecessary work in internal calls to GetSortedWalFiles (#12831 ) Summary: We are seeing a number of crash test failures coming from checkpoint and backup code, likely from WalManager::GetSortedWalFiles -> ... -> WalManager::ReadFirstLine and this code path is not needed, because we don't need to know the sequence numbers of WAL files going into a checkpoint or backup. We can minimize the impact of whatever inconsistency is causing that problem by not relying on it where it's not needed. Similarly, when we only need a roughly accurate set of current WAL files, we don't need to query all the archived WAL files (and redundantly the live ones again). So this reduces filesystem queries and DB mutex acquires in creating backups and checkpoints. Needed follow-up: Figure out what is causing various failures with an apparent inconsistency where GetSortedWalFiles fails on reading a WAL file. If it's an injected failure, perhaps it's not propagating that injected failure appropriately. It might also be an inconsistency between what the DB knows is flushed and what WalManager reads from the filesystem (which we know is dubious and should be phased out, which this is arguably another step toward). Or completing that phase-out might solve the problem without a full diagnosis. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12831 Test Plan: existing tests (easily caught when I went too far in initally developing this change) Update to BackupUsingDirectIO test so that there's a WAL file in what is backed up. (Was relying on some oddity.) Reviewed By: cbi42 Differential Revision: D59252649 Pulled By: pdillinger fbshipit-source-id: 7ad4187a1c70caa59a6d6c1c643ef95232b929f5	2024-07-01 23:29:02 -07:00
HypenZou	5bb7f95ed6	Don't take archived log size into account when calculating log size for flush (#12680 ) Summary: Context/Summary: It seems unreasonable to take the archived log size into account when calculating log size for flush in method CreateCheckpoint. If the user sets WAL_ttl_seconds or WAL_size_limit_MB, the argument _log_size_for_flush_ can easily be reached due to the size of the archived dir. As a result, the flush may always be triggered. Test corverd by ./checkpoint_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/12680 Reviewed By: jaykorean Differential Revision: D59097904 Pulled By: ajkr fbshipit-source-id: 0ed29c1b078d8f40b85288541b008e00dbc517d3	2024-06-28 11:56:26 -07:00
Hui Xiao	0d93c8a6ca	Decouple sync fault and write injection in FaultInjectionTestFS & fix tracing issue under WAL write error injection (#12797 ) Summary: Context/Summary: After injecting write error to WAL, we started to see crash recovery verification failure in prefix recovery. That's because the current tracing implementation traces every write before it writes to WAL even when the WAL write can fail with write error injection. One consequence of that is the traced writes in trace files does not corresponding to write sequence sequence anymore e.g, it has more traced writes that the actual assigned sequence number to successful writes. Therefore `b4a84efb4e/db_stress_tool/expected_state.cc (L674)` won't restore the ExpectedState to the correct sequence number we want. Ideally, we should have a prepare-commit mechanism for tracing just like our ExpectedState so we can ignore the traced write if the write fails later. But for now, to simplify, we simply don't inject WAL error (and metadata write error cuz it could fail write when sync WAL dir fails) To do so, we need to be able to exclude WAL from write injection but still allow sync fault injection in it to maintain its original sync fault testing coverage. This prompts us to decouple sync fault and write injection in FaultInjectionTestFS. And this is what this PR mainly about. So now `FaultInjectionTestFS` works as the following: - If direct_writable is true, then `FaultInjectionTestFS` is bypassed for writable file - Otherwise, FaultInjectionTestFS` can buffer data for sync fault injection (if inject_unsynced_data_loss_ == true, global settings) and/or inject write error (if MaybeInjectThreadLocalError(), thread-local settings). WAL file can be optionally excluded from write injection Bonus: better naming of relevant variables Pull Request resolved: https://github.com/facebook/rocksdb/pull/12797 Test Plan: - The follow commands failed before this fix but passes after ``` python3 tools/db_crashtest.py --simple blackbox \ --interval=5 \ --preserve_unverified_changes=1 \ --threads=32 \ --disable_auto_compactions=1 \ --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=3.2003682301518492 --bottommost_compression_type=zlib --bottommost_file_compaction_delay=600 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=2 --compaction_readahead_size=0 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=16777216 --compression_checksum=1 --compression_max_dict_buffer_bytes=549755813887 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=00:00-23:59 --data_block_index_type=0 \ --db_write_buffer_size=0 --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=0 --disable_manual_compaction_one_in=0 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --fill_cache=0 --flush_one_in=100 --format_version=4 --get_all_column_family_metadata_one_in=0 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=0 --get_properties_of_all_tables_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=9 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=0 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=1000 --max_key_len=3 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 \ --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=0 --readpercent=0 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=1 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=9890 --universal_max_read_amp=-1 --unpartitioned_pinning=3 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_compression=1 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=335544320 --write_dbid_to_manifest=1 --write_fault_one_in=100 --writepercent=100 ``` - CI Reviewed By: cbi42 Differential Revision: D58917145 Pulled By: hx235 fbshipit-source-id: b6397036bea035a92341c2b05fb01872db2153d7	2024-06-26 14:56:35 -07:00
Hui Xiao	41c6b4549a	Revert back to previous ReadAsync error injection (#12811 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/12713 adjusted the error injection in ReadAsync. See original behavior here `71f9e6b5b3/utilities/fault_injection_fs.cc (L456-L484)` The PR returns the injected error instead of the ReadAsync() status. It also allows cb to be call in `TestFSRandomAccessFile` layer when ReadAsync() and the cb can called within `FSRandomAccessFile` layer so cb can be double called. It appears to be the root-cause of the following frequent error`AddressSanitizer: heap-use-after-free on rocksdb::RandomAccessFileReader::ReadAsync` though I don't have a confirmed repro yet. Considering this change to mostly revert to previous behavior, it should be safe to proceed anyway. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12811 Test Plan: Monitor CI Reviewed By: jaykorean Differential Revision: D59067927 Pulled By: hx235 fbshipit-source-id: 8645e5a52d44b7ed2186438f885b4ea13f10b59d	2024-06-26 13:29:10 -07:00
Hui Xiao	6f79496475	Tag FaultInjectionIOType::kRead for FaultInjectionTestFS new read file & fix unrelease snapshot (#12810 ) Summary: Context/Summary: It makes more sense to mark error injection during creation as read file as "kread" so we don't get confusing msg like below ``` stderr: error : Get() returns IO error: injected metadata write error for key: 000000000000004F000000000000012B00000000000000EF. Verification failed :( ``` Also an early return here `e0ddbee76f/db_stress_tool/db_stress_test_base.cc (L2871)` can lead to unreleased snapshot upon DB restart `Non-ok close status: Operation aborted: Cannot close DB with unreleased snapshot`. This PR fixed it too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12810 Test Plan: CI Reviewed By: jowlyzhang Differential Revision: D59022154 Pulled By: hx235 fbshipit-source-id: 18c489d4692e2eb4fb32937967f57c8a81010cc3	2024-06-25 15:44:07 -07:00
Hui Xiao	56f7ef50d7	Fix nullptr access and race to fault_fs_guard (#12799 ) Summary: Context/Summary: There are a couple places where we forgot to check fault_fs_guard before accessing it. So we can see something like this occasionally ``` =138831==Hint: address points to the zero page. SCARINESS: 10 (null-deref) AddressSanitizer:DEADLYSIGNAL #0 0x18b9e0b in rocksdb::ThreadLocalPtr::Get() const fbcode/internal_repo_rocksdb/repo/util/thread_local.cc:503 https://github.com/facebook/rocksdb/issues/1 0x83d8b7 in rocksdb::StressTest::TestCompactRange(rocksdb::ThreadState, long, rocksdb::Slice const&, rocksdb::ColumnFamilyHandle) fbcode/internal_repo_rocksdb/repo/utilities/fault_injection_fs.h ``` Also accessing of `io_activties_exempted_from_fault_injection.find` not fully synced so we see the following ``` WARNING: ThreadSanitizer: data race (pid=90939) Write of size 8 at 0x7b4c000004d0 by thread T762 (mutexes: write M0): #0 std::_Rb_tree<rocksdb::Env::IOActivity, rocksdb::Env::IOActivity, std::_Identity<rocksdb::Env::IOActivity>, std::less<rocksdb::Env::IOActivity>, std::allocator<rocksdb::Env::IOActivity>>::operator=(std::_Rb_tree<rocksdb::Env::IOActivity, rocksdb::Env::IOActivity, std::_Identity<rocksdb::Env::IOActivity>, std::less<rocksdb::Env::IOActivity>, std::allocator<rocksdb::Env::IOActivity>> const&) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/stl_tree.h:208 (db_stress+0x411c32) (BuildId: b803e5aca22c6b080defed8e85b7bfec) https://github.com/facebook/rocksdb/issues/1 rocksdb::DbStressListener::OnErrorRecoveryCompleted(rocksdb::Status) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/stl_set.h:298 (db_stress+0x4112e5) (BuildId: b803e5aca22c6b080defed8e85b7bfec) https://github.com/facebook/rocksdb/issues/2 rocksdb::EventHelpers::NotifyOnErrorRecoveryEnd(std::vector<std::shared_ptr<rocksdb::EventListener>, std::allocator<std::shared_ptr<rocksdb::EventListener>>> const&, rocksdb::Status const&, rocksdb::Status const&, rocksdb::InstrumentedMutex) fbcode/internal_repo_rocksdb/repo/db/event_helpers.cc:239 (db_stress+0xa09d60) (BuildId: b803e5aca22c6b080defed8e85b7bfec) Previous read of size 8 at 0x7b4c000004d0 by thread T131 (mutexes: write M1): #0 rocksdb::FaultInjectionTestFS::MaybeInjectThreadLocalError(rocksdb::FaultInjectionIOType, rocksdb::IOOptions const&, rocksdb::FaultInjectionTestFS::ErrorOperation, rocksdb::Slice, bool, char, bool, bool) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/stl_tree.h:798 (db_stress+0xf7d0f3) (BuildId: b803e5aca22c6b080defed8e85b7bfec) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12799 Test Plan: CI Reviewed By: jowlyzhang Differential Revision: D58917449 Pulled By: hx235 fbshipit-source-id: f24fc1acc2a7d91f9f285447a97ba41397f48dbd	2024-06-24 16:10:36 -07:00
Hui Xiao	1adb935720	Inject more errors to more files in stress test (#12713 ) Summary: Context: We currently have partial error injection: - DB operation: all read, SST write - DB open: all read, SST write, all metadata write. This PR completes the error injection (with some limitations below): - DB operation & open: all read, all write, all metadata write, all metadata read Summary: - Inject retryable metadata read, metadata write error concerning directory (e.g, dir sync, ) or file metadata (e.g, name, size, file creation/deletion...) - Inject retryable errors to all major file types: random access file, sequential file, writable file - Allow db stress test operations to handle above injected errors gracefully without crashing - Change all error injection to thread-local implementation for easier disabling and enabling in the same thread. For example, we can control error handling thread to have no error injection. It's also cleaner in code. - Limitation: compared to before, we now don't have write fault injection for backup/restore CopyOrCreateFiles work threads since they use anonymous background threads as well as read injection for db open bg thread - Add a new flag to test error recovery without error injection so we can test the path where error recovery actually succeeds - Some Refactory & fix to db stress test framework (see PR review comments) - Fix some minor bugs surfaced (see PR review comments) - Limitation: had to disable backup restore with metadata read/write injection since it surfaces too many testing issues. Will add it back later to focus on surfacing actual code/internal bugs first. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12713 Test Plan: - Existing UT - CI with no trivial error failure Reviewed By: pdillinger Differential Revision: D58326608 Pulled By: hx235 fbshipit-source-id: 011b5195aaeb6011641ae0a9194f7f2a0e325ad7	2024-06-19 08:42:00 -07:00
Hui Xiao	d0259c2c98	Enable reading un-synced data in db stress test (#12752 ) Summary: Context/Summary: There are a few blockers to enabling reading un-synced data in db stress test (1) GetFileSize() will always return 0 for file written under direct IO because we don't track the last flushed position for `TestFSWritableFile` under direct IO. So it will surface as ``` Verification failed: VerifyChecksum failed: Corruption: file is too short (0 bytes) to be an sstable: /tmp/rocksdb_crashtest_blackbox4deg_c5e/000009.sst db_stress: db_stress_tool/db_stress_test_base.cc:518: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed. Received signal 6 (Aborted) Invoking GDB for stack trace... ``` (2) A couple minor FIXME in left in https://github.com/facebook/rocksdb/pull/12729. This PR fixed (1) and (2) and enabled reading un-synced data in stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12752 Test Plan: - The following command failed before this PR and passed after. ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=100 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=10000 --block_align=0 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=37.92024930098943 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kXXH3 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=10 --compress_format_version=2 --compressed_secondary_cache_size=8388608 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackbox4deg_c5e --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kHot --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_8whyhdxm --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --fill_cache=1 --flush_one_in=1000 --format_version=4 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=9 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=100 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=2 --max_total_wal_size=0 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=0 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --uncache_aggressiveness=1 --universal_max_read_amp=-1 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=1 --use_multiget=0 --use_put_entity_one_in=5 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=10 --verify_compression=1 --verify_db_one_in=10000 --verify_file_checksums_one_in=10 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35 Verification failed: VerifyChecksum failed: Corruption: file is too short (0 bytes) to be an sstable: /tmp/rocksdb_crashtest_blackbox4deg_c5e/000009.sst db_stress: db_stress_tool/db_stress_test_base.cc:518: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed. Received signal 6 (Aborted) Invoking GDB for stack trace... ``` - Run python3 tools/db_crashtest.py --simple blackbox --lock_wal_one_in=10 --backup_one_in=10 --sync_fault_injection=0 --use_direct_io_for_flush_and_compaction=0 for 1 hour - Monitor stress test CI Reviewed By: pdillinger Differential Revision: D58395807 Pulled By: hx235 fbshipit-source-id: 7d4b321acc0a0af3501b62dc417a7f6e2d318265	2024-06-18 14:41:14 -07:00
Yu Zhang	c73cf7a878	Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760 ) Summary: This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier. A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following: 1) Setting the ratio to be equal to or smaller than 0 disables this collector 2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met. 3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction. For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D58734976 Pulled By: jowlyzhang fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81	2024-06-18 10:51:29 -07:00
Peter Dillinger	0646ec6e2d	Ensure Close() before LinkFile() for WALs in Checkpoint (#12734 ) Summary: POSIX semantics for LinkFile (hard links) allow linking a file that is still being written two, with both the source and destination showing any subsequent writes to the source. This may not be practical semantics for some FileSystem implementations such as remote storage. They might only link the flushed or sync-ed file contents at time of LinkFile, or might even have undefined behavior if LinkFile is called on a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731 to bring more hygiene to our handling of WAL files in Checkpoint. Specifically, we now Close WAL files as soon as they are either (a) inactive and fully synced, or (b) inactive and obsolete (so maybe never fully synced), rather than letting Close() happen in handling obsolete files (maybe a background thread). This should not be a performance issue as Close() should be trivial cost relative to other IO ops, but just in case: * We don't Close() while holding a mutex, to avoid blocking, and * The old behavior is available with a new kill switch option `background_close_inactive_wals`. Stacked on https://github.com/facebook/rocksdb/issues/12731 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734 Test Plan: Extended existing unit test, especially adding a hygiene check to FaultInjectionTestFS to detect LinkFile() on a file still open for writes. FaultInjectionTestFS already has relevant tracking data, and tests can opt out of the new check, as in a smoke test I have left for the old, deprecated functionality `background_close_inactive_wals=true`. Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK with the crash test. (The only place I can find we use LinkFile in production is Checkpoint.) Reviewed By: cbi42 Differential Revision: D58295284 Pulled By: pdillinger fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6	2024-06-12 11:48:45 -07:00
Peter Dillinger	98393f0139	Fix Checkpoint hard link of inactive but unsynced WAL (#12731 ) Summary: Background: there is one active WAL file but there can be several more WAL files in various states. Those other WALs are always in a "flushed" state but could be on the `logs_` list not yet fully synced. We currently allow any WAL that is not the active WAL to be hard-linked when creating a Checkpoint, as although it might still be open for write, we are not appending any more data to it. The problem is that a created Checkpoint is supposed to be fully synced on return of that function, and a hard-linked WAL in the state described above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083, it would synced if using track_and_verify_wals_in_manifest=true.) The fix is a step toward a long term goal of removing the need to query the filesystem to determine WAL files and their state. (I consider it dubious any time we independently read from or query metadata from a file we have open for writing, as this makes us more susceptible to FileSystem deficiencies or races.) More specifically: * Detect which WALs might not be fully synced, according to our DBImpl metadata, and prevent hard linking those (with `trim_to_size=true` from `GetLiveFilesStorageInfo()`. And while we're at it, use our known flushed sizes for those WALs. * To avoid a race between that and GetSortedWalFiles(), track a maximum needed WAL number for the Checkpoint/GetLiveFilesStorageInfo. * Because of the level of consistency provided by those two, we no longer need to consider syncing as part of the FlushWAL in GetLiveFilesStorageInfo. (We determine the max WAL number consistent with the manifest file size, while holding DB mutex. Should make track_and_verify_wals_in_manifest happy.) This makes the premise of test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback no longer hit) so the test is removed, with crash test as backstop for related issues. See https://github.com/facebook/rocksdb/issues/10185 Stacked on https://github.com/facebook/rocksdb/issues/12729 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731 Test Plan: Expanded an existing test, which now fails before fix. Also long runs of blackbox_crash_test with amplified checkpoint frequency. Reviewed By: cbi42 Differential Revision: D58199629 Pulled By: pdillinger fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382	2024-06-05 17:40:09 -07:00
Peter Dillinger	9f4c597d83	FaultInjectionTestFS read unsynced data by default (#12729 ) Summary: In places (e.g. GetSortedWals()) RocksDB relies on querying the file size or even reading the contents of files currently open for writing, and as in POSIX semantics, expects to see the flushed size and contents regardless of what has been synced. FaultInjectionTestFS historically did not emulate this behavior, only showing synced data from such read operations. (Different from FaultInjectionTestEnv--sigh.) This change makes the "proper" behavior the default behavior, at least for GetFileSize and FSSequentialFile. However, this new functionality is disabled in db_stress because of undiagnosed, unresolved issues. Also removes unused and confusing field `pos_at_last_flush_` This change is needed to support testing a relevant bug fix (in a follow-up diff). Other suggested follow-up: * Fix db_stress not to rely on the old behavior, and fix a related FIXME in db_stress_test_base.cc in LockWAL testing. * Fill in some corner cases in the FileSystem API for reading unsynced data (see new TODO items). * Consider deprecating and removing Flush() API functions from FileSystem APIs. It is not clear to me that there is a supported scenario in which they do anything but confuse API users and developers. If there is a use for them, it doesn't appear to be tested. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12729 Test Plan: applies to all unit tests successfully, just updating the unit test from https://github.com/facebook/rocksdb/issues/12556 due to relying on the errant behavior. Also added a specific unit test Reviewed By: hx235 Differential Revision: D58091835 Pulled By: pdillinger fbshipit-source-id: f47a63b2b000f5875b6293a98577bff663d7fd33	2024-06-04 15:25:23 -07:00
Yu Zhang	fc59d8f9c6	Add public API `WriteWithCallback` to support custom callbacks (#12603 ) Summary: This PR adds a `DB::WriteWithCallback` API that does the same things as `DB::Write` while takes an argument `UserWriteCallback` to execute custom callback functions during the write. We currently support two types of callback functions: `OnWriteEnqueued` and `OnWalWriteFinish`. The former is invoked after the write is enqueued, and the later is invoked after WAL write finishes when applicable. These callback functions are intended for users to use to improve synchronization between concurrent writes, their execution is on the write's critical path so it will impact the write's latency if not used properly. The documentation for the callback interface mentioned this and suggest user to keep these callback functions' implementation minimum. Although transaction interfaces' writes doesn't yet allow user to specify such a user write callback argument, the `DBImpl::Write*` type of APIs do not differentiate between regular DB writes or writes coming from the transaction layer when it comes to supporting this `UserWriteCallback`. These callbacks works for all the write modes including: default write mode, Options.two_write_queues, Options.unordered_write, Options.enable_pipelined_write Pull Request resolved: https://github.com/facebook/rocksdb/pull/12603 Test Plan: Added unit test in ./write_callback_test Reviewed By: anand1976 Differential Revision: D58044638 Pulled By: jowlyzhang fbshipit-source-id: 87a84a0221df8f589ec8fc4d74597e72ce97e4cd	2024-05-31 19:30:19 -07:00
Jaepil Jeong	c115eb6162	Fix compile errors in C++23 (#12106 ) Summary: This PR fixes compile errors in C++23. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12106 Reviewed By: cbi42 Differential Revision: D57826279 Pulled By: ajkr fbshipit-source-id: 594abfd8eceaf51eaf3bbabf7696c0bb5e0e9a68	2024-05-28 15:33:57 -07:00
Peter Dillinger	d2ef70872f	Rename, deprecate `LogFile` and `VectorLogPtr` (#12695 ) Summary: These names are confusing with `Logger` etc. so moving to `WalFile` etc. Other small, related name refactorings. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12695 Test Plan: Left most unit tests using old names as an API compatibility test. Non-test code compiles with deprecated names removed. No functional changes. Reviewed By: ajkr Differential Revision: D57747458 Pulled By: pdillinger fbshipit-source-id: 7b77596b9c20d865d43b9dc66c30c8bd2b3b424f	2024-05-28 09:24:49 -07:00
Yu Zhang	9a72cf1a61	Add timestamp support in dump_wal/dump/idump (#12690 ) Summary: As titled. For dumping wal files, since a mapping from column family id to the user comparator object is needed to print the timestamp in human readable format, option `[--db=<db_path>]` is added to `dump_wal` command to allow the user to choose to optionally open the DB as read only instance and dump the wal file with better timestamp formatting. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12690 Test Plan: Manually tested dump_wal: [dump a wal file specified with --walfile] ``` >> ./ldb --walfile=$TEST_DB/000004.log dump_wal --print_value >>1,1,28,13,PUT(0) : 0x666F6F0100000000000000 : 0x7631 (Column family id: [0] contained in WAL are not opened in DB. Applied default hex formatting for user key. Specify --db=<db_path> to open DB for better user key formatting if it contains timestamp.) ``` [dump with --db specified for better timestamp formatting] ``` >> ./ldb --walfile=$TEST_DB/000004.log dump_wal --db=$TEST_DB --print_value >> 1,1,28,13,PUT(0) : 0x666F6F\|timestamp:1 : 0x7631 ``` dump: [dump a file specified with --path] ``` >>./ldb --path=/tmp/rocksdbtest-501/column_family_test_75359_17910784957761284041/000004.log dump Sequence,Count,ByteSize,Physical Offset,Key(s) : value 1,1,28,13,PUT(0) : 0x666F6F0100000000000000 : 0x7631 (Column family id: [0] contained in WAL are not opened in DB. Applied default hex formatting for user key. Specify --db=<db_path> to open DB for better user key formatting if it contains timestamp.) ``` [dump db specified with --db] ``` >> ./ldb --db=/tmp/rocksdbtest-501/column_family_test_75359_17910784957761284041 dump >> foo\|timestamp:1 ==> v1 Keys in range: 1 ``` idump ``` ./ldb --db=$TEST_DB idump 'foo\|timestamp:1' seq:1, type:1 => v1 Internal keys in range: 1 ``` Reviewed By: ltamasi Differential Revision: D57755382 Pulled By: jowlyzhang fbshipit-source-id: a0a2ef80c92801cbf7bfccc64769c1191824362e	2024-05-23 20:26:57 -07:00
Levi Tamasi	62600cb2d4	Fix rebuilding transactions containing PutEntity (#12681 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12681 When rebuilding transactions during recovery, `MemtableInserter::PutCFImpl` currently calls `WriteBatchInternal::Put` regardless of value type, which is incorrect for `PutEntity` entries, as well as `TimedPut`s and the blob indexes used by the old BlobDB implementation. The patch fixes the handling of `PutEntity` and returns `NotSupported` for `TimedPut`s and blob indices. Reviewed By: jaykorean, jowlyzhang Differential Revision: D57636355 fbshipit-source-id: 833de4e4aa0b42ff6638b72c4181f981d12d0f15	2024-05-21 17:22:20 -07:00
Levi Tamasi	c87f5cf91c	Add GetEntityForUpdate to optimistic and WriteCommitted pessimistic transactions (#12668 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12668 The patch adds a new `GetEntityForUpdate` API to optimistic and WriteCommitted pessimistic transactions, which provides transactional wide-column point lookup functionality with concurrency control. For WriteCommitted transactions, user-defined timestamps are also supported similarly to the `GetForUpdate` API. Reviewed By: jaykorean Differential Revision: D57458304 fbshipit-source-id: 7eadbac531ca5446353e494abbd0635d63f62d24	2024-05-20 10:43:05 -07:00
raffertyyu	4dd084f66d	fix gcc warning about dangling-reference in backup_engine_test (#12637 ) Summary: gcc 14.1 reports some warnings about dangling-reference occured in backup_engine_test. ```c++ /data/rocksdb/utilities/backup/backup_engine_test.cc: In member function 'virtual void rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::TestBody()': /data/rocksdb/utilities/backup/backup_engine_test.cc:4411:64: error: possibly dangling reference to a temporary [-Werror=dangling-reference] 4411 \| std::make_pair(alt_backup_engine, backup_engine_.get())}) { \| ^ /data/rocksdb/utilities/backup/backup_engine_test.cc:4410:23: note: the temporary was destroyed at the end of the full expression 'std::make_pair<rocksdb::BackupEngine, rocksdb::BackupEngine&>(((rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test)this)->rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::rocksdb::{anonymous}::BackupEngineTest.rocksdb::{anonymous}::BackupEngineTest::backup_engine_.std::unique_ptr<rocksdb::BackupEngine>::get(), alt_backup_engine)' 4410 \| {std::make_pair(backup_engine_.get(), alt_backup_engine), \| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /data/rocksdb/utilities/backup/backup_engine_test.cc:4411:64: error: possibly dangling reference to a temporary [-Werror=dangling-reference] 4411 \| std::make_pair(alt_backup_engine, backup_engine_.get())}) { \| ^ /data/rocksdb/utilities/backup/backup_engine_test.cc:4411:23: note: the temporary was destroyed at the end of the full expression 'std::make_pair<rocksdb::BackupEngine&, rocksdb::BackupEngine>(alt_backup_engine, ((rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test)this)->rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::rocksdb::{anonymous}::BackupEngineTest.rocksdb::{anonymous}::BackupEngineTest::backup_engine_.std::unique_ptr<rocksdb::BackupEngine>::get())' 4411 \| std::make_pair(alt_backup_engine, backup_engine_.get())}) { \| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` It seems to be related to this update in gcc: https://gcc.gnu.org/gcc-14/changes.html#:~:text=%2DWdangling%2Dreference%20false%20positives%20have%20been%20reduced.%20The%20warning%20does%20not%20warn%20about%20std%3A%3Aspan%2Dlike%20classes%3B%20there%20is%20also%20a%20new%20attribute%20gnu%3A%3Ano_dangling%20to%20suppress%20the%20warning.%20See%20the%20manual%20for%20more%20info. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12637 Reviewed By: cbi42 Differential Revision: D57263996 Pulled By: ajkr fbshipit-source-id: 1e416c38240d3d1adda787fc484c0392e28bb7f1	2024-05-18 18:01:19 -07:00
Peter Dillinger	b75438f986	Allow disableWAL+recycle with WritePreparedTxnDB internals (#12639 ) Summary: Follow-up from https://github.com/facebook/rocksdb/issues/12403 The crash test was periodically failing with the "disableWAL option is not supported if recycle_log_file_num > 0" failure, despite not setting the disableWAL from the user side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12639 Test Plan: db_stress reproducer now passes. Added WAL recycling to txn DB unit tests, which is generally more difficult for correctness. Many tests now cover this change and pass. Reviewed By: anand1976 Differential Revision: D57227617 Pulled By: pdillinger fbshipit-source-id: db9abefeb505bce624b45bc64009694d2a5baed9	2024-05-10 17:56:40 -07:00
Levi Tamasi	b92d874c8b	Support MultiGetEntity in optimistic and WriteCommitted pessimistic transactions (#12634 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12634 The patch implements support for the `MultiGetEntity` API in optimistic transactions and pessimistic transactions with the WriteCommitted policy. Similarly to the other wide-column transaction APIs, the implementation leverages the `WriteBatchWithIndex` layer. Reviewed By: jaykorean Differential Revision: D57177638 fbshipit-source-id: 2d9f9f287fc97e7c126830b48d21457c7c35db3f	2024-05-09 16:49:38 -07:00
Levi Tamasi	97e70906fa	Improve the sanity checks in (Multi)GetEntity and friends (#12630 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12630 The patch cleans up, improves, and brings into sync (to the extent possible without API signature changes) the sanity checks around the `GetEntity` / `MultiGetEntity` family of APIs, including the read-your-own-writes (`WriteBatchWithIndex`) and transaction layers. The checks are centralized in two main sets of entry points, namely in `DB(Impl)` and the "main" `GetEntityFromBatchAndDB` / `MultiGetEntityFromBatchAndDB` overloads in `WriteBatchWithIndex`. This eliminates the need to duplicate the checks in the transaction classes. Reviewed By: jaykorean Differential Revision: D57125741 fbshipit-source-id: 4dd059ef644a9b173fbba767538943397e4cc6cd	2024-05-09 12:25:19 -07:00
Levi Tamasi	eaa3226ef7	Add support for GetEntity in optimistic and WriteCommitted pessimistic transactions (#12623 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12623 The PR adds support for the `GetEntity` API to optimistic and WriteCommitted pessimistic transactions. `MultiGetEntity` support and the `ForUpdate` variants of these read APIs will be implemented in subsequent PRs. Reviewed By: jaykorean Differential Revision: D57030879 fbshipit-source-id: 1f0aed6418782975fe537b6b3d437fad31fcbd43	2024-05-07 10:20:26 -07:00
Levi Tamasi	45c290660a	Add PutEntity support for optimistic and WritePrepared pessimistic transactions (#12606 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12606 The patch extends optimistic transactions and WriteCommitted pessimistic transactions with support for the `PutEntity` API. Similarly to the other APIs, `PutEntity` is available via both the `Transaction` and `TransactionDB` interfaces, where using the latter executes the write in a single-operation transaction as usual. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added in separate PRs. Reviewed By: jaykorean Differential Revision: D56911242 fbshipit-source-id: 57cf8bb6c6b1b40ba4a8a652831c13a617644289	2024-05-06 14:41:00 -07:00
Andrew Kryczka	0fef690bd5	Sync non-latest WALs during flush for 2PC, single-CF DBs (#12622 ) Summary: Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs". This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622 Reviewed By: cbi42 Differential Revision: D56987303 Pulled By: ajkr fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48	2024-05-06 11:56:16 -07:00
anand76	97991960e9	Retry DB::Open upon a corruption detected while reading the MANIFEST (#12518 ) Summary: This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518 Reviewed By: ajkr Differential Revision: D55932155 Pulled By: anand1976 fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9	2024-04-18 17:36:33 -07:00
Levi Tamasi	ef38d99edc	Sanity check the keys parameter in MultiGetEntityFromBatchAndDB (#12564 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12564 Similarly to how `db`, `column_family`, and `results` are handled, bail out early from `WriteBatchWithIndex::MultiGetEntityFromBatchAndDB` if `keys` is `nullptr`. Note that these checks are best effort in the sense that with the current method signature, the callee has no way of reporting an error if `statuses` is `nullptr` or catching other types of invalid pointers (e.g. when `keys` and/or `results` is non-`nullptr` but do not point to a contiguous range of `num_keys` objects). We can improve this (and many similar RocksDB APIs) using `std::span` in a major release once we move to C++20. Reviewed By: jaykorean Differential Revision: D56318179 fbshipit-source-id: bc7a258eda82b5f6c839f212ab824130e773a4f0	2024-04-18 14:26:58 -07:00
Levi Tamasi	0df601ab07	Reset user-facing wide-column stuctures upon deserialization failures (#12562 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12562 The patch makes a small usability improvement by consistently resetting any user-facing wide-column structures (`DBIter::columns()`, `BaseDeltaIterator::columns()`, and any `PinnableWideColumns` objects) upon encountering any deserialization failures. Reviewed By: jaykorean Differential Revision: D56312764 fbshipit-source-id: 44efed0d1720cc06bf6facf928f73ce39a1bd2ca	2024-04-18 13:08:34 -07:00
Levi Tamasi	c0aef2a28e	Add MultiGetEntityFromBatchAndDB to WriteBatchWithIndex (#12539 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12539 As a follow-up to https://github.com/facebook/rocksdb/pull/12533, this PR extends `WriteBatchWithIndex` with a `MultiGetEntityFromBatchAndDB` API that enables users to perform batched wide-column point lookups with read-your-own-writes consistency. This API transparently combines data from the indexed write batch and the underlying database as needed and presents the results in the form of a wide-column entity. Reviewed By: jaykorean Differential Revision: D56153145 fbshipit-source-id: 537967051b7521bb41b04070ac1a78a1d8873c08	2024-04-16 08:58:04 -07:00
Levi Tamasi	491c4fb0ed	Add GetEntityFromBatchAndDB to WriteBatchWithIndex (#12533 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12533 The PR extends `WriteBatchWithIndex` with a new wide-column point lookup API `GetEntityFromBatchAndDB`. Similarly to `GetFromBatchAndDB`, the new API can transparently combine data from the write batch with data from the underlying database as needed. Like `DB::GetEntity`, it returns any result in the form of a wide-column entity (i.e. plain key-values are wrapped into an entity with a single anonymous column). Reviewed By: jaykorean Differential Revision: D56069132 fbshipit-source-id: 4f19cdeea4ce136497ce79fc9d28c925de59e220	2024-04-15 09:20:47 -07:00
liuhu	6fbd02f258	Add dump all keys for cache dumper impl (#12500 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/12501 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12500 Reviewed By: jowlyzhang Differential Revision: D55922379 Pulled By: ajkr fbshipit-source-id: 0759afcec148d256a2d1cd5ef76fd988fab4a9af	2024-04-12 10:47:13 -07:00
Andrew Kryczka	8897bf2d04	Drop unsynced data in `TestFSWritableFile::Close()` (#12528 ) Summary: Our `FileSystem` for simulating unsynced data loss should not sync during `Close()` because it masks bugs where we forgot to sync as long as we closed the file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12528 Test Plan: Peeled back https://github.com/facebook/rocksdb/issues/10560 fix and verified it is caught much faster now (few seconds vs. ???) with command like ``` $ TEST_TMPDIR=./ python3 tools/db_crashtest.py blackbox --disable_wal=0 --max_key=1000 --write_buffer_size=131072 --max_bytes_for_level_base=524288 --target_file_size_base=131072 --interval=3 --sync_fault_injection=1 --enable_blob_files=0 --manual_wal_flush_one_in=10 --sync_wal_one_in=0 --get_live_files_one_in=0 --get_sorted_wal_files_one_in=0 --backup_one_in=0 --checkpoint_one_in=0 --write_fault_one_in=0 --read_fault_one_in=0 --open_write_fault_one_in=0 --compact_range_one_in=0 --compact_files_one_in=0 --open_read_fault_one_in=0 --get_property_one_in=0 --writepercent=100 -readpercent=0 -prefixpercent=0 -delpercent=0 -delrangepercent=0 -iterpercent=0 ``` Reviewed By: anand1976 Differential Revision: D56033250 Pulled By: ajkr fbshipit-source-id: 6bbf480d79a06c46f08f6214010937f6654af5ca	2024-04-12 09:57:56 -07:00
liuhu	b7f1eeb0ca	Cache dumper exit early due to deadline or max_dumped_size (#12491 ) Summary: In production, we need to control the duration time or max size of cache dumper to get better performance. Fixes https://github.com/facebook/rocksdb/issues/12494 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12491 Reviewed By: hx235 Differential Revision: D55905826 Pulled By: ajkr fbshipit-source-id: 9196a5e852c344d6783f7a8234e997c87215bd19	2024-04-11 21:56:45 -07:00
Andrew Kryczka	7dcd0bd833	Add GetLiveFilesStorageInfo to legacy BlobDB (#12468 ) Summary: It is an important function and should be correct on legacy BlobDB, even though using legacy BlobDB is not recommended Pull Request resolved: https://github.com/facebook/rocksdb/pull/12468 Reviewed By: cbi42 Differential Revision: D55231038 Pulled By: ajkr fbshipit-source-id: 2ac18e4c149590b373eb79cd92c0ca5e7fce94f2	2024-04-05 13:50:27 -07:00

1 2 3 4 5 ...

1602 Commits