rocksdb

Commit Graph

Author	SHA1	Message	Date
Peter Dillinger	f63428bcc7	Optimize, simplify filter block building (fix regression) (#12931 ) Summary: This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys). Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.) Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now. The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931 Test Plan: existing tests (with some minor updates) Also manually ported over the pre-broken regression test described in https://github.com/facebook/rocksdb/issues/12870 and ran it (passed). Performance: Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872 #12911 https://github.com/facebook/rocksdb/issues/12874 #12867 https://github.com/facebook/rocksdb/issues/12903 #12904. "After" includes this PR (and all of those, with base revision `16c21af`). Simultaneous test script designed to maximally depend on SST construction efficiency: ``` for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] \|\| (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 \| grep micros/op; done) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) \| tee results ``` Showing average ops/sec of "after" vs. "before" ``` -partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1 935586 vs. 928176 (+0.79%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0 930171 vs. 926801 (+0.36%) -partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1 910727 vs. 894397 (+1.8%) -partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1 929795 vs. 922007 (+0.84%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0 921924 vs. 917285 (+0.51%) -partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1 903393 vs. 887340 (+1.8%) ``` As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key. Reviewed By: jowlyzhang Differential Revision: D61138271 Pulled By: pdillinger fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043	2024-08-14 15:13:16 -07:00
Yu Zhang	d458331ee9	Move file tracking in VersionEditHandlerPointInTime to VersionBuilder (#12928 ) Summary: `VersionEditHandlerPointInTime` is tracking found files, missing files, intermediate files in order to decide to build a `Version` on negative edge trigger (transition from valid to invalid) without applying the current `VersionEdit`. However, applying `VersionEdit` and check completeness of a `Version` are specialization of `VersionBuilder`. More importantly, when we augment best efforts recovery to recover not just complete point in time Version but also a prefix of seqno for a point in time Version, such checks need to be duplicated in `VersionEditHandlerPointInTime` and `VersionBuilder`. To avoid this, this refactor move all the file tracking functionality in `VersionEditHandlerPointInTime` into `VersionBuilder`. To continue to let `VersionEditHandlerPIT` do the edge trigger check and build a `Version` before applying the current `VersionEdit`, a suite of APIs to supporting creating a save point and its associated functions are added in `VersionBuilder` to achieve this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12928 Test Plan: Existing tests Reviewed By: anand1976 Differential Revision: D61171320 Pulled By: jowlyzhang fbshipit-source-id: 604f66f8b1e3a3e13da59d8ba357c74e8a366dbc	2024-08-12 21:09:37 -07:00
anand76	c21fe1a47f	Add ticker stats for read corruption retries (#12923 ) Summary: Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923 Test Plan: Update existing tests Reviewed By: jowlyzhang Differential Revision: D61024687 Pulled By: anand1976 fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06	2024-08-12 15:32:07 -07:00
Hui Xiao	b65e29a4a9	Loosen a strong assertion in ExpectedValue::Exists() (#12932 ) Summary: Context/Summary: .... since it won't work in the PrepareDelete() path Pull Request resolved: https://github.com/facebook/rocksdb/pull/12932 Test Plan: CI Reviewed By: cbi42 Differential Revision: D61155155 Pulled By: hx235 fbshipit-source-id: 99b0784f6c903d70c7b3b88b53ae8e2c885de96f	2024-08-12 15:21:27 -07:00
SGZW	6727f0f58a	fix compaction_picker_test asan heap use after free (#12908 ) Summary: ![image](https://github.com/user-attachments/assets/3290fe18-aca2-4691-b072-fbbc96a15fb1) this testcase set syncpoint function which reference this test case heap variable "enable_per_key_placement_" and this sync point function will be triggered by another testcase, so asan will report asan heap use after free error Pull Request resolved: https://github.com/facebook/rocksdb/pull/12908 Reviewed By: hx235 Differential Revision: D60973363 Pulled By: cbi42 fbshipit-source-id: df4f488f51e7741784d5a92fc0a5fc538c5d5b1a	2024-08-09 15:06:37 -07:00
SGZW	5c456c4c08	fix compaction speedup for marked files ut (#12912 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12912 Reviewed By: hx235 Differential Revision: D60973460 Pulled By: cbi42 fbshipit-source-id: ebaa343757f09f7281884a512ebe3a7d6845c8b3	2024-08-09 15:05:02 -07:00
Hui Xiao	112bf15dca	Fix false-positive TestBackupRestore corruption (#12917 ) Summary: Context: https://github.com/facebook/rocksdb/pull/12838 allows a write thread encountered certain injected error to release the lock and sleep before retrying write in order to reduce performance cost. This requires adding checks like [this](`b26b395e0a/db_stress_tool/expected_value.cc (L29-L31)`) to prevent writing to the same key from another thread. The added check causes a false-positive failure when delete range + file ingestion + backup is used. Consider the following scenario: (1) Issue a delete range covering some key that do not exist and a key does exist (named as k1). k1 will have "pending delete" state while the keys that does not exit will have whatever state they already have since we don't delete a key that does not exist already. (2) After https://github.com/facebook/rocksdb/pull/12838, `PrepareDeleteRange(... &prepared)` will return `prepared = false`. So below logic will be executed and k1's "pending delete" won't get roll-backed nor committed. ``` std::vector<PendingExpectedValue> pending_expected_values = shared->PrepareDeleteRange(rand_column_family, rand_key, rand_key + FLAGS_range_deletion_width, &prepared); if (!prepared) { for (PendingExpectedValue& pending_expected_value : pending_expected_values) { pending_expected_value.PermitUnclosedPendingState(); } return s; } ``` (3) Issue an file ingestion covering k1 and another key k2. Similar to (2), we will have `shared->PreparePut(column_family, key, &prepared)` return `prepared = false` for k1 while k2 will have a "pending put" state. So below logic will be executed and k2's "pending put" state won't get roll-backed nor committed. ``` for (int64_t key = key_base; s.ok() && key < shared->GetMaxKey() && static_cast<int32_t>(keys.size()) < FLAGS_ingest_external_file_width; ++key) PendingExpectedValue pending_expected_value = shared->PreparePut(column_family, key, &prepared); if (!prepared) { pending_expected_value.PermitUnclosedPendingState(); for (PendingExpectedValue& pev : pending_expected_values) { pev.PermitUnclosedPendingState(); } return; } } ``` (4) Issue a backup and verify on k2. Below logic decides that k2 should exist in restored DB since it has a pending write state while k2 is never ingested into the original DB as (3) returns early. ``` bool Exists() const { return PendingPut() \|\| !IsDeleted(); } TestBackupRestore() { ... Status get_status = restored_db->Get( read_opts, restored_cf_handles[rand_column_families[i]], key, &restored_value); bool exists = thread->shared->Exists(rand_column_families[i], rand_keys[0]); if (get_status.ok()) { if (!exists && from_latest && ShouldAcquireMutexOnKey()) { std::ostringstream oss; oss << "0x" << key.ToString(true) << " exists in restore but not in original db"; s = Status::Corruption(oss.str()); } } else if (get_status.IsNotFound()) { if (exists && from_latest && ShouldAcquireMutexOnKey()) { std::ostringstream oss; oss << "0x" << key.ToString(true) << " exists in original db but not in restore"; s = Status::Corruption(oss.str()); } } ... } ``` So we see false-positive corruption like `Failure in a backup/restore operation with: Corruption: 0x000000000000017B0000000000000073787878 exists in original db but not in restore` A simple fix is to remove `PendingPut()` from `bool Exists() ` since it's called under a lock and should never see a pending write. However, in order for "under a lock and should never see a pending write" to be true, we need to remove the logic of releasing the lock during sleep in the write thread, which expose pending write to other thread that can call Exists() like back up thread. The downside of holding lock during sleep is blocking other write thread of the same key to proceed cuz they need to wait for the lock. This should happen rarely as the key of a thread is selected randomly in crash test like below. ``` void StressTest::OperateDb(ThreadState* thread) { for (uint64_t i = 0; i < ops_per_open; i++) { ... int64_t rand_key = GenerateOneKey(thread, i); ... } } ``` Summary: - Removed the "lock release" part and related checks - Printed recovery time if the write thread waited more than 10 seconds - Reverted regression in testing coverage when deleting a non-existent key Pull Request resolved: https://github.com/facebook/rocksdb/pull/12917 Test Plan: Below command repro-ed frequently before the fix and not after. ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=8388608 --blob_compaction_readahead_size=1048576 --blob_compression_type=none --blob_file_size=1073741824 --blob_file_starting_level=1 --blob_garbage_collection_age_cutoff=0.0 --blob_garbage_collection_force_threshold=0.75 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=16.216959977115277 --bottommost_compression_type=xpress --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kXXH3 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=0 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=10 --compress_format_version=2 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=2097151 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=0 --delrangepercent=5 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_blob_files=0 --enable_blob_garbage_collection=1 --enable_checksum_handoff=1 --enable_compaction_filter=1 --enable_custom_split_merge=1 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0.5 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=1000 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=0 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=10 --prefix_size=8 --prefixpercent=0 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=60 --recycle_log_file_num=1 --reopen=20 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=118 --universal_max_read_amp=-1 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D60890580 Pulled By: hx235 fbshipit-source-id: 401f90d6d351c7ee11088cad06fb00e54062d416	2024-08-09 14:51:36 -07:00
Hui Xiao	16c21afc06	Fix failure to clean the temporary directory due to NotFound in crash test checkpoint creation (#12919 ) Summary: Context/Summary: `b26b395e0a` propagates `CleanStagingDirectory()` status to `CreateCheckpoint()`. However, we didn't return early when `Status s = db_->GetEnv()->FileExists(full_private_path);` return non-NotFound non-ok stratus in `CleanStagingDirectory()`. Therefore we can proceed to the next step when `full_private_path` doesn't exist. ``` Verification failed: Checkpoint failed: Operation aborted: Failed to clean the temporary directory /dev/shm/rocksdb.J4Su/rocksdb_crashtest_blackbox/.checkpoint28.tmp needed before checkpoint creation : NotFound: db_stress: db_stress_tool/db_stress_test_base.cc:549: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12919 Test Plan: Below failed before the fix and passes after ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2 --bloom_bits=4 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb.J4Su/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --default_temperature=kUnknown --default_write_temperature=kHot --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb.J4Su/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=13 --index_shortening=0 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=500000 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=0 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=3 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=1 --uncache_aggressiveness=0 --universal_max_read_amp=0 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=1 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=35 ``` Reviewed By: cbi42 Differential Revision: D60938952 Pulled By: hx235 fbshipit-source-id: 5696cd6b00f33c9f9a256944fecb4e2f4d52a2e6	2024-08-08 15:37:19 -07:00
Changyu Bi	b32d899482	Fix MultiGet dropping memtable kv checksum corruption (#12842 ) Summary: Corruption status returned by `GetFromTable()` could be overwritten here: `b6c3495a71/db/version_set.cc (L2614)` This PR fixes this issue by setting `*(s->found_final_value) = true;` in SaveValue. Also makes the handling of the return value of `GetFromTable()` more robust and added asserts in a couple places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12842 Test Plan: Updated an existing unit test to cover MultiGet. It fails the assertion here before this PR: `b6c3495a71/db/version_set.cc (L2601)` Reviewed By: anand1976 Differential Revision: D59498203 Pulled By: cbi42 fbshipit-source-id: 1f071c1b2c5b66fb71264b547a9e670d1cf592f0	2024-08-08 13:34:11 -07:00
Peter Dillinger	d33d25f903	Disable WAL recycling in crash test; reproducer for recovery data loss (#12918 ) Summary: I was investigating a crash test failure with "Corruption: SST file is ahead of WALs" which I haven't reproduced, but I did reproduce a data loss issue on recovery which I suspect could be the same root problem. The problem is already somewhat known (see https://github.com/facebook/rocksdb/issues/12403 and https://github.com/facebook/rocksdb/issues/12639) where it's only safe to recovery multiple recycled WAL files with trailing old data if the sequence numbers between them are adjacent (to ensure we didn't lose anything in the corrupt/obsolete WAL tail). However, aside from disableWAL=true, there are features like external file ingestion that can increment the sequence numbers without writing to the WAL. It is simply unsustainable to worry about this kind of feature interaction limiting where we can consume sequence numbers. It is very hard to test and audit as well. For reliable crash recovery of recycled WALs, we need a better way of detecting that we didn't drop data from one WAL to the next. Until then, let's disable WAL recycling in the crash test, to help stabilize it. Ideas for follow-up to fix the underlying problem: (a) With recycling, we could always sync the WAL before opening the next one. HOWEVER, this potentially very large sync could cause a big hiccup in writes (vs. O(1) sized manifest sync). (a1) The WAL sync could ensure it is truncated to size, or (a2) By requiring track_and_verify_wals_in_manifest, we could assume that the last synced size in the manifest is the final usable size of the WAL. (It might also be worth avoiding truncating recycled WALs.) (b) Add a new mechanism to record and verify the final size of a WAL without requiring a sync. (b1) By requiring track_and_verify_wals_in_manifest, this could be new WAL metadata recorded in the manifest (at the time of switching WALs). Note that new fields of WalMetadata are not forward-compatible, but a new kind of manifest record (next to WalAddition, WalDeletion; e.g. WalCompletion) is IIRC forward-compatible. (b2) A new kind of WAL header entry (not forward compatible, unfortunately) could record the final size of the previous WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12918 Test Plan: Added disabled reproducer for non-linear data loss on recovery Reviewed By: hx235 Differential Revision: D60917527 Pulled By: pdillinger fbshipit-source-id: 3663d79aec81851f5cf41669f84a712bb4563fd7	2024-08-07 14:20:45 -07:00
Peter Dillinger	b15f8c7f0e	Refactor db_bloom_filter_test (#12911 ) Summary: Ahead of a "decoupled" variant of partitioned filters, refactoring this unit test file to make it easier to incorporate that new variant. * bool test param to new enum class FilterPartitioning * Some cases of iterating over that bool to new parameterized test * Combine some common functionality for configuring parameterized options Pull Request resolved: https://github.com/facebook/rocksdb/pull/12911 Test Plan: no production changes, and no intentional changes to scope or conditions of tests Differential Revision: D60701287 fbshipit-source-id: 3497e3230e29a4f62c934bcb75693965a2df41d8	2024-08-07 11:28:16 -07:00
Hui Xiao	b26b395e0a	Fix CreateCheckpoint not handling failed CleanStagingDirectory well (#12894 ) Summary: Context/Summary: `CleanStagingDirectory()` is called when the temporary .tmp folder we use to create checkpoint is not empty to begin with. Expanded fault injection can make this call fail e.g, `Delete file /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/.checkpoint17.tmp/012393.sst -- IO error: injected metadata write error`. But The result of `CleanStagingDirectory()` is ignored in `CreateCheckpoint()`. So the injected IO error can't be propagated to db stress test and handled correctly. Hence we see `While mkdir: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/.checkpoint17.tmp: File exists` when we try to re-use a non-empty .tmp folder for new snapshots. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12894 Test Plan: Monitor CI Reviewed By: ltamasi Differential Revision: D60422849 Pulled By: hx235 fbshipit-source-id: 6f735c98eaa05d2b97ba4f781e0928357a50377a	2024-08-06 11:24:29 -07:00
Yu Zhang	719c96125c	Add a TransactionOptions to enable tracking timestamp size info inside WriteBatch (#12864 ) Summary: In normal use cases, meta info like column family's timestamp size is tracked at the transaction layer, so it's not necessary and even detrimental to track such info inside the internal WriteBatch because it may let anti-patterns like bypassing Transaction write APIs and directly write to its internal WriteBatch like this: `9d0a754dc9/storage/rocksdb/ha_rocksdb.cc (L4949-L4950)` Setting this option to true will keep aforementioned use case continue to work before it's refactored out. This option is only for this purpose and it will be gradually deprecated after aforementioned MyRocks use case are refactored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12864 Test Plan: Added unit tests Reviewed By: cbi42 Differential Revision: D60194094 Pulled By: jowlyzhang fbshipit-source-id: 64a98822167e99aa7e4fa2a60085d44a5deaa45c	2024-08-05 13:06:45 -07:00
Yu Zhang	36b061a6c7	Fix test breakage (#12915 ) Summary: https://github.com/facebook/rocksdb/issues/12891 updated this deletion rate in the test to be much higher, which makes the test flaky. The rate is being intentionally set to very low to maximize the retention of a ".log.trash" file after DB closes. This PR just change it back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12915 Reviewed By: ltamasi Differential Revision: D60776312 Pulled By: jowlyzhang fbshipit-source-id: d193557a042c65816fcc337cceb09905e042e9f6	2024-08-05 12:26:18 -07:00
Yu Zhang	d12aaf23ca	Fix file deletions in DestroyDB not rate limited (#12891 ) Summary: Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`. It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it. This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891 Test Plan: Added unit tests, existing tests. Reviewed By: anand1976 Differential Revision: D60300220 Pulled By: jowlyzhang fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7	2024-08-02 19:31:55 -07:00
Peter Dillinger	9d5c8c89a1	Fix filter partition size logic (#12904 ) Summary: Was checking == a desired number of entries added to a filter, when the combination of whole key and prefix filtering could add more than one entry per table internal key. This could lead to unnecessarily large filter partitions, which could affect performance and block cache fairness. Also (only somewhat related because of other work in progress): * Some variable renaming and a new assertion in BlockBasedTableBuilder, to add some clarity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12904 Test Plan: If you add assertion logic to the base revision checking that the partition cut is requested whenever `keys_added_to_partition_ >= keys_per_partition_`, it fails on a number of db_bloom_filter_test tests. However, such an assertion in the revised code would be essentially redundant with the new logic. If I added a regression test for this, it would be tricky and fragile, so I don't think it's important enough to chase and maintain. (Open to suggestions / input.) Reviewed By: jowlyzhang Differential Revision: D60557827 Pulled By: pdillinger fbshipit-source-id: 77a56097d540da6e7851941a26d26ced2d944373	2024-08-02 14:49:02 -07:00
Levi Tamasi	2e8a1a14ef	Fix a data race affecting the background error status (#12910 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12910 There is currently a call to `GetBGError()` in `DBImpl::WriteImplWALOnly()` where the DB mutex is (incorrectly) not held, leading to a data race. Technically, we could acquire the mutex here but instead, the patch removes the affected check altogether, since the same check is already performed (in a thread-safe manner) in the subsequent call to `PreprocessWrite()`. Reviewed By: cbi42 Differential Revision: D60682008 fbshipit-source-id: 54b67975dcf57d67c068cac71e8ada09a1793ec5	2024-08-02 14:11:08 -07:00
Peter Dillinger	9245550e8b	Clean up/refactor (Partitioned)FilterBlockBuilder (#12903 ) Summary: This is ahead of some related changes/enhancements. Refactorings here: * Restructure some state of PartitionedFilterBlockBuilder to reduce redundancy in state tracking, improve clarity. * Changed some function signatures to better match standard practice (return Status) * Improve comments, arrange related fields * Discourage/prevent production use of Finish without status (now TEST_Finish) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12903 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D60548613 Pulled By: pdillinger fbshipit-source-id: d7dbc79951fcc3b837877227d58f713698ad2596	2024-08-02 13:35:45 -07:00
Hui Xiao	5e203c76a2	SyncWAL() before Close() when FLAGS_avoid_flush_during_shutdown=true in crash test (#12900 ) Summary: Context/Summary: When we use WAL and don't flush data during shutdown `FLAGS_avoid_flush_during_shutdown=true`, then we rely on WAL to recover data in next Open() so will need to sync WAL in crash test. Currently the condition is flipped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12900 Test Plan: Below fails with data loss `Verification failed. Expected state has key 000000000000015D000000000000012B0000000000000147, iterator is at key 000000000000015D000000000000012B0000000000000152` before the fix but not after the fix ``` ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=3 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=100 --block_align=0 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=0 --bloom_bits=10 --bottommost_compression_type=disable --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=tiered_auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_style=1 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_ratio=0.3333333333333333 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=8 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox_2 --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_2 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000000 --format_version=5 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=13 --index_shortening=1 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kHot --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --uncache_aggressiveness=4404 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=1 --verify_db_one_in=10000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35 ``` Reviewed By: anand1976, ltamasi Differential Revision: D60489038 Pulled By: hx235 fbshipit-source-id: fb35889ae1509eb1bac27b015bb24a07d3b95268	2024-08-02 10:45:34 -07:00
Changyu Bi	8be824e316	Use compensated file size for intra-L0 compaction (#12878 ) Summary: In leveled compaction, we pick intra-L0 compaction instead of L0->Lbase whenever L0 size is small. When L0 files contain many deletions, it makes more sense to compact then down instead of accumulating tombstones in L0. This PR uses compensated_file_size when computing L0 size for determining intra-L0 compaction. Also scale down the limit on total L0 size further to be more cautious about accumulating data in L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12878 Test Plan: updated unit test. Reviewed By: hx235 Differential Revision: D59932421 Pulled By: cbi42 fbshipit-source-id: 9de973ac51eb7df81b38b8c68110072b1aa06321	2024-08-01 17:49:34 -07:00
Yu Zhang	005256bcc8	Fix same user collected property being re-added in stress tests (#12907 ) Summary: As titled. The `emplace_back` below will add the same collector factory again during Reopen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12907 Reviewed By: pdillinger Differential Revision: D60614170 Pulled By: jowlyzhang fbshipit-source-id: a79498d209e4910a5e94a5cb742935015277918c	2024-08-01 16:21:39 -07:00
Levi Tamasi	8767267315	Attempt to fix the nightly build-linux-clang-13-asan-ubsan-with-folly build Summary: https://github.com/facebook/rocksdb/pull/12801 updated the version of `folly` used in RocksDB builds to a revision that requires `g++` version 10 when built with a GNU toolchain. This shouldn't really matter for this nightly GitHub Actions job, since we're supposed to be building with `clang++-13`; however, due to the way the compilers had been set, seems like we were historically only building RocksDB with `clang` (and `folly` with `gcc-9`, which led to a broken build after the update). Attempt to fix this by setting `CC` / `CXX` to `clang` / `clang++` in the job's environment. Reviewed By: pdillinger Differential Revision: D60534452 fbshipit-source-id: c7b5a02409fb1ea50e4524731237f7bc8d3f7ca6	2024-08-01 13:29:56 -07:00
Yu Zhang	319374ae67	Add some checks at property block creation side (#12898 ) Summary: Crash test encountered this failure: ```file ingestion error: Corruption: properties unsorted under specified IngestExternalFileOptions: move_files: 0, verify_checksums_before_ingest: 1, verify_checksums_readahead_size: 1048576 (Empty string or missing field indicates default option or value is used``` Further inspection showed out of order table properties in an external file created by `SstFileWriter` for ingestion, and the file is likely created like this because it passed the initial checksum check. This change added some assertions to check invariant at the properties creation and collecting side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12898 Test Plan: Existing tests Reviewed By: hx235 Differential Revision: D60459817 Pulled By: jowlyzhang fbshipit-source-id: 91474943d2f9d7795f00b6031c08a13ab91e2470	2024-07-31 13:28:17 -07:00
Peter Dillinger	2595476541	Fix rare WAL handling crash (#12899 ) Summary: A crash test failure in log sync in DBImpl::WriteToWAL is due to a missed case in https://github.com/facebook/rocksdb/issues/12734. Just need to apply similar logic from DBImpl::SyncWalImpl to check for an already closed WAL (nullptr writer). This is extremely rare because it only comes from failed Sync on a closed WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12899 Test Plan: watch crash test Reviewed By: cbi42 Differential Revision: D60481652 Pulled By: pdillinger fbshipit-source-id: 4a176bb6a53dcf077f88344710a110c2f946c386	2024-07-30 17:38:30 -07:00
anand76	55877d8893	Make transaction name conflict check more robust (#12895 ) Summary: The `PessimisticTransaction::SetName()` code checks for an existing txn of the given name before registering the new txn. However, this is not atomic, which could result in a race condition if two txns try to register with the same name. Both might succeed and lead to unpredictable behavior. This PR makes the test and set atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12895 Reviewed By: pdillinger Differential Revision: D60460482 Pulled By: anand1976 fbshipit-source-id: e8afeb2356e1b8f4e8df785cb73532739f82579d	2024-07-30 12:31:02 -07:00
Peter Dillinger	9058fd037c	Small CPU optimization to experimental range filters (#12893 ) Summary: By reusing an object that owns a vector. The vector allocation/sizing was substantial in a CPU profile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12893 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D60405139 Pulled By: pdillinger fbshipit-source-id: 8bfbc07cd9b4829f2ac9015e90f2b4eba61fd984	2024-07-29 14:23:35 -07:00
Yu Zhang	24d86f7b41	Add an option to toggle timestamp based validation for the whole DB (#12857 ) Summary: As titled. This PR adds a `TransactionDBOptions` field `enable_udt_validation` to allow user to toggle the timestamp based validation behavior across the whole DB. When it is true, which is the default value and the existing behavior. A recap of what this behavior is: `GetForUpdate` does timestamp based conflict checking to make sure no other transaction has committed a version of the key tagged with a timestamp equal to or newer than the calling transaction's `read_timestamp_` the user set via `SetReadTimestampForValidation`. When this field is set to false, we disable timestamp based validation for the whole DB. MyRocks find it hard to find a read timestamp for this validation API, so we added this flexibility. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12857 Test Plan: Added unit test Reviewed By: ltamasi Differential Revision: D60194134 Pulled By: jowlyzhang fbshipit-source-id: b8507f8ddc37fc7a2948cf492ce5c599ae646fef	2024-07-29 13:54:37 -07:00
Hui Xiao	408e8d4c85	Handle injected write error after successful WAL write in crash test + misc (#12838 ) Summary: Context/Summary: We discovered the following false positive in our crash test lately: (1) PUT() writes k/v to WAL but fails in `ApplyWALToManifest()`. The k/v is in the WAL (2) Current stress test logic will rollback the expected state of such k/v since PUT() fails (3) If the DB crashes before recovery finishes and reopens, the WAL will be replayed and the k/v is in the DB while the expected state have been roll-backed. We decided to leave those expected state to be pending until the loop-write of the same key succeeds. Bonus: Now that I realized write to manifest can also fail the write which faces the similar problem as https://github.com/facebook/rocksdb/pull/12797, I decided to disable fault injection on user write per thread (instead of globally) when tracing is needed for prefix recovery; some refactory Pull Request resolved: https://github.com/facebook/rocksdb/pull/12838 Test Plan: Rehearsal CI Run below command (varies on sync_fault_injection=1,0 to verify ExpectedState behavior) for a while to ensure crash recovery validation works fine ``` python3 tools/db_crashtest.py --simple blackbox --interval=30 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=20 --delrangepercent=20 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=8 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=8 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=10 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=8 --writepercent=40 ``` Reviewed By: cbi42 Differential Revision: D59377075 Pulled By: hx235 fbshipit-source-id: 91f602fd67e2d339d378cd28b982095fd073dcb6	2024-07-29 13:51:49 -07:00
Yu Zhang	d94c2adc28	Add entry for bug fix in #12882 (#12892 ) Summary: As titled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12892 Reviewed By: hx235 Differential Revision: D60400651 Pulled By: jowlyzhang fbshipit-source-id: 2dd60c2287143f464ecab0de859715af6ab3a825	2024-07-29 12:20:50 -07:00
Yu Zhang	9883b5f497	Fix manifest_number_ point to invalid file (#12882 ) Summary: This PR fix `VersionSet`'s `manifest_number_` could be pointing to an invalid number intermediately. This happens when a new manifest roll is attempted but fast failed after loading table handlers and before the new manifest file creation/writing is actually attempted. In theory, a later manifest roll effort will overthrow this intermediate invalid in memory state. There is on harm when the DB crashes in this invalid state either. But efforts that takes a file snapshot of the DB like backup will incorrectly try to copy a non existing manifest file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12882 Reviewed By: cbi42 Differential Revision: D60204956 Pulled By: jowlyzhang fbshipit-source-id: effbdb124b582f879d114988af06ac63867fc549	2024-07-24 17:50:08 -07:00
Yu Zhang	05c9c9aeed	Fix race between test and recovery flush switch memtable (#12884 ) Summary: As titled, to fix this type of data race: https://github.com/facebook/rocksdb/actions/runs/10066814221/job/27829003372?pr=12882 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12884 Test Plan: COMPILE_WITH_TSAN=1 make -j10 db_wal_test ./db_wal_test --gtest_filter=DBWALTest.RecoveryFlushSwitchWALOnEmptyMemtable --gtest_repeat=100 Reviewed By: anand1976 Differential Revision: D60197834 Pulled By: jowlyzhang fbshipit-source-id: 89524cdb4d17a1b647295bcccf5eb2d7d425bc6a	2024-07-24 17:06:16 -07:00
Jay Huh	086849aa4f	Properly disable MultiCFIterator in WritePrepared/UnPreparedTxnDBs (#12883 ) Summary: MultiCfIterators (`CoalescingIterator` and `AttributeGroupIterator`) are not yet compatible with write-prepared/write-unprepared transactions, yet (write-committed is fine). This fix includes the following. - Properly return `ErrorIterator` if the user attempts to use the `CoalescingIterator` or `AttributeGroupIterator` in WritePreparedTxnDB (and WriteUnpreparedTxnDB) - Set `use_multi_cf_iterator = 0` if `use_txn=1` and `txn_write_policy != 0 (WRITE_COMMITTED)` in stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12883 Test Plan: Works ``` ./db_stress ... --use_txn=1 --txn_write_policy=0 --use_multi_cf_iterator=1 ``` Fails ``` ./db_stress ... --use_txn=1 --txn_write_policy=1 --use_multi_cf_iterator=1 ``` Reviewed By: cbi42 Differential Revision: D60190784 Pulled By: jaykorean fbshipit-source-id: 3bc1093e81a4ef5753ba9b32c5aea997c21bfd33	2024-07-24 16:50:12 -07:00
Peter Dillinger	f456a7213f	Refactor IndexBuilder::AddIndexEntry (#12867 ) Summary: Something I am working on is going to expand usage of `BlockBasedTableBuilder::Rep::last_key`, but the existing code contract for `IndexBuilder::AddIndexEntry` makes that difficult because it modifies its `last_key` parameter to be the separator value recorded in the index, often something between the two boundary keys. This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches. Additional improvements in PartitionedIndexBuilder specifically: * Reduce string copies by eliminating `sub_index_last_key_` and instead tracking the key for the next partition in a placeholder Entry. * Simplify code and improve code quality by changing `sub_index_builder_` to unique_ptr. * Eliminate unnecessary NewFlushBlockPolicy call/object. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12867 Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up. Reviewed By: anand1976 Differential Revision: D59793119 Pulled By: pdillinger fbshipit-source-id: 556da75cf13b967511f84702b2713d152f536a07	2024-07-22 14:27:31 -07:00
Hui Xiao	15d9988ab2	Update history and version for 9.5.fb release (#12880 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12880 Reviewed By: jaykorean, jowlyzhang Differential Revision: D60057955 Pulled By: hx235 fbshipit-source-id: 1c599a5334aff1f424bb473275efe4349b17d41d	2024-07-22 13:15:09 -07:00
Hui Xiao	349b1ec08f	Fix duplicate WAL entries caused by write after error recovery (#12873 ) Summary: Context/Summary: We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty. This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush as well as updates the INFO log message for clarity. ``` 2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB 2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started 2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873 Test Plan: New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after ``` db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed. ``` Reviewed By: anand1976 Differential Revision: D59884468 Pulled By: hx235 fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756	2024-07-22 12:40:25 -07:00
Changyu Bi	c064ac3bc5	Avoid opening table files and reading table properties under mutex (#12879 ) Summary: InitInputTableProperties() can open and do IOs and is called under mutex_. This PR removes it from FinalizeInputInfo(). It is now called in CompactionJob::Run() and BuildCompactionJobInfo() (called in NotifyOnCompactionBegin()) without holding mutex_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12879 Test Plan: existing unit tests. Added assert in GetInputTableProperties() to ensure that input_table_properties_ is initialized whenever it's called. Reviewed By: hx235 Differential Revision: D59933195 Pulled By: cbi42 fbshipit-source-id: c8089e13af8567fa3ab4b94d9ec384ae98ab2ec8	2024-07-19 19:12:45 -07:00
Changyu Bi	4384dd5eee	Support ingesting SST files generated by a live DB (#12750 ) Summary: ... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option `IngestExternalFileOptions::allow_db_generated_files` is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys). The feature is marked the option as experimental for now. Main changes needed to enable this: - ignore CF id mismatch during ingestion - ignore the missing external file version table property Rest of the change is mostly in new unit tests. A previous attempt is in https://github.com/facebook/rocksdb/issues/5602. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12750 Test Plan: - new unit tests Reviewed By: ajkr, jowlyzhang Differential Revision: D58396673 Pulled By: cbi42 fbshipit-source-id: aae513afad7b1ff5d4faa48104df5f384926bf03	2024-07-19 16:14:54 -07:00
anand76	0fca5e31b4	Fix race between manifest error recovery and file ingestion (#12871 ) Summary: This PR fixes an assertion failure in `DBImpl::ResumeImpl` - `assert(!versions_->descriptor_log_)`. In `VersionSet`, `descriptor_log_` has a pointer to the current MANIFEST writer. When there's an error updating the manifest, `descriptor_log_` is reset, and the error recovery thread checks `io_status()` in `VersionSet` and attempts to write a new MANIFEST. If another DB manipulation happens at the same time (like external file ingestion, column family manipulation etc), it calls `LogAndApply`, which also attempts to write a new MANIFEST. The assertion in `ResumeImpl` might fail in this case since the other MANIFEST writer may have updated `descriptor_log_`. To prevent the assertion, this fix updates both `io_status_` and `descriptor_log_` while holding the DB mutex. The other option would have been to simply remove the assert. But I think its important to have it to ensure the invariant that `io_status_` is cleared if the MANIFEST is written successfully, and this fix makes things easier to reason about. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12871 Test Plan: Existing tests and crash test Reviewed By: hx235 Differential Revision: D59926947 Pulled By: anand1976 fbshipit-source-id: af9ad18da3e29fc62c7ec2e30e0738aa33d4e5f1	2024-07-19 10:37:51 -07:00
Peter Dillinger	de6d0e5ec3	Reduce cases of impacted performance from bug fix (#12874 ) Summary: https://github.com/facebook/rocksdb/issues/12872 was a bit too gross of a fix, because we still don't need to track previous prefix in FullFilterBlockBuilder for many non-partitioned use cases. This basically narrows the fix (and potentail CPU regression) to partitioned+prefix filter cases, which are the cases that needed to be fixed. A better efficiency fix would still be nice but not as high of a priority. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12874 Test Plan: existing tests (just added in https://github.com/facebook/rocksdb/issues/12872) Reviewed By: jowlyzhang Differential Revision: D59885591 Pulled By: pdillinger fbshipit-source-id: 8f273fc3e14c4b60c8a55501dc4bbcc325cd17a1	2024-07-17 16:42:27 -07:00
Peter Dillinger	93b163d1a2	Fix major bug with prefixes, SeekForPrev, and partitioned filters (#12872 ) Summary: Basically, the fix in https://github.com/facebook/rocksdb/issues/8137 was incomplete (and I missed it in the review), because if `whole_key_filtering` is false, then `last_prefix_str_` will never be set to non-empty and the fix doesn't work. Also related to https://github.com/facebook/rocksdb/issues/5835. This is intended as a safe, simple fix that will regress CPU efficiency slightly (for `whole_key_filtering=false` cases, because of extra prefix string copies during flush & compaction). An efficient fix is not possible without some substantial refactoring. Also in this PR: new test DBBloomFilterTest.FilterNumEntriesCoalesce tests an adjacent code path that was previously untested for its effect of ensuring the number of unique prefixes and keys is tracked properly when both prefixes and whole keys are going into a filter. (Test fails when either of the two code segments checking for duplicates is disabled.) In addition, the same test would fail before the main bug fix here because the code would inappropriately add the empty string to the filter (because of unmodified `last_prefix_str_`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12872 Test Plan: In addition to DBBloomFilterTest.FilterNumEntriesCoalesce, extended DBBloomFilterTest.SeekForPrevWithPartitionedFilters to cover the broken case. (Mostly whitespace change.) Reviewed By: jowlyzhang Differential Revision: D59873793 Pulled By: pdillinger fbshipit-source-id: 2a7b7f09ca73dc188fb4dab833826ad6da7ebb11	2024-07-17 14:08:35 -07:00
Hui Xiao	21db55f816	Move WAL sync before memtable insertion (#12869 ) Summary: Context/Summary: WAL sync currently happens after memtable write. This causes inconvenience in stress test as we can't simply rollback the ExpectedState when write fails due to injected WAL sync error so something complicated like https://github.com/facebook/rocksdb/pull/12838 might be needed. After moving WAL sync before memtable insertion, there should not be injected IO error after memtable insertion so we can keep the current simple way of handling failed write in stress test with ExpectedState rollback. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12869 Test Plan: 1. Below command failed with `iterator has key 0000000000000207000000000000012B0000000000000013, but expected state does not.` before this PR and passes after ``` ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=50 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=0 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=50 Reviewed By: jowlyzhang Differential Revision: D59825730 Pulled By: hx235 fbshipit-source-id: 7d77aaf177ded2f99bf1ce19f5a4bd0783b9ca92	2024-07-17 13:39:14 -07:00
Hui Xiao	6870cc1187	Temporally disable log recycle with testing GetLiveFilesStorageInfo() (#12868 ) Summary: Context/Summary: We recently discovered a case where `GetLiveFilesStorageInfo()` failed when `Options::recycle_log_file_num` > 0. Before fixing the incompatibility, we disable these combination in stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12868 Test Plan: monitor CI Reviewed By: jowlyzhang Differential Revision: D59820802 Pulled By: hx235 fbshipit-source-id: 7b09063af6d72ae0ba187b4cf8887abd8a78e5e8	2024-07-16 12:37:50 -07:00
Hui Xiao	9e4ee7f0c6	Fix non-okay status being ignored in write path under two_write_queues_ (#12866 ) Summary: Context/Summary: see above, though the impact is small. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12866 Test Plan: exiting UT Reviewed By: anand1976 Differential Revision: D59782913 Pulled By: hx235 fbshipit-source-id: ec02843645cce49466bde602035d2e61c31965b8	2024-07-16 10:55:08 -07:00
anand76	5aa675457e	Fix unhandled MANIFEST write errors (#12865 ) Summary: The failure of `WriteCurrentStateToManifest()` in `VersionSet::ProcessManifestWrites()` was not handled properly. If it failed, `manifest_io_status` was not updated, leading to `manifest_file_number_` being updated to the newly created manifest even though its bad. This would lead to the bad manifest immediately getting deleted, and also the good manifest (referenced by `CURRENT`) getting deleted by obsolete file deletion because of `manifest_file_number_` not referencing its number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12865 Reviewed By: hx235 Differential Revision: D59782940 Pulled By: anand1976 fbshipit-source-id: f752fb9a1c23fd3d734616e273613cbac204301b	2024-07-15 19:13:29 -07:00
Hui Xiao	4ff35afb42	Fix a bug where `OnErrorRecoveryBegin()` is not called before auto-recovery (#12860 ) Summary: Context/Summary: `*auto_recovery` needs to be set true in order for `OnErrorRecoveryBegin()` to be called before auto-recovery `3db030d7ee/db/event_helpers.cc (L64-L66)` Currently it's set false for auto-recovery. This PR fixes it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12860 Test Plan: - Manual observation that it is called - Existing UT Reviewed By: jowlyzhang Differential Revision: D59693315 Pulled By: hx235 fbshipit-source-id: 3f428c5b1e9818bb7697fdcd7f245d11378eb14a	2024-07-15 17:00:14 -07:00
WangQian	755010f8d3	Fix the bug with using the user comparator to compare prefix. (#12862 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/12855 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12862 Reviewed By: cbi42 Differential Revision: D59771651 Pulled By: jowlyzhang fbshipit-source-id: ffe0025143f51f9ce1b46900c3fef6a20eb34f4a	2024-07-15 15:13:29 -07:00
Peter Dillinger	0e3e43f4d1	FaultInjectionTestFS follow-up and clean-up (#12861 ) Summary: In follow-up to https://github.com/facebook/rocksdb/issues/12852: * Use std::copy in place of copy_n for potentially overlapping buffer * Get rid of troublesome -1 idiom from `pos_at_last_append_` and `pos_at_last_sync_` * Small improvements to test FaultInjectionFSTest.ReadUnsyncedData Pull Request resolved: https://github.com/facebook/rocksdb/pull/12861 Test Plan: CI, crash test, etc. Reviewed By: cbi42 Differential Revision: D59757484 Pulled By: pdillinger fbshipit-source-id: c6fbdc2e97c959983184925a855cc8b0285fa23f	2024-07-15 10:28:34 -07:00
Changyu Bi	b800b5eb6a	Deflake ThreadStatus related unit tests (#12858 ) Summary: Unit tests `DBTest.ThreadStatusFlush` and `DBTestWithParam.ThreadStatusSingleCompaction` have been flaky and fail with error message ``` [ RUN ] DBTest.ThreadStatusFlush op_count: 0, expected_count 1 thread id: 718113, thread status: , cf_name thread id: 718114, thread status: , cf_name pikachu /__w/rocksdb/rocksdb/db/db_test.cc:4817: Failure Value of: VerifyOperationCount(env_, ThreadStatus::OP_FLUSH, 1) Actual: false Expected: true [ FAILED ] DBTest.ThreadStatusFlush (106 ms) [ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0 db/db_test.cc:4673: Failure Expected equality of these values: op_count Which is: 0 expected_count Which is: 1 [ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) ``` One cause for this is that before flush/compaction finishes, we will go through `~WritableFileWriter()`, either for WAL or SST file, and temporarily set thread_operation to UNKNOWN. This UNKNOWN thread operation seem to be there for some stress test verification. This PR fixes these tests by setting the IOActivity in ~WritableFileWriter() for debug build. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12858 Test Plan: monitor future test failure. Reviewed By: hx235 Differential Revision: D59691564 Pulled By: cbi42 fbshipit-source-id: 3f96998bba9d42aba50d1830c2b51bef2dd6705f	2024-07-15 09:56:09 -07:00
Peter Dillinger	72438a6788	Support read & write with unsynced data in FaultInjectionTestFS (#12852 ) Summary: Follow-up to https://github.com/facebook/rocksdb/issues/12729 and others to fix FaultInjectionTestFS handling the case where a live WAL is being appended to and synced while also being copied for checkpoint or backup, up to a known flushed (but not necessarily synced) prefix of the file. It was tricky to structure the code in a way that could handle a tricky race with Sync in another thread (see code comments, thanks Changyu) while maintaining good performance and test-ability. For more context, see the call to FlushWAL() in DBImpl::GetLiveFilesStorageInfo(). Also, the unit test for https://github.com/facebook/rocksdb/issues/12729 was neutered by https://github.com/facebook/rocksdb/issues/12797, and this re-enables the functionality it is testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12852 Test Plan: unit test expanded/updated. Local runs of blackbox_crash_test. The implementation is structured so that a multi-threaded unit test is not needed to cover at least the code lines, as the race handling is folded into "catch up after returning unsynced and then a sync." Reviewed By: cbi42 Differential Revision: D59594045 Pulled By: pdillinger fbshipit-source-id: 94667bb72255e2952586c53bae2c2dd384e85a50	2024-07-12 16:01:57 -07:00
Yu Zhang	3db030d7ee	Fix bug for recovering a prepared but not committed txn (#12856 ) Summary: This PR fix a bug for recovering a prepared Transaction that can contain user-defined timestamps. The `Transaction::Put` type of APIs expect the key provided to be user key without timestamps. When the original transaction added a key for a column family that enables user-defined timestamps, say of size 8. Internally `WriteBatch::Put` will leave a placeholder 8 bytes for the final commit timestamp. For example: `cec28aa90f/db/write_batch.cc (L937)` When rebuilding this transaction from a `WriteBatch` from WAL log, we should consider this and remove the tailing 8 bytes of a key before adding it via the public Transaction write APIs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12856 Test Plan: Added unit test that would fail without this fix Reviewed By: cbi42 Differential Revision: D59656399 Pulled By: jowlyzhang fbshipit-source-id: c716aefa4d548770b691efe96ac8e6d7dab458b9	2024-07-11 16:25:35 -07:00

1 2 3 4 5 ...

12971 Commits All Branches Search

12971 Commits

All Branches