Commit Graph

12916 Commits

Author SHA1 Message Date
Peter Dillinger 10984e8c26 Fix and generalize framework for filtering range queries, etc. (#13005)
Summary:
There was a subtle design/contract bug in the previous version of range filtering in experimental.h  If someone implemented a key segments extractor with "all or nothing" fixed size segments, that could result in unsafe range filtering. For example, with two segments of width 3:
```
x = 0x|12 34 56|78 9A 00|
y = 0x|12 34 56||78 9B
z = 0x|12 34 56|78 9C 00|
```
Segment 1 of y (empty) is out of order with segment 1 of x and z.

I have re-worked the contract to make it clear what does work, and implemented a standard extractor for fixed-size segments, CappedKeySegmentsExtractor. The safe approach for filtering is to consume as much as is available for a segment in the case of a short key.

I have also added support for min-max filtering with reverse byte-wise comparator, which is probably the 2nd most common comparator for RocksDB users (because of MySQL). It might seem that a min-max filter doesn't care about forward or reverse ordering, but it does when trying to determine whether in input range from segment values v1 to v2, where it so happens that v2 is byte-wise less than v1, is an empty forward interval or a non-empty reverse interval. At least in the current setup, we don't have that context.

A new unit test (with some refactoring) tests CappedKeySegmentsExtractor, reverse byte-wise comparator, and the corresponding min-max filter.

I have also (contractually / mathematically) generalized the framework to comparators other than the byte-wise comparator, and made other generalizations to make the extractor limitations more explicitly connected to the particular filters and filtering used--at least in description.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13005

Test Plan: added unit tests as described

Reviewed By: jowlyzhang

Differential Revision: D62769784

Pulled By: pdillinger

fbshipit-source-id: 0d41f0d0273586bdad55e4aa30381ebc861f7044
2024-09-18 15:26:37 -07:00
Nick Brekhus 0611eb5b9d Fix orphaned files in SstFileManager (#13015)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015

`Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened.

Reviewed By: jowlyzhang

Differential Revision: D62590773

fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28
2024-09-18 13:27:44 -07:00
Yu Zhang f411c8bc97 Update folly Github hash (#13017)
Summary:
The internal codebase is updated for the coro directory's graduation from experimental. Updating our build script for a newer version with this change too. Using this hash: 03041f014b

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13017

Reviewed By: nickbrekhus

Differential Revision: D62763932

Pulled By: jowlyzhang

fbshipit-source-id: 1b211707fbc7d974d6d6ceaf577e174424bb44ed
2024-09-17 17:47:10 -07:00
Changyu Bi f97e33454f Fix a bug with auto recovery on WAL write error (#12995)
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:

```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped: 4b1d595306/file/writable_file_writer.cc (L624)
The error needs to be retryable to start background auto recovery.
3. One CF successfully flushes its memtable during auto recovery.
4. Crash the process.
5. Reopen the DB, one CF will have the update as a result of successful flush. Other CFs will miss all the updates in the write batch since WAL does not have them.
```

This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery.

A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12995

Test Plan:
added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR:
```
python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944  --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1
```

Reviewed By: hx235

Differential Revision: D62888510

Pulled By: cbi42

fbshipit-source-id: 308bdbbb8d897cc8eba950155cd0e37cf7eb76fe
2024-09-17 14:10:33 -07:00
Jon Janzen a164576252 Remove last user of AutoHeaders.RECURSIVE_GLOB
Summary: I came across this code while buckifying parts of folly and fizz in open source. This is pretty hacky code and cleaning it up doesn't seem that hard, so I did it.

Reviewed By: zertosh, pdillinger

Differential Revision: D62781766

fbshipit-source-id: 43714bce992c53149d1e619063d803297362fb5d
2024-09-17 13:21:57 -07:00
leipeng 8648fbcba3 Add missing RemapFileSystem::ReopenWritableFile (#12941)
Summary:
`RemapFileSystem::ReopenWritableFile` is missing, add it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12941

Reviewed By: pdillinger

Differential Revision: D61822540

Pulled By: cbi42

fbshipit-source-id: dc228f7e8b842216f63de8b925cb663898455345
2024-09-17 13:08:25 -07:00
Nicholas Ormrod 0e04ef1a96 Deshim coro in fbcode/internal_repo_rocksdb
Summary:
The following rules were deshimmed:
```
//folly/experimental/coro:accumulate -> //folly/coro:accumulate
//folly/experimental/coro:async_generator -> //folly/coro:async_generator
//folly/experimental/coro:async_pipe -> //folly/coro:async_pipe
//folly/experimental/coro:async_scope -> //folly/coro:async_scope
//folly/experimental/coro:async_stack -> //folly/coro:async_stack
//folly/experimental/coro:baton -> //folly/coro:baton
//folly/experimental/coro:blocking_wait -> //folly/coro:blocking_wait
//folly/experimental/coro:collect -> //folly/coro:collect
//folly/experimental/coro:concat -> //folly/coro:concat
//folly/experimental/coro:coroutine -> //folly/coro:coroutine
//folly/experimental/coro:current_executor -> //folly/coro:current_executor
//folly/experimental/coro:detach_on_cancel -> //folly/coro:detach_on_cancel
//folly/experimental/coro:detail_barrier -> //folly/coro:detail_barrier
//folly/experimental/coro:detail_barrier_task -> //folly/coro:detail_barrier_task
//folly/experimental/coro:detail_current_async_frame -> //folly/coro:detail_current_async_frame
//folly/experimental/coro:detail_helpers -> //folly/coro:detail_helpers
//folly/experimental/coro:detail_malloc -> //folly/coro:detail_malloc
//folly/experimental/coro:detail_manual_lifetime -> //folly/coro:detail_manual_lifetime
//folly/experimental/coro:detail_traits -> //folly/coro:detail_traits
//folly/experimental/coro:filter -> //folly/coro:filter
//folly/experimental/coro:future_util -> //folly/coro:future_util
//folly/experimental/coro:generator -> //folly/coro:generator
//folly/experimental/coro:gmock_helpers -> //folly/coro:gmock_helpers
//folly/experimental/coro:gtest_helpers -> //folly/coro:gtest_helpers
//folly/experimental/coro:inline_task -> //folly/coro:inline_task
//folly/experimental/coro:invoke -> //folly/coro:invoke
//folly/experimental/coro:merge -> //folly/coro:merge
//folly/experimental/coro:mutex -> //folly/coro:mutex
//folly/experimental/coro:promise -> //folly/coro:promise
//folly/experimental/coro:result -> //folly/coro:result
//folly/experimental/coro:retry -> //folly/coro:retry
//folly/experimental/coro:rust_adaptors -> //folly/coro:rust_adaptors
//folly/experimental/coro:scope_exit -> //folly/coro:scope_exit
//folly/experimental/coro:shared_lock -> //folly/coro:shared_lock
//folly/experimental/coro:shared_mutex -> //folly/coro:shared_mutex
//folly/experimental/coro:sleep -> //folly/coro:sleep
//folly/experimental/coro:small_unbounded_queue -> //folly/coro:small_unbounded_queue
//folly/experimental/coro:task -> //folly/coro:task
//folly/experimental/coro:timed_wait -> //folly/coro:timed_wait
//folly/experimental/coro:timeout -> //folly/coro:timeout
//folly/experimental/coro:traits -> //folly/coro:traits
//folly/experimental/coro:transform -> //folly/coro:transform
//folly/experimental/coro:unbounded_queue -> //folly/coro:unbounded_queue
//folly/experimental/coro:via_if_async -> //folly/coro:via_if_async
//folly/experimental/coro:with_async_stack -> //folly/coro:with_async_stack
//folly/experimental/coro:with_cancellation -> //folly/coro:with_cancellation
//folly/experimental/coro:bounded_queue -> //folly/coro:bounded_queue
//folly/experimental/coro:shared_promise -> //folly/coro:shared_promise
//folly/experimental/coro:cleanup -> //folly/coro:cleanup
//folly/experimental/coro:auto_cleanup_fwd -> //folly/coro:auto_cleanup_fwd
//folly/experimental/coro:auto_cleanup -> //folly/coro:auto_cleanup
```

The following headers were deshimmed:
```
folly/experimental/coro/Accumulate.h -> folly/coro/Accumulate.h
folly/experimental/coro/Accumulate-inl.h -> folly/coro/Accumulate-inl.h
folly/experimental/coro/AsyncGenerator.h -> folly/coro/AsyncGenerator.h
folly/experimental/coro/AsyncPipe.h -> folly/coro/AsyncPipe.h
folly/experimental/coro/AsyncScope.h -> folly/coro/AsyncScope.h
folly/experimental/coro/AsyncStack.h -> folly/coro/AsyncStack.h
folly/experimental/coro/Baton.h -> folly/coro/Baton.h
folly/experimental/coro/BlockingWait.h -> folly/coro/BlockingWait.h
folly/experimental/coro/Collect.h -> folly/coro/Collect.h
folly/experimental/coro/Collect-inl.h -> folly/coro/Collect-inl.h
folly/experimental/coro/Concat.h -> folly/coro/Concat.h
folly/experimental/coro/Concat-inl.h -> folly/coro/Concat-inl.h
folly/experimental/coro/Coroutine.h -> folly/coro/Coroutine.h
folly/experimental/coro/CurrentExecutor.h -> folly/coro/CurrentExecutor.h
folly/experimental/coro/DetachOnCancel.h -> folly/coro/DetachOnCancel.h
folly/experimental/coro/detail/Barrier.h -> folly/coro/detail/Barrier.h
folly/experimental/coro/detail/BarrierTask.h -> folly/coro/detail/BarrierTask.h
folly/experimental/coro/detail/CurrentAsyncFrame.h -> folly/coro/detail/CurrentAsyncFrame.h
folly/experimental/coro/detail/Helpers.h -> folly/coro/detail/Helpers.h
folly/experimental/coro/detail/Malloc.h -> folly/coro/detail/Malloc.h
folly/experimental/coro/detail/ManualLifetime.h -> folly/coro/detail/ManualLifetime.h
folly/experimental/coro/detail/Traits.h -> folly/coro/detail/Traits.h
folly/experimental/coro/Filter.h -> folly/coro/Filter.h
folly/experimental/coro/Filter-inl.h -> folly/coro/Filter-inl.h
folly/experimental/coro/FutureUtil.h -> folly/coro/FutureUtil.h
folly/experimental/coro/Generator.h -> folly/coro/Generator.h
folly/experimental/coro/GmockHelpers.h -> folly/coro/GmockHelpers.h
folly/experimental/coro/GtestHelpers.h -> folly/coro/GtestHelpers.h
folly/experimental/coro/detail/InlineTask.h -> folly/coro/detail/InlineTask.h
folly/experimental/coro/Invoke.h -> folly/coro/Invoke.h
folly/experimental/coro/Merge.h -> folly/coro/Merge.h
folly/experimental/coro/Merge-inl.h -> folly/coro/Merge-inl.h
folly/experimental/coro/Mutex.h -> folly/coro/Mutex.h
folly/experimental/coro/Promise.h -> folly/coro/Promise.h
folly/experimental/coro/Result.h -> folly/coro/Result.h
folly/experimental/coro/Retry.h -> folly/coro/Retry.h
folly/experimental/coro/RustAdaptors.h -> folly/coro/RustAdaptors.h
folly/experimental/coro/ScopeExit.h -> folly/coro/ScopeExit.h
folly/experimental/coro/SharedLock.h -> folly/coro/SharedLock.h
folly/experimental/coro/SharedMutex.h -> folly/coro/SharedMutex.h
folly/experimental/coro/Sleep.h -> folly/coro/Sleep.h
folly/experimental/coro/Sleep-inl.h -> folly/coro/Sleep-inl.h
folly/experimental/coro/SmallUnboundedQueue.h -> folly/coro/SmallUnboundedQueue.h
folly/experimental/coro/Task.h -> folly/coro/Task.h
folly/experimental/coro/TimedWait.h -> folly/coro/TimedWait.h
folly/experimental/coro/Timeout.h -> folly/coro/Timeout.h
folly/experimental/coro/Timeout-inl.h -> folly/coro/Timeout-inl.h
folly/experimental/coro/Traits.h -> folly/coro/Traits.h
folly/experimental/coro/Transform.h -> folly/coro/Transform.h
folly/experimental/coro/Transform-inl.h -> folly/coro/Transform-inl.h
folly/experimental/coro/UnboundedQueue.h -> folly/coro/UnboundedQueue.h
folly/experimental/coro/ViaIfAsync.h -> folly/coro/ViaIfAsync.h
folly/experimental/coro/WithAsyncStack.h -> folly/coro/WithAsyncStack.h
folly/experimental/coro/WithCancellation.h -> folly/coro/WithCancellation.h
folly/experimental/coro/BoundedQueue.h -> folly/coro/BoundedQueue.h
folly/experimental/coro/SharedPromise.h -> folly/coro/SharedPromise.h
folly/experimental/coro/Cleanup.h -> folly/coro/Cleanup.h
folly/experimental/coro/AutoCleanup-fwd.h -> folly/coro/AutoCleanup-fwd.h
folly/experimental/coro/AutoCleanup.h -> folly/coro/AutoCleanup.h
```

This is a codemod. It was automatically generated and will be landed once it is approved and tests are passing in sandcastle.
You have been added as a reviewer by Sentinel or Butterfly.

Autodiff project: dcoro
Autodiff partition: fbcode.internal_repo_rocksdb
Autodiff bookmark: ad.dcoro.fbcode.internal_repo_rocksdb

Reviewed By: dtolnay

Differential Revision: D62684411

fbshipit-source-id: 8dbd31ab64fcdd99435d322035b9668e3200e0a3
2024-09-14 09:48:21 -07:00
Nick Brekhus 40adb2bab7 Fix wraparound in SstFileManager (#13010)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13010

The OnAddFile cur_compactions_reserved_size_ accounting causes wraparound when re-opening a database with an unowned SstFileManager and during recovery. It was introduced in #4164 which addresses out of space recovery with an unclear purpose. Compaction jobs do this accounting via EnoughRoomForCompaction/OnCompactionCompletion and to my understanding would never reuse a sst file name.

Reviewed By: anand1976

Differential Revision: D62535775

fbshipit-source-id: a7c44d6e0a4b5ff74bc47abfe57c32ca6770243d
2024-09-13 14:39:48 -07:00
anand76 cabd2d8718 Fix a couple of missing cases of retry on corruption (#13007)
Summary:
For SST checksum mismatch corruptions in the read path, RocksDB retries the read if the underlying file system supports verification and reconstruction of data (`FSSupportedOps::kVerifyAndReconstructRead`). There were a couple of places where the retry was missing - reading the SST footer and the properties block. This PR fixes the retry in those cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13007

Test Plan: Add new unit tests

Reviewed By: jaykorean

Differential Revision: D62519186

Pulled By: anand1976

fbshipit-source-id: 50aa38f18f2a53531a9fc8d4ccdf34fbf034ed59
2024-09-13 13:56:49 -07:00
Changyu Bi e490f2b051 Fix a bug in ReFitLevel() where `FileMetaData::being_compacted` is not cleared (#13009)
Summary:
in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009

Test Plan:
existing test.
- stress test failure in T200339331 should not happen anymore.

Reviewed By: hx235

Differential Revision: D62597169

Pulled By: cbi42

fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e
2024-09-12 15:19:14 -07:00
Yu Zhang 43bc71fef6 Add an internal API MemTableList::GetEditForDroppingCurrentVersion (#13001)
Summary:
Prepare this internal API to be used by atomic data replacement. The main purpose of this API is to get a `VersionEdit` to mark the entire current `MemTableListVersion` as dropped.  Flush needs the similar functionality when installing results, so that logic is refactored into a util function `GetDBRecoveryEditForObsoletingMemTables` to be shared by flush and this internal API.

To test this internal API, flush's result installation is redirected to use this API when it is flushing all the immutable MemTables in debug mode. It should achieve the exact same results, just with a duplicated `VersionEdit::log_number` field that doesn't upsets the recovery logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13001

Test Plan: Existing tests

Reviewed By: pdillinger

Differential Revision: D62309591

Pulled By: jowlyzhang

fbshipit-source-id: e25914d9a2e281c25ab7ee31a66eaf6adfae4b88
2024-09-10 13:23:13 -07:00
Levi Tamasi 55ac0b729e Support building db_bench using buck (#13004)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13004

The patch extends the buckifier script so it generates a target for `db_bench` as well.

Reviewed By: cbi42

Differential Revision: D62407071

fbshipit-source-id: 0cb98a324ce0598ad84a8675aa77b7d0f91bf40c
2024-09-10 11:37:29 -07:00
anand76 f48b64460e Provide a way to invoke a callback for a Cache handle (#12987)
Summary:
Add the `ApplyToHandle` method to the `Cache` interface to allow a caller to request the invocation of a callback on the given cache handle. The goal here is to allow a cache that manages multiple cache instances to use a callback on a handle to determine which instance it belongs to. For example, the callback can hash the key and use that to pick the correct target instance. This is useful to redirect methods like `Ref` and `Release`, which don't know the cache key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12987

Reviewed By: pdillinger

Differential Revision: D62151907

Pulled By: anand1976

fbshipit-source-id: e4ffbbb96eac9061d2ab0e7e1739eea5ebb1cd58
2024-09-06 14:54:09 -07:00
Yu Zhang 0c6e9c036a Make compaction always use the input version with extra ref protection (#12992)
Summary:
`Compaction` is already creating its own ref for the input Version: 4b1d595306/db/compaction/compaction.cc (L73)

And properly Unref it during destruction:
4b1d595306/db/compaction/compaction.cc (L450)

This PR redirects compaction's access of `cfd->current()` to this input `Version`, to prepare for when a column family's data can be replaced all together, and `cfd->current()` is not safe to access for a compaction job. Because a new `Version` with just some other external files could be installed as `cfd->current()`. The compaction job's expectation of the current `Version` and the corresponding storage info to always have its input files will no longer be guaranteed.

My next follow up is to do a similar thing for flush, also to prepare it for when a column family's data can be replaced. I will make it create its own reference of the current `MemTableListVersion` and use it as input, all flush job's access of memtables will be wired to that input `MemTableListVersion`. Similarly this reference will be unreffed during a flush job's destruction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12992

Test Plan: Existing tests

Reviewed By: pdillinger

Differential Revision: D62212625

Pulled By: jowlyzhang

fbshipit-source-id: 9a781213469cf366857a128d50a702af683a046a
2024-09-06 14:07:33 -07:00
Yu Zhang a24574e80a Add documentation for background job's state transition (#12994)
Summary:
The `SchedulePending*` API is a bit confusing since it doesn't immediately schedule the work and can be confused with the actual scheduling. So I have changed these to be `EnqueuePending*` and added some documentation for the corresponding state transitions of these background work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12994

Test Plan: existing tests

Reviewed By: cbi42

Differential Revision: D62252746

Pulled By: jowlyzhang

fbshipit-source-id: ee68be6ed33070cad9a5004b7b3e16f5bcb041bf
2024-09-06 13:08:34 -07:00
Changyu Bi 0bea5a2cfe Disable WAL fault injection in some case (#13000)
Summary:
when manual_wal_flush is true and when there are more than 1 CF, WAL fault injection can cause CFs to be inconsistent. See more explanation and repro in T199157789. Disable the combination for now until we have a fix that allows auto recovery. This also helps to see if there's other cause of stress test failures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13000

Test Plan:
the following command could repro db consistency failure in a few runs before this PR.
From stress test output we can also see that exclude_wal_from_write_fault_injection and metadata_write_fault_one_in are sanitized to 0.
```
python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944  --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1
```

Reviewed By: hx235

Differential Revision: D62303631

Pulled By: cbi42

fbshipit-source-id: d9441188ee84d53e5e7916f3305e50843fe9fde2
2024-09-06 11:05:49 -07:00
Peter Dillinger 4577b672d5 More valgrind fixes (#12990)
Summary:
* https://github.com/facebook/rocksdb/issues/12936 was insufficient to fix the std::optional false positives. Making a fix validated in CI this time (see https://github.com/facebook/rocksdb/issues/12991)
* valgrind grinds to a halt on startup on my dev machine apparently because it expects internet access. Disable its attempts to access the internet when git is using a proxy.
* Move PORTABLE=1 from CI job to the Makefile. Without it, valgrind complains about illegal instructions (too new)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12990

Test Plan: manual, watch nightly valgrind job

Reviewed By: ltamasi

Differential Revision: D62203242

Pulled By: pdillinger

fbshipit-source-id: a611b08da7dbd173b0709ed7feb0578729553a17
2024-09-06 10:11:34 -07:00
Peter Dillinger 0d3aaf7c0f Ensure SSTs compressed in tiered_secondary_cache_test (#12993)
Summary:
It appears the arm testsuite is failing because it is building without snappy, which is causing the SST files not to be compressed, which somehow causes these tests to fail. Manually setting LZ4 which is already required.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12993

Test Plan: reproduced and verified fix on ARM laptop

Reviewed By: anand1976

Differential Revision: D62216451

Pulled By: pdillinger

fbshipit-source-id: 3f21fcd9be0edaa66c7eca0cb7d56b998171e263
2024-09-05 10:36:29 -07:00
Peter Dillinger 4b1d595306 Fix and clarify ignore_unknown_options (#12989)
Summary:
`ignore_unknown_options=true` had an undocumented behavior of having no effect (disallow unknown options) if reading from the same or older major.minor version. Presumably this was intended to catch unintentional addition of new options in a patch release, but there is no automated version compatibility testing between patch releases. So this was a bad choice without such testing support, because it just means users would hit the failure in case of adding features to a patch release.

In this diff we respect ignore_unknown_options when reading a file from any newer version, even patch versions, and document this behavior in the API.

I don't think it's practical or necessary to test among patch releases in check_format_compatible.sh. This seems like an exceptional case of applying a *different semantics* to patch version updates than to minor/major versions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12989

Test Plan: unit test updated (and refactored)

Reviewed By: jaykorean

Differential Revision: D62168738

Pulled By: pdillinger

fbshipit-source-id: fb3c3ef30f0bbad0d5ffcc4570fb9ef963e7daac
2024-09-04 11:42:04 -07:00
Jay Huh 064c0ad53d Fix the check format compatible change (#12988)
Summary:
`check_format_compatible` script was broken due to extra comma added in 5b8f5cbcf4
e.g. https://github.com/facebook/rocksdb/actions/runs/10505042711/job/29101787220
```
...
2024-08-23T11:44:15.0175202Z == Building 9.5.fb, debug
2024-08-23T11:44:15.0190592Z fatal: ambiguous argument '_tmp_origin/9.5.fb,': unknown revision or path not in the working tree.
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12988

Test Plan:
```
tools/check_format_compatible.sh
```
```
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.6.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.6.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.6.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.7.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.7.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.7.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.8.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.8.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.8.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.9.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.9.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.9.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.10.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.10.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.10.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 8.11.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/8.11.fb to /tmp/rocksdb_format_compatible_jewoongh/db/8.11.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.0.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.0.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.0.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.1.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.1.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.1.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.2.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.2.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.2.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.3.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.3.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.3.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.4.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.4.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.4.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.5.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.5.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.5.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
== Use HEAD (48339d2a65670211bc9c204364a2127ba9b2a460) to open DB generated using 9.6.fb...
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/9.6.fb to /tmp/rocksdb_format_compatible_jewoongh/db/9.6.fb/db_dump.txt
== Dumping data from /tmp/rocksdb_format_compatible_jewoongh/db/current to /tmp/rocksdb_format_compatible_jewoongh/db/current/db_dump.txt
==== Compatibility Test PASSED ====
```

Reviewed By: pdillinger

Differential Revision: D62162454

Pulled By: jaykorean

fbshipit-source-id: 562225c6cb27e0eb66f241a6f9424dc624d8c837
2024-09-03 22:17:31 -07:00
Symious c989c51ed7 Fix non-ASCII character (#12972)
Summary:
Met the following error while compiling the project.

```
build_tools/check-sources.sh
utilities/fault_injection_fs.cc:509:    // If there<E2><80><99>s no injected error, then cb will be called asynchronously when
utilities/fault_injection_fs.cc:510:    // target_ actually finishes the read. But if there<E2><80><99>s an injected error, it
utilities/fault_injection_fs.cc:512:    // isn<E2><80><99>t invoked at all.
^^^^ Use only ASCII characters in source files
make[1]: *** [Makefile:1291: check-sources] Error 1
make[1]: Leaving directory '/home/janus/Github/symious/rocksdb'
make: *** [Makefile:1084: check] Error 2
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12972

Reviewed By: hx235

Differential Revision: D61923865

Pulled By: cbi42

fbshipit-source-id: 63af0a38fea15e09a860895bdd5ed0a57700e447
2024-09-03 14:41:55 -07:00
Changyu Bi cd6f802ccb Add a new file ingestion option `link_files` (#12980)
Summary:
Add option `IngestExternalFileOptions::link_files` that hard links input files and preserves original file links after ingestion, unlike `move_files` which will unlink input files after ingestion. This can be useful when being used together with `allow_db_generated_files` to ingest files from another DB. Also reverted the change to `move_files` in https://github.com/facebook/rocksdb/issues/12959 to simplify the contract so that it will always unlink input files without exception.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12980

Test Plan: updated unit test `ExternSSTFileLinkFailFallbackTest.LinkFailFallBackExternalSst` to test that input files will not be unlinked.

Reviewed By: pdillinger

Differential Revision: D61925111

Pulled By: cbi42

fbshipit-source-id: eadaca72e1ae5288bdd195d57158466e5656fa62
2024-09-03 13:06:25 -07:00
Changyu Bi 92ad4a88f3 Small CPU optimization in InlineSkipList::Insert() (#12975)
Summary:
reuse decode key in more places to avoid decoding length prefixed key x->Key().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12975

Test Plan:
ran benchmarks simultaneously for "before" and "after"
* fillseq:
```
(for I in $(seq 1 50); do ./db_bench --benchmarks=fillseq --disable_auto_compactions=1 --min_write_buffer_number_to_merge=100 --max_write_buffer_number=1000  --write_buffer_size=268435456 --num=5000000 --seed=1723056275 --disable_wal=1 2>&1 | grep "fillseq"
done;) | awk '{ t += $5; c++; print } END { printf ("%9.3f\n", 1.0 * t / c) }';

before: 1483191
after: 1490555 (+0.5%)
```

* fillrandom:
```
(for I in $(seq 1 2); do ./db_bench_imain --benchmarks=fillrandom --disable_auto_compactions=1 --min_write_buffer_number_to_merge=100 --max_write_buffer_number=1000  --write_buffer_size=268435456 --num=2500000 --seed=1723056275 --disable_wal=1 2>&1 | grep "fillrandom"

before: 255463
after: 256128 (+0.26%)
```

Reviewed By: anand1976

Differential Revision: D61835340

Pulled By: cbi42

fbshipit-source-id: 70345510720e348bacd51269acb5d2dd5a62bf0a
2024-08-27 13:57:40 -07:00
anand76 f31b4d80ff Retain previous trace file in db_stress for debugging purposes (#12978)
Summary:
There are several crash test failures due to DB verification failure. Retain some trace history in the expected state directory to make debugging easier.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12978

Reviewed By: cbi42

Differential Revision: D61864921

Pulled By: anand1976

fbshipit-source-id: 9f3f37b7e1e958bc89a3cf0373182354c2c1aa3b
2024-08-27 12:43:47 -07:00
Jay Huh 0082907bf2 Scope down workflow permissions (#12973)
Summary:
Followed instruction per https://docs.github.com/en/actions/writing-workflows/workflow-syntax-for-github-actions#defining-access-for-the-github_token-scopes

It turns out that we did not need any of these except `Metadata: read`.

Before
```
GITHUB_TOKEN Permissions
  Actions: write
  Attestations: write
  Checks: write
  Contents: write
  Deployments: write
  Discussions: write
  Issues: write
  Metadata: read
  Packages: write
  Pages: write
  PullRequests: write
  RepositoryProjects: write
  SecurityEvents: write
  Statuses: write
```

After
```
GITHUB_TOKEN Permissions
  Metadata: read
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12973

Test Plan: GitHub Actions triggered by this PR

Reviewed By: cbi42

Differential Revision: D61812651

Pulled By: jaykorean

fbshipit-source-id: 4413756c93f503e8b2fb77eb8b684ef9e6a6c13d
2024-08-26 15:28:17 -07:00
Peter Dillinger d96e67c2bf Fix flaky test DBTest2.VariousFileTemperatures (#12974)
Summary:
... apparently due to potentially not purging obsolete files after CompactRange

Example: https://github.com/facebook/rocksdb/actions/runs/10564621261/job/29267393711?pr=12959

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12974

Test Plan: reproduced failure with USE_CLANG=1 COERCE_CONTEXT_SWITCH=1, now fixed

Reviewed By: cbi42

Differential Revision: D61812600

Pulled By: pdillinger

fbshipit-source-id: d4b23e1a179bb8ec39875ed7a8ce1649fa3344bd
2024-08-26 14:08:21 -07:00
Changyu Bi 4eb5878ab2 Support ingesting db generated files using hard link (#12959)
Summary:
so `IngestExternalFileOptions::move_files` and `IngestExternalFileOptions::allow_db_generated_files` are now compatible. The original file links won't be removed if `allow_db_generated_files` is true. This is to prevent deleting files from another DB.

There was a [comment](https://github.com/facebook/rocksdb/pull/12750#discussion_r1684509620) in https://github.com/facebook/rocksdb/issues/12750 about how exactly-once ingestion would work with `move_files`. I've discussed with customer and decided that it can be done by reading the target DB to see if it contains any ingested key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12959

Test Plan: updated unit tests `IngestDBGeneratedFileTest*` to enable `move_files`.

Reviewed By: jowlyzhang

Differential Revision: D61703480

Pulled By: cbi42

fbshipit-source-id: 6b4294369767f989a2f36bbace4ca3c0257aeaf7
2024-08-26 12:25:16 -07:00
Peter Dillinger 8ac14f525e Add Temperature to FileAttributes (#12965)
Summary:
.. so that appropriate implementations can return temperature information from GetChildrenFileAttributes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12965

Test Plan: just an API placeholder for now

Reviewed By: anand1976

Differential Revision: D61748199

Pulled By: pdillinger

fbshipit-source-id: b457e324cb451e836611a0bf630c3da0f30a8abf
2024-08-23 20:06:41 -07:00
Peter Dillinger 96340dbce2 Options for file temperature for more files (#12957)
Summary:
We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options:
* `metadata_write_temperature` for DB "small" files that don't contain much user data
* `wal_write_temperature` for WALs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957

Test Plan: Unit test included, though it's hard to be sure we've covered all the places

Reviewed By: jowlyzhang

Differential Revision: D61664815

Pulled By: pdillinger

fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b
2024-08-23 19:49:25 -07:00
Jay Huh d6aed64de4 Disable benchmark-linux (#12964)
Summary:
Disabling the job temporarily. We will re-enable this when ready again

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12964

Reviewed By: ltamasi

Differential Revision: D61740941

Pulled By: jaykorean

fbshipit-source-id: 167e50c4f5e38d508a8e56633261611467f30690
2024-08-23 18:39:16 -07:00
eniac1024 f5c5f881d2 Fix MultiGet with timestamps (#12943)
Summary:
Issue: MultiGet(PinnableSlice) can't read out all timestamps.
Fixed the impl, and added an UT as well. In the original impl, if MultiGet reads multiple column families, a later column family would clean up timestamps of previous column family.
Fix: https://github.com/facebook/rocksdb/issues/12950#issue-2476996580

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12943

Reviewed By: anand1976

Differential Revision: D61729257

Pulled By: pdillinger

fbshipit-source-id: 55267c26076c8a59acedd27e14714711729a40df
2024-08-23 13:58:04 -07:00
Changyu Bi c62de54c7c Record largest seqno in table properties and verify in file ingestion (#12951)
Summary:
this helps to avoid scanning input files when ingesting db generated files: ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951

Test Plan:
* `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion
* existing unit tests for other ingestion use cases.

Reviewed By: jowlyzhang

Differential Revision: D61608285

Pulled By: cbi42

fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca
2024-08-21 16:24:18 -07:00
Jay Huh 4df71db246 Fix build for macos-arm64-macosx-clang17-no-san (#12949)
Summary:
When merged into internal code base we see the following error. This should fix it.

```
Actions failed:
    [2024-08-20T07:45:53.879-07:00] Action failed: fbcode//rocksdb/src:rocksdb_lib (cfg:macos-arm64-macosx-clang17-no-san#e5847010950663ca) (cxx_compile util/write_batch_util.cc)
[2024-08-20T07:45:53.879-07:00] Remote command returned non-zero exit code 1
[2024-08-20T07:45:53.879-07:00] Remote action, reproduce with: `frecli cas download-action 2fe3749f2d3ea6107cce103d4e2be1dcc76a9df797bae308cde5eaccc65201b7:145`
fbcode/rocksdb/src/include/rocksdb/write_batch.h:460:14: error: no template named 'unordered_map' in namespace 'std'; did you mean 'unordered_set'?
  const std::unordered_map<uint32_t, size_t>& GetColumnFamilyToTimestampSize() {
        ~~~~~^~~~~~~~~~~~~
fbcode/rocksdb/src/include/rocksdb/write_batch.h:540:8: error: no template named 'unordered_map' in namespace 'std'; did you mean 'unordered_set'?
  std::unordered_map<uint32_t, size_t> cf_id_to_ts_sz_;
  ~~~~~^~~~~~~~~~~~~
/paragon/pods/259551525/home/execution/3/202ac945754041b6bc424b0c35e42c9d/work/buck-out/v2/gen/fbsource/a90614bbe22ec1d7/xplat/toolchains/minimal_xcode/__clang_genrule__/out/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__memory/compressed_pair.h:113:3: error: static_assert failed due to requirement '!is_same<unsigned long, unsigned long>::value' "__compressed_pair cannot be instantiated when T1 and T2 are the same type; The current implementation is NOT ABI-compatible with the previous implementation for this configuration"
  static_assert((!is_same<_T1, _T2>::value),
  ^              ~~~~~~~~~~~~~~~~~~~~~~~~~
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12949

Test Plan: CI

Reviewed By: jowlyzhang, cbi42

Differential Revision: D61577604

Pulled By: jaykorean

fbshipit-source-id: 3584a2cd550a303346d80ccc5cc90f4a9b3e2da2
2024-08-21 10:27:50 -07:00
Yu Zhang 945f60b157 Add some documentation for version edit handlers (#12948)
Summary:
As titled. No functional change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12948

Reviewed By: hx235

Differential Revision: D61551254

Pulled By: jowlyzhang

fbshipit-source-id: ccf53d78bd2f18f174d7e61972e5de467c96ce76
2024-08-21 10:09:10 -07:00
Radek Hubner ecb844babd Enable Continuous Benchmarking of RocksDB (#12885)
Summary:
This pull request transitions the benchmarking process from CircleCI to GitHub Actions. The benchmarking jobs will now be executed on a self-hosted runner. Unlike the previous CircleCI configuration, where jobs were queued due to the long execution time (nearly 60 minutes per job), the new setup schedules the benchmarking tasks to run every two hours.

Closes https://github.com/facebook/rocksdb/issues/12615

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12885

Reviewed By: pdillinger

Differential Revision: D61422468

Pulled By: jaykorean

fbshipit-source-id: 10535865c849797825f9652e4e9ef367b3d73599
2024-08-20 11:47:52 -07:00
Yu Zhang 81d52bdc1a Fix UDT in memtable only assertions (#12946)
Summary:
Empty memtables can be legitimately created and flushed, for example by error recovery flush attempts:

273b3eadf0/db/db_impl/db_impl_compaction_flush.cc (L2309-L2312)

This check is updated to be considerate of this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12946

Reviewed By: hx235

Differential Revision: D61492477

Pulled By: jowlyzhang

fbshipit-source-id: 7d16fcaea457948546072f85b3650fd1cc24f9db
2024-08-20 09:19:52 -07:00
Jay Huh d223d34bf3 Fix HISTORY.md file for 9.6 release (#12947)
Summary:
# Summary

Mistakenly double-updated the HISTORY.md file by running `unreleased_history/release.sh` after the first commit in  https://github.com/facebook/rocksdb/issues/12945. Manually fixing the file to reflect the correct content

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12947

Test Plan: N/A. History file change.

Reviewed By: cbi42

Differential Revision: D61512756

Pulled By: jaykorean

fbshipit-source-id: 50dc7e92a945fa80c7dfd01cc89243fd5eaf0548
2024-08-19 22:39:17 -07:00
Jay Huh 5b8f5cbcf4 Update main branch for 9.6 release (#12945)
Summary:
Main branch cut at defd97bc9.
Updated HISTORY.md, version and format compatibility test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12945

Test Plan: CI

Reviewed By: jowlyzhang

Differential Revision: D61482149

Pulled By: jaykorean

fbshipit-source-id: 4edf7c0a8c6e4df8fcc938bc778dfd02981d0c55
2024-08-19 17:36:23 -07:00
Changyu Bi defd97bc9d Add an option to verify memtable key order during reads (#12889)
Summary:
add a new CF option `paranoid_memory_checks` that allows additional data integrity validations during read/scan. Currently, skiplist-based memtable will validate the order of keys visited. Further data validation can be added in different layers. The option will be opt-in due to performance overhead.

The motivation for this feature is for services where data correctness is critical and want to detect in-memory corruption earlier. For a corrupted memtable key, this feature can help to detect it during during reads instead of during flush with existing protections (OutputValidator that verifies key order or per kv checksum). See internally linked task for more context.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12889

Test Plan:
* new unit test added for paranoid_memory_checks=true.
* existing unit test for paranoid_memory_checks=false.
* enable in stress test.

Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR:
* Memtable-only randomread ops/sec:
```
(for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000  --seed=1723056275 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }';

Main: 608146
PR with paranoid_memory_checks=false: 607727 (- %0.07)
PR with paranoid_memory_checks=true: 521889 (-%14.2)
```

* Memtable-only sequential scan ops/sec:
```
(for I in $(seq 1 50); do ./db_bench--benchmarks=fillseq,readseq[-X10] --write_buffer_size=268435456 --num=1000000  --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';

Main: 9180077
PR with paranoid_memory_checks=false: 9536241 (+%3.8)
PR with paranoid_memory_checks=true: 7653934 (-%16.6)
```

* Memtable-only reverse scan ops/sec:
```
(for I in $(seq 1 20); do ./db_bench --benchmarks=fillseq,readreverse[-X10] --write_buffer_size=268435456 --num=1000000  --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';

 Main: 1285719
 PR with integrity_checks=false: 1431626 (+%11.3)
 PR with integrity_checks=true: 811031 (-%36.9)
```

The `readrandom` benchmark shows no regression. The scanning benchmarks show improvement that I can't explain.

Reviewed By: pdillinger

Differential Revision: D60414267

Pulled By: cbi42

fbshipit-source-id: a70b0cbeea131f1a249a5f78f9dc3a62dacfaa91
2024-08-19 13:53:25 -07:00
Jay Huh 273b3eadf0 Add Remote Compaction Installation Callback Function (#12940)
Summary:
Add an optional callback function upon remote compaction temp output installation. This will be internally used for setting the final status in the Offload Infra.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12940

Test Plan:
Unit Test added
```
./compaction_service_test
```

_Also internally tested by manually merging into internal code base_

Reviewed By: anand1976

Differential Revision: D61419157

Pulled By: jaykorean

fbshipit-source-id: 66831685bc403949c26bfc65840dd1900d2a5a67
2024-08-19 11:22:43 -07:00
Yu Zhang 295326b6ee Best efforts recovery recover seqno prefix (#12938)
Summary:
This PR make best efforts recovery more permissive by allowing it to recover incomplete Version that presents a valid point in time view from the user's perspective. Currently, a Version is only valid and saved if all files consisting that Version can be found. With this change, if only a suffix of L0 files (and their associated blob files) are missing,  a valid Version is also available to be saved and recover to. Note that we don't do this if the column family was atomically flushed. Because atomic flush also need a consistent view across the column families, we cannot guarantee that if we are recovering to incomplete version.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12938

Test Plan: Existing tests and added unit tests.

Reviewed By: anand1976

Differential Revision: D61414381

Pulled By: jowlyzhang

fbshipit-source-id: f9b73deb34d35ad696ab42315928b656d586262a
2024-08-16 17:18:54 -07:00
Peter Dillinger 4d3518951a Option to decouple index and filter partitions (#12939)
Summary:
Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads.

However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed.

Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB).

 ## This change

Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder.

This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931.

Suggested follow-up:
* Update confusing use of "last key" to refer to "previous key"
* Expand unit test coverage with parallel compression and dictionary training
* Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939

Test Plan:
unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature.

 ## SST write performance (CPU)

Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR
```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
923691 vs. 924851 (-0.13%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
921398 vs. 922973 (-0.17%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
902259 vs. 908756 (-0.71%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
917932 vs. 916901 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
912755 vs. 907298 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
899754 vs. 892433 (+0.82%)
```
I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations.

 ## Read performance

Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup:

```
(for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 | tee results-$CS; done)
```

And read ops/s results:

```CSV
Cache size MB,After/decoupled/4k,Before/4k,Before/8k
3,15593,15158,12826
6,16295,16693,14134
10,20427,20813,18459
20,27035,26836,27384
30,33250,31810,33846
60,35518,32585,35329
100,36612,31805,35292
300,35780,31492,35481
1000,34145,31551,35411
1100,35219,31380,34302
1200,35060,31037,34322
```

If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration.

Reviewed By: jowlyzhang

Differential Revision: D61376772

Pulled By: pdillinger

fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f
2024-08-16 15:34:31 -07:00
Hui Xiao 75a1230ce8 Fix improper ExpectedValute::Exists() usages and disable compaction during VerifyDB() in crash test (#12933)
Summary:
**Context:**
Adding assertion `!PendingPut()&&!PendingDelete()` in `ExpectedValute::Exists()` surfaced a couple improper usages of `ExpectedValute::Exists()` in the crash test
- Commit phase of `ExpectedValue::Delete()`/`SyncDelete()`:
When we issue delete to expected value during commit phase or `SyncDelete()` (used in crash recovery verification) as below, we don't really care about the result.
d458331ee9/db_stress_tool/expected_state.cc (L73)
d458331ee9/db_stress_tool/expected_value.cc (L52)
That means, we don't really need to check for `Exists()` d458331ee9/db_stress_tool/expected_value.cc (L24-L26).
This actually gives an alternative solution to b65e29a4a9 to solve false-positive assertion violation.
- TestMultiGetXX() path: `Exists()` is called without holding the lock as required

f63428bcc7/db_stress_tool/no_batched_ops_stress.cc (L2688)
```
void MaybeAddKeyToTxnForRYW(
      ThreadState* thread, int column_family, int64_t key, Transaction* txn,
      std::unordered_map<std::string, ExpectedValue>& ryw_expected_values) {
    assert(thread);
    assert(txn);

    SharedState* const shared = thread->shared;
    assert(shared);

    if (!shared->AllowsOverwrite(key) && shared->Exists(column_family, key)) {
      // Just do read your write checks for keys that allow overwrites.
      return;
    }

    // With a 1 in 10 probability, insert the just added key in the batch
    // into the transaction. This will create an overlap with the MultiGet
    // keys and exercise some corner cases in the code
    if (thread->rand.OneIn(10)) {
```

f63428bcc7/db_stress_tool/expected_state.h (L74-L76)

The assertion also failed if db stress compaction filter was invoked before crash recovery verification (`VerifyDB()`->`VerifyOrSyncValue()`) finishes.
f63428bcc7/db_stress_tool/db_stress_compaction_filter.h (L53)
It failed because it can encounter a key with pending state when checking for `Exists()` since that key's expected state has not been sync-ed with db state in `VerifyOrSyncValue()`.
f63428bcc7/db_stress_tool/no_batched_ops_stress.cc (L2579-L2591)

**Summary:**
This PR fixes above issues by
- not checking `Exists()` in commit phase/`SyncDelete()`
- using the concurrent version of key existence check like in other read
- conditionally temporarily disabling compaction till after crash recovery verification succeeds()

And add back the assertion `!PendingPut()&&!PendingDelete()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12933

Test Plan: Rehearsal CI

Reviewed By: cbi42

Differential Revision: D61214889

Pulled By: hx235

fbshipit-source-id: ef25ba896e64330ddf330182314981516880c3e4
2024-08-15 12:32:59 -07:00
Peter Dillinger 21da4ba4aa Attempt fix valgrind FP on std::optional (#12936)
Summary:
```
[ RUN      ]
BlockBasedTableReaderTest/BlockBasedTableReaderTest.MultiGet/347
==49577== Thread 4:
==49577== Conditional jump or move depends on uninitialised value(s)
==49577==    at 0x518AF93: operator!=<long unsigned int, long unsigned int> (optional:1115)
==49577==    by 0x518AF93: rocksdb::(anonymous namespace)::XXPH3FilterBitsBuilder::AddKeyAndAlt(rocksdb::Slice const&, rocksdb::Slice const&) (filter_policy.cc:100)
==49577==    by 0x5192722: Add (full_filter_block.cc:37)
==49577==    by 0x5192722: rocksdb::FullFilterBlockBuilder::Add(rocksdb::Slice const&) (full_filter_block.cc:33)
==49577==    by 0x5125DDB: rocksdb::BlockBasedTableBuilder::BGWorkWriteMaybeCompressedBlock() (block_based_table_builder.cc:1473)
==49577==    by 0x570C6B3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
==49577==    by 0x5617608: start_thread (pthread_create.c:477)
==49577==    by 0x5988132: clone (clone.S:95)
==49577==
```

Seems to be explained by ASM that valgrind doesn't like. https://stackoverflow.com/q/51616179

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12936

Test Plan: Wasn't able to reproduce locally

Reviewed By: hx235

Differential Revision: D61338401

Pulled By: pdillinger

fbshipit-source-id: b5b10f7f5c6a8c9eb088c00e5699046100167cb7
2024-08-15 10:55:29 -07:00
Jay Huh b99baa5ae7 Do not add unprep_seqs when WriteImpl() fails in unprepared txn (#12927)
Summary:
With the following scenario, we see assertion failure in write_unprepared_txn

1. The write is the first one in the transaction (`log_number_` is not set yet - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L376-L379)
2. `WriteToWAL()` fails inside `db_impl_->WriteImpl()` due to fault injection. `last_log_number_`is 0. https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L386
3. `prepare_batch_cnt_` is still added to `unprep_seqs_` https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L395-L398
4. When the transaction gets destructed after failed, it expects `log_number_ > 0` if `unprep_seqs_` is not empty - https://github.com/facebook/rocksdb/blob/main/utilities/transactions/write_unprepared_txn.cc#L54-L55.

Step 3 is wrong. `unprep_seqs_` is for the ordered list of unprep sequence numbers that we have already written to the DB. If `db_impl_->WriteImpl()` failed, `unprep_seqs_` shouldn't be set. `prepare_seq` value would be wrong anyway.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12927

Test Plan:
Following command steadily repros the issue. This no longer fails after the fix
```
./db_stress \
  --WAL_size_limit_MB=0 \
  --WAL_ttl_seconds=60 \
  --acquire_snapshot_one_in=10000 \
  --adaptive_readahead=1 \
  --adm_policy=2 \
  --advise_random_on_open=0 \
  --allow_data_in_errors=True \
  --allow_fallocate=1 \
  --async_io=1 \
  --auto_readahead_size=0 \
  --avoid_flush_during_recovery=0 \
  --avoid_flush_during_shutdown=1 \
  --avoid_unnecessary_blocking_io=0 \
  --backup_max_size=104857600 \
  --backup_one_in=100000 \
  --batch_protection_bytes_per_key=8 \
  --bgerror_resume_retry_interval=100 \
  --block_align=1 \
  --block_protection_bytes_per_key=8 \
  --block_size=16384 \
  --bloom_before_level=7 \
  --bloom_bits=4.1280979878467345 \
  --bottommost_compression_type=none \
  --bottommost_file_compaction_delay=600 \
  --bytes_per_sync=262144 \
  --cache_index_and_filter_blocks=1 \
  --cache_index_and_filter_blocks_with_high_priority=0 \
  --cache_size=33554432 \
  --cache_type=lru_cache \
  --charge_compression_dictionary_building_buffer=1 \
  --charge_file_metadata=1 \
  --charge_filter_construction=1 \
  --charge_table_reader=1 \
  --check_multiget_consistency=0 \
  --check_multiget_entity_consistency=1 \
  --checkpoint_one_in=0 \
  --checksum_type=kXXH3 \
  --clear_column_family_one_in=0 \
  --compact_files_one_in=1000000 \
  --compact_range_one_in=1000000 \
  --compaction_pri=4 \
  --compaction_readahead_size=1048576 \
  --compaction_ttl=0 \
  --compress_format_version=1 \
  --compressed_secondary_cache_ratio=0.0 \
  --compressed_secondary_cache_size=0 \
  --compression_checksum=1 \
  --compression_max_dict_buffer_bytes=4294967295 \
  --compression_max_dict_bytes=16384 \
  --compression_parallel_threads=8 \
  --compression_type=none \
  --compression_use_zstd_dict_trainer=1 \
  --compression_zstd_max_train_bytes=0 \
  --continuous_verification_interval=0 \
  --create_timestamped_snapshot_one_in=0 \
  --daily_offpeak_time_utc=04:00-08:00 \
  --data_block_index_type=0 \
  --db=$db \
  --db_write_buffer_size=8388608 \
  --default_temperature=kHot \
  --default_write_temperature=kUnknown \
  --delete_obsolete_files_period_micros=21600000000 \
  --delpercent=5 \
  --delrangepercent=0 \
  --destroy_db_initially=0 \
  --detect_filter_construct_corruption=0 \
  --disable_file_deletions_one_in=10000 \
  --disable_manual_compaction_one_in=10000 \
  --disable_wal=0 \
  --dump_malloc_stats=1 \
  --enable_checksum_handoff=1 \
  --enable_compaction_filter=0 \
  --enable_custom_split_merge=1 \
  --enable_do_not_compress_roles=0 \
  --enable_index_compression=1 \
  --enable_memtable_insert_with_hint_prefix_extractor=0 \
  --enable_pipelined_write=0 \
  --enable_sst_partitioner_factory=0 \
  --enable_thread_tracking=1 \
  --enable_write_thread_adaptive_yield=1 \
  --error_recovery_with_no_fault_injection=1 \
  --exclude_wal_from_write_fault_injection=0 \
  --expected_values_dir=$exp \
  --fail_if_options_file_error=0 \
  --fifo_allow_compaction=1 \
  --file_checksum_impl=crc32c \
  --fill_cache=1 \
  --flush_one_in=1000 \
  --format_version=2 \
  --get_all_column_family_metadata_one_in=1000000 \
  --get_current_wal_file_one_in=0 \
  --get_live_files_apis_one_in=1000000 \
  --get_properties_of_all_tables_one_in=100000 \
  --get_property_one_in=1000000 \
  --get_sorted_wal_files_one_in=0 \
  --hard_pending_compaction_bytes_limit=274877906944 \
  --high_pri_pool_ratio=0 \
  --index_block_restart_interval=15 \
  --index_shortening=0 \
  --index_type=2 \
  --ingest_external_file_one_in=0 \
  --initial_auto_readahead_size=524288 \
  --inplace_update_support=0 \
  --iterpercent=10 \
  --key_len_percent_dist=1,30,69 \
  --key_may_exist_one_in=100000 \
  --kill_random_test=888887 \
  --last_level_temperature=kCold \
  --level_compaction_dynamic_level_bytes=1 \
  --lock_wal_one_in=1000000 \
  --log2_keys_per_lock=10 \
  --log_file_time_to_roll=0 \
  --log_readahead_size=0 \
  --long_running_snapshots=1 \
  --low_pri_pool_ratio=0 \
  --lowest_used_cache_tier=1 \
  --manifest_preallocation_size=5120 \
  --manual_wal_flush_one_in=0 \
  --mark_for_compaction_one_file_in=10 \
  --max_auto_readahead_size=0 \
  --max_background_compactions=20 \
  --max_bytes_for_level_base=10485760 \
  --max_key=100000 \
  --max_key_len=3 \
  --max_log_file_size=1048576 \
  --max_manifest_file_size=1073741824 \
  --max_sequential_skip_in_iterations=2 \
  --max_total_wal_size=0 \
  --max_write_batch_group_size_bytes=16 \
  --max_write_buffer_number=10 \
  --max_write_buffer_size_to_maintain=0 \
  --memtable_insert_hint_per_batch=0 \
  --memtable_max_range_deletions=1000 \
  --memtable_prefix_bloom_size_ratio=0.1 \
  --memtable_protection_bytes_per_key=8 \
  --memtable_whole_key_filtering=1 \
  --memtablerep=skip_list \
  --metadata_charge_policy=0 \
  --metadata_read_fault_one_in=32 \
  --metadata_write_fault_one_in=0 \
  --min_write_buffer_number_to_merge=2 \
  --mmap_read=1 \
  --mock_direct_io=False \
  --nooverwritepercent=1 \
  --num_file_reads_for_auto_readahead=2 \
  --open_files=-1 \
  --open_metadata_read_fault_one_in=0 \
  --open_metadata_write_fault_one_in=0 \
  --open_read_fault_one_in=0 \
  --open_write_fault_one_in=0 \
  --ops_per_thread=20000000 \
  --optimize_filters_for_hits=1 \
  --optimize_filters_for_memory=0 \
  --optimize_multiget_for_io=0 \
  --paranoid_file_checks=1 \
  --partition_filters=0 \
  --partition_pinning=0 \
  --pause_background_one_in=1000000 \
  --periodic_compaction_seconds=1 \
  --prefix_size=5 \
  --prefixpercent=5 \
  --prepopulate_block_cache=0 \
  --preserve_internal_time_seconds=60 \
  --progress_reports=0 \
  --promote_l0_one_in=0 \
  --read_amp_bytes_per_bit=32 \
  --read_fault_one_in=1000 \
  --readahead_size=0 \
  --readpercent=45 \
  --recycle_log_file_num=1 \
  --reopen=20 \
  --report_bg_io_stats=1 \
  --reset_stats_one_in=10000 \
  --sample_for_compression=0 \
  --secondary_cache_fault_one_in=0 \
  --sync=0 \
  --sync_fault_injection=0 \
  --table_cache_numshardbits=6 \
  --target_file_size_base=524288 \
  --target_file_size_multiplier=2 \
  --test_batches_snapshots=0 \
  --top_level_index_pinning=0 \
  --txn_write_policy=2 \
  --uncache_aggressiveness=126 \
  --universal_max_read_amp=4 \
  --unordered_write=0 \
  --unpartitioned_pinning=1 \
  --use_adaptive_mutex=0 \
  --use_adaptive_mutex_lru=1 \
  --use_attribute_group=1 \
  --use_delta_encoding=1 \
  --use_direct_io_for_flush_and_compaction=0 \
  --use_direct_reads=0 \
  --use_full_merge_v1=0 \
  --use_get_entity=0 \
  --use_merge=1 \
  --use_multi_cf_iterator=0 \
  --use_multi_get_entity=0 \
  --use_multiget=1 \
  --use_optimistic_txn=0 \
  --use_put_entity_one_in=0 \
  --use_timed_put_one_in=0 \
  --use_txn=1 \
  --use_write_buffer_manager=0 \
  --user_timestamp_size=0 \
  --value_size_mult=32 \
  --verification_only=0 \
  --verify_checksum=1 \
  --verify_checksum_one_in=1000000 \
  --verify_compression=0 \
  --verify_db_one_in=10000 \
  --verify_file_checksums_one_in=1000000 \
  --verify_iterator_with_expected_state_one_in=5 \
  --verify_sst_unique_id_in_manifest=1 \
  --wal_bytes_per_sync=0 \
  --wal_compression=none \
  --write_buffer_size=4194304 \
  --write_dbid_to_manifest=1 \
  --write_fault_one_in=128 \
  --writepercent=35
```

Reviewed By: cbi42

Differential Revision: D61048774

Pulled By: jaykorean

fbshipit-source-id: 22200d55fd0b22b68732b12516e681a6c6e2c601
2024-08-15 09:16:29 -07:00
Peter Dillinger f63428bcc7 Optimize, simplify filter block building (fix regression) (#12931)
Summary:
This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931

Test Plan:
existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
 https://github.com/facebook/rocksdb/issues/12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872 #12911 https://github.com/facebook/rocksdb/issues/12874 #12867 https://github.com/facebook/rocksdb/issues/12903 #12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency:

```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```

Showing average ops/sec of "after" vs. "before"

```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```

As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.

Reviewed By: jowlyzhang

Differential Revision: D61138271

Pulled By: pdillinger

fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043
2024-08-14 15:13:16 -07:00
Yu Zhang d458331ee9 Move file tracking in VersionEditHandlerPointInTime to VersionBuilder (#12928)
Summary:
`VersionEditHandlerPointInTime` is tracking found files, missing files, intermediate files in order to decide to build a `Version` on negative edge trigger (transition from valid to invalid) without applying  the current `VersionEdit`.  However, applying `VersionEdit` and check completeness of a `Version` are specialization of `VersionBuilder`.  More importantly, when we augment best efforts recovery to recover not just complete point in time Version but also a prefix of seqno for a point in time Version, such checks need to be duplicated in `VersionEditHandlerPointInTime` and `VersionBuilder`.

To avoid this, this refactor move all the file tracking functionality in `VersionEditHandlerPointInTime` into `VersionBuilder`.  To continue to let `VersionEditHandlerPIT` do the edge trigger check and  build a `Version` before applying the current `VersionEdit`, a suite of APIs to supporting creating a save point and its associated functions are added in `VersionBuilder` to achieve this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12928

Test Plan: Existing tests

Reviewed By: anand1976

Differential Revision: D61171320

Pulled By: jowlyzhang

fbshipit-source-id: 604f66f8b1e3a3e13da59d8ba357c74e8a366dbc
2024-08-12 21:09:37 -07:00
anand76 c21fe1a47f Add ticker stats for read corruption retries (#12923)
Summary:
Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923

Test Plan: Update existing tests

Reviewed By: jowlyzhang

Differential Revision: D61024687

Pulled By: anand1976

fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06
2024-08-12 15:32:07 -07:00
Hui Xiao b65e29a4a9 Loosen a strong assertion in ExpectedValue::Exists() (#12932)
Summary:
**Context/Summary:** .... since it won't work in the PrepareDelete() path

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12932

Test Plan: CI

Reviewed By: cbi42

Differential Revision: D61155155

Pulled By: hx235

fbshipit-source-id: 99b0784f6c903d70c7b3b88b53ae8e2c885de96f
2024-08-12 15:21:27 -07:00
SGZW 6727f0f58a fix compaction_picker_test asan heap use after free (#12908)
Summary:
![image](https://github.com/user-attachments/assets/3290fe18-aca2-4691-b072-fbbc96a15fb1)

this testcase set syncpoint function which reference this test case heap variable "enable_per_key_placement_" and this sync point function will be triggered by another testcase, so asan will report asan heap use after free error

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12908

Reviewed By: hx235

Differential Revision: D60973363

Pulled By: cbi42

fbshipit-source-id: df4f488f51e7741784d5a92fc0a5fc538c5d5b1a
2024-08-09 15:06:37 -07:00