Commit graph

5565 commits

Author SHA1 Message Date
Jay Huh c449867236 MultiCfIterator Impl Follow up (#12465)
Summary:
As a follow up for https://github.com/facebook/rocksdb/issues/12422 , this PR includes the following two changes.
- Removal of `direction_` in the MultiCfIterator
- Use of Member Func Template instead of `std::function`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12465

Test Plan:
```
./multi_cf_iterator_test
```

Reviewed By: pdillinger, ltamasi

Differential Revision: D55208448

Pulled By: jaykorean

fbshipit-source-id: 8b3167c1d59839d076afc29097b5ad21a453460a
2024-03-22 14:51:16 -07:00
Peter Dillinger b515a5db3f Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470)
Summary:
ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want.

So this adds ScopedArenaPtr<T> to replace those uses.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470

Test Plan: CI (including ASAN/UBSAN)

Reviewed By: jowlyzhang

Differential Revision: D55254362

Pulled By: pdillinger

fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18
2024-03-22 13:40:42 -07:00
anand76 3b736c4aa3 Fix heap use after free error on retry after checksum mismatch (#12464)
Summary:
Fix the heap use after free bug caused by freeing the file system IO buffer in `BlockFetcher::ReadBlock()` instead of the caller.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12464

Test Plan: Update the `DBIOCorruptionTest` tests

Reviewed By: akankshamahajan15

Differential Revision: D55206920

Pulled By: anand1976

fbshipit-source-id: fd6b608a61cd229b20c1e5f348ff3cc92328de0f
2024-03-21 16:19:09 -07:00
Andrew Kryczka bf98dcf9a8 Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462)
Summary:
The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file.

There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462

Test Plan:
A repro command

```
db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp
./db_stress \
        --clear_column_family_one_in=0 \
        --test_batches_snapshots=0 \
        --write_fault_one_in=0 \
        --use_put_entity_one_in=0 \
        --prefixpercent=0 \
        --read_fault_one_in=0 \
        --readpercent=0 \
        --reopen=0 \
        --set_options_one_in=10000 \
        --delpercent=0 \
        --delrangepercent=0 \
        --open_metadata_write_fault_one_in=0 \
        --open_read_fault_one_in=0 \
        --open_write_fault_one_in=0 \
        --destroy_db_initially=0 \
        --ingest_external_file_one_in=0 \
        --iterpercent=0 \
        --nooverwritepercent=0 \
        --db=$db \
        --enable_blob_files=1 \
        --expected_values_dir=$exp \
        --max_background_compactions=20 \
        --max_bytes_for_level_base=2097152 \
        --max_key=100000 \
        --min_blob_size=0 \
        --open_files=-1 \
        --ops_per_thread=100000000 \
        --prefix_size=-1 \
        --target_file_size_base=524288 \
        --use_merge=1 \
        --value_size_mult=32 \
        --write_buffer_size=524288 \
        --writepercent=100
```

It used to fail like:

```
...
frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3
frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, rocksdb::PinnedIteratorsManager*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33
frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492
frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408
frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109
frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656
frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607
frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505
frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084
frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807
frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312
frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157
frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544
frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72
frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317
frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148
...
```

Reviewed By: ltamasi

Differential Revision: D55157795

Pulled By: ajkr

fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764
2024-03-21 12:38:53 -07:00
anand76 63a105a481 Enable recycle_log_file_num option for point in time recovery (#12403)
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.

This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403

Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior

Reviewed By: ajkr

Differential Revision: D54544824

Pulled By: anand1976

fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
2024-03-21 12:29:35 -07:00
Yu Zhang 13e1c32a18 Follow ups for TimedPut and write time property (#12455)
Summary:
This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including:

1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry.

2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries.

3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible.

Test plans:
Added unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455

Reviewed By: pdillinger

Differential Revision: D55165422

Pulled By: jowlyzhang

fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47
2024-03-21 10:00:15 -07:00
Richard Barnes 6a1c2abe9d Remove extra semi colon from hbt/src/tagstack/tests/SlicerTest.cpp (#12461)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12461

X-link: https://github.com/facebookincubator/dynolog/pull/233

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: rahku

Differential Revision: D55087324

fbshipit-source-id: e8a03d33cad72a7d378e58f85eb550a03f6c2897
2024-03-20 12:44:50 -07:00
Kshitij Wadhwa 4ce1dc930c don't run ZSTD_TrainDictionary in BlockBasedTableBuilder if there isn't compression needed (#12453)
Summary:
fixes https://github.com/facebook/rocksdb/issues/12409

### Issue

ZSTD_TrainDictionary [[link](a53ed91691/table/block_based/block_based_table_builder.cc (L1894))] runs for SSTFileWriter::Finish even when bottommost_compression option is set to kNoCompression. This reduces throughput for SstFileWriter::Finish

We construct rocksdb options using ZSTD compression for levels including 2 and above. For levels 0 and 1, we set it to kNoCompression. We also set zstd_max_train_bytes to a non-zero positive value (which is applicable for levels with ZSTD compression enabled). These options are used for the database and also passed to SstFileWriter for creating sst files to be later added to that database. Since the BlockBasedTableBuilder::Finish [[link](a53ed91691/table/block_based/block_based_table_builder.cc (L1892))] only checks for zstd_max_train_bytes to be non-zero positive value, it runs ZSTD_TrainDictionary even when it shouldn't since SSTFileWriter is operating at bottommost level

### Fix

If compression_type is set to kNoCompression, then don't run ZSTD_TrainDictionary and dictionary building

### Testing

I see we have tests for sst file writer with compression type set/unset. Let me know if it isn't covered and I can extend

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12453

Reviewed By: cbi42

Differential Revision: D55030484

Pulled By: ajkr

fbshipit-source-id: 834de2174c2b087d61bf045ca1ae29f337b821a7
2024-03-20 11:07:32 -07:00
Jay Huh 3f3f4660bd wal_read_status check in RecoverLogFiles (#12460)
Summary:
Fixing the not-checked status failure as in https://github.com/facebook/rocksdb/actions/runs/8334988399/job/22809612148.

When the status is not ok() for any reason, we do not check the `wal_read_status` because it's not necessary. It's causing the test failure when running with `ASSERT_STATUS_CHECKED=1`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12460

Test Plan: Existing tests

Reviewed By: ajkr

Differential Revision: D55104844

Pulled By: jaykorean

fbshipit-source-id: 919b1fddca835494f9087c51c4da6eabc9e8df2b
2024-03-20 08:09:09 -07:00
anand76 4868c10b44 Retry block reads on checksum mismatch (#12427)
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.

Tests:
Add new unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427

Reviewed By: ajkr

Differential Revision: D55025941

Pulled By: anand1976

fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
2024-03-18 16:16:05 -07:00
Jay Huh b4e9f5a400 Update Remote Compaction Tests to include more than one CF (#12430)
Summary:
Update `compaction_service_test` to make sure remote compaction works with multiple column family set up. Minor refactor to get rid of duplicate code

Fixing one quick bug in the existing test util: Test util's `FilesPerLevel` didn't honor `cf_id` properly)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12430

Test Plan:
```
./compaction_service_test
```

Reviewed By: ajkr

Differential Revision: D54883035

Pulled By: jaykorean

fbshipit-source-id: 83b4f6f566fed5c4824bfef7de01074354a72b44
2024-03-18 15:40:48 -07:00
Hui Xiao 2443ebf810 Don't write to WAL after previous WAL write error (#12448)
Summary:
**Context/Summary:**
WAL write can continue onto the the WAL file that has encountered error and thus crash at 3f5bd46a07/file/writable_file_writer.cc (L67) in below scenario:
<img width="698" alt="Screenshot 2024-03-15 at 1 52 45 PM" src="https://github.com/facebook/rocksdb/assets/83968999/cd631ef2-c87c-4926-91ab-a0dc10f1eb14">

Note that GetLiveFilesStorageInfo() can happen concurrently with PUT() for the non-WAL-write part where db lock isn't held

This PR added an error check in LogWriter layer to prevent thread 2 from starting to write WAL after thread 1's write error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12448

Test Plan:
Step 1 Apply the patch below to simulate frequent WAL write error for the purpose of repro
```
 diff --git a/db_stress_tool/db_stress_driver.cc b/db_stress_tool/db_stress_driver.cc
index b47fa89e6..31930e976 100644
 --- a/db_stress_tool/db_stress_driver.cc
+++ b/db_stress_tool/db_stress_driver.cc
@@ -98,7 +98,7 @@ bool RunStressTestImpl(SharedState* shared) {
     //  MANIFEST, CURRENT, and WAL files.
     fault_fs_guard->SetRandomWriteError(
         shared->GetSeed(), FLAGS_write_fault_one_in, error_msg,
-        /*inject_for_all_file_types=*/false, {FileType::kTableFile});
+        /*inject_for_all_file_types=*/false, {FileType::kWalFile});
     fault_fs_guard->SetFilesystemDirectWritable(false);
     fault_fs_guard->EnableWriteErrorInjection();
   }

 diff --git a/utilities/fault_injection_fs.cc b/utilities/fault_injection_fs.cc
index 0ffb43ea6..589912cf4 100644
 --- a/utilities/fault_injection_fs.cc
+++ b/utilities/fault_injection_fs.cc
@@ -1042,7 +1042,7 @@ IOStatus FaultInjectionTestFS::InjectWriteError(const std::string& file_name) {
   }

   if (allowed_type) {
-    if (write_error_rand_.OneIn(write_error_one_in_)) {
+    if (write_error_rand_.OneIn(1)) {
       return GetError();
     }
   }
```
Step 2 Run below
```
./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=41.19540459544058 --bottommost_compression_type=disable --bottommost_file_compaction_delay=3600 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=1 --compression_max_dict_buffer_bytes=68719476735 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_index_compression=1 --enable_pipelined_write=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=10000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=15 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=1048576 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=500000 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --report_bg_io_stats=0 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=35
```
Pre-PR:
```
db_stress: ./file/writable_file_writer.h:309: rocksdb::IOStatus rocksdb::WritableFileWriter::AssertFalseAndGetStatusForPrevError(): Assertion `sync_without_flush_called_' failed.
```
Post-PR
```
2024/03/15-13:44:08  Starting database operations
put or merge error: IO error: Retryable injected write error
```

Note: The patch is NOT included in the PR as we first need to figure out how to handle this type of failed write in stress test (planned for the near future). It's sufficient to show the stress test does not crash as pre-PR for the purpose of this PR.

Reviewed By: ajkr

Differential Revision: D54969287

Pulled By: hx235

fbshipit-source-id: 0ba4eabfec44ea7656d4d7117836f388897562f2
2024-03-18 12:27:49 -07:00
Jay Huh db1dea22b1 MultiCfIterator Implementations (#12422)
Summary:
This PR continues https://github.com/facebook/rocksdb/issues/12153 by implementing the missing `Iterator` APIs - `Seek()`, `SeekForPrev()`, `SeekToLast()`, and `Prev`. A MaxHeap Implementation has been added to handle the reverse direction.

The current implementation does not include upper/lower bounds yet. These will be added in subsequent PRs. The API is still marked as under construction and will be lifted after being added to the stress test.

Please note that changing the iterator direction in the middle of iteration is expensive, as it requires seeking the element in each iterator again in the opposite direction and rebuilding the heap along the way. The first `Next()` after `SeekForPrev()` requires changing the direction under the current implementation. We may optimize this in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12422

Test Plan: The `multi_cf_iterator_test` has been extended to cover the API implementations.

Reviewed By: pdillinger

Differential Revision: D54820754

Pulled By: jaykorean

fbshipit-source-id: 9eb741508df0f7bad598fb8e6bd5cdffc39e81d1
2024-03-18 09:05:30 -07:00
Changyu Bi 3d5be596a5 Fix a bug in iterator with UDT + ReadOptions::pin_data (#12451)
Summary:
with https://github.com/facebook/rocksdb/issues/12414 enabling `ReadOptions::pin_data`, this bug surfaced as corrupted per key-value checksum during crash test. `saved_key_.GetUserKey()` could be pinned user key, so DBIter should not overwrite it.

In one case, it only surfaces when iterator skips many keys of the same user key. To stress that code path, this PR also added `max_sequential_skip_in_iterations` to crash test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12451

Test Plan:
- Set ReadOptions::pin_data to true, the bug can be reproed quickly with `./db_stress --persist_user_defined_timestamps=1 --user_timestamp_size=8 --writepercent=35 --delpercent=4 --delrangepercent=1 --iterpercent=20 --nooverwritepercent=1 --prefix_size=8 --prefixpercent=10 --readpercent=30 --memtable_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --clear_column_family_one_in=0`.
    - Set max_sequential_skip_in_iterations to 1 for the other occurrence of the bug.

Reviewed By: jowlyzhang

Differential Revision: D55003766

Pulled By: cbi42

fbshipit-source-id: 23e1049129456684dafb028b6132b70e0afc07fb
2024-03-18 09:05:11 -07:00
Yu Zhang f2546b6623 Support returning write unix time in iterator property (#12428)
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.

2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in  https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.

Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.

2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.

3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428

Test Plan: Added unit test

Reviewed By: pdillinger

Differential Revision: D54967067

Pulled By: jowlyzhang

fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
2024-03-15 15:37:37 -07:00
Andrew Kryczka 4d5ebad971 Fix kBlockCacheTier read with table cache miss (#12443)
Summary:
Thanks ltamasi for pointing out this bug.

We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443

Reviewed By: cbi42

Differential Revision: D54930128

Pulled By: ajkr

fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d
2024-03-15 14:41:58 -07:00
Andrew Kryczka 3f5bd46a07 Add ContinueCallback to GetMergeOperands() (#12438)
Summary:
The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because:

- Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge
- The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me

If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438

Reviewed By: hx235

Differential Revision: D54914669

Pulled By: ajkr

fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a
2024-03-15 12:25:49 -07:00
Changyu Bi 096fb9b67d Fix data race in WalManager (#12439)
Summary:
Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439

Test Plan:
- existing UT
- not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests

Reviewed By: anand1976

Differential Revision: D54920817

Pulled By: cbi42

fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6
2024-03-14 21:24:06 -07:00
Yu Zhang 1104eaa35e Add initial support for TimedPut API (#12419)
Summary:
This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API.

The life cycle of such an entry on the write/flush/compaction paths are:

1) It is initially added to memtable as:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}`

2) When it's flushed to L0 sst files, it's converted to:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}`
 when we have easy access to the seqno to time mapping.

3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become:
`<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap.

On the read path:
A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored.

Needed follow ups:
1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable.

2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types.

3) Stress test coverage for the feature

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D54920296

Pulled By: jowlyzhang

fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e
2024-03-14 15:44:55 -07:00
Peter Dillinger dd24bda137 Fix windows build and CI (#12426)
Summary:
Issue https://github.com/facebook/rocksdb/issues/12421 describes a regression in the migration from CircleCI to GitHub Actions in which failing build steps no longer fail Windows CI jobs. In GHA with pwsh (new preferred powershell command), only the last non-builtin command (or something like that) affects the overall success/failure result, and failures in external commands do not exit the script, even with `$ErrorActionPreference = 'Stop'` and `$PSNativeCommandErrorActionPreference = $true`. Switching to `powershell` causes some obscure failure (not seen in CircleCI) about the `-Lo` option to `curl`.

Here we work around this using the only reasonable-but-ugly way known: explicitly check the result after every non-trivial build step. This leaves us highly susceptible to future regressions with unchecked build steps in the future, but a clean solution is not known.

This change also fixes the build errors that were allowed to creep in because of the CI regression. Also decreased the unnecessarily long running time of DBWriteTest.WriteThreadWaitNanosCounter.

For background, this problem explicitly contradicts GitHub's documentation, and GitHub has known about the problem for more than a year, with no evidence of caring or intending to fix. https://github.com/actions/runner-images/issues/6668 Somehow CircleCI doesn't have this problem. And even though cmd.exe and powershell have been perpetuating DOS-isms for decades, they still seem to be a somewhat active "hot mess" when it comes to sensible, consistent, and documented behavior.

Fixes https://github.com/facebook/rocksdb/issues/12421

A history of some things I tried in development is here: https://github.com/facebook/rocksdb/compare/main...pdillinger:rocksdb:debug_windows_ci_orig

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12426

Test Plan: CI, including https://github.com/facebook/rocksdb/issues/12434 where I have temporarily enabled other Windows builds on PR with this change

Reviewed By: cbi42

Differential Revision: D54903698

Pulled By: pdillinger

fbshipit-source-id: 116bcbebbbf98f347c7cf7dfdeebeaaed7f76827
2024-03-14 12:04:41 -07:00
Peter Dillinger c0ae5be934 Disable flaky part of TransactionLogIteratorCheckWhenArchive (#12423)
Summary:
https://github.com/facebook/rocksdb/issues/12397 attempted to make the test more honest about its failures, and they're really showing up in CI now (but not locally). Disable pending investigation

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12423

Test Plan: watch CI

Reviewed By: ltamasi

Differential Revision: D54817705

Pulled By: pdillinger

fbshipit-source-id: 4721834c49b225ac52d1a28ecb06b9d05de977b3
2024-03-12 12:54:53 -07:00
Peter Dillinger 7622029101 Fix flaky TransactionLogIteratorCheckWhenArchive (#12397)
Summary:
Seen in https://github.com/facebook/rocksdb/actions/runs/8086592802/job/22096691572?pr=12388

```
[ RUN      ] DBTestXactLogIterator.TransactionLogIteratorCheckWhenArchive
db/db_log_iter_test.cc:173:23: runtime error: member call on address 0x0000023956f0 which does not point to an object of type 'rocksdb::DBTestXactLogIterator'
0x0000023956f0: note: object is of type 'rocksdb::DBTestBase'
 00 00 00 00  98 ae f7 da 75 7f 00 00  a0 5d 39 02 00 00 00 00  80 ff 39 02 00 00 00 00  95 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'rocksdb::DBTestBase'
 UndefinedBehaviorSanitizer: undefined-behavior db/db_log_iter_test.cc:173:23 in
```

This is almost certainly caused by the sync point callback happening on asynchronous file deletion in the DB while the end of the test is reached and the destruction of the `DBTestXactLogIterator` has reached `DBTestBase::~DBTestBase()`. Either closing the DB or disabling sync points before the end of the test should suffice to fix, and we'll do both. And assert that the sync point callback is actually hit each time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12397

Test Plan: unable to reproduce, but ran 1000 iterations of the test with UBSAN

Reviewed By: ltamasi

Differential Revision: D54326687

Pulled By: pdillinger

fbshipit-source-id: cc09a4dcd2f237d5b45d910364d6aa56bbd46d50
2024-03-12 08:43:47 -07:00
Andrew Kryczka 27a2473668 Best-effort recovery support for atomic flush (#12406)
Summary:
This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once.

The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits.

Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery.

The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios.

The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary.

- `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was  previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates.
- `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406

Reviewed By: jaykorean, cbi42

Differential Revision: D54494904

Pulled By: ajkr

fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52
2024-03-06 14:40:40 -08:00
Peter Dillinger a53ed91691 Fix/improve temperature handling for file ingestion (#12402)
Summary:
Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388

In terms of public API:
* Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre.
* Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general).
* Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature.
* Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost"

In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed.

Eventual suggested follow-up:
* Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare).
* More temperature handling for CreateColumnFamilyWithImport and Checkpoints.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402

Test Plan:
Deeply revamped
ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer.

Used macros instead of functions for better tracing to critical source location on test failures.

Some enhancements to FileTemperatureTestFS in the process of developing the revamped test.

Reviewed By: jowlyzhang

Differential Revision: D54442794

Pulled By: pdillinger

fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995
2024-03-05 16:56:08 -08:00
Jay Huh 3412195367 Introduce MultiCfIterator (#12153)
Summary:
This PR introduces a new implementation of `Iterator` via a new public API called `NewMultiCfIterator()`. The new API takes a vector of column family handles to build a cross-column-family iterator, which internally maintains multiple `DBIter`s as child iterators from a consistent database state. When a key exists in multiple column families, the iterator selects the value (and wide columns) from the first column family containing the key, following the order provided in the `column_families` parameter. Similar to the merging iterator, a min heap is used to iterate across the child iterators. Backward iteration and direction change functionalities will be implemented in future PRs.

The comparator used to compare keys across different column families will be derived from the iterator of the first column family specified in `column_families`. This comparator will be checked against the comparators from all other column families that the iterator will traverse. If there's a mismatch with any of the comparators, the initialization of the iterator will fail.

Please note that this PR is not enough for users to start using `MultiCfIterator`. The `MultiCfIterator` and related APIs are still marked as "**DO NOT USE - UNDER CONSTRUCTION**". This PR is just the first of many PRs that will follow soon.

This PR includes the following:
- Introduction and partial implementation of the `MultiCfIterator`, which implements the generic `Iterator` interface. The implementation includes the construction of the iterator, `SeekToFirst()`, `Next()`, `Valid()`, `key()`, `value()`, and `columns()`.
- Unit tests to verify iteration across multiple column families in two distinct scenarios: (1) keys are unique across all column families, and (2) the same keys exist in multiple column families.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12153

Reviewed By: pdillinger

Differential Revision: D52308697

Pulled By: jaykorean

fbshipit-source-id: b03e69f13b40af5a8f0598d0f43a0bec01ef8294
2024-03-05 10:22:43 -08:00
yuzhangyu@fb.com 1cfdece85d Run internal cpp modernizer on RocksDB repo (#12398)
Summary:
When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE`  with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains:

Running
`xplat/scripts/codemod_service/cpp_modernizer.sh`
in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment)
without swapping out the namespace macro `ROCKSDB_NAMESPACE`

Followed by RocksDB's own
`make format`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398

Test Plan: Auto tests

Reviewed By: hx235

Differential Revision: D54382532

Pulled By: jowlyzhang

fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa
2024-03-04 10:08:32 -08:00
Richard Barnes d7b8756976 Remove extra semi colon from internal_repo_rocksdb/repo/db/table_cache_sync_and_async.h
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362208

fbshipit-source-id: a47acd4c794c899fccb65285b116b50d9566ea12
2024-03-04 06:34:44 -08:00
Richard Barnes ced333ee45 Remove extra semi colon from instagram/ranking/mezql/shots/parser/fast/Token.cpp
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362213

fbshipit-source-id: 0bbc9e5fce917fc4f72423f0a4c8cb2c2b1759dd
2024-03-04 06:32:50 -08:00
Jay Huh c00c16855d Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395)
Summary:
In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs:  https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977.

To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395

Test Plan:
# Summary

In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs:  #11925 #11943, and #11977.

To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.

# Test Plan

There should be no behavior changes. Existing tests and CI for the correctness tests.

**Test for Perf Regression**
Build
```
$> make -j64 release
```
Setup
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```

Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: pdillinger

Differential Revision: D54332713

Pulled By: jaykorean

fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5
2024-03-01 10:28:20 -08:00
Jay Huh 5bcc184975 Update APIs to support generic unique identifier format (#12384)
Summary:
The current design proposes using a combination of `job_id`, `db_id`, and `db_session_id` to create a unique identifier for remote compaction jobs. However, this approach may not be suitable for users who prefer a different format for the unique identifier.

At Meta, we are utilizing generic compute offload to offload compaction tasks to remote workers. The compute offload client generates a UUID for each task, which requires an update to the current RocksDB API for onboarding purposes.

Users still have the option to create the unique identifier by combining `job_id`, `db_id`, and `db_session_id` if they prefer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12384

Test Plan:
```
$> ./compaction_service_test                                                                                                                             13:29:35
[==========] Running 14 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 14 tests from CompactionServiceTest
[ RUN      ] CompactionServiceTest.BasicCompactions
[       OK ] CompactionServiceTest.BasicCompactions (2642 ms)
[ RUN      ] CompactionServiceTest.ManualCompaction
[       OK ] CompactionServiceTest.ManualCompaction (454 ms)
[ RUN      ] CompactionServiceTest.CancelCompactionOnRemoteSide
[       OK ] CompactionServiceTest.CancelCompactionOnRemoteSide (1643 ms)
[ RUN      ] CompactionServiceTest.FailedToStart
[       OK ] CompactionServiceTest.FailedToStart (1332 ms)
[ RUN      ] CompactionServiceTest.InvalidResult
[       OK ] CompactionServiceTest.InvalidResult (1516 ms)
[ RUN      ] CompactionServiceTest.SubCompaction
[       OK ] CompactionServiceTest.SubCompaction (551 ms)
[ RUN      ] CompactionServiceTest.CompactionFilter
[       OK ] CompactionServiceTest.CompactionFilter (563 ms)
[ RUN      ] CompactionServiceTest.Snapshot
[       OK ] CompactionServiceTest.Snapshot (124 ms)
[ RUN      ] CompactionServiceTest.ConcurrentCompaction
[       OK ] CompactionServiceTest.ConcurrentCompaction (660 ms)
[ RUN      ] CompactionServiceTest.CompactionInfo
[       OK ] CompactionServiceTest.CompactionInfo (984 ms)
[ RUN      ] CompactionServiceTest.FallbackLocalAuto
[       OK ] CompactionServiceTest.FallbackLocalAuto (343 ms)
[ RUN      ] CompactionServiceTest.FallbackLocalManual
[       OK ] CompactionServiceTest.FallbackLocalManual (380 ms)
[ RUN      ] CompactionServiceTest.RemoteEventListener
[       OK ] CompactionServiceTest.RemoteEventListener (491 ms)
[ RUN      ] CompactionServiceTest.TablePropertiesCollector
[       OK ] CompactionServiceTest.TablePropertiesCollector (169 ms)
[----------] 14 tests from CompactionServiceTest (11854 ms total)

[----------] Global test environment tear-down
[==========] 14 tests from 1 test case ran. (11855 ms total)
[  PASSED  ] 14 tests.
```

Reviewed By: hx235

Differential Revision: D54220339

Pulled By: jaykorean

fbshipit-source-id: 5a9054f31933d1996adca02082eb37b6d5353224
2024-03-01 09:55:30 -08:00
Changyu Bi 4aed229fa7 Add write_memtable_time to perf level kEnableWait (#12394)
Summary:
.. so write time can be measured under the new perf level for single-threaded writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12394

Test Plan: * add a new UT `PerfContextTest.WriteMemtableTimePerfLevel`

Reviewed By: anand1976

Differential Revision: D54326263

Pulled By: cbi42

fbshipit-source-id: d0e334d9581851ba6cf53c776c0bd876365d1e00
2024-02-29 15:08:26 -08:00
Peter Dillinger 13ef21c22e default_write_temperature option (#12388)
Summary:
Currently SST files that aren't applicable to last_level_temperature nor file_temperature_age_thresholds are written with temperature kUnknown, which is a little weird and doesn't support CF-based tiering. The default_temperature option only affects how kUnknown is interpreted for stats.

This change adds a new per-CF option default_write_temperature that determines the temperature of new SST files when those other options do not apply.

Also made a change to ignore last_level_temperature with FIFO compaction, because I found that could lead to an infinite loop in compaction.

Needed follow-up: Fix temperature handling with external file ingestion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12388

Test Plan: unit tests extended appropriately. (Ignore whitespace changes when reviewing.)

Reviewed By: jowlyzhang

Differential Revision: D54266574

Pulled By: pdillinger

fbshipit-source-id: c9ec9a74dbf22be6e986f77f9689d05fea8ef0bb
2024-02-28 14:36:13 -08:00
奏之章 1fa5dff7d1 WriteThread::EnterAsBatchGroupLeader reorder writers (#12138)
Summary:
Reorder writers list to allow a leader can take as more commits as possible to maximize the throughput of the system and reduce IOPS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12138

Reviewed By: hx235

Differential Revision: D53955592

Pulled By: ajkr

fbshipit-source-id: 4d899d038faef691b63801d9d85f5cc079b7bbb5
2024-02-27 15:23:54 -08:00
zaidoon 3104e55f29 update DB::DumpSupportInfo to log whether jemalloc is supported or not (#12386)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12386

Reviewed By: cbi42

Differential Revision: D54231896

Pulled By: ajkr

fbshipit-source-id: 6b3357b2e97d3599955e303810088bb5d5896199
2024-02-27 15:07:00 -08:00
Peter Dillinger d780e7a561 Remove bottommost_temperature (#12389)
Summary:
deprecated option already replaced by `last_level_temperature`. (Keeping recognition of the option in old options files.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12389

Test Plan: tests updated

Reviewed By: jowlyzhang, cbi42

Differential Revision: D54267946

Pulled By: pdillinger

fbshipit-source-id: 65c49b15e7394829c1f3b44edd4179d2daff6017
2024-02-27 14:48:00 -08:00
anand76 d9c0d44dab Add a perf level for measuring user thread block time (#12368)
Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368

Test Plan: Add a unit test for write_thread_wait_nanos

Reviewed By: ajkr

Differential Revision: D54021583

Pulled By: anand1976

fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
2024-02-22 12:14:53 -08:00
Yu Zhang f1ca47b904 Add support to bulk load external files for UDT in memtable only feature (#12356)
Summary:
This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature.

Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`.  We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`.

To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps.

This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356

Test Plan:
added unit test
existing unit test

Reviewed By: cbi42

Differential Revision: D54025116

Pulled By: jowlyzhang

fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554
2024-02-21 15:41:53 -08:00
Andrew Kryczka 8e29f243c9 No filesystem reads during Merge() writes (#12365)
Summary:
This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365

Test Plan:
"rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read.

```
$ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true |& grep 'block.cache.data.add COUNT' ; done
strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0
strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1
```

Reviewed By: hx235

Differential Revision: D53982520

Pulled By: ajkr

fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4
2024-02-21 13:15:27 -08:00
Alex Wied f2732d0586 Export GetSequenceNumber functionality for Snapshots (#12354)
Summary:
This PR adds `Snapshot->GetSequenceNumber()` functionality to the C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12354

Reviewed By: akankshamahajan15

Differential Revision: D53836085

Pulled By: cbi42

fbshipit-source-id: 4a14daeba9210a69bcb74e4c1c0666deff1b4837
2024-02-16 10:28:41 -08:00
anand76 d227276147 Deprecate some variants of Get and MultiGet (#12327)
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327

Reviewed By: pdillinger

Differential Revision: D53828151

Pulled By: anand1976

fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
2024-02-16 09:21:06 -08:00
Akanksha Mahajan 956f1dfde3 Change ReadAsync callback API to remove const from FSReadRequest (#11649)
Summary:
Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D53585309

Pulled By: akankshamahajan15

fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d
2024-02-16 09:14:55 -08:00
anand76 28c1c15c29 Sync tickers and histograms across C++ and Java (#12355)
Summary:
The RocksDB ticker and histogram statistics were out of sync between the C++ and Java code, with a number of newer stats missing in TickerType.java and HistogramType.java. Also, there were gaps in numbering in portal.h, which could soon become an issue due to the number of tickers and the fact that we're limited to 1 byte in Java. This PR adds the missing stats, and re-numbers all of them. It also moves some stats around to try to group related stats together. Since this will go into a major release, compatibility shouldn't be an issue.

This should be automated at some point, since the current process is somewhat error prone.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12355

Reviewed By: jaykorean

Differential Revision: D53825324

Pulled By: anand1976

fbshipit-source-id: 298c180872f4b9f1ee54b8bb22f4e280458e7e09
2024-02-15 17:22:03 -08:00
Peter Dillinger 12018136d8 KeySegmentsExtractor and prototype higher-dimensional filtering (#12075)
Summary:
This change contains a prototype new API for "higher dimensional" filtering of read queries. Existing filters treat keys as one-dimensional, either as distinct points (whole key) or as contiguous ranges in comparator order (prefix filters). The proposed KeySegmentsExtractor allows treating keys as multi-dimensional for filtering purposes even though they still have a single total order across dimensions. For example, consider these keys in different LSM levels:

L0:
abc_0123
abc_0150
def_0114
ghi_0134

L1:
abc_0045
bcd_0091
def_0077
xyz_0080

If we get a range query for [def_0100, def_0200), a prefix filter (up to the underscore) will tell us that both levels are potentially relevant. However, if each SST file stores a simple range of the values for the second segment of the key, we would see that L1 only has [0045, 0091] which (under certain required assumptions) we are sure does not overlap with the given range query. Thus, we can filter out processing or reading any index or data blocks from L1 for the query.

This kind of case shows up with time-ordered data but is more general than filtering based on user timestamp. See https://github.com/facebook/rocksdb/issues/11332 . Here the "time" segments of the keys are meaningfully ordered with respect to each other even when the previous segment is different, so summarizing data along an alternate dimension of the key like this can work well for filtering.

This prototype implementation simply leverages existing APIs for user table properties and table filtering, which is not very CPU efficient. Eventually, we expect to create a native implementation. However, I have put some significant
thought and engineering into the new APIs overall, which I expect to be close to refined enough for production.

For details, see new public APIs in experimental.h. For a detailed example, see the new unit test in db_bloom_filter_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12075

Test Plan: Unit test included

Reviewed By: jowlyzhang

Differential Revision: D53619406

Pulled By: pdillinger

fbshipit-source-id: 9e6e7b82b4db8d815db76a6ab340e90db2c191f2
2024-02-15 15:39:55 -08:00
Yu Zhang 4bea83aa44 Remove the force mode for EnableFileDeletions API (#12337)
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337

Reviewed By: hx235

Differential Revision: D53630393

Pulled By: jowlyzhang

fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
2024-02-13 18:36:25 -08:00
Yu Zhang 10d02456b6 Add support to bulk load external files with user-defined timestamps (#12343)
Summary:
This PR adds initial support to bulk loading external sst files with user-defined timestamps.

To ensure this invariant is met while ingesting external files:
     assume there are two internal keys: <K, ts1, seq1> and <K, ts2, seq2>, the following should hold:
     ts1 < ts2 iff. seq1 < seq2

These extra requirements are added for ingesting external files with user-defined timestamps:
1) A file with overlapping user key (without timestamp) range with the db cannot be ingested. This is because we cannot ensure above invariant is met without checking each overlapped key's timestamp and compare it with the timestamp from the db. This is an expensive step. This bulk loading feature will be used by MyRocks and currently their usage can guarantee ingested file's key range doesn't overlap with db.
4f3a57a13f/storage/rocksdb/ha_rocksdb.cc (L3312)
We can consider loose this requirement by doing this check in the future, this initial support just disallow this.

2) Files with overlapping user key (without timestamp) range are not allowed to be ingested. For similar reasons, it's hard to ensure above invariant is met. For example, if we have two files where user keys are interleaved like this:
file1: [c10, c8, f10, f5]
file2: [b5, c11, f4]
Either file1 gets a bigger global seqno than file2, or the other way around, above invariant cannot be met.
So we disallow this.

2) When a column family enables user-defined timestamps, it doesn't support ingestion behind mode. Ingestion behind currently simply puts the file at the bottommost level, and assign a global seqno 0 to the file. We need to do similar search though the LSM tree for key range overlap checks to make sure aformentioned invariant is met. So this initial support disallow this mode. We can consider adding it in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12343

Test Plan: Add unit tests

Reviewed By: cbi42

Differential Revision: D53686182

Pulled By: jowlyzhang

fbshipit-source-id: f05e3fb27967f7974ed40179d78634c40ecfb136
2024-02-13 11:15:28 -08:00
Levi Tamasi de1e3ff6ea Fix a data race in DBImpl::RenameTempFileToOptionsFile (#12347)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347

`DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels).

Reviewed By: jowlyzhang

Differential Revision: D53675153

fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481
2024-02-12 13:26:09 -08:00
Peter Dillinger 54cb9c77d9 Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.

I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`.  If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.

With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.

A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).

Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.

I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308

Test Plan: existing tests, CI

Reviewed By: ltamasi

Differential Revision: D53204947

Pulled By: pdillinger

fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 10:44:11 -08:00
Yu Zhang e3e8fbb497 Add a separate range classes for internal usage (#12071)
Summary:
Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`.

These APIs have different expectations of what keys this range class contain.  Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp.

This PR contains:
1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead.
2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071

Test Plan:
existing tests
Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled.
The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT.

Reviewed By: ltamasi

Differential Revision: D53292608

Pulled By: jowlyzhang

fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8
2024-02-06 18:35:36 -08:00
Hui Xiao 1a885fe730 Remove deprecated Options::access_hint_on_compaction_start (#11654)
Summary:
**Context:**
`Options::access_hint_on_compaction_start ` is marked deprecated and now ready to be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11654

Test Plan:
Multiple db_stress runs with pre-PR and post-PR binary randomly to ensure forward/backward compatibility on options 36a5686ec0?fbclid=IwAR2IcdAUdTvw9O9V5GkHEYJRGMVR9p7Ei-LMa-9qiXlj3z80DxjkxlGnP1E
`python3 tools/db_crashtest.py --simple blackbox --interval=30`

Reviewed By: cbi42

Differential Revision: D47892459

Pulled By: hx235

fbshipit-source-id: a62f46a0377fe143be7638e218978d5431c15c56
2024-02-05 13:35:19 -08:00
Peter Dillinger 6e88126dd3 Don't log an error when an auxiliary dir is missing (#12326)
Summary:
info_log gets an error logged when wal_dir or a db_path/cf_path is missing. Under this condition, the directory is created later (in DBImpl::Recover -> Directories::SetDirectories) with no error status returned.

To avoid error spam in logs, change these to a descriptive "header" log entry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12326

Test Plan: manual with DBBasicTest.DBCloseAllDirectoryFDs which exercises this code

Reviewed By: jowlyzhang

Differential Revision: D53374743

Pulled By: pdillinger

fbshipit-source-id: 32d1ce18809da13a25bdd6183d661f66a3b6a111
2024-02-05 10:26:41 -08:00