Commit Graph

5783 Commits

Author SHA1 Message Date
Hui Xiao 349b1ec08f Fix duplicate WAL entries caused by write after error recovery (#12873)
Summary:
**Context/Summary:**
We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty.

This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush
 as well as updates the INFO log message for clarity.

```
2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB
2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started
2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873

Test Plan:
New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after
```
db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64*) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch*, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed.
```

Reviewed By: anand1976

Differential Revision: D59884468

Pulled By: hx235

fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756
2024-07-22 12:40:25 -07:00
Changyu Bi c064ac3bc5 Avoid opening table files and reading table properties under mutex (#12879)
Summary:
InitInputTableProperties() can open and do IOs and is called under mutex_. This PR removes it from FinalizeInputInfo(). It is now called in CompactionJob::Run() and BuildCompactionJobInfo() (called in NotifyOnCompactionBegin()) without holding mutex_.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12879

Test Plan: existing unit tests. Added assert in GetInputTableProperties() to ensure that input_table_properties_ is initialized whenever it's called.

Reviewed By: hx235

Differential Revision: D59933195

Pulled By: cbi42

fbshipit-source-id: c8089e13af8567fa3ab4b94d9ec384ae98ab2ec8
2024-07-19 19:12:45 -07:00
Changyu Bi 4384dd5eee Support ingesting SST files generated by a live DB (#12750)
Summary:
... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option `IngestExternalFileOptions::allow_db_generated_files` is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys).

The feature is marked the option as experimental for now.

Main changes needed to enable this:
- ignore CF id mismatch during ingestion
- ignore the missing external file version table property

Rest of the change is mostly in new unit tests.

A previous attempt is in https://github.com/facebook/rocksdb/issues/5602.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12750

Test Plan: - new unit tests

Reviewed By: ajkr, jowlyzhang

Differential Revision: D58396673

Pulled By: cbi42

fbshipit-source-id: aae513afad7b1ff5d4faa48104df5f384926bf03
2024-07-19 16:14:54 -07:00
anand76 0fca5e31b4 Fix race between manifest error recovery and file ingestion (#12871)
Summary:
This PR fixes an assertion failure in `DBImpl::ResumeImpl` - `assert(!versions_->descriptor_log_)`. In `VersionSet`, `descriptor_log_` has a pointer to the current MANIFEST writer. When there's an error updating the manifest, `descriptor_log_` is reset, and the error recovery thread checks `io_status()` in `VersionSet` and attempts to write a new MANIFEST. If another DB manipulation happens at the same time (like external file ingestion, column family manipulation etc), it calls `LogAndApply`, which also attempts to write a new MANIFEST. The assertion in `ResumeImpl` might fail in this case since the other MANIFEST writer may have updated `descriptor_log_`. To prevent the assertion, this fix updates both `io_status_` and `descriptor_log_` while holding the DB mutex.

The other option would have been to simply remove the assert. But I think its important to have it to ensure the invariant that `io_status_` is cleared if the MANIFEST is written successfully, and this fix makes things easier to reason about.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12871

Test Plan: Existing tests and crash test

Reviewed By: hx235

Differential Revision: D59926947

Pulled By: anand1976

fbshipit-source-id: af9ad18da3e29fc62c7ec2e30e0738aa33d4e5f1
2024-07-19 10:37:51 -07:00
Peter Dillinger 93b163d1a2 Fix major bug with prefixes, SeekForPrev, and partitioned filters (#12872)
Summary:
Basically, the fix in https://github.com/facebook/rocksdb/issues/8137 was incomplete (and I missed it in the review), because if `whole_key_filtering` is false, then `last_prefix_str_` will never be set to non-empty and the fix doesn't work. Also related to https://github.com/facebook/rocksdb/issues/5835.

This is intended as a safe, simple fix that will regress CPU efficiency slightly (for `whole_key_filtering=false` cases, because of extra prefix string copies during flush & compaction). An efficient fix is not possible without some substantial refactoring.

Also in this PR: new test DBBloomFilterTest.FilterNumEntriesCoalesce tests an adjacent code path that was previously untested for its effect of ensuring the number of unique prefixes and keys is tracked properly when both prefixes and whole keys are going into a filter. (Test fails when either of the two code segments checking for duplicates is disabled.) In addition, the same test would fail before the main bug fix here because the code would inappropriately add the empty string to the filter (because of unmodified `last_prefix_str_`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12872

Test Plan: In addition to DBBloomFilterTest.FilterNumEntriesCoalesce, extended DBBloomFilterTest.SeekForPrevWithPartitionedFilters to cover the broken case. (Mostly whitespace change.)

Reviewed By: jowlyzhang

Differential Revision: D59873793

Pulled By: pdillinger

fbshipit-source-id: 2a7b7f09ca73dc188fb4dab833826ad6da7ebb11
2024-07-17 14:08:35 -07:00
Hui Xiao 21db55f816 Move WAL sync before memtable insertion (#12869)
Summary:
**Context/Summary:**
WAL sync currently happens after memtable write. This causes inconvenience in stress test as we can't simply rollback the ExpectedState when write fails due to injected WAL sync error so something complicated like https://github.com/facebook/rocksdb/pull/12838 might be needed. After moving WAL sync before memtable insertion, there should not be injected IO error after memtable insertion so we can keep the current simple way of handling failed write in stress test with ExpectedState rollback.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12869

Test Plan:
1. Below command failed with `iterator has key 0000000000000207000000000000012B0000000000000013, but expected state does not.` before this PR and passes after
```
./db_stress  --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=50 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=0 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=50

Reviewed By: jowlyzhang

Differential Revision: D59825730

Pulled By: hx235

fbshipit-source-id: 7d77aaf177ded2f99bf1ce19f5a4bd0783b9ca92
2024-07-17 13:39:14 -07:00
Hui Xiao 9e4ee7f0c6 Fix non-okay status being ignored in write path under two_write_queues_ (#12866)
Summary:
Context/Summary: see above, though the impact is small.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12866

Test Plan: exiting UT

Reviewed By: anand1976

Differential Revision: D59782913

Pulled By: hx235

fbshipit-source-id: ec02843645cce49466bde602035d2e61c31965b8
2024-07-16 10:55:08 -07:00
anand76 5aa675457e Fix unhandled MANIFEST write errors (#12865)
Summary:
The failure of `WriteCurrentStateToManifest()` in `VersionSet::ProcessManifestWrites()` was not handled properly. If it failed, `manifest_io_status` was not updated, leading to `manifest_file_number_` being updated to the newly created manifest even though its bad. This would lead to the bad manifest immediately getting deleted, and also the good manifest (referenced by `CURRENT`) getting deleted by obsolete file deletion because of `manifest_file_number_` not referencing its number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12865

Reviewed By: hx235

Differential Revision: D59782940

Pulled By: anand1976

fbshipit-source-id: f752fb9a1c23fd3d734616e273613cbac204301b
2024-07-15 19:13:29 -07:00
Hui Xiao 4ff35afb42 Fix a bug where `OnErrorRecoveryBegin()` is not called before auto-recovery (#12860)
Summary:
**Context/Summary:**
`*auto_recovery` needs to be set true in order for `OnErrorRecoveryBegin()` to be called before auto-recovery
3db030d7ee/db/event_helpers.cc (L64-L66)
Currently it's set false for auto-recovery. This PR fixes it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12860

Test Plan:
- Manual observation that it is called
- Existing UT

Reviewed By: jowlyzhang

Differential Revision: D59693315

Pulled By: hx235

fbshipit-source-id: 3f428c5b1e9818bb7697fdcd7f245d11378eb14a
2024-07-15 17:00:14 -07:00
WangQian 755010f8d3 Fix the bug with using the user comparator to compare prefix. (#12862)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12855

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12862

Reviewed By: cbi42

Differential Revision: D59771651

Pulled By: jowlyzhang

fbshipit-source-id: ffe0025143f51f9ce1b46900c3fef6a20eb34f4a
2024-07-15 15:13:29 -07:00
Peter Dillinger 0e3e43f4d1 FaultInjectionTestFS follow-up and clean-up (#12861)
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/12852:
* Use std::copy in place of copy_n for potentially overlapping buffer
* Get rid of troublesome -1 idiom from `pos_at_last_append_` and `pos_at_last_sync_`
* Small improvements to test FaultInjectionFSTest.ReadUnsyncedData

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12861

Test Plan: CI, crash test, etc.

Reviewed By: cbi42

Differential Revision: D59757484

Pulled By: pdillinger

fbshipit-source-id: c6fbdc2e97c959983184925a855cc8b0285fa23f
2024-07-15 10:28:34 -07:00
Peter Dillinger 72438a6788 Support read & write with unsynced data in FaultInjectionTestFS (#12852)
Summary:
Follow-up to https://github.com/facebook/rocksdb/issues/12729 and others to fix FaultInjectionTestFS handling the case where a live WAL is being appended to and synced while also being copied for checkpoint or backup, up to a known flushed (but not necessarily synced) prefix of the file. It was tricky to structure the code in a way that could handle a tricky race with Sync in another thread (see code comments, thanks Changyu) while maintaining good performance and test-ability.

For more context, see the call to FlushWAL() in DBImpl::GetLiveFilesStorageInfo().

Also, the unit test for https://github.com/facebook/rocksdb/issues/12729 was neutered by https://github.com/facebook/rocksdb/issues/12797, and this re-enables the functionality it is testing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12852

Test Plan:
unit test expanded/updated. Local runs of blackbox_crash_test.

The implementation is structured so that a multi-threaded unit test is not needed to cover at least the code lines, as the race handling is folded into "catch up after returning unsynced and then a sync."

Reviewed By: cbi42

Differential Revision: D59594045

Pulled By: pdillinger

fbshipit-source-id: 94667bb72255e2952586c53bae2c2dd384e85a50
2024-07-12 16:01:57 -07:00
Konstantin Ilin 5ecb92760a Create C API function to iterate over WriteBatch for custom Column Families (#12718)
Summary:
Create C API function for iterating over WriteBatch for custom Column Families
Adding function to C API that exposes column family specific methods to iterate over WriteBatch: put_cf, delete_cf and merge_cf. This is required when the one needs to read changes for any non-default column family. Without that functionality it is impossible to iterate over changes in WAL that are relevant to custom column families.

Fixes https://github.com/facebook/rocksdb/issues/12790

Testing:
Added WriteBatch iteration test to "columnfamilies" section of C API unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12718

Reviewed By: cbi42

Differential Revision: D59483601

Pulled By: ajkr

fbshipit-source-id: b68b900636304528a38620a8c3ad82fdce4b60cb
2024-07-09 12:05:08 -07:00
w41ter b837d41ab1 Expose SizeApproximationFlags to C API (#12836)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12836

Reviewed By: cbi42

Differential Revision: D59502673

Pulled By: ajkr

fbshipit-source-id: fc9f77d6740d8efa45d9357662f0f827dbd0511f
2024-07-09 12:00:50 -07:00
Chdy 110ce5f4a3 fix: Round-Robin pri under leveled compaction allows subcompactions b… (#12843)
Summary:
### Summary: Round-Robin pri under leveled compaction allows subcompactions by default is not compatible with PlainTable

```c++
bool Compaction::ShouldFormSubcompactions() const {
  if (cfd_ == nullptr) {
    return false;
  }

  // Round-Robin pri under leveled compaction allows subcompactions by default
  // and the number of subcompactions can be larger than max_subcompactions_
  if (cfd_->ioptions()->compaction_pri == kRoundRobin &&
      cfd_->ioptions()->compaction_style == kCompactionStyleLevel) {
    return output_level_ > 0;
  }

  if (max_subcompactions_ <= 1) {
    return false;
  }
```

PlainTable does not support Subcompaction, including when AdaptiveTable is applied to PlainTable.  subcompaction by default will result in the following error in some scenarios.

```c++
void PlainTableIterator::Seek(const Slice& target) {
  if (use_prefix_seek_ != !table_->IsTotalOrderMode()) {
    // This check is done here instead of NewIterator() to permit creating an
    // iterator with total_order_seek = true even if we won't be able to Seek()
    // it. This is needed for compaction: it creates iterator with
    // total_order_seek = true but usually never does Seek() on it,
    // only SeekToFirst().
    status_ = Status::InvalidArgument(
        "total_order_seek not implemented for PlainTable.");
    offset_ = next_offset_ = table_->file_info_.data_end_offset;
    return;
  }
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12843

Reviewed By: ajkr

Differential Revision: D59433477

Pulled By: cbi42

fbshipit-source-id: fb780ba7f7e8efdfedb7480abf14dd38e0b63677
2024-07-08 12:25:11 -07:00
Peter Dillinger 0bb939611d Avoid unnecessary work in internal calls to GetSortedWalFiles (#12831)
Summary:
We are seeing a number of crash test failures coming from checkpoint and backup code, likely from WalManager::GetSortedWalFiles -> ... -> WalManager::ReadFirstLine and this code path is not needed, because we don't need to know the sequence numbers of WAL files going into a checkpoint or backup. We can minimize the impact of whatever inconsistency is causing that problem by not relying on it where it's not needed.

Similarly, when we only need a roughly accurate set of current WAL files, we don't need to query all the archived WAL files (and redundantly the live ones again).

So this reduces filesystem queries and DB mutex acquires in creating backups and checkpoints.

Needed follow-up:
Figure out what is causing various failures with an apparent inconsistency where GetSortedWalFiles fails on reading a WAL file. If it's an injected failure, perhaps it's not propagating that injected failure appropriately. It might also be an inconsistency between what the DB knows is flushed and what WalManager reads from the filesystem (which we know is dubious and should be phased out, which this is arguably another step toward). Or completing that phase-out might solve the problem without a full diagnosis.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12831

Test Plan:
existing tests (easily caught when I went too far in initally developing this change)

Update to BackupUsingDirectIO test so that there's a WAL file in what is backed up. (Was relying on some oddity.)

Reviewed By: cbi42

Differential Revision: D59252649

Pulled By: pdillinger

fbshipit-source-id: 7ad4187a1c70caa59a6d6c1c643ef95232b929f5
2024-07-01 23:29:02 -07:00
Jeffery 093f4ef82c Fix db_rate_limiter_test for win (#12816)
Summary:
We didn't implement file system prefetch for OS Win. During table open, it uses `FilePrefetchBuffer` instead and only do 1 read instead of 4 in BufferedIO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12816

Reviewed By: jaykorean

Differential Revision: D59181835

Pulled By: ajkr

fbshipit-source-id: 18b8f0247408cd1a80f289357ede5232ae5a3c66
2024-07-01 16:14:19 -07:00
HypenZou 5bb7f95ed6 Don't take archived log size into account when calculating log size for flush (#12680)
Summary:
**Context/Summary:**
It seems unreasonable to take the archived log size into account when calculating log size **for flush** in method CreateCheckpoint. If the user sets WAL_ttl_seconds or WAL_size_limit_MB, the argument _log_size_for_flush_ can easily be reached due to the size of the archived dir. As a result, the flush may always be triggered.
**Test**
corverd by ./checkpoint_test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12680

Reviewed By: jaykorean

Differential Revision: D59097904

Pulled By: ajkr

fbshipit-source-id: 0ed29c1b078d8f40b85288541b008e00dbc517d3
2024-06-28 11:56:26 -07:00
Changyu Bi a31fe52173 Remove the return value of `SetBGError()` (#12792)
Summary:
the return value for `ErrorHandler::SetBGError(error)` seems to be not well-defined, it can be `bg_error_` (no matter if the `bg_error_` is set to the input error), ok status or [`recovery_error_`](3ee4d5a11a/db/error_handler.cc (L669)) from `StartRecoverFromRetryableBGIOError()`.  The `recovery_error_` returned may be an OK status.

We have only a few places that use the return value of  `SetBGError()` and they don't need to do so. Using the return value may even be wrong for example in 3ee4d5a11a/db/db_impl/db_impl_write.cc (L2365) where a non-ok `s` could be overwritten to OK. This PR changes SetBGError() to return void and clean up relevant code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12792

Test Plan: existing unit tests and go over all places where return value of `SetBGError()` is used.

Reviewed By: hx235

Differential Revision: D58904898

Pulled By: cbi42

fbshipit-source-id: d58a20ba5a40e3f35367c6034a32c755088c3653
2024-06-26 18:17:05 -07:00
Jason Volk 8bf1f6f87f Add info logging via callback to C api. (#12537)
Summary:
I'd like to get this in so the Rust folks can integrate with their splendid logging/tracing frameworks; will be hugely appreciated. 🙏🏻
The infolog capabilities for C embeddings are quite spartan. LOG files were generated involuntarily until redirection to stderr was added by https://github.com/facebook/rocksdb/issues/12262; still insufficient for apps which cannot tolerate pollution of their stdio and tend to have existing logging frameworks to tie into for that.

Adds a very minimal derive of Logger around a C callback, written in the spirit, useful for FFI interfaces from other languages to integrate infolog.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12537

Reviewed By: ajkr

Differential Revision: D57597766

Pulled By: cbi42

fbshipit-source-id: ec684ce4ddf77a0a6ebbf013a1bacb4ff2e49eb0
2024-06-26 10:29:05 -07:00
Peter Dillinger 39455974cb Fix possible double-free on TruncatedRangeDelIterator (#12805)
Summary:
Not sure where or how it happens, but using a recent CircleCI failure I got a reliable db_stress reproducer.

Using std::unique_ptr appropriately for managing them has apparently (and unsurprisingly) fixed the problem without needing to know exactly where the problem was.

Suggested follow-up:
* Three or even four levels of pointers is very confusing to work with. Surely this part can be cleaned up to be simpler.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12805

Test Plan:
Reproducer passes, plus ASAN test and crash test runs. I don't think it's worth the extra work to track down the details and create a careful unit test.

```
./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=2 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=tiered_lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=1 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0.2 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb.gpxs/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb.gpxs/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=4 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kHot --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=60 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=1 --uncache_aggressiveness=5 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

Reviewed By: cbi42

Differential Revision: D58958390

Pulled By: pdillinger

fbshipit-source-id: 1271cfdcc3c574f78cd59f3c68148f7ed4a19c47
2024-06-24 11:51:16 -07:00
Yu Zhang fa4ffc816c Fix race condition between event listener and error handler (#12803)
Summary:
Fix a race for accessing `bg_error_` after mutex is released. We make some copies before releasing to avoid this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12803

Reviewed By: cbi42

Differential Revision: D58957557

Pulled By: jowlyzhang

fbshipit-source-id: 3c7369a3b8c8707aebc0044ff98288c898c05cb8
2024-06-24 11:45:28 -07:00
Hui Xiao 9d64ca55b7 Proceed for new memtable on okay status (#12798)
Summary:
**Context/Summary:**
The relevant code logs info of newly created WAL and proceeds to "ConstructFragmentedRangeTombstones()" even when the previous step fails. This PR fixes it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12798

Test Plan: Existing tests

Reviewed By: cbi42

Differential Revision: D58917246

Pulled By: hx235

fbshipit-source-id: f395210d91e50617195cb9a8047cf5d82db0c40e
2024-06-22 16:17:59 -07:00
Changyu Bi b4a84efb4e Fix assertion failure in ConstructFragmentedRangeTombstones() (#12796)
Summary:
the assertion `assert(!IsFragmentedRangeTombstonesConstructed(false));` assumes ConstructFragmentedRangeTombstones() is called only once for a memtable. This is not true since SwitchMemtable() can be called multiple times on the same live memtable, if a previous attempt fails. So remove the assertion in this PR and simplify relevant code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12796

Test Plan: the exact condition to trigger manifest write in SwitchMemtable() is complicated. Will monitor crash test to see if there's no more failure.

Reviewed By: hx235

Differential Revision: D58913310

Pulled By: cbi42

fbshipit-source-id: 458bb9eebcf6743e9001186fcb757e4b50e8a5d2
2024-06-22 11:31:16 -07:00
Peter Dillinger 3ee4d5a11a Fix possible crash in failure to sync some WALs (#12789)
Summary:
I believe this was possible with recyclable logs before recent work like https://github.com/facebook/rocksdb/issues/12734, but this cleans up a couple of possible crashes revealed by the crash test.  A WAL with a nullptr file writer (already closed) can persist in `logs_` if a later WAL fails to sync. In case of any WAL sync failures, we don't record WAL syncs to the manifest. Thus, even if a WAL is fully synced and closed, we might need to keep it on the `logs_` list so that we know to record its sync to the manifest if there should be a successful sync next time. (However, I believe that's future-looking because currently any failure in WAL sync is considered non-recoverable.)

I don't believe this was likely enough before recent changes to warrant a release note (if it was possible).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12789

Test Plan: A unit test that would reveal the crashes, now fixed

Reviewed By: cbi42

Differential Revision: D58874154

Pulled By: pdillinger

fbshipit-source-id: bc69407cd9cbcd080af9585d502d4e33dafc3d29
2024-06-21 12:56:21 -07:00
Jay Huh cce51f0664 Fix heap-use-after-free in MultiCfIteratorImpl (#12784)
Summary:
# Summary

When changing the direction of the multi-cf-iter, we do this by `Seek(current_key)` (if changing from backward to forward) or `SeekForPrev(current_key)` (if forward -> backward) in the child iters and rebuild the heap.

`Slice target` is just a pointer and contents are not guaranteed to be the same after re-init the heap.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12784

Test Plan:
I was able to steadily repro by building with `COMPILE_WITH_ASAN=1` running db_stress.
```
COMPILE_WITH_ASAN=1 make -j64 dbg
```
```
./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=2 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=62.9095874568401 --bottommost_compression_type=none --bottommost_file_compaction_delay=600 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=100 --compress_format_version=2 --compressed_secondary_cache_size=8388608 --compression_checksum=1 --compression_max_dict_buffer_bytes=1099511627775 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=crc32c --fill_cache=0 --flush_one_in=1000000 --format_version=4 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --kill_random_test=888887 --last_level_temperature=kHot --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=60 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --metadata_read_fault_one_in=1000 --metadata_write_fault_one_in=128 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=20000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --persist_user_defined_timestamps=1 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=0 --readpercent=50 --recycle_log_file_num=0 --reopen=20 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=bar --sqfc_version=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=1 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=0 --uncache_aggressiveness=14 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_cf_iterator=1 --use_multi_get_entity=1 --use_multiget=1 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_txn=0 --use_write_buffer_manager=0 --user_timestamp_size=8 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```
```
==1606272==ERROR: AddressSanitizer: heap-use-after-free on address 0x6060000b0cc0 at pc 0x7f733469c7de bp 0x7f7311bfcfe0 sp 0x7f7311bfc790
READ of size 40 at 0x6060000b0cc0 thread T57
    #0 0x7f733469c7dd in __interceptor_memcpy /home/engshare/third-party2/gcc/11.x/src/gcc-11.x/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:827
    https://github.com/facebook/rocksdb/issues/1 0x7f7331f65f7e in rocksdb::IterKey::SetInternalKey(rocksdb::Slice const&, rocksdb::Slice const&, unsigned long, rocksdb::ValueType, rocksdb::Slice const*) db/dbformat.h:761
    https://github.com/facebook/rocksdb/issues/2 0x7f7331f661ee in rocksdb::IterKey::SetInternalKey(rocksdb::Slice const&, unsigned long, rocksdb::ValueType, rocksdb::Slice const*) db/dbformat.h:776
    https://github.com/facebook/rocksdb/issues/3 0x7f73323039ff in rocksdb::DBIter::SetSavedKeyToSeekTarget(rocksdb::Slice const&) db/db_iter.cc:1462
    https://github.com/facebook/rocksdb/issues/4 0x7f7332304eb8 in rocksdb::DBIter::Seek(rocksdb::Slice const&) db/db_iter.cc:1540
    https://github.com/facebook/rocksdb/issues/5 0x7f7331d94abd in rocksdb::ArenaWrappedDBIter::Seek(rocksdb::Slice const&) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x1394abd)
    https://github.com/facebook/rocksdb/issues/6 0x7f73320f1a52 in rocksdb::MultiCfIteratorImpl::Seek(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}::operator()(rocksdb::Iterator*) const db/multi_cf_iterator_impl.h:73
    https://github.com/facebook/rocksdb/issues/7 0x7f73320fccf0 in void rocksdb::MultiCfIteratorImpl::SeekCommon<rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::greater<int> > >, rocksdb::MultiCfIteratorImpl::Seek(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}>(rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::greater<int> > >&, rocksdb::MultiCfIteratorImpl::Seek(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x16fccf0)
    https://github.com/facebook/rocksdb/issues/8 0x7f73320f1a93 in rocksdb::MultiCfIteratorImpl::Seek(rocksdb::Slice const&) db/multi_cf_iterator_impl.h:73
    https://github.com/facebook/rocksdb/issues/9 0x7f73320f1dbe in rocksdb::MultiCfIteratorImpl::Next()::{lambda()https://github.com/facebook/rocksdb/issues/1}::operator()() const db/multi_cf_iterator_impl.h:90
    https://github.com/facebook/rocksdb/issues/10 0x7f73320fe159 in rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::greater<int> > >& rocksdb::MultiCfIteratorImpl::GetHeap<rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::greater<int> > >, rocksdb::MultiCfIteratorImpl::Next()::{lambda()https://github.com/facebook/rocksdb/issues/1}>(rocksdb::MultiCfIteratorImpl::Next()::{lambda()https://github.com/facebook/rocksdb/issues/1}) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x16fe159)
    https://github.com/facebook/rocksdb/issues/11 0x7f73320f1ec9 in rocksdb::MultiCfIteratorImpl::Next() db/multi_cf_iterator_impl.h:87
    https://github.com/facebook/rocksdb/issues/12 0x7f73320f3255 in rocksdb::CoalescingIterator::Next() db/coalescing_iterator.h:34
    https://github.com/facebook/rocksdb/issues/13 0x66f28a in TestIterateImpl<rocksdb::Iterator, rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, const rocksdb::ReadOptions&, const std::vector<int>&, const std::vector<long int>&)::<lambda(const rocksdb::ReadOptions&)>, rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, const rocksdb::ReadOptions&, const std::vector<int>&, const std::vector<long int>&)::<lambda(rocksdb::Iterator*)> > db_stress_tool/db_stress_test_base.cc:1718
    https://github.com/facebook/rocksdb/issues/14 0x6440b4 in rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) db_stress_tool/db_stress_test_base.cc:1504
    https://github.com/facebook/rocksdb/issues/15 0x640cb0 in rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) db_stress_tool/db_stress_test_base.cc:1376
    https://github.com/facebook/rocksdb/issues/16 0x6004f6 in rocksdb::ThreadBody(void*) db_stress_tool/db_stress_driver.cc:39
    https://github.com/facebook/rocksdb/issues/17 0x7f73327caed4 in StartThreadWrapper env/env_posix.cc:469
    https://github.com/facebook/rocksdb/issues/18 0x7f733029abc8 in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434
    https://github.com/facebook/rocksdb/issues/19 0x7f733032cf5b in __GI___clone3 (/usr/local/fbcode/platform010/lib/libc.so.6+0x12cf5b)

0x6060000b0cc0 is located 0 bytes inside of 55-byte region [0x6060000b0cc0,0x6060000b0cf7)
freed by thread T57 here:
    #0 0x7f73346d1d77 in operator delete[](void*) /home/engshare/third-party2/gcc/11.x/src/gcc-11.x/libsanitizer/asan/asan_new_delete.cpp:163
    https://github.com/facebook/rocksdb/issues/1 0x7f7331d9274b in rocksdb::IterKey::ResetBuffer() db/dbformat.h:830
    https://github.com/facebook/rocksdb/issues/2 0x7f73323146b9 in rocksdb::IterKey::EnlargeBuffer(unsigned long) db/dbformat.cc:278
    https://github.com/facebook/rocksdb/issues/3 0x7f7331f33031 in rocksdb::IterKey::EnlargeBufferIfNeeded(unsigned long) db/dbformat.h:846
    https://github.com/facebook/rocksdb/issues/4 0x7f7331f65ee0 in rocksdb::IterKey::SetInternalKey(rocksdb::Slice const&, rocksdb::Slice const&, unsigned long, rocksdb::ValueType, rocksdb::Slice const*) db/dbformat.h:757
    https://github.com/facebook/rocksdb/issues/5 0x7f7331f661ee in rocksdb::IterKey::SetInternalKey(rocksdb::Slice const&, unsigned long, rocksdb::ValueType, rocksdb::Slice const*) db/dbformat.h:776
    https://github.com/facebook/rocksdb/issues/6 0x7f73323039ff in rocksdb::DBIter::SetSavedKeyToSeekTarget(rocksdb::Slice const&) db/db_iter.cc:1462
    https://github.com/facebook/rocksdb/issues/7 0x7f7332304eb8 in rocksdb::DBIter::Seek(rocksdb::Slice const&) db/db_iter.cc:1540
    https://github.com/facebook/rocksdb/issues/8 0x7f7331d94abd in rocksdb::ArenaWrappedDBIter::Seek(rocksdb::Slice const&) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x1394abd)
    https://github.com/facebook/rocksdb/issues/9 0x7f73320f1a52 in rocksdb::MultiCfIteratorImpl::Seek(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}::operator()(rocksdb::Iterator*) const db/multi_cf_iterator_impl.h:73
    https://github.com/facebook/rocksdb/issues/10 0x7f73320fccf0 in void rocksdb::MultiCfIteratorImpl::SeekCommon<rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::greater<int> > >, rocksdb::MultiCfIteratorImpl::Seek(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}>(rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::greater<int> > >&, rocksdb::MultiCfIteratorImpl::Seek(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x16fccf0)
    https://github.com/facebook/rocksdb/issues/11 0x7f73320f1a93 in rocksdb::MultiCfIteratorImpl::Seek(rocksdb::Slice const&) db/multi_cf_iterator_impl.h:73
    https://github.com/facebook/rocksdb/issues/12 0x7f73320f1dbe in rocksdb::MultiCfIteratorImpl::Next()::{lambda()https://github.com/facebook/rocksdb/issues/1}::operator()() const db/multi_cf_iterator_impl.h:90
    https://github.com/facebook/rocksdb/issues/13 0x7f73320fe159 in rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::greater<int> > >& rocksdb::MultiCfIteratorImpl::GetHeap<rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::greater<int> > >, rocksdb::MultiCfIteratorImpl::Next()::{lambda()https://github.com/facebook/rocksdb/issues/1}>(rocksdb::MultiCfIteratorImpl::Next()::{lambda()https://github.com/facebook/rocksdb/issues/1}) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x16fe159)
    https://github.com/facebook/rocksdb/issues/14 0x7f73320f1ec9 in rocksdb::MultiCfIteratorImpl::Next() db/multi_cf_iterator_impl.h:87
    https://github.com/facebook/rocksdb/issues/15 0x7f73320f3255 in rocksdb::CoalescingIterator::Next() db/coalescing_iterator.h:34
    https://github.com/facebook/rocksdb/issues/16 0x66f28a in TestIterateImpl<rocksdb::Iterator, rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, const rocksdb::ReadOptions&, const std::vector<int>&, const std::vector<long int>&)::<lambda(const rocksdb::ReadOptions&)>, rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, const rocksdb::ReadOptions&, const std::vector<int>&, const std::vector<long int>&)::<lambda(rocksdb::Iterator*)> > db_stress_tool/db_stress_test_base.cc:1718
    https://github.com/facebook/rocksdb/issues/17 0x6440b4 in rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) db_stress_tool/db_stress_test_base.cc:1504
    https://github.com/facebook/rocksdb/issues/18 0x640cb0 in rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) db_stress_tool/db_stress_test_base.cc:1376
    https://github.com/facebook/rocksdb/issues/19 0x6004f6 in rocksdb::ThreadBody(void*) db_stress_tool/db_stress_driver.cc:39
    https://github.com/facebook/rocksdb/issues/20 0x7f73327caed4 in StartThreadWrapper env/env_posix.cc:469
    https://github.com/facebook/rocksdb/issues/21 0x7f733029abc8 in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434

previously allocated by thread T57 here:
    #0 0x7f73346d13b7 in operator new[](unsigned long) /home/engshare/third-party2/gcc/11.x/src/gcc-11.x/libsanitizer/asan/asan_new_delete.cpp:102
    https://github.com/facebook/rocksdb/issues/1 0x7f73323146c5 in rocksdb::IterKey::EnlargeBuffer(unsigned long) db/dbformat.cc:279
    https://github.com/facebook/rocksdb/issues/2 0x7f7331f33031 in rocksdb::IterKey::EnlargeBufferIfNeeded(unsigned long) db/dbformat.h:846
    https://github.com/facebook/rocksdb/issues/3 0x7f7331f65ee0 in rocksdb::IterKey::SetInternalKey(rocksdb::Slice const&, rocksdb::Slice const&, unsigned long, rocksdb::ValueType, rocksdb::Slice const*) db/dbformat.h:757
    https://github.com/facebook/rocksdb/issues/4 0x7f7331f661ee in rocksdb::IterKey::SetInternalKey(rocksdb::Slice const&, unsigned long, rocksdb::ValueType, rocksdb::Slice const*) db/dbformat.h:776
    https://github.com/facebook/rocksdb/issues/5 0x7f7332303e1e in rocksdb::DBIter::SetSavedKeyToSeekForPrevTarget(rocksdb::Slice const&) db/db_iter.cc:1479
    https://github.com/facebook/rocksdb/issues/6 0x7f7332306302 in rocksdb::DBIter::SeekForPrev(rocksdb::Slice const&) db/db_iter.cc:1615
    https://github.com/facebook/rocksdb/issues/7 0x7f7331d94b0f in rocksdb::ArenaWrappedDBIter::SeekForPrev(rocksdb::Slice const&) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x1394b0f)
    https://github.com/facebook/rocksdb/issues/8 0x7f73320f1c5a in rocksdb::MultiCfIteratorImpl::SeekForPrev(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}::operator()(rocksdb::Iterator*) const db/multi_cf_iterator_impl.h:82
    https://github.com/facebook/rocksdb/issues/9 0x7f73320fdc1e in void rocksdb::MultiCfIteratorImpl::SeekCommon<rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::less<int> > >, rocksdb::MultiCfIteratorImpl::SeekForPrev(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}>(rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::less<int> > >&, rocksdb::MultiCfIteratorImpl::SeekForPrev(rocksdb::Slice const&)::{lambda(rocksdb::Iterator*)https://github.com/facebook/rocksdb/issues/2}) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x16fdc1e)
    https://github.com/facebook/rocksdb/issues/10 0x7f73320f1c9b in rocksdb::MultiCfIteratorImpl::SeekForPrev(rocksdb::Slice const&) db/multi_cf_iterator_impl.h:81
    https://github.com/facebook/rocksdb/issues/11 0x7f73320f2002 in rocksdb::MultiCfIteratorImpl::Prev()::{lambda()https://github.com/facebook/rocksdb/issues/1}::operator()() const db/multi_cf_iterator_impl.h:99
    https://github.com/facebook/rocksdb/issues/12 0x7f73320ff223 in rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::less<int> > >& rocksdb::MultiCfIteratorImpl::GetHeap<rocksdb::BinaryHeap<rocksdb::MultiCfIteratorInfo, rocksdb::MultiCfIteratorImpl::MultiCfHeapItemComparator<std::less<int> > >, rocksdb::MultiCfIteratorImpl::Prev()::{lambda()https://github.com/facebook/rocksdb/issues/1}>(rocksdb::MultiCfIteratorImpl::Prev()::{lambda()https://github.com/facebook/rocksdb/issues/1}) (/data/users/jewoongh/rocksdb/librocksdb.so.9.4+0x16ff223)
    https://github.com/facebook/rocksdb/issues/13 0x7f73320f210d in rocksdb::MultiCfIteratorImpl::Prev() db/multi_cf_iterator_impl.h:96
    https://github.com/facebook/rocksdb/issues/14 0x7f73320f3275 in rocksdb::CoalescingIterator::Prev() db/coalescing_iterator.h:35
    https://github.com/facebook/rocksdb/issues/15 0x66f440 in TestIterateImpl<rocksdb::Iterator, rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, const rocksdb::ReadOptions&, const std::vector<int>&, const std::vector<long int>&)::<lambda(const rocksdb::ReadOptions&)>, rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, const rocksdb::ReadOptions&, const std::vector<int>&, const std::vector<long int>&)::<lambda(rocksdb::Iterator*)> > db_stress_tool/db_stress_test_base.cc:1725
    https://github.com/facebook/rocksdb/issues/16 0x6440b4 in rocksdb::StressTest::TestIterate(rocksdb::ThreadState*, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) db_stress_tool/db_stress_test_base.cc:1504
    https://github.com/facebook/rocksdb/issues/17 0x640cb0 in rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) db_stress_tool/db_stress_test_base.cc:1376
    https://github.com/facebook/rocksdb/issues/18 0x6004f6 in rocksdb::ThreadBody(void*) db_stress_tool/db_stress_driver.cc:39
    https://github.com/facebook/rocksdb/issues/19 0x7f73327caed4 in StartThreadWrapper env/env_posix.cc:469
    https://github.com/facebook/rocksdb/issues/20 0x7f733029abc8 in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434

Thread T57 created by T0 here:
    #0 0x7f7334642136 in __interceptor_pthread_create /home/engshare/third-party2/gcc/11.x/src/gcc-11.x/libsanitizer/asan/asan_interceptors.cpp:216
    https://github.com/facebook/rocksdb/issues/1 0x7f73327cb008 in StartThread env/env_posix.cc:479
    https://github.com/facebook/rocksdb/issues/2 0x7f733276b406 in rocksdb::CompositeEnvWrapper::StartThread(void (*)(void*), void*) env/composite_env_wrapper.h:316
    https://github.com/facebook/rocksdb/issues/3 0x7f733276b406 in rocksdb::CompositeEnvWrapper::StartThread(void (*)(void*), void*) env/composite_env_wrapper.h:316
    https://github.com/facebook/rocksdb/issues/4 0x6013d9 in rocksdb::RunStressTestImpl(rocksdb::SharedState*) db_stress_tool/db_stress_driver.cc:108
    https://github.com/facebook/rocksdb/issues/5 0x603083 in rocksdb::RunStressTest(rocksdb::SharedState*) db_stress_tool/db_stress_driver.cc:248
    https://github.com/facebook/rocksdb/issues/6 0x4e6ab3 in rocksdb::db_stress_tool(int, char**) db_stress_tool/db_stress_tool.cc:365
    https://github.com/facebook/rocksdb/issues/7 0x4e260a in main db_stress_tool/db_stress.cc:23
    https://github.com/facebook/rocksdb/issues/8 0x7f733022c656 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    https://github.com/facebook/rocksdb/issues/9 0x7f733022c717 in __libc_start_main_impl ../csu/libc-start.c:409
    https://github.com/facebook/rocksdb/issues/10 0x4e2530 in _start (/data/users/jewoongh/rocksdb/db_stress+0x4e2530)
```

`heap-use-after-free` was no longer happening with the same command after making the change.

Reviewed By: pdillinger

Differential Revision: D58871081

Pulled By: jaykorean

fbshipit-source-id: 0194c34ffec5f16a6556c6bf3941a27253a4ecb4
2024-06-21 11:56:10 -07:00
Hui Xiao 1adb935720 Inject more errors to more files in stress test (#12713)
Summary:
**Context:**
We currently have partial error injection:
- DB operation: all read, SST write
- DB open: all read, SST write, all metadata write.

This PR completes the error injection (with some limitations below):
- DB operation & open: all read, all write, all metadata write, all metadata read

**Summary:**
- Inject retryable metadata read, metadata write error concerning directory (e.g, dir sync, ) or file metadata (e.g, name, size, file creation/deletion...)
- Inject retryable errors to all major file types: random access file, sequential file, writable file
- Allow db stress test operations to handle above injected errors gracefully without crashing
- Change all error injection to thread-local implementation for easier disabling and enabling in the same thread. For example, we can control error handling thread to have no error injection. It's also cleaner in code.
   - Limitation: compared to before, we now don't have write fault injection for backup/restore CopyOrCreateFiles work threads since they use anonymous background threads as well as read injection for db open bg thread
- Add a new flag to test error recovery without error injection so we can test the path where error recovery actually succeeds
- Some Refactory & fix to db stress test framework (see PR review comments)
- Fix some minor bugs surfaced (see PR review comments)
- Limitation: had to disable backup restore with metadata read/write injection since it surfaces too many testing issues. Will add it back later to focus on surfacing actual code/internal bugs first.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12713

Test Plan:
- Existing UT
- CI with no trivial error failure

Reviewed By: pdillinger

Differential Revision: D58326608

Pulled By: hx235

fbshipit-source-id: 011b5195aaeb6011641ae0a9194f7f2a0e325ad7
2024-06-19 08:42:00 -07:00
Yu Zhang c73cf7a878 Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760)
Summary:
This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier.

A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following:
1) Setting the ratio to be equal to or smaller than 0 disables this collector
2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met.
3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction.
 For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D58734976

Pulled By: jowlyzhang

fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81
2024-06-18 10:51:29 -07:00
Jonah Gao 9f95aa8269 `GetAggregatedIntProperty` accumulates property once per block cache (#12755)
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12687.

A block cache may be shared by multiple column families. Therefore, when getting the aggregated property of the block cache, we need to deduplicate by instances of the block cache, meaning the same instance should only be counted once.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12755

Reviewed By: jowlyzhang

Differential Revision: D58508819

Pulled By: ajkr

fbshipit-source-id: 3b746841d7eac59f900387ec3b8c19dbcd20aae4
2024-06-18 10:46:55 -07:00
Peter Dillinger 3758e31f3f Fix rare failure in DBBlockCacheTypeTest.Uncache (#12775)
Summary:
Following up on https://github.com/facebook/rocksdb/issues/12748 after seeing recurrence in https://github.com/facebook/rocksdb/actions/runs/9522985253/job/26253605587?pr=12774

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12775

Test Plan: Was able to reproduce failure and verify fix this time using COERCE_CONTEXT_SWITCH=1 :)

Reviewed By: jowlyzhang

Differential Revision: D58623461

Pulled By: pdillinger

fbshipit-source-id: d93a5e6a4977675eac54bbd42e70ae7b29b950a4
2024-06-14 20:50:36 -07:00
Jay Huh 0ab60b8a8c MultiCfIterator - Handle case of invalid key from child iter manual prefix iteration (#12773)
Summary:
Instead of completely disallowing `MultiCfIterator` when one or more child iterators will do manual prefix iteration (as suggested in https://github.com/facebook/rocksdb/issues/12770 ), just let `MultiCfIterator` operate as is even when there's a possibility of undefined result from child iterators. If one or more child iterators cause the heap to be empty, just return early and `Valid()` will return false.

It is still possible that heap is not empty when one or more child iterators are returning wrong keys. Basically, MultiCfIterator behaves the same as what we described in https://github.com/facebook/rocksdb/wiki/Prefix-Seek#manual-prefix-iterating - "RocksDB will not return error when it is misused and the iterating result will be undefined."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12773

Test Plan:
MultiCfIterator added back to the stress test
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2
```

Reviewed By: cbi42

Differential Revision: D58612055

Pulled By: jaykorean

fbshipit-source-id: e0dd942bed98382c59d463412dd8f163e6790b93
2024-06-14 15:59:17 -07:00
Yu Zhang f5e44f3490 Fix manual flush hanging on waiting for no stall for UDT in memtable … (#12771)
Summary:
This PR fix a possible manual flush hanging scenario because of its expectation that others will clear out excessive memtables was not met. The root cause is the FlushRequest rescheduling logic is using a stricter criteria for what a write stall is about to happen means than `WaitUntilFlushWouldNotStallWrites` does. Currently, the former thinks a write stall is about to happen when the last memtable is half full, and it will instead reschedule queued FlushRequest and not actually proceed with the flush. While the latter thinks if we already start to use the last memtable, we should wait until some other background flush jobs clear out some memtables before proceed this manual flush.

If we make them use the same criteria, we can guarantee that at any time when`WaitUntilFlushWouldNotStallWrites` is waiting, it's not because the rescheduling logic is holding it back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12771

Test Plan: Added unit test

Reviewed By: ajkr

Differential Revision: D58603746

Pulled By: jowlyzhang

fbshipit-source-id: 9fa1c87c0175d47a40f584dfb1b497baa576755b
2024-06-14 13:37:37 -07:00
Yu Zhang 13c758f986 Change the behavior of manual flush to not retain UDT (#12737)
Summary:
When user-defined timestamps in Memtable only feature is enabled, all scheduled flushes go through a check to see if it's eligible to be rescheduled to retain user-defined timestamps. However when the user makes a manual flush request, their intention is for all the in memory data to be persisted into SST files as soon as possible. These two sides have some conflict of interest, the user can implement some workaround like https://github.com/facebook/rocksdb/issues/12631 to explicitly mark which one takes precedence. The implementation for this can be nuanced since the user needs to be aware of all the scenarios that can trigger a manual flush and handle the concurrency well etc.

In this PR, we updated the default behavior to give manual flush precedence when it's requested. The user-defined timestamps rescheduling mechanism is turned off when a manual flush is requested. Likewise, all error recovery triggered flushes skips the rescheduling mechanism too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12737

Test Plan: Add unit tests

Reviewed By: ajkr

Differential Revision: D58538246

Pulled By: jowlyzhang

fbshipit-source-id: 0b9b3d1af3e8d882f2d6a2406adda19324ba0694
2024-06-13 13:18:10 -07:00
Peter Dillinger 0646ec6e2d Ensure Close() before LinkFile() for WALs in Checkpoint (#12734)
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.

Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
  `background_close_inactive_wals`.

Stacked on https://github.com/facebook/rocksdb/issues/12731

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734

Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.

Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)

Reviewed By: cbi42

Differential Revision: D58295284

Pulled By: pdillinger

fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
2024-06-12 11:48:45 -07:00
Peter Dillinger 21eb82ebec Disable "uncache" behavior in DB shutdown (#12751)
Summary:
Crash test showed a potential use-after-free where a file marked as obsolete and eligible for uncache on destruction is destroyed in the VersionSet destructor, which only happens as part of DB shutdown. At that point, the in-memory column families have already been destroyed, so attempting to uncache could use-after-free on stuff like getting the `user_comparator()` from the `internal_comparator()`.

I attempted to make it smarter, but wasn't able to untangle the destruction dependencies in a way that was safe, understandable, and maintainable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12751

Test Plan:
Reproduced by adding uncache_aggressiveness to an existing (but otherwise unrelated) test. This makes it a fair regression test.

Also added testing to ensure that trivial moves and DB close & reopen are well behaved with uncache_aggressiveness. Specifically, this issue doesn't seem to be because things are uncached inappropriately in those cases.

Reviewed By: ltamasi

Differential Revision: D58390058

Pulled By: pdillinger

fbshipit-source-id: 66ac9cb13bf02638fa80ee5b7218153d8bc7cfd3
2024-06-11 15:57:40 -07:00
Evan Jones af50823069 c.h: Add set_track_and_verify_wals_in_manifest to C API (#12749)
Summary:
This option is recommended to be set for production use:

    We recommend to set track_and_verify_wals_in_manifest to true
    for production

https://github.com/facebook/rocksdb/wiki/Track-WAL-in-MANIFEST

This adds this setting to the C API, so it can be used by other languages.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12749

Reviewed By: ltamasi

Differential Revision: D58382892

Pulled By: ajkr

fbshipit-source-id: 885de4539745a3119b6b2a162ab4fca9fa975283
2024-06-10 16:26:52 -07:00
Peter Dillinger 68112b3beb Attempt fix rare failure in DBBlockCacheTypeTest.Uncache (#12748)
Summary:
I haven't been able to reproduce the failure, seen in https://github.com/facebook/rocksdb/actions/runs/9420830905/job/25953696902?pr=12734

```
[ RUN      ] DBBlockCacheTypeTestInstance/DBBlockCacheTypeTest.Uncache/2
db/db_block_cache_test.cc:1415: Failure
Expected equality of these values:
  cache->GetOccupancyCount()
    Which is: 37
  kBaselineCount + kNumDataBlocks + meta_blocks_per_file
    Which is: 15
Google Test trace:
db/db_block_cache_test.cc:1346: ua=10000
db/db_block_cache_test.cc:1344: partitioned=1
db/db_block_cache_test.cc:1418: Failure
...
```

But it's consistent with a SuperVersion reference sticking around beyond the CompactRange, as I can reproduce the result with a dangling Iterator. Like some other tests have had trouble with periodic stats popping up randomly, I suspect that could be the explanation in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12748

Test Plan: Watch for similar future failures

Reviewed By: ltamasi

Differential Revision: D58366031

Pulled By: pdillinger

fbshipit-source-id: b812ca8837b8c8b9cbda1b201d76316d145fa3ec
2024-06-10 13:31:46 -07:00
Evan Jones 32e6825bc6 c.h: Add GetDbIdentity, Options::write_dbid_to_manifest (#12736)
Summary:
The write_dbid_to_manifest option is documented as "We recommend setting this flag to true". However, there is no way to set this flag from the C API.

Add the following functions to the C API:

* rocksdb_get_db_identity
* rocksdb_options_get_write_dbid_to_manifest
* rocksdb_options_set_write_dbid_to_manifest

Add a test that this option preserves the ID across checkpoints.

c.cc:
* Remove outdated comments about missing C API functions that exist.
* Document that CopyString is intended for binary data and is not NUL terminated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12736

Reviewed By: ltamasi

Differential Revision: D58202117

Pulled By: ajkr

fbshipit-source-id: 707b110df5c4bd118d65548327428a53a9dc3019
2024-06-07 16:53:43 -07:00
Peter Dillinger b34cef57b7 Support pro-actively erasing obsolete block cache entries (#12694)
Summary:
Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries.

Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive.

This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on.

The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF.

Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths.

Notable auxiliary code details:
* Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.)
* Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item)

Marked EXPERIMENTAL until more thorough validation is complete.

Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694

Test Plan:
* Unit tests added, including refactoring an existing test to make better use of parameterized tests.
* Added to crash test.
* Performance, sample command:
```
for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done
```

Averaging 10 runs each case, block cache data block hit rates

```
lru_cache
UA=0   -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0
UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0

fixed_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9
UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2

auto_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5
UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8
```

Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings).

Reviewed By: ajkr

Differential Revision: D57932442

Pulled By: pdillinger

fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c
2024-06-07 08:57:11 -07:00
Yu Zhang 44aceb88d0 Add a OnManualFlushScheduled callback in event listener (#12631)
Summary:
As titled. Also added the newest user-defined timestamp into the `MemTableInfo`. This can be a useful info in the callback.

Added some unit tests as examples for how users can use two separate approaches to allow manual flush / manual compactions to go through when the user-defined timestamps in memtable only feature is enabled. One approach relies on selectively increase cutoff timestamp in `OnMemtableSeal` callback when it's initiated by a manual flush. Another approach is to increase cutoff timestamp in `OnManualFlushScheduled` callback. The caveats of the approaches are also documented in the unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12631

Reviewed By: ajkr

Differential Revision: D58260528

Pulled By: jowlyzhang

fbshipit-source-id: bf446d7140affdf124744095e0a179fa6e427532
2024-06-06 17:29:01 -07:00
Hui Xiao 390fc55ba1 Revert PR 12684 and 12556 (#12738)
Summary:
**Context/Summary:** a better API design is decided lately so we decided to revert these two changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12738

Test Plan: - CI

Reviewed By: ajkr

Differential Revision: D58162165

Pulled By: hx235

fbshipit-source-id: 9bbe4d2fe9fbe39213f4cf137a2d419e6ffb8e16
2024-06-06 11:46:16 -07:00
Peter Dillinger 98393f0139 Fix Checkpoint hard link of inactive but unsynced WAL (#12731)
Summary:
Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
  metadata, and prevent hard linking those (with `trim_to_size=true`
  from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
  flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
  needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
  longer need to consider syncing as part of the FlushWAL in
  GetLiveFilesStorageInfo. (We determine the max WAL number consistent
  with the manifest file size, while holding DB mutex. Should make
  track_and_verify_wals_in_manifest happy.) This makes the premise of
  test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
  no longer hit) so the test is removed, with crash test as backstop for
  related issues. See https://github.com/facebook/rocksdb/issues/10185

Stacked on https://github.com/facebook/rocksdb/issues/12729

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731

Test Plan:
Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.

Reviewed By: cbi42

Differential Revision: D58199629

Pulled By: pdillinger

fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382
2024-06-05 17:40:09 -07:00
Peter Dillinger 9f4c597d83 FaultInjectionTestFS read unsynced data by default (#12729)
Summary:
In places (e.g. GetSortedWals()) RocksDB relies on querying the file size or even reading the contents of files currently open for writing, and as in POSIX semantics, expects to see the flushed size and contents regardless of what has been synced. FaultInjectionTestFS historically did not emulate this behavior, only showing synced data from such read operations. (Different from FaultInjectionTestEnv--sigh.)

This change makes the "proper" behavior the default behavior, at least for GetFileSize and FSSequentialFile. However, this new functionality is disabled in db_stress because of undiagnosed, unresolved issues.

Also removes unused and confusing field `pos_at_last_flush_`

This change is needed to support testing a relevant bug fix (in a follow-up diff).  Other suggested follow-up:
* Fix db_stress not to rely on the old behavior, and fix a related FIXME in db_stress_test_base.cc in LockWAL testing.
* Fill in some corner cases in the FileSystem API for reading unsynced data (see new TODO items).
* Consider deprecating and removing Flush() API functions from FileSystem APIs. It is not clear to me that there is a supported scenario in which they do anything but confuse API users and developers. If there is a use for them, it doesn't appear to be tested.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12729

Test Plan: applies to all unit tests successfully, just updating the unit test from https://github.com/facebook/rocksdb/issues/12556 due to relying on the errant behavior. Also added a specific unit test

Reviewed By: hx235

Differential Revision: D58091835

Pulled By: pdillinger

fbshipit-source-id: f47a63b2b000f5875b6293a98577bff663d7fd33
2024-06-04 15:25:23 -07:00
Yu Zhang 8523f0a86a Use extended file boundary for key range overlap check during file ingestion (#12735)
Summary:
When https://github.com/facebook/rocksdb/issues/12343 added support to bulk load external files while column family enables user-defined timestamps, it's a requirement that the external file doesn't overlap with the DB in key ranges. More specifically, the external file should not contain a user key (without timestamp) that already have some entries in the DB.

All the `*Overlap*` functions like `RangeOverlapWithMemtable`, `RangeOverlapWithCompaction` are using `CompareWithoutTimestamp` to check for overlap  already. One thing that is missing here is we need to extend the external file's user key boundary for this check to avoid missing the checks for the boundary user keys. For example, with the current way of checking things where `external_file_info.smallest.user_key()` is used as the left boundary, and `external_file_info.largest.user_key()` is used as the right boundary, a file with this entry: (b, 40) can fit into a DB with these two entries: (b, 30), (c, 20).

To avoid this, we extend the user key boundaries used for overlap check, by updating the left boundary with the maximum timestamp and the right boundary with the minimum timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12735

Test Plan: Added unit test

Reviewed By: ltamasi

Differential Revision: D58152117

Pulled By: jowlyzhang

fbshipit-source-id: 9cba61e7357f6d76ad44c258381c35073ebbf347
2024-06-04 13:39:51 -07:00
hgy@ruijie.com.cn 21a16f9e64 add c-api to get default cf handle (#12514)
Summary:
rocksdb_batched_multi_get_cf has performance improvement than normal multi_get, however it needs a cf_handle arg, so add a C-API to get and destroy the default cf_handle, as many user only use the default cf.

Fixes https://github.com/facebook/rocksdb/issues/12316

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12514

Reviewed By: hx235

Differential Revision: D55922517

Pulled By: ajkr

fbshipit-source-id: c4cc4289f2cfd9efbb8f390a44a9d8d1ed08d9f0
2024-06-03 11:09:35 -07:00
Yu Zhang fc59d8f9c6 Add public API `WriteWithCallback` to support custom callbacks (#12603)
Summary:
This PR adds a `DB::WriteWithCallback` API that does the same things as `DB::Write` while takes an argument `UserWriteCallback` to execute custom callback functions during the write.

We currently support two types of callback functions: `OnWriteEnqueued` and `OnWalWriteFinish`. The former is invoked   after the write is enqueued, and the later is invoked after WAL write finishes when applicable.

These callback functions are intended for users to use to improve synchronization between concurrent writes, their execution is on the write's critical path so it will impact the write's latency if not used properly. The documentation for the callback interface mentioned this and suggest user to keep these callback functions' implementation minimum.

Although transaction interfaces' writes doesn't yet allow user to specify such a user write callback argument, the `DBImpl::Write*` type of APIs do not differentiate between regular DB writes or writes coming from the transaction layer when it comes to supporting this `UserWriteCallback`. These callbacks works for all the write modes including: default write mode, Options.two_write_queues, Options.unordered_write, Options.enable_pipelined_write

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12603

Test Plan: Added unit test in ./write_callback_test

Reviewed By: anand1976

Differential Revision: D58044638

Pulled By: jowlyzhang

fbshipit-source-id: 87a84a0221df8f589ec8fc4d74597e72ce97e4cd
2024-05-31 19:30:19 -07:00
Peter Dillinger 7127119ae9 Refactor SyncWAL and SyncClosedLogs for code sharing (#12707)
Summary:
These functions were very similar and did not make sense for maintaining separately. This is not a pure refactor but I think bringing the behaviors closer together should reduce long term risk of unintentionally divergent behavior. This change is motivated by some forthcoming WAL handling fixes for Checkpoint and Backups.

* Sync() is always used on closed WALs, like the old SyncClosedWals. SyncWithoutFlush() is only used on the active (maybe) WAL. Perhaps SyncWithoutFlush() should be used whenever available, but I don't know which is preferred, as the previous state of the code was inconsistent.
* Syncing the WAL dir is selective based on need, like old SyncWAL, rather than done always like old SyncClosedLogs. This could be a performance improvement that was never applied to SyncClosedLogs but now is. We might still sync the dir more times than necessary in the case of parallel SyncWAL variants, but on a good FileSystem that's probably not too different performance-wise from us implementing something to have threads wait on each other.

Cosmetic changes:

* Rename internal function SyncClosedLogs to SyncClosedWals
* Merging the sync points into the common implementation between the two entry points isn't pretty, but should be fine.

Recommended follow-up:

* Clean up more confusing naming like log_dir_synced_

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12707

Test Plan: existing tests

Reviewed By: anand1976

Differential Revision: D57870856

Pulled By: pdillinger

fbshipit-source-id: 5455fba016d25dd5664fa41b253f18db2ca8919a
2024-05-30 14:53:13 -07:00
Rulin Huang 20777b96cb Optimizations in notify-one (#12545)
Summary:
We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.
| case name          | optimized/original  |
|-------------------:|--------------------:|
| fillrandom         | 182%                |
| fillseq            | 184%                |
| fillsync           | 136%                |
| overwrite          | 179%                |
| randomreplacekeys  | 180%                |
| randomtransaction  | 161%                |
| updaterandom       | 163%                |
| xorupdaterandom    | 165%                |

With analysis, we find that although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. The following is our method to optimize this process.

Assume that there are currently n threads in total, we parallelize SetState in LaunchParallelMemTableWriters. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.

A reproduction script:
./db_bench --benchmarks="fillrandom"  --threads ${number of all activate vcpu}  --seed 1708494134896523  --duration 60

![image](https://github.com/facebook/rocksdb/assets/22110918/c5eca02f-93b3-4434-bba2-5155fc892a97)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12545

Reviewed By: ajkr

Differential Revision: D57422827

Pulled By: cbi42

fbshipit-source-id: 94127937c0c61e4241720bd902c82c607b7b2431
2024-05-30 09:10:44 -07:00
Changyu Bi af3be5255a Fail DeleteRange() early when row_cache is configured (#12710)
Summary:
https://github.com/facebook/rocksdb/issues/12512 added the sanity check for this incompatible combination. However, it does the check during memtable insertion which can turn the DB into read-only mode. This PR moves the check earlier so that this write failure will not turn the DB into read-only mode and affect other DB operations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12710

Test Plan: * updated unit test `DBRangeDelTest.RowCache` to write to DB after a failed DeleteRange(). The test fails before this PR.

Reviewed By: ajkr

Differential Revision: D57925188

Pulled By: cbi42

fbshipit-source-id: 8bf001bd3fcf05635411ba28bc4a037321942879
2024-05-29 15:03:15 -07:00
anand76 9cc6168c98 Add LDB command and option for follower instances (#12682)
Summary:
Add the `--leader_path` option to specify the directory path of the leader for a follower RocksDB instance. This PR also adds a `count` command to the repl shell. While not specific to followers, it is useful for testing purposes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12682

Reviewed By: jowlyzhang

Differential Revision: D57642296

Pulled By: anand1976

fbshipit-source-id: 53767d496ecadc363ff92cd958b8e15a7bf3b151
2024-05-28 23:21:32 -07:00
HypenZou 8765a0f546 Fix version edit dump in json (#12703)
Summary:
**Context/Summary:**
the flag --json of manifest_dump in ldb tool has no effect
The bug may  be introduced by pr https://github.com/facebook/rocksdb/pull/8378

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12703

Reviewed By: cbi42

Differential Revision: D57848094

Pulled By: ajkr

fbshipit-source-id: 3d1ce65528bf4ce9c53593a7208406ab90e8994b
2024-05-28 16:44:25 -07:00
muthukrishnan24 259f21e695 Add WB, WBWI Create, UpdateTimestamp, Iterator::Refresh in C API (#10529)
Summary:
This PR adds UpdateTimestamp API of WriteBatch and WBWI, create WB, WBWI with all options and Iterator Refresh in C API

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10529

Reviewed By: cbi42

Differential Revision: D57826913

Pulled By: ajkr

fbshipit-source-id: d2ec840129f61a1d3a5a12e859728be98ebbad2f
2024-05-28 15:36:09 -07:00
Jaepil Jeong c115eb6162 Fix compile errors in C++23 (#12106)
Summary:
This PR fixes compile errors in C++23.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12106

Reviewed By: cbi42

Differential Revision: D57826279

Pulled By: ajkr

fbshipit-source-id: 594abfd8eceaf51eaf3bbabf7696c0bb5e0e9a68
2024-05-28 15:33:57 -07:00
Daniel Vasquez Lopez 7c6c632ea9 Use `std::optional` instead of `std::unique_ptr` to conditionally create a read lock. (#12704)
Summary:
This change replaces the use of `std::unique_ptr` with `std::optional` for conditionally constructing a `ReadLock` object. The read lock object was recently introduced in PR https://github.com/facebook/rocksdb/issues/12624. This change makes the code more concise and clarifies that the lock is not meant to be transferred (as `std::unique_ptr` is movable). It also avoids a heap allocation.

There are no functional changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12704

Reviewed By: cbi42

Differential Revision: D57848192

Pulled By: ajkr

fbshipit-source-id: da48c77aac33b51ba5dcc238f98fc48ccf234a21
2024-05-28 15:31:45 -07:00
Peter Dillinger d2ef70872f Rename, deprecate `LogFile` and `VectorLogPtr` (#12695)
Summary:
These names are confusing with `Logger` etc. so moving to `WalFile` etc.

Other small, related name refactorings.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12695

Test Plan: Left most unit tests using old names as an API compatibility test. Non-test code compiles with deprecated names removed. No functional changes.

Reviewed By: ajkr

Differential Revision: D57747458

Pulled By: pdillinger

fbshipit-source-id: 7b77596b9c20d865d43b9dc66c30c8bd2b3b424f
2024-05-28 09:24:49 -07:00
Changyu Bi 0ee7f8bacb Fix `max_read_amp` value in crash test (#12701)
Summary:
It should be no less than `level0_file_num_compaction_trigger`(which defaults to 4) when set to a positive value. Otherwise DB open will fail.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12701

Test Plan: crash test not failing DB open due to this option value.

Reviewed By: ajkr

Differential Revision: D57825062

Pulled By: cbi42

fbshipit-source-id: 22d8e12aeceb5cef815157845995a8448552e2d2
2024-05-26 17:26:55 -07:00
Changyu Bi fecb10c2fa Improve universal compaction sorted-run trigger (#12477)
Summary:
Universal compaction currently uses `level0_file_num_compaction_trigger` for two purposes:
1. the trigger for checking if there is any compaction to do, and
2. the limit on the number of sorted runs. RocksDB will do compaction to keep the number of sorted runs no more than the value of this option.

This can make the option inflexible. A value that is too small causes higher write amp: more compactions to reduce the number of sorted runs. A value that is too big delays potential compaction work and causes worse read performance. This PR introduce an option `CompactionOptionsUniversal::max_read_amp` for only the second purpose: to specify
the hard limit on the number of sorted runs.

For backward compatibility, `max_read_amp = -1` by default, which means to fallback to the current behavior.
When `max_read_amp > 0`,`level0_file_num_compaction_trigger` will only serve as a trigger to find potential compaction.
When `max_read_amp = 0`, RocksDB will auto-tune the limit on the number of sorted runs. The estimation is based on DB size, write_buffer_size and size_ratio, so it is adaptive to the size change of the DB. See more in `UniversalCompactionBuilder::PickCompaction()`.
Alternatively, users now can configure `max_read_amp` to a very big value and keep `level0_file_num_compaction_trigger` small. This will allow `size_ratio` and `max_size_amplification_percent` to control the number of sorted runs. This essentially disables compactions with reason kUniversalSortedRunNum.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12477

Test Plan:
* new unit test
* existing unit test for default behavior
* updated crash test with the new option
* benchmark:
  * Create a DB that is roughly 24GB in the last level. When `max_read_amp = 0`, we estimate that the DB needs 9 levels to avoid excessive compactions to reduce the number of sorted runs.
  * We then run fillrandom to ingest another 24GB data to compare write amp.
     * case 1: small level0 trigger: `level0_file_num_compaction_trigger=5, max_read_amp=-1`
       * write-amp: 4.8
     * case 2: auto-tune: `level0_file_num_compaction_trigger=5, max_read_amp=0`
       *  write-amp: 3.6
     * case 3: auto-tune with minimal trigger: `level0_file_num_compaction_trigger=1, max_read_amp=0`
       *  write-amp: 3.8
     * case 4: hard-code a good value for trigger: `level0_file_num_compaction_trigger=9`
       * write-amp: 2.8
```
Case 1:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   1.0      0.0     0.0      0.0      22.6     22.6       0.0   1.0      0.0    163.2    141.94            111.10       108    1.314       0      0       0.0       0.0
 L45      8/0    1.81 GB   0.0     39.6    11.1     28.5      39.3     10.8       0.0   3.5    209.0    207.3    194.25            191.29        43    4.517    348M  2498K       0.0       0.0
 L46     13/0    3.12 GB   0.0     15.3     9.5      5.8      15.0      9.3       0.0   1.6    203.1    199.3     77.13             75.88        16    4.821    134M  2362K       0.0       0.0
 L47     19/0    4.68 GB   0.0     15.4    10.5      4.9      14.7      9.8       0.0   1.4    204.0    194.9     77.38             76.15         8    9.673    135M  5920K       0.0       0.0
 L48     38/0    9.42 GB   0.0     19.6    11.7      7.9      17.3      9.4       0.0   1.5    206.5    182.3     97.15             95.02         4   24.287    172M    20M       0.0       0.0
 L49     91/0   22.70 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum    169/0   41.74 GB   0.0     89.9    42.9     47.0     109.0     61.9       0.0   4.8    156.7    189.8    587.85            549.45       179    3.284    791M    31M       0.0       0.0

Case 2:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0   214.47 MB   1.2      0.0     0.0      0.0      22.6     22.6       0.0   1.0      0.0    164.5    140.81            109.98       108    1.304       0      0       0.0       0.0
 L44      0/0    0.00 KB   0.0      1.3     1.3      0.0       1.2      1.2       0.0   1.0    206.1    204.9      6.24              5.98         3    2.081     11M    51K       0.0       0.0
 L45      4/0   844.36 MB   0.0      7.1     5.4      1.7       7.0      5.4       0.0   1.3    194.6    192.9     37.41             36.00        13    2.878     62M   489K       0.0       0.0
 L46     11/0    2.57 GB   0.0     14.6     9.8      4.8      14.3      9.5       0.0   1.5    193.7    189.8     77.09             73.54        17    4.535    128M  2411K       0.0       0.0
 L47     24/0    5.81 GB   0.0     19.8    12.0      7.8      18.8     11.0       0.0   1.6    191.4    181.1    106.19            101.21         9   11.799    174M  9166K       0.0       0.0
 L48     38/0    9.42 GB   0.0     19.6    11.8      7.9      17.3      9.4       0.0   1.5    197.3    173.6    101.97             97.23         4   25.491    172M    20M       0.0       0.0
 L49     91/0   22.70 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum    169/0   41.54 GB   0.0     62.4    40.3     22.1      81.3     59.2       0.0   3.6    136.1    177.2    469.71            423.94       154    3.050    549M    32M       0.0       0.0

Case 3:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   5.0      0.0     0.0      0.0      22.6     22.6       0.0   1.0      0.0    163.8    141.43            111.13       108    1.310       0      0       0.0       0.0
 L44      0/0    0.00 KB   0.0      0.8     0.8      0.0       0.8      0.8       0.0   1.0    201.4    200.2      4.26              4.19         2    2.130   7360K    33K       0.0       0.0
 L45      4/0   844.38 MB   0.0      6.3     5.0      1.2       6.2      5.0       0.0   1.2    202.0    200.3     31.81             31.50        12    2.651     55M   403K       0.0       0.0
 L46      7/0    1.62 GB   0.0     13.3     8.8      4.6      13.1      8.6       0.0   1.5    198.9    195.7     68.72             67.89        17    4.042    117M  1696K       0.0       0.0
 L47     24/0    5.81 GB   0.0     21.7    12.9      8.8      20.6     11.8       0.0   1.6    198.5    188.6    112.04            109.97        12    9.336    191M  9352K       0.0       0.0
 L48     41/0   10.14 GB   0.0     24.8    13.0     11.8      21.9     10.1       0.0   1.7    198.6    175.6    127.88            125.36         6   21.313    218M    25M       0.0       0.0
 L49     91/0   22.70 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum    167/0   41.10 GB   0.0     67.0    40.5     26.4      85.4     58.9       0.0   3.8    141.1    179.8    486.13            450.04       157    3.096    589M    36M       0.0       0.0

Case 4:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.7      0.0     0.0      0.0      22.6     22.6       0.0   1.0      0.0    158.6    146.02            114.68       108    1.352       0      0       0.0       0.0
 L42      0/0    0.00 KB   0.0      1.7     1.7      0.0       1.7      1.7       0.0   1.0    185.4    184.3      9.25              8.96         4    2.314     14M    67K       0.0       0.0
 L43      0/0    0.00 KB   0.0      2.5     2.5      0.0       2.5      2.5       0.0   1.0    197.8    195.6     13.01             12.65         4    3.253     22M   202K       0.0       0.0
 L44      4/0   844.40 MB   0.0      4.2     4.2      0.0       4.1      4.1       0.0   1.0    188.1    185.1     22.81             21.89         5    4.562     36M   503K       0.0       0.0
 L45     13/0    3.12 GB   0.0      7.5     6.5      1.0       7.2      6.2       0.0   1.1    188.7    181.8     40.69             39.32         5    8.138     65M  2282K       0.0       0.0
 L46     17/0    4.18 GB   0.0      8.3     7.1      1.2       7.9      6.6       0.0   1.1    192.2    181.8     44.23             43.06         4   11.058     73M  3846K       0.0       0.0
 L47     22/0    5.34 GB   0.0      8.9     7.5      1.4       8.2      6.8       0.0   1.1    189.1    174.1     48.12             45.37         3   16.041     78M  6098K       0.0       0.0
 L48     27/0    6.58 GB   0.0      9.2     7.6      1.6       8.2      6.6       0.0   1.1    195.2    172.9     48.52             47.11         2   24.262     81M  9217K       0.0       0.0
 L49     91/0   22.70 GB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum    174/0   42.74 GB   0.0     42.3    37.0      5.3      62.4     57.1       0.0   2.8    116.3    171.3    372.66            333.04       135    2.760    372M    22M       0.0       0.0

setup:
./db_bench --benchmarks=fillseq,compactall,waitforcompaction --num=200000000 --compression_type=none --disable_wal=1 --compaction_style=1 --num_levels=50 --target_file_size_base=268435456 --max_compaction_bytes=6710886400 --level0_file_num_compaction_trigger=10 --write_buffer_size=268435456 --seed 1708494134896523

benchmark:
./db_bench --benchmarks=overwrite,waitforcompaction,stats --num=200000000 --compression_type=none --disable_wal=1 --compaction_style=1 --write_buffer_size=268435456 --level0_file_num_compaction_trigger=5 --target_file_size_base=268435456 --use_existing_db=1 --num_levels=50 --writes=200000000 --universal_max_read_amp=-1 --seed=1716488324800233

```

Reviewed By: ajkr

Differential Revision: D55370922

Pulled By: cbi42

fbshipit-source-id: 9be69979126b840d08e93e7059260e76a878bb2a
2024-05-24 10:10:31 -07:00
Andrew Kryczka c72ee4531b Fix recycled WAL detection when wal_compression is enabled (#12643)
Summary:
I think the point of the `if (end_of_buffer_offset_ - buffer_.size() == 0)` was to only set `recycled_` when the first record was read. However, the condition was false when reading the first record when the WAL began with a  `kSetCompressionType` record because we had already dropped the `kSetCompressionType` record from `buffer_`. To fix this, I used `first_record_read_` instead.

Also, it was pretty confusing to treat the WAL as non-recycled when a recyclable record first appeared in a non-first record. I changed it to return an error if that happens.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12643

Reviewed By: hx235

Differential Revision: D57238099

Pulled By: ajkr

fbshipit-source-id: e20a2a0c9cf0c9510a7b6af463650a05d559239e
2024-05-22 15:34:37 -07:00
Hui Xiao 733150f6aa Flush WAL upon DB close (#12684)
Summary:
**Context/Summary:** https://github.com/facebook/rocksdb/pull/12556 `avoid_sync_during_shutdown=false` missed an edge case where `manual_wal_flush == true` so WAL sync will still miss unflushed WAL. This PR fixes it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12684

Test Plan: modified UT to include this case `manual_wal_flush==true`

Reviewed By: cbi42

Differential Revision: D57655861

Pulled By: hx235

fbshipit-source-id: c9f49fe260e8b38b3ea387558432dcd9a3dbec19
2024-05-22 11:08:16 -07:00
Levi Tamasi 014368f62c Fix the names of function objects added in PR 12681 (#12689)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12689

These should be in `snake_case` (not `camelCase`) per our style guide.

Reviewed By: jowlyzhang

Differential Revision: D57676418

fbshipit-source-id: 82ad6a87d1540f0b29c2f864ca0128287fe95a9e
2024-05-22 11:06:52 -07:00
Levi Tamasi 62600cb2d4 Fix rebuilding transactions containing PutEntity (#12681)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12681

When rebuilding transactions during recovery, `MemtableInserter::PutCFImpl` currently calls `WriteBatchInternal::Put` regardless of value type, which is incorrect for `PutEntity` entries, as well as `TimedPut`s and the blob indexes used by the old BlobDB implementation. The patch fixes the handling of `PutEntity` and returns `NotSupported` for `TimedPut`s and blob indices.

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D57636355

fbshipit-source-id: 833de4e4aa0b42ff6638b72c4181f981d12d0f15
2024-05-21 17:22:20 -07:00
Peter Dillinger d89ab23bec Disallow memtable flush and sst ingest while WAL is locked (#12652)
Summary:
We recently noticed that some memtable flushed and file
ingestions could proceed during LockWAL, in violation of its stated
contract. (Note: we aren't 100% sure its actually needed by MySQL, but
we want it to be in a clean state nonetheless.)

Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I
found a place to wait to wait for LockWAL to be cleared before allowing
these operations to proceed: WaitForPendingWrites()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652

Test Plan:
Added to unit tests. Extended how db_stress validates LockWAL
and re-enabled combination of ingestion and LockWAL in crash test, in
follow-up to https://github.com/facebook/rocksdb/issues/12642

Ran blackbox_crash_test for a long while with relevant features
amplified.

Suggested follow-up: fix FaultInjectionTestFS to report file sizes
consistent with what the user has requested to be flushed.

Reviewed By: jowlyzhang

Differential Revision: D57622142

Pulled By: pdillinger

fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce
2024-05-21 10:17:34 -07:00
Hui Xiao d7b938882e Sync WAL during db Close() (#12556)
Summary:
**Context/Summary:**
Below crash test found out we don't sync WAL upon DB close, which can lead to unsynced data loss. This PR syncs it.
```
./db_stress --threads=1 --disable_auto_compactions=1 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=1 --bloom_bits=29.895303579352174 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=0 --compaction_readahead_size=0 --compaction_style=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --default_temperature=kUnknown --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=6 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=3 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=2 --report_bg_io_stats=1 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=0 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=3 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=100

 Verification failed for column family 0 key 000000000000B9D1000000000000012B000000000000017D (4756691): value_from_db: , value_from_expected: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, msg: Iterator verification: Value not found: NotFound:
Verification failed :(
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12556

Test Plan:
- New UT
- Same stress test command failed before this fix but pass after
- CI

Reviewed By: ajkr

Differential Revision: D56267964

Pulled By: hx235

fbshipit-source-id: af1b7e8769c129f64ba1c7f1ff17102f1239b929
2024-05-20 17:33:43 -07:00
Levi Tamasi ef1d4955ba Fix the output of `ldb dump_wal` for PutEntity records (#12677)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12677

The patch contains two fixes related to printing `PutEntity` records with `ldb dump_wal`:
1) It adds the key to the printout (it was missing earlier).
2) It restores the formatting flags of the output stream after dumping the wide-column structure so that any `hex` flag that might have been set does not affect subsequent printing of e.g. sequence numbers.

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D57591295

fbshipit-source-id: af4e3e219f0082ad39bbdfd26f8c5a57ebb898be
2024-05-20 17:04:14 -07:00
anand76 0ed93552f4 Implement obsolete file deletion (GC) in follower (#12657)
Summary:
This PR implements deletion of obsolete files in a follower RocksDB instance. The follower tails the leader's MANIFEST and creates links to newly added SST files. These links need to be deleted once those files become obsolete in order to reclaim space. There are three cases to be considered -
1. New files added and links created, but the Version could not be installed due to some missing files. Those links need to be preserved so a subsequent catch up attempt can succeed. We insert the next file number in the `VersionSet` to `pending_outputs_` to prevent their deletion.
2. Files deleted from the previous successfully installed `Version`. These are deleted as usual in `PurgeObsoleteFiles`.
3. New files added by a `VersionEdit` and deleted by a subsequent `VersionEdit`, both processed in the same catchup attempt. Links will be created for the new files when verifying a candidate `Version`. Those need to be deleted explicitly as they're never added to `VersionStorageInfo`, and thus not deleted by `PurgeObsoleteFiles`.

Test plan -
New unit tests in `db_follower_test`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12657

Reviewed By: jowlyzhang

Differential Revision: D57462697

Pulled By: anand1976

fbshipit-source-id: 898f15570638dd4930f839ffd31c560f9cb73916
2024-05-17 19:13:33 -07:00
Changyu Bi ffd7930312 Add more debug print to `DBTestWithParam.ThreadStatusSingleCompaction` (#12661)
Summary:
This test is flaky and a recent failure prints the following:
```
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 1842811, thread status:
thread id: 1842803, thread status:
db/db_test.cc:4697: Failure
Expected equality of these values:
  op_count
    Which is: 0
  expected_count
    Which is: 1
[  FAILED  ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (307 ms)
```
Empty thread status implies that operation_type of the threads are all OP_UNKNOWN. From 3ed46e0668/monitoring/thread_status_updater.cc (L197), this can be due to thread_data->operation_type being OP_UNKNOWN or that thread_data->cf_key it not in `cf_info_map_`, potentially due to how cf_key_ is accessed with relaxed memory order. This PR adds some debug print to print the cf_name to check this.

This PR also prints num_running_compaction and lsm state to check if a compaction is indeed running, and removes some not needed options and ensures that exactly 4 L0 files are created.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12661

Test Plan:
- Cannot repro the failure locally: `gtest-parallel --repeat=10000 --workers=200 ./db_test --gtest_filter="*ThreadStatusSingleCompaction*"`
- New failure message will look like:
```
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
op_count: 1, expected_count 2
thread id: 6104100864, thread status: , cf_name
thread id: 6103527424, thread status: Compaction, cf_name default
running compaction: 1 lsm state: 4
db/db_test.cc:4885: Failure
Value of: match
  Actual: false
Expected: true
[  FAILED  ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (115 ms)
```

Reviewed By: hx235

Differential Revision: D57422755

Pulled By: cbi42

fbshipit-source-id: 635663f26052b20e485dfa06a7c0f1f318ac1099
2024-05-16 17:23:56 -07:00
Peter Dillinger 131c8ccfcd Add EntryType for TimedPut (#12669)
Summary:
Represent internal kTypeValuePreferredSeqno in the public API as kEntryTimedPut (because it is created by TimedPut, until the entry can be safely converted to a regular value entry in compaction)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12669

Test Plan: for follow-up work actually using it. But putting this in place in the public API gives us more flexibility in rolling out that follow-up work (e.g. as a user extension or patch if needed).

Reviewed By: jowlyzhang

Differential Revision: D57459637

Pulled By: pdillinger

fbshipit-source-id: 160ccf7c4e524ee479558846b2a207d51b8b3d9c
2024-05-16 15:18:12 -07:00
Andrew Kryczka 4eaf628120 Add `Iterator` property "rocksdb.iterator.is-value-pinned" (#12659)
Summary:
`ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests.

See also: https://github.com/facebook/rocksdb/issues/12658

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659

Reviewed By: cbi42

Differential Revision: D57391200

Pulled By: ajkr

fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2
2024-05-15 19:11:52 -07:00
Peter Dillinger 3ed46e0668 Handle early exit in DBErrorHandlingFSTests (#12655)
Summary:
To avoid use-after-free on custom env on ASSERT_WHATEVER failure.

This is motivated by a rare crash seen in DBErrorHandlingFSTest.WALWriteError (VersionSet::GetObsoleteFiles in a SstFileManagerImpl::ClearError thread) and wanting to rule out this being related to that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12655

Test Plan: manually seeing ASSERT_WHATEVER failures, especially under ASAN

Reviewed By: cbi42

Differential Revision: D57358202

Pulled By: pdillinger

fbshipit-source-id: 4da2a0d73a54380b257e5cc1ab6c666e26b83973
2024-05-14 16:44:32 -07:00
Andrii Lysenko b9fc13db69 Add padding before timestamp size record if it doesn't fit into a WAL block. (#12614)
Summary:
If timestamp size record doesn't fit into a block, without padding `Writer::EmitPhysicalRecord` fails on assert (either `assert(block_offset_ + kHeaderSize + n <= kBlockSize);` or `assert(block_offset_ + kRecyclableHeaderSize + n <= kBlockSize)`, depending on whether recycling log files is enabled)  in debug build. In release, current block grows beyond 32K, `block_offset_` gets reset on next `AddRecord` and all the subsequent blocks are no longer aligned by block size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12614

Reviewed By: ltamasi

Differential Revision: D57302140

Pulled By: jowlyzhang

fbshipit-source-id: cacb5cefb7586885e52a8137ae23a95e1aefca2d
2024-05-14 15:54:02 -07:00
Yu Zhang 1567d50a27 Temporarily disable file ingestion if LockWAL is tested (#12642)
Summary:
As titled. A proper fix should probably be failing file ingestion if the DB is in a lock wal state as it promises to "Freezes the logical state of the DB".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12642

Reviewed By: pdillinger

Differential Revision: D57235869

Pulled By: jowlyzhang

fbshipit-source-id: c70031463842220f865621eb6f53424df27d29e9
2024-05-14 09:27:48 -07:00
Yu Zhang c110091d36 Support read timestamp in ldb (#12641)
Summary:
As titled. Also updated sst_dump to print out user-defined timestamp separately and in human readable format.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12641

Test Plan:
manually tested:
Example success run:
./ldb --db=$TEST_DB multi_get_entity 0x00000000000000950000000000000120 --key_hex --column_family=7  --read_timestamp=1115613683797942 --value_hex
0x00000000000000950000000000000120 ==> :0x0E0000000A0B080906070405020300011E1F1C1D1A1B181916171415121310112E2F2C2D2A2B282926272425222320213E3F3C3D3A3B383936373435323330314E4F4C4D4A4B484946474445424340415E5F5C5D5A5B58595657545552535051
Example failed run:
Failed: GetEntity failed: Invalid argument: column family enables user-defined timestamp while --read_timestamp is not a valid uint64 value.

sst_dump print out:
'000000000000015D000000000000012B000000000000013B|timestamp:1113554493256671' seq:2330405, type:1 => 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E

Reviewed By: ltamasi

Differential Revision: D57297006

Pulled By: jowlyzhang

fbshipit-source-id: 8486d91468e4f6c0d42dca3c9629f1c45a92bf5a
2024-05-13 15:43:12 -07:00
Hui Xiao 20213d01a3 Fix crash in CompactFiles() of conflict range under `preclude_last_level_data_seconds > 0` (#12628)
Summary:
**Context/Summary:**

Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below
```
 Assertion `output_level == 0 || !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed.
```

This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628

Test Plan: New UT crashed before the fix and return correct status after the fix.

Reviewed By: cbi42

Differential Revision: D57123536

Pulled By: hx235

fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430
2024-05-13 13:12:06 -07:00
Peter Dillinger b75438f986 Allow disableWAL+recycle with WritePreparedTxnDB internals (#12639)
Summary:
Follow-up from https://github.com/facebook/rocksdb/issues/12403

The crash test was periodically failing with the
"disableWAL option is not supported if recycle_log_file_num > 0" failure, despite not setting the disableWAL from the user side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12639

Test Plan: db_stress reproducer now passes. Added WAL recycling to txn DB unit tests, which is generally more difficult for correctness. Many tests now cover this change and pass.

Reviewed By: anand1976

Differential Revision: D57227617

Pulled By: pdillinger

fbshipit-source-id: db9abefeb505bce624b45bc64009694d2a5baed9
2024-05-10 17:56:40 -07:00
Levi Tamasi 97e70906fa Improve the sanity checks in (Multi)GetEntity and friends (#12630)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12630

The patch cleans up, improves, and brings into sync (to the extent possible without API signature changes) the sanity checks around the `GetEntity` / `MultiGetEntity` family of APIs, including the read-your-own-writes (`WriteBatchWithIndex`) and transaction layers. The checks are centralized in two main sets of entry points, namely in `DB(Impl)` and the "main" `GetEntityFromBatchAndDB` / `MultiGetEntityFromBatchAndDB` overloads in `WriteBatchWithIndex`. This eliminates the need to duplicate the checks in the transaction classes.

Reviewed By: jaykorean

Differential Revision: D57125741

fbshipit-source-id: 4dd059ef644a9b173fbba767538943397e4cc6cd
2024-05-09 12:25:19 -07:00
Wei Liu 1a3357648f Error log update to db_impl_compaction_flush.cc (#12608)
Summary:
Make the error looks better.

Some inconsistency here and [here](e46ab9d4f0/db/db_impl/db_impl_compaction_flush.cc (L2701-L2702))

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12608

Reviewed By: ajkr

Differential Revision: D57134933

Pulled By: cbi42

fbshipit-source-id: 2f19f077f388d196652a4e3afd2526f18bf75b2d
2024-05-09 11:36:24 -07:00
Yu Zhang 9dc171e3bb Fix issue that cause false alarm corruption report (#12626)
Summary:
The state of `saved_seq_for_penul_check_` is not correctly maintained with the current flow. It's supposed to store the original sequence number for a `kTypeValuePreferredSeqno` entry for use in the `DecideOutputLevel` function. However, it's not always properly cleared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12626

Test Plan:
Added unit test that would fail before the fix
./tiered_compaction_test --gtest_filter="*InterleavedTimedPutAndPut*"

Reviewed By: pdillinger

Differential Revision: D57123469

Pulled By: jowlyzhang

fbshipit-source-id: 8d73214b3b6dc152daf19b6bd6ee9063581dc277
2024-05-08 15:51:38 -07:00
Andrew Kryczka 933ac0e05c Fix locking for `ColumnFamilyOptions::inplace_update_support` (#12624)
Summary:
In `SaveValue()`, the read lock needs to be obtained before `VerifyEntryChecksum()` because the KV checksum verification reads the entire value metadata+data, which is all mutable when `ColumnFamilyOptions::inplace_update_support == true`.

In `MemTable::Update()`, the write lock needs to be obtained before mutating the value metadata (changing the value size) because it can be read concurrently.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12624

Test Plan:
```
$ make COMPILE_WITH_TSAN=1 -j56 db_stress
...
$ python3 tools/db_crashtest.py blackbox --simple --max_key=10 --inplace_update_support=1 --interval=10 --allow_concurrent_memtable_write=0
```

Reviewed By: cbi42

Differential Revision: D57034571

Pulled By: ajkr

fbshipit-source-id: 3dddf881ad87923143acdf6bfec12ce47bb13a48
2024-05-08 08:30:12 -07:00
Changyu Bi 5bf2c00a35 Clarify manual compaction and file ingestion behavior with FIFO compaction (#12618)
Summary:
For manual compaction, FIFO compaction will always skip key range overlapping checking with SST files. If CompactRange() is called with CompactionRangeOptions::change_level=true, a CF with FIFO compaction will now return Status::NotSupported.

For file ingestion, we will always ingest into L0. Previously, it's possible to ingest files into non-L0 levels with FIFO compaction.

These changes also help to fix [this](a178d15baf/db/db_impl/db_impl_compaction_flush.cc (L1269)) assertion failure in crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12618

Test Plan: added unit tests to verify the new behavior.

Reviewed By: hx235

Differential Revision: D56962401

Pulled By: cbi42

fbshipit-source-id: 19812a1509650b4162b379ca5bee02f2e9d9569d
2024-05-07 12:00:15 -07:00
Zaidoon Abd Al Hadi 36ab251c07 Expose block based metadata cache options via C API (#12611)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12611

Reviewed By: jaykorean

Differential Revision: D56961823

Pulled By: ajkr

fbshipit-source-id: aa062cdb49a0bb2c1148a81d4c882a4733c7790e
2024-05-06 16:49:11 -07:00
Andrew Kryczka 0fef690bd5 Sync non-latest WALs during flush for 2PC, single-CF DBs (#12622)
Summary:
Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs".

This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622

Reviewed By: cbi42

Differential Revision: D56987303

Pulled By: ajkr

fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48
2024-05-06 11:56:16 -07:00
Changyu Bi 6fdc4c5282 Fix a corruption bug in `CreateColumnFamilyWithImport()` (#12602)
Summary:
when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602

Test Plan:
a new repro unit test. Before this PR, it fails with
```
[ RUN      ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF
db/import_column_family_test.cc:1048: Failure
db_->WaitForCompact(o)
Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3
```

Reviewed By: hx235

Differential Revision: D56851808

Pulled By: cbi42

fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84
2024-05-06 11:01:38 -07:00
Peter Dillinger a178d15baf More checks around num_entries vs. num_deletions (#12600)
Summary:
We've seen an internal crash test+sanitizer failure seemingly caused by underflow on `current_num_non_deletions_` which would happen if num_entries < num_deletions. (T186407810)

This change adds an additional check (fail earlier?) and coerces read table properties to satisfy the invariant that is supposed to be provided by https://github.com/facebook/rocksdb/pull/4841 but could be violated by older files, due to
https://github.com/facebook/rocksdb/pull/4016.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12600

Test Plan: existing tests

Reviewed By: ajkr

Differential Revision: D56796191

Pulled By: pdillinger

fbshipit-source-id: 6d22cc40eb74974c42b311293ee2775c6af95afc
2024-05-03 16:40:07 -07:00
Zaidoon Abd Al Hadi ed01babd07 Expose compaction pri through C API (#12604)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12604

Reviewed By: cbi42

Differential Revision: D56914066

Pulled By: ajkr

fbshipit-source-id: 64b51ab2b7b5ec0b5fde5a5f61d076bac1c3a8ad
2024-05-02 18:39:24 -07:00
Changyu Bi e2ef349f56 Deflake unit test `DBCompactionTest.CompactionLimiter` (#12596)
Summary:
The test has been flaky for a long time. A recent [failure](https://github.com/facebook/rocksdb/actions/runs/8820808355/job/24215219590?pr=12578) shows that there is still flush running when the assertion fails. I think this is because `WaitForFlushMemTable()` may return before the a flush schedules the next compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12596

Test Plan: I could not repro the failure locally: `gtest-parallel --repeat=8000 --workers=100 ./db_compaction_test --gtest_filter="*CompactionLimiter*"`

Reviewed By: ajkr

Differential Revision: D56715874

Pulled By: cbi42

fbshipit-source-id: f5f64eb30fff7e115c19beedad2dc22afa06258d
2024-05-02 17:10:06 -07:00
Yu Zhang 241253053a Fix delete obsolete files on recovery not rate limited (#12590)
Summary:
This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited.

The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` d610e14f93/include/rocksdb/sst_file_manager.h (L126) however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1.

In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call.

There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590

Test Plan: Added unit test and verified with manual testing

Reviewed By: anand1976

Differential Revision: D56830519

Pulled By: jowlyzhang

fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885
2024-05-01 12:26:54 -07:00
Yu Zhang 8b3d9e6bfe Add TimedPut to stress test (#12559)
Summary:
This also updates WriteBatch's protection info to include write time since there are several places in memtable that by default protects the whole value slice.

This PR is stacked on https://github.com/facebook/rocksdb/issues/12543

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12559

Reviewed By: pdillinger

Differential Revision: D56308285

Pulled By: jowlyzhang

fbshipit-source-id: 5524339fe0dd6c918dc940ca2f0657b5f2111c56
2024-04-30 15:40:35 -07:00
Yu Zhang 2c02a9b76f Preserve TimedPut on penultimate level until it actually expires (#12543)
Summary:
To make sure `TimedPut` are placed on proper tier before and when it becomes eligible for cold tier
1) flush and compaction need to keep relevant seqno to time mapping for not just the sequence number contained in internal keys, but also preferred sequence number for `TimedPut` entries.

This PR also fix some bugs in for handling `TimedPut` during compaction:
1) dealing with an edge case when a `TimedPut` entry's internal key is the right bound for penultimate level, the internal key after swapping in its preferred sequence number will fall outside of the penultimate range because preferred sequence number is smaller than its original sequence number. The entry however is still safe to be placed on penultimate level, so we keep track of `TimedPut` entry's original sequence number for this check. The idea behind this is that as long as it's safe for the original key to be placed on penultimate level, it's safe for the entry with swapped preferred sequence number to be placed on penultimate level too. Because we only swap in preferred sequence number when that entry is visible to the earliest snapshot and there is no other data points with the same user key in lower levels. On the other hand, as long as it's not safe for the original key to be placed on penultimate level, we will not place the entry after swapping the preferred seqno on penultimate level either.

2) the assertion that preferred seqno is always bigger than original sequence number may fail if this logic is only exercised after sequence number is zeroed out. We adjust the assertion to handle that case too. In this case, we don't swap in the preferred seqno but will adjust the its type to `kTypeValue`.

3) there was a special case handling for when range deletion may end up incorrectly covering an entry if preferred seqno is swapped in. But it missed the case that if the original entry is already covered by range deletion. The original handling will mistakenly output the entry instead of omitting it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12543

Test Plan:
./tiered_compaction_test --gtest_filter="PrecludeLastLevelTest.PreserveTimedPutOnPenultimateLevel"
./compaction_iterator_test --gtest_filter="*TimedPut*"

Reviewed By: pdillinger

Differential Revision: D56195096

Pulled By: jowlyzhang

fbshipit-source-id: 37ebb09d2513abbd9e90cda0217e26874584b8f3
2024-04-30 11:16:02 -07:00
Peter Dillinger 45c105104b Set optimize_filters_for_memory by default (#12377)
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.

Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377

Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.

Reviewed By: jowlyzhang

Differential Revision: D54129517

Pulled By: pdillinger

fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
2024-04-30 08:33:31 -07:00
Changyu Bi 5c1334f763 DeleteRange() return NotSupported if row_cache is configured (#12512)
Summary:
...since this feature combination is not supported yet (https://github.com/facebook/rocksdb/issues/4122).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12512

Test Plan: new unit test.

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D55820323

Pulled By: cbi42

fbshipit-source-id: eeb5e97d15c9bdc388793a2fb8e52cfa47e34bcf
2024-04-29 16:33:13 -07:00
Andrew Kryczka b2931a5c53 Fixed `MultiGet()` error handling to not skip blob dereference (#12597)
Summary:
See comment at top of the test case and release note.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597

Reviewed By: jaykorean

Differential Revision: D56718786

Pulled By: ajkr

fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f
2024-04-29 14:18:42 -07:00
anand76 e36b0a2da4 Fix corruption bug when recycle_log_file_num changed from 0 (#12591)
Summary:
When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 || s >= first_seqno_` in `rocksdb::MemTable::Add`.

We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591

Test Plan: 1. Add a unit test to verify the bug and fix

Reviewed By: jowlyzhang

Differential Revision: D56655812

Pulled By: anand1976

fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87
2024-04-29 12:25:00 -07:00
Andrew Kryczka 2ec25a3e54 Prevent data block compression with `BlockBasedTableOptions::block_align` (#12592)
Summary:
Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592

Reviewed By: hx235

Differential Revision: D56650862

Pulled By: ajkr

fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141
2024-04-26 20:05:30 -07:00
Andrew Kryczka 177ccd3904 Print more debug info in test when `SyncWAL()` fails (#12580)
Summary:
Example failure (cannot reproduce):

```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBWriteTestInstance/DBWriteTest
[ RUN      ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
[  FAILED  ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0 (49 ms)
[----------] 1 test from DBWriteTestInstance/DBWriteTest (49 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (49 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0
```

I have no idea why `SyncWAL()` would not be supported from what is presumably a `SpecialEnv` so added more debug info in case it fails again in CI. The last failure was https://github.com/facebook/rocksdb/actions/runs/8731304938/job/23956487511?fbclid=IwAR2jyXgVQtCezri3axV5MwMdI7D6VIudMk1xkiN_FL9-x2dkBv4IqIjjgB4 and it only happened once ever AFAIK.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12580

Reviewed By: hx235

Differential Revision: D56541996

Pulled By: ajkr

fbshipit-source-id: 1eab17567db783c11054fa85dd8b8880eacd3a50
2024-04-25 14:34:11 -07:00
Jay Huh f16ba42116 Fix IteratorsConsistentView tests (#12582)
Summary:
Fixing the failure in IteratorsConsistentViewExplicitSnapshot as shown in https://github.com/facebook/rocksdb/actions/runs/8825927545/job/24230854140?pr=12581

The failure was due to the timing of the `flush()` for the later Column Family in the loop. If the flush for the later CFs installs the new super version before getting the SV for the iterator, assertion succeeds, but if the order flips, SV will be obsolete and assertion can fail.

This PR simplifies the test in a way that we do only one `flush()` so that `SYNC_POINT` can guarantee the order of operations. For ImplicitSnapshot test, it now just triggers flush for the second CF after obtaining SV for the first CF. For the ExplicitSnapshot test, it now triggers atomic flush() for all CFs after obtaining SV for the first CF.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12582

Test Plan:
```
./db_iterator_test --gtest_filter="*IteratorsConsistentView*"
./multi_cf_iterator_test -- --gtest_filter="*ConsistentView*
```

Reviewed By: ajkr, jowlyzhang

Differential Revision: D56557234

Pulled By: jaykorean

fbshipit-source-id: 7aa2f6d0e12a915b6e16cd240389bcfb5b4a5b62
2024-04-25 14:06:46 -07:00
Jay Huh 1fca175eec MultiCFSnapshot for NewIterators() API (#12573)
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573

Test Plan:
**Unit Tests Added**

```
./db_iterator_test --gtest_filter="*IteratorsConsistentView*"
```
```
./multi_cf_iterator_test -- --gtest_filter="*ConsistentView*"
```

**Performance Check**

Setup
```
make -j64 release
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none
```

Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom :       6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom :       6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: jowlyzhang

Differential Revision: D56444066

Pulled By: jaykorean

fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99
2024-04-24 15:28:55 -07:00
Andrew Kryczka 6807da0b44 Fix `DisableManualCompaction()` hang (#12578)
Summary:
Prior to this PR the following sequence could happen:

1. `RunManualCompaction()` A schedules compaction to thread pool and waits
2. `RunManualCompaction()` B waits without scheduling anything due to conflict
3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both
4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done
5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool
6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction
7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution
8. `DisableManualCompaction()` returns as there are no more manual compactions running

Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward:

5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done
6'. `DisableManualCompaction()` returns as there are no more manual compactions running

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578

Reviewed By: cbi42

Differential Revision: D56528144

Pulled By: ajkr

fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b
2024-04-24 12:40:36 -07:00
Andrew Kryczka 3f3045a405 fix DeleteRange+memtable_insert_with_hint_prefix_extractor interaction (#12558)
Summary:
Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558

Reviewed By: hx235

Differential Revision: D56279019

Pulled By: ajkr

fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29
2024-04-22 20:13:58 -07:00
Levi Tamasi bcfe4a0dcf Make sure DBImplFollower::stop_requested_ is initialized (#12572)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12572

Reviewed By: jowlyzhang, anand1976

Differential Revision: D56426800

fbshipit-source-id: a31f86d8869148092325924db4e7fbfad28777a4
2024-04-22 12:02:28 -07:00
anand76 d8fb849b7e Basic RocksDB follower implementation (#12540)
Summary:
A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways -
1. It has its own directory with links to the leader's database
2. Periodically refreshes itself
3. (Future) Snapshot support
4. (Future) Garbage collection of obsolete links
5. (Long term) Memtable replication

There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive.

The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files.

TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540

Reviewed By: jowlyzhang

Differential Revision: D56315718

Pulled By: anand1976

fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b
2024-04-19 19:13:31 -07:00
Jay Huh 909ff2c208 MultiCFSnapshot Refactor - separate multiget key range info from CFD & superversion info (#12561)
Summary:
While implementing MultiCFIterators (CoalescingIterator and AttributeGroupIterator), we found that the existing `NewIterators()` API does not ensure a uniform view of the DB across all column families. The `NewIterators()` function is utilized to generate child iterators for the MultiCfIterators, and it's expected that all child iterators maintain a consistent view of the DB.

For example, within the loop where the super version for each CF is being obtained, if a CF undergoes compaction after the super versions for previous CFs have already been retrieved, we lose the consistency in the view of the CFs for the iterators due to the API not under a db mutex.

This preliminary refactoring of `MultiCFSnapshot` aims to address this issue in the `NewIterators()` API in the later PR. Currently, `MultiCFSnapshot` is used to achieve a consistent view across CFs in `MultiGet`. The `MultiGetColumnFamilyData` contains MultiGet-specific information that can be decoupled from the cfd and sv, allowing `MultiCFSnapshot` to be used in other places.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12561

Test Plan:
**Existing Unit Tests for `MultiCFSnapshot()`**

```
./db_basic_test -- --gtest_filter="*MultiGet*"
```

**Performance Test**

Setup
```
make -j64 release

TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom :       4.760 micros/op 210072 ops/sec 4.760 seconds 1000000 operations; (0 of 1000000 found)
```

After the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom :       4.593 micros/op 217727 ops/sec 4.593 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: anand1976

Differential Revision: D56309422

Pulled By: jaykorean

fbshipit-source-id: 7a9164d12c810b6c2d2db062827fcc4a36cbc77b
2024-04-18 20:11:01 -07:00
anand76 97991960e9 Retry DB::Open upon a corruption detected while reading the MANIFEST (#12518)
Summary:
This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518

Reviewed By: ajkr

Differential Revision: D55932155

Pulled By: anand1976

fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9
2024-04-18 17:36:33 -07:00
Levi Tamasi 0df601ab07 Reset user-facing wide-column stuctures upon deserialization failures (#12562)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12562

The patch makes a small usability improvement by consistently resetting any user-facing wide-column structures (`DBIter::columns()`, `BaseDeltaIterator::columns()`, and any `PinnableWideColumns` objects) upon encountering any deserialization failures.

Reviewed By: jaykorean

Differential Revision: D56312764

fbshipit-source-id: 44efed0d1720cc06bf6facf928f73ce39a1bd2ca
2024-04-18 13:08:34 -07:00
Levi Tamasi e82fe7c0b7 Fix the move semantics of PinnableWideColumns (#12557)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557

Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move.

Reviewed By: jaykorean

Differential Revision: D56275054

fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730
2024-04-17 18:56:23 -07:00
Jay Huh 4f584652ab Add an option to wait for purge in WaitForCompact (#12520)
Summary:
Adding an option to wait for purge to complete in `WaitForCompact` API.

Internally, RocksDB has a way to wait for purge to complete (e.g. TEST_WaitForPurge() in db_impl_debug.cc), but there's no public API available for gracefully wait for purge to complete.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12520

Test Plan:
Unit Test Added - `WaitForCompactWithWaitForPurgeOptionTest`
```
./deletefile_test -- --gtest_filter="*WaitForCompactWithWaitForPurgeOptionTest*"
```

Existing Tests
```
./db_compaction_test -- --gtest_filter="*WaitForCompactWithOption*"
```

Reviewed By: ajkr

Differential Revision: D55888283

Pulled By: jaykorean

fbshipit-source-id: cfc6d6e8657deaefab8961890b36e390095c9f65
2024-04-17 17:33:27 -07:00
Andrew Kryczka 7027265417 Fix `max_successive_merges` counting CPU overhead regression (#12546)
Summary:
In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546

Reviewed By: jaykorean

Differential Revision: D56193693

Pulled By: ajkr

fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da
2024-04-17 12:11:24 -07:00
Jay Huh 02ea0d6367 Reserve vector in advance to avoid resizing in GetLiveFilesMetaData (#12554)
Summary:
As title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12554

Test Plan: Existing CI

Reviewed By: ajkr

Differential Revision: D56252201

Pulled By: jaykorean

fbshipit-source-id: 06211555a54ce5e6bf656b81109022494e6787ea
2024-04-17 11:01:06 -07:00
Jay Huh b7319d8a10 MultiCfIterator - Tests for lower/upper bounds (#12548)
Summary:
Thanks to how we are using `DBIter` as child iterators in MultiCfIterators (both `CoalescingIterator` and `AttributeGroupIterator`), we got the lower/upper bound feature for free. This PR simply adds unit test coverage to ensure that the lower/upper bounds are working as expected in the MultiCfIterators.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12548

Test Plan:
UnitTest Added
```
./multi_cf_iterator_test
```

Reviewed By: ltamasi

Differential Revision: D56197966

Pulled By: jaykorean

fbshipit-source-id: fa51cc70705dbc5efd836ac006a7c6a49d05707a
2024-04-16 14:20:13 -07:00
Jay Huh d34712e0ac MultiCfIterator - AttributeGroupIter Impl & CoalescingIter Optimization (#12534)
Summary:
Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480 #12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`.

Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL`

Here are some implementation details:
- `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration.
- `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap.
- `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap.
- Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once.
- This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534

Test Plan:
Uncommented the assertions in `verifyAttributeGroupIterator()`

```
./multi_cf_iterator_test
```

Reviewed By: ltamasi

Differential Revision: D56089019

Pulled By: jaykorean

fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054
2024-04-16 08:45:38 -07:00
Yu Zhang b166ca8b74 Second attempt #12386 (#12529)
Summary:
Check https://github.com/facebook/rocksdb/issues/12386 back in now that we have figured out MyRocks build's failure and unblocked it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12529

Reviewed By: ajkr

Differential Revision: D56047495

Pulled By: jowlyzhang

fbshipit-source-id: f90664b9e72c085e068f174720f126b80ad4e8ea
2024-04-12 10:14:44 -07:00
Andrew Kryczka 8897bf2d04 Drop unsynced data in `TestFSWritableFile::Close()` (#12528)
Summary:
Our `FileSystem` for simulating unsynced data loss should not sync during `Close()` because it masks bugs where we forgot to sync as long as we closed the file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12528

Test Plan:
Peeled back https://github.com/facebook/rocksdb/issues/10560 fix and verified it is caught much faster now (few seconds vs. ???) with command like

```
$ TEST_TMPDIR=./ python3 tools/db_crashtest.py blackbox --disable_wal=0 --max_key=1000 --write_buffer_size=131072 --max_bytes_for_level_base=524288 --target_file_size_base=131072 --interval=3 --sync_fault_injection=1 --enable_blob_files=0 --manual_wal_flush_one_in=10 --sync_wal_one_in=0 --get_live_files_one_in=0 --get_sorted_wal_files_one_in=0 --backup_one_in=0 --checkpoint_one_in=0 --write_fault_one_in=0 --read_fault_one_in=0 --open_write_fault_one_in=0 --compact_range_one_in=0 --compact_files_one_in=0 --open_read_fault_one_in=0 --get_property_one_in=0 --writepercent=100 -readpercent=0 -prefixpercent=0 -delpercent=0 -delrangepercent=0 -iterpercent=0
```

Reviewed By: anand1976

Differential Revision: D56033250

Pulled By: ajkr

fbshipit-source-id: 6bbf480d79a06c46f08f6214010937f6654af5ca
2024-04-12 09:57:56 -07:00
Vershinin Maxim 00873208 70d3fc3b6f Fix error for CF smallest and largest keys computation in ImportColumnFamilyJob::Prepare (#12526)
Summary:
This PR fixes error for CF smallest and largest keys computation in ImportColumnFamilyJob::Prepare.
Before this fix smallest and largest keys for CF were computed incorrectly, and ImportColumnFamilyJob::Prepare function might not have detect overlaps between CFs. I added test to detect this error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12526

Reviewed By: hx235

Differential Revision: D56046044

Pulled By: ajkr

fbshipit-source-id: d562fbfc9cc2d9624372d24d34a649198a960691
2024-04-11 21:54:51 -07:00
Jay Huh 58a98bded9 MultiCFIterator Refactor - CoalescingIterator & AttributeGroupIterator (#12480)
Summary:
There are a couple of reasons to modify the current implementation of the MultiCfIterator, which implements the generic `Iterator` interface.
- The default behavior of `value()`/`columns()` returning data from different Column Families for different keys can be prone to errors, even though there might be valid use cases where users do not care about the origin of the value/columns.
- The `attribute_groups()` API, which is not yet implemented, will not be useful for a single-CF iterator.

In this PR, we are implementing the following changes:
- `IteratorBase` introduced, which includes all basic iterator functions except `value()` and `columns()`.
- `Iterator`, which now inherits from `IteratorBase`, includes `value()` and `columns()`.
- New public interface `AttributeGroupIterator` inherits from `IteratorBase` and additionally includes `attribute_groups()` (to be implemented).
- Renamed former `MultiCfIterator` to `CoalescingIterator` which inherits from `Iterator`
- Existing MultiCfIteratorTest has been split into two - `CoalescingIteratorTest` and `AttributeGroupIteratorTest`.
- Moved AttributeGroup related code from `wide_columns.h` to a new file, `attribute_groups.h`.

Some Implementation Details
- `MultiCfIteratorImpl` takes two functions - `populate_func` and `reset_func` and use them to populate `value_` and `columns_` in CoalescingIterator and `attribute_groups_` in AttributeGroupIterator. In CoalescingIterator, populate_func is `Coalesce()`, in AttributeGroupIterator populate_func is `AddToAttributeGroups()`. `reset_func` clears populated value_, columns_ and attribute_groups_ accordingly.
- `Coalesce()` merge sorts columns from multiple CFs when a key exists in more than on CFs. column that appears in later CF overwrites the prior ones.

For example, if CF1 has `"key_1" ==> {"col_1": "foo",  "col_2", "baz"}` and CF2 has `"key_1" ==> {"col_2": "quux", "col_3", "bla"}`, and when the iterator is at `key_1`, `columns()` will return `{"col_1": "foo", "col_2", "quux", "col_3", "bla"}`

In this example, `value()` will be empty, because none of them have values for `kDefaultColumnName`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12480

Test Plan:
## Unit Test
```
./multi_cf_iterator_test
```

## Performance Test

To make sure this change does not impact existing `Iterator` performance

**Build**
```
$> make -j64 release
```
**Setup**
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
**Run**
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```

**Before the change**
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.519 micros/op 1927904 ops/sec 0.519 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       5.302 micros/op 188589 ops/sec 5.303 seconds 1000000 operations; (0 of 1000000 found)
```
**After the change**
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.497 micros/op 2011012 ops/sec 0.497 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       5.252 micros/op 190405 ops/sec 5.252 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: ltamasi

Differential Revision: D55353909

Pulled By: jaykorean

fbshipit-source-id: 8d7786ffee09e022261ce34aa60e8633685e1946
2024-04-11 11:34:04 -07:00
Yu Zhang fab9dd9635 Temporary revert #12386 to unblock MyRocks build (#12523)
Summary:
MyRocks reports build failure with this change (build failures in this diff: https://www.internalfb.com/diff/D55924596) https://github.com/facebook/rocksdb/issues/12386, we haven't figured out how to fix it yet. So we are temporarily reverting it to unblock them.

This reverts commit 3104e55f29.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12523

Reviewed By: hx235

Differential Revision: D55981751

Pulled By: jowlyzhang

fbshipit-source-id: 1d7edd42b65ca847cec67549644a2b1e5775841e
2024-04-10 13:47:52 -07:00
Hui Xiao abdbeedba6 Miscellaneous improvement to info printing (#12504)
Summary:
**Context/Summary:**
Debugging crash test makes me realize there are a few places can use some improvement of logging more info

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12504

Test Plan:
Manual testing
Debug build
```
2024/04/04-16:12:12.289791 1636007 [/db_filesnapshot.cc:156] Number of log files 2 (0 required by manifest)
...
2024/04/04-16:12:12.289814 1636007 [/db_filesnapshot.cc:171] Log files : /000004.log /000008.log  .Log files required by manifest: .
```
Non-debug build
```
2024/04/04-16:19:23.222168 1685043 [/db_filesnapshot.cc:156] Number of log files 1 (0 required by manifest)
```
CI

Reviewed By: jaykorean

Differential Revision: D55710013

Pulled By: hx235

fbshipit-source-id: 9964d46cfb0a2074620f31571cf9fd29d0a88819
2024-04-05 10:23:31 -07:00
Changyu Bi a0aade7e62 Add some debug print for flaky test `DBCompactionTest.CompactionLimiter` (#12509)
Summary:
The unit test fails occasionally can cannot be reproed locally.
```
[ RUN      ] DBCompactionTest.CompactionLimiter
db/db_compaction_test.cc:6139: Failure
Expected equality of these values:
  cf_count
    Which is: 17
  env_->GetThreadPoolQueueLen(Env::LOW)
    Which is: 15
[  FAILED  ] DBCompactionTest.CompactionLimiter (512 ms)
```

Add some debug print to help triaging if it fails again.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12509

Reviewed By: jowlyzhang

Differential Revision: D55770552

Pulled By: cbi42

fbshipit-source-id: 2a39b2199f80352fcf2c6cd2b9c8b81c727eee8c
2024-04-04 15:21:40 -07:00
Changyu Bi 796011e5ad Limit compaction input files expansion (#12484)
Summary:
We removed the limit in https://github.com/facebook/rocksdb/issues/10835 and the option in https://github.com/facebook/rocksdb/issues/12323. Usually input level is much smaller than output level, which is likely why we have not seen issues with not applying a limit. It should be safer to add a safe guard as suggested in https://github.com/facebook/rocksdb/pull/12323#issuecomment-2016687321.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12484

Test Plan: * new and existing UT

Reviewed By: ajkr

Differential Revision: D55438870

Pulled By: cbi42

fbshipit-source-id: 0511d0465a70398c36230ed7cced5291ff1a6c19
2024-03-29 11:34:29 -07:00
Hui Xiao d985902ef4 Disallow refitting more than 1 file from non-L0 to L0 (#12481)
Summary:
**Context/Summary:**
We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior.

A related change to OptionChangeMigration():
- When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size.
- Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481

Test Plan:
- New UT
- Modified UT

Reviewed By: cbi42

Differential Revision: D55351178

Pulled By: hx235

fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80
2024-03-29 10:52:36 -07:00
Jay Huh c449867236 MultiCfIterator Impl Follow up (#12465)
Summary:
As a follow up for https://github.com/facebook/rocksdb/issues/12422 , this PR includes the following two changes.
- Removal of `direction_` in the MultiCfIterator
- Use of Member Func Template instead of `std::function`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12465

Test Plan:
```
./multi_cf_iterator_test
```

Reviewed By: pdillinger, ltamasi

Differential Revision: D55208448

Pulled By: jaykorean

fbshipit-source-id: 8b3167c1d59839d076afc29097b5ad21a453460a
2024-03-22 14:51:16 -07:00
Peter Dillinger b515a5db3f Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470)
Summary:
ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want.

So this adds ScopedArenaPtr<T> to replace those uses.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470

Test Plan: CI (including ASAN/UBSAN)

Reviewed By: jowlyzhang

Differential Revision: D55254362

Pulled By: pdillinger

fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18
2024-03-22 13:40:42 -07:00
anand76 3b736c4aa3 Fix heap use after free error on retry after checksum mismatch (#12464)
Summary:
Fix the heap use after free bug caused by freeing the file system IO buffer in `BlockFetcher::ReadBlock()` instead of the caller.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12464

Test Plan: Update the `DBIOCorruptionTest` tests

Reviewed By: akankshamahajan15

Differential Revision: D55206920

Pulled By: anand1976

fbshipit-source-id: fd6b608a61cd229b20c1e5f348ff3cc92328de0f
2024-03-21 16:19:09 -07:00
Andrew Kryczka bf98dcf9a8 Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462)
Summary:
The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file.

There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462

Test Plan:
A repro command

```
db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp
./db_stress \
        --clear_column_family_one_in=0 \
        --test_batches_snapshots=0 \
        --write_fault_one_in=0 \
        --use_put_entity_one_in=0 \
        --prefixpercent=0 \
        --read_fault_one_in=0 \
        --readpercent=0 \
        --reopen=0 \
        --set_options_one_in=10000 \
        --delpercent=0 \
        --delrangepercent=0 \
        --open_metadata_write_fault_one_in=0 \
        --open_read_fault_one_in=0 \
        --open_write_fault_one_in=0 \
        --destroy_db_initially=0 \
        --ingest_external_file_one_in=0 \
        --iterpercent=0 \
        --nooverwritepercent=0 \
        --db=$db \
        --enable_blob_files=1 \
        --expected_values_dir=$exp \
        --max_background_compactions=20 \
        --max_bytes_for_level_base=2097152 \
        --max_key=100000 \
        --min_blob_size=0 \
        --open_files=-1 \
        --ops_per_thread=100000000 \
        --prefix_size=-1 \
        --target_file_size_base=524288 \
        --use_merge=1 \
        --value_size_mult=32 \
        --write_buffer_size=524288 \
        --writepercent=100
```

It used to fail like:

```
...
frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3
frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, rocksdb::PinnedIteratorsManager*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33
frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492
frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408
frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109
frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656
frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607
frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505
frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084
frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807
frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312
frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157
frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544
frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72
frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317
frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148
...
```

Reviewed By: ltamasi

Differential Revision: D55157795

Pulled By: ajkr

fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764
2024-03-21 12:38:53 -07:00
anand76 63a105a481 Enable recycle_log_file_num option for point in time recovery (#12403)
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.

This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403

Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior

Reviewed By: ajkr

Differential Revision: D54544824

Pulled By: anand1976

fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
2024-03-21 12:29:35 -07:00
Yu Zhang 13e1c32a18 Follow ups for TimedPut and write time property (#12455)
Summary:
This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including:

1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry.

2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries.

3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible.

Test plans:
Added unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455

Reviewed By: pdillinger

Differential Revision: D55165422

Pulled By: jowlyzhang

fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47
2024-03-21 10:00:15 -07:00
Richard Barnes 6a1c2abe9d Remove extra semi colon from hbt/src/tagstack/tests/SlicerTest.cpp (#12461)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12461

X-link: https://github.com/facebookincubator/dynolog/pull/233

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: rahku

Differential Revision: D55087324

fbshipit-source-id: e8a03d33cad72a7d378e58f85eb550a03f6c2897
2024-03-20 12:44:50 -07:00
Kshitij Wadhwa 4ce1dc930c don't run ZSTD_TrainDictionary in BlockBasedTableBuilder if there isn't compression needed (#12453)
Summary:
fixes https://github.com/facebook/rocksdb/issues/12409

### Issue

ZSTD_TrainDictionary [[link](a53ed91691/table/block_based/block_based_table_builder.cc (L1894))] runs for SSTFileWriter::Finish even when bottommost_compression option is set to kNoCompression. This reduces throughput for SstFileWriter::Finish

We construct rocksdb options using ZSTD compression for levels including 2 and above. For levels 0 and 1, we set it to kNoCompression. We also set zstd_max_train_bytes to a non-zero positive value (which is applicable for levels with ZSTD compression enabled). These options are used for the database and also passed to SstFileWriter for creating sst files to be later added to that database. Since the BlockBasedTableBuilder::Finish [[link](a53ed91691/table/block_based/block_based_table_builder.cc (L1892))] only checks for zstd_max_train_bytes to be non-zero positive value, it runs ZSTD_TrainDictionary even when it shouldn't since SSTFileWriter is operating at bottommost level

### Fix

If compression_type is set to kNoCompression, then don't run ZSTD_TrainDictionary and dictionary building

### Testing

I see we have tests for sst file writer with compression type set/unset. Let me know if it isn't covered and I can extend

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12453

Reviewed By: cbi42

Differential Revision: D55030484

Pulled By: ajkr

fbshipit-source-id: 834de2174c2b087d61bf045ca1ae29f337b821a7
2024-03-20 11:07:32 -07:00
Jay Huh 3f3f4660bd wal_read_status check in RecoverLogFiles (#12460)
Summary:
Fixing the not-checked status failure as in https://github.com/facebook/rocksdb/actions/runs/8334988399/job/22809612148.

When the status is not ok() for any reason, we do not check the `wal_read_status` because it's not necessary. It's causing the test failure when running with `ASSERT_STATUS_CHECKED=1`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12460

Test Plan: Existing tests

Reviewed By: ajkr

Differential Revision: D55104844

Pulled By: jaykorean

fbshipit-source-id: 919b1fddca835494f9087c51c4da6eabc9e8df2b
2024-03-20 08:09:09 -07:00
anand76 4868c10b44 Retry block reads on checksum mismatch (#12427)
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.

Tests:
Add new unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427

Reviewed By: ajkr

Differential Revision: D55025941

Pulled By: anand1976

fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
2024-03-18 16:16:05 -07:00
Jay Huh b4e9f5a400 Update Remote Compaction Tests to include more than one CF (#12430)
Summary:
Update `compaction_service_test` to make sure remote compaction works with multiple column family set up. Minor refactor to get rid of duplicate code

Fixing one quick bug in the existing test util: Test util's `FilesPerLevel` didn't honor `cf_id` properly)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12430

Test Plan:
```
./compaction_service_test
```

Reviewed By: ajkr

Differential Revision: D54883035

Pulled By: jaykorean

fbshipit-source-id: 83b4f6f566fed5c4824bfef7de01074354a72b44
2024-03-18 15:40:48 -07:00
Hui Xiao 2443ebf810 Don't write to WAL after previous WAL write error (#12448)
Summary:
**Context/Summary:**
WAL write can continue onto the the WAL file that has encountered error and thus crash at 3f5bd46a07/file/writable_file_writer.cc (L67) in below scenario:
<img width="698" alt="Screenshot 2024-03-15 at 1 52 45 PM" src="https://github.com/facebook/rocksdb/assets/83968999/cd631ef2-c87c-4926-91ab-a0dc10f1eb14">

Note that GetLiveFilesStorageInfo() can happen concurrently with PUT() for the non-WAL-write part where db lock isn't held

This PR added an error check in LogWriter layer to prevent thread 2 from starting to write WAL after thread 1's write error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12448

Test Plan:
Step 1 Apply the patch below to simulate frequent WAL write error for the purpose of repro
```
 diff --git a/db_stress_tool/db_stress_driver.cc b/db_stress_tool/db_stress_driver.cc
index b47fa89e6..31930e976 100644
 --- a/db_stress_tool/db_stress_driver.cc
+++ b/db_stress_tool/db_stress_driver.cc
@@ -98,7 +98,7 @@ bool RunStressTestImpl(SharedState* shared) {
     //  MANIFEST, CURRENT, and WAL files.
     fault_fs_guard->SetRandomWriteError(
         shared->GetSeed(), FLAGS_write_fault_one_in, error_msg,
-        /*inject_for_all_file_types=*/false, {FileType::kTableFile});
+        /*inject_for_all_file_types=*/false, {FileType::kWalFile});
     fault_fs_guard->SetFilesystemDirectWritable(false);
     fault_fs_guard->EnableWriteErrorInjection();
   }

 diff --git a/utilities/fault_injection_fs.cc b/utilities/fault_injection_fs.cc
index 0ffb43ea6..589912cf4 100644
 --- a/utilities/fault_injection_fs.cc
+++ b/utilities/fault_injection_fs.cc
@@ -1042,7 +1042,7 @@ IOStatus FaultInjectionTestFS::InjectWriteError(const std::string& file_name) {
   }

   if (allowed_type) {
-    if (write_error_rand_.OneIn(write_error_one_in_)) {
+    if (write_error_rand_.OneIn(1)) {
       return GetError();
     }
   }
```
Step 2 Run below
```
./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=41.19540459544058 --bottommost_compression_type=disable --bottommost_file_compaction_delay=3600 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=1 --compression_max_dict_buffer_bytes=68719476735 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_index_compression=1 --enable_pipelined_write=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=10000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=15 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=1048576 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=500000 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --report_bg_io_stats=0 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=35
```
Pre-PR:
```
db_stress: ./file/writable_file_writer.h:309: rocksdb::IOStatus rocksdb::WritableFileWriter::AssertFalseAndGetStatusForPrevError(): Assertion `sync_without_flush_called_' failed.
```
Post-PR
```
2024/03/15-13:44:08  Starting database operations
put or merge error: IO error: Retryable injected write error
```

Note: The patch is NOT included in the PR as we first need to figure out how to handle this type of failed write in stress test (planned for the near future). It's sufficient to show the stress test does not crash as pre-PR for the purpose of this PR.

Reviewed By: ajkr

Differential Revision: D54969287

Pulled By: hx235

fbshipit-source-id: 0ba4eabfec44ea7656d4d7117836f388897562f2
2024-03-18 12:27:49 -07:00
Jay Huh db1dea22b1 MultiCfIterator Implementations (#12422)
Summary:
This PR continues https://github.com/facebook/rocksdb/issues/12153 by implementing the missing `Iterator` APIs - `Seek()`, `SeekForPrev()`, `SeekToLast()`, and `Prev`. A MaxHeap Implementation has been added to handle the reverse direction.

The current implementation does not include upper/lower bounds yet. These will be added in subsequent PRs. The API is still marked as under construction and will be lifted after being added to the stress test.

Please note that changing the iterator direction in the middle of iteration is expensive, as it requires seeking the element in each iterator again in the opposite direction and rebuilding the heap along the way. The first `Next()` after `SeekForPrev()` requires changing the direction under the current implementation. We may optimize this in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12422

Test Plan: The `multi_cf_iterator_test` has been extended to cover the API implementations.

Reviewed By: pdillinger

Differential Revision: D54820754

Pulled By: jaykorean

fbshipit-source-id: 9eb741508df0f7bad598fb8e6bd5cdffc39e81d1
2024-03-18 09:05:30 -07:00
Changyu Bi 3d5be596a5 Fix a bug in iterator with UDT + `ReadOptions::pin_data` (#12451)
Summary:
with https://github.com/facebook/rocksdb/issues/12414 enabling `ReadOptions::pin_data`, this bug surfaced as corrupted per key-value checksum during crash test. `saved_key_.GetUserKey()` could be pinned user key, so DBIter should not overwrite it.

In one case, it only surfaces when iterator skips many keys of the same user key. To stress that code path, this PR also added `max_sequential_skip_in_iterations` to crash test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12451

Test Plan:
- Set ReadOptions::pin_data to true, the bug can be reproed quickly with `./db_stress --persist_user_defined_timestamps=1 --user_timestamp_size=8 --writepercent=35 --delpercent=4 --delrangepercent=1 --iterpercent=20 --nooverwritepercent=1 --prefix_size=8 --prefixpercent=10 --readpercent=30 --memtable_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --clear_column_family_one_in=0`.
    - Set max_sequential_skip_in_iterations to 1 for the other occurrence of the bug.

Reviewed By: jowlyzhang

Differential Revision: D55003766

Pulled By: cbi42

fbshipit-source-id: 23e1049129456684dafb028b6132b70e0afc07fb
2024-03-18 09:05:11 -07:00
Yu Zhang f2546b6623 Support returning write unix time in iterator property (#12428)
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.

2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in  https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.

Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.

2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.

3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428

Test Plan: Added unit test

Reviewed By: pdillinger

Differential Revision: D54967067

Pulled By: jowlyzhang

fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
2024-03-15 15:37:37 -07:00
Andrew Kryczka 4d5ebad971 Fix kBlockCacheTier read with table cache miss (#12443)
Summary:
Thanks ltamasi for pointing out this bug.

We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443

Reviewed By: cbi42

Differential Revision: D54930128

Pulled By: ajkr

fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d
2024-03-15 14:41:58 -07:00
Andrew Kryczka 3f5bd46a07 Add `ContinueCallback` to `GetMergeOperands()` (#12438)
Summary:
The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because:

- Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge
- The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me

If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438

Reviewed By: hx235

Differential Revision: D54914669

Pulled By: ajkr

fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a
2024-03-15 12:25:49 -07:00
Changyu Bi 096fb9b67d Fix data race in WalManager (#12439)
Summary:
Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439

Test Plan:
- existing UT
- not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests

Reviewed By: anand1976

Differential Revision: D54920817

Pulled By: cbi42

fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6
2024-03-14 21:24:06 -07:00
Yu Zhang 1104eaa35e Add initial support for TimedPut API (#12419)
Summary:
This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API.

The life cycle of such an entry on the write/flush/compaction paths are:

1) It is initially added to memtable as:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}`

2) When it's flushed to L0 sst files, it's converted to:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}`
 when we have easy access to the seqno to time mapping.

3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become:
`<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap.

On the read path:
A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored.

Needed follow ups:
1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable.

2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types.

3) Stress test coverage for the feature

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D54920296

Pulled By: jowlyzhang

fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e
2024-03-14 15:44:55 -07:00
Peter Dillinger dd24bda137 Fix windows build and CI (#12426)
Summary:
Issue https://github.com/facebook/rocksdb/issues/12421 describes a regression in the migration from CircleCI to GitHub Actions in which failing build steps no longer fail Windows CI jobs. In GHA with pwsh (new preferred powershell command), only the last non-builtin command (or something like that) affects the overall success/failure result, and failures in external commands do not exit the script, even with `$ErrorActionPreference = 'Stop'` and `$PSNativeCommandErrorActionPreference = $true`. Switching to `powershell` causes some obscure failure (not seen in CircleCI) about the `-Lo` option to `curl`.

Here we work around this using the only reasonable-but-ugly way known: explicitly check the result after every non-trivial build step. This leaves us highly susceptible to future regressions with unchecked build steps in the future, but a clean solution is not known.

This change also fixes the build errors that were allowed to creep in because of the CI regression. Also decreased the unnecessarily long running time of DBWriteTest.WriteThreadWaitNanosCounter.

For background, this problem explicitly contradicts GitHub's documentation, and GitHub has known about the problem for more than a year, with no evidence of caring or intending to fix. https://github.com/actions/runner-images/issues/6668 Somehow CircleCI doesn't have this problem. And even though cmd.exe and powershell have been perpetuating DOS-isms for decades, they still seem to be a somewhat active "hot mess" when it comes to sensible, consistent, and documented behavior.

Fixes https://github.com/facebook/rocksdb/issues/12421

A history of some things I tried in development is here: https://github.com/facebook/rocksdb/compare/main...pdillinger:rocksdb:debug_windows_ci_orig

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12426

Test Plan: CI, including https://github.com/facebook/rocksdb/issues/12434 where I have temporarily enabled other Windows builds on PR with this change

Reviewed By: cbi42

Differential Revision: D54903698

Pulled By: pdillinger

fbshipit-source-id: 116bcbebbbf98f347c7cf7dfdeebeaaed7f76827
2024-03-14 12:04:41 -07:00
Peter Dillinger c0ae5be934 Disable flaky part of TransactionLogIteratorCheckWhenArchive (#12423)
Summary:
https://github.com/facebook/rocksdb/issues/12397 attempted to make the test more honest about its failures, and they're really showing up in CI now (but not locally). Disable pending investigation

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12423

Test Plan: watch CI

Reviewed By: ltamasi

Differential Revision: D54817705

Pulled By: pdillinger

fbshipit-source-id: 4721834c49b225ac52d1a28ecb06b9d05de977b3
2024-03-12 12:54:53 -07:00
Peter Dillinger 7622029101 Fix flaky TransactionLogIteratorCheckWhenArchive (#12397)
Summary:
Seen in https://github.com/facebook/rocksdb/actions/runs/8086592802/job/22096691572?pr=12388

```
[ RUN      ] DBTestXactLogIterator.TransactionLogIteratorCheckWhenArchive
db/db_log_iter_test.cc:173:23: runtime error: member call on address 0x0000023956f0 which does not point to an object of type 'rocksdb::DBTestXactLogIterator'
0x0000023956f0: note: object is of type 'rocksdb::DBTestBase'
 00 00 00 00  98 ae f7 da 75 7f 00 00  a0 5d 39 02 00 00 00 00  80 ff 39 02 00 00 00 00  95 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'rocksdb::DBTestBase'
 UndefinedBehaviorSanitizer: undefined-behavior db/db_log_iter_test.cc:173:23 in
```

This is almost certainly caused by the sync point callback happening on asynchronous file deletion in the DB while the end of the test is reached and the destruction of the `DBTestXactLogIterator` has reached `DBTestBase::~DBTestBase()`. Either closing the DB or disabling sync points before the end of the test should suffice to fix, and we'll do both. And assert that the sync point callback is actually hit each time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12397

Test Plan: unable to reproduce, but ran 1000 iterations of the test with UBSAN

Reviewed By: ltamasi

Differential Revision: D54326687

Pulled By: pdillinger

fbshipit-source-id: cc09a4dcd2f237d5b45d910364d6aa56bbd46d50
2024-03-12 08:43:47 -07:00
Andrew Kryczka 27a2473668 Best-effort recovery support for atomic flush (#12406)
Summary:
This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once.

The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits.

Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery.

The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios.

The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary.

- `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was  previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates.
- `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406

Reviewed By: jaykorean, cbi42

Differential Revision: D54494904

Pulled By: ajkr

fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52
2024-03-06 14:40:40 -08:00
Peter Dillinger a53ed91691 Fix/improve temperature handling for file ingestion (#12402)
Summary:
Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388

In terms of public API:
* Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre.
* Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general).
* Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature.
* Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost"

In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed.

Eventual suggested follow-up:
* Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare).
* More temperature handling for CreateColumnFamilyWithImport and Checkpoints.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402

Test Plan:
Deeply revamped
ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer.

Used macros instead of functions for better tracing to critical source location on test failures.

Some enhancements to FileTemperatureTestFS in the process of developing the revamped test.

Reviewed By: jowlyzhang

Differential Revision: D54442794

Pulled By: pdillinger

fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995
2024-03-05 16:56:08 -08:00
Jay Huh 3412195367 Introduce MultiCfIterator (#12153)
Summary:
This PR introduces a new implementation of `Iterator` via a new public API called `NewMultiCfIterator()`. The new API takes a vector of column family handles to build a cross-column-family iterator, which internally maintains multiple `DBIter`s as child iterators from a consistent database state. When a key exists in multiple column families, the iterator selects the value (and wide columns) from the first column family containing the key, following the order provided in the `column_families` parameter. Similar to the merging iterator, a min heap is used to iterate across the child iterators. Backward iteration and direction change functionalities will be implemented in future PRs.

The comparator used to compare keys across different column families will be derived from the iterator of the first column family specified in `column_families`. This comparator will be checked against the comparators from all other column families that the iterator will traverse. If there's a mismatch with any of the comparators, the initialization of the iterator will fail.

Please note that this PR is not enough for users to start using `MultiCfIterator`. The `MultiCfIterator` and related APIs are still marked as "**DO NOT USE - UNDER CONSTRUCTION**". This PR is just the first of many PRs that will follow soon.

This PR includes the following:
- Introduction and partial implementation of the `MultiCfIterator`, which implements the generic `Iterator` interface. The implementation includes the construction of the iterator, `SeekToFirst()`, `Next()`, `Valid()`, `key()`, `value()`, and `columns()`.
- Unit tests to verify iteration across multiple column families in two distinct scenarios: (1) keys are unique across all column families, and (2) the same keys exist in multiple column families.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12153

Reviewed By: pdillinger

Differential Revision: D52308697

Pulled By: jaykorean

fbshipit-source-id: b03e69f13b40af5a8f0598d0f43a0bec01ef8294
2024-03-05 10:22:43 -08:00
yuzhangyu@fb.com 1cfdece85d Run internal cpp modernizer on RocksDB repo (#12398)
Summary:
When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE`  with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains:

Running
`xplat/scripts/codemod_service/cpp_modernizer.sh`
in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment)
without swapping out the namespace macro `ROCKSDB_NAMESPACE`

Followed by RocksDB's own
`make format`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398

Test Plan: Auto tests

Reviewed By: hx235

Differential Revision: D54382532

Pulled By: jowlyzhang

fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa
2024-03-04 10:08:32 -08:00
Richard Barnes d7b8756976 Remove extra semi colon from internal_repo_rocksdb/repo/db/table_cache_sync_and_async.h
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362208

fbshipit-source-id: a47acd4c794c899fccb65285b116b50d9566ea12
2024-03-04 06:34:44 -08:00
Richard Barnes ced333ee45 Remove extra semi colon from instagram/ranking/mezql/shots/parser/fast/Token.cpp
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362213

fbshipit-source-id: 0bbc9e5fce917fc4f72423f0a4c8cb2c2b1759dd
2024-03-04 06:32:50 -08:00
Jay Huh c00c16855d Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395)
Summary:
In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs:  https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977.

To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395

Test Plan:
# Summary

In the current implementation of iterators, `DBImpl*` and `ColumnFamilyData*` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs:  #11925 #11943, and #11977.

To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle*` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl*` and `ColumnFamilyData*` can be easily accessed via `ColumnFamilyHandleImpl*`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`.

# Test Plan

There should be no behavior changes. Existing tests and CI for the correctness tests.

**Test for Perf Regression**
Build
```
$> make -j64 release
```
Setup
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```

Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator  :       0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom   :       4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: pdillinger

Differential Revision: D54332713

Pulled By: jaykorean

fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5
2024-03-01 10:28:20 -08:00
Jay Huh 5bcc184975 Update APIs to support generic unique identifier format (#12384)
Summary:
The current design proposes using a combination of `job_id`, `db_id`, and `db_session_id` to create a unique identifier for remote compaction jobs. However, this approach may not be suitable for users who prefer a different format for the unique identifier.

At Meta, we are utilizing generic compute offload to offload compaction tasks to remote workers. The compute offload client generates a UUID for each task, which requires an update to the current RocksDB API for onboarding purposes.

Users still have the option to create the unique identifier by combining `job_id`, `db_id`, and `db_session_id` if they prefer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12384

Test Plan:
```
$> ./compaction_service_test                                                                                                                             13:29:35
[==========] Running 14 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 14 tests from CompactionServiceTest
[ RUN      ] CompactionServiceTest.BasicCompactions
[       OK ] CompactionServiceTest.BasicCompactions (2642 ms)
[ RUN      ] CompactionServiceTest.ManualCompaction
[       OK ] CompactionServiceTest.ManualCompaction (454 ms)
[ RUN      ] CompactionServiceTest.CancelCompactionOnRemoteSide
[       OK ] CompactionServiceTest.CancelCompactionOnRemoteSide (1643 ms)
[ RUN      ] CompactionServiceTest.FailedToStart
[       OK ] CompactionServiceTest.FailedToStart (1332 ms)
[ RUN      ] CompactionServiceTest.InvalidResult
[       OK ] CompactionServiceTest.InvalidResult (1516 ms)
[ RUN      ] CompactionServiceTest.SubCompaction
[       OK ] CompactionServiceTest.SubCompaction (551 ms)
[ RUN      ] CompactionServiceTest.CompactionFilter
[       OK ] CompactionServiceTest.CompactionFilter (563 ms)
[ RUN      ] CompactionServiceTest.Snapshot
[       OK ] CompactionServiceTest.Snapshot (124 ms)
[ RUN      ] CompactionServiceTest.ConcurrentCompaction
[       OK ] CompactionServiceTest.ConcurrentCompaction (660 ms)
[ RUN      ] CompactionServiceTest.CompactionInfo
[       OK ] CompactionServiceTest.CompactionInfo (984 ms)
[ RUN      ] CompactionServiceTest.FallbackLocalAuto
[       OK ] CompactionServiceTest.FallbackLocalAuto (343 ms)
[ RUN      ] CompactionServiceTest.FallbackLocalManual
[       OK ] CompactionServiceTest.FallbackLocalManual (380 ms)
[ RUN      ] CompactionServiceTest.RemoteEventListener
[       OK ] CompactionServiceTest.RemoteEventListener (491 ms)
[ RUN      ] CompactionServiceTest.TablePropertiesCollector
[       OK ] CompactionServiceTest.TablePropertiesCollector (169 ms)
[----------] 14 tests from CompactionServiceTest (11854 ms total)

[----------] Global test environment tear-down
[==========] 14 tests from 1 test case ran. (11855 ms total)
[  PASSED  ] 14 tests.
```

Reviewed By: hx235

Differential Revision: D54220339

Pulled By: jaykorean

fbshipit-source-id: 5a9054f31933d1996adca02082eb37b6d5353224
2024-03-01 09:55:30 -08:00
Changyu Bi 4aed229fa7 Add `write_memtable_time` to perf level `kEnableWait` (#12394)
Summary:
.. so write time can be measured under the new perf level for single-threaded writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12394

Test Plan: * add a new UT `PerfContextTest.WriteMemtableTimePerfLevel`

Reviewed By: anand1976

Differential Revision: D54326263

Pulled By: cbi42

fbshipit-source-id: d0e334d9581851ba6cf53c776c0bd876365d1e00
2024-02-29 15:08:26 -08:00
Peter Dillinger 13ef21c22e default_write_temperature option (#12388)
Summary:
Currently SST files that aren't applicable to last_level_temperature nor file_temperature_age_thresholds are written with temperature kUnknown, which is a little weird and doesn't support CF-based tiering. The default_temperature option only affects how kUnknown is interpreted for stats.

This change adds a new per-CF option default_write_temperature that determines the temperature of new SST files when those other options do not apply.

Also made a change to ignore last_level_temperature with FIFO compaction, because I found that could lead to an infinite loop in compaction.

Needed follow-up: Fix temperature handling with external file ingestion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12388

Test Plan: unit tests extended appropriately. (Ignore whitespace changes when reviewing.)

Reviewed By: jowlyzhang

Differential Revision: D54266574

Pulled By: pdillinger

fbshipit-source-id: c9ec9a74dbf22be6e986f77f9689d05fea8ef0bb
2024-02-28 14:36:13 -08:00