Commit graph

739 commits

Author SHA1 Message Date
Jay Huh f22557886e Fix Compaction Stats (#13071)
Summary:
Compaction stats code is not so straightforward to understand. Here's a bit of context for this PR and why this change was made.

- **CompactionStats (compaction_stats_.stats):** Internal stats about the compaction used for logging and public metrics.
- **CompactionJobStats (compaction_job_stats_)**: The public stats at job level. It's part of Compaction event listener and included in the CompactionResult.
- **CompactionOutputsStats**: output stats only. resides in CompactionOutputs. It gets aggregated toward the CompactionStats (internal stats).

The internal stats, `compaction_stats_.stats`, has the output information recorded from the compaction iterator, but it does not have any input information (input records, input output files) until `UpdateCompactionStats()` gets called. We cannot simply call `UpdateCompactionStats()` to fill in the input information in the remote compaction (which is a subcompaction of the primary host's compaction) because the `compaction->inputs()` have the full list of input files and `UpdateCompactionStats()` takes the entire list of records in all files. `num_input_records` gets double-counted if multiple sub-compactions are submitted to the remote worker.

The job level stats (in the case of remote compaction, it's subcompaction level stat), `compaction_job_stats_`, has the correct input records, but has no output information. We can use `UpdateCompactionJobStats(compaction_stats_.stats)` to set the output information (num_output_records, num_output_files, etc.) from the `compaction_stats_.stats`, but it also sets all other fields including the input information which sets all back to 0.

Therefore, we are overriding `UpdateCompactionJobStats()` in remote worker only to update job level stats, `compaction_job_stats_`, with output information of the internal stats.

Baiscally, we are merging the aggregated output info from the internal stats and aggregated input info from the compaction job stats.

In this PR we are also fixing how we are setting `is_remote_compaction` in CompactionJobStats.
- OnCompactionBegin event, if options.compaction_service is set, `is_remote_compaction=true` for all compactions except for trivial moves
- OnCompactionCompleted event, if any of the sub_compactions were done remotely, compaction level stats's `is_remote_compaction` will be true

Other minor changes
- num_output_records is already available in CompactionJobStats. No need to store separately in CompactionResult.
- total_bytes is not needed.
- Renamed `SubcompactionState::AggregateCompactionStats()` to `SubcompactionState::AggregateCompactionOutputStats()` to make it clear that it's only aggregating output stats.
- Renamed `SetTotalBytes()` to `AddBytesWritten()` to make it more clear that it's adding total written bytes from the compaction output.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13071

Test Plan:
Unit Tests added and updated
```
./compaction_service_test
```

Reviewed By: anand1976

Differential Revision: D64479657

Pulled By: jaykorean

fbshipit-source-id: a7a776a00dc718abae95d856b661bcbafd3b0ed5
2024-10-16 19:20:37 -07:00
Changyu Bi 787730c859 Add an ingestion option to not fill block cache (#13067)
Summary:
add `IngestExternalFileOptions::fill_cache` to allow users to ingest files without loading index/filter/data and other blocks into block cache during file ingestion. This can be useful when users are ingesting files into a CF that is not available to readers yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13067

Test Plan:
* unit test: `ExternalSSTFileTest.NoBlockCache`
* ran one round of crash test with fill_cache disabled: `python3 ./tools/db_crashtest.py --simple blackbox --ops_per_thread=1000000 --interval=30 --ingest_external_file_one_in=200 --level0_stop_writes_trigger=200 --level0_slowdown_writes_trigger=100 --sync_fault_injection=0 --disable_wal=0 --manual_wal_flush_one_in=0`

Reviewed By: jowlyzhang

Differential Revision: D64356424

Pulled By: cbi42

fbshipit-source-id: b380c26f5987238e1ed7d42ceef0390cfaa0b8e2
2024-10-16 14:11:22 -07:00
Jay Huh da5e11310b Preserve Options File (#13074)
Summary:
In https://github.com/facebook/rocksdb/issues/13025 , we made a change to load the latest options file in the remote worker instead of serializing the entire set of options.

That was done under assumption that OPTIONS file do not get purged often. While testing, we learned that this happens more often than we want it to be, so we want to prevent the OPTIONS file from getting purged anytime between when the remote compaction is scheduled and the option is loaded in the remote worker.

Like how we are protecting new SST files from getting purged using `min_pending_output`, we are doing the same by keeping track of `min_options_file_number`. Any OPTIONS file with number greater than `min_options_file_number` will be protected from getting purged. Just like `min_pending_output`, `min_options_file_number` gets bumped when the compaction is done. This is only applicable when `options.compaction_service` is set.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13074

Test Plan:
```
./compaction_service_test --gtest_filter="*PreservedOptionsLocalCompaction*"
./compaction_service_test --gtest_filter="*PreservedOptionsRemoteCompaction*"
```

Reviewed By: anand1976

Differential Revision: D64433795

Pulled By: jaykorean

fbshipit-source-id: 0d902773f0909d9481dec40abf0b4c54ce5e86b2
2024-10-16 09:22:51 -07:00
Yu Zhang 8592517c89 Remove stale entries from L0 files when UDT is not persisted (#13035)
Summary:
When user-defined timestamps are not persisted, currently we replace the actual timestamp with min timestamp after an entry is output from compaction iterator. Compaction iterator won't be able to help with removing stale entries this way. This PR adds a wrapper iterator `TimestampStrippingIterator` for `MemTableIterator` that does the min timestamp replacement at the memtable iteration step. It is used by flush and can help remove stale entries from landing in L0 files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13035

Test Plan: Added unit test

Reviewed By: pdillinger, cbi42

Differential Revision: D63423682

Pulled By: jowlyzhang

fbshipit-source-id: 087dcc9cee97b9ea51b8d2b88dc91c2984d54e55
2024-10-14 12:28:35 -07:00
Yu Zhang 9375c3b635 Fix needs_flush assertion in file ingestion (#13045)
Summary:
This PR makes file ingestion job's flush wait a bit further until the SuperVersion is also updated. This is necessary since follow up operations will use the current SuperVersion to do range overlapping check and level assignment.

In debug mode, file ingestion job's second `NeedsFlush` call could have been invoked when the memtables are flushed but the SuperVersion hasn't been updated yet, triggering the assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13045

Test Plan:
Existing tests
Manually stress tested

Reviewed By: cbi42

Differential Revision: D63671151

Pulled By: jowlyzhang

fbshipit-source-id: 95a169e58a7e59f6dd4125e7296e9060fe4c63a7
2024-10-02 17:19:18 -07:00
Peter Dillinger a1a102ffce Steps toward deprecating implicit prefix seek, related fixes (#13026)
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.

Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.

Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)

Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026

Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.

Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).

Reviewed By: ltamasi

Differential Revision: D63147378

Pulled By: pdillinger

fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
2024-09-20 15:54:19 -07:00
Jay Huh 5f4a8c3da4 Load latest options from OPTIONS file in Remote host (#13025)
Summary:
We've been serializing and deserializing DBOptions and CFOptions (and other CF into) as part of `CompactionServiceInput`. These are all readily available in the OPTIONS file and the remote worker can read the OPTIONS file to obtain the same information. This helps reducing the size of payload significantly.

In a very rare scenario if the OPTIONS file is purged due to options change by primary host at the same time while the remote host is loading the latest options, it may fail. In this case, we just retry once.

This also solves the problem where we had to open the default CF with the CFOption from another CF if the remote compaction is for a non-default column family. (TODO comment in /db_impl_secondary.cc)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13025

Test Plan:
Unit Tests
```
./compaction_service_test
```
```
./compaction_job_test
```

Also tested with Meta's internal Offload Infra

Reviewed By: anand1976, cbi42

Differential Revision: D63100109

Pulled By: jaykorean

fbshipit-source-id: b7162695e31e2c5a920daa7f432842163a5b156d
2024-09-20 13:26:02 -07:00
Peter Dillinger 98c33cb8e3 Steps toward making IDENTITY file obsolete (#13019)
Summary:
* Set write_dbid_to_manifest=true by default
* Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file
* Refactor related DB open code to minimize code duplication

_Recommend hiding whitespace changes for review_

Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019

Test Plan: unit tests and stress test updated for new functionality

Reviewed By: anand1976

Differential Revision: D62898229

Pulled By: pdillinger

fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0
2024-09-19 14:05:21 -07:00
Nick Brekhus 0611eb5b9d Fix orphaned files in SstFileManager (#13015)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015

`Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened.

Reviewed By: jowlyzhang

Differential Revision: D62590773

fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28
2024-09-18 13:27:44 -07:00
Changyu Bi f97e33454f Fix a bug with auto recovery on WAL write error (#12995)
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:

```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped: 4b1d595306/file/writable_file_writer.cc (L624)
The error needs to be retryable to start background auto recovery.
3. One CF successfully flushes its memtable during auto recovery.
4. Crash the process.
5. Reopen the DB, one CF will have the update as a result of successful flush. Other CFs will miss all the updates in the write batch since WAL does not have them.
```

This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery.

A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12995

Test Plan:
added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR:
```
python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944  --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1
```

Reviewed By: hx235

Differential Revision: D62888510

Pulled By: cbi42

fbshipit-source-id: 308bdbbb8d897cc8eba950155cd0e37cf7eb76fe
2024-09-17 14:10:33 -07:00
Changyu Bi e490f2b051 Fix a bug in ReFitLevel() where FileMetaData::being_compacted is not cleared (#13009)
Summary:
in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009

Test Plan:
existing test.
- stress test failure in T200339331 should not happen anymore.

Reviewed By: hx235

Differential Revision: D62597169

Pulled By: cbi42

fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e
2024-09-12 15:19:14 -07:00
Yu Zhang 43bc71fef6 Add an internal API MemTableList::GetEditForDroppingCurrentVersion (#13001)
Summary:
Prepare this internal API to be used by atomic data replacement. The main purpose of this API is to get a `VersionEdit` to mark the entire current `MemTableListVersion` as dropped.  Flush needs the similar functionality when installing results, so that logic is refactored into a util function `GetDBRecoveryEditForObsoletingMemTables` to be shared by flush and this internal API.

To test this internal API, flush's result installation is redirected to use this API when it is flushing all the immutable MemTables in debug mode. It should achieve the exact same results, just with a duplicated `VersionEdit::log_number` field that doesn't upsets the recovery logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13001

Test Plan: Existing tests

Reviewed By: pdillinger

Differential Revision: D62309591

Pulled By: jowlyzhang

fbshipit-source-id: e25914d9a2e281c25ab7ee31a66eaf6adfae4b88
2024-09-10 13:23:13 -07:00
Yu Zhang 0c6e9c036a Make compaction always use the input version with extra ref protection (#12992)
Summary:
`Compaction` is already creating its own ref for the input Version: 4b1d595306/db/compaction/compaction.cc (L73)

And properly Unref it during destruction:
4b1d595306/db/compaction/compaction.cc (L450)

This PR redirects compaction's access of `cfd->current()` to this input `Version`, to prepare for when a column family's data can be replaced all together, and `cfd->current()` is not safe to access for a compaction job. Because a new `Version` with just some other external files could be installed as `cfd->current()`. The compaction job's expectation of the current `Version` and the corresponding storage info to always have its input files will no longer be guaranteed.

My next follow up is to do a similar thing for flush, also to prepare it for when a column family's data can be replaced. I will make it create its own reference of the current `MemTableListVersion` and use it as input, all flush job's access of memtables will be wired to that input `MemTableListVersion`. Similarly this reference will be unreffed during a flush job's destruction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12992

Test Plan: Existing tests

Reviewed By: pdillinger

Differential Revision: D62212625

Pulled By: jowlyzhang

fbshipit-source-id: 9a781213469cf366857a128d50a702af683a046a
2024-09-06 14:07:33 -07:00
Yu Zhang a24574e80a Add documentation for background job's state transition (#12994)
Summary:
The `SchedulePending*` API is a bit confusing since it doesn't immediately schedule the work and can be confused with the actual scheduling. So I have changed these to be `EnqueuePending*` and added some documentation for the corresponding state transitions of these background work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12994

Test Plan: existing tests

Reviewed By: cbi42

Differential Revision: D62252746

Pulled By: jowlyzhang

fbshipit-source-id: ee68be6ed33070cad9a5004b7b3e16f5bcb041bf
2024-09-06 13:08:34 -07:00
Changyu Bi cd6f802ccb Add a new file ingestion option link_files (#12980)
Summary:
Add option `IngestExternalFileOptions::link_files` that hard links input files and preserves original file links after ingestion, unlike `move_files` which will unlink input files after ingestion. This can be useful when being used together with `allow_db_generated_files` to ingest files from another DB. Also reverted the change to `move_files` in https://github.com/facebook/rocksdb/issues/12959 to simplify the contract so that it will always unlink input files without exception.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12980

Test Plan: updated unit test `ExternSSTFileLinkFailFallbackTest.LinkFailFallBackExternalSst` to test that input files will not be unlinked.

Reviewed By: pdillinger

Differential Revision: D61925111

Pulled By: cbi42

fbshipit-source-id: eadaca72e1ae5288bdd195d57158466e5656fa62
2024-09-03 13:06:25 -07:00
Changyu Bi 4eb5878ab2 Support ingesting db generated files using hard link (#12959)
Summary:
so `IngestExternalFileOptions::move_files` and `IngestExternalFileOptions::allow_db_generated_files` are now compatible. The original file links won't be removed if `allow_db_generated_files` is true. This is to prevent deleting files from another DB.

There was a [comment](https://github.com/facebook/rocksdb/pull/12750#discussion_r1684509620) in https://github.com/facebook/rocksdb/issues/12750 about how exactly-once ingestion would work with `move_files`. I've discussed with customer and decided that it can be done by reading the target DB to see if it contains any ingested key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12959

Test Plan: updated unit tests `IngestDBGeneratedFileTest*` to enable `move_files`.

Reviewed By: jowlyzhang

Differential Revision: D61703480

Pulled By: cbi42

fbshipit-source-id: 6b4294369767f989a2f36bbace4ca3c0257aeaf7
2024-08-26 12:25:16 -07:00
Peter Dillinger 96340dbce2 Options for file temperature for more files (#12957)
Summary:
We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options:
* `metadata_write_temperature` for DB "small" files that don't contain much user data
* `wal_write_temperature` for WALs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957

Test Plan: Unit test included, though it's hard to be sure we've covered all the places

Reviewed By: jowlyzhang

Differential Revision: D61664815

Pulled By: pdillinger

fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b
2024-08-23 19:49:25 -07:00
eniac1024 f5c5f881d2 Fix MultiGet with timestamps (#12943)
Summary:
Issue: MultiGet(PinnableSlice) can't read out all timestamps.
Fixed the impl, and added an UT as well. In the original impl, if MultiGet reads multiple column families, a later column family would clean up timestamps of previous column family.
Fix: https://github.com/facebook/rocksdb/issues/12950#issue-2476996580

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12943

Reviewed By: anand1976

Differential Revision: D61729257

Pulled By: pdillinger

fbshipit-source-id: 55267c26076c8a59acedd27e14714711729a40df
2024-08-23 13:58:04 -07:00
anand76 c21fe1a47f Add ticker stats for read corruption retries (#12923)
Summary:
Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923

Test Plan: Update existing tests

Reviewed By: jowlyzhang

Differential Revision: D61024687

Pulled By: anand1976

fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06
2024-08-12 15:32:07 -07:00
Changyu Bi b32d899482 Fix MultiGet dropping memtable kv checksum corruption (#12842)
Summary:
Corruption status returned by `GetFromTable()` could be overwritten here: b6c3495a71/db/version_set.cc (L2614)

This PR fixes this issue by setting `*(s->found_final_value) = true;` in SaveValue. Also makes the handling of the return value of `GetFromTable()` more robust and added asserts in a couple places.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12842

Test Plan: Updated an existing unit test to cover MultiGet. It fails the assertion here before this PR: b6c3495a71/db/version_set.cc (L2601)

Reviewed By: anand1976

Differential Revision: D59498203

Pulled By: cbi42

fbshipit-source-id: 1f071c1b2c5b66fb71264b547a9e670d1cf592f0
2024-08-08 13:34:11 -07:00
Yu Zhang d12aaf23ca Fix file deletions in DestroyDB not rate limited (#12891)
Summary:
Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`.

It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it.  This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891

Test Plan: Added unit tests, existing tests.

Reviewed By: anand1976

Differential Revision: D60300220

Pulled By: jowlyzhang

fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7
2024-08-02 19:31:55 -07:00
Levi Tamasi 2e8a1a14ef Fix a data race affecting the background error status (#12910)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12910

There is currently a call to `GetBGError()` in `DBImpl::WriteImplWALOnly()` where the DB mutex is (incorrectly) not held, leading to a data race. Technically, we could acquire the mutex here but instead, the patch removes the affected check altogether, since the same check is already performed (in a thread-safe manner) in the subsequent call to `PreprocessWrite()`.

Reviewed By: cbi42

Differential Revision: D60682008

fbshipit-source-id: 54b67975dcf57d67c068cac71e8ada09a1793ec5
2024-08-02 14:11:08 -07:00
Peter Dillinger 2595476541 Fix rare WAL handling crash (#12899)
Summary:
A crash test failure in log sync in DBImpl::WriteToWAL is due to a missed case in https://github.com/facebook/rocksdb/issues/12734. Just need to apply similar logic from DBImpl::SyncWalImpl to check for an already closed WAL (nullptr writer). This is extremely rare because it only comes from failed Sync on a closed WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12899

Test Plan: watch crash test

Reviewed By: cbi42

Differential Revision: D60481652

Pulled By: pdillinger

fbshipit-source-id: 4a176bb6a53dcf077f88344710a110c2f946c386
2024-07-30 17:38:30 -07:00
Hui Xiao 408e8d4c85 Handle injected write error after successful WAL write in crash test + misc (#12838)
Summary:
**Context/Summary:**
We discovered the following false positive in our crash test lately:
(1) PUT() writes k/v to WAL but fails in `ApplyWALToManifest()`. The k/v is in the WAL
(2) Current stress test logic will rollback the expected state of such k/v since PUT() fails
(3) If the DB crashes before recovery finishes and reopens, the WAL will be replayed and the k/v is in the DB while the expected state have been roll-backed.

We decided to leave those expected state to be pending until the loop-write of the same key succeeds.

Bonus: Now that I realized write to manifest can also fail the write which faces the similar problem as https://github.com/facebook/rocksdb/pull/12797, I decided to disable fault injection on user write per thread (instead of globally) when tracing is needed for prefix recovery; some refactory

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12838

Test Plan:
Rehearsal CI
Run below command (varies on sync_fault_injection=1,0 to verify ExpectedState behavior) for a while to ensure crash recovery validation works fine

```
python3 tools/db_crashtest.py --simple blackbox --interval=30 --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=20 --delrangepercent=20 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=8 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=8 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=10 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=8 --writepercent=40
```

Reviewed By: cbi42

Differential Revision: D59377075

Pulled By: hx235

fbshipit-source-id: 91f602fd67e2d339d378cd28b982095fd073dcb6
2024-07-29 13:51:49 -07:00
Hui Xiao 349b1ec08f Fix duplicate WAL entries caused by write after error recovery (#12873)
Summary:
**Context/Summary:**
We recently discovered a case where write of the same key right after error recovery of a previous failed write of the same key finishes causes two same WAL entries, violating our assertion. This is because we don't advance seqno on failed write and reuse the same WAL containing the failed write for the new write if the memtable at the time is empty.

This PR reuses the flush path for an empty memtable to switch WAL and update min WAL to keep in error recovery flush
 as well as updates the INFO log message for clarity.

```
2024/07/17-15:01:32.271789 327757 (Original Log Time 2024/07/17-15:01:25.942234) [/flush_job.cc:1017] [default] [JOB 2] Level-0 flush table https://github.com/facebook/rocksdb/issues/9: 0 bytes OK It's an empty SST file from a successful flush so won't be kept in the DB
2024/07/17-15:01:32.271798 327757 (Original Log Time 2024/07/17-15:01:32.269954) [/memtable_list.cc:560] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9 started
2024/07/17-15:01:32.271802 327757 (Original Log Time 2024/07/17-15:01:32.271217) [/memtable_list.cc:760] [default] Level-0 commit flush result of table https://github.com/facebook/rocksdb/issues/9: memtable https://github.com/facebook/rocksdb/issues/1 done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12873

Test Plan:
New UT that failed before this PR with following assertion failure (i.e, duplicate WAL entries) and passes after
```
db_wal_test: db/write_batch.cc:2254: rocksdb::Status rocksdb::{anonymous}::MemTableInserter::PutCFImpl(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&, rocksdb::ValueType, RebuildTxnOp, const ProtectionInfoKVOS64*) [with RebuildTxnOp = rocksdb::{anonymous}::MemTableInserter::PutCF(uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)::<lambda(rocksdb::WriteBatch*, uint32_t, const rocksdb::Slice&, const rocksdb::Slice&)>; uint32_t = unsigned int; rocksdb::ProtectionInfoKVOS64 = rocksdb::ProtectionInfoKVOS<long unsigned int>]: Assertion `seq_per_batch_' failed.
```

Reviewed By: anand1976

Differential Revision: D59884468

Pulled By: hx235

fbshipit-source-id: 5d854b719092552c69727a979f269fb7f6c39756
2024-07-22 12:40:25 -07:00
Changyu Bi c064ac3bc5 Avoid opening table files and reading table properties under mutex (#12879)
Summary:
InitInputTableProperties() can open and do IOs and is called under mutex_. This PR removes it from FinalizeInputInfo(). It is now called in CompactionJob::Run() and BuildCompactionJobInfo() (called in NotifyOnCompactionBegin()) without holding mutex_.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12879

Test Plan: existing unit tests. Added assert in GetInputTableProperties() to ensure that input_table_properties_ is initialized whenever it's called.

Reviewed By: hx235

Differential Revision: D59933195

Pulled By: cbi42

fbshipit-source-id: c8089e13af8567fa3ab4b94d9ec384ae98ab2ec8
2024-07-19 19:12:45 -07:00
Changyu Bi 4384dd5eee Support ingesting SST files generated by a live DB (#12750)
Summary:
... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option `IngestExternalFileOptions::allow_db_generated_files` is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys).

The feature is marked the option as experimental for now.

Main changes needed to enable this:
- ignore CF id mismatch during ingestion
- ignore the missing external file version table property

Rest of the change is mostly in new unit tests.

A previous attempt is in https://github.com/facebook/rocksdb/issues/5602.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12750

Test Plan: - new unit tests

Reviewed By: ajkr, jowlyzhang

Differential Revision: D58396673

Pulled By: cbi42

fbshipit-source-id: aae513afad7b1ff5d4faa48104df5f384926bf03
2024-07-19 16:14:54 -07:00
Hui Xiao 21db55f816 Move WAL sync before memtable insertion (#12869)
Summary:
**Context/Summary:**
WAL sync currently happens after memtable write. This causes inconvenience in stress test as we can't simply rollback the ExpectedState when write fails due to injected WAL sync error so something complicated like https://github.com/facebook/rocksdb/pull/12838 might be needed. After moving WAL sync before memtable insertion, there should not be injected IO error after memtable insertion so we can keep the current simple way of handling failed write in stress test with ExpectedState rollback.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12869

Test Plan:
1. Below command failed with `iterator has key 0000000000000207000000000000012B0000000000000013, but expected state does not.` before this PR and passes after
```
./db_stress  --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=4 --bloom_bits=56.810257702625165 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=04:00-08:00 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=0 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=4 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=50 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=0 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=4 --sync=1 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=239 --universal_max_read_amp=-1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=128 --writepercent=50

Reviewed By: jowlyzhang

Differential Revision: D59825730

Pulled By: hx235

fbshipit-source-id: 7d77aaf177ded2f99bf1ce19f5a4bd0783b9ca92
2024-07-17 13:39:14 -07:00
Hui Xiao 9e4ee7f0c6 Fix non-okay status being ignored in write path under two_write_queues_ (#12866)
Summary:
Context/Summary: see above, though the impact is small.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12866

Test Plan: exiting UT

Reviewed By: anand1976

Differential Revision: D59782913

Pulled By: hx235

fbshipit-source-id: ec02843645cce49466bde602035d2e61c31965b8
2024-07-16 10:55:08 -07:00
Peter Dillinger 0bb939611d Avoid unnecessary work in internal calls to GetSortedWalFiles (#12831)
Summary:
We are seeing a number of crash test failures coming from checkpoint and backup code, likely from WalManager::GetSortedWalFiles -> ... -> WalManager::ReadFirstLine and this code path is not needed, because we don't need to know the sequence numbers of WAL files going into a checkpoint or backup. We can minimize the impact of whatever inconsistency is causing that problem by not relying on it where it's not needed.

Similarly, when we only need a roughly accurate set of current WAL files, we don't need to query all the archived WAL files (and redundantly the live ones again).

So this reduces filesystem queries and DB mutex acquires in creating backups and checkpoints.

Needed follow-up:
Figure out what is causing various failures with an apparent inconsistency where GetSortedWalFiles fails on reading a WAL file. If it's an injected failure, perhaps it's not propagating that injected failure appropriately. It might also be an inconsistency between what the DB knows is flushed and what WalManager reads from the filesystem (which we know is dubious and should be phased out, which this is arguably another step toward). Or completing that phase-out might solve the problem without a full diagnosis.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12831

Test Plan:
existing tests (easily caught when I went too far in initally developing this change)

Update to BackupUsingDirectIO test so that there's a WAL file in what is backed up. (Was relying on some oddity.)

Reviewed By: cbi42

Differential Revision: D59252649

Pulled By: pdillinger

fbshipit-source-id: 7ad4187a1c70caa59a6d6c1c643ef95232b929f5
2024-07-01 23:29:02 -07:00
Changyu Bi a31fe52173 Remove the return value of SetBGError() (#12792)
Summary:
the return value for `ErrorHandler::SetBGError(error)` seems to be not well-defined, it can be `bg_error_` (no matter if the `bg_error_` is set to the input error), ok status or [`recovery_error_`](3ee4d5a11a/db/error_handler.cc (L669)) from `StartRecoverFromRetryableBGIOError()`.  The `recovery_error_` returned may be an OK status.

We have only a few places that use the return value of  `SetBGError()` and they don't need to do so. Using the return value may even be wrong for example in 3ee4d5a11a/db/db_impl/db_impl_write.cc (L2365) where a non-ok `s` could be overwritten to OK. This PR changes SetBGError() to return void and clean up relevant code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12792

Test Plan: existing unit tests and go over all places where return value of `SetBGError()` is used.

Reviewed By: hx235

Differential Revision: D58904898

Pulled By: cbi42

fbshipit-source-id: d58a20ba5a40e3f35367c6034a32c755088c3653
2024-06-26 18:17:05 -07:00
Peter Dillinger 39455974cb Fix possible double-free on TruncatedRangeDelIterator (#12805)
Summary:
Not sure where or how it happens, but using a recent CircleCI failure I got a reliable db_stress reproducer.

Using std::unique_ptr appropriately for managing them has apparently (and unsurprisingly) fixed the problem without needing to know exactly where the problem was.

Suggested follow-up:
* Three or even four levels of pointers is very confusing to work with. Surely this part can be cleaned up to be simpler.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12805

Test Plan:
Reproducer passes, plus ASAN test and crash test runs. I don't think it's worth the extra work to track down the details and create a careful unit test.

```
./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=2 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=tiered_lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=1 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0.2 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb.gpxs/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb.gpxs/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=4 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kHot --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=60 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=1 --uncache_aggressiveness=5 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

Reviewed By: cbi42

Differential Revision: D58958390

Pulled By: pdillinger

fbshipit-source-id: 1271cfdcc3c574f78cd59f3c68148f7ed4a19c47
2024-06-24 11:51:16 -07:00
Hui Xiao 9d64ca55b7 Proceed for new memtable on okay status (#12798)
Summary:
**Context/Summary:**
The relevant code logs info of newly created WAL and proceeds to "ConstructFragmentedRangeTombstones()" even when the previous step fails. This PR fixes it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12798

Test Plan: Existing tests

Reviewed By: cbi42

Differential Revision: D58917246

Pulled By: hx235

fbshipit-source-id: f395210d91e50617195cb9a8047cf5d82db0c40e
2024-06-22 16:17:59 -07:00
Peter Dillinger 3ee4d5a11a Fix possible crash in failure to sync some WALs (#12789)
Summary:
I believe this was possible with recyclable logs before recent work like https://github.com/facebook/rocksdb/issues/12734, but this cleans up a couple of possible crashes revealed by the crash test.  A WAL with a nullptr file writer (already closed) can persist in `logs_` if a later WAL fails to sync. In case of any WAL sync failures, we don't record WAL syncs to the manifest. Thus, even if a WAL is fully synced and closed, we might need to keep it on the `logs_` list so that we know to record its sync to the manifest if there should be a successful sync next time. (However, I believe that's future-looking because currently any failure in WAL sync is considered non-recoverable.)

I don't believe this was likely enough before recent changes to warrant a release note (if it was possible).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12789

Test Plan: A unit test that would reveal the crashes, now fixed

Reviewed By: cbi42

Differential Revision: D58874154

Pulled By: pdillinger

fbshipit-source-id: bc69407cd9cbcd080af9585d502d4e33dafc3d29
2024-06-21 12:56:21 -07:00
Hui Xiao 1adb935720 Inject more errors to more files in stress test (#12713)
Summary:
**Context:**
We currently have partial error injection:
- DB operation: all read, SST write
- DB open: all read, SST write, all metadata write.

This PR completes the error injection (with some limitations below):
- DB operation & open: all read, all write, all metadata write, all metadata read

**Summary:**
- Inject retryable metadata read, metadata write error concerning directory (e.g, dir sync, ) or file metadata (e.g, name, size, file creation/deletion...)
- Inject retryable errors to all major file types: random access file, sequential file, writable file
- Allow db stress test operations to handle above injected errors gracefully without crashing
- Change all error injection to thread-local implementation for easier disabling and enabling in the same thread. For example, we can control error handling thread to have no error injection. It's also cleaner in code.
   - Limitation: compared to before, we now don't have write fault injection for backup/restore CopyOrCreateFiles work threads since they use anonymous background threads as well as read injection for db open bg thread
- Add a new flag to test error recovery without error injection so we can test the path where error recovery actually succeeds
- Some Refactory & fix to db stress test framework (see PR review comments)
- Fix some minor bugs surfaced (see PR review comments)
- Limitation: had to disable backup restore with metadata read/write injection since it surfaces too many testing issues. Will add it back later to focus on surfacing actual code/internal bugs first.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12713

Test Plan:
- Existing UT
- CI with no trivial error failure

Reviewed By: pdillinger

Differential Revision: D58326608

Pulled By: hx235

fbshipit-source-id: 011b5195aaeb6011641ae0a9194f7f2a0e325ad7
2024-06-19 08:42:00 -07:00
Yu Zhang c73cf7a878 Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760)
Summary:
This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier.

A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following:
1) Setting the ratio to be equal to or smaller than 0 disables this collector
2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met.
3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction.
 For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D58734976

Pulled By: jowlyzhang

fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81
2024-06-18 10:51:29 -07:00
Jonah Gao 9f95aa8269 GetAggregatedIntProperty accumulates property once per block cache (#12755)
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12687.

A block cache may be shared by multiple column families. Therefore, when getting the aggregated property of the block cache, we need to deduplicate by instances of the block cache, meaning the same instance should only be counted once.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12755

Reviewed By: jowlyzhang

Differential Revision: D58508819

Pulled By: ajkr

fbshipit-source-id: 3b746841d7eac59f900387ec3b8c19dbcd20aae4
2024-06-18 10:46:55 -07:00
Yu Zhang f5e44f3490 Fix manual flush hanging on waiting for no stall for UDT in memtable … (#12771)
Summary:
This PR fix a possible manual flush hanging scenario because of its expectation that others will clear out excessive memtables was not met. The root cause is the FlushRequest rescheduling logic is using a stricter criteria for what a write stall is about to happen means than `WaitUntilFlushWouldNotStallWrites` does. Currently, the former thinks a write stall is about to happen when the last memtable is half full, and it will instead reschedule queued FlushRequest and not actually proceed with the flush. While the latter thinks if we already start to use the last memtable, we should wait until some other background flush jobs clear out some memtables before proceed this manual flush.

If we make them use the same criteria, we can guarantee that at any time when`WaitUntilFlushWouldNotStallWrites` is waiting, it's not because the rescheduling logic is holding it back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12771

Test Plan: Added unit test

Reviewed By: ajkr

Differential Revision: D58603746

Pulled By: jowlyzhang

fbshipit-source-id: 9fa1c87c0175d47a40f584dfb1b497baa576755b
2024-06-14 13:37:37 -07:00
Yu Zhang 13c758f986 Change the behavior of manual flush to not retain UDT (#12737)
Summary:
When user-defined timestamps in Memtable only feature is enabled, all scheduled flushes go through a check to see if it's eligible to be rescheduled to retain user-defined timestamps. However when the user makes a manual flush request, their intention is for all the in memory data to be persisted into SST files as soon as possible. These two sides have some conflict of interest, the user can implement some workaround like https://github.com/facebook/rocksdb/issues/12631 to explicitly mark which one takes precedence. The implementation for this can be nuanced since the user needs to be aware of all the scenarios that can trigger a manual flush and handle the concurrency well etc.

In this PR, we updated the default behavior to give manual flush precedence when it's requested. The user-defined timestamps rescheduling mechanism is turned off when a manual flush is requested. Likewise, all error recovery triggered flushes skips the rescheduling mechanism too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12737

Test Plan: Add unit tests

Reviewed By: ajkr

Differential Revision: D58538246

Pulled By: jowlyzhang

fbshipit-source-id: 0b9b3d1af3e8d882f2d6a2406adda19324ba0694
2024-06-13 13:18:10 -07:00
Peter Dillinger 0646ec6e2d Ensure Close() before LinkFile() for WALs in Checkpoint (#12734)
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.

Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
  `background_close_inactive_wals`.

Stacked on https://github.com/facebook/rocksdb/issues/12731

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734

Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.

Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)

Reviewed By: cbi42

Differential Revision: D58295284

Pulled By: pdillinger

fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
2024-06-12 11:48:45 -07:00
Peter Dillinger b34cef57b7 Support pro-actively erasing obsolete block cache entries (#12694)
Summary:
Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries.

Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive.

This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on.

The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF.

Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths.

Notable auxiliary code details:
* Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.)
* Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item)

Marked EXPERIMENTAL until more thorough validation is complete.

Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694

Test Plan:
* Unit tests added, including refactoring an existing test to make better use of parameterized tests.
* Added to crash test.
* Performance, sample command:
```
for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done
```

Averaging 10 runs each case, block cache data block hit rates

```
lru_cache
UA=0   -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0
UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0

fixed_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9
UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2

auto_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5
UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8
```

Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings).

Reviewed By: ajkr

Differential Revision: D57932442

Pulled By: pdillinger

fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c
2024-06-07 08:57:11 -07:00
Yu Zhang 44aceb88d0 Add a OnManualFlushScheduled callback in event listener (#12631)
Summary:
As titled. Also added the newest user-defined timestamp into the `MemTableInfo`. This can be a useful info in the callback.

Added some unit tests as examples for how users can use two separate approaches to allow manual flush / manual compactions to go through when the user-defined timestamps in memtable only feature is enabled. One approach relies on selectively increase cutoff timestamp in `OnMemtableSeal` callback when it's initiated by a manual flush. Another approach is to increase cutoff timestamp in `OnManualFlushScheduled` callback. The caveats of the approaches are also documented in the unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12631

Reviewed By: ajkr

Differential Revision: D58260528

Pulled By: jowlyzhang

fbshipit-source-id: bf446d7140affdf124744095e0a179fa6e427532
2024-06-06 17:29:01 -07:00
Hui Xiao 390fc55ba1 Revert PR 12684 and 12556 (#12738)
Summary:
**Context/Summary:** a better API design is decided lately so we decided to revert these two changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12738

Test Plan: - CI

Reviewed By: ajkr

Differential Revision: D58162165

Pulled By: hx235

fbshipit-source-id: 9bbe4d2fe9fbe39213f4cf137a2d419e6ffb8e16
2024-06-06 11:46:16 -07:00
Peter Dillinger 98393f0139 Fix Checkpoint hard link of inactive but unsynced WAL (#12731)
Summary:
Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
  metadata, and prevent hard linking those (with `trim_to_size=true`
  from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
  flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
  needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
  longer need to consider syncing as part of the FlushWAL in
  GetLiveFilesStorageInfo. (We determine the max WAL number consistent
  with the manifest file size, while holding DB mutex. Should make
  track_and_verify_wals_in_manifest happy.) This makes the premise of
  test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
  no longer hit) so the test is removed, with crash test as backstop for
  related issues. See https://github.com/facebook/rocksdb/issues/10185

Stacked on https://github.com/facebook/rocksdb/issues/12729

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731

Test Plan:
Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.

Reviewed By: cbi42

Differential Revision: D58199629

Pulled By: pdillinger

fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382
2024-06-05 17:40:09 -07:00
Yu Zhang fc59d8f9c6 Add public API WriteWithCallback to support custom callbacks (#12603)
Summary:
This PR adds a `DB::WriteWithCallback` API that does the same things as `DB::Write` while takes an argument `UserWriteCallback` to execute custom callback functions during the write.

We currently support two types of callback functions: `OnWriteEnqueued` and `OnWalWriteFinish`. The former is invoked   after the write is enqueued, and the later is invoked after WAL write finishes when applicable.

These callback functions are intended for users to use to improve synchronization between concurrent writes, their execution is on the write's critical path so it will impact the write's latency if not used properly. The documentation for the callback interface mentioned this and suggest user to keep these callback functions' implementation minimum.

Although transaction interfaces' writes doesn't yet allow user to specify such a user write callback argument, the `DBImpl::Write*` type of APIs do not differentiate between regular DB writes or writes coming from the transaction layer when it comes to supporting this `UserWriteCallback`. These callbacks works for all the write modes including: default write mode, Options.two_write_queues, Options.unordered_write, Options.enable_pipelined_write

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12603

Test Plan: Added unit test in ./write_callback_test

Reviewed By: anand1976

Differential Revision: D58044638

Pulled By: jowlyzhang

fbshipit-source-id: 87a84a0221df8f589ec8fc4d74597e72ce97e4cd
2024-05-31 19:30:19 -07:00
Peter Dillinger 7127119ae9 Refactor SyncWAL and SyncClosedLogs for code sharing (#12707)
Summary:
These functions were very similar and did not make sense for maintaining separately. This is not a pure refactor but I think bringing the behaviors closer together should reduce long term risk of unintentionally divergent behavior. This change is motivated by some forthcoming WAL handling fixes for Checkpoint and Backups.

* Sync() is always used on closed WALs, like the old SyncClosedWals. SyncWithoutFlush() is only used on the active (maybe) WAL. Perhaps SyncWithoutFlush() should be used whenever available, but I don't know which is preferred, as the previous state of the code was inconsistent.
* Syncing the WAL dir is selective based on need, like old SyncWAL, rather than done always like old SyncClosedLogs. This could be a performance improvement that was never applied to SyncClosedLogs but now is. We might still sync the dir more times than necessary in the case of parallel SyncWAL variants, but on a good FileSystem that's probably not too different performance-wise from us implementing something to have threads wait on each other.

Cosmetic changes:

* Rename internal function SyncClosedLogs to SyncClosedWals
* Merging the sync points into the common implementation between the two entry points isn't pretty, but should be fine.

Recommended follow-up:

* Clean up more confusing naming like log_dir_synced_

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12707

Test Plan: existing tests

Reviewed By: anand1976

Differential Revision: D57870856

Pulled By: pdillinger

fbshipit-source-id: 5455fba016d25dd5664fa41b253f18db2ca8919a
2024-05-30 14:53:13 -07:00
Rulin Huang 20777b96cb Optimizations in notify-one (#12545)
Summary:
We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.
| case name          | optimized/original  |
|-------------------:|--------------------:|
| fillrandom         | 182%                |
| fillseq            | 184%                |
| fillsync           | 136%                |
| overwrite          | 179%                |
| randomreplacekeys  | 180%                |
| randomtransaction  | 161%                |
| updaterandom       | 163%                |
| xorupdaterandom    | 165%                |

With analysis, we find that although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. The following is our method to optimize this process.

Assume that there are currently n threads in total, we parallelize SetState in LaunchParallelMemTableWriters. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.

A reproduction script:
./db_bench --benchmarks="fillrandom"  --threads ${number of all activate vcpu}  --seed 1708494134896523  --duration 60

![image](https://github.com/facebook/rocksdb/assets/22110918/c5eca02f-93b3-4434-bba2-5155fc892a97)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12545

Reviewed By: ajkr

Differential Revision: D57422827

Pulled By: cbi42

fbshipit-source-id: 94127937c0c61e4241720bd902c82c607b7b2431
2024-05-30 09:10:44 -07:00
Changyu Bi af3be5255a Fail DeleteRange() early when row_cache is configured (#12710)
Summary:
https://github.com/facebook/rocksdb/issues/12512 added the sanity check for this incompatible combination. However, it does the check during memtable insertion which can turn the DB into read-only mode. This PR moves the check earlier so that this write failure will not turn the DB into read-only mode and affect other DB operations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12710

Test Plan: * updated unit test `DBRangeDelTest.RowCache` to write to DB after a failed DeleteRange(). The test fails before this PR.

Reviewed By: ajkr

Differential Revision: D57925188

Pulled By: cbi42

fbshipit-source-id: 8bf001bd3fcf05635411ba28bc4a037321942879
2024-05-29 15:03:15 -07:00
anand76 9cc6168c98 Add LDB command and option for follower instances (#12682)
Summary:
Add the `--leader_path` option to specify the directory path of the leader for a follower RocksDB instance. This PR also adds a `count` command to the repl shell. While not specific to followers, it is useful for testing purposes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12682

Reviewed By: jowlyzhang

Differential Revision: D57642296

Pulled By: anand1976

fbshipit-source-id: 53767d496ecadc363ff92cd958b8e15a7bf3b151
2024-05-28 23:21:32 -07:00
Peter Dillinger d2ef70872f Rename, deprecate LogFile and VectorLogPtr (#12695)
Summary:
These names are confusing with `Logger` etc. so moving to `WalFile` etc.

Other small, related name refactorings.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12695

Test Plan: Left most unit tests using old names as an API compatibility test. Non-test code compiles with deprecated names removed. No functional changes.

Reviewed By: ajkr

Differential Revision: D57747458

Pulled By: pdillinger

fbshipit-source-id: 7b77596b9c20d865d43b9dc66c30c8bd2b3b424f
2024-05-28 09:24:49 -07:00