Commit Graph

12971 Commits

Author SHA1 Message Date
Peter Dillinger efba8f5b27 Respect ReadOptions::read_tier in prefetching (#12782)
Summary:
a pre-existing flaw revealed by crash test with uncache behavior. Easy fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12782

Test Plan: Modified unit test PrefetchTest.Basic (fails without fix)

Reviewed By: hx235

Differential Revision: D58757916

Pulled By: pdillinger

fbshipit-source-id: 23c0240c7cf0cb0b69a372f9531c07af920e09da
2024-06-19 09:53:59 -07:00
Hui Xiao 1adb935720 Inject more errors to more files in stress test (#12713)
Summary:
**Context:**
We currently have partial error injection:
- DB operation: all read, SST write
- DB open: all read, SST write, all metadata write.

This PR completes the error injection (with some limitations below):
- DB operation & open: all read, all write, all metadata write, all metadata read

**Summary:**
- Inject retryable metadata read, metadata write error concerning directory (e.g, dir sync, ) or file metadata (e.g, name, size, file creation/deletion...)
- Inject retryable errors to all major file types: random access file, sequential file, writable file
- Allow db stress test operations to handle above injected errors gracefully without crashing
- Change all error injection to thread-local implementation for easier disabling and enabling in the same thread. For example, we can control error handling thread to have no error injection. It's also cleaner in code.
   - Limitation: compared to before, we now don't have write fault injection for backup/restore CopyOrCreateFiles work threads since they use anonymous background threads as well as read injection for db open bg thread
- Add a new flag to test error recovery without error injection so we can test the path where error recovery actually succeeds
- Some Refactory & fix to db stress test framework (see PR review comments)
- Fix some minor bugs surfaced (see PR review comments)
- Limitation: had to disable backup restore with metadata read/write injection since it surfaces too many testing issues. Will add it back later to focus on surfacing actual code/internal bugs first.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12713

Test Plan:
- Existing UT
- CI with no trivial error failure

Reviewed By: pdillinger

Differential Revision: D58326608

Pulled By: hx235

fbshipit-source-id: 011b5195aaeb6011641ae0a9194f7f2a0e325ad7
2024-06-19 08:42:00 -07:00
Peter Dillinger 71f9e6b5b3 Add experimental range filters to stress/crash test (#12769)
Summary:
Implemented two key segment extractors that satisfy the "segment prefix property," one with variable segment widths and one with fixed. Used these to create a couple of named configs and versions that are randomly selected by the crash test. On the read side, the required table_filter is set up everywhere I found the stress test uses iterator_upper_bound.

Writing filters on new SST files and applying filters on SST files to range queries are configured independently, to potentially help with isolating different sides of the functionality.

Not yet implemented / possible follow-up:
* Consider manipulating/skewing the query bounds to better exercise filters
* Not yet using categories in the extractors
* Not yet dynamically changing the filtering version

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12769

Test Plan: Some stress test trial runs, including with ASAN. Inserted some temporary probes to ensure code was being exercised (more or less) as intended.

Reviewed By: hx235

Differential Revision: D58547462

Pulled By: pdillinger

fbshipit-source-id: f7b1596dd668426268c5293ac17615f749703f52
2024-06-18 16:16:09 -07:00
Jay Huh f26e2fedb3 Disable AttributeGroup in multiops txn test (#12781)
Summary:
AttributeGroup is not yet supported in MultiOpsTxn Test. Disabling it for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12781

Test Plan: Disabling in the test

Reviewed By: hx235

Differential Revision: D58757042

Pulled By: jaykorean

fbshipit-source-id: 8c3c85376e6ec0d1c7027b83abeb91eddc64236f
2024-06-18 16:05:18 -07:00
Hui Xiao d0259c2c98 Enable reading un-synced data in db stress test (#12752)
Summary:
**Context/Summary:**
There are a few blockers to enabling reading un-synced data in db stress test
(1) GetFileSize() will always return 0 for file written under direct IO because we don't track the last flushed position for `TestFSWritableFile` under direct IO. So it will surface as
```
Verification failed: VerifyChecksum failed: Corruption: file is too short (0 bytes) to be an sstable: /tmp/rocksdb_crashtest_blackbox4deg_c5e/000009.sst
db_stress: db_stress_tool/db_stress_test_base.cc:518: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
```
(2) A couple minor FIXME in left in https://github.com/facebook/rocksdb/pull/12729.

This PR fixed (1) and (2) and enabled reading un-synced data in stress test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12752

Test Plan:
- The following command failed before this PR and passed after.

```
./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=100 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=10000 --block_align=0 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=37.92024930098943 --bottommost_compression_type=disable --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kXXH3 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=10 --compress_format_version=2 --compressed_secondary_cache_size=8388608 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackbox4deg_c5e --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kHot --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_8whyhdxm --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --fill_cache=1 --flush_one_in=1000 --format_version=4 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=9 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=100 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=0 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=2 --max_total_wal_size=0 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=0 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --uncache_aggressiveness=1 --universal_max_read_amp=-1 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=1 --use_multiget=0 --use_put_entity_one_in=5 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=10 --verify_compression=1 --verify_db_one_in=10000 --verify_file_checksums_one_in=10 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35

Verification failed: VerifyChecksum failed: Corruption: file is too short (0 bytes) to be an sstable: /tmp/rocksdb_crashtest_blackbox4deg_c5e/000009.sst
db_stress: db_stress_tool/db_stress_test_base.cc:518: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, const rocksdb::Status&, bool) const: Assertion `false' failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
```
- Run python3 tools/db_crashtest.py --simple blackbox --lock_wal_one_in=10 --backup_one_in=10 --sync_fault_injection=0 --use_direct_io_for_flush_and_compaction=0 for 1 hour
- Monitor stress test CI

Reviewed By: pdillinger

Differential Revision: D58395807

Pulled By: hx235

fbshipit-source-id: 7d4b321acc0a0af3501b62dc417a7f6e2d318265
2024-06-18 14:41:14 -07:00
Yu Zhang c73cf7a878 Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760)
Summary:
This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier.

A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following:
1) Setting the ratio to be equal to or smaller than 0 disables this collector
2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met.
3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction.
 For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D58734976

Pulled By: jowlyzhang

fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81
2024-06-18 10:51:29 -07:00
Jonah Gao 9f95aa8269 `GetAggregatedIntProperty` accumulates property once per block cache (#12755)
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12687.

A block cache may be shared by multiple column families. Therefore, when getting the aggregated property of the block cache, we need to deduplicate by instances of the block cache, meaning the same instance should only be counted once.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12755

Reviewed By: jowlyzhang

Differential Revision: D58508819

Pulled By: ajkr

fbshipit-source-id: 3b746841d7eac59f900387ec3b8c19dbcd20aae4
2024-06-18 10:46:55 -07:00
Jay Huh b8c9a2576a Add AttributeGroupIterator to Stress Test (#12776)
Summary:
As title. Changes include the following
- `Refresh()` moved from `Iterator` interface to `IteratorBase` so that `AttributeGroupIterator` can also have Refresh() API (implemention will be added in the future PR)
- `TestIterate()`'s main logic refactored into `TestIterateImpl()` so that it can be shared with `TestIterateAttributeGroups()`
- `VerifyIterator()` also changed so that verification code can be shared between `Iterator` and `AttributeGroupIterator`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12776

Test Plan:
Single CF Iterator
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=0 --verify_iterator_with_expected_state_one_in=2
```

CoalescingIterator
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2
```

AttributeGroupIterator
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2
```

Reviewed By: cbi42

Differential Revision: D58626165

Pulled By: jaykorean

fbshipit-source-id: 3e0a6ff72e51ecef9e06b65acfa53605a24d742e
2024-06-17 11:25:30 -07:00
Peter Dillinger 3758e31f3f Fix rare failure in DBBlockCacheTypeTest.Uncache (#12775)
Summary:
Following up on https://github.com/facebook/rocksdb/issues/12748 after seeing recurrence in https://github.com/facebook/rocksdb/actions/runs/9522985253/job/26253605587?pr=12774

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12775

Test Plan: Was able to reproduce failure and verify fix this time using COERCE_CONTEXT_SWITCH=1 :)

Reviewed By: jowlyzhang

Differential Revision: D58623461

Pulled By: pdillinger

fbshipit-source-id: d93a5e6a4977675eac54bbd42e70ae7b29b950a4
2024-06-14 20:50:36 -07:00
Peter Dillinger d6979bda40 Verify public headers do not reference internal ones (#12774)
Summary:
This is not currently caught by our public CI so adding a form of this check to `make check-headers`, which is part of the build-linux-unity-and-headers GHA job.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12774

Test Plan: manually added a violation, which was caught. Also caught an existing trivial violation (fixed). CI will verify it plays nice with GHA.

Reviewed By: jowlyzhang

Differential Revision: D58616601

Pulled By: pdillinger

fbshipit-source-id: e656ce82709660c088a3d3a5e41dd07655cb40e0
2024-06-14 16:32:28 -07:00
Jay Huh 0ab60b8a8c MultiCfIterator - Handle case of invalid key from child iter manual prefix iteration (#12773)
Summary:
Instead of completely disallowing `MultiCfIterator` when one or more child iterators will do manual prefix iteration (as suggested in https://github.com/facebook/rocksdb/issues/12770 ), just let `MultiCfIterator` operate as is even when there's a possibility of undefined result from child iterators. If one or more child iterators cause the heap to be empty, just return early and `Valid()` will return false.

It is still possible that heap is not empty when one or more child iterators are returning wrong keys. Basically, MultiCfIterator behaves the same as what we described in https://github.com/facebook/rocksdb/wiki/Prefix-Seek#manual-prefix-iterating - "RocksDB will not return error when it is misused and the iterating result will be undefined."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12773

Test Plan:
MultiCfIterator added back to the stress test
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2
```

Reviewed By: cbi42

Differential Revision: D58612055

Pulled By: jaykorean

fbshipit-source-id: e0dd942bed98382c59d463412dd8f163e6790b93
2024-06-14 15:59:17 -07:00
Yu Zhang f5e44f3490 Fix manual flush hanging on waiting for no stall for UDT in memtable … (#12771)
Summary:
This PR fix a possible manual flush hanging scenario because of its expectation that others will clear out excessive memtables was not met. The root cause is the FlushRequest rescheduling logic is using a stricter criteria for what a write stall is about to happen means than `WaitUntilFlushWouldNotStallWrites` does. Currently, the former thinks a write stall is about to happen when the last memtable is half full, and it will instead reschedule queued FlushRequest and not actually proceed with the flush. While the latter thinks if we already start to use the last memtable, we should wait until some other background flush jobs clear out some memtables before proceed this manual flush.

If we make them use the same criteria, we can guarantee that at any time when`WaitUntilFlushWouldNotStallWrites` is waiting, it's not because the rescheduling logic is holding it back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12771

Test Plan: Added unit test

Reviewed By: ajkr

Differential Revision: D58603746

Pulled By: jowlyzhang

fbshipit-source-id: 9fa1c87c0175d47a40f584dfb1b497baa576755b
2024-06-14 13:37:37 -07:00
Yu Zhang 13c758f986 Change the behavior of manual flush to not retain UDT (#12737)
Summary:
When user-defined timestamps in Memtable only feature is enabled, all scheduled flushes go through a check to see if it's eligible to be rescheduled to retain user-defined timestamps. However when the user makes a manual flush request, their intention is for all the in memory data to be persisted into SST files as soon as possible. These two sides have some conflict of interest, the user can implement some workaround like https://github.com/facebook/rocksdb/issues/12631 to explicitly mark which one takes precedence. The implementation for this can be nuanced since the user needs to be aware of all the scenarios that can trigger a manual flush and handle the concurrency well etc.

In this PR, we updated the default behavior to give manual flush precedence when it's requested. The user-defined timestamps rescheduling mechanism is turned off when a manual flush is requested. Likewise, all error recovery triggered flushes skips the rescheduling mechanism too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12737

Test Plan: Add unit tests

Reviewed By: ajkr

Differential Revision: D58538246

Pulled By: jowlyzhang

fbshipit-source-id: 0b9b3d1af3e8d882f2d6a2406adda19324ba0694
2024-06-13 13:18:10 -07:00
Richard Barnes a8dd15ad41 Fix deprecated dynamic exception in internal_repo_rocksdb/repo/java/rocksjni/kv_helper.h +1
Summary:
LLVM has detected a violation of `-Wdeprecated-dynamic-exception-spec`. Dynamic exceptions were removed in C++17. This diff fixes the deprecated instance(s).

See [Dynamic exception specification](https://en.cppreference.com/w/cpp/language/except_spec) and [noexcept specifier](https://en.cppreference.com/w/cpp/language/noexcept_spec).

Reviewed By: palmje

Differential Revision: D58528375

fbshipit-source-id: 130fecd3aa556e4cdb955feea53c442bd9fbc864
2024-06-13 12:41:13 -07:00
mingwei 762031531d add gtags files ignore (#12747)
Summary:
add 3 files ignore:
1.GPATH
2.GRTAGS
3.GTAGS

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12747

Reviewed By: jowlyzhang

Differential Revision: D58448441

Pulled By: ajkr

fbshipit-source-id: 245de6ba5ab26e2a592f33665a3a82d5624fd503
2024-06-12 21:46:40 -07:00
Peter Dillinger abf9ebc4bf Remove redundant no_io parameters to filter functions (#12762)
Summary:
Consolidate on already-present ReadOptions::read_tier

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12762

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D58450516

Pulled By: pdillinger

fbshipit-source-id: 1eec58c60beca73c6d5f2e9ae4442644920f8c30
2024-06-12 18:47:11 -07:00
Peter Dillinger 3abcba8470 Propagate more ReadOptions to ApproximateOffsetOf/Sizes (#12764)
Summary:
Unknown why these would ignore options like deadline and read_tier. Setting total_order_seek=true is unnecessary because of the disable_prefix_seek (= true) parameter to NewIndexIterator. This is only used by the hash index, which uses total order seek if either the ReadOption or the parameter is true. The parameter is arguably redundant with the total_order_seek option, meaning it could be eliminated, but I think this case is exceptional (compared to e.g. no_io):
* Prefix seek is particular to user iterators, though might be usable, carefully, for other read operations.
* The historical default of total_order_seek=false in a sense is "wrong result by default" so cannot be interpreted as an intent to force prefix seek in an operation for which it might be usual or give bad results.

Also added a generic release note to cover this and related PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12764

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D58474240

Pulled By: pdillinger

fbshipit-source-id: 79014d9822ba8f09d57ce4524363aa0973017b68
2024-06-12 16:25:47 -07:00
Peter Dillinger 5cf3bed00f Eliminate some parameters redundant with ReadOptions (#12761)
Summary:
... in Index and CompressionDict readers (Filters in another PR). no_io and verify_checksums should be inferred from ReadOptions rather than specified redundantly.

Fixes incomplete propagation of ReadOptions in
UncompressionDictReader::GetOrReadUncompressionDictionar so is technically a functional change. (Related to https://github.com/facebook/rocksdb/issues/12757)

Also there was hardcoded no verify_checksums in DumpTable, but only for UncompressionDict, which doesn't make sense. Now using consistent ReadOptions and verify_checksum can be controlled for more reads together.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12761

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D58450392

Pulled By: pdillinger

fbshipit-source-id: 0faed22832d664cb3b04a4c03ee77119977c200b
2024-06-12 15:44:37 -07:00
Hui Xiao a2f772910e Fix manual WAL flush causing false-positive inconsistent values in TestBackupRestore() (#12758)
Summary:
**Context/Summary:**
When manual WAL flush is used, the following can happen:

t1: Issued Put(k1) to original DB. It entered WAL buffer since manual_wal_flush_one_in > 0. It never made it to WAL file without FlushWAL()
t2: The same WAL got back-up and restored to restore DB. So the restore DB's WAL does not contain this Put()
t3: The same WAL in the original DB got FlushWAL() so it got the Put() entry

Querying k1 in original and restored DB will give different result and fail our consistency check in stress test.

```
Failure in a backup/restore operation with: Corruption: 0x000000000000000178 exists in original db but not in restore
```

This PR fixed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12758

Test Plan:
```

./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=13 --bottommost_compression_type=none --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_1 --db_write_buffer_size=0 --default_temperature=kCold --default_write_temperature=kHot --delete_obsolete_files_period_micros=21600000000 --delpercent=40 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=none --fill_cache=0 --flush_one_in=1000000 --format_version=2 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=5 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=10 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=0 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --uncache_aggressiveness=709 --universal_max_read_amp=0 --unpartitioned_pinning=0 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=335544 --write_dbid_to_manifest=1 --write_fault_one_in=128 --writepercent=45
```
Repro-ed quickly before the fix and stably run after the fix.

Reviewed By: jowlyzhang

Differential Revision: D58426535

Pulled By: hx235

fbshipit-source-id: 611e56086e76f8c06d292624e60fd96e511ce723
2024-06-12 12:17:45 -07:00
Peter Dillinger 0646ec6e2d Ensure Close() before LinkFile() for WALs in Checkpoint (#12734)
Summary:
POSIX semantics for LinkFile (hard links) allow linking a file
that is still being written two, with both the source and destination
showing any subsequent writes to the source. This may not be practical
semantics for some FileSystem implementations such as remote storage.
They might only link the flushed or sync-ed file contents at time of
LinkFile, or might even have undefined behavior if LinkFile is called on
a file still open for write (not yet "sealed"). This change builds on https://github.com/facebook/rocksdb/issues/12731
to bring more hygiene to our handling of WAL files in Checkpoint.

Specifically, we now Close WAL files as soon as they are either
(a) inactive and fully synced, or (b) inactive and obsolete (so maybe
never fully synced), rather than letting Close() happen in handling
obsolete files (maybe a background thread). This should not be a
performance issue as Close() should be trivial cost relative to other
IO ops, but just in case:
* We don't Close() while holding a mutex, to avoid blocking, and
* The old behavior is available with a new kill switch option
  `background_close_inactive_wals`.

Stacked on https://github.com/facebook/rocksdb/issues/12731

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12734

Test Plan:
Extended existing unit test, especially adding a hygiene
check to FaultInjectionTestFS to detect LinkFile() on a file still open
for writes. FaultInjectionTestFS already has relevant tracking data, and
tests can opt out of the new check, as in a smoke test I have left for
the old, deprecated functionality `background_close_inactive_wals=true`.

Also ran lengthy blackbox_crash_test to ensure the hygiene check is OK
with the crash test. (The only place I can find we use LinkFile in
production is Checkpoint.)

Reviewed By: cbi42

Differential Revision: D58295284

Pulled By: pdillinger

fbshipit-source-id: 64d90ed8477e2366c19eaf9c4c5ad60b82cac5c6
2024-06-12 11:48:45 -07:00
Peter Dillinger d64eac28d3 Fix a failure to propagate ReadOptions (#12757)
Summary:
The crash test revealed a case in which the uncache functionality in ~BlockBasedTableReader could initiate an block read (IO), despite setting ReadOptions::read_tier = kBlockCacheTier.

The root cause is a place in the code where many people have over time decided to opt-in propagating ReadOptions and no one took the initiative to propagate ReadOptions by default (opt out / override only as needed). The fix is in partitioned_index_reader.cc. Here,
ReadOptions::readahead_size is opted-out to avoid churn in prefetch_test that is not clearly an improvement or regression. It's hard to tell given the poor state of relevant documentation https://github.com/facebook/rocksdb/issues/12756. The affected unit test was added in https://github.com/facebook/rocksdb/issues/10602.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12757

Test Plan: (Now postponed to a follow-up diff) I have added some new infrastructure to DEBUG builds to catch this specific kind of violation in unit tests and in the stress/crash test. `EnforceReadOpts` establishes a thread-local context under which we assert no IOs are performed if ReadOptions said it should be forbidden. With this new checking, the Uncache unit test would catch the critical step toward a violation (inner ReadOptions allowing IO, even if no IO is actually performed), which is fixed with the production code change.

Reviewed By: hx235

Differential Revision: D58421526

Pulled By: pdillinger

fbshipit-source-id: 9e9917a0e320c78967e751bd887926a2ed231d37
2024-06-11 21:41:21 -07:00
Peter Dillinger 961468f92e Fix TSAN-reported data race with uncache_aggressiveness (#12753)
Summary:
Data race reported on
BlockBasedTableReader::Rep::uncache_aggressiveness because apparently a file can be marked obsolete through multiple table cache references in parallel. Using a relaxed atomic should resolve the race quite reasonably, especially considering this is a rare case and the racing writes should be storing the same value anyway.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12753

Test Plan: watch for TSAN crash test results

Reviewed By: ltamasi

Differential Revision: D58397473

Pulled By: pdillinger

fbshipit-source-id: 3e78b6adac4f7a7056790754bee42b3cb244f037
2024-06-11 16:53:13 -07:00
Peter Dillinger 21eb82ebec Disable "uncache" behavior in DB shutdown (#12751)
Summary:
Crash test showed a potential use-after-free where a file marked as obsolete and eligible for uncache on destruction is destroyed in the VersionSet destructor, which only happens as part of DB shutdown. At that point, the in-memory column families have already been destroyed, so attempting to uncache could use-after-free on stuff like getting the `user_comparator()` from the `internal_comparator()`.

I attempted to make it smarter, but wasn't able to untangle the destruction dependencies in a way that was safe, understandable, and maintainable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12751

Test Plan:
Reproduced by adding uncache_aggressiveness to an existing (but otherwise unrelated) test. This makes it a fair regression test.

Also added testing to ensure that trivial moves and DB close & reopen are well behaved with uncache_aggressiveness. Specifically, this issue doesn't seem to be because things are uncached inappropriately in those cases.

Reviewed By: ltamasi

Differential Revision: D58390058

Pulled By: pdillinger

fbshipit-source-id: 66ac9cb13bf02638fa80ee5b7218153d8bc7cfd3
2024-06-11 15:57:40 -07:00
Evan Jones af50823069 c.h: Add set_track_and_verify_wals_in_manifest to C API (#12749)
Summary:
This option is recommended to be set for production use:

    We recommend to set track_and_verify_wals_in_manifest to true
    for production

https://github.com/facebook/rocksdb/wiki/Track-WAL-in-MANIFEST

This adds this setting to the C API, so it can be used by other languages.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12749

Reviewed By: ltamasi

Differential Revision: D58382892

Pulled By: ajkr

fbshipit-source-id: 885de4539745a3119b6b2a162ab4fca9fa975283
2024-06-10 16:26:52 -07:00
Peter Dillinger 68112b3beb Attempt fix rare failure in DBBlockCacheTypeTest.Uncache (#12748)
Summary:
I haven't been able to reproduce the failure, seen in https://github.com/facebook/rocksdb/actions/runs/9420830905/job/25953696902?pr=12734

```
[ RUN      ] DBBlockCacheTypeTestInstance/DBBlockCacheTypeTest.Uncache/2
db/db_block_cache_test.cc:1415: Failure
Expected equality of these values:
  cache->GetOccupancyCount()
    Which is: 37
  kBaselineCount + kNumDataBlocks + meta_blocks_per_file
    Which is: 15
Google Test trace:
db/db_block_cache_test.cc:1346: ua=10000
db/db_block_cache_test.cc:1344: partitioned=1
db/db_block_cache_test.cc:1418: Failure
...
```

But it's consistent with a SuperVersion reference sticking around beyond the CompactRange, as I can reproduce the result with a dangling Iterator. Like some other tests have had trouble with periodic stats popping up randomly, I suspect that could be the explanation in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12748

Test Plan: Watch for similar future failures

Reviewed By: ltamasi

Differential Revision: D58366031

Pulled By: pdillinger

fbshipit-source-id: b812ca8837b8c8b9cbda1b201d76316d145fa3ec
2024-06-10 13:31:46 -07:00
Hui Xiao d3c4b7fe0b Enable reopen with un-synced data loss in crash test (#12746)
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/12567 disabled reopen with un-synced data loss in crash test since we discovered un-synced WAL loss and we currently don't support prefix recovery in reopen. This PR explicitly sync WAL data before close to avoid such data loss case from happening and add back the testing coverage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12746

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D58326890

Pulled By: hx235

fbshipit-source-id: 0865f715e97c5948d7cb3aea62fe2a626cb6522a
2024-06-10 12:35:53 -07:00
Eduardo Menges Mattje 43906597f5 Fixed CMake builds for iOS (#12744)
Summary:
This [ports the flags which were already defined in `build_tools`](44aceb88d0/build_tools/build_detect_platform (L151)) for CMake.

The flag `OS_MACOSX` in specific is necessary for proper endiness detection, [since they're required for proper inclusion of OSX endian header](44aceb88d0/port/port_posix.h (L27)).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12744

Reviewed By: ltamasi

Differential Revision: D58317987

Pulled By: ajkr

fbshipit-source-id: 407e623ddb6afc9c48939d52f610281f59cf99af
2024-06-07 22:13:49 -07:00
Evan Jones 32e6825bc6 c.h: Add GetDbIdentity, Options::write_dbid_to_manifest (#12736)
Summary:
The write_dbid_to_manifest option is documented as "We recommend setting this flag to true". However, there is no way to set this flag from the C API.

Add the following functions to the C API:

* rocksdb_get_db_identity
* rocksdb_options_get_write_dbid_to_manifest
* rocksdb_options_set_write_dbid_to_manifest

Add a test that this option preserves the ID across checkpoints.

c.cc:
* Remove outdated comments about missing C API functions that exist.
* Document that CopyString is intended for binary data and is not NUL terminated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12736

Reviewed By: ltamasi

Differential Revision: D58202117

Pulled By: ajkr

fbshipit-source-id: 707b110df5c4bd118d65548327428a53a9dc3019
2024-06-07 16:53:43 -07:00
Peter Dillinger b34cef57b7 Support pro-actively erasing obsolete block cache entries (#12694)
Summary:
Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries.

Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive.

This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on.

The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF.

Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths.

Notable auxiliary code details:
* Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.)
* Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item)

Marked EXPERIMENTAL until more thorough validation is complete.

Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694

Test Plan:
* Unit tests added, including refactoring an existing test to make better use of parameterized tests.
* Added to crash test.
* Performance, sample command:
```
for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done
```

Averaging 10 runs each case, block cache data block hit rates

```
lru_cache
UA=0   -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0
UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0

fixed_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9
UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2

auto_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5
UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8
```

Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings).

Reviewed By: ajkr

Differential Revision: D57932442

Pulled By: pdillinger

fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c
2024-06-07 08:57:11 -07:00
Yu Zhang 44aceb88d0 Add a OnManualFlushScheduled callback in event listener (#12631)
Summary:
As titled. Also added the newest user-defined timestamp into the `MemTableInfo`. This can be a useful info in the callback.

Added some unit tests as examples for how users can use two separate approaches to allow manual flush / manual compactions to go through when the user-defined timestamps in memtable only feature is enabled. One approach relies on selectively increase cutoff timestamp in `OnMemtableSeal` callback when it's initiated by a manual flush. Another approach is to increase cutoff timestamp in `OnManualFlushScheduled` callback. The caveats of the approaches are also documented in the unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12631

Reviewed By: ajkr

Differential Revision: D58260528

Pulled By: jowlyzhang

fbshipit-source-id: bf446d7140affdf124744095e0a179fa6e427532
2024-06-06 17:29:01 -07:00
Hui Xiao 390fc55ba1 Revert PR 12684 and 12556 (#12738)
Summary:
**Context/Summary:** a better API design is decided lately so we decided to revert these two changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12738

Test Plan: - CI

Reviewed By: ajkr

Differential Revision: D58162165

Pulled By: hx235

fbshipit-source-id: 9bbe4d2fe9fbe39213f4cf137a2d419e6ffb8e16
2024-06-06 11:46:16 -07:00
Peter Dillinger 98393f0139 Fix Checkpoint hard link of inactive but unsynced WAL (#12731)
Summary:
Background: there is one active WAL file but there can be
several more WAL files in various states. Those other WALs are always
in a "flushed" state but could be on the `logs_` list not yet fully
synced. We currently allow any WAL that is not the active WAL to be
hard-linked when creating a Checkpoint, as although it might still be
open for write, we are not appending any more data to it.

The problem is that a created Checkpoint is supposed to be fully synced
on return of that function, and a hard-linked WAL in the state described
above might not be fully synced. (Through some prudence in https://github.com/facebook/rocksdb/issues/10083,
it would synced if using track_and_verify_wals_in_manifest=true.)

The fix is a step toward a long term goal of removing the need to query
the filesystem to determine WAL files and their state. (I consider it
dubious any time we independently read from or query metadata from a
file we have open for writing, as this makes us more susceptible to
FileSystem deficiencies or races.) More specifically:
* Detect which WALs might not be fully synced, according to our DBImpl
  metadata, and prevent hard linking those (with `trim_to_size=true`
  from `GetLiveFilesStorageInfo()`. And while we're at it, use our known
  flushed sizes for those WALs.
* To avoid a race between that and GetSortedWalFiles(), track a maximum
  needed WAL number for the Checkpoint/GetLiveFilesStorageInfo.
* Because of the level of consistency provided by those two, we no
  longer need to consider syncing as part of the FlushWAL in
  GetLiveFilesStorageInfo. (We determine the max WAL number consistent
  with the manifest file size, while holding DB mutex. Should make
  track_and_verify_wals_in_manifest happy.) This makes the premise of
  test PutRaceWithCheckpointTrackedWalSync obsolete (sync point callback
  no longer hit) so the test is removed, with crash test as backstop for
  related issues. See https://github.com/facebook/rocksdb/issues/10185

Stacked on https://github.com/facebook/rocksdb/issues/12729

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12731

Test Plan:
Expanded an existing test, which now fails before fix.
Also long runs of blackbox_crash_test with amplified checkpoint frequency.

Reviewed By: cbi42

Differential Revision: D58199629

Pulled By: pdillinger

fbshipit-source-id: 376e55f4a2b082cd2adb6408a41209de14422382
2024-06-05 17:40:09 -07:00
Adam Kupczyk a211e06552 Remove close when fd == -1. (#12732)
Summary:
Its polluting my valgrind runs:
==3733139== Warning: invalid file descriptor -1 in syscall close()
==3733139== Warning: invalid file descriptor -1 in syscall close()
==3733139== Warning: invalid file descriptor -1 in syscall close()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12732

Reviewed By: ltamasi

Differential Revision: D58170009

Pulled By: ajkr

fbshipit-source-id: 1fc6944c2667641996676a75aa3e91984070ba49
2024-06-05 12:52:48 -07:00
wuruilong 61d10fe0c3 Fix compile errors on loongarch (#12739)
Summary:
Failed to compile when using cmake on loongarch architecture with the following error details:[https://buildd.debian.org/status/fetch.php?pkg=rocksdb&arch=loong64&ver=9.2.1-2&stamp= 1717362107&raw=0](url). The reason for the error is that loongarch does not support the mcpu option, refer to the link for details: [https://gcc.gnu.org/onlinedocs/gcc-13.2.0/gcc/LoongArch-Options.html](url)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12739

Reviewed By: ltamasi

Differential Revision: D58200695

Pulled By: ajkr

fbshipit-source-id: 00e1a51e15defaa8983524cdd3fc25240833c08b
2024-06-05 12:46:55 -07:00
Peter Dillinger 9f4c597d83 FaultInjectionTestFS read unsynced data by default (#12729)
Summary:
In places (e.g. GetSortedWals()) RocksDB relies on querying the file size or even reading the contents of files currently open for writing, and as in POSIX semantics, expects to see the flushed size and contents regardless of what has been synced. FaultInjectionTestFS historically did not emulate this behavior, only showing synced data from such read operations. (Different from FaultInjectionTestEnv--sigh.)

This change makes the "proper" behavior the default behavior, at least for GetFileSize and FSSequentialFile. However, this new functionality is disabled in db_stress because of undiagnosed, unresolved issues.

Also removes unused and confusing field `pos_at_last_flush_`

This change is needed to support testing a relevant bug fix (in a follow-up diff).  Other suggested follow-up:
* Fix db_stress not to rely on the old behavior, and fix a related FIXME in db_stress_test_base.cc in LockWAL testing.
* Fill in some corner cases in the FileSystem API for reading unsynced data (see new TODO items).
* Consider deprecating and removing Flush() API functions from FileSystem APIs. It is not clear to me that there is a supported scenario in which they do anything but confuse API users and developers. If there is a use for them, it doesn't appear to be tested.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12729

Test Plan: applies to all unit tests successfully, just updating the unit test from https://github.com/facebook/rocksdb/issues/12556 due to relying on the errant behavior. Also added a specific unit test

Reviewed By: hx235

Differential Revision: D58091835

Pulled By: pdillinger

fbshipit-source-id: f47a63b2b000f5875b6293a98577bff663d7fd33
2024-06-04 15:25:23 -07:00
Yu Zhang 8523f0a86a Use extended file boundary for key range overlap check during file ingestion (#12735)
Summary:
When https://github.com/facebook/rocksdb/issues/12343 added support to bulk load external files while column family enables user-defined timestamps, it's a requirement that the external file doesn't overlap with the DB in key ranges. More specifically, the external file should not contain a user key (without timestamp) that already have some entries in the DB.

All the `*Overlap*` functions like `RangeOverlapWithMemtable`, `RangeOverlapWithCompaction` are using `CompareWithoutTimestamp` to check for overlap  already. One thing that is missing here is we need to extend the external file's user key boundary for this check to avoid missing the checks for the boundary user keys. For example, with the current way of checking things where `external_file_info.smallest.user_key()` is used as the left boundary, and `external_file_info.largest.user_key()` is used as the right boundary, a file with this entry: (b, 40) can fit into a DB with these two entries: (b, 30), (c, 20).

To avoid this, we extend the user key boundaries used for overlap check, by updating the left boundary with the maximum timestamp and the right boundary with the minimum timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12735

Test Plan: Added unit test

Reviewed By: ltamasi

Differential Revision: D58152117

Pulled By: jowlyzhang

fbshipit-source-id: 9cba61e7357f6d76ad44c258381c35073ebbf347
2024-06-04 13:39:51 -07:00
Valery Mironov a8a52e5b4d Fix AddressSanitizer container-overflow (#12722)
Summary:
```
ERROR: AddressSanitizer: container-overflow on address 0x506000682221 at pc 0x5583da569f76 bp 0x7f0ec8a9ffb0 sp 0x7f0ec8a9f780
WRITE of size 53 at 0x506000682221 thread T29
    #0 0x5583da569f75 in pread
    https://github.com/facebook/rocksdb/issues/1 0x5583e334fde4 in rocksdb::PosixRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const /rocksdb/env/io_posix.cc:580:9
    https://github.com/facebook/rocksdb/issues/2 0x5583e2cac42b in rocksdb::(anonymous namespace)::CompositeRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const /rocksdb/env/composite_env.cc:61:21
    https://github.com/facebook/rocksdb/issues/3 0x5583e2c8a8e4 in rocksdb::(anonymous namespace)::LegacyRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const /rocksdb/env/env.cc:152:41
    https://github.com/facebook/rocksdb/issues/4 0x5583e2d6cbfb in rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, unsigned long, unsigned long, rocksdb::Slice*, char*, std::__2::unique_ptr<char [], std::__2::default_delete<char []>>*, rocksdb::Env::IOPriority) const /rocksdb/file/random_access_file_reader.cc:204:25
    https://github.com/facebook/rocksdb/issues/5 0x5583e307c614 in rocksdb::ReadFooterFromFile(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, unsigned long, rocksdb::Footer*, unsigned long) /rocksdb/table/format.cc:383:17
    https://github.com/facebook/rocksdb/issues/6 0x5583e2f88456 in rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::CacheReservationManager>, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, unsigned long) /rocksdb/table/block_based/block_based_table_reader.cc:610:9
    https://github.com/facebook/rocksdb/issues/7 0x5583e2ef7837 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, bool) const /rocksdb/table/block_based/block_based_table_factory.cc:599:10
    https://github.com/facebook/rocksdb/issues/8 0x5583e2ab873c in rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:142:34
    https://github.com/facebook/rocksdb/issues/9 0x5583e2aba5f6 in rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:190:16
    https://github.com/facebook/rocksdb/issues/10 0x5583e2abb7e1 in rocksdb::TableCache::NewIterator(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::RangeDelAggregator*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, rocksdb::TableReader**, rocksdb::HistogramImpl*, rocksdb::TableReaderCaller, rocksdb::Arena*, bool, int, unsigned long, rocksdb::InternalKey const*, rocksdb::InternalKey const*, bool) /rocksdb/db/table_cache.cc:235:9
    https://github.com/facebook/rocksdb/issues/11 0x5583e28d14cf in rocksdb::BuildTable(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::__2::vector<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>, std::__2::allocator<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>>>, rocksdb::FileMetaData*, std::__2::vector<rocksdb::BlobFileAddition, std::__2::allocator<rocksdb::BlobFileAddition>>*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::__2::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const*, rocksdb::BlobFileCompletionCallback*, unsigned long*, unsigned long*, unsigned long*) /rocksdb/db/builder.cc:335:57
    https://github.com/facebook/rocksdb/issues/12 0x5583e29bf29d in rocksdb::FlushJob::WriteLevel0Table() /rocksdb/db/flush_job.cc:919:11
    https://github.com/facebook/rocksdb/issues/13 0x5583e29b33ac in rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*, bool*) /rocksdb/db/flush_job.cc:276:9
    https://github.com/facebook/rocksdb/issues/14 0x5583e27a4781 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:258:19
    https://github.com/facebook/rocksdb/issues/15 0x5583e27a7a96 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:377:14
    https://github.com/facebook/rocksdb/issues/16 0x5583e27d6777 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2778:14
    https://github.com/facebook/rocksdb/issues/17 0x5583e27d14e2 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2817:16
    https://github.com/facebook/rocksdb/issues/18 0x5583e323d353 in std::__2::__function::__policy_func<void ()>::operator()[abi:ne180100]() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:714:12
    https://github.com/facebook/rocksdb/issues/19 0x5583e323d353 in std::__2::function<void ()>::operator()() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:981:10
    https://github.com/facebook/rocksdb/issues/20 0x5583e323d353 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /rocksdb/util/threadpool_imp.cc:266:5
    https://github.com/facebook/rocksdb/issues/21 0x5583e3243d18 in decltype(std::declval<void (*)(void*)>()(std::declval<rocksdb::BGThreadMetadata*>())) std::__2::__invoke[abi:ne180100]<void (*)(void*), rocksdb::BGThreadMetadata*>(void (*&&)(void*), rocksdb::BGThreadMetadata*&&) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__type_traits/invoke.h:344:25
    https://github.com/facebook/rocksdb/issues/22 0x5583e3243d18 in void std::__2::__thread_execute[abi:ne180100]<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*, 2ul>(std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>&, std::__2::__tuple_indices<2ul>) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:193:3
    https://github.com/facebook/rocksdb/issues/23 0x5583e3243d18 in void* std::__2::__thread_proxy[abi:ne180100]<std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>>(void*) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:202:3
    https://github.com/facebook/rocksdb/issues/24 0x5583da5e819e in asan_thread_start(void*) crtstuff.c
    https://github.com/facebook/rocksdb/issues/25 0x7f0eda362a93 in start_thread nptl/pthread_create.c:447:8
    https://github.com/facebook/rocksdb/issues/26 0x7f0eda3efc3b in clone3 misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

0x506000682221 is located 1 bytes inside of 56-byte region [0x506000682220,0x506000682258)
allocated by thread T29 here:
    #0 0x5583da6281d1 in operator new(unsigned long)
    https://github.com/facebook/rocksdb/issues/1 0x5583da6c987d in __libcpp_operator_new<unsigned long> /root/build/3rdParty/llvm/runtimes/include/c++/v1/new:271:10
    https://github.com/facebook/rocksdb/issues/2 0x5583da6c987d in __libcpp_allocate /root/build/3rdParty/llvm/runtimes/include/c++/v1/new:295:10
    https://github.com/facebook/rocksdb/issues/3 0x5583da6c987d in allocate /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocator.h:125:32
    https://github.com/facebook/rocksdb/issues/4 0x5583da6c987d in allocate_at_least /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocator.h:131:13
    https://github.com/facebook/rocksdb/issues/5 0x5583da6c987d in allocate_at_least<std::__2::allocator<char> > /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocate_at_least.h:34:20
    https://github.com/facebook/rocksdb/issues/6 0x5583da6c987d in __allocate_at_least<std::__2::allocator<char> > /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocate_at_least.h:42:10
    https://github.com/facebook/rocksdb/issues/7 0x5583da6c987d in std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>::__shrink_or_extend[abi:ne180100](unsigned long) /root/build/3rdParty/llvm/runtimes/include/c++/v1/string:3236:27
    https://github.com/facebook/rocksdb/issues/8 0x5583e307c5aa in std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>::reserve(unsigned long) /root/build/3rdParty/llvm/runtimes/include/c++/v1/string:3207:3
    https://github.com/facebook/rocksdb/issues/9 0x5583e307c5aa in rocksdb::ReadFooterFromFile(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, unsigned long, rocksdb::Footer*, unsigned long) /rocksdb/table/format.cc:382:18
    https://github.com/facebook/rocksdb/issues/10 0x5583e2f88456 in rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::CacheReservationManager>, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, unsigned long) /rocksdb/table/block_based/block_based_table_reader.cc:610:9
    https://github.com/facebook/rocksdb/issues/11 0x5583e2ef7837 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, bool) const /rocksdb/table/block_based/block_based_table_factory.cc:599:10
    https://github.com/facebook/rocksdb/issues/12 0x5583e2ab873c in rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:142:34
    https://github.com/facebook/rocksdb/issues/13 0x5583e2aba5f6 in rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:190:16
    https://github.com/facebook/rocksdb/issues/14 0x5583e2abb7e1 in rocksdb::TableCache::NewIterator(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::RangeDelAggregator*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, rocksdb::TableReader**, rocksdb::HistogramImpl*, rocksdb::TableReaderCaller, rocksdb::Arena*, bool, int, unsigned long, rocksdb::InternalKey const*, rocksdb::InternalKey const*, bool) /rocksdb/db/table_cache.cc:235:9
    https://github.com/facebook/rocksdb/issues/15 0x5583e28d14cf in rocksdb::BuildTable(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::__2::vector<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>, std::__2::allocator<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>>>, rocksdb::FileMetaData*, std::__2::vector<rocksdb::BlobFileAddition, std::__2::allocator<rocksdb::BlobFileAddition>>*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::__2::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const*, rocksdb::BlobFileCompletionCallback*, unsigned long*, unsigned long*, unsigned long*) /rocksdb/db/builder.cc:335:57
    https://github.com/facebook/rocksdb/issues/16 0x5583e29bf29d in rocksdb::FlushJob::WriteLevel0Table() /rocksdb/db/flush_job.cc:919:11
    https://github.com/facebook/rocksdb/issues/17 0x5583e29b33ac in rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*, bool*) /rocksdb/db/flush_job.cc:276:9
    https://github.com/facebook/rocksdb/issues/18 0x5583e27a4781 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:258:19
    https://github.com/facebook/rocksdb/issues/19 0x5583e27a7a96 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:377:14
    https://github.com/facebook/rocksdb/issues/20 0x5583e27d6777 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2778:14
    https://github.com/facebook/rocksdb/issues/21 0x5583e27d14e2 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2817:16
    https://github.com/facebook/rocksdb/issues/22 0x5583e323d353 in std::__2::__function::__policy_func<void ()>::operator()[abi:ne180100]() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:714:12
    https://github.com/facebook/rocksdb/issues/23 0x5583e323d353 in std::__2::function<void ()>::operator()() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:981:10
    https://github.com/facebook/rocksdb/issues/24 0x5583e323d353 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /rocksdb/util/threadpool_imp.cc:266:5
    https://github.com/facebook/rocksdb/issues/25 0x5583e3243d18 in decltype(std::declval<void (*)(void*)>()(std::declval<rocksdb::BGThreadMetadata*>())) std::__2::__invoke[abi:ne180100]<void (*)(void*), rocksdb::BGThreadMetadata*>(void (*&&)(void*), rocksdb::BGThreadMetadata*&&) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__type_traits/invoke.h:344:25
    https://github.com/facebook/rocksdb/issues/26 0x5583e3243d18 in void std::__2::__thread_execute[abi:ne180100]<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*, 2ul>(std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>&, std::__2::__tuple_indices<2ul>) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:193:3
    https://github.com/facebook/rocksdb/issues/27 0x5583e3243d18 in void* std::__2::__thread_proxy[abi:ne180100]<std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>>(void*) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:202:3
    https://github.com/facebook/rocksdb/issues/28 0x5583da5e819e in asan_thread_start(void*) crtstuff.c

HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_container_overflow=0.
If you suspect a false positive see also: https://github.com/google/sanitizers/wiki/AddressSanitizerContainerOverflow.
 AddressSanitizer:container-overflow in pread
Shadow bytes around the buggy address:
  0x506000681f80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x506000682000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x506000682080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x506000682100: fa fa fa fa fa fa fa fa fa fa fa fa 00 00 00 00
  0x506000682180: 00 00 00 fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x506000682200: fa fa fa fa[01]fc fc fc fc fc fc fa fa fa fa fa
  0x506000682280: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x506000682300: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 01
  0x506000682380: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
  0x506000682400: fd fd fd fd fd fd fd fa fa fa fa fa fd fd fd fd
  0x506000682480: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12722

Reviewed By: hx235

Differential Revision: D58118264

Pulled By: ajkr

fbshipit-source-id: 0dd914c886c022d82697b769d664ba52de0770de
2024-06-04 09:41:53 -07:00
Po-Chuan Hsieh b03d415660 Fix build on i386 (#12719)
Summary:
Cited from https://pkg-status.freebsd.org/beefy21/data/140i386-default/02faf78f4c9b/logs/rocksdb-9.2.1.log
The error message is as follows:
```
mkdir -p db_stress_tool && clang++ -O2 -pipe  -DOS_FREEBSD -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing   -Wno-inconsistent-missing-override -Wno-unused-parameter -Wno-unused-variable -Wno-unused-private-field -isystem /usr/local/include -std=c++17   -fPIC -DROCKSDB_DLL -DROCKSDB_USE_RTTI   -g -W -Wextra -Wall -Wsign-compare -Wshadow -Wunused-parameter -Werror -I. -I./include -std=c++17 -O2 -pipe  -DOS_FREEBSD -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing   -Wno-inconsistent-missing-override -Wno-unused-parameter -Wno-unused-variable -Wno-unused-private-field -isystem /usr/local/include -std=c++17  -faligned-new -DHAVE_ALIGNED_NEW -DROCKSDB_PLATFORM_POSIX -DROCKSDB_LIB_IO_POSIX -O2 -pipe  -DOS_FREEBSD -fstack-protector-strong -isystem /usr/local/include -fno-strict-aliasing  -D_REENTRANT -DOS_FREEBSD -DSNAPPY -DGFLAGS=1 -DZLIB -DBZIP2 -DLZ4 -DZSTD -DROCKSDB_MALLOC_USABLE_SIZE -DROCKSDB_PTHREAD_ADAPTIVE_MUTEX -DROCKSDB_BACKTRACE -DROCKSDB_SCHED_GETCPU_PRESENT   -isystem third-party/gtest-1.8.1/fused-src -O2 -fno-omit-frame-pointer -momit-leaf-frame-pointer -DNDEBUG -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers -Wno-invalid-offsetof -c db_stress_tool/db_stress_common.cc -o db_stress_tool/db_stress_common.o
db_stress_tool/db_stress_common.cc:204:17: error: format specifies type 'unsigned long' but the argument has type 'size_t' (aka 'unsigned int') [-Werror,-Wformat]
                block_cache->GetCapacity());
                ^~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12719

Reviewed By: hx235

Differential Revision: D58093539

Pulled By: jaykorean

fbshipit-source-id: 400cae3a4b0d23b168937a5388065ef1c4b8b56e
2024-06-04 09:36:00 -07:00
Andrii Lysenko e4428b7eb9 More details for 'tail prefetch size is calculated based on' (#12667)
Summary:
These messages indicate that SST file was created by a pre-9.0.0 RocksDB. Eventually, `TailPrefetchStats` might be removed, so it would be more informative if log message also included name of the affected SST file.

Issue: https://github.com/facebook/rocksdb/issues/12664

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12667

Reviewed By: ajkr

Differential Revision: D57464025

Pulled By: hx235

fbshipit-source-id: 12f2f2635e3092f8c29362aa132462492b5c1417
2024-06-03 11:37:35 -07:00
hgy@ruijie.com.cn 21a16f9e64 add c-api to get default cf handle (#12514)
Summary:
rocksdb_batched_multi_get_cf has performance improvement than normal multi_get, however it needs a cf_handle arg, so add a C-API to get and destroy the default cf_handle, as many user only use the default cf.

Fixes https://github.com/facebook/rocksdb/issues/12316

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12514

Reviewed By: hx235

Differential Revision: D55922517

Pulled By: ajkr

fbshipit-source-id: c4cc4289f2cfd9efbb8f390a44a9d8d1ed08d9f0
2024-06-03 11:09:35 -07:00
Levi Tamasi b9e82f5162 Mention two fixes (#12677 and #12681) in the changelog (#12730)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12730

Reviewed By: jowlyzhang

Differential Revision: D58092079

fbshipit-source-id: 708437e705f9d9f770c9ba1a1a9b3c369b0a4b79
2024-06-03 10:50:41 -07:00
Andrew Kryczka c3ae569792 Update the main branch for the 9.3 release (#12726)
Summary:
Cut the 9.3.fb branch as of 5/17 11:59pm. Also, cherry-picked all bug fixes that have happened since then. Removed their files from unreleased_history/ since those fixes will appear in 9.3.0, so there seems no use repeating them in any later release.

Release branch: https://github.com/facebook/rocksdb/tree/9.3.fb
Tests: https://github.com/facebook/rocksdb/actions/runs/9342097111

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12726

Reviewed By: ltamasi

Differential Revision: D58069263

Pulled By: ajkr

fbshipit-source-id: c4f557bc8dbc20ce53021ac7e97a24f930542bf9
2024-06-02 22:10:24 -07:00
Jay Huh b7fc9ada9e Temporarily disable multi_cf_iter in stress test (#12728)
Summary:
We plan to re-enable the test after fixing the test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12728

Test Plan: N/A. Disabling the test

Reviewed By: hx235

Differential Revision: D58071284

Pulled By: jaykorean

fbshipit-source-id: af6b45ec7654f9c7b40c36d3b59c7087e27a7af9
2024-06-02 21:58:12 -07:00
Levi Tamasi 023a808417 Disable iterator refresh for CoalescingIterator in TestIterateAgainstExpected (#12723)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12723

`CoalescingIterator` doesn't support `Refresh` currently; the patch adds a check that was missing from https://github.com/facebook/rocksdb/pull/12721 to disable this operation when multi-CF iterators are in use in the stress test.

Reviewed By: jaykorean

Differential Revision: D58053334

fbshipit-source-id: 3146f0e7e87230b49b244cecdfcee345c0ce78fa
2024-06-01 09:52:48 -07:00
Yu Zhang fc59d8f9c6 Add public API `WriteWithCallback` to support custom callbacks (#12603)
Summary:
This PR adds a `DB::WriteWithCallback` API that does the same things as `DB::Write` while takes an argument `UserWriteCallback` to execute custom callback functions during the write.

We currently support two types of callback functions: `OnWriteEnqueued` and `OnWalWriteFinish`. The former is invoked   after the write is enqueued, and the later is invoked after WAL write finishes when applicable.

These callback functions are intended for users to use to improve synchronization between concurrent writes, their execution is on the write's critical path so it will impact the write's latency if not used properly. The documentation for the callback interface mentioned this and suggest user to keep these callback functions' implementation minimum.

Although transaction interfaces' writes doesn't yet allow user to specify such a user write callback argument, the `DBImpl::Write*` type of APIs do not differentiate between regular DB writes or writes coming from the transaction layer when it comes to supporting this `UserWriteCallback`. These callbacks works for all the write modes including: default write mode, Options.two_write_queues, Options.unordered_write, Options.enable_pipelined_write

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12603

Test Plan: Added unit test in ./write_callback_test

Reviewed By: anand1976

Differential Revision: D58044638

Pulled By: jowlyzhang

fbshipit-source-id: 87a84a0221df8f589ec8fc4d74597e72ce97e4cd
2024-05-31 19:30:19 -07:00
Jay Huh f3b7e959b3 Add CoalescingIterator to TestIterateAgainstExpected (#12721)
Summary:
Continuing from https://github.com/facebook/rocksdb/pull/12706. Adding the CoalescingIterator to `TestIterateAgainstExpected` as well when `use_multi_cf_iterator` is set True

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12721

Test Plan:
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1 --verify_iterator_with_expected_state_one_in=2
```

Reviewed By: ltamasi

Differential Revision: D58033811

Pulled By: jaykorean

fbshipit-source-id: 7caf39883e277e695b653e295ad72b1004169ca0
2024-05-31 15:17:06 -07:00
Levi Tamasi 6f17056e40 Add transactional/read-your-own-write MultiGetEntity stress test (#12717)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12717

The PR adds `Transaction::MultiGetEntity` to the stress tests. Similarly to what we do for `Transaction::MultiGet`, in this mode we open a transaction and randomly add writes for some of the queried keys to it while keeping track of the values written on a per-key basis. The results of `Transaction::MultiGetEntity` can then be validated against these expected values (in order to test the read-your-own-writes functionality) as well as the results returned by `Transaction::GetEntity` for the same keys.

Reviewed By: jaykorean

Differential Revision: D57990210

fbshipit-source-id: 9bf3bb292051c2c57757f86b517919197b03c524
2024-05-31 12:01:59 -07:00
Jay Huh a901ef48f0 Introduce use_multi_cf_iterator in stress test (#12706)
Summary:
Introduce `use_multi_cf_iterator`, and when it's set, use `CoalescingIterator` in `TestIterate()`. Because all the column families contain the same data in today's Stress Test, we can compare `CoalescingIterator` against any `DBIter` from any of the column families. Currently, coalescing logic verification is done by unit tests, but we can extend the stress test to support different data in different column families in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12706

Test Plan:
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1
```

**More PRs to come**
- Use `AttributeGroupIterator` when both `use_multi_cf_iterator` and `use_attribute_group` are true
- Support `Refresh()` in `CoalescingIterator`
- Extend Stress Test to support different data in different CFs (Long-term)

Reviewed By: ltamasi

Differential Revision: D58020247

Pulled By: jaykorean

fbshipit-source-id: 8e2483b85cf2bb0f5a9bb44851601bbf063484ec
2024-05-31 10:50:15 -07:00
Po-Chuan Hsieh 76aa0d9ee2 Fix build on FreeBSD (#12714)
Summary:
The error message is as follows:
```
port/stack_trace.cc:286:7: error: use of undeclared identifier 'waitpid'
      waitpid(child_pid, &wstatus, 0);
      ^
port/stack_trace.cc:287:11: error: use of undeclared identifier 'WIFEXITED'
      if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == EXIT_SUCCESS) {
          ^
port/stack_trace.cc:287:33: error: use of undeclared identifier 'WEXITSTATUS'
      if (WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == EXIT_SUCCESS) {
                                ^
3 errors generated.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12714

Reviewed By: ajkr

Differential Revision: D57970244

Pulled By: jaykorean

fbshipit-source-id: afdad9af16b4bfe5e059bc82180f74b2c3260ed9
2024-05-30 17:48:17 -07:00
Yu Zhang 8a462eefae Add user timestamp support into interactive query command (#12716)
Summary:
As titled. This PR also makes the interactive query tool more permissive by allowing the user to continue to try out a different command after the previous command received some allowed errors, such as `Status::NotFound`, `Status::InvalidArgument`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12716

Test Plan:
Manually tested:
```
yuzhangyu@yuzhangyu-mbp rocksdb % ./ldb --db=$TEST_DB --key_hex --value_hex query
get 0x0000000000000000 --read_timestamp=1115559245398440
0x0000000000000000|timestamp:1115559245398440 ==> 0x07000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B
put 0x0000000000000000 0x0000
put 0x0000000000000000 => 0x0000 failed: Invalid argument: cannot call this method on column family default that enables timestamp
put 0x0000000000000000 aha 0x0000
put gets invalid argument: Invalid argument: user provided timestamp is not a valid uint64 value.
put 0x0000000000000000 1115559245398441 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B
put 0x0000000000000000 write_ts: 1115559245398441 => 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B succeeded
delete 0x0000000000000000
delete 0x0000000000000000 failed: Invalid argument: cannot call this method on column family default that enables timestamp
delete 0x0000000000000000 1115559245398442
delete 0x0000000000000000 write_ts: 1115559245398442 succeeded
get 0x0000000000000000 --read_timestamp=1115559245398442
get 0x0000000000000000 read_timestamp: 1115559245398442 status: NotFound:
get 0x0000000000000000 --read_timestamp=1115559245398441
0x0000000000000000|timestamp:1115559245398441 ==> 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B
count --from=0x0000000000000000 --to=0x0000000000000001
scan from 0x0000000000000000 to 0x0000000000000001failed: Invalid argument: cannot call this method on column family default that enables timestamp
count --from=0x0000000000000000 --to=0x0000000000000001 --read_timestamp=1115559245398442
0
count --from=0x0000000000000000 --to=0x0000000000000001 --read_timestamp=1115559245398441
1
```

Reviewed By: ltamasi

Differential Revision: D57992183

Pulled By: jowlyzhang

fbshipit-source-id: 720525de22412d16aa952870e088f2c371459ece
2024-05-30 17:23:38 -07:00