Commit Graph

1602 Commits

Author SHA1 Message Date
Hui Xiao 58fc9d61b7 Trim readahead size based on prefix during prefix scan (#13040)
Summary:
**Context/Summary:**
During prefix scan, prefetched data blocks containing keys not in the same prefix as the `Seek()`'s key will be wasted when `ReadOptions::prefix_same_as_start = true` since they won't be returned to the user. This PR is to exclude those data blocks from being prefetched in a similar manner like trimming according to `ReadOptions::iterate_upper_bound`.

Bonus: refactoring to some existing prefetch test so they are easier to extend and read

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13040

Test Plan:
- New UT, integration to existing UTs
- Benchmark to ensure no regression from CPU due to more trimming logic
```
// Build DB with one sorted run under the same prefix
./db_bench --benchmarks=fillrandom --prefix_size=3 --keys_per_prefix=5000000 --num=5000000 --db=/dev/shm/db_bench --disable_auto_compactions=1
```
```
// Augment the existing db bench to call `Seek()` instead of `SeekToFirst()` in `void ReadSequential(){..}` to trigger the logic in this PR

+++ b/tools/db_bench_tool.cc
@@ -5900,7 +5900,12 @@ class Benchmark {
     Iterator* iter = db->NewIterator(options);
     int64_t i = 0;
     int64_t bytes = 0;
-    for (iter->SeekToFirst(); i < reads_ && iter->Valid(); iter->Next()) {
+
+    iter->SeekToFirst();
+    assert(iter->status().ok() && iter->Valid());
+    auto prefix = prefix_extractor_->Transform(iter->key());
+
+    for (iter->Seek(prefix); i < reads_ && iter->Valid(); iter->Next()) {
       bytes += iter->key().size() + iter->value().size();
       thread->stats.FinishedOps(nullptr, db, 1, kRead);
       ++i;
:
```
```
// Compare prefix scan performance
./db_bench --benchmarks=readseq[-X20] --prefix_size=3  --prefix_same_as_start=1 --auto_readahead_size=1 --cache_size=1 --use_existing_db=1 --db=/dev/shm/db_bench --disable_auto_compactions=1

// Before PR
readseq [AVG    20 runs] : 2449011 (± 50238) ops/sec;  270.9 (± 5.6) MB/sec
readseq [MEDIAN 20 runs] : 2499167 ops/sec;  276.5 MB/sec

// After PR  (regress 0.4 %)
readseq [AVG    20 runs] : 2439098 (± 42931) ops/sec;  269.8 (± 4.7) MB/sec
readseq [MEDIAN 20 runs] : 2460859 ops/sec;  272.2 MB/sec

```

- Stress test: randomly set `prefix_same_as_start` in `TestPrefixScan()`. Run below for a while
```
python3 tools/db_crashtest.py --simple blackbox --prefix_size=5 --prefixpercent=65 --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=-1 --bloom_bits=3 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=0 --compaction_style=2 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=8388608 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=0 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=15 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32  --readpercent=10 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --uncache_aggressiveness=1 --universal_max_read_amp=10 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=10
```

Reviewed By: anand1976

Differential Revision: D64367065

Pulled By: hx235

fbshipit-source-id: 5750c05ccc835c3e9dc81c961b76deaf30bd23c2
2024-10-17 15:52:55 -07:00
Peter Dillinger ac24f152a1 Refactor `table_factory` into MutableCFOptions (#13077)
Summary:
This is setting up for a fix to a data race in SetOptions on BlockBasedTableOptions (BBTO), https://github.com/facebook/rocksdb/issues/10079
The race will be fixed by replacing `table_factory` with a modified copy whenever we want to modify a BBTO field.

An argument could be made that this change creates more entaglement between features (e.g. BlobSource <-> MutableCFOptions), rather than (conceptually) minimizing the dependencies of each feature, but
* Most of these things already depended on ImmutableOptions
* Historically there has been a lot of plumbing (and possible small CPU overhead) involved in adding features that need to reach a lot of places, like `block_protection_bytes_per_key`. Keeping those wrapped up in options simplifies that.
* SuperVersion management generally takes care of lifetime management of MutableCFOptions, so is not that difficult. (Crash test agrees so far.)

There are some FIXME places where it is known to be unsafe to replace `block_cache` unless/until we handle shared_ptr tracking properly. HOWEVER, replacing `block_cache` is generally dubious, at least while existing users of the old block cache (e.g. table readers) can continue indefinitely.

The change to cf_options.cc is essentially just moving code (not changing).

I'm not concerned about the performance of copying another shared_ptr with MutableCFOptions, but I left a note about considering an improvement if more shared_ptr are added to it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13077

Test Plan:
existing tests, crash test.

Unit test DBOptionsTest.GetLatestCFOptions updated with some temporary logic. MemoryTest required some refactoring (simplification) for the change.

Reviewed By: cbi42

Differential Revision: D64546903

Pulled By: pdillinger

fbshipit-source-id: 69ae97ce5cf4c01b58edc4c5d4687eb1e5bf5855
2024-10-17 14:13:20 -07:00
Changyu Bi 8ad4c7efc4 Add an API to check if an SST file is generated by SstFileWriter (#13072)
Summary:
Some users want to check if a file in their DB was created by SstFileWriter and ingested into hte DB. Files created by SstFileWriter records additional table properties cbebbad7d9/table/sst_file_writer_collectors.h (L17-L21). These are not exposed so I'm adding an API to SstFileWriter to check for them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13072

Test Plan: added some assertions.

Reviewed By: jowlyzhang

Differential Revision: D64411518

Pulled By: cbi42

fbshipit-source-id: 279bfef48615aa6f78287b643d8445a1471e7f07
2024-10-16 16:57:05 -07:00
Changyu Bi 787730c859 Add an ingestion option to not fill block cache (#13067)
Summary:
add `IngestExternalFileOptions::fill_cache` to allow users to ingest files without loading index/filter/data and other blocks into block cache during file ingestion. This can be useful when users are ingesting files into a CF that is not available to readers yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13067

Test Plan:
* unit test: `ExternalSSTFileTest.NoBlockCache`
* ran one round of crash test with fill_cache disabled: `python3 ./tools/db_crashtest.py --simple blackbox --ops_per_thread=1000000 --interval=30 --ingest_external_file_one_in=200 --level0_stop_writes_trigger=200 --level0_slowdown_writes_trigger=100 --sync_fault_injection=0 --disable_wal=0 --manual_wal_flush_one_in=0`

Reviewed By: jowlyzhang

Differential Revision: D64356424

Pulled By: cbi42

fbshipit-source-id: b380c26f5987238e1ed7d42ceef0390cfaa0b8e2
2024-10-16 14:11:22 -07:00
Peter Dillinger 351d2fd2b6 Make simple BlockBasedTableOptions mutable (#10021)
Summary:
In theory, there should be no danger in mutability, as table
builders and readers work from copies of BlockBasedTableOptions.
However, there is currently an unresolved read-write race that
affecting SetOptions on BBTO fields. This should be generally
acceptable for non-pointer options of 64 bits or less, but a fix
is needed to make it mutability general here. See
https://github.com/facebook/rocksdb/issues/10079

This change systematically sets all of those "simple" options (and future
such options) as mutable. (Resurrecting this PR perhaps preferable to
proposed https://github.com/facebook/rocksdb/issues/13063)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10021

Test Plan: Some unit test updates. XXX comment added to stress test code

Reviewed By: cbi42

Differential Revision: D64360967

Pulled By: pdillinger

fbshipit-source-id: ff220fa778331852fe331b42b76ac4adfcd2d760
2024-10-14 17:49:26 -07:00
Yu Zhang 8181dfb1c4 Fix a bug for surfacing write unix time (#13057)
Summary:
The write unix time from non L0 files are not surfaced properly because the level's wrapper iterator doesn't have a `write_unix_time` implementation that delegates to the corresponding file. The unit test didn't catch this because it incorrectly destroy the old db and reopen to check write time, instead of just reopen and check. This fix also include a change to support ldb's scan command to get write time for easier debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13057

Test Plan: Updated unit tests

Reviewed By: pdillinger

Differential Revision: D64015107

Pulled By: jowlyzhang

fbshipit-source-id: 244474f78a034f80c9235eea2aa8a0f4e54dff59
2024-10-08 11:31:51 -07:00
Yu Zhang 32dd657bad Add some per key optimization for UDT in memtable only feature (#13031)
Summary:
This PR added some optimizations for the per key handling for SST file for the user-defined timestamps in Memtable only feature. CPU profiling shows this part is a big culprit for regression. This optimization saves some string construction/destruction/appending/copying. vector operations like reserve/emplace_back.

When iterating keys in a block, we need to copy some shared bytes from previous key, put it together with the non shared bytes and find a right location to pad the min timestamp. Previously, we create a tmp local string buffer to first construct the key from its pieces, and then copying this local string's content into `IterKey`'s buffer. To avoid having this local string and to avoid this extra copy. Instead of piecing together the key in a local string first, we just track all the pieces that make this key in a reused Slice array. And then copy the pieces in order into `IterKey`'s buffer. Since the previous key should be kept intact while we are copying some shared bytes from it,  we added a secondary buffer in `IterKey` and alternate between primary buffer and secondary buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13031

Test Plan: Existing tests.

Reviewed By: ltamasi

Differential Revision: D63416531

Pulled By: jowlyzhang

fbshipit-source-id: 9819b0e02301a2dbc90621b2fe4f651bc912113c
2024-10-03 17:57:50 -07:00
Peter Dillinger 12960f0b57 Disable uncache_aggressiveness with allow_mmap_reads (#13051)
Summary:
There was a crash test Bus Error crash in `IndexBlockIter::SeekToFirstImpl()` <- .. <-
`BlockBasedTable::~BlockBasedTable()` with `--mmap_read=1`, which suggests some kind of incompatibility that I haven't diagnosed. Bus Error is uncommon these days as CPUs support unaligned reads, but are associated with mmap problems.

Because mmap reads really only make sense without block cache, it's not a concerning loss to essentially disable the combination.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13051

Test Plan: watch crash test

Reviewed By: jowlyzhang

Differential Revision: D63795069

Pulled By: pdillinger

fbshipit-source-id: 6c823c619840086b5c9cff53dbc7470662b096be
2024-10-02 18:16:50 -07:00
Peter Dillinger a1a102ffce Steps toward deprecating implicit prefix seek, related fixes (#13026)
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.

Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.

Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)

Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026

Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.

Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).

Reviewed By: ltamasi

Differential Revision: D63147378

Pulled By: pdillinger

fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
2024-09-20 15:54:19 -07:00
Nicholas Ormrod 0e04ef1a96 Deshim coro in fbcode/internal_repo_rocksdb
Summary:
The following rules were deshimmed:
```
//folly/experimental/coro:accumulate -> //folly/coro:accumulate
//folly/experimental/coro:async_generator -> //folly/coro:async_generator
//folly/experimental/coro:async_pipe -> //folly/coro:async_pipe
//folly/experimental/coro:async_scope -> //folly/coro:async_scope
//folly/experimental/coro:async_stack -> //folly/coro:async_stack
//folly/experimental/coro:baton -> //folly/coro:baton
//folly/experimental/coro:blocking_wait -> //folly/coro:blocking_wait
//folly/experimental/coro:collect -> //folly/coro:collect
//folly/experimental/coro:concat -> //folly/coro:concat
//folly/experimental/coro:coroutine -> //folly/coro:coroutine
//folly/experimental/coro:current_executor -> //folly/coro:current_executor
//folly/experimental/coro:detach_on_cancel -> //folly/coro:detach_on_cancel
//folly/experimental/coro:detail_barrier -> //folly/coro:detail_barrier
//folly/experimental/coro:detail_barrier_task -> //folly/coro:detail_barrier_task
//folly/experimental/coro:detail_current_async_frame -> //folly/coro:detail_current_async_frame
//folly/experimental/coro:detail_helpers -> //folly/coro:detail_helpers
//folly/experimental/coro:detail_malloc -> //folly/coro:detail_malloc
//folly/experimental/coro:detail_manual_lifetime -> //folly/coro:detail_manual_lifetime
//folly/experimental/coro:detail_traits -> //folly/coro:detail_traits
//folly/experimental/coro:filter -> //folly/coro:filter
//folly/experimental/coro:future_util -> //folly/coro:future_util
//folly/experimental/coro:generator -> //folly/coro:generator
//folly/experimental/coro:gmock_helpers -> //folly/coro:gmock_helpers
//folly/experimental/coro:gtest_helpers -> //folly/coro:gtest_helpers
//folly/experimental/coro:inline_task -> //folly/coro:inline_task
//folly/experimental/coro:invoke -> //folly/coro:invoke
//folly/experimental/coro:merge -> //folly/coro:merge
//folly/experimental/coro:mutex -> //folly/coro:mutex
//folly/experimental/coro:promise -> //folly/coro:promise
//folly/experimental/coro:result -> //folly/coro:result
//folly/experimental/coro:retry -> //folly/coro:retry
//folly/experimental/coro:rust_adaptors -> //folly/coro:rust_adaptors
//folly/experimental/coro:scope_exit -> //folly/coro:scope_exit
//folly/experimental/coro:shared_lock -> //folly/coro:shared_lock
//folly/experimental/coro:shared_mutex -> //folly/coro:shared_mutex
//folly/experimental/coro:sleep -> //folly/coro:sleep
//folly/experimental/coro:small_unbounded_queue -> //folly/coro:small_unbounded_queue
//folly/experimental/coro:task -> //folly/coro:task
//folly/experimental/coro:timed_wait -> //folly/coro:timed_wait
//folly/experimental/coro:timeout -> //folly/coro:timeout
//folly/experimental/coro:traits -> //folly/coro:traits
//folly/experimental/coro:transform -> //folly/coro:transform
//folly/experimental/coro:unbounded_queue -> //folly/coro:unbounded_queue
//folly/experimental/coro:via_if_async -> //folly/coro:via_if_async
//folly/experimental/coro:with_async_stack -> //folly/coro:with_async_stack
//folly/experimental/coro:with_cancellation -> //folly/coro:with_cancellation
//folly/experimental/coro:bounded_queue -> //folly/coro:bounded_queue
//folly/experimental/coro:shared_promise -> //folly/coro:shared_promise
//folly/experimental/coro:cleanup -> //folly/coro:cleanup
//folly/experimental/coro:auto_cleanup_fwd -> //folly/coro:auto_cleanup_fwd
//folly/experimental/coro:auto_cleanup -> //folly/coro:auto_cleanup
```

The following headers were deshimmed:
```
folly/experimental/coro/Accumulate.h -> folly/coro/Accumulate.h
folly/experimental/coro/Accumulate-inl.h -> folly/coro/Accumulate-inl.h
folly/experimental/coro/AsyncGenerator.h -> folly/coro/AsyncGenerator.h
folly/experimental/coro/AsyncPipe.h -> folly/coro/AsyncPipe.h
folly/experimental/coro/AsyncScope.h -> folly/coro/AsyncScope.h
folly/experimental/coro/AsyncStack.h -> folly/coro/AsyncStack.h
folly/experimental/coro/Baton.h -> folly/coro/Baton.h
folly/experimental/coro/BlockingWait.h -> folly/coro/BlockingWait.h
folly/experimental/coro/Collect.h -> folly/coro/Collect.h
folly/experimental/coro/Collect-inl.h -> folly/coro/Collect-inl.h
folly/experimental/coro/Concat.h -> folly/coro/Concat.h
folly/experimental/coro/Concat-inl.h -> folly/coro/Concat-inl.h
folly/experimental/coro/Coroutine.h -> folly/coro/Coroutine.h
folly/experimental/coro/CurrentExecutor.h -> folly/coro/CurrentExecutor.h
folly/experimental/coro/DetachOnCancel.h -> folly/coro/DetachOnCancel.h
folly/experimental/coro/detail/Barrier.h -> folly/coro/detail/Barrier.h
folly/experimental/coro/detail/BarrierTask.h -> folly/coro/detail/BarrierTask.h
folly/experimental/coro/detail/CurrentAsyncFrame.h -> folly/coro/detail/CurrentAsyncFrame.h
folly/experimental/coro/detail/Helpers.h -> folly/coro/detail/Helpers.h
folly/experimental/coro/detail/Malloc.h -> folly/coro/detail/Malloc.h
folly/experimental/coro/detail/ManualLifetime.h -> folly/coro/detail/ManualLifetime.h
folly/experimental/coro/detail/Traits.h -> folly/coro/detail/Traits.h
folly/experimental/coro/Filter.h -> folly/coro/Filter.h
folly/experimental/coro/Filter-inl.h -> folly/coro/Filter-inl.h
folly/experimental/coro/FutureUtil.h -> folly/coro/FutureUtil.h
folly/experimental/coro/Generator.h -> folly/coro/Generator.h
folly/experimental/coro/GmockHelpers.h -> folly/coro/GmockHelpers.h
folly/experimental/coro/GtestHelpers.h -> folly/coro/GtestHelpers.h
folly/experimental/coro/detail/InlineTask.h -> folly/coro/detail/InlineTask.h
folly/experimental/coro/Invoke.h -> folly/coro/Invoke.h
folly/experimental/coro/Merge.h -> folly/coro/Merge.h
folly/experimental/coro/Merge-inl.h -> folly/coro/Merge-inl.h
folly/experimental/coro/Mutex.h -> folly/coro/Mutex.h
folly/experimental/coro/Promise.h -> folly/coro/Promise.h
folly/experimental/coro/Result.h -> folly/coro/Result.h
folly/experimental/coro/Retry.h -> folly/coro/Retry.h
folly/experimental/coro/RustAdaptors.h -> folly/coro/RustAdaptors.h
folly/experimental/coro/ScopeExit.h -> folly/coro/ScopeExit.h
folly/experimental/coro/SharedLock.h -> folly/coro/SharedLock.h
folly/experimental/coro/SharedMutex.h -> folly/coro/SharedMutex.h
folly/experimental/coro/Sleep.h -> folly/coro/Sleep.h
folly/experimental/coro/Sleep-inl.h -> folly/coro/Sleep-inl.h
folly/experimental/coro/SmallUnboundedQueue.h -> folly/coro/SmallUnboundedQueue.h
folly/experimental/coro/Task.h -> folly/coro/Task.h
folly/experimental/coro/TimedWait.h -> folly/coro/TimedWait.h
folly/experimental/coro/Timeout.h -> folly/coro/Timeout.h
folly/experimental/coro/Timeout-inl.h -> folly/coro/Timeout-inl.h
folly/experimental/coro/Traits.h -> folly/coro/Traits.h
folly/experimental/coro/Transform.h -> folly/coro/Transform.h
folly/experimental/coro/Transform-inl.h -> folly/coro/Transform-inl.h
folly/experimental/coro/UnboundedQueue.h -> folly/coro/UnboundedQueue.h
folly/experimental/coro/ViaIfAsync.h -> folly/coro/ViaIfAsync.h
folly/experimental/coro/WithAsyncStack.h -> folly/coro/WithAsyncStack.h
folly/experimental/coro/WithCancellation.h -> folly/coro/WithCancellation.h
folly/experimental/coro/BoundedQueue.h -> folly/coro/BoundedQueue.h
folly/experimental/coro/SharedPromise.h -> folly/coro/SharedPromise.h
folly/experimental/coro/Cleanup.h -> folly/coro/Cleanup.h
folly/experimental/coro/AutoCleanup-fwd.h -> folly/coro/AutoCleanup-fwd.h
folly/experimental/coro/AutoCleanup.h -> folly/coro/AutoCleanup.h
```

This is a codemod. It was automatically generated and will be landed once it is approved and tests are passing in sandcastle.
You have been added as a reviewer by Sentinel or Butterfly.

Autodiff project: dcoro
Autodiff partition: fbcode.internal_repo_rocksdb
Autodiff bookmark: ad.dcoro.fbcode.internal_repo_rocksdb

Reviewed By: dtolnay

Differential Revision: D62684411

fbshipit-source-id: 8dbd31ab64fcdd99435d322035b9668e3200e0a3
2024-09-14 09:48:21 -07:00
anand76 cabd2d8718 Fix a couple of missing cases of retry on corruption (#13007)
Summary:
For SST checksum mismatch corruptions in the read path, RocksDB retries the read if the underlying file system supports verification and reconstruction of data (`FSSupportedOps::kVerifyAndReconstructRead`). There were a couple of places where the retry was missing - reading the SST footer and the properties block. This PR fixes the retry in those cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13007

Test Plan: Add new unit tests

Reviewed By: jaykorean

Differential Revision: D62519186

Pulled By: anand1976

fbshipit-source-id: 50aa38f18f2a53531a9fc8d4ccdf34fbf034ed59
2024-09-13 13:56:49 -07:00
Peter Dillinger 4577b672d5 More valgrind fixes (#12990)
Summary:
* https://github.com/facebook/rocksdb/issues/12936 was insufficient to fix the std::optional false positives. Making a fix validated in CI this time (see https://github.com/facebook/rocksdb/issues/12991)
* valgrind grinds to a halt on startup on my dev machine apparently because it expects internet access. Disable its attempts to access the internet when git is using a proxy.
* Move PORTABLE=1 from CI job to the Makefile. Without it, valgrind complains about illegal instructions (too new)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12990

Test Plan: manual, watch nightly valgrind job

Reviewed By: ltamasi

Differential Revision: D62203242

Pulled By: pdillinger

fbshipit-source-id: a611b08da7dbd173b0709ed7feb0578729553a17
2024-09-06 10:11:34 -07:00
Changyu Bi c62de54c7c Record largest seqno in table properties and verify in file ingestion (#12951)
Summary:
this helps to avoid scanning input files when ingesting db generated files: ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951

Test Plan:
* `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion
* existing unit tests for other ingestion use cases.

Reviewed By: jowlyzhang

Differential Revision: D61608285

Pulled By: cbi42

fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca
2024-08-21 16:24:18 -07:00
Peter Dillinger 4d3518951a Option to decouple index and filter partitions (#12939)
Summary:
Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads.

However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed.

Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB).

 ## This change

Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder.

This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931.

Suggested follow-up:
* Update confusing use of "last key" to refer to "previous key"
* Expand unit test coverage with parallel compression and dictionary training
* Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939

Test Plan:
unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature.

 ## SST write performance (CPU)

Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR
```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
923691 vs. 924851 (-0.13%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
921398 vs. 922973 (-0.17%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
902259 vs. 908756 (-0.71%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
917932 vs. 916901 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
912755 vs. 907298 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
899754 vs. 892433 (+0.82%)
```
I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations.

 ## Read performance

Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup:

```
(for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 | tee results-$CS; done)
```

And read ops/s results:

```CSV
Cache size MB,After/decoupled/4k,Before/4k,Before/8k
3,15593,15158,12826
6,16295,16693,14134
10,20427,20813,18459
20,27035,26836,27384
30,33250,31810,33846
60,35518,32585,35329
100,36612,31805,35292
300,35780,31492,35481
1000,34145,31551,35411
1100,35219,31380,34302
1200,35060,31037,34322
```

If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration.

Reviewed By: jowlyzhang

Differential Revision: D61376772

Pulled By: pdillinger

fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f
2024-08-16 15:34:31 -07:00
Peter Dillinger 21da4ba4aa Attempt fix valgrind FP on std::optional (#12936)
Summary:
```
[ RUN      ]
BlockBasedTableReaderTest/BlockBasedTableReaderTest.MultiGet/347
==49577== Thread 4:
==49577== Conditional jump or move depends on uninitialised value(s)
==49577==    at 0x518AF93: operator!=<long unsigned int, long unsigned int> (optional:1115)
==49577==    by 0x518AF93: rocksdb::(anonymous namespace)::XXPH3FilterBitsBuilder::AddKeyAndAlt(rocksdb::Slice const&, rocksdb::Slice const&) (filter_policy.cc:100)
==49577==    by 0x5192722: Add (full_filter_block.cc:37)
==49577==    by 0x5192722: rocksdb::FullFilterBlockBuilder::Add(rocksdb::Slice const&) (full_filter_block.cc:33)
==49577==    by 0x5125DDB: rocksdb::BlockBasedTableBuilder::BGWorkWriteMaybeCompressedBlock() (block_based_table_builder.cc:1473)
==49577==    by 0x570C6B3: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.29)
==49577==    by 0x5617608: start_thread (pthread_create.c:477)
==49577==    by 0x5988132: clone (clone.S:95)
==49577==
```

Seems to be explained by ASM that valgrind doesn't like. https://stackoverflow.com/q/51616179

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12936

Test Plan: Wasn't able to reproduce locally

Reviewed By: hx235

Differential Revision: D61338401

Pulled By: pdillinger

fbshipit-source-id: b5b10f7f5c6a8c9eb088c00e5699046100167cb7
2024-08-15 10:55:29 -07:00
Peter Dillinger f63428bcc7 Optimize, simplify filter block building (fix regression) (#12931)
Summary:
This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931

Test Plan:
existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
 https://github.com/facebook/rocksdb/issues/12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872 #12911 https://github.com/facebook/rocksdb/issues/12874 #12867 https://github.com/facebook/rocksdb/issues/12903 #12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency:

```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```

Showing average ops/sec of "after" vs. "before"

```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```

As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.

Reviewed By: jowlyzhang

Differential Revision: D61138271

Pulled By: pdillinger

fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043
2024-08-14 15:13:16 -07:00
anand76 c21fe1a47f Add ticker stats for read corruption retries (#12923)
Summary:
Add a couple of ticker stats for corruption retry count and successful retries. This PR also eliminates an extra read attempt when there's a checksum mismatch in a block read from the prefetch buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12923

Test Plan: Update existing tests

Reviewed By: jowlyzhang

Differential Revision: D61024687

Pulled By: anand1976

fbshipit-source-id: 3a08403580ab244000e0d480b7ee0f5a03d76b06
2024-08-12 15:32:07 -07:00
Peter Dillinger 9d5c8c89a1 Fix filter partition size logic (#12904)
Summary:
Was checking == a desired number of entries added to a filter, when the combination of whole key and prefix filtering could add more than one entry per table internal key. This could lead to unnecessarily large filter partitions, which could affect performance and block cache fairness.

Also (only somewhat related because of other work in progress):
* Some variable renaming and a new assertion in BlockBasedTableBuilder, to add some clarity.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12904

Test Plan:
If you add assertion logic to the base revision checking that the partition cut is requested whenever `keys_added_to_partition_ >= keys_per_partition_`, it fails on a number of db_bloom_filter_test tests. However, such an assertion in the revised code would be essentially redundant with the new logic.

If I added a regression test for this, it would be tricky and fragile, so I don't think it's important enough to chase and maintain.  (Open to suggestions / input.)

Reviewed By: jowlyzhang

Differential Revision: D60557827

Pulled By: pdillinger

fbshipit-source-id: 77a56097d540da6e7851941a26d26ced2d944373
2024-08-02 14:49:02 -07:00
Peter Dillinger 9245550e8b Clean up/refactor (Partitioned)FilterBlockBuilder (#12903)
Summary:
This is ahead of some related changes/enhancements. Refactorings here:
* Restructure some state of PartitionedFilterBlockBuilder to reduce redundancy in state tracking, improve clarity.
* Changed some function signatures to better match standard practice (return Status)
* Improve comments, arrange related fields
* Discourage/prevent production use of Finish without status (now TEST_Finish)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12903

Test Plan: existing tests

Reviewed By: jowlyzhang

Differential Revision: D60548613

Pulled By: pdillinger

fbshipit-source-id: d7dbc79951fcc3b837877227d58f713698ad2596
2024-08-02 13:35:45 -07:00
Yu Zhang 319374ae67 Add some checks at property block creation side (#12898)
Summary:
Crash test encountered this failure:
```file ingestion error: Corruption: properties unsorted under specified IngestExternalFileOptions: move_files: 0, verify_checksums_before_ingest: 1, verify_checksums_readahead_size: 1048576 (Empty string or missing field indicates default option or value is used```

Further inspection showed out of order table properties in an external file created by `SstFileWriter` for ingestion, and the file is likely created like this because it passed the initial checksum check. This change added some assertions to check invariant at the properties creation and collecting side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12898

Test Plan: Existing tests

Reviewed By: hx235

Differential Revision: D60459817

Pulled By: jowlyzhang

fbshipit-source-id: 91474943d2f9d7795f00b6031c08a13ab91e2470
2024-07-31 13:28:17 -07:00
Peter Dillinger f456a7213f Refactor IndexBuilder::AddIndexEntry (#12867)
Summary:
Something I am working on is going to expand usage of `BlockBasedTableBuilder::Rep::last_key`, but the existing code contract for `IndexBuilder::AddIndexEntry` makes that difficult because it modifies its `last_key` parameter to be the separator value recorded in the index, often something between the two boundary keys.

This change primarily changes the contract of that function and related functions to separate function inputs and outputs, without sacrificing efficiency. For efficiency, a reusable scratch string buffer is provided by the caller, which the callee can use (or not) in returning a result Slice. That should yield a performance improvement as we are reusing a buffer for keys rather than copying into a new one each time in the FindShort* functions, without any additional string copies or conditional branches.

Additional improvements in PartitionedIndexBuilder specifically:
* Reduce string copies by eliminating `sub_index_last_key_` and instead tracking the key for the next partition in a placeholder Entry.
* Simplify code and improve code quality by changing `sub_index_builder_` to unique_ptr.
* Eliminate unnecessary NewFlushBlockPolicy call/object.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12867

Test Plan: existing tests, crash test. Will validate performance along with the change this is setting up.

Reviewed By: anand1976

Differential Revision: D59793119

Pulled By: pdillinger

fbshipit-source-id: 556da75cf13b967511f84702b2713d152f536a07
2024-07-22 14:27:31 -07:00
Changyu Bi 4384dd5eee Support ingesting SST files generated by a live DB (#12750)
Summary:
... to enable use cases like using RocksDB to merge sort data for ingestion. A new file ingestion option `IngestExternalFileOptions::allow_db_generated_files` is introduced to allows users to ingest SST files generated by live DBs instead of SstFileWriter. For now this only works if the SST files being ingested have zero as their largest sequence number AND do not overlap with any data in the DB (so we can assign seqno 0 which matches the seqno of all ingested keys).

The feature is marked the option as experimental for now.

Main changes needed to enable this:
- ignore CF id mismatch during ingestion
- ignore the missing external file version table property

Rest of the change is mostly in new unit tests.

A previous attempt is in https://github.com/facebook/rocksdb/issues/5602.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12750

Test Plan: - new unit tests

Reviewed By: ajkr, jowlyzhang

Differential Revision: D58396673

Pulled By: cbi42

fbshipit-source-id: aae513afad7b1ff5d4faa48104df5f384926bf03
2024-07-19 16:14:54 -07:00
Peter Dillinger de6d0e5ec3 Reduce cases of impacted performance from bug fix (#12874)
Summary:
https://github.com/facebook/rocksdb/issues/12872 was a bit too gross of a fix, because we still don't need to track previous prefix in FullFilterBlockBuilder for many non-partitioned use cases. This basically narrows the fix (and potentail CPU regression) to partitioned+prefix filter cases, which are the cases that needed to be fixed.

A better efficiency fix would still be nice but not as high of a priority.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12874

Test Plan: existing tests (just added in https://github.com/facebook/rocksdb/issues/12872)

Reviewed By: jowlyzhang

Differential Revision: D59885591

Pulled By: pdillinger

fbshipit-source-id: 8f273fc3e14c4b60c8a55501dc4bbcc325cd17a1
2024-07-17 16:42:27 -07:00
Peter Dillinger 93b163d1a2 Fix major bug with prefixes, SeekForPrev, and partitioned filters (#12872)
Summary:
Basically, the fix in https://github.com/facebook/rocksdb/issues/8137 was incomplete (and I missed it in the review), because if `whole_key_filtering` is false, then `last_prefix_str_` will never be set to non-empty and the fix doesn't work. Also related to https://github.com/facebook/rocksdb/issues/5835.

This is intended as a safe, simple fix that will regress CPU efficiency slightly (for `whole_key_filtering=false` cases, because of extra prefix string copies during flush & compaction). An efficient fix is not possible without some substantial refactoring.

Also in this PR: new test DBBloomFilterTest.FilterNumEntriesCoalesce tests an adjacent code path that was previously untested for its effect of ensuring the number of unique prefixes and keys is tracked properly when both prefixes and whole keys are going into a filter. (Test fails when either of the two code segments checking for duplicates is disabled.) In addition, the same test would fail before the main bug fix here because the code would inappropriately add the empty string to the filter (because of unmodified `last_prefix_str_`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12872

Test Plan: In addition to DBBloomFilterTest.FilterNumEntriesCoalesce, extended DBBloomFilterTest.SeekForPrevWithPartitionedFilters to cover the broken case. (Mostly whitespace change.)

Reviewed By: jowlyzhang

Differential Revision: D59873793

Pulled By: pdillinger

fbshipit-source-id: 2a7b7f09ca73dc188fb4dab833826ad6da7ebb11
2024-07-17 14:08:35 -07:00
Hui Xiao 0b10e7dbba Ignore more non-critical IO error in BlockCacheLookupForReadAheadSize() in crash test (#12822)
Summary:
**Context/Summary:**
Similar to https://github.com/facebook/rocksdb/pull/12814#issue-2376803461

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12822

Test Plan: CI

Reviewed By: jaykorean

Differential Revision: D59166188

Pulled By: hx235

fbshipit-source-id: 68c2eb7b319103ede0ba34944a0737440aecb17f
2024-06-28 12:01:13 -07:00
Hui Xiao aec15eebec Ignore non-critical IO error in `BlockCacheLookupForReadAheadSize()` in crash test (#12814)
Summary:
**Context/Summary:**

Error in `BlockCacheLookupForReadAheadSize()` is not critical enough to return such error in read path. That's because the worst case is to not have any read ahead. See below comment. a31fe52173/table/block_based/block_based_table_iterator.cc (L867-L871)

Therefore we should allow the read to return ok() even when we inject read error there.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12814

Test Plan:
Below command failed with ` Didn't get expected error from PrefixScan` before the fix but passes after

```
./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=100 --adaptive_readahead=0 --adm_policy=3 --advise_random_on_open=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_align=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=5 --bloom_bits=29.31310447925055 --bottommost_compression_type=lz4hc --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=tiered_auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=0 --compaction_readahead_size=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0.6666666666666666 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --default_temperature=kHot --default_write_temperature=kHot --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=2 --index_shortening=0 --index_type=2 --ingest_external_file_one_in=1000 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kHot --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --log_file_time_to_roll=60 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=16384 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=0 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=1 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=foo --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=0 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --uncache_aggressiveness=203 --universal_max_read_amp=10 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=1 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=1000 --writepercent=35
```

Reviewed By: jaykorean

Differential Revision: D59092430

Pulled By: hx235

fbshipit-source-id: 39558c34461ce92275cae706c33dfd00e6f0ecce
2024-06-26 23:02:28 -07:00
Peter Dillinger 39455974cb Fix possible double-free on TruncatedRangeDelIterator (#12805)
Summary:
Not sure where or how it happens, but using a recent CircleCI failure I got a reliable db_stress reproducer.

Using std::unique_ptr appropriately for managing them has apparently (and unsurprisingly) fixed the problem without needing to know exactly where the problem was.

Suggested follow-up:
* Three or even four levels of pointers is very confusing to work with. Surely this part can be cleaned up to be simpler.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12805

Test Plan:
Reproducer passes, plus ASAN test and crash test runs. I don't think it's worth the extra work to track down the details and create a careful unit test.

```
./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=2 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=tiered_lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=1 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=0 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_ratio=0.2 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/dev/shm/rocksdb.gpxs/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --expected_values_dir=/dev/shm/rocksdb.gpxs/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --fill_cache=1 --flush_one_in=1000000 --format_version=3 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=4 --index_shortening=0 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kHot --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=60 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=1 --uncache_aggressiveness=5 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

Reviewed By: cbi42

Differential Revision: D58958390

Pulled By: pdillinger

fbshipit-source-id: 1271cfdcc3c574f78cd59f3c68148f7ed4a19c47
2024-06-24 11:51:16 -07:00
Peter Dillinger efba8f5b27 Respect ReadOptions::read_tier in prefetching (#12782)
Summary:
a pre-existing flaw revealed by crash test with uncache behavior. Easy fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12782

Test Plan: Modified unit test PrefetchTest.Basic (fails without fix)

Reviewed By: hx235

Differential Revision: D58757916

Pulled By: pdillinger

fbshipit-source-id: 23c0240c7cf0cb0b69a372f9531c07af920e09da
2024-06-19 09:53:59 -07:00
Yu Zhang c73cf7a878 Add CompactForTieringCollector to support automatically trigger compaction for tiering use case (#12760)
Summary:
This PR adds user property collector factory `CompactForTieringCollectorFactory` to support observe SST file and mark it as need compaction for fast tracking data to the proper tier.

A triggering ratio `compaction_trigger_ratio_` can be configured to achieve the following:
1) Setting the ratio to be equal to or smaller than 0 disables this collector
2) Setting the ratio to be within (0, 1] will write the number of observed eligible entries into a user property and marks a file as need-compaction when aforementioned condition is met.
3) Setting the ratio to be higher than 1 can be used to just writes the user table property, and not mark any file as need compaction.
 For a column family that does not enable tiering feature, even if an effective configuration is provided, this collector is still disabled. For a file that is already on the last level, this collector is also disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12760

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D58734976

Pulled By: jowlyzhang

fbshipit-source-id: 6daab2c4f62b5c6689c3c03e3b3907bbbe6b7a81
2024-06-18 10:51:29 -07:00
Peter Dillinger abf9ebc4bf Remove redundant no_io parameters to filter functions (#12762)
Summary:
Consolidate on already-present ReadOptions::read_tier

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12762

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D58450516

Pulled By: pdillinger

fbshipit-source-id: 1eec58c60beca73c6d5f2e9ae4442644920f8c30
2024-06-12 18:47:11 -07:00
Peter Dillinger 3abcba8470 Propagate more ReadOptions to ApproximateOffsetOf/Sizes (#12764)
Summary:
Unknown why these would ignore options like deadline and read_tier. Setting total_order_seek=true is unnecessary because of the disable_prefix_seek (= true) parameter to NewIndexIterator. This is only used by the hash index, which uses total order seek if either the ReadOption or the parameter is true. The parameter is arguably redundant with the total_order_seek option, meaning it could be eliminated, but I think this case is exceptional (compared to e.g. no_io):
* Prefix seek is particular to user iterators, though might be usable, carefully, for other read operations.
* The historical default of total_order_seek=false in a sense is "wrong result by default" so cannot be interpreted as an intent to force prefix seek in an operation for which it might be usual or give bad results.

Also added a generic release note to cover this and related PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12764

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D58474240

Pulled By: pdillinger

fbshipit-source-id: 79014d9822ba8f09d57ce4524363aa0973017b68
2024-06-12 16:25:47 -07:00
Peter Dillinger 5cf3bed00f Eliminate some parameters redundant with ReadOptions (#12761)
Summary:
... in Index and CompressionDict readers (Filters in another PR). no_io and verify_checksums should be inferred from ReadOptions rather than specified redundantly.

Fixes incomplete propagation of ReadOptions in
UncompressionDictReader::GetOrReadUncompressionDictionar so is technically a functional change. (Related to https://github.com/facebook/rocksdb/issues/12757)

Also there was hardcoded no verify_checksums in DumpTable, but only for UncompressionDict, which doesn't make sense. Now using consistent ReadOptions and verify_checksum can be controlled for more reads together.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12761

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D58450392

Pulled By: pdillinger

fbshipit-source-id: 0faed22832d664cb3b04a4c03ee77119977c200b
2024-06-12 15:44:37 -07:00
Peter Dillinger d64eac28d3 Fix a failure to propagate ReadOptions (#12757)
Summary:
The crash test revealed a case in which the uncache functionality in ~BlockBasedTableReader could initiate an block read (IO), despite setting ReadOptions::read_tier = kBlockCacheTier.

The root cause is a place in the code where many people have over time decided to opt-in propagating ReadOptions and no one took the initiative to propagate ReadOptions by default (opt out / override only as needed). The fix is in partitioned_index_reader.cc. Here,
ReadOptions::readahead_size is opted-out to avoid churn in prefetch_test that is not clearly an improvement or regression. It's hard to tell given the poor state of relevant documentation https://github.com/facebook/rocksdb/issues/12756. The affected unit test was added in https://github.com/facebook/rocksdb/issues/10602.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12757

Test Plan: (Now postponed to a follow-up diff) I have added some new infrastructure to DEBUG builds to catch this specific kind of violation in unit tests and in the stress/crash test. `EnforceReadOpts` establishes a thread-local context under which we assert no IOs are performed if ReadOptions said it should be forbidden. With this new checking, the Uncache unit test would catch the critical step toward a violation (inner ReadOptions allowing IO, even if no IO is actually performed), which is fixed with the production code change.

Reviewed By: hx235

Differential Revision: D58421526

Pulled By: pdillinger

fbshipit-source-id: 9e9917a0e320c78967e751bd887926a2ed231d37
2024-06-11 21:41:21 -07:00
Peter Dillinger 961468f92e Fix TSAN-reported data race with uncache_aggressiveness (#12753)
Summary:
Data race reported on
BlockBasedTableReader::Rep::uncache_aggressiveness because apparently a file can be marked obsolete through multiple table cache references in parallel. Using a relaxed atomic should resolve the race quite reasonably, especially considering this is a rare case and the racing writes should be storing the same value anyway.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12753

Test Plan: watch for TSAN crash test results

Reviewed By: ltamasi

Differential Revision: D58397473

Pulled By: pdillinger

fbshipit-source-id: 3e78b6adac4f7a7056790754bee42b3cb244f037
2024-06-11 16:53:13 -07:00
Peter Dillinger b34cef57b7 Support pro-actively erasing obsolete block cache entries (#12694)
Summary:
Currently, when files become obsolete, the block cache entries associated with them just age out naturally. With pure LRU, this is not too bad, as once you "use" enough cache entries to (re-)fill the cache, you are guranteed to have purged the obsolete entries. However, HyperClockCache is a counting clock cache with a somewhat longer memory, so could be more negatively impacted by previously-hot cache entries becoming obsolete, and taking longer to age out than newer single-hit entries.

Part of the reason we still have this natural aging-out is that there's almost no connection between block cache entries and the file they are associated with. Everything is hashed into the same pool(s) of entries with nothing like a secondary index based on file. Keeping track of such an index could be expensive.

This change adds a new, mutable CF option `uncache_aggressiveness` for erasing obsolete block cache entries. The process can be speculative, lossy, or unproductive because not all potential block cache entries associated with files will be resident in memory, and attempting to remove them all could be wasted CPU time. Rather than a simple on/off switch, `uncache_aggressiveness` basically tells RocksDB how much CPU you're willing to burn trying to purge obsolete block cache entries. When such efforts are not sufficiently productive for a file, we stop and move on.

The option is in ColumnFamilyOptions so that it is dynamically changeable for already-open files, and customizeable by CF.

Note that this block cache removal happens as part of the process of purging obsolete files, which is often in a background thread (depending on `background_purge_on_iterator_cleanup` and `avoid_unnecessary_blocking_io` options) rather than along CPU critical paths.

Notable auxiliary code details:
* Possibly fixing some issues with trivial moves with `only_delete_metadata`: unnecessary TableCache::Evict in that case and missing from the ObsoleteFileInfo move operator. (Not able to reproduce an current failure.)
* Remove suspicious TableCache::Erase() from VersionSet::AddObsoleteBlobFile() (TODO follow-up item)

Marked EXPERIMENTAL until more thorough validation is complete.

Direct stats of this functionality are omitted because they could be misleading. Block cache hit rate is a better indicator of benefit, and CPU profiling a better indicator of cost.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12694

Test Plan:
* Unit tests added, including refactoring an existing test to make better use of parameterized tests.
* Added to crash test.
* Performance, sample command:
```
for I in `seq 1 10`; do for UA in 300; do for CT in lru_cache fixed_hyper_clock_cache auto_hyper_clock_cache; do rm -rf /dev/shm/test3; TEST_TMPDIR=/dev/shm/test3 /usr/bin/time ./db_bench -benchmarks=readwhilewriting -num=13000000 -read_random_exp_range=6 -write_buffer_size=10000000 -bloom_bits=10 -cache_type=$CT -cache_size=390000000 -cache_index_and_filter_blocks=1 -disable_wal=1 -duration=60 -statistics -uncache_aggressiveness=$UA 2>&1 | grep -E 'micros/op|rocksdb.block.cache.data.(hit|miss)|rocksdb.number.keys.(read|written)|maxresident' | awk '/rocksdb.block.cache.data.miss/ { miss = $4 } /rocksdb.block.cache.data.hit/ { hit = $4 } { print } END { print "hit rate = " ((hit * 1.0) / (miss + hit)) }' | tee -a results-$CT-$UA; done; done; done
```

Averaging 10 runs each case, block cache data block hit rates

```
lru_cache
UA=0   -> hit rate = 0.327, ops/s = 87668, user CPU sec = 139.0
UA=300 -> hit rate = 0.336, ops/s = 87960, user CPU sec = 139.0

fixed_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 100069, user CPU sec = 139.9
UA=300 -> hit rate = 0.343, ops/s = 100104, user CPU sec = 140.2

auto_hyper_clock_cache
UA=0   -> hit rate = 0.336, ops/s = 97580, user CPU sec = 140.5
UA=300 -> hit rate = 0.345, ops/s = 97972, user CPU sec = 139.8
```

Conclusion: up to roughly 1 percentage point of improved block cache hit rate, likely leading to overall improved efficiency (because the foreground CPU cost of cache misses likely outweighs the background CPU cost of erasure, let alone I/O savings).

Reviewed By: ajkr

Differential Revision: D57932442

Pulled By: pdillinger

fbshipit-source-id: 84a243ca5f965f731f346a4853009780a904af6c
2024-06-07 08:57:11 -07:00
Valery Mironov a8a52e5b4d Fix AddressSanitizer container-overflow (#12722)
Summary:
```
ERROR: AddressSanitizer: container-overflow on address 0x506000682221 at pc 0x5583da569f76 bp 0x7f0ec8a9ffb0 sp 0x7f0ec8a9f780
WRITE of size 53 at 0x506000682221 thread T29
    #0 0x5583da569f75 in pread
    https://github.com/facebook/rocksdb/issues/1 0x5583e334fde4 in rocksdb::PosixRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const /rocksdb/env/io_posix.cc:580:9
    https://github.com/facebook/rocksdb/issues/2 0x5583e2cac42b in rocksdb::(anonymous namespace)::CompositeRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const /rocksdb/env/composite_env.cc:61:21
    https://github.com/facebook/rocksdb/issues/3 0x5583e2c8a8e4 in rocksdb::(anonymous namespace)::LegacyRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const /rocksdb/env/env.cc:152:41
    https://github.com/facebook/rocksdb/issues/4 0x5583e2d6cbfb in rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, unsigned long, unsigned long, rocksdb::Slice*, char*, std::__2::unique_ptr<char [], std::__2::default_delete<char []>>*, rocksdb::Env::IOPriority) const /rocksdb/file/random_access_file_reader.cc:204:25
    https://github.com/facebook/rocksdb/issues/5 0x5583e307c614 in rocksdb::ReadFooterFromFile(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, unsigned long, rocksdb::Footer*, unsigned long) /rocksdb/table/format.cc:383:17
    https://github.com/facebook/rocksdb/issues/6 0x5583e2f88456 in rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::CacheReservationManager>, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, unsigned long) /rocksdb/table/block_based/block_based_table_reader.cc:610:9
    https://github.com/facebook/rocksdb/issues/7 0x5583e2ef7837 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, bool) const /rocksdb/table/block_based/block_based_table_factory.cc:599:10
    https://github.com/facebook/rocksdb/issues/8 0x5583e2ab873c in rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:142:34
    https://github.com/facebook/rocksdb/issues/9 0x5583e2aba5f6 in rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:190:16
    https://github.com/facebook/rocksdb/issues/10 0x5583e2abb7e1 in rocksdb::TableCache::NewIterator(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::RangeDelAggregator*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, rocksdb::TableReader**, rocksdb::HistogramImpl*, rocksdb::TableReaderCaller, rocksdb::Arena*, bool, int, unsigned long, rocksdb::InternalKey const*, rocksdb::InternalKey const*, bool) /rocksdb/db/table_cache.cc:235:9
    https://github.com/facebook/rocksdb/issues/11 0x5583e28d14cf in rocksdb::BuildTable(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::__2::vector<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>, std::__2::allocator<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>>>, rocksdb::FileMetaData*, std::__2::vector<rocksdb::BlobFileAddition, std::__2::allocator<rocksdb::BlobFileAddition>>*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::__2::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const*, rocksdb::BlobFileCompletionCallback*, unsigned long*, unsigned long*, unsigned long*) /rocksdb/db/builder.cc:335:57
    https://github.com/facebook/rocksdb/issues/12 0x5583e29bf29d in rocksdb::FlushJob::WriteLevel0Table() /rocksdb/db/flush_job.cc:919:11
    https://github.com/facebook/rocksdb/issues/13 0x5583e29b33ac in rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*, bool*) /rocksdb/db/flush_job.cc:276:9
    https://github.com/facebook/rocksdb/issues/14 0x5583e27a4781 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:258:19
    https://github.com/facebook/rocksdb/issues/15 0x5583e27a7a96 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:377:14
    https://github.com/facebook/rocksdb/issues/16 0x5583e27d6777 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2778:14
    https://github.com/facebook/rocksdb/issues/17 0x5583e27d14e2 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2817:16
    https://github.com/facebook/rocksdb/issues/18 0x5583e323d353 in std::__2::__function::__policy_func<void ()>::operator()[abi:ne180100]() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:714:12
    https://github.com/facebook/rocksdb/issues/19 0x5583e323d353 in std::__2::function<void ()>::operator()() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:981:10
    https://github.com/facebook/rocksdb/issues/20 0x5583e323d353 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /rocksdb/util/threadpool_imp.cc:266:5
    https://github.com/facebook/rocksdb/issues/21 0x5583e3243d18 in decltype(std::declval<void (*)(void*)>()(std::declval<rocksdb::BGThreadMetadata*>())) std::__2::__invoke[abi:ne180100]<void (*)(void*), rocksdb::BGThreadMetadata*>(void (*&&)(void*), rocksdb::BGThreadMetadata*&&) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__type_traits/invoke.h:344:25
    https://github.com/facebook/rocksdb/issues/22 0x5583e3243d18 in void std::__2::__thread_execute[abi:ne180100]<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*, 2ul>(std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>&, std::__2::__tuple_indices<2ul>) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:193:3
    https://github.com/facebook/rocksdb/issues/23 0x5583e3243d18 in void* std::__2::__thread_proxy[abi:ne180100]<std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>>(void*) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:202:3
    https://github.com/facebook/rocksdb/issues/24 0x5583da5e819e in asan_thread_start(void*) crtstuff.c
    https://github.com/facebook/rocksdb/issues/25 0x7f0eda362a93 in start_thread nptl/pthread_create.c:447:8
    https://github.com/facebook/rocksdb/issues/26 0x7f0eda3efc3b in clone3 misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

0x506000682221 is located 1 bytes inside of 56-byte region [0x506000682220,0x506000682258)
allocated by thread T29 here:
    #0 0x5583da6281d1 in operator new(unsigned long)
    https://github.com/facebook/rocksdb/issues/1 0x5583da6c987d in __libcpp_operator_new<unsigned long> /root/build/3rdParty/llvm/runtimes/include/c++/v1/new:271:10
    https://github.com/facebook/rocksdb/issues/2 0x5583da6c987d in __libcpp_allocate /root/build/3rdParty/llvm/runtimes/include/c++/v1/new:295:10
    https://github.com/facebook/rocksdb/issues/3 0x5583da6c987d in allocate /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocator.h:125:32
    https://github.com/facebook/rocksdb/issues/4 0x5583da6c987d in allocate_at_least /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocator.h:131:13
    https://github.com/facebook/rocksdb/issues/5 0x5583da6c987d in allocate_at_least<std::__2::allocator<char> > /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocate_at_least.h:34:20
    https://github.com/facebook/rocksdb/issues/6 0x5583da6c987d in __allocate_at_least<std::__2::allocator<char> > /root/build/3rdParty/llvm/runtimes/include/c++/v1/__memory/allocate_at_least.h:42:10
    https://github.com/facebook/rocksdb/issues/7 0x5583da6c987d in std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>::__shrink_or_extend[abi:ne180100](unsigned long) /root/build/3rdParty/llvm/runtimes/include/c++/v1/string:3236:27
    https://github.com/facebook/rocksdb/issues/8 0x5583e307c5aa in std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>::reserve(unsigned long) /root/build/3rdParty/llvm/runtimes/include/c++/v1/string:3207:3
    https://github.com/facebook/rocksdb/issues/9 0x5583e307c5aa in rocksdb::ReadFooterFromFile(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::FilePrefetchBuffer*, unsigned long, rocksdb::Footer*, unsigned long) /rocksdb/table/format.cc:382:18
    https://github.com/facebook/rocksdb/issues/10 0x5583e2f88456 in rocksdb::BlockBasedTable::Open(rocksdb::ReadOptions const&, rocksdb::ImmutableOptions const&, rocksdb::EnvOptions const&, rocksdb::BlockBasedTableOptions const&, rocksdb::InternalKeyComparator const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::CacheReservationManager>, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, int, bool, unsigned long, bool, rocksdb::TailPrefetchStats*, rocksdb::BlockCacheTracer*, unsigned long, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, unsigned long) /rocksdb/table/block_based/block_based_table_reader.cc:610:9
    https://github.com/facebook/rocksdb/issues/11 0x5583e2ef7837 in rocksdb::BlockBasedTableFactory::NewTableReader(rocksdb::ReadOptions const&, rocksdb::TableReaderOptions const&, std::__2::unique_ptr<rocksdb::RandomAccessFileReader, std::__2::default_delete<rocksdb::RandomAccessFileReader>>&&, unsigned long, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, bool) const /rocksdb/table/block_based/block_based_table_factory.cc:599:10
    https://github.com/facebook/rocksdb/issues/12 0x5583e2ab873c in rocksdb::TableCache::GetTableReader(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, bool, bool, rocksdb::HistogramImpl*, std::__2::unique_ptr<rocksdb::TableReader, std::__2::default_delete<rocksdb::TableReader>>*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:142:34
    https://github.com/facebook/rocksdb/issues/13 0x5583e2aba5f6 in rocksdb::TableCache::FindTable(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileDescriptor const&, rocksdb::Cache::Handle**, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, bool, bool, rocksdb::HistogramImpl*, bool, int, bool, unsigned long, rocksdb::Temperature) /rocksdb/db/table_cache.cc:190:16
    https://github.com/facebook/rocksdb/issues/14 0x5583e2abb7e1 in rocksdb::TableCache::NewIterator(rocksdb::ReadOptions const&, rocksdb::FileOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::RangeDelAggregator*, std::__2::shared_ptr<rocksdb::SliceTransform const> const&, rocksdb::TableReader**, rocksdb::HistogramImpl*, rocksdb::TableReaderCaller, rocksdb::Arena*, bool, int, unsigned long, rocksdb::InternalKey const*, rocksdb::InternalKey const*, bool) /rocksdb/db/table_cache.cc:235:9
    https://github.com/facebook/rocksdb/issues/15 0x5583e28d14cf in rocksdb::BuildTable(std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const&, rocksdb::VersionSet*, rocksdb::ImmutableDBOptions const&, rocksdb::TableBuilderOptions const&, rocksdb::FileOptions const&, rocksdb::TableCache*, rocksdb::InternalIteratorBase<rocksdb::Slice>*, std::__2::vector<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>, std::__2::allocator<std::__2::unique_ptr<rocksdb::FragmentedRangeTombstoneIterator, std::__2::default_delete<rocksdb::FragmentedRangeTombstoneIterator>>>>, rocksdb::FileMetaData*, std::__2::vector<rocksdb::BlobFileAddition, std::__2::allocator<rocksdb::BlobFileAddition>>*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>, unsigned long, unsigned long, rocksdb::SnapshotChecker*, bool, rocksdb::InternalStats*, rocksdb::IOStatus*, std::__2::shared_ptr<rocksdb::IOTracer> const&, rocksdb::BlobFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, rocksdb::Env::WriteLifeTimeHint, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>> const*, rocksdb::BlobFileCompletionCallback*, unsigned long*, unsigned long*, unsigned long*) /rocksdb/db/builder.cc:335:57
    https://github.com/facebook/rocksdb/issues/16 0x5583e29bf29d in rocksdb::FlushJob::WriteLevel0Table() /rocksdb/db/flush_job.cc:919:11
    https://github.com/facebook/rocksdb/issues/17 0x5583e29b33ac in rocksdb::FlushJob::Run(rocksdb::LogsWithPrepTracker*, rocksdb::FileMetaData*, bool*) /rocksdb/db/flush_job.cc:276:9
    https://github.com/facebook/rocksdb/issues/18 0x5583e27a4781 in rocksdb::DBImpl::FlushMemTableToOutputFile(rocksdb::ColumnFamilyData*, rocksdb::MutableCFOptions const&, bool*, rocksdb::JobContext*, rocksdb::SuperVersionContext*, std::__2::vector<unsigned long, std::__2::allocator<unsigned long>>&, unsigned long, rocksdb::SnapshotChecker*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:258:19
    https://github.com/facebook/rocksdb/issues/19 0x5583e27a7a96 in rocksdb::DBImpl::FlushMemTablesToOutputFiles(rocksdb::autovector<rocksdb::DBImpl::BGFlushArg, 8ul> const&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:377:14
    https://github.com/facebook/rocksdb/issues/20 0x5583e27d6777 in rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2778:14
    https://github.com/facebook/rocksdb/issues/21 0x5583e27d14e2 in rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) /rocksdb/db/db_impl/db_impl_compaction_flush.cc:2817:16
    https://github.com/facebook/rocksdb/issues/22 0x5583e323d353 in std::__2::__function::__policy_func<void ()>::operator()[abi:ne180100]() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:714:12
    https://github.com/facebook/rocksdb/issues/23 0x5583e323d353 in std::__2::function<void ()>::operator()() const /root/build/3rdParty/llvm/runtimes/include/c++/v1/__functional/function.h:981:10
    https://github.com/facebook/rocksdb/issues/24 0x5583e323d353 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) /rocksdb/util/threadpool_imp.cc:266:5
    https://github.com/facebook/rocksdb/issues/25 0x5583e3243d18 in decltype(std::declval<void (*)(void*)>()(std::declval<rocksdb::BGThreadMetadata*>())) std::__2::__invoke[abi:ne180100]<void (*)(void*), rocksdb::BGThreadMetadata*>(void (*&&)(void*), rocksdb::BGThreadMetadata*&&) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__type_traits/invoke.h:344:25
    https://github.com/facebook/rocksdb/issues/26 0x5583e3243d18 in void std::__2::__thread_execute[abi:ne180100]<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*, 2ul>(std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>&, std::__2::__tuple_indices<2ul>) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:193:3
    https://github.com/facebook/rocksdb/issues/27 0x5583e3243d18 in void* std::__2::__thread_proxy[abi:ne180100]<std::__2::tuple<std::__2::unique_ptr<std::__2::__thread_struct, std::__2::default_delete<std::__2::__thread_struct>>, void (*)(void*), rocksdb::BGThreadMetadata*>>(void*) /root/build/3rdParty/llvm/runtimes/include/c++/v1/__thread/thread.h:202:3
    https://github.com/facebook/rocksdb/issues/28 0x5583da5e819e in asan_thread_start(void*) crtstuff.c

HINT: if you don't care about these errors you may set ASAN_OPTIONS=detect_container_overflow=0.
If you suspect a false positive see also: https://github.com/google/sanitizers/wiki/AddressSanitizerContainerOverflow.
 AddressSanitizer:container-overflow in pread
Shadow bytes around the buggy address:
  0x506000681f80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x506000682000: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x506000682080: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x506000682100: fa fa fa fa fa fa fa fa fa fa fa fa 00 00 00 00
  0x506000682180: 00 00 00 fa fa fa fa fa fa fa fa fa fa fa fa fa
=>0x506000682200: fa fa fa fa[01]fc fc fc fc fc fc fa fa fa fa fa
  0x506000682280: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x506000682300: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 01
  0x506000682380: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa
  0x506000682400: fd fd fd fd fd fd fd fa fa fa fa fa fd fd fd fd
  0x506000682480: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12722

Reviewed By: hx235

Differential Revision: D58118264

Pulled By: ajkr

fbshipit-source-id: 0dd914c886c022d82697b769d664ba52de0770de
2024-06-04 09:41:53 -07:00
Andrii Lysenko e4428b7eb9 More details for 'tail prefetch size is calculated based on' (#12667)
Summary:
These messages indicate that SST file was created by a pre-9.0.0 RocksDB. Eventually, `TailPrefetchStats` might be removed, so it would be more informative if log message also included name of the affected SST file.

Issue: https://github.com/facebook/rocksdb/issues/12664

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12667

Reviewed By: ajkr

Differential Revision: D57464025

Pulled By: hx235

fbshipit-source-id: 12f2f2635e3092f8c29362aa132462492b5c1417
2024-06-03 11:37:35 -07:00
anand76 0ae3d9f98d Fix stale memory access with FSBuffer and tiered sec cache (#12712)
Summary:
A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks  into the compressed cache and the corresponding compressed block into the NVM cache.  The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712

Test Plan:
1. Add a unit test
2. Release validation stress test

Reviewed By: jaykorean

Differential Revision: D57974026

Pulled By: anand1976

fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd
2024-05-30 12:33:58 -07:00
Richard Barnes 1827f3f983 Remove extra semi colon from internal_repo_rocksdb/repo/table/sst_file_reader.cc
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D57632757

fbshipit-source-id: 1dbad2a2e185381e225df8b9027033e06aeaf01b
2024-05-22 07:14:52 -07:00
Andrew Kryczka 4eaf628120 Add `Iterator` property "rocksdb.iterator.is-value-pinned" (#12659)
Summary:
`ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests.

See also: https://github.com/facebook/rocksdb/issues/12658

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659

Reviewed By: cbi42

Differential Revision: D57391200

Pulled By: ajkr

fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2
2024-05-15 19:11:52 -07:00
Yu Zhang c110091d36 Support read timestamp in ldb (#12641)
Summary:
As titled. Also updated sst_dump to print out user-defined timestamp separately and in human readable format.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12641

Test Plan:
manually tested:
Example success run:
./ldb --db=$TEST_DB multi_get_entity 0x00000000000000950000000000000120 --key_hex --column_family=7  --read_timestamp=1115613683797942 --value_hex
0x00000000000000950000000000000120 ==> :0x0E0000000A0B080906070405020300011E1F1C1D1A1B181916171415121310112E2F2C2D2A2B282926272425222320213E3F3C3D3A3B383936373435323330314E4F4C4D4A4B484946474445424340415E5F5C5D5A5B58595657545552535051
Example failed run:
Failed: GetEntity failed: Invalid argument: column family enables user-defined timestamp while --read_timestamp is not a valid uint64 value.

sst_dump print out:
'000000000000015D000000000000012B000000000000013B|timestamp:1113554493256671' seq:2330405, type:1 => 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E

Reviewed By: ltamasi

Differential Revision: D57297006

Pulled By: jowlyzhang

fbshipit-source-id: 8486d91468e4f6c0d42dca3c9629f1c45a92bf5a
2024-05-13 15:43:12 -07:00
Andrew Kryczka 7bf6d4c9d5 Lazily construct `BlockBasedTableIterator::block_handles_` (#12616)
Summary:
Our external benchmark attributed a CPU regression to https://github.com/facebook/rocksdb/issues/11860. Based on the CPU profile the new overhead is from `std::deque`. The deque is always empty for these scans so we do not need to construct it. This PR lazily constructs it only when it is needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12616

Test Plan:
- Command: `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,seekrandom[-X10] -compression_type=none -disable_auto_compactions=true -write_buffer_size=524288 -value_size=1024 -num=10000 -reads=100000`
- Results
  - Before this PR: `seekrandom [AVG    10 runs] : 47811 (± 431) ops/sec`
  - After this PR: `seekrandom [AVG 10 runs] : 51013 (± 632) ops/sec`

Reviewed By: jaykorean

Differential Revision: D56954136

Pulled By: ajkr

fbshipit-source-id: b4d34c9b6c6c2e83d4fff06deacb9f0df2ad042f
2024-05-03 17:18:13 -07:00
Hui Xiao abd6751aba Fix wrong padded bytes being used to generate file checksum (#12598)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598

Test Plan:
- New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption:  ...file checksum mismatch`) and passes after
- Benchmark
```
TEST_TMPDIR=/dev/shm  ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG    300 runs] : 421334 (± 4126) ops/sec;   46.6 (± 0.5) MB/sec
Post-PR: (no regression observed but a slight improvement)
fillseq [AVG    300 runs] : 425768 (± 4309) ops/sec;   47.1 (± 0.5) MB/sec

Reviewed By: ajkr, anand1976

Differential Revision: D56725688

Pulled By: hx235

fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5
2024-04-30 15:38:53 -07:00
Peter Dillinger 45c105104b Set optimize_filters_for_memory by default (#12377)
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.

Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377

Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.

Reviewed By: jowlyzhang

Differential Revision: D54129517

Pulled By: pdillinger

fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
2024-04-30 08:33:31 -07:00
Andrew Kryczka 2ec25a3e54 Prevent data block compression with `BlockBasedTableOptions::block_align` (#12592)
Summary:
Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592

Reviewed By: hx235

Differential Revision: D56650862

Pulled By: ajkr

fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141
2024-04-26 20:05:30 -07:00
Hui Xiao 7d83b4e3e5 Fix file checksum mismatch due to padded bytes when block_align=true (#12542)
Summary:
**Context/Summary:**
When `BlockBasedTableOptions::block_align=true`, we pad bytes to align blocks d41e568b1c/table/block_based/block_based_table_builder.cc (L1415-L1421).
Those bytes are not included in generating the file checksum upon file creation. But `VerifyFileChecksums()` includes those bytes in generating the file check to compare against the checksum generating upon file creation. Therefore a file checksum mismatch is returned in `VerifyFileChecksums()`.

We decided to include those padded bytes in generating the checksum upon file creation.

Bonus: also fix surrounding code to use actual padded bytes for verification - see https://github.com/facebook/rocksdb/pull/12542#discussion_r1571429163

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12542

Test Plan:
- New UT
- Benchmark
```
TEST_TMPDIR=/dev/shm  ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG    300 runs] : 422857 (± 3942) ops/sec;   46.8 (± 0.4) MB/sec
Post-PR:
fillseq [AVG    300 runs] : 424707 (± 3799) ops/sec;   47.0 (± 0.4) MB/sec

Reviewed By: ajkr

Differential Revision: D56168447

Pulled By: hx235

fbshipit-source-id: 96209ef950d42943d336f11968ae3fcf9872fc2c
2024-04-22 14:07:34 -07:00
Yu Zhang 74d419be4d Add support in SstFileReader to get a raw table iterator (#12385)
Summary:
This PR adds support to programmatically iterate a raw table file with an iterator returned by `SstFileReader::NewTableIterator`. For third party tools to use to observe SST files created by RocksDB.

The original feature request was from this merge request: https://github.com/facebook/rocksdb/pull/12370

Since keys returned by raw table iterators are internal keys, this PR also adds a struct `ParsedEntryInfo` and util method `ParseEntry` to support user to parse internal key. `GetInternalKeyForSeek`, and `GetInternalKeyForSeekForPrev` to support users to create internal keys for seek operations with this raw table iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12385

Test Plan: Added unit tests

Reviewed By: cbi42

Differential Revision: D55662855

Pulled By: jowlyzhang

fbshipit-source-id: 0716a173ee95924fbd4e1f9b6cccf06525c40049
2024-04-02 21:23:06 -07:00
anand76 4e226c97b8 Don't swallow errors in BlockBasedTable::MultiGet (#12486)
Summary:
Errors were being swallowed in `BlockBasedTable::MultiGet` under some circumstances, such as error when parsing the internal key from the block, or IO error when reading the blob value. We need to set the status for the key to the observed error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12486

Test Plan: Run db_stress and verify the expected error failure before, and no failures after the change.

Reviewed By: jaykorean, ajkr

Differential Revision: D55483940

Pulled By: anand1976

fbshipit-source-id: 493e44db507d5db45e8d1ef2e67808d2c4046318
2024-03-28 13:56:28 -07:00
Peter Dillinger b515a5db3f Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470)
Summary:
ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want.

So this adds ScopedArenaPtr<T> to replace those uses.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470

Test Plan: CI (including ASAN/UBSAN)

Reviewed By: jowlyzhang

Differential Revision: D55254362

Pulled By: pdillinger

fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18
2024-03-22 13:40:42 -07:00
anand76 3b736c4aa3 Fix heap use after free error on retry after checksum mismatch (#12464)
Summary:
Fix the heap use after free bug caused by freeing the file system IO buffer in `BlockFetcher::ReadBlock()` instead of the caller.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12464

Test Plan: Update the `DBIOCorruptionTest` tests

Reviewed By: akankshamahajan15

Differential Revision: D55206920

Pulled By: anand1976

fbshipit-source-id: fd6b608a61cd229b20c1e5f348ff3cc92328de0f
2024-03-21 16:19:09 -07:00