Commit Graph

1540 Commits

Author SHA1 Message Date
Yanqin Jin c93ba7db5d Revise LockWAL/UnlockWAL implementation (#11020)
Summary:
RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and
releases the internal `DBImpl::log_write_mutex_`.

According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288
> Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_.

This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application
should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing
the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()`
is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`.

There are two issues:
1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after
calling `LockWAL()` because log_write_mutex is not recursive.
2. Two threads may dead lock due to lock order inversion.

To fix these issues, we can modify the implementation of LockWAL so that it does not keep
`log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can
instead manually inject a write stall so that all future writes will be stopped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D41785203

Pulled By: riversand963

fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce
2022-12-13 21:45:00 -08:00
Hui Xiao 98d5db5c2e Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922

Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run 36a5686ec0 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: ajkr

Differential Revision: D41063187

Pulled By: hx235

fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 13:29:37 -08:00
Peter Dillinger 9b34c097a1 Fix bug updating latest backup on delete (#11029)
Summary:
Previously, the "latest" valid backup would not be updated on delete.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11029

Test Plan: unit test included (added to an existing test for efficiency)

Reviewed By: hx235

Differential Revision: D41967925

Pulled By: pdillinger

fbshipit-source-id: ca143354d281eb979557ea421902cd26803a1137
2022-12-13 09:42:34 -08:00
Hui Xiao 5f52dbc505 Move two history entries mistake out of 7.9 section (#11013)
Summary:
https://github.com/facebook/rocksdb/pull/10892 and https://github.com/facebook/rocksdb/pull/10955 mistakenly added two entries under sealed 7.9.history section. This PR fixes these two.

No need to update 7.9 branch (https://github.com/facebook/rocksdb/blob/7.9.fb/HISTORY.md) cuz it's cut before these two PRs landed

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11013

Reviewed By: cbi42

Differential Revision: D41666514

Pulled By: hx235

fbshipit-source-id: c4bc7a29ff663664bf0be1ba1c7eab4d00a61528
2022-12-06 18:33:35 -08:00
anand76 8ffabdc226 Fix table cache leak in MultiGet with async_io (#10997)
Summary:
When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997

Test Plan:
1. New unit test in db_basic_test
2. asan_crash

Reviewed By: pdillinger

Differential Revision: D41587244

Pulled By: anand1976

fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64
2022-12-04 22:58:25 -08:00
Hui Xiao 2f76ab150d Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892)
Summary:
**Context**
`Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.

`DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.
And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.
- Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`.

As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.

**Summary**
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.

**Test**
- Make check
- Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after
- [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears
- [Ongoing]Regular CI stress test to confirm such fix didn't harm anything

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892

Reviewed By: ajkr

Differential Revision: D40778965

Pulled By: hx235

fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f
2022-11-29 14:14:43 -08:00
Hui Xiao f1574a20ff Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999)
Summary:
**Context/Summary:**

This reverts commit fc74abb436 and related HISTORY record.

The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF.  This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004

Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D41572604

Pulled By: hx235

fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
2022-11-29 10:56:42 -08:00
Hui Xiao d8c043f7ad Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955)
Summary:
**Context**

https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect.

As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors  `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound.
Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178
Column family: default, op_logs: S 00000000000023A90000000000000003787878`

**Summary**:
- Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`
- Add some helpful log to LOG file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955

Test Plan:
- Errors repro-ed by
```
./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20
```
is gone after this fix
- CI

Reviewed By: ajkr

Differential Revision: D41319441

Pulled By: hx235

fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709
2022-11-28 15:45:03 -08:00
Changyu Bi 534fb06dd3 Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966)
Summary:
Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966

Test Plan:
- added unit test
- stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
- ran different stress tests over sandcastle
- Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.

Reviewed By: ajkr

Differential Revision: D41414172

Pulled By: cbi42

fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
2022-11-23 14:27:14 -08:00
Yanqin Jin 3d0d6b8140 Make best-efforts recovery verify SST unique ID before Version construction (#10962)
Summary:
The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true).

With best_efforts_recovery being true, RocksDB will recover to the latest point in
MANIFEST such that all valid SST files included up to this point pass unique ID checks as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D41378241

Pulled By: riversand963

fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c
2022-11-22 22:53:31 -08:00
anand76 f4cfcfe824 Post 7.9.0 release branch cut updates (#10974)
Summary:
Update HISTORY.md, version.h, and check_format_compatible.sh

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10974

Reviewed By: akankshamahajan15

Differential Revision: D41455289

Pulled By: anand1976

fbshipit-source-id: 99888ebcb9109e5ced80584a66b20123f8783c0b
2022-11-21 19:24:42 -08:00
anand76 3ff6da6bd5 Update HISTORY.md for 7.9.0 (#10973)
Summary:
Update HISTORY.md for 7.9.0 release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10973

Reviewed By: pdillinger

Differential Revision: D41453720

Pulled By: anand1976

fbshipit-source-id: 47a23d4b6539ec6a9a09c9e69c026f7c8b10afa7
2022-11-21 17:00:01 -08:00
Peter Dillinger e079d562af Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945)
Summary:
Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945

Test Plan: existing tests for CacheDumper, added basic unit test

Reviewed By: anand1976

Differential Revision: D41231497

Pulled By: pdillinger

fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068
2022-11-21 16:17:36 -08:00
Andrew Kryczka 097f9f4425 Fix CompactionIterator flag for penultimate level output (#10967)
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967

Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
  - Before this fix, Y was always zero
  - After this fix, Y gradually increased throughout the benchmark

Reviewed By: riversand963

Differential Revision: D41417726

Pulled By: ajkr

fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
2022-11-21 16:14:03 -08:00
Peter Dillinger 3182beeffc Observe and warn about misconfigured HyperClockCache (#10965)
Summary:
Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.

Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965

Test Plan:
unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.

Also manual testing with db_bench. Create db with
```
./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
```
Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
```
./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
```
Seeing warnings / errors or not as expected.

Reviewed By: anand1976

Differential Revision: D41406932

Pulled By: pdillinger

fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
2022-11-21 12:08:21 -08:00
Peter Dillinger 8c0f5b1fcf Mark HyperClockCache as production-ready (#10963)
Summary:
After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963

Test Plan: existing tests, comment changes only

Reviewed By: siying

Differential Revision: D41384083

Pulled By: pdillinger

fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d
2022-11-17 14:44:59 -08:00
Levi Tamasi 8fa8780932 Mention wide-column support in HISTORY.md (#10959)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10959

Reviewed By: akankshamahajan15

Differential Revision: D41348198

Pulled By: ltamasi

fbshipit-source-id: 51e89d03c1fe87f576a766f609a7f233a519c83d
2022-11-16 12:22:35 -08:00
anand76 ecba6a320e Add some async read stats (#10947)
Summary:
Add stats for time spent in the ReadAsync call, and async read errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947

Test Plan: Run db_bench and look at stats

Reviewed By: akankshamahajan15

Differential Revision: D41236637

Pulled By: anand1976

fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf
2022-11-13 21:38:35 -08:00
Peter Dillinger f321e8fc98 Don't attempt to use SecondaryCache on block_cache_compressed (#10944)
Summary:
Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!

This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.

Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.

This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944

Test Plan: test added that fails on base revision (at least with ASAN)

Reviewed By: akankshamahajan15

Differential Revision: D41205578

Pulled By: pdillinger

fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
2022-11-11 17:35:53 -08:00
akankshamahajan d1aca4a5ae Fix async_io regression in scans (#10939)
Summary:
Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939

Test Plan:
- stress tests  export CRASH_TEST_EXT_ARGS="--async_io=1"
    make crash_test -j32
- Ran db_bench command which was caught the regression:
./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288

seekrandom   :    3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations;  132.3 MB/s (7942 of 7942 found)

Reviewed By: anand1976

Differential Revision: D41173899

Pulled By: akankshamahajan15

fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe
2022-11-11 13:34:49 -08:00
Levi Tamasi fbd9077d66 Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925)
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41096453

Pulled By: ltamasi

fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
2022-11-07 15:42:10 -08:00
Andrew Kryczka aa0a11e1b9 Fix flush picking non-consecutive memtables (#10921)
Summary:
Prevents `MemTableList::PickMemtablesToFlush()` from picking non-consecutive memtables. It leads to wrong ordering in L0 if the files are committed, or an error like below if force_consistency_checks=true catches it:

```
Corruption: force_consistency_checks: VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/25 with seqno 320416 368066 vs. file https://github.com/facebook/rocksdb/issues/24 with seqno 336037 352068
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10921

Test Plan: fix the expectation in the existing test of this behavior

Reviewed By: riversand963

Differential Revision: D41046935

Pulled By: ajkr

fbshipit-source-id: 783696bff56115063d5dc5856dfaed6a9881d1ab
2022-11-04 15:55:54 -07:00
akankshamahajan ff9ad2c39b Fix async_io failures in case there is error in reading data (#10890)
Summary:
Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error.
Fix: Added check to abort IO in second buffer if curr_ got empty.

This PR also fixes db_stress failures which happened when buffers are not aligned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890

Test Plan:
- Ran make crash_test -j32 with async_io enabled.
-  Ran benchmarks to make sure there is no regression.

Reviewed By: anand1976

Differential Revision: D40881731

Pulled By: akankshamahajan15

fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58
2022-11-01 16:06:51 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Hui Xiao 7f5e438aee Move move wrong history entry out of 7.8 release (#10898)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/10777 mistakenly added a history entry under 7.8 release but the PR is not included in 7.8. This mistake was due to rebase and merge didn't realize it was a conflict when "## Unreleased" was changed to "## 7.8.0 (10/22/2022)".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10898

Test Plan: Make check

Reviewed By: akankshamahajan15

Differential Revision: D40861001

Pulled By: hx235

fbshipit-source-id: b2310c95490f6ebb90834a210c965a74c9560b51
2022-10-31 15:02:29 -07:00
sdong d989300ad1 Avoid repeat periodic stats printing when there is no change (#10891)
Summary:
When there is a column family that doesn't get any traffic, its stats are still dumped when options.options.stats_dump_period_sec triggers. This sometimes spam the information logs. With this change, we skip the printing if there is not change, until 8 periods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10891

Test Plan: Manually test the behavior with hacked db_bench setups.

Reviewed By: jay-zhuang

Differential Revision: D40777183

fbshipit-source-id: ef0b9a793e4f6282df099b464f01d1fb4c5a2cab
2022-10-31 09:51:38 -07:00
Changyu Bi 56715350d9 Reduce heap operations for range tombstone keys in iterator (#10877)
Summary:
Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877

Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmark:
```
# Set up: --writes_per_range_tombstone=1 means one point write and one delete range

TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64

Level Files Size(MB)
--------------------
  0        8      152
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5        0        0
  6        0        0

# Benchmark
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true

# Pre PR
readseq [AVG    5 runs] : 1432116 (± 59664) ops/sec;  224.0 (± 9.3) MB/sec
readseq [MEDIAN 5 runs] : 1454886 ops/sec;  227.5 MB/sec

# Post PR
readseq [AVG    5 runs] : 1944425 (± 29521) ops/sec;  304.1 (± 4.6) MB/sec
readseq [MEDIAN 5 runs] : 1959430 ops/sec;  306.5 MB/sec
```

Reviewed By: ajkr

Differential Revision: D40710936

Pulled By: cbi42

fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
2022-10-27 14:28:50 -07:00
Hui Xiao fc74abb436 Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress

./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35

put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```

**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
-  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
   - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
   - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
  - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
  - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
    - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.

Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777

Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761  and on FIFO compaction only

Reviewed By: ajkr

Differential Revision: D40090485

Pulled By: hx235

fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
2022-10-25 10:39:58 -07:00
akankshamahajan daceb85c51 Update version.h, HISTORY.md and add branches to compatibility check (#10846)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10846

Reviewed By: ajkr

Differential Revision: D40617997

Pulled By: akankshamahajan15

fbshipit-source-id: 4b2d6e85dbca7e73b930c4165869b693d3e4e137
2022-10-23 19:42:06 -07:00
akankshamahajan 9a55e5da17 Update HISTORY.md for 7.8 release (#10844)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10844

Reviewed By: ajkr

Differential Revision: D40592956

Pulled By: akankshamahajan15

fbshipit-source-id: 6656f4bc5faa30fa7882bf44155f7931895590e2
2022-10-22 10:09:07 -07:00
Jay Zhuang f726d29a82 Allow penultimate level output for the last level only compaction (#10822)
Summary:
Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level.
(it includes the PR https://github.com/facebook/rocksdb/issues/10829)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822

Reviewed By: siying

Differential Revision: D40389180

Pulled By: jay-zhuang

fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad
2022-10-22 08:57:38 -07:00
Peter Dillinger 27c9705ac4 Use kXXH3 as default checksum (CPU efficiency) (#10778)
Summary:
Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware.

A current DB performance comparison using buck+clang build:
```
TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal
```
kXXH3 (+0.2% DB write throughput):
`fillseq [AVG    1000 runs] : 822149 (± 1004) ops/sec;   91.0 (± 0.1) MB/sec`
kCRC32c:
`fillseq [AVG    1000 runs] : 820484 (± 1203) ops/sec;   90.8 (± 0.1) MB/sec`

Micro benchmark comparison:
```
./db_bench --benchmarks=xxh3[-X20],crc32c[-X20]
```
Machine 1, buck+clang build:
`xxh3 [AVG    20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec`
`crc32c [AVG    20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec`

Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0:
`xxh3 [AVG    20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec`
`crc32c [AVG    20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778

Test Plan: make check, unit tests updated

Reviewed By: ajkr

Differential Revision: D40112510

Pulled By: pdillinger

fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37
2022-10-21 18:09:12 -07:00
sdong 5d17297b76 Make UserComparatorWrapper not Customizable (#10837)
Summary:
Right now UserComparatorWrapper is a Customizable object, although it is not, which introduces some intialization overhead for the object. In some benchmarks, it shows up in CPU profiling. Make it not configurable by defining most functions needed by UserComparatorWrapper to an interface and implement the interface.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10837

Test Plan: Make sure existing tests pass

Reviewed By: pdillinger

Differential Revision: D40528511

fbshipit-source-id: 70eaac89ecd55401a26e8ed32abbc413a9617c62
2022-10-21 12:27:50 -07:00
akankshamahajan 0e7b27bfcf Refactor block cache tracing APIs (#10811)
Summary:
Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.

`DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.

This same philosophy can be applied to KV and IO tracing as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811

Test Plan:
existing unit tests
Old API DB::StartBlockTrace checked with db_bench tool
create database
```
./db_bench --benchmarks="fillseq" \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000
```

To trace block cache accesses when running readrandom benchmark:
```
./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
--threads=16 \
-block_cache_trace_file="/tmp/binary_trace_test_example" \
-block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
-block_cache_trace_sampling_frequency=1

```

Reviewed By: anand1976

Differential Revision: D40435289

Pulled By: akankshamahajan15

fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
2022-10-21 12:15:35 -07:00
Changyu Bi 333abe9c55 Ignore max_compaction_bytes for compaction input that are within output key-range (#10835)
Summary:
When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835

Test Plan:
- CI, a unit test on max_compaction_bytes fails before turning this flag off.
- Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432`
```
main:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   91.59 MB   0.8     70.9     0.0     70.9     200.8    129.9       0.0   1.5     25.2     71.2   2886.55           2463.45      9725    0.297   1093M   254K       0.0       0.0
  L1      9/0   248.03 MB   1.0    392.0   129.8    262.2     391.7    129.5       0.0   3.0     69.0     68.9   5821.71           5536.90       804    7.241   6029M  5814K       0.0       0.0
  L2     87/0    2.50 GB   1.0    537.0   128.5    408.5     533.8    125.2       0.7   4.2     69.5     69.1   7912.24           7323.70      4417    1.791   8299M    36M       0.0       0.0
  L3    836/0   24.99 GB   1.0    616.9   118.3    498.7     594.5     95.8       5.2   5.0     66.9     64.5   9442.38           8490.28      4204    2.246   9749M   306M       0.0       0.0
  L4   2355/0   62.95 GB   0.3     67.3    37.1     30.2      54.2     24.0      38.9   1.5     72.2     58.2    954.37            821.18       917    1.041   1076M   173M       0.0       0.0
 Sum   3290/0   90.77 GB   0.0   1684.2   413.7   1270.5    1775.0    504.5      44.9  13.7     63.8     67.3  27017.25          24635.52     20067    1.346     26G   522M       0.0       0.0

Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds

This PR:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   45.71 MB   0.8     72.9     0.0     72.9     202.8    129.9       0.0   1.6     25.4     70.7   2938.16           2510.36      9741    0.302   1124M   265K       0.0       0.0
  L1      8/0   234.54 MB   0.9    384.5   129.8    254.7     384.2    129.6       0.0   3.0     69.0     68.9   5708.08           5424.43       791    7.216   5913M  5753K       0.0       0.0
  L2     84/0    2.47 GB   1.0    543.1   128.6    414.5     539.9    125.4       0.7   4.2     69.6     69.2   7989.31           7403.13      4418    1.808   8393M    36M       0.0       0.0
  L3    839/0   24.96 GB   1.0    615.6   118.4    497.2     593.2     96.0       5.1   5.0     66.6     64.1   9471.23           8489.31      4193    2.259   9726M   306M       0.0       0.0
  L4   2360/0   63.04 GB   0.3     67.6    37.3     30.3      54.4     24.1      38.9   1.5     71.5     57.6    967.30            827.99       907    1.066   1080M   173M       0.0       0.0
 Sum   3294/0   90.75 GB   0.0   1683.8   414.2   1269.6    1774.5    504.9      44.8  13.7     63.7     67.1  27074.08          24655.22     20050    1.350     26G   522M       0.0       0.0

Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds
```

Reviewed By: ajkr

Differential Revision: D40518319

Pulled By: cbi42

fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60
2022-10-21 10:22:41 -07:00
Andrew Kryczka 33ceea9b76 Add DB property for fast block cache stats collection (#10832)
Summary:
This new property allows users to trigger the background block cache stats collection mode through the `GetProperty()` and `GetMapProperty()` APIs. The background mode has much lower overhead at the expense of returning stale values in more cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10832

Test Plan: updated unit test

Reviewed By: pdillinger

Differential Revision: D40497883

Pulled By: ajkr

fbshipit-source-id: bdcc93402f426463abb2153756aad9e295447343
2022-10-20 15:04:29 -07:00
Yueh-Hsuan Chiang e267909ecf Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348)
Summary:
FIFO compaction can theoretically open a DB with any compaction style.
However, the current code only allows FIFO compaction to open a DB with
a single level.

This PR relaxes the limitation of FIFO compaction and allows it to open a
DB with multiple levels.  Below is the read / write / compaction behavior:

* The read behavior is untouched, and it works like a regular rocksdb instance.
* The write behavior is untouched as well.  When a FIFO compacted DB
is opened with multiple levels, all new files will still be in level 0, and no files
will be moved to a different level.
* Compaction logic is extended.  It will first identify the bottom-most non-empty level.
Then, it will delete the oldest file in that level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348

Test Plan:
Added a new test to verify the migration from level to FIFO where the db has multiple levels.
Extended existing test cases in db_test and db_basic_test to also verify
all entries of a key after reopening the DB with FIFO compaction.

Reviewed By: jay-zhuang

Differential Revision: D40233744

fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
2022-10-18 14:38:13 -07:00
Jay Zhuang 5a5f21c489 Allow the last level data moving up to penultimate level (#10782)
Summary:
Lock the penultimate level for the whole compaction inputs range, so any
key in that compaction is safe to move up from the last level to
penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782

Reviewed By: siying

Differential Revision: D40231540

Pulled By: siying

fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e
2022-10-10 22:50:34 -07:00
Peter Dillinger 2d0380adbe Allow manifest fix-up without requiring prior state (#10796)
Summary:
This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state.

Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796

Test Plan: tests added/updated, that fail before the change(s) and now pass

Reviewed By: jay-zhuang

Differential Revision: D40232645

Pulled By: jay-zhuang

fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be
2022-10-10 17:59:17 -07:00
akankshamahajan ebf8c454fd Provide support for async_io with tailing iterators (#10781)
Summary:
Provide support for async_io if ReadOptions.tailing is set true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10781

Test Plan:
- Update unit tests
- Ran db_bench: ./db_bench --benchmarks="readrandom" --use_existing_db --use_tailing_iterator=1 --async_io=1

Reviewed By: anand1976

Differential Revision: D40128882

Pulled By: anand1976

fbshipit-source-id: 55e17855536871a5c47e2de92d238ae005c32d01
2022-10-10 15:48:48 -07:00
Jay Zhuang c401f285c3 Add option `preserve_internal_time_seconds` to preserve the time info (#10747)
Summary:
Add option `preserve_internal_time_seconds` to preserve the internal
time information.
It's mostly for the migration of the existing data to tiered storage (
`preclude_last_level_data_seconds`). When the tiering feature is just
enabled, the existing data won't have the time information to decide if
it's hot or cold. Enabling this feature will start collect and preserve
the time information for the new data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747

Reviewed By: siying

Differential Revision: D39910141

Pulled By: siying

fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138
2022-10-07 18:49:40 -07:00
Peter Dillinger b205c6d029 Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768)
Summary:
We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)

This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768

Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)

Reviewed By: anand1976

Differential Revision: D40034747

Pulled By: anand1976

fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
2022-10-06 14:54:21 -07:00
Yanqin Jin 4d82b94896 Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773)
Summary:
With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773

Test Plan:
Reproduce: D39993203 has detailed steps.
Run the test with and without the fix.

Reviewed By: cbi42

Differential Revision: D40077955

Pulled By: cbi42

fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240
2022-10-05 12:24:39 -07:00
Changyu Bi eca47fb696 Ignore kBottommostFiles compaction logic when allow_ingest_behind (#10767)
Summary:
fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles)  if allow_ingest_behind is enabled and the bottommost level is unfilled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767

Test Plan: Added a unit test to reproduce the compaction loop.

Reviewed By: ajkr

Differential Revision: D40031861

Pulled By: ajkr

fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b
2022-10-05 09:27:14 -07:00
Changyu Bi ffde463a5f Cleanup SuperVersion in Iterator::Refresh() (#10770)
Summary:
Fix a bug in Iterator::Refresh() where the local SV it obtained could be obsolete upon return, and should be cleaned up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10770

Test Plan: added a unit test to reproduce the issue.

Reviewed By: ajkr

Differential Revision: D40063809

Pulled By: ajkr

fbshipit-source-id: 619e728eb0f1ac9540b4d0ad38e43acc37a514b2
2022-10-04 22:23:24 -07:00
Yanqin Jin edda219fc3 Manual flush with `wait=false` should not stall when writes stopped (#10001)
Summary:
When `FlushOptions::wait` is set to false, manual flush should not stall forever.

If the database has already stopped writes, then the thread calling `DB::Flush()` with
`FlushOptions::wait=false` should not enter the `DBImpl::write_thread_`.

To prevent this, we should do a check at the beginning and return `TryAgain()`

Resolves: https://github.com/facebook/rocksdb/issues/9892

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10001

Reviewed By: siying

Differential Revision: D36422303

Pulled By: siying

fbshipit-source-id: 723bd3065e8edc4f17c82449d0d6b95a2381ac0a
2022-10-04 16:43:01 -07:00
Jay Zhuang f007ad8b4f RoundRobin TTL compaction (#10725)
Summary:
For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725

Reviewed By: ajkr

Differential Revision: D39771069

Pulled By: jay-zhuang

fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd
2022-10-04 14:53:32 -07:00
akankshamahajan ae0f9c3339 Add new property in IOOptions to skip recursing through directories and list only files during GetChildren. (#10668)
Summary:
Add new property "do_not_recurse" in  IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files.
By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open.

Provided support in PosixFileSystem to use "do_not_recurse".

TestPlan:
- Existing tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668

Reviewed By: anand1976

Differential Revision: D39471683

Pulled By: akankshamahajan15

fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f
2022-10-03 10:59:45 -07:00
Changyu Bi 9f2363f4c4 User-defined timestamp support for `DeleteRange()` (#10661)
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661

Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.

| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|

Reviewed By: riversand963

Differential Revision: D39441192

Pulled By: cbi42

fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
2022-09-30 16:13:03 -07:00
Jay Zhuang f3cc66632b Align compaction output file boundaries to the next level ones (#10655)
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.

In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).

db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432

# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds

# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```

The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
          baseline      this_PR
count  1.656000e+03  1.705000e+03
mean   3.116062e+07  3.028076e+07
std    7.145242e+06  8.046139e+06
```

The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`

Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655

Reviewed By: cbi42

Differential Revision: D39552321

Pulled By: jay-zhuang

fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
2022-09-29 19:43:55 -07:00
Changyu Bi df492791b6 Fix segfault in Iterator::Refresh() (#10739)
Summary:
when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739

Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue.

Reviewed By: ajkr, riversand963

Differential Revision: D39825283

Pulled By: cbi42

fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c
2022-09-26 18:57:23 -07:00
Yanqin Jin 52f2411722 Update HISTORY to mention PR #10724 (#10737)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10737

Reviewed By: cbi42

Differential Revision: D39825386

Pulled By: riversand963

fbshipit-source-id: a3c55f2777e034d6ae6ff44ef0219d9fbbf1cc96
2022-09-26 15:59:30 -07:00
Changyu Bi 93f46da1fa Mention in HISTORY.md the fix in #10705 (#10720)
Summary:
Mention in HISTORY.md the fix in https://github.com/facebook/rocksdb/issues/10705.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10720

Reviewed By: riversand963

Differential Revision: D39709455

Pulled By: cbi42

fbshipit-source-id: 1a2c9dd8425c73f7095ddb1d0b1cca8ed35b7ef2
2022-09-21 15:08:43 -07:00
Akanksha Mahajan 0f4978d34f Fix sqe->addr passed in cancel request in io_uring (#10644)
Summary:
Update io_uring_prep_cancel as it is now backward compatible.
Also, io_uring_prep_cancel expects sqe->addr to match with read
request submitted. It's being set wrong which is now fixed in this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10644

Test Plan:
- Ran internally with lastest liburing package and on RocksDB
github repo with older version.
- Ran seekrandom regression to confirm there is no regression.

Reviewed By: anand1976

Differential Revision: D39284229

Pulled By: akankshamahajan15

fbshipit-source-id: fd52cdf23d676da114896163626b75c8ae09c980
2022-09-21 14:21:59 -07:00
Yanqin Jin 2b6f3510c2 Update version number and HISTORY in main branch (#10694)
Summary:
This PR bumps up version number from 7.7 to 7.8 in main branch, indicating that next release will be 7.8. We are going to release 7.7 soon. Since 7.7.fb branch has been created, we can land this to main.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10694

Reviewed By: gitbw95

Differential Revision: D39581577

Pulled By: riversand963

fbshipit-source-id: 84f3fecf25fd9ac96e46b4cd6d50ddb6edc89427
2022-09-19 12:25:44 -07:00
gitbw95 2cc5b39560 Add enable_split_merge option for CompressedSecondaryCache (#10690)
Summary:
`enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690

Test Plan:
Unit Tests
Stress Tests

Reviewed By: anand1976

Differential Revision: D39567604

Pulled By: gitbw95

fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7
2022-09-16 15:41:49 -07:00
Peter Dillinger 0f91c72adc Call experimental new clock cache HyperClockCache (#10684)
Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)

Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.

For performance tests see https://github.com/facebook/rocksdb/pull/10626

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684

Test Plan: no interesting functional changes; tests updated

Reviewed By: anand1976

Differential Revision: D39547800

Pulled By: pdillinger

fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
2022-09-16 12:47:29 -07:00
anand76 37b75e1364 Fix some MultiGet stats (#10673)
Summary:
The stats were not accurate for the coroutine version of MultiGet. This PR fixes it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10673

Reviewed By: akankshamahajan15

Differential Revision: D39492615

Pulled By: anand1976

fbshipit-source-id: b46c04e15ea27e66f4c31f00c66497aa283bf9d3
2022-09-15 22:48:06 -07:00
anand76 7b11d48444 Change MultiGet multi-level optimization to default on (#10671)
Summary:
Change the ```ReadOptions.optimize_multiget_for_io``` flag to default on. It doesn't impact regular MultiGet users as its only applicable when ```ReadOptions.async_io``` is also set to true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10671

Reviewed By: akankshamahajan15

Differential Revision: D39477439

Pulled By: anand1976

fbshipit-source-id: 47abcdbfa69f9bc60422ab68a238b232e085d4ba
2022-09-14 08:51:16 -07:00
Changyu Bi f291eefb02 Cache fragmented range tombstone list for mutable memtables (#10547)
Summary:
Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.

The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547

Test Plan:
- TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
```
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
```
Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.

micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
|   |fillrandom main           | write path caching          | read path caching          |memtable V3 (https://github.com/facebook/rocksdb/issues/10308)     | readrandom main            | write path caching           | read path caching            |memtable V3      |
|---   |---  |---   |---   |---   | ---   |           ---   |  ---   |  ---   |
| 0                    |6.35                           |6.15                           |5.82                           |6.12                           |2.24                           |2.26                           |2.03                           |2.07                           |
| 1                    |5.99                           |5.88                           |5.77                           |6.28                           |2.65                           |2.27                           |2.24                           |2.5                            |
| 10                   |6.15                           |6.02                           |5.92                           |5.95                           |5.15                           |2.61                           |2.31                           |2.53                           |
| 100                  |5.95                           |5.78                           |5.88                           |6.23                           |28.31                          |2.34                           |2.45                           |2.94                           |
| 100 25 threads       |52.01                          |45.85                          |46.18                          |47.52                          |35.97                          |3.34                           |3.34                           |3.56                           |
| 1000                 |6.0                            |7.07                           |5.98                           |6.08                           |333.18                         |2.86                           |2.7                            |3.6                            |
| 1000 25 threads      |52.6                           |148.86                         |79.06                          |45.52                          |473.49                         |3.66                           |3.48                           |4.38                           |

  - Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`

readrandom micros/op:
|  |main            |write path caching           |read path caching            |memtable V3      |
|---|---|---|---|---|
| single thread        |48.28                          |1.55                           |1.52                           |1.96                           |
| 25 threads           |64.3                           |2.55                           |2.67                           |2.64                           |

Reviewed By: ajkr

Differential Revision: D38895410

Pulled By: cbi42

fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
2022-09-13 20:07:28 -07:00
Akanksha Mahajan 03fc43976d Async optimization in scan path (#10602)
Summary:
Optimizations
1. In FilePrefetchBuffer, when data is overlapping between two buffers, it copies the data from first to third buffer, then from
second to third buffer to return continuous buffer. This optimization will call ReadAsync on first buffer as soon as buffer is empty instead of getting blocked by second buffer to copy the data.
2. For fixed size readahead_size, FilePrefetchBuffer will issues two async read calls. One with length + readahead_size_/2 on first buffer(if buffer is empty) and readahead_size_/2 on second buffer during seek.

- Add readahead_size to db_stress for stress testing these changes in https://github.com/facebook/rocksdb/pull/10632

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10602

Test Plan:
- CircleCI tests
- stress_test completed successfully
export CRASH_TEST_EXT_ARGS="--async_io=1"
make crash_test -j32
- db_bench showed no regression
   With this PR:
```
 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661876074584472 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:14:34 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  270878.018 micros/op 3 ops/sec 30.068 seconds 111 operations;  618.7 MB/s (111 of 111 found)

 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=true -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661875332862922 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:02:12 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  358352.488 micros/op 2 ops/sec 30.102 seconds 84 operations;  474.4 MB/s (84 of 84 found)
```

Without PR in main:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661876425983045 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:20:26 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  280881.953 micros/op 3 ops/sec 30.054 seconds 107 operations;  605.2 MB/s (107 of 107 found)

 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=0
Set seed to 1661876475267771 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:21:15 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  363155.084 micros/op 2 ops/sec 30.142 seconds 83 operations;  468.1 MB/s (83 of 83 found)
```

Reviewed By: anand1976

Differential Revision: D39141328

Pulled By: akankshamahajan15

fbshipit-source-id: 560655922c1a437a8569c228abb31b8c0b413120
2022-09-12 17:42:01 -07:00
ltamasi dc7d155438 Mention some recent blob caching related changes in HISTORY.md (#10653)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10653

Reviewed By: riversand963

Differential Revision: D39368165

Pulled By: ltamasi

fbshipit-source-id: 06cfd3c87ca90b9d07c082d5e307c0dc6a16840c
2022-09-09 09:56:10 -07:00
gitbw95 0148c4934d Add PerfContext counters for CompressedSecondaryCache (#10650)
Summary:
Add PerfContext counters for CompressedSecondaryCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10650

Test Plan: Unit Tests.

Reviewed By: anand1976

Differential Revision: D39354712

Pulled By: gitbw95

fbshipit-source-id: 1b90d3df99d08ddecd351edfd48d1e3723fdbc15
2022-09-08 16:35:57 -07:00
Yanqin Jin 3d67d79154 Fix overlapping check by excluding timestamp (#10615)
Summary:
With user-defined timestamp, checking overlapping should exclude
timestamp part from key. This has already been done for range checking
for files in sstableKeyCompare(), but not yet done when checking with
concurrent compactions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615

Test Plan:
(Will add more tests)

make check
(Repro seems easier with this commit sha: git checkout 78bbdef530)
rm -rf /dev/shm/rocksdb/* &&
mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected &&
./db_stress
--allow_data_in_errors=True --clear_column_family_one_in=0
--continuous_verification_interval=0 --data_block_index_type=1
--db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5
--delrangepercent=0
--expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected
--iterpercent=0 --max_background_compactions=20
--max_bytes_for_level_base=10485760 --max_key=25000000
--max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1
--ops_per_thread=1000000 --paranoid_file_checks=1 --partition_filters=0
--prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0
--snapshot_hold_ops=100000 --subcompactions=1 --compaction_pri=3
--target_file_size_base=65536 --target_file_size_multiplier=2
--test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1
--user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1
--write_buffer_size=65536 --writepercent=60 -disable_wal=1

Reviewed By: akankshamahajan15

Differential Revision: D39146797

Pulled By: riversand963

fbshipit-source-id: 7fca800026ca6219220100b8b6cf84d907828163
2022-09-08 13:03:07 -07:00
Peter Dillinger 6de7081cf3 Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.

One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.

Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`

Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532

Test Plan:
updated unit tests

Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec

Reviewed By: jay-zhuang

Differential Revision: D38765551

Pulled By: pdillinger

fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-07 22:52:42 -07:00
Bo Wang d490bfcdb6 Avoid recompressing cold block in CompressedSecondaryCache (#10527)
Summary:
**Summary:**
When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.

When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.

**Implementation Details**
Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0.  (refs >= 1 && in_cache == false && IS_STANDALONE == true)

The behaviors of  `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
1. If a handle is found in primary cache:
  1.1. If the handle's value is not nullptr, it is returned immediately.
  1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
    - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
    - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
2. If a handle is not found in primary cache:
  2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
  2.2.  If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block)  and return a standalone handle.

The behaviors of `LRUCacheShard::Promote()` are updated as follows:
1. If `e->sec_handle` has value, one of the following steps can happen:
  1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
  1.2. Insert the item into the primary cache and return the handle to caller.
  1.3. Exception handling.
3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.

The behavior of  `CompressedSecondaryCache::Insert()` is updated:
1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
4. If a dummy item is found for a block, the block is inserted into the secondary cache.

The behavior of  `CompressedSecondaryCache:::Lookup()` is updated:
1. If a handle is not found or it is a dummy item, a nullptr is returned.
2. If `erase_handle` is true, the handle is erased.

The behaviors of  `LRUCacheShard::Release()` are adjusted for the standalone handles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527

Test Plan:
1. stress tests.
5. unit tests.
6. CPU profiling for db_bench.

Reviewed By: siying

Differential Revision: D38747613

Pulled By: gitbw95

fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
2022-09-07 19:00:27 -07:00
Changyu Bi 30bc495c03 Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449)
Summary:
Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.

With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
- in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
- in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.

This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.

One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.

Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449

Test Plan:
- Added many unit tests in db_range_del_test
- Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
- Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.

```
python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
```

- Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written".  As expected, the performance with this PR depends on the range tombstone width.
```
# Setup:
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50

# Scan entire DB
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true

# Short range scan (10 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true

# Long range scan(1000 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
```
Avg over of 10 runs (some slower tests had fews runs):

For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.

- Scan entire DB

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0 range tombstone    |2525600 (± 43564)    |2486917 (± 33698)    |-1.53%               |
| 100   |1853835 (± 24736)    |2073884 (± 32176)    |+11.87%              |
| 1000  |422415 (± 7466)      |1115801 (± 22781)    |+164.15%             |
| 10000 |22384 (± 227)        |227919 (± 6647)      |+918.22%             |
| 1 range tombstone      |2176540 (± 39050)    |2434954 (± 24563)    |+11.87%              |
- Short range scan

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0  range tombstone   |35398 (± 533)        |35338 (± 569)        |-0.17%               |
| 100   |28276 (± 664)        |31684 (± 331)        |+12.05%              |
| 1000  |7637 (± 77)          |25422 (± 277)        |+232.88%             |
| 10000 |1367                 |28667                |+1997.07%            |
| 1 range tombstone      |32618 (± 581)        |32748 (± 506)        |+0.4%                |

- Long range scan

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0 range tombstone     |2262 (± 33)          |2353 (± 20)          |+4.02%               |
| 100   |1696 (± 26)          |1926 (± 18)          |+13.56%              |
| 1000  |410 (± 6)            |1255 (± 29)          |+206.1%              |
| 10000 |25                   |414                  |+1556.0%             |
| 1 range tombstone   |1957 (± 30)          |2185 (± 44)          |+11.65%              |

- Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61

Reviewed By: ajkr

Differential Revision: D38450331

Pulled By: cbi42

fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
2022-09-02 09:51:19 -07:00
Akanksha Mahajan 4cd16d65ae Add new option num_file_reads_for_auto_readahead in BlockBasedTableOptions (#10556)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file if user doesn't provide readahead_size and reads are sequential.
A new option num_file_reads_for_auto_readahead is added which can be
configured and indicates after how many sequential reads prefetching should
be start.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10556

Test Plan: Existing and new unit test

Reviewed By: anand1976

Differential Revision: D38947147

Pulled By: akankshamahajan15

fbshipit-source-id: c9eeab495f84a8df7f701c42f04894e46440ad97
2022-09-01 11:56:00 -07:00
anand76 5fbcc8c54d Update MULTIGET_IO_BATCH_SIZE for non-async MultiGet (#10622)
Summary:
This stat was only getting updated in the async (coroutine) version of MultiGet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10622

Reviewed By: akankshamahajan15

Differential Revision: D39188790

Pulled By: anand1976

fbshipit-source-id: 7e231507f65fc94c8a006c38f79dfba182a2c24a
2022-08-31 21:03:52 -07:00
sdong 9509003503 Option migration tool to break down files for FIFO compaction (#10600)
Summary:
Right now, when the option migration tool migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. Although it creates a valid LSM-tree for FIFO, for any data to be deleted for FIFO, the giant file will be deleted, which might make the DB almost empty. There is not good solution for it, because usually we don't have enough information to reconstruct the FIFO LSM-tree. This change changes to a solution that compromises the FIFO condition. We hope the solution is more useable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10600

Test Plan: Add unit tests for that.

Reviewed By: jay-zhuang

Differential Revision: D39106424

fbshipit-source-id: bdfd852c3b343373765b8d9716fefc08fd27145c
2022-08-31 12:08:23 -07:00
Hui Xiao e484b81eee Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573)
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.

This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.

The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped  (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```

```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```

**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
    -  Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
    - `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573

Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr  "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.

Reviewed By: ajkr

Differential Revision: D39005886

Pulled By: hx235

fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
2022-08-29 17:35:21 -07:00
anand76 72a3fb3424 Update statistics for async scan readaheads (#10585)
Summary:
Imported a fix to "rocksdb.prefetched.bytes.discarded" stat from https://github.com/facebook/rocksdb/issues/10561, and added a new stat "rocksdb.async.prefetch.abort.micros" to measure time spent waiting for async reads to abort.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10585

Reviewed By: akankshamahajan15

Differential Revision: D39067000

Pulled By: anand1976

fbshipit-source-id: d7cda71abb48017239bd5fd832345a16c7024faf
2022-08-29 14:37:44 -07:00
Peter Dillinger c5afbbfe4b Don't wait for indirect flush in read-only DB (#10569)
Summary:
Some APIs for getting live files, which are used by Checkpoint
and BackupEngine, can optionally trigger and wait for a flush. These
would deadlock when used on a read-only DB. Here we fix that by assuming
the user wants the overall operation to succeed and is OK without
flushing (because the DB is read-only).

Follow-up work: the same or other issues can be hit by directly invoking
some DB functions that are clearly not appropriate for read-only
instance, but are not covered by overrides in DBImplReadOnly and
CompactedDBImpl. These should be fixed to avoid similar problems on
accidental misuse. (Long term, it would be nice to have a DBReadOnly
class without those members, like BackupEngineReadOnly.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10569

Test Plan: tests updated to catch regression (hang before the fix)

Reviewed By: riversand963

Differential Revision: D38995759

Pulled By: pdillinger

fbshipit-source-id: f5f8bc7123e13cb45bd393dd974d7d6eda20bc68
2022-08-29 13:36:23 -07:00
Levi Tamasi 64e74723f7 Use the default metadata charge policy when creating an LRU cache via the Java API (#10577)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10577

Reviewed By: akankshamahajan15

Differential Revision: D39035884

Pulled By: ltamasi

fbshipit-source-id: 48f116f8ca172b7eb5eb3651f39ddb891a7ffade
2022-08-29 09:42:04 -07:00
Jay Zhuang d9e71fb2c5 Fix periodic_task unable to re-register the same task type (#10379)
Summary:
Timer has a limitation that it cannot re-register a task with the same name,
because the cancel only mark the task as invalid and wait for the Timer thread
to clean it up later, before the task is cleaned up, the same task name cannot
be added. Which makes the task option update likely to fail, which basically
cancel and re-register the same task name. Change the periodic task name to a
random unique id and store it in periodic_task_scheduler.

Also refactor the `periodic_work` to `periodic_task` to make each job function
as a `task`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10379

Test Plan: unittests

Reviewed By: ajkr

Differential Revision: D38000615

Pulled By: jay-zhuang

fbshipit-source-id: e4135f9422e3b53aaec8eda54f4e18ce633a279e
2022-08-25 18:52:37 -07:00
Brendan MacDonell 418b36a9bc Support CompactionPri::kRoundRobin in RocksJava (#10572)
Summary:
Pretty trivial — this PR just adds the new compaction priority to the Java API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10572

Reviewed By: hx235

Differential Revision: D39006523

Pulled By: ajkr

fbshipit-source-id: ea8d665817e7b05826c397afa41c3abcda81484e
2022-08-25 13:32:03 -07:00
Andrew Kryczka 7ad4b38617 Ensure writes to WAL tail during `FlushWAL(true /* sync */)` will be synced (#10560)
Summary:
WAL append and switch can both happen between `FlushWAL(true /* sync */)`'s sync operations and its call to `MarkLogsSynced()`. We permit this since locks need to be released for the sync operations. Such an appended/switched WAL is both inactive and incompletely synced at the time `MarkLogsSynced()` processes it.

Prior to this PR, `MarkLogsSynced()` assumed all inactive WALs were fully synced and removed them from consideration for future syncs. That was wrong in the scenario described above and led to the latest append(s) never being synced. This PR changes `MarkLogsSynced()` to only remove inactive WALs from consideration for which all flushed data has been synced.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10560

Test Plan: repro unit test for the scenario described above. Without this PR, it fails on "key2" not found

Reviewed By: riversand963

Differential Revision: D38957391

Pulled By: ajkr

fbshipit-source-id: da77175eba97ff251a4219b227b3bb2d4843ed26
2022-08-25 12:53:46 -07:00
Changyu Bi d140fbfd7d Add Iterator test against expected state to stress test (#10538)
Summary:
As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913,
`db_stress` does not have much verification for iterator correctness.
It has a `TestIterate()` function, but that is mainly for comparing results
between two iterators, one with `total_order_seek` and the other optionally
sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33
added a new `TestIterateAgainstExpected()` function that compares iterator against
expected state. It locks a range of keys, creates an iterator, does
a random sequence of `Next/Prev` and compares against expected state.
This PR is based on that commit, the main changes include some logs
(for easier debugging if a test fails), a forward and backward scan to
cover the entire locked key range, and a flag for optionally turning on
this version of Iterator testing.

Added constraint that the checks against expected state in
`TestIterateAgainstExpected()` and in `TestGet()` are only turned on
when `--skip_verifydb` flag is not set.
Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538

Test Plan:
Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`,
and a large `--iterpercent` and `--num_iterations`. Checked `op_logs`
manually to ensure expected coverage. Tweaked part of the code in
https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it.
- internally run various flavor of crash test

Reviewed By: ajkr

Differential Revision: D38847269

Pulled By: cbi42

fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599
2022-08-24 14:59:50 -07:00
muthukrishnan.s 79ed4be80f Add get_name, get_id for column family handle in C API (#10499)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10499

Reviewed By: hx235

Differential Revision: D38523859

Pulled By: ajkr

fbshipit-source-id: 268bba1fcce4a3e20c51e498a79d7b476f663aea
2022-08-24 13:49:02 -07:00
Changyu Bi 198e5d8ee9 Update `TestGet()` to verify against expected state (#10553)
Summary:
updated `TestGet()` in `no_batched_op_stress` to check the result of `Get()` operations against expected state (`expected_state_manager_`). More specifically, if `Get()` finds a key, expected state should not have `DELETION_SENTINEL` for the same key, and if `Get()` returns NotFound for a key, expected state should not have the key. One intention for this change it to verify correctness of code path change regarding range tombstones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10553

Test Plan: run db_stress with nonzero readpercent: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`. When I initially used wrong column family in `thread->shared->Get`, the test reported inconsistencies.

Reviewed By: ajkr

Differential Revision: D38927007

Pulled By: cbi42

fbshipit-source-id: f9f61b312ad0b4c21a799329609ba8526169b048
2022-08-23 17:08:14 -07:00
Peter Dillinger db7606a41a Fix "Behavior Changes" in 7.6 HISTORY.md (#10557)
Summary:
see diff

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10557

Test Plan: no functional change

Reviewed By: gitbw95

Differential Revision: D38950531

Pulled By: pdillinger

fbshipit-source-id: af72e80a31d7df38f6e633fa7115984c2274ed60
2022-08-23 13:03:41 -07:00
gitbw95 a9c2c7778d Update HISTORY.md for the upcoming 7.6 release (#10543)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10543

Reviewed By: anand1976

Differential Revision: D38877168

Pulled By: gitbw95

fbshipit-source-id: d6888f7dbb1f2a5bef144ad2443429a61663c1e8
2022-08-20 20:56:40 -07:00
anand76 35cdd3e71e MultiGet async IO across multiple levels (#10535)
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535

Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO

No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom :       7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations;  571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom :       7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations;  568.2 MB/s (8271992 of 8271992 found)`

For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom :       5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations;  797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom :       5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations;  790.7 MB/s (11649992 of 11649992 found) `

Reviewed By: akankshamahajan15

Differential Revision: D38774009

Pulled By: anand1976

fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
2022-08-19 16:52:52 -07:00
anand76 2553d1efa1 Revert "Avoid dynamic memory allocation on read path (#10453)" (#10541)
Summary:
This reverts commit 0d885e80d4. The original commit causes a ASAN stack-use-after-return failure due to the `CreateCallback` being allocated on stack and then used in another thread when a secondary cache object is promoted to the primary cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10541

Reviewed By: gitbw95

Differential Revision: D38850039

Pulled By: anand1976

fbshipit-source-id: 810c592b7de2523693f5bb267159b23b0ee9132c
2022-08-19 11:02:54 -07:00
Bo Wang 13cb7a84b6 Fix the memory leak in db_stress tests that are caused by `FaultInjectionSecondaryCache` and add `CompressedSecondaryCache` into stress tests. (#10523)
Summary:
1. Fix the memory leak in db_stress tests that are caused by `FaultInjectionSecondaryCache`. To address the test requirements for both CompressedSecondaryCache and CachlibWrapper, a new class variable `base_is_compressed_sec_cache_` is added to determine the different behaviors in `Lookup()` and `WaitAll()`.
2. Add `CompressedSecondaryCache` into stress tests.

Before this PR, memory leak is reported during crash tests if  `CompressedSecondaryCache` is in stress tests. One example is shown as follows:
```
==70722==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 6648240 byte(s) in 83103 object(s) allocated from:
    #0 0x13de9d7 in operator new(unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dbgo/gen/aab7ed39/internal_repo_rocksdb/repo/db_stress+0x13de9d7)
    https://github.com/facebook/rocksdb/issues/1 0x9084c7 in rocksdb::BlocklikeTraits<rocksdb::Block>::Create(rocksdb::BlockContents&&, unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*) internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:128
    https://github.com/facebook/rocksdb/issues/2 0x9084c7 in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)::operator()(void const*, unsigned long, void**, unsigned long*) const internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:34
    https://github.com/facebook/rocksdb/issues/3 0x9082c9 in rocksdb::Block std::__invoke_impl<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::__invoke_other, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:61
    https://github.com/facebook/rocksdb/issues/4 0x90825d in std::enable_if<is_invocable_r_v<rocksdb::Block, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>, rocksdb::Block>::type std::__invoke_r<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:114
    https://github.com/facebook/rocksdb/issues/5 0x9081b0 in std::_Function_handler<rocksdb::Status (void const*, unsigned long, void**, unsigned long*), std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)>::_M_invoke(std::_Any_data const&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:291
    https://github.com/facebook/rocksdb/issues/6 0x991f2c in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)>::operator()(void const*, unsigned long, void**, unsigned long*) const third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:560
    https://github.com/facebook/rocksdb/issues/7 0x990277 in rocksdb::CompressedSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/cache/compressed_secondary_cache.cc:77
    https://github.com/facebook/rocksdb/issues/8 0xd3aa4d in rocksdb::FaultInjectionSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/utilities/fault_injection_secondary_cache.cc:92
    https://github.com/facebook/rocksdb/issues/9 0xeadaab in rocksdb::lru_cache::LRUCacheShard::Lookup(rocksdb::Slice const&, unsigned int, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/lru_cache.cc:445
    https://github.com/facebook/rocksdb/issues/10 0x1064573 in rocksdb::ShardedCache::Lookup(rocksdb::Slice const&, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/sharded_cache.cc:89
    https://github.com/facebook/rocksdb/issues/11 0x8be0df in rocksdb::BlockBasedTable::GetEntryFromCache(rocksdb::CacheTier const&, rocksdb::Cache*, rocksdb::Slice const&, rocksdb::BlockType, bool, rocksdb::GetContext*, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:389
    https://github.com/facebook/rocksdb/issues/12 0x905790 in rocksdb::Status rocksdb::BlockBasedTable::GetDataBlockFromCache<rocksdb::Block>(rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::ReadOptions const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::UncompressionDict const&, rocksdb::BlockType, bool, rocksdb::GetContext*) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1263
    https://github.com/facebook/rocksdb/issues/13 0x8b9259 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1559
    https://github.com/facebook/rocksdb/issues/14 0x8b710c in rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool, bool, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1726
    https://github.com/facebook/rocksdb/issues/15 0x8c329f in rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::FilePrefetchBuffer*, bool, bool, rocksdb::Status&) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader_impl.h:58
    https://github.com/facebook/rocksdb/issues/16 0x920117 in rocksdb::BlockBasedTableIterator::InitDataBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:262
    https://github.com/facebook/rocksdb/issues/17 0x920d42 in rocksdb::BlockBasedTableIterator::MaterializeCurrentBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:332
    https://github.com/facebook/rocksdb/issues/18 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/19 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/20 0xef9f6c in rocksdb::MergingIterator::PrepareValue() internal_repo_rocksdb/repo/table/merging_iterator.cc:260
    https://github.com/facebook/rocksdb/issues/21 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/22 0xc67bcd in rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:326
    https://github.com/facebook/rocksdb/issues/23 0xc66d36 in rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:234
    https://github.com/facebook/rocksdb/issues/24 0xc7ab47 in rocksdb::DBIter::Next() internal_repo_rocksdb/repo/db/db_iter.cc:161
    https://github.com/facebook/rocksdb/issues/25 0x70d938 in rocksdb::BatchedOpsStressTest::TestPrefixScan(rocksdb::ThreadState*, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) internal_repo_rocksdb/repo/db_stress_tool/batched_ops_stress.cc:320
    https://github.com/facebook/rocksdb/issues/26 0x6dc6a8 in rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:907
    https://github.com/facebook/rocksdb/issues/27 0x6867de in rocksdb::ThreadBody(void*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_driver.cc:33
    https://github.com/facebook/rocksdb/issues/28 0xce4cc2 in rocksdb::(anonymous namespace)::StartThreadWrapper(void*) internal_repo_rocksdb/repo/env/env_posix.cc:461
    https://github.com/facebook/rocksdb/issues/29 0x7f23f9068c0e in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434:8
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10523

Test Plan:
```
$COMPILE_WITH_ASAN=1  make -j 24
$db_stress J=40 crash_test_with_txn
```

Reviewed By: anand1976

Differential Revision: D38646839

Pulled By: gitbw95

fbshipit-source-id: 9452895c7dc95481a9d7afe83b15193cf5b1c43e
2022-08-18 21:53:27 -07:00
Andrew Kryczka 91166012c8 Prevent a case of WriteBufferManager flush thrashing (#6364)
Summary:
Previously, the flushes triggered by `WriteBufferManager` could affect
the same CF repeatedly if it happens to get consecutive writes. Such
flushes are not particularly useful for reducing memory usage since
they switch nearly-empty memtables to immutable while they've just begun
filling their first arena block. In fact they may not even reduce the
mutable memory count if they involve replacing one mutable memtable containing
one arena block with a new mutable memtable containing one arena block.
Further, if such switches happen even a few times before a flush finishes,
the immutable memtable limit will be reached and writes will stall.

This PR adds a heuristic to not switch memtables to immutable for CFs
that already have one or more immutable memtables awaiting flush. There
is a memory usage regression if the user continues writing to the same
CF, that DB does not have any CFs eligible for switching, flushes
are not finishing, and the `WriteBufferManager` was constructed with
`allow_stall=false`. Before, it would grow by switching nearly empty
memtables until writes stall. Now, it would grow by filling memtables
until writes stall. This feels like an acceptable behavior change because
users who prefer to stall over violate the memory limit should be using
`allow_stall=true`, which is unaffected by this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/6364

Test Plan:
- Command:

`rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num_multi_db=8 -num_column_families=2 -write_buffer_size=4194304 -db_write_buffer_size=16777216 -compression_type=none -statistics=true -target_file_size_base=4194304 -max_bytes_for_level_base=16777216`

- `rocksdb.db.write.stall` count before this PR: 175
- `rocksdb.db.write.stall` count after this PR: 0

Reviewed By: jay-zhuang

Differential Revision: D20167197

Pulled By: ajkr

fbshipit-source-id: 4a64064e9bc33d57c0a35f15547542d0191d0cb7
2022-08-17 15:53:40 -07:00
Gang Liao 275cd80cdb Add a blob-specific cache priority (#10461)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
2022-08-12 17:59:06 -07:00
Changyu Bi fd165c869d Add memtable per key-value checksum (#10281)
Summary:
Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281

Test Plan:
- Added new unit test cases: `make check`
- Benchmark on memtable insert
```
TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

# avg over 10 runs
Baseline: 1166936 ops/sec
memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%)
memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%)
write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%)
```
-  Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum.
ops/sec for `readrandom` dropped ~6.8%.
```
# Readseq
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

readseq [AVG    20 runs] : 7432840 (± 212005) ops/sec;  822.3 (± 23.5) MB/sec
readseq [MEDIAN 20 runs] : 7573878 ops/sec;  837.9 MB/sec

With -memtable_protection_bytes_per_key=2:

readseq [AVG    20 runs] : 5134607 (± 119596) ops/sec;  568.0 (± 13.2) MB/sec
readseq [MEDIAN 20 runs] : 5232946 ops/sec;  578.9 MB/sec

# Readrandom
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100
readrandom [AVG    10 runs] : 140236 (± 3938) ops/sec;    9.8 (± 0.3) MB/sec
readrandom [MEDIAN 10 runs] : 140545 ops/sec;    9.8 MB/sec

With -memtable_protection_bytes_per_key=2:
readrandom [AVG    10 runs] : 130632 (± 2738) ops/sec;    9.1 (± 0.2) MB/sec
readrandom [MEDIAN 10 runs] : 130341 ops/sec;    9.1 MB/sec
```

- Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800`

Reviewed By: ajkr

Differential Revision: D37607896

Pulled By: cbi42

fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113
2022-08-12 13:51:32 -07:00
Peter Dillinger 86a1e3e0e7 Derive cache keys from SST unique IDs (#10394)
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.

See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.

This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.

These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):

```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```

After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```

However, the change does modify (probably weaken) the "guaranteed unique" promise from this

> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86

to this (see https://github.com/facebook/rocksdb/issues/10388)

> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits

I don't think this is a practical concern, though.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394

Test Plan: unit tests updated, see simulation results above

Reviewed By: jay-zhuang

Differential Revision: D38667529

Pulled By: pdillinger

fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
2022-08-12 13:49:49 -07:00
Levi Tamasi f3ddbe66bd Mention PR 10391 in HISTORY.md (#10522)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10522

Reviewed By: riversand963

Differential Revision: D38639429

Pulled By: ltamasi

fbshipit-source-id: 14d7ed4df76a78ba6882e0474048a720afb907d4
2022-08-11 15:51:11 -07:00
sdong 911c0208b9 WritableFileWriter tries to skip operations after failure (#10489)
Summary:
A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489

Test Plan: Fix existing unit tests and watch crash tests for a while.

Reviewed By: anand1976

Differential Revision: D38473277

fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
2022-08-10 10:19:20 -07:00
gitbw95 f060b47ee8 Fix the segdefault bug in CompressedSecondaryCache and its tests (#10507)
Summary:
This fix is to replace `AllocateBlock()` with `new`. Once I figure out why `AllocateBlock()` might cause the segfault, I will update the implementation.

Fix the bug that causes ./compressed_secondary_cache_test output following test failures:

```
Note: Google Test filter = CompressedSecondaryCacheTest.MergeChunksIntoValueTest
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CompressedSecondaryCacheTest
[ RUN      ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest
[       OK ] CompressedSecondaryCacheTest.MergeChunksIntoValueTest (1 ms)
[----------] 1 test from CompressedSecondaryCacheTest (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (9 ms total)
[  PASSED  ] 1 test.
t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.MergeChunksIntoValueTest: line 4: 1091086 Segmentation fault      (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.MergeChunksIntoValueTest
Note: Google Test filter = CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CompressedSecondaryCacheTest
[ RUN      ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression
[       OK ] CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression (1 ms)
[----------] 1 test from CompressedSecondaryCacheTest (1 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (2 ms total)
[  PASSED  ] 1 test.
t/run-compressed_secondary_cache_test-CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression: line 4: 1090883 Segmentation fault      (core dumped) TEST_TMPDIR=$d ./compressed_secondary_cache_test --gtest_filter=CompressedSecondaryCacheTest.BasicTestWithMemoryAllocatorAndCompression

```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10507

Test Plan:
Test 1:
```
$make -j 24
$./compressed_secondary_cache_test
```
Test 2:
```
$COMPILE_WITH_ASAN=1  make -j 24
$./compressed_secondary_cache_test
```
Test 3:
```
$COMPILE_WITH_TSAN=1 make -j 24
$./compressed_secondary_cache_test
```

Reviewed By: anand1976

Differential Revision: D38529885

Pulled By: gitbw95

fbshipit-source-id: d903fa3fadbd4d29f9528728c63a4f61c4396890
2022-08-09 15:34:50 -07:00
Levi Tamasi 06b04127a8 Reset blob value as soon as it's not needed in DBIter (#10490)
Summary:
We have recently added caching support to BlobDB, and separately,
implemented an optimization where reading blobs from the cache
results in the cache handle being transferred to the target `PinnableSlice`
(as opposed to the contents getting copied). With these changes,
it makes sense to reset the `PinnableSlice` storing the blob value in
`DBIter` as soon as we move to a different iterator position to prevent
us from holding on to the cache handle any longer than necessary.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10490

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D38473630

Pulled By: ltamasi

fbshipit-source-id: 84c045ffac76436c6152fd0f5775b007f4051386
2022-08-09 11:39:57 -07:00
Levi Tamasi a85443c001 Update HISTORY.md for PR 10492 (#10504)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10504

Reviewed By: akankshamahajan15

Differential Revision: D38514813

Pulled By: ltamasi

fbshipit-source-id: 3c0c157740a6680b6f91216adcc2553c3a327b94
2022-08-08 15:54:13 -07:00
Jay Zhuang 3f763763aa Change `bottommost_temperture` to `last_level_temperture` (#10471)
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471

Test Plan: CI

Reviewed By: siying

Differential Revision: D38450621

Pulled By: jay-zhuang

fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
2022-08-08 14:36:34 -07:00
Jay Zhuang 375534752a Improve universal compaction picker for tiered compaction (#10467)
Summary:
Current universal compaction picker may cause extra size amplification
compaction if there're more hot data on penultimate level. Improve the picker
to skip the last level for size amp calculation if tiered compaction is
enabled, which can
1. avoid extra unnecessary size amp compaction;
2. typically cold tier (the last level) is not size constrained, so skip size
   amp for cold tier is intended;

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467

Test Plan: CI and added unittest

Reviewed By: siying

Differential Revision: D38391350

Pulled By: jay-zhuang

fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2
2022-08-08 14:34:36 -07:00
Jay Zhuang 0d885e80d4 Avoid dynamic memory allocation on read path (#10453)
Summary:
lambda function dynamicly allocates memory from heap if it needs to
capture multiple values, which could be expensive.
Switch to explictly use local functor from stack.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10453

Test Plan:
CI
db_bench shows ~2-3% read improvement:
```
# before the change
TEST_TMPDIR=/tmp/dbbench4 ./db_bench_main --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000
readrandom   :       8.528 micros/op 117265 ops/sec 85.277 seconds 10000000 operations;   13.0 MB/s (10000000 of 10000000 found)
# after the change
TEST_TMPDIR=/tmp/dbbench5 ./db_bench_new --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000
readrandom   :       8.263 micros/op 121015 ops/sec 82.634 seconds 10000000 operations;   13.4 MB/s (10000000 of 10000000 found)
```
details: https://gist.github.com/jay-zhuang/5ac0628db8fc9cbcb499e056d4cb5918

Micro-benchmark shows a similar improvement ~1-2%:
before the change:
https://gist.github.com/jay-zhuang/9dc0ebf51bbfbf4af82f6193d43cf75b
after the change:
https://gist.github.com/jay-zhuang/fc061f1813cd8f441109ad0b0fe7c185

Reviewed By: ajkr

Differential Revision: D38345056

Pulled By: jay-zhuang

fbshipit-source-id: f3597aeeee338a804d37bf2e81386d5a100665e0
2022-08-08 12:59:31 -07:00
Changyu Bi 9d77bf8f7b Fragment memtable range tombstone in the write path (#10380)
Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380

Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.

```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100

multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.

Single thread result:
| Max # tombstones  | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op |  99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
| 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
| 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
| 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
| 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |

32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones  | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op |  99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
| 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
| 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
| 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |

Reviewed By: ajkr

Differential Revision: D37916564

Pulled By: cbi42

fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
2022-08-05 12:02:33 -07:00
Bo Wang f28d0c2020 Fix data race reported on SetIsInSecondaryCache in LRUCache (#10472)
Summary:
Currently, `SetIsInSecondaryCache` is after `Promote`. After `Promote`, a handle can be accessed and its flags can be set. This causes data race.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10472

Test Plan:
unit tests
stress tests

Reviewed By: pdillinger

Differential Revision: D38403991

Pulled By: gitbw95

fbshipit-source-id: 0aaa2d2edeaf5bc799fcce605648fe49eb7119c2
2022-08-04 13:52:11 -07:00
Bo Wang 87b82f28a1 Split cache to minimize internal fragmentation (#10287)
Summary:
### **Summary:**
To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`.  Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue.

_Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._

To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows.

**T1**
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    90000000
RawSize:    9956.4 MB (estimated)
FileSize:   5664.8 MB (estimated)

| Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
| - | - | - |
| T1_3 | 4000 | 4000 |
| T1_4 | 2000 | 3000 |

Populate the DB:
./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1
Overwrite it to a stable state:
./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1

Run read tests with differnt cache setting:
T1_3:
MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 &

T1_4:
MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom  --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 &

For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.

| Test Name | T1_3 | T1_3 after mem defrag | T1_4 | T1_4 after mem defrag |
| - | - | - | - | - |
| allocated (MB)  | 8728 | 8076 | 5518 | 5043 |
| available (MB)  | 8753 | 8092 | 5536 | 5051 |
| external fragmentation rate  | 0.003 | 0.002 | 0.003 | 0.0016 |
| resident (MB)  | 8956 | 8365 | 5655 | 5235 |

**T2**
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     256 bytes each (128 bytes after compression)
Entries:    40000000
RawSize:    10986.3 MB (estimated)
FileSize:   6103.5 MB (estimated)

| Test Name | Primary Cache Size (MB) | Compressed Secondary Cache Size (MB) |
| - | - | - |
| T2_3 | 4000 | 4000 |
| T2_4 | 2000 | 3000 |

Create DB (10GB):
./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2
Overwrite it to a stable state:
./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2

Run read tests with differnt cache setting:
T2_3:
MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800  &

T2_4:
MALLOC_CONF="prof:true,prof_stats:true" ./db_bench  --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800  &

For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats.

| Test Name |  T2_3 | T2_3 after mem defrag | T2_4 | T2_4 after mem defrag |
| -  | - | - | - | - |
| allocated (MB)  | 8425 | 8093 | 5426 | 5149 |
| available (MB)  | 8489 | 8138 | 5435 | 5158 |
| external fragmentation rate  | 0.008 | 0.0055 | 0.0017 | 0.0017 |
| resident (MB)  | 8676 | 8392 | 5541 | 5321 |

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287

Test Plan: Unit tests.

Reviewed By: anand1976

Differential Revision: D37743362

Pulled By: gitbw95

fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad
2022-08-02 15:28:11 -07:00
Peter Dillinger 27f3af5966 Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460)
Summary:
TL;DR: due to a recent change, if you drop a column family,
often that DB will no longer fsync after writing new SST files
to remaining or new column families, which could lead to data
loss on power loss.

More bug detail:
The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
DB::Close time rather than waiting for DB object destruction.
Unfortunately, it also closes shared FSDirectory objects on
DropColumnFamily (& destroy remaining handles), which can lead
to use-after-Close on FSDirectory shared with remaining column
families. Those "uses" are only Fsyncs (or redundant Closes). In
the default Posix filesystem, an Fsync on a closed FSDirectory is a
quiet no-op. Consequently (under most configurations), if you drop
a column family, that DB will no longer fsync after writing new SST
files to column families sharing the same directory (true under most
configurations).

More fix detail:
Basically, this removes unnecessary Close ops on destroying
ColumnFamilyData. We let `shared_ptr` take care of calling the
destructor at the right time. If the intent was to require Close be
called before destroying FSDirectory, that was not made clear by the
author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
timely destruction of FSDirectory to suffice as Close (in
CountedFileSystem). Added a TODO to revisit that.

Also in this PR:
* Added a TODO to share FSDirectory instances between DB and its column
families. (Already shared among column families.)
* Made DB::Close attempt to close all its open FSDirectory objects even
if there is a failure in closing one. Also code clean-up around this
logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460

Test Plan:
add an assert to check for use-after-Close. With that
existing tests can detect the misuse. With fix, tests pass (except noted
relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)

Reviewed By: ajkr

Differential Revision: D38357922

Pulled By: pdillinger

fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
2022-08-02 10:54:32 -07:00
Levi Tamasi cc8ded6152 Do not put blobs read during compaction into cache (#10457)
Summary:
During compaction, blobs are currently read using the default
`ReadOptions`, which has the `fill_cache` flag set to true. Earlier,
this didn't make any difference since we didn't have a blob cache;
however, now we have to explicitly set this flag to false to avoid
polluting the cache during compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38333528

Pulled By: ltamasi

fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2
2022-08-01 19:49:05 -07:00
Yanqin Jin fbfcf5cbcd Remove unused fields from FileMetaData (temporarily) (#10443)
Summary:
FileMetaData::[min|max]_timestamp are not currently being used or
tracked by RocksDB, even when user-defined timestamp is enabled. Each of
them is a std::string which can occupy 32 bytes. Remove them for now.
They may be added back when we have a pressing need for them. When we do
add them back, consider store them in a more compact way, e.g. one
boolean flag and a byte array of size 16.

Per file min/max timestamp bounds are available as table properties.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D38292275

Pulled By: riversand963

fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94
2022-08-01 17:56:13 -07:00
sdong cc2099803a Use EnvLogger instead of PosixLogger (#10436)
Summary:
EnvLogger was built to replace PosixLogger that supports multiple Envs. Make FileSystem use EnvLogger by default, remove Posix FS specific implementation and remove PosixLogger code,
Some hacky changes are made to make sure iostats are not polluted by logging, in order to pass existing unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10436

Test Plan: Run db_bench and watch info log files.

Reviewed By: anand1976

Differential Revision: D38259855

fbshipit-source-id: 67d65874bfba7a33535b6d0dd0ed92cbbc9888b8
2022-08-01 14:37:18 -07:00
Akanksha Mahajan 56463d443d Provide support for subcompactions with user-defined timestamps (#10344)
Summary:
The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues.
Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure.
Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues.

This PR provide support to reenable subcompactions with user-defined timestamps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344

Test Plan:
Added new unit test
- Without fix for Issue1 unit test MultipleSubCompactions fails with error:
```
db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│
or::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
Received signal 6 (Aborted)                                                                             │
#0   /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed.
Aborted (core dumped)
```
Ran stress test
`make crash_test_with_ts -j32`

Reviewed By: riversand963

Differential Revision: D38220841

Pulled By: akankshamahajan15

fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb
2022-07-31 11:39:16 -07:00
Peter Dillinger 65036e4217 Revert "Add a blob-specific cache priority (#10309)" (#10434)
Summary:
This reverts commit 8d178090be
because of a clear performance regression seen in internal dashboard
https://fburl.com/unidash/tpz75iee

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434

Reviewed By: ltamasi

Differential Revision: D38256373

Pulled By: pdillinger

fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b
2022-07-29 07:18:15 -07:00
Andrew Kryczka c7ccbb33a6 Allow manual compactions to run in parallel by default (#10317)
Summary:
This PR changes the default value of
`CompactRangeOptions::exclusive_manual_compaction` from true to false so
manual `CompactRange()`s can run in parallel with other compactions. I
believe no artificial parallelism restriction is the intuitive behavior
so feel the old default value is a trap, which I have fallen into
several times, including yesterday.
`CompactRangeOptions::exclusive_manual_compaction == false` has been
used in both our correctness test and in production for years so should
be reasonably safe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10317

Reviewed By: jay-zhuang

Differential Revision: D37659392

Pulled By: ajkr

fbshipit-source-id: 504915e978bbe300b79483d064070c75e93d91e5
2022-07-28 17:07:36 -07:00
Jay Zhuang 87649d3288 Best efforts recovery to skip empty MANIFEST (#10416)
Summary:
Skip empty MANIFEST fie during best_efforts_recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10416

Test Plan: make failed db_stress test pass

Reviewed By: riversand963

Differential Revision: D38126273

Pulled By: jay-zhuang

fbshipit-source-id: 4498d322b09eaa194dd2cbf9c683d62ab54bfb01
2022-07-27 20:16:38 -07:00
Gang Liao 8d178090be Add a blob-specific cache priority (#10309)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309

Reviewed By: ltamasi

Differential Revision: D38211655

Pulled By: gangliao

fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
2022-07-27 19:09:24 -07:00
Zichen Zhu 8b2d429251 Mention kRoundRobin in HISTORY.md (#10421)
Summary:
Update HISTORY.md for CompactionPri::kRoundRobin. Detailed implementation can be found in [PR10107](https://github.com/facebook/rocksdb/pull/10107), [PR10227](https://github.com/facebook/rocksdb/pull/10227), [PR10250](https://github.com/facebook/rocksdb/pull/10250), [PR10278](https://github.com/facebook/rocksdb/pull/10278), [PR10316](https://github.com/facebook/rocksdb/pull/10316), and [PR10341](https://github.com/facebook/rocksdb/pull/10341)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10421

Reviewed By: ajkr

Differential Revision: D38194070

Pulled By: littlepig2013

fbshipit-source-id: 4ce153dc0bf22cd865d09c5429955023dbc90f37
2022-07-27 10:38:26 -07:00
Changyu Bi 2fc6df37d6 Add checksum handshake for WAL fragment decompression (#10339)
Summary:
If WAL compression is enabled, WAL fragment decompression results are concatenated together in `log::Reader::ReadPhysicalRecord()`. This PR adds checksum handshake to protect memory corruption during the copying process.

`checksum` is renamed to `record_checksum` in `ReadRecord()` to differentiate it from `checksum_` flag that specifies whether CRC32C checksum is verified.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10339

Test Plan: added checksum verification in log_test.cc, `make check -j32`.

Reviewed By: ajkr

Differential Revision: D37763734

Pulled By: cbi42

fbshipit-source-id: c4faa7c76b9ff1df35026edf31adfe4b47ae3154
2022-07-25 16:27:26 -07:00
sdong 4e00748098 Fix a bug in hash linked list (#10401)
Summary:
In hash linked list, with a bucket of only one record, following sequence can cause users to temporarily miss a record:

Thread 1: Fetch the structure bucket x points too, which would be a Node n1 for a key, with next pointer to be null
Thread 2: Insert a key to bucket x that is larger than the existing key. This will make n1->next points to a new node n2, and update bucket x to point to n1.
Thread 1: see n1->next is not null, so it thinks it is a header of linked list and ignore the key of n1.

Fix it by refetch structure that bucket x points to when it sees n1->next is not null. This should work because if n1->next is not null, bucket x should already point to a linked list or skip list header.

A related change is to revert th order of testing for linked list and skip list. This is because after refetching the bucket, it might end up with a skip list, rather than linked list.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10401

Test Plan: Run existing tests and make sure at least it doesn't regress.

Reviewed By: jay-zhuang

Differential Revision: D38064471

fbshipit-source-id: 142bb85e1546c803f47e3357aef3e76debccd8df
2022-07-25 11:33:28 -07:00
sdong 252bea405e Improve SubCompaction Partitioning (#10393)
Summary:
Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393

Test Plan:
1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
2. Add a new unit test to ApproximateKeyAnchors()
3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.

Reviewed By: jay-zhuang

Differential Revision: D38043783

fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f
2022-07-23 17:38:49 -07:00
LIU HU 8885b0537b Fix underflow in FIFOCompactionPicker (#10386)
Summary:
Fix https://github.com/facebook/rocksdb/issues/10133

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10386

Reviewed By: riversand963

Differential Revision: D38067265

Pulled By: ajkr

fbshipit-source-id: 3a99a98ac5d7ac37581b5b636fbfa7901563d834
2022-07-22 09:20:35 -07:00
Wallace 1e9bf25f61 Do not hold mutex when write keys if not necessary (#7516)
Summary:
## Problem Summary
RocksDB will acquire the global mutex of db instance for every time when user calls `Write`.  When RocksDB schedules a lot of compaction jobs,   it will compete the mutex with write thread and it will hurt the write performance.

## Problem Solution:
I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7516

Test Plan:
1. make check
2. CI
3. COMPILE_WITH_TSAN=1 make db_stress
make crash_test
make crash_test_with_multiops_wp_txn
make crash_test_with_multiops_wc_txn
make crash_test_with_atomic_flush

Reviewed By: siying

Differential Revision: D36908702

Pulled By: riversand963

fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe
2022-07-21 13:35:36 -07:00
Andrew Kryczka e576f2ab19 Fix race conditions in GenericRateLimiter (#10374)
Summary:
Made locking strict for all accesses of `GenericRateLimiter` internal state.

`SetBytesPerSecond()` was the main problem since it had no locking, while the two updates it makes need to be done as one atomic operation.

The test case, "ConfigOptionsTest.ConfiguringOptionsDoesNotRevertRateLimiterBandwidth", is for the issue fixed in https://github.com/facebook/rocksdb/issues/10378, but I forgot to include the test there.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10374

Reviewed By: pdillinger

Differential Revision: D37906367

Pulled By: ajkr

fbshipit-source-id: ccde620d2a7f96d1401bdafd2bdb685cbefbafa5
2022-07-19 09:31:14 -07:00
Gang Liao 0b6bc101ba Charge blob cache usage against the global memory limit (#10321)
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321

Reviewed By: ltamasi

Differential Revision: D37913590

Pulled By: gangliao

fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
2022-07-18 23:26:57 -07:00
Andrew Kryczka 25cc564ff7 Make RateLimiter not Customizable (#10378)
Summary:
(PR created for informational/testing purposes only.)

- Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()`
- Benefit over #10374 is eliminating race conditions with Configurable framework.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378

Reviewed By: pdillinger

Differential Revision: D37914865

fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30
2022-07-18 14:48:42 -07:00
Gang Liao ec4ebeff30 Support prepopulating/warming the blob cache (#10298)
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298

Reviewed By: ltamasi

Differential Revision: D37908743

Pulled By: gangliao

fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
2022-07-17 07:13:59 -07:00
Gang Liao 95ef007adc Support using secondary cache with the blob cache (#10349)
Summary:
RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used. The goals of this task are to add support for using a secondary cache with the blob cache and to measure the potential performance gains using `db_bench`.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10349

Reviewed By: ltamasi

Differential Revision: D37896773

Pulled By: gangliao

fbshipit-source-id: 7804619ce4a44b73d9e11ad606640f9385969c84
2022-07-16 03:54:37 -07:00
sdong 7506c1a4ca Update HISTORY.md for the upcoming 7.5 release (#10372)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10372

Reviewed By: riversand963

Differential Revision: D37894412

fbshipit-source-id: 77055a460d662f3b89921102a16b1a726d324d84
2022-07-15 15:35:14 -07:00
Jay Zhuang a3acf2ef87 Add seqno to time mapping (#10338)
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338

Test Plan: CI

Reviewed By: siying

Differential Revision: D37810187

Pulled By: jay-zhuang

fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
2022-07-14 21:49:34 -07:00
Siying Dong 66685d6aa1 Fix HISTORY.md for misplaced items (#10362)
Summary:
Some items are misplaced to 7.4 but they are unreleased.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10362

Reviewed By: jay-zhuang

Differential Revision: D37859426

fbshipit-source-id: e2ad099227309ed2e0f3ca450a9a43986d681c7c
2022-07-14 11:33:47 -07:00
sdong c8b20d469d Make InternalKeyComparator not configurable (#10342)
Summary:
InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too.
We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342

Test Plan: Run existing CI tests and make sure it doesn't fail

Reviewed By: riversand963

Differential Revision: D37771351

fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee
2022-07-14 10:09:31 -07:00
Changyu Bi 5f9fe7f21e Added WAL compression checksum (#10319)
Summary:
Enabled zstd checksum flag in StreamingCompress so that WAL (de)compreression is protected by a checksum per compression frame.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10319

Test Plan:
- `make check`
- WAL perf: average ops/sec over 10 runs is 161226 pre PR and 159635 post PR (1% drop).
```
sudo TEST_TMPDIR=/dev/shm/memtable_write ./db_bench_checksum -benchmarks=fillseq -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=10 -wal_compression=zstd
```

Reviewed By: ajkr

Differential Revision: D37673311

Pulled By: cbi42

fbshipit-source-id: 9f34a3bfc2a82e5c80b1ec63bb339a7465108ec9
2022-07-13 15:29:20 -07:00
Guido Tagliavini Ponce 9645e66fc9 Temporarily return a LRUCache from NewClockCache (#10351)
Summary:
ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.

The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.

A new version of the NewClockCache function was created for our internal tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351

Test Plan: ``make -j24 check`` and re-run the pre-release tests.

Reviewed By: pdillinger

Differential Revision: D37802685

Pulled By: guidotag

fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
2022-07-13 08:45:44 -07:00
Yanqin Jin b283f041f5 Stop tracking syncing live WAL for performance (#10330)
Summary:
With https://github.com/facebook/rocksdb/issues/10087, applications calling `SyncWAL()` or writing with `WriteOptions::sync=true` can suffer
from performance regression. This PR reverts to original behavior of tracking the syncing of closed WALs.
After we revert back to old behavior, recovery, whether kPointInTime or kAbsoluteConsistency, may fail to
detect corruption in synced WALs if the corruption is in the live WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10330

Test Plan:
make check

Before https://github.com/facebook/rocksdb/issues/10087
```bash
fillsync     :     750.269 micros/op 1332 ops/sec 75.027 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :     776.492 micros/op 1287 ops/sec 77.649 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 1310 (± 44) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :     805.625 micros/op 1241 ops/sec 80.563 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 1287 ops/sec;    0.1 MB/sec
```

Before this PR and after https://github.com/facebook/rocksdb/issues/10087
```bash
fillsync     :    1479.601 micros/op 675 ops/sec 147.960 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :    1626.080 micros/op 614 ops/sec 162.608 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 645 (± 59) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :    1588.402 micros/op 629 ops/sec 158.840 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 629 ops/sec;    0.1 MB/sec
```

After this PR
```bash
fillsync     :     749.621 micros/op 1334 ops/sec 74.962 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :     865.577 micros/op 1155 ops/sec 86.558 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 1244 (± 175) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :     845.837 micros/op 1182 ops/sec 84.584 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 1182 ops/sec;    0.1 MB/sec
```

Reviewed By: ajkr

Differential Revision: D37725212

Pulled By: riversand963

fbshipit-source-id: 8fa7d13b3c7662be5d56351c42caf3266af937ae
2022-07-12 17:16:57 -07:00
zczhu 96206531bc Support reservation in thread pool (#10278)
Summary:
Add `ReserveThreads` and `ReleaseThreads` functions in thread pool to support reservation in for a specific thread pool.  With this feature, a thread will be blocked if the number of waiting threads (noted by `num_waiting_threads_`) equals the number of reserved threads (noted by `reserved_threads_`), normally `reserved_threads_` is upper bounded by `num_waiting_threads_`; in rare cases (e.g. `SetBackgroundThreadsInternal` is called when some threads are already reserved), `num_waiting_threads_` can be less than `reserved_threads`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10278

Test Plan: Add `ReserveThreads` unit test in `env_test`. Update the unit test `SimpleColumnFamilyInfoTest` in `thread_list_test` with adding `ReserveThreads` related assertions.

Reviewed By: hx235

Differential Revision: D37640946

Pulled By: littlepig2013

fbshipit-source-id: 4d691f6b9a433569f96ab52d52c3defe5b065367
2022-07-08 19:48:09 -07:00
Gang Liao 28586be8ec Update HISTORY.md for blob cache (#10328)
Summary:
Update HISTORY.md for blob cache.  Implementation can be found from Github issue https://github.com/facebook/rocksdb/issues/10156 (or Github PRs https://github.com/facebook/rocksdb/issues/10155, https://github.com/facebook/rocksdb/issues/10178, https://github.com/facebook/rocksdb/issues/10225, https://github.com/facebook/rocksdb/issues/10198, and https://github.com/facebook/rocksdb/issues/10272).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10328

Reviewed By: riversand963

Differential Revision: D37732514

Pulled By: gangliao

fbshipit-source-id: 4c942a41c07914bfc8db56a0d3cf4d3e53d5963f
2022-07-08 18:35:52 -07:00
Peter Dillinger e6c5e0ab9a Have Cache use Status::MemoryLimit (#10262)
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)

I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.

HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262

Test Plan: updated unit tests

Reviewed By: hx235

Differential Revision: D37473155

Pulled By: pdillinger

fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
2022-07-06 14:41:46 -07:00
Akanksha Mahajan 2acbf386a3 Provide support for direct_reads with async_io (#10197)
Summary:
Provide support for use_direct_reads with async_io.

TestPlan:
-  Updated unit tests
-  db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420
- db_stress
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1"
make crash_test -j
```
- Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197

Reviewed By: anand1976

Differential Revision: D37255646

Pulled By: akankshamahajan15

fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920
2022-07-06 11:42:59 -07:00
sdong a9565ccb26 Try to trivial move more than one files (#10190)
Summary:
In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.

When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190

Test Plan:
Add some unit tests.
For performance, Try to run
./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark.

Reviewed By: jay-zhuang

Differential Revision: D37230647

fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549
2022-07-05 10:10:37 -07:00
Akanksha Mahajan 11215e0f3a Fix bug in Logger creation if dbname and db_log_dir are on different filesystem (#10292)
Summary:
If dbname and db_log_dir are at different filesystems (one
local and one remote), creation of dbname will fail because that path
doesn't exist wrt to db_log_dir.
This patch will ignore the error returned on creation of dbname. If they
are on same filesystem, db_log_dir creation will automatically return
the error in case there is any error in creation of dbname.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10292

Test Plan: Existing unit tests

Reviewed By: riversand963

Differential Revision: D37567773

Pulled By: akankshamahajan15

fbshipit-source-id: 005d28c536208d4c126c8cb8e196d1d85b881100
2022-06-30 19:04:25 -07:00
sdong 4428c76181 Multi-File Trivial Move in L0->L1 (#10188)
Summary:
In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move
2. Modify the trivial move condition so that this compaction would be tagged as trivial move.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188

Test Plan:
See throughput improvements with db_bench with fast fillseq benchmark and small L0 files:

./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes

The throughput improved by about 50%. Stalling still happens though.

Reviewed By: jay-zhuang

Differential Revision: D37224743

fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659
2022-06-30 18:04:23 -07:00
sdong b397dcd390 Change The Way Level Target And Compaction Score Are Calculated (#10057)
Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10057

Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.

Reviewed By: ajkr

Differential Revision: D37539742

fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4
2022-06-30 13:32:47 -07:00
sdong 6115254416 Fix A Bug Where Concurrent Compactions Cause Further Slowing Down (#10270)
Summary:
Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed.

Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270

Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value.

Reviewed By: ajkr

Differential Revision: D37497327

fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4
2022-06-29 11:20:36 -07:00
Yanqin Jin b6cfda1283 Support `iter_start_ts` for backward iteration (#10200)
Summary:
Resolves https://github.com/facebook/rocksdb/issues/9761

With this PR, applications can create an iterator with the following
```cpp
ReadOptions read_opts;
read_opts.timestamp = &ts_ub;
read_opts.iter_start_ts = &ts_lb;
auto* it = db->NewIterator(read_opts);
it->SeekToLast();
// or it->SeekForPrev("foo");
it->Prev();
...
```
The application can access different versions of the same user key via `key()`, `value()`, and `timestamp()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10200

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D37258074

Pulled By: riversand963

fbshipit-source-id: 3f0b866ade50dcff7ef60d506397a9dd6ec91565
2022-06-28 19:51:05 -07:00
Gang Liao d7ebb58cb5 Add blob cache tickers, perf context statistics, and DB properties (#10203)
Summary:
In order to be able to monitor the performance of the new blob cache, we made the follow changes:
- Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics)
- Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context)
- Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203

Reviewed By: ltamasi

Differential Revision: D37478658

Pulled By: gangliao

fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40
2022-06-28 13:52:35 -07:00
sdong 246d469750 Reduce overhead of SortFileByOverlappingRatio() (#10161)
Summary:
Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161

Test Plan:
Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran:

TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000

The throughput improved by 11%.

Reviewed By: jay-zhuang

Differential Revision: D37129469

fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d
2022-06-24 14:01:11 -07:00
Yanqin Jin 725df120e9 Fix race condition between file purge and backup/checkpoint (#10187)
Summary:
Resolves https://github.com/facebook/rocksdb/issues/10129

I extracted this fix from https://github.com/facebook/rocksdb/issues/7516 since it's also already a bug in main branch, and we want to
separate it from the main part of the PR.

There can be a race condition between two threads. Thread 1 executes
`DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`.
```
Time   thread 1                                thread 2
  |  mutex_.lock
  |  read disable_delete_obsolete_files_
  |  ...
  |  wait on log_sync_cv and release mutex_
  |                                          mutex_.lock
  |                                          ++disable_delete_obsolete_files_
  |                                          mutex_.unlock
  |                                          mutex_.lock
  |                                          while (pending_purge_obsolete_files > 0) { bg_cv.wait;}
  |                                          wake up with mutex_ locked
  |                                          compute WALs tracked by MANIFEST
  |                                          mutex_.unlock
  |  wake up with mutex_ locked
  |  ++pending_purge_obsolete_files_
  |  mutex_.unlock
  |
  |  delete obsolete WAL
  |                                          WAL missing but tracked in MANIFEST.
  V
```

The fix proposed eliminates the possibility of the above by increasing
`pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10187

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D37214235

Pulled By: riversand963

fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184
2022-06-23 18:32:25 -07:00
Yueh-Hsuan Chiang 2a3792edfc Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175)
Summary:
Add suggest_compact_range() and suggest_compact_range_cf() to C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175

Test Plan:
As verifying the result requires SyncPoint, which is not available in the c_test.c,
the test is currently done by invoking the functions and making sure it does not crash.

Reviewed By: jay-zhuang

Differential Revision: D37305191

Pulled By: ajkr

fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db
2022-06-23 16:25:25 -07:00
Baptiste Lemaire 5879053fd0 Dynamically changeable `MemPurge` option (#10011)
Summary:
**Summary**
Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.

**Motivation**
RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.

**Content of this PR**
This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.

**Benchmarking**
I will add numbers to prove that there is no performance impact within the next 12 hours.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011

Reviewed By: pdillinger

Differential Revision: D36462357

Pulled By: bjlemaire

fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
2022-06-23 09:42:18 -07:00
Yueh-Hsuan Chiang e103b87296 Add get_column_family_metadata() and related functions to C API (#10207)
Summary:
* Add metadata related structs and functions in C API, including
  - `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
     that returns `rocksdb_column_family_metadata_t`.
  - `rocksdb_column_family_metadata_t` and its get functions & destroy function.
  - `rocksdb_level_metadata_t` and its and its get functions & destroy function.
  - `rocksdb_file_metadata_t` and its and get functions & destroy functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207

Test Plan:
Extend the existing c_test.c to include additional checks for column_family_metadata
inside CheckCompaction.

Reviewed By: riversand963

Differential Revision: D37305209

Pulled By: ajkr

fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e
2022-06-22 15:00:28 -07:00
Peter Dillinger 84210c9489 Add data block hash index to crash test, fix MultiGet issue (#10220)
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.

Fixes https://github.com/facebook/rocksdb/issues/10186

I intend to pick this into the 7.4.0 release candidate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220

Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.

Reviewed By: anand1976

Differential Revision: D37315647

Pulled By: pdillinger

fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
2022-06-21 16:23:58 -07:00
Peter Dillinger 6358e1b967 Start release 7.5 development (#10204)
Summary:
Update HISTORY.md and version.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10204

Test Plan: version bump only

Reviewed By: ajkr

Differential Revision: D37271866

Pulled By: pdillinger

fbshipit-source-id: 0ccaa2af36648a5b6017c172a7826a244e1aec93
2022-06-20 07:12:39 -07:00
Peter Dillinger fac7a23685 Update HISTORY for 7.4.0 release freeze (#10196)
Summary:
Planned for Sunday 6/19

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10196

Test Plan: no code

Reviewed By: akankshamahajan15

Differential Revision: D37244857

Pulled By: pdillinger

fbshipit-source-id: afbf4aa201983b3c01c16b5f55c68f2325d17421
2022-06-19 16:31:16 -07:00
Changyu Bi 0e0a19832e Fix a bug in WriteBatchInternal::Append when write batch KV protection is turned on (#10201)
Summary:
This bug was discovered after write batch checksum verification before WAL is added (https://github.com/facebook/rocksdb/issues/10114) and stress test with write batch checksum protection is turned on (https://github.com/facebook/rocksdb/issues/10037). In this [line](d5d8920f2c/db/write_batch.cc (L2887)), the number of checksums may not be consistent with `batch->Count()`. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10201

Test Plan:
```
./db_stress --batch_protection_bytes_per_key=8 --destroy_db_initially=1 --max_key=100000 --use_txn=1
```

Reviewed By: ajkr

Differential Revision: D37260799

Pulled By: cbi42

fbshipit-source-id: ff8dce7dcce295d689333bc9d892d17a843bf0ea
2022-06-18 15:12:17 -07:00
Andrew Kryczka d5d8920f2c Fix race condition with WAL tracking and `FlushWAL(true /* sync */)` (#10185)
Summary:
`FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.

The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.

An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.

```
Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185

Test Plan:
- repro unit test
- the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"

Reviewed By: riversand963

Differential Revision: D37200993

Pulled By: ajkr

fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f
2022-06-17 16:45:28 -07:00
Hui Xiao a5d773e077 Add rate-limiting support to batched MultiGet() (#10159)
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.

However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159

Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of  `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
   - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work

Reviewed By: ajkr, anand1976

Differential Revision: D37135172

Pulled By: hx235

fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
2022-06-17 16:40:47 -07:00
Peter Dillinger 1aac814578 Use optimized folly DistributedMutex in LRUCache when available (#10179)
Summary:
folly DistributedMutex is faster than standard mutexes though
imposes some static obligations on usage. See
https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h
for details. Here we use this alternative for our Cache implementations
(especially LRUCache) for better locking performance, when RocksDB is
compiled with folly.

Also added information about which distributed mutex implementation is
being used to cache_bench output and to DB LOG.

Intended follow-up:
* Use DMutex in more places, perhaps improving API to support non-scoped
locking
* Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently)

Credit: Thanks Siying for reminding me about this line of work that was previously
left unfinished.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179

Test Plan:
for correctness, existing tests. CircleCI config updated.
Also Meta-internal buck build updated.

For performance, ran simultaneous before & after cache_bench. Out of three
comparison runs, the middle improvement to ops/sec was +21%:

Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode
compiler)

```
Complete in 20.201 s; Rough parallel ops/sec = 1584062
Thread ops/sec = 107176

Operation latency (ns):
Count: 32000000 Average: 9257.9421  StdDev: 122412.04
Min: 134  Median: 3623.0493  Max: 56918500
Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63
```

New: (add USE_FOLLY=1)

```
Complete in 16.674 s; Rough parallel ops/sec = 1919135  (+21%)
Thread ops/sec = 135487

Operation latency (ns):
Count: 32000000 Average: 7304.9294  StdDev: 108530.28
Min: 132  Median: 3777.6012  Max: 91030902
Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83
```

Reviewed By: anand1976

Differential Revision: D37182983

Pulled By: pdillinger

fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1
2022-06-17 13:08:45 -07:00