Commit Graph

300 Commits

Author SHA1 Message Date
anand76 6cc7ad15b6 Implement secondary cache admission policy to allow all evicted blocks (#12599)
Summary:
Add a secondary cache admission policy to admit all blocks evicted from the block cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12599

Reviewed By: pdillinger

Differential Revision: D56891760

Pulled By: anand1976

fbshipit-source-id: 193c98c055aa3477f4e3a78e5d3daef27a5eacf4
2024-05-02 11:23:35 -07:00
anand76 6349da612b Update HISTORY.md and version to 9.3.0 (#12601)
Summary:
Update HISTORY.md for 9.2 and version to 9.3.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12601

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D56845901

Pulled By: anand1976

fbshipit-source-id: 0d1137a6568e4712be2f8b705f4f7b438217dbed
2024-05-01 16:33:04 -07:00
Yu Zhang 241253053a Fix delete obsolete files on recovery not rate limited (#12590)
Summary:
This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited.

The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` d610e14f93/include/rocksdb/sst_file_manager.h (L126) however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1.

In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call.

There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590

Test Plan: Added unit test and verified with manual testing

Reviewed By: anand1976

Differential Revision: D56830519

Pulled By: jowlyzhang

fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885
2024-05-01 12:26:54 -07:00
Hui Xiao abd6751aba Fix wrong padded bytes being used to generate file checksum (#12598)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598

Test Plan:
- New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption:  ...file checksum mismatch`) and passes after
- Benchmark
```
TEST_TMPDIR=/dev/shm  ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG    300 runs] : 421334 (± 4126) ops/sec;   46.6 (± 0.5) MB/sec
Post-PR: (no regression observed but a slight improvement)
fillseq [AVG    300 runs] : 425768 (± 4309) ops/sec;   47.1 (± 0.5) MB/sec

Reviewed By: ajkr, anand1976

Differential Revision: D56725688

Pulled By: hx235

fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5
2024-04-30 15:38:53 -07:00
Peter Dillinger 45c105104b Set optimize_filters_for_memory by default (#12377)
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.

Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377

Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.

Reviewed By: jowlyzhang

Differential Revision: D54129517

Pulled By: pdillinger

fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
2024-04-30 08:33:31 -07:00
Changyu Bi 5c1334f763 DeleteRange() return NotSupported if row_cache is configured (#12512)
Summary:
...since this feature combination is not supported yet (https://github.com/facebook/rocksdb/issues/4122).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12512

Test Plan: new unit test.

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D55820323

Pulled By: cbi42

fbshipit-source-id: eeb5e97d15c9bdc388793a2fb8e52cfa47e34bcf
2024-04-29 16:33:13 -07:00
Andrew Kryczka b2931a5c53 Fixed `MultiGet()` error handling to not skip blob dereference (#12597)
Summary:
See comment at top of the test case and release note.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597

Reviewed By: jaykorean

Differential Revision: D56718786

Pulled By: ajkr

fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f
2024-04-29 14:18:42 -07:00
anand76 e36b0a2da4 Fix corruption bug when recycle_log_file_num changed from 0 (#12591)
Summary:
When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 || s >= first_seqno_` in `rocksdb::MemTable::Add`.

We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591

Test Plan: 1. Add a unit test to verify the bug and fix

Reviewed By: jowlyzhang

Differential Revision: D56655812

Pulled By: anand1976

fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87
2024-04-29 12:25:00 -07:00
Andrew Kryczka 2ec25a3e54 Prevent data block compression with `BlockBasedTableOptions::block_align` (#12592)
Summary:
Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592

Reviewed By: hx235

Differential Revision: D56650862

Pulled By: ajkr

fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141
2024-04-26 20:05:30 -07:00
Andrew Kryczka b4c520cadc change default `CompactionOptions::compression` while deprecating it (#12587)
Summary:
I had a TODO to complete `CompactionOptions`'s compression API but never did it: d610e14f93/db/compaction/compaction_picker.cc (L371-L373)

Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`.

In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587

Reviewed By: hx235

Differential Revision: D56619273

Pulled By: ajkr

fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e
2024-04-26 13:03:21 -07:00
Jay Huh 1fca175eec MultiCFSnapshot for NewIterators() API (#12573)
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573

Test Plan:
**Unit Tests Added**

```
./db_iterator_test --gtest_filter="*IteratorsConsistentView*"
```
```
./multi_cf_iterator_test -- --gtest_filter="*ConsistentView*"
```

**Performance Check**

Setup
```
make -j64 release
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none
```

Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom :       6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom :       6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: jowlyzhang

Differential Revision: D56444066

Pulled By: jaykorean

fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99
2024-04-24 15:28:55 -07:00
Andrew Kryczka 6807da0b44 Fix `DisableManualCompaction()` hang (#12578)
Summary:
Prior to this PR the following sequence could happen:

1. `RunManualCompaction()` A schedules compaction to thread pool and waits
2. `RunManualCompaction()` B waits without scheduling anything due to conflict
3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both
4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done
5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool
6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction
7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution
8. `DisableManualCompaction()` returns as there are no more manual compactions running

Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward:

5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done
6'. `DisableManualCompaction()` returns as there are no more manual compactions running

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578

Reviewed By: cbi42

Differential Revision: D56528144

Pulled By: ajkr

fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b
2024-04-24 12:40:36 -07:00
Andrew Kryczka 3f3045a405 fix DeleteRange+memtable_insert_with_hint_prefix_extractor interaction (#12558)
Summary:
Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558

Reviewed By: hx235

Differential Revision: D56279019

Pulled By: ajkr

fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29
2024-04-22 20:13:58 -07:00
Hui Xiao 7d83b4e3e5 Fix file checksum mismatch due to padded bytes when block_align=true (#12542)
Summary:
**Context/Summary:**
When `BlockBasedTableOptions::block_align=true`, we pad bytes to align blocks d41e568b1c/table/block_based/block_based_table_builder.cc (L1415-L1421).
Those bytes are not included in generating the file checksum upon file creation. But `VerifyFileChecksums()` includes those bytes in generating the file check to compare against the checksum generating upon file creation. Therefore a file checksum mismatch is returned in `VerifyFileChecksums()`.

We decided to include those padded bytes in generating the checksum upon file creation.

Bonus: also fix surrounding code to use actual padded bytes for verification - see https://github.com/facebook/rocksdb/pull/12542#discussion_r1571429163

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12542

Test Plan:
- New UT
- Benchmark
```
TEST_TMPDIR=/dev/shm  ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG    300 runs] : 422857 (± 3942) ops/sec;   46.8 (± 0.4) MB/sec
Post-PR:
fillseq [AVG    300 runs] : 424707 (± 3799) ops/sec;   47.0 (± 0.4) MB/sec

Reviewed By: ajkr

Differential Revision: D56168447

Pulled By: hx235

fbshipit-source-id: 96209ef950d42943d336f11968ae3fcf9872fc2c
2024-04-22 14:07:34 -07:00
anand76 97991960e9 Retry DB::Open upon a corruption detected while reading the MANIFEST (#12518)
Summary:
This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518

Reviewed By: ajkr

Differential Revision: D55932155

Pulled By: anand1976

fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9
2024-04-18 17:36:33 -07:00
Levi Tamasi e82fe7c0b7 Fix the move semantics of PinnableWideColumns (#12557)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557

Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move.

Reviewed By: jaykorean

Differential Revision: D56275054

fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730
2024-04-17 18:56:23 -07:00
Jay Huh 4f584652ab Add an option to wait for purge in WaitForCompact (#12520)
Summary:
Adding an option to wait for purge to complete in `WaitForCompact` API.

Internally, RocksDB has a way to wait for purge to complete (e.g. TEST_WaitForPurge() in db_impl_debug.cc), but there's no public API available for gracefully wait for purge to complete.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12520

Test Plan:
Unit Test Added - `WaitForCompactWithWaitForPurgeOptionTest`
```
./deletefile_test -- --gtest_filter="*WaitForCompactWithWaitForPurgeOptionTest*"
```

Existing Tests
```
./db_compaction_test -- --gtest_filter="*WaitForCompactWithOption*"
```

Reviewed By: ajkr

Differential Revision: D55888283

Pulled By: jaykorean

fbshipit-source-id: cfc6d6e8657deaefab8961890b36e390095c9f65
2024-04-17 17:33:27 -07:00
Andrew Kryczka 7027265417 Fix `max_successive_merges` counting CPU overhead regression (#12546)
Summary:
In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546

Reviewed By: jaykorean

Differential Revision: D56193693

Pulled By: ajkr

fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da
2024-04-17 12:11:24 -07:00
Levi Tamasi c0aef2a28e Add MultiGetEntityFromBatchAndDB to WriteBatchWithIndex (#12539)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12539

As a follow-up to https://github.com/facebook/rocksdb/pull/12533, this PR extends `WriteBatchWithIndex` with a `MultiGetEntityFromBatchAndDB` API that enables users to perform batched wide-column point lookups with read-your-own-writes consistency. This API transparently combines data from the indexed write batch and the underlying database as needed and presents the results in the form of a wide-column entity.

Reviewed By: jaykorean

Differential Revision: D56153145

fbshipit-source-id: 537967051b7521bb41b04070ac1a78a1d8873c08
2024-04-16 08:58:04 -07:00
Jay Huh d34712e0ac MultiCfIterator - AttributeGroupIter Impl & CoalescingIter Optimization (#12534)
Summary:
Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480 #12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`.

Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL`

Here are some implementation details:
- `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration.
- `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap.
- `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap.
- Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once.
- This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534

Test Plan:
Uncommented the assertions in `verifyAttributeGroupIterator()`

```
./multi_cf_iterator_test
```

Reviewed By: ltamasi

Differential Revision: D56089019

Pulled By: jaykorean

fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054
2024-04-16 08:45:38 -07:00
Levi Tamasi 491c4fb0ed Add GetEntityFromBatchAndDB to WriteBatchWithIndex (#12533)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12533

The PR extends `WriteBatchWithIndex` with a new wide-column point lookup API `GetEntityFromBatchAndDB`. Similarly to `GetFromBatchAndDB`, the new API can transparently combine data from the write batch with data from the underlying database as needed. Like `DB::GetEntity`, it returns any result in the form of a wide-column entity (i.e. plain key-values are wrapped into an entity with a single anonymous column).

Reviewed By: jaykorean

Differential Revision: D56069132

fbshipit-source-id: 4f19cdeea4ce136497ce79fc9d28c925de59e220
2024-04-15 09:20:47 -07:00
liuhu 6fbd02f258 Add dump all keys for cache dumper impl (#12500)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12501

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12500

Reviewed By: jowlyzhang

Differential Revision: D55922379

Pulled By: ajkr

fbshipit-source-id: 0759afcec148d256a2d1cd5ef76fd988fab4a9af
2024-04-12 10:47:13 -07:00
liuhu b7f1eeb0ca Cache dumper exit early due to deadline or max_dumped_size (#12491)
Summary:
In production, we need to control the duration time or max size of cache dumper to get better performance.
Fixes https://github.com/facebook/rocksdb/issues/12494

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12491

Reviewed By: hx235

Differential Revision: D55905826

Pulled By: ajkr

fbshipit-source-id: 9196a5e852c344d6783f7a8234e997c87215bd19
2024-04-11 21:56:45 -07:00
Yu Zhang 74d419be4d Add support in SstFileReader to get a raw table iterator (#12385)
Summary:
This PR adds support to programmatically iterate a raw table file with an iterator returned by `SstFileReader::NewTableIterator`. For third party tools to use to observe SST files created by RocksDB.

The original feature request was from this merge request: https://github.com/facebook/rocksdb/pull/12370

Since keys returned by raw table iterators are internal keys, this PR also adds a struct `ParsedEntryInfo` and util method `ParseEntry` to support user to parse internal key. `GetInternalKeyForSeek`, and `GetInternalKeyForSeekForPrev` to support users to create internal keys for seek operations with this raw table iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12385

Test Plan: Added unit tests

Reviewed By: cbi42

Differential Revision: D55662855

Pulled By: jowlyzhang

fbshipit-source-id: 0716a173ee95924fbd4e1f9b6cccf06525c40049
2024-04-02 21:23:06 -07:00
Hui Xiao d985902ef4 Disallow refitting more than 1 file from non-L0 to L0 (#12481)
Summary:
**Context/Summary:**
We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior.

A related change to OptionChangeMigration():
- When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size.
- Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481

Test Plan:
- New UT
- Modified UT

Reviewed By: cbi42

Differential Revision: D55351178

Pulled By: hx235

fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80
2024-03-29 10:52:36 -07:00
akankshamahajan 1856734821 Branch cut 9.1.fb (#12476)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12476

Reviewed By: jowlyzhang

Differential Revision: D55319508

Pulled By: akankshamahajan15

fbshipit-source-id: 2b6db671e027511282775c0fea155335d8e73cc2
2024-03-25 15:07:43 -07:00
Andrew Kryczka bf98dcf9a8 Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462)
Summary:
The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file.

There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462

Test Plan:
A repro command

```
db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp
./db_stress \
        --clear_column_family_one_in=0 \
        --test_batches_snapshots=0 \
        --write_fault_one_in=0 \
        --use_put_entity_one_in=0 \
        --prefixpercent=0 \
        --read_fault_one_in=0 \
        --readpercent=0 \
        --reopen=0 \
        --set_options_one_in=10000 \
        --delpercent=0 \
        --delrangepercent=0 \
        --open_metadata_write_fault_one_in=0 \
        --open_read_fault_one_in=0 \
        --open_write_fault_one_in=0 \
        --destroy_db_initially=0 \
        --ingest_external_file_one_in=0 \
        --iterpercent=0 \
        --nooverwritepercent=0 \
        --db=$db \
        --enable_blob_files=1 \
        --expected_values_dir=$exp \
        --max_background_compactions=20 \
        --max_bytes_for_level_base=2097152 \
        --max_key=100000 \
        --min_blob_size=0 \
        --open_files=-1 \
        --ops_per_thread=100000000 \
        --prefix_size=-1 \
        --target_file_size_base=524288 \
        --use_merge=1 \
        --value_size_mult=32 \
        --write_buffer_size=524288 \
        --writepercent=100
```

It used to fail like:

```
...
frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3
frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, rocksdb::PinnedIteratorsManager*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33
frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492
frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408
frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109
frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656
frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607
frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505
frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084
frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807
frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312
frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157
frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544
frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72
frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317
frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148
...
```

Reviewed By: ltamasi

Differential Revision: D55157795

Pulled By: ajkr

fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764
2024-03-21 12:38:53 -07:00
anand76 63a105a481 Enable recycle_log_file_num option for point in time recovery (#12403)
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.

This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403

Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior

Reviewed By: ajkr

Differential Revision: D54544824

Pulled By: anand1976

fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
2024-03-21 12:29:35 -07:00
anand76 98d8a85624 New PerfContext counters for block cache bytes read (#12459)
Summary:
Add PerfContext counters for measuring the cumulative size of blocks found in the block cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12459

Reviewed By: ajkr

Differential Revision: D55170694

Pulled By: anand1976

fbshipit-source-id: 8cbba76eece116cefce7f00e2fc9d74757661d25
2024-03-21 10:46:46 -07:00
Yu Zhang 13e1c32a18 Follow ups for TimedPut and write time property (#12455)
Summary:
This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including:

1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry.

2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries.

3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible.

Test plans:
Added unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455

Reviewed By: pdillinger

Differential Revision: D55165422

Pulled By: jowlyzhang

fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47
2024-03-21 10:00:15 -07:00
anand76 4868c10b44 Retry block reads on checksum mismatch (#12427)
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.

Tests:
Add new unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427

Reviewed By: ajkr

Differential Revision: D55025941

Pulled By: anand1976

fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
2024-03-18 16:16:05 -07:00
Andrew Kryczka 4d5ebad971 Fix kBlockCacheTier read with table cache miss (#12443)
Summary:
Thanks ltamasi for pointing out this bug.

We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443

Reviewed By: cbi42

Differential Revision: D54930128

Pulled By: ajkr

fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d
2024-03-15 14:41:58 -07:00
Andrew Kryczka 3f5bd46a07 Add `ContinueCallback` to `GetMergeOperands()` (#12438)
Summary:
The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because:

- Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge
- The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me

If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438

Reviewed By: hx235

Differential Revision: D54914669

Pulled By: ajkr

fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a
2024-03-15 12:25:49 -07:00
Changyu Bi 096fb9b67d Fix data race in WalManager (#12439)
Summary:
Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439

Test Plan:
- existing UT
- not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests

Reviewed By: anand1976

Differential Revision: D54920817

Pulled By: cbi42

fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6
2024-03-14 21:24:06 -07:00
Levi Tamasi 7c290f72b8 Implement WriteBatchWithIndex::GetEntityFromBatch (#12424)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12424

The PR adds a wide-column point lookup API `GetEntityFromBatch` to `WriteBatchWithIndex`. Similarly to APIs like `DB::GetEntity`, this new API returns wide-column entities as-is, and wraps plain values in an entity with a single column (the anonymous default column). Also, similarly to `WriteBatchWithIndex::GetFromBatch`, it only reads data from the batch itself.

Reviewed By: jaykorean

Differential Revision: D54826535

fbshipit-source-id: 92604f3ebd90fe1afbd36f2d2194b7dee0011efa
2024-03-14 10:45:49 -07:00
Alan Paxton d9a441113e JNI get_helper code sharing / multiGet() use efficient batch C++ support (#12344)
Summary:
Implement RAII-based helpers for JNIGet() and multiGet()

Replace JNI C++ helpers `rocksdb_get_helper, rocksdb_get_helper_direct`, `multi_get_helper`, `multi_get_helper_direct`, `multi_get_helper_release_keys`, `txn_get_helper`, and `txn_multi_get_helper`.

The model is to entirely do away with a single helper, instead a number of utility methods allow each separate
JNI `Get()` and `MultiGet()` method to organise their parameters efficiently, then call the underlying C++ `db->Get()`,
`db->MultiGet()`, `txn->Get()`, or `txn->MultiGet()` method itself, and use further utilities to retrieve results.

Roughly speaking:

* get keys into C++ form
* Call C++ Get()
* get results and status into Java form

We achieve a useful performance gain as part of this work; by using the updated C++ multiGet we immediately pick up its performance gains (batch improvements to multiGet C++ were previously implemented, but not until now used by Java/JNI). multiGetBB already uses the batched C++ multiGet(), and all other benchmarks show consistent improvement after the changes:

## Before:
```
Benchmark (columnFamilyTestType) (keyCount) (keySize) (multiGetSize) (valueSize) Mode Cnt Score Error Units
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 256 thrpt 25 5315.459 ± 20.465 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 1024 thrpt 25 5673.115 ± 78.299 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 4096 thrpt 25 2616.860 ± 46.994 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 16384 thrpt 25 1700.058 ± 24.034 ops/s
MultiGetNewBenchmarks.multiGetBB200 no_column_family 10000 1024 100 65536 thrpt 25 791.171 ± 13.955 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 256 thrpt 25 6129.929 ± 94.200 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 1024 thrpt 25 7012.405 ± 97.886 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 4096 thrpt 25 2799.014 ± 39.352 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 16384 thrpt 25 1417.205 ± 22.272 ops/s
MultiGetNewBenchmarks.multiGetList10 no_column_family 10000 1024 100 65536 thrpt 25 655.594 ± 13.050 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 256 thrpt 25 6147.247 ± 82.711 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 1024 thrpt 25 7004.213 ± 79.251 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 4096 thrpt 25 2715.154 ± 110.017 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 16384 thrpt 25 1408.070 ± 31.714 ops/s
MultiGetNewBenchmarks.multiGetListExplicitCF20 no_column_family 10000 1024 100 65536 thrpt 25 623.829 ± 57.374 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 256 thrpt 25 6119.243 ± 116.313 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 1024 thrpt 25 6931.873 ± 128.094 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 4096 thrpt 25 2678.253 ± 39.113 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 16384 thrpt 25 1337.384 ± 19.500 ops/s
MultiGetNewBenchmarks.multiGetListRandomCF30 no_column_family 10000 1024 100 65536 thrpt 25 625.596 ± 14.525 ops/s
```

## After:
```
Benchmark                                    (columnFamilyTestType)  (keyCount)  (keySize)  (multiGetSize)  (valueSize)   Mode  Cnt     Score     Error  Units
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100          256  thrpt   25  5191.074 ±  78.250  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         1024  thrpt   25  5378.692 ± 260.682  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100         4096  thrpt   25  2590.183 ±  34.844  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        16384  thrpt   25  1634.793 ±  34.022  ops/s
MultiGetBenchmarks.multiGetBB200                   no_column_family       10000       1024             100        65536  thrpt   25   786.455 ±   8.462  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100          256  thrpt   25  5285.055 ±  11.676  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         1024  thrpt   25  5586.758 ± 213.008  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100         4096  thrpt   25  2527.172 ±  17.106  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        16384  thrpt   25  1819.547 ±  12.958  ops/s
MultiGetBenchmarks.multiGetBB200                    1_column_family       10000       1024             100        65536  thrpt   25   803.861 ±   9.963  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100          256  thrpt   25  5253.793 ±  28.020  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         1024  thrpt   25  5705.591 ±  20.556  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100         4096  thrpt   25  2523.377 ±  15.415  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        16384  thrpt   25  1815.344 ±  11.309  ops/s
MultiGetBenchmarks.multiGetBB200                 20_column_families       10000       1024             100        65536  thrpt   25   820.792 ±   3.192  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100          256  thrpt   25  5262.184 ±  20.477  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         1024  thrpt   25  5706.959 ±  23.123  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100         4096  thrpt   25  2520.362 ±   9.170  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        16384  thrpt   25  1789.185 ±  14.239  ops/s
MultiGetBenchmarks.multiGetBB200                100_column_families       10000       1024             100        65536  thrpt   25   818.401 ±  12.132  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100          256  thrpt   25  6978.310 ±  14.084  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         1024  thrpt   25  7664.242 ±  22.304  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100         4096  thrpt   25  2881.778 ±  81.054  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        16384  thrpt   25  1599.826 ±   7.190  ops/s
MultiGetBenchmarks.multiGetList10                  no_column_family       10000       1024             100        65536  thrpt   25   737.520 ±   6.809  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100          256  thrpt   25  6974.376 ±  10.716  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         1024  thrpt   25  7637.440 ±  45.877  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100         4096  thrpt   25  2820.472 ±  42.231  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        16384  thrpt   25  1716.663 ±   8.527  ops/s
MultiGetBenchmarks.multiGetList10                   1_column_family       10000       1024             100        65536  thrpt   25   755.848 ±   7.514  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100          256  thrpt   25  6943.651 ±  20.040  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         1024  thrpt   25  7679.415 ±   9.114  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100         4096  thrpt   25  2844.564 ±  13.388  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        16384  thrpt   25  1729.545 ±   5.983  ops/s
MultiGetBenchmarks.multiGetList10                20_column_families       10000       1024             100        65536  thrpt   25   783.218 ±   1.530  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100          256  thrpt   25  6944.276 ±  29.995  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         1024  thrpt   25  7670.301 ±   8.986  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100         4096  thrpt   25  2839.828 ±  12.421  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        16384  thrpt   25  1730.005 ±   9.209  ops/s
MultiGetBenchmarks.multiGetList10               100_column_families       10000       1024             100        65536  thrpt   25   787.096 ±   1.977  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100          256  thrpt   25  6896.944 ±  21.530  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         1024  thrpt   25  7622.407 ±  12.824  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100         4096  thrpt   25  2927.538 ±  19.792  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        16384  thrpt   25  1598.041 ±   4.312  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20        no_column_family       10000       1024             100        65536  thrpt   25   744.564 ±   9.236  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100          256  thrpt   25  6853.760 ±  78.041  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         1024  thrpt   25  7360.917 ± 355.365  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100         4096  thrpt   25  2848.774 ±  13.409  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        16384  thrpt   25  1727.688 ±   3.329  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20         1_column_family       10000       1024             100        65536  thrpt   25   776.088 ±   7.517  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100          256  thrpt   25  6910.339 ±  14.366  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         1024  thrpt   25  7633.660 ±  10.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100         4096  thrpt   25  2787.799 ±  81.775  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        16384  thrpt   25  1726.517 ±   6.830  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20      20_column_families       10000       1024             100        65536  thrpt   25   787.597 ±   3.362  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100          256  thrpt   25  6922.445 ±  10.493  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         1024  thrpt   25  7604.710 ±  48.043  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100         4096  thrpt   25  2848.788 ±  15.783  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        16384  thrpt   25  1730.837 ±   6.497  ops/s
MultiGetBenchmarks.multiGetListExplicitCF20     100_column_families       10000       1024             100        65536  thrpt   25   794.557 ±   1.869  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100          256  thrpt   25  6918.716 ±  15.766  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         1024  thrpt   25  7626.692 ±   9.394  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100         4096  thrpt   25  2871.382 ±  72.155  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        16384  thrpt   25  1598.786 ±   4.819  ops/s
MultiGetBenchmarks.multiGetListRandomCF30          no_column_family       10000       1024             100        65536  thrpt   25   748.469 ±   7.234  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100          256  thrpt   25  6922.666 ±  17.131  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         1024  thrpt   25  7623.890 ±   8.805  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100         4096  thrpt   25  2850.698 ±  18.004  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        16384  thrpt   25  1727.623 ±   4.868  ops/s
MultiGetBenchmarks.multiGetListRandomCF30           1_column_family       10000       1024             100        65536  thrpt   25   774.534 ±  10.025  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100          256  thrpt   25  5486.251 ±  13.582  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         1024  thrpt   25  4920.656 ±  44.557  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100         4096  thrpt   25  3922.913 ±  25.686  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        16384  thrpt   25  2873.106 ±   4.336  ops/s
MultiGetBenchmarks.multiGetListRandomCF30        20_column_families       10000       1024             100        65536  thrpt   25   802.404 ±   8.967  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100          256  thrpt   25  4817.996 ±  18.042  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         1024  thrpt   25  4243.922 ±  13.929  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100         4096  thrpt   25  3175.998 ±   7.773  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        16384  thrpt   25  2321.990 ±  12.501  ops/s
MultiGetBenchmarks.multiGetListRandomCF30       100_column_families       10000       1024             100        65536  thrpt   25  1753.028 ±   7.130  ops/s
```

Closes https://github.com/facebook/rocksdb/issues/11518

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12344

Reviewed By: cbi42

Differential Revision: D54809714

Pulled By: pdillinger

fbshipit-source-id: bee3b949720abac073bce043b59ce976a11e99eb
2024-03-12 12:42:08 -07:00
Yu Zhang 210c8df820 Use do_validate flag to control timestamp based validation in WriteCommittedTxn::GetForUpdate (#12369)
Summary:
When PR https://github.com/facebook/rocksdb/issues/9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this:
1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as  `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`.
5950907a82/utilities/transactions/transaction_util.cc (L141)

2)  Let's denote the committed timestamp set via `Transaction::SetCommitTimestamp` to be `Ts_cmt`. Later `WriteCommitedTxn::Commit` would only pass if this condition is met: `Ts_read < Ts_cmt`. 5950907a82/utilities/transactions/pessimistic_transaction.cc (L431)

Together these two checks can ensure `Ts_db < Ts_cmt` to meet the user-defined timestamp invariant that newer timestamp should have newer sequence number.

The `do_validate` flag was originally intended to make snapshot based validation optional. If it's true, `GetForUpdate` checks no entry is written after the snapshot. If it's false, it will skip this snapshot based validation. In this PR, we are making the UDT based validation configurable too based on this flag instead of mandating it for below reasons: 1) in some cases the users themselves can enforce aformentioned invariant on their side independently, without RocksDB help, for example, if they are managing a monotonically increasing timestamp, and their transactions are only committed in a single thread. So they don't need this UDT based validation and wants to skip it, 2) It also could be expensive or not practical for users to come up with such a read timestamp that is exactly in between their commit timestamp and the db's timestamp. For example, in aformentioned case where a monotonically increasing timestamp is managed, the users would need to access this timestamp both for setting the read timestamp and for setting the commit timestamp. So it's preferable to skip this check too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12369

Test Plan: added unit tests

Reviewed By: ltamasi

Differential Revision: D54268920

Pulled By: jowlyzhang

fbshipit-source-id: ca7693796f9bb11f376a2059d91841e51c89435a
2024-03-07 14:58:10 -08:00
Andrew Kryczka 27a2473668 Best-effort recovery support for atomic flush (#12406)
Summary:
This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once.

The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits.

Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery.

The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios.

The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary.

- `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was  previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates.
- `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406

Reviewed By: jaykorean, cbi42

Differential Revision: D54494904

Pulled By: ajkr

fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52
2024-03-06 14:40:40 -08:00
Peter Dillinger a53ed91691 Fix/improve temperature handling for file ingestion (#12402)
Summary:
Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388

In terms of public API:
* Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre.
* Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general).
* Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature.
* Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost"

In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed.

Eventual suggested follow-up:
* Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare).
* More temperature handling for CreateColumnFamilyWithImport and Checkpoints.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402

Test Plan:
Deeply revamped
ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer.

Used macros instead of functions for better tracing to critical source location on test failures.

Some enhancements to FileTemperatureTestFS in the process of developing the revamped test.

Reviewed By: jowlyzhang

Differential Revision: D54442794

Pulled By: pdillinger

fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995
2024-03-05 16:56:08 -08:00
Jay Huh 5bcc184975 Update APIs to support generic unique identifier format (#12384)
Summary:
The current design proposes using a combination of `job_id`, `db_id`, and `db_session_id` to create a unique identifier for remote compaction jobs. However, this approach may not be suitable for users who prefer a different format for the unique identifier.

At Meta, we are utilizing generic compute offload to offload compaction tasks to remote workers. The compute offload client generates a UUID for each task, which requires an update to the current RocksDB API for onboarding purposes.

Users still have the option to create the unique identifier by combining `job_id`, `db_id`, and `db_session_id` if they prefer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12384

Test Plan:
```
$> ./compaction_service_test                                                                                                                             13:29:35
[==========] Running 14 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 14 tests from CompactionServiceTest
[ RUN      ] CompactionServiceTest.BasicCompactions
[       OK ] CompactionServiceTest.BasicCompactions (2642 ms)
[ RUN      ] CompactionServiceTest.ManualCompaction
[       OK ] CompactionServiceTest.ManualCompaction (454 ms)
[ RUN      ] CompactionServiceTest.CancelCompactionOnRemoteSide
[       OK ] CompactionServiceTest.CancelCompactionOnRemoteSide (1643 ms)
[ RUN      ] CompactionServiceTest.FailedToStart
[       OK ] CompactionServiceTest.FailedToStart (1332 ms)
[ RUN      ] CompactionServiceTest.InvalidResult
[       OK ] CompactionServiceTest.InvalidResult (1516 ms)
[ RUN      ] CompactionServiceTest.SubCompaction
[       OK ] CompactionServiceTest.SubCompaction (551 ms)
[ RUN      ] CompactionServiceTest.CompactionFilter
[       OK ] CompactionServiceTest.CompactionFilter (563 ms)
[ RUN      ] CompactionServiceTest.Snapshot
[       OK ] CompactionServiceTest.Snapshot (124 ms)
[ RUN      ] CompactionServiceTest.ConcurrentCompaction
[       OK ] CompactionServiceTest.ConcurrentCompaction (660 ms)
[ RUN      ] CompactionServiceTest.CompactionInfo
[       OK ] CompactionServiceTest.CompactionInfo (984 ms)
[ RUN      ] CompactionServiceTest.FallbackLocalAuto
[       OK ] CompactionServiceTest.FallbackLocalAuto (343 ms)
[ RUN      ] CompactionServiceTest.FallbackLocalManual
[       OK ] CompactionServiceTest.FallbackLocalManual (380 ms)
[ RUN      ] CompactionServiceTest.RemoteEventListener
[       OK ] CompactionServiceTest.RemoteEventListener (491 ms)
[ RUN      ] CompactionServiceTest.TablePropertiesCollector
[       OK ] CompactionServiceTest.TablePropertiesCollector (169 ms)
[----------] 14 tests from CompactionServiceTest (11854 ms total)

[----------] Global test environment tear-down
[==========] 14 tests from 1 test case ran. (11855 ms total)
[  PASSED  ] 14 tests.
```

Reviewed By: hx235

Differential Revision: D54220339

Pulled By: jaykorean

fbshipit-source-id: 5a9054f31933d1996adca02082eb37b6d5353224
2024-03-01 09:55:30 -08:00
Changyu Bi 4aed229fa7 Add `write_memtable_time` to perf level `kEnableWait` (#12394)
Summary:
.. so write time can be measured under the new perf level for single-threaded writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12394

Test Plan: * add a new UT `PerfContextTest.WriteMemtableTimePerfLevel`

Reviewed By: anand1976

Differential Revision: D54326263

Pulled By: cbi42

fbshipit-source-id: d0e334d9581851ba6cf53c776c0bd876365d1e00
2024-02-29 15:08:26 -08:00
Peter Dillinger d780e7a561 Remove `bottommost_temperature` (#12389)
Summary:
deprecated option already replaced by `last_level_temperature`. (Keeping recognition of the option in old options files.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12389

Test Plan: tests updated

Reviewed By: jowlyzhang, cbi42

Differential Revision: D54267946

Pulled By: pdillinger

fbshipit-source-id: 65c49b15e7394829c1f3b44edd4179d2daff6017
2024-02-27 14:48:00 -08:00
Andrew Kryczka a43481b3d0 Decouple `RateLimiter` burst size and refill period (#12379)
Summary:
When the rate limiter does not have any waiting requests, the first request to arrive may consume all of the available bandwidth, despite potentially having lower priority than requests that arrive later in the same refill interval. Then, those higher priority requests must wait for a refill. So even in scenarios in which we have an overall bandwidth surplus, the highest priority requests can be sporadically delayed up to a whole refill period.

Alone, this isn't necessarily problematic as the refill period is configurable via `refill_period_us` and can be tuned down as needed until the max sporadic delay is tolerable. However, tuning down `refill_period_us` had a side effect of reducing burst size. Some users require a certain burst size to issue optimal I/O sizes to the underlying storage system.

To satisfy those users, this PR decouples the refill period from the burst size. That way, the max sporadic delay can be limited without impacting I/O sizes issued to the underlying storage system.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12379

Test Plan:
The goal is to show we can now limit the max sporadic delay without impacting compaction's I/O size.

The benchmark runs compaction with a large I/O size, while user reads simultaneously run at a low rate that does not consume all of the available bandwidth. The max sporadic delay is measured using the P100 of rocksdb.file.read.get.micros. I just used strace to verify the compaction reads follow `rate_limiter_single_burst_bytes`

Setup: `./db_bench -benchmarks=fillrandom,flush -write_buffer_size=67108864 -disable_auto_compactions=true -value_size=256 -num=1048576`

Benchmark: `./db_bench -benchmarks=readrandom -use_existing_db=true -num=1048576 -duration=10 -benchmark_read_rate_limit=4096 -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -rate_limit_user_ops=true -statistics=true -cache_size=0 -stats_level=5 -compaction_readahead_size=16777216 -use_direct_reads=true`

Results:

refill_micros | rocksdb.file.read.get.micros (P100)
-- | --
10000 | 10802
100000 | 100240
1000000 | 922061

For verifying compaction read sizes: `strace -fye pread64 ./db_bench -benchmarks=compact -use_existing_db=true -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -compaction_readahead_size=16777216 -use_direct_reads=true`

Reviewed By: hx235

Differential Revision: D54165675

Pulled By: ajkr

fbshipit-source-id: c5968486316cbfb7ff8e5b7d75d3589883dd1105
2024-02-26 16:55:13 -08:00
anand76 d9c0d44dab Add a perf level for measuring user thread block time (#12368)
Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368

Test Plan: Add a unit test for write_thread_wait_nanos

Reviewed By: ajkr

Differential Revision: D54021583

Pulled By: anand1976

fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
2024-02-22 12:14:53 -08:00
Yu Zhang f1ca47b904 Add support to bulk load external files for UDT in memtable only feature (#12356)
Summary:
This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature.

Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`.  We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`.

To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps.

This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356

Test Plan:
added unit test
existing unit test

Reviewed By: cbi42

Differential Revision: D54025116

Pulled By: jowlyzhang

fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554
2024-02-21 15:41:53 -08:00
Andrew Kryczka 8e29f243c9 No filesystem reads during `Merge()` writes (#12365)
Summary:
This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365

Test Plan:
"rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read.

```
$ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true |& grep 'block.cache.data.add COUNT' ; done
strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0
strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1
```

Reviewed By: hx235

Differential Revision: D53982520

Pulled By: ajkr

fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4
2024-02-21 13:15:27 -08:00
Yu Zhang 31dfc81e18 Start 9.1.0 release (#12360)
Summary:
with release notes for 9.0.fb, format_compatible test update, and version.h update.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360

Test Plan: CI

Reviewed By: cbi42

Differential Revision: D53879416

Pulled By: jowlyzhang

fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75
2024-02-16 18:26:48 -08:00
anand76 d227276147 Deprecate some variants of Get and MultiGet (#12327)
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327

Reviewed By: pdillinger

Differential Revision: D53828151

Pulled By: anand1976

fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
2024-02-16 09:21:06 -08:00
Peter Dillinger bfd00bba9c Use format_version=6 by default (#12352)
Summary:
It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352

Test Plan:
tests updated, including format capatibility test

table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes
to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only
could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id.
I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability.

Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match
the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow.

Reviewed By: anand1976

Differential Revision: D53786884

Pulled By: pdillinger

fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51
2024-02-15 11:23:48 -08:00
Changyu Bi 6e57135a65 Add a changelog entry for PR 12322 (#12353)
Summary:
.. for public api change related to sst_dump.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12353

Reviewed By: jaykorean

Differential Revision: D53791123

Pulled By: cbi42

fbshipit-source-id: 3fbe9c7a3eb0a30dc1a00d39bc8a46028baa3779
2024-02-15 09:53:20 -08:00
Yu Zhang f405e55cfa Add support in SstFileWriter to not persist user defined timestamps (#12348)
Summary:
This PR adds support in `SstFileWriter` to create SST files without persisting timestamps when the column family has enabled UDTs in Memtable only feature. The sst files created from flush and compaction do not contain timestamps, we want to make the sst files created by `SstFileWriter` to follow the same pattern and not persist timestamps. This is to prepare for ingesting external SST files for this type of column family.

There are timestamp-aware APIs and non timestamp-aware APIs in `SstFileWriter`. The former are exclusively used for when the column family's comparator is timestamp-aware, a.k.a `Comparator::timestamp_size() > 0`, while the latter are exclusively used for the column family's comparator is non timestamp-aware, a.k.a `Comparator::timestamp_size() == 0`.  There are sanity checks to make sure these APIs are correctly used.

In this PR, the APIs usage continue with above enforcement, where even though timestamps are not eventually persisted, users are still asked to use only the timestamp-aware APIs. But because data points will logically all have minimum timestamps, we don't allow multiple versions of the same user key (without timestamp) to be added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12348

Test Plan:
Added unit tests
Manual inspection of generated sst files with `sst_dump`

Reviewed By: ltamasi

Differential Revision: D53732667

Pulled By: jowlyzhang

fbshipit-source-id: e43beba0d3a1736b94ee5c617163a6280efd65b7
2024-02-13 20:30:07 -08:00
Yu Zhang 4bea83aa44 Remove the force mode for EnableFileDeletions API (#12337)
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337

Reviewed By: hx235

Differential Revision: D53630393

Pulled By: jowlyzhang

fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
2024-02-13 18:36:25 -08:00
Levi Tamasi de1e3ff6ea Fix a data race in DBImpl::RenameTempFileToOptionsFile (#12347)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347

`DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels).

Reviewed By: jowlyzhang

Differential Revision: D53675153

fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481
2024-02-12 13:26:09 -08:00
Peter Dillinger 58d55b7f4e Mark wal_compression feature as production-ready (#12336)
Summary:
(as title)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12336

Test Plan: in use at Meta for a large service; in crash test

Reviewed By: hx235

Differential Revision: D53537628

Pulled By: pdillinger

fbshipit-source-id: 69e7ac9ab7b59b928d1144105667a7fde8a55a5a
2024-02-07 15:06:04 -08:00
Yu Zhang e3e8fbb497 Add a separate range classes for internal usage (#12071)
Summary:
Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`.

These APIs have different expectations of what keys this range class contain.  Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp.

This PR contains:
1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead.
2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071

Test Plan:
existing tests
Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled.
The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT.

Reviewed By: ltamasi

Differential Revision: D53292608

Pulled By: jowlyzhang

fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8
2024-02-06 18:35:36 -08:00
Hui Xiao 1a885fe730 Remove deprecated Options::access_hint_on_compaction_start (#11654)
Summary:
**Context:**
`Options::access_hint_on_compaction_start ` is marked deprecated and now ready to be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11654

Test Plan:
Multiple db_stress runs with pre-PR and post-PR binary randomly to ensure forward/backward compatibility on options 36a5686ec0?fbclid=IwAR2IcdAUdTvw9O9V5GkHEYJRGMVR9p7Ei-LMa-9qiXlj3z80DxjkxlGnP1E
`python3 tools/db_crashtest.py --simple blackbox --interval=30`

Reviewed By: cbi42

Differential Revision: D47892459

Pulled By: hx235

fbshipit-source-id: a62f46a0377fe143be7638e218978d5431c15c56
2024-02-05 13:35:19 -08:00
Changyu Bi 5620efc794 Remove deprecated option `ignore_max_compaction_bytes_for_input` (#12323)
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10835 to allow disabling the new compaction behavior if it's not safe. The option is enabled by default and there has not been a need to disable it. So it should be safe to remove now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12323

Reviewed By: ajkr

Differential Revision: D53330336

Pulled By: cbi42

fbshipit-source-id: 36eef4664ac96b3a7ed627c48bd6610b0a7eafc5
2024-02-02 17:09:42 -08:00
Changyu Bi ace1721b28 Remove deprecated option `level_compaction_dynamic_file_size` (#12325)
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10655 to allow reverting to old behavior. The option is enabled by default and there has not been a need to disable it. Remove it for 9.0 release. Also fixed and improved a few unit tests that depended on setting this option to false.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12325

Test Plan: existing tests.

Reviewed By: hx235

Differential Revision: D53369430

Pulled By: cbi42

fbshipit-source-id: 0ec2440ca8d88db7f7211c581542c7581bd4d3de
2024-02-02 15:37:40 -08:00
anand76 95b41eec6d Fix potential incorrect result for duplicate key in MultiGet (#12295)
Summary:
The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -
1. Duplicate keys in a MultiGet batch
2. Key matches the last key in a non-zero, non-bottommost level file
3. Final value is not in the file (merge operand, not snapshot visible etc)
4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
6. Value or merge operand for the key is present in the very next level

The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the  conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295

Test Plan: Add unit tests that fail an assertion or return wrong result without the fix

Reviewed By: hx235

Differential Revision: D53187634

Pulled By: anand1976

fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea
2024-02-02 11:48:35 -08:00
Andrew Kryczka f9d45358ca Removed `check_flush_compaction_key_order` (#12311)
Summary:
`check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311

Reviewed By: cbi42

Differential Revision: D53233379

Pulled By: ajkr

fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637
2024-01-31 16:30:26 -08:00
Yu Zhang b10c171e58 Remove WritableFile(FSWritableFile)::GetFileSize default implementation (#12303)
Summary:
As titled. This changes public API behavior, and subclasses of `WritableFile` and `FSWritableFile` need to explicitly provide an implementation for the `GetFileSize` method after this change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12303

Reviewed By: ajkr

Differential Revision: D53205769

Pulled By: jowlyzhang

fbshipit-source-id: 2e613ca3650302913821b33159b742bdf1d24bc7
2024-01-30 09:49:32 -08:00
Andrew Kryczka aacf60dda2 Speedup based on number of files marked for compaction (#12306)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306

Reviewed By: cbi42

Differential Revision: D53200559

Pulled By: ajkr

fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e
2024-01-29 17:29:04 -08:00
Yu Zhang 17042a3fb7 Remove misspelled tickers used in error handler (#12302)
Summary:
As titled, the replacement tickers have been introduced in https://github.com/facebook/rocksdb/issues/11509  and in use since release 8.4. This PR completely removes the misspelled ones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12302

Test Plan: CI tests

Reviewed By: jaykorean

Differential Revision: D53196935

Pulled By: jowlyzhang

fbshipit-source-id: 9c9d0d321247690db5edfdc52b4fecb2f1218979
2024-01-29 15:28:37 -08:00
akankshamahajan b9cb7b9644 Provide support for FSBuffer for point lookups (#12266)
Summary:
Provide support for FSBuffer for point lookups

It also add support for compaction and scan reads that goes through BlockFetcher when readahead/prefetching is not enabled.

Some of the compaction/Scan reads goes through FilePrefetchBuffer and some through BlockFetcher. This PR add support to use underlying file system scratch buffer for reads that go through BlockFetcher as for FilePrefetch reads, design is complicated to support this feature.

Design - In order to use underlying FileSystem provided scratch for Reads, it uses MultiRead with 1 request instead of Read API which required API change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12266

Test Plan: Stress test using underlying file system  scratch buffer internally.

Reviewed By: anand1976

Differential Revision: D53019089

Pulled By: akankshamahajan15

fbshipit-source-id: 4fe3d090d77363320e4b67186fd4d51c005c0961
2024-01-29 15:08:20 -08:00
Chdy fc48af33f5 fix some perf statistic in write (#12285)
Summary:
### Summary:  perf context lack statistics in some write steps
```
rocksdb::get_perf_context()->write_wal_time);
rocksdb::get_perf_context()->write_memtable_time);
rocksdb::get_perf_context()->write_pre_and_post_process_time);
```

#### case 1:
when the unordered_write is true, the `write_memtable_time` is 0
```
write_wal_time : 13.7012
write_memtable_time : 0
write_pre_and_post_process_time : 142.037
```

Reason: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` during insert memtable,

```c++
Status DBImpl::UnorderedWriteMemtable(const WriteOptions& write_options,
                                      WriteBatch* my_batch,
                                      WriteCallback* callback, uint64_t log_ref,
                                      SequenceNumber seq,
                                      const size_t sub_batch_cnt) {
	...
  if (w.CheckCallback(this) && w.ShouldWriteToMemtable()) {

    // need calculate write_memtable_time
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        seq_per_batch_, sub_batch_cnt, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);
    if (write_options.disableWAL) {
      has_unpersisted_data_.store(true, std::memory_order_relaxed);
    }
  }
	...
}
```
Fix: add perf function
```
write_wal_time : 14.3991
write_memtable_time : 19.3367
write_pre_and_post_process_time : 130.441
```

#### case 2:
when the enable_pipelined_write is true, the `write_memtable_time` is small
```
write_wal_time : 11.2986
write_memtable_time : 1.0205
write_pre_and_post_process_time : 140.131
```

Fix: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` when `w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER`
```c++
Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
                                  WriteBatch* my_batch, WriteCallback* callback,
                                  uint64_t* log_used, uint64_t log_ref,
                                  bool disable_memtable, uint64_t* seq_used) {
  ...
  if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
    // need calculate write_memtable_time

    assert(w.ShouldWriteToMemtable());
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        false /*seq_per_batch*/, 0 /*batch_cnt*/, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);

    if (write_thread_.CompleteParallelMemTableWriter(&w)) {
      MemTableInsertStatusCheck(w.status);
      versions_->SetLastSequence(w.write_group->last_sequence);
      write_thread_.ExitAsMemTableWriter(&w, *w.write_group);
    }
  }
  if (seq_used != nullptr) {
    *seq_used = w.sequence;
  }

  assert(w.state == WriteThread::STATE_COMPLETED);
  return w.FinalStatus();
}
```
FIx: add perf function
```
write_wal_time : 10.5201
write_memtable_time : 17.1048
write_pre_and_post_process_time : 114.313
```

#### case3:
`DBImpl::WriteImplWALOnly` function has no statistical `write_delay_time`
```c++
Status DBImpl::WriteImplWALOnly(
    WriteThread* write_thread, const WriteOptions& write_options,
    WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used,
    const uint64_t log_ref, uint64_t* seq_used, const size_t sub_batch_cnt,
    PreReleaseCallback* pre_release_callback, const AssignOrder assign_order,
    const PublishLastSeq publish_last_seq, const bool disable_memtable) {
 ...
  if (publish_last_seq == kDoPublishLastSeq) {

  } else {
    // need calculate write_delay_time
    InstrumentedMutexLock lock(&mutex_);
    Status status =
        DelayWrite(/*num_bytes=*/0ull, *write_thread, write_options);
    if (!status.ok()) {
      WriteThread::WriteGroup write_group;
      write_thread->EnterAsBatchGroupLeader(&w, &write_group);
      write_thread->ExitAsBatchGroupLeader(write_group, status);
      return status;
    }
  }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12285

Reviewed By: ajkr

Differential Revision: D53191765

Pulled By: cbi42

fbshipit-source-id: f78d5b280bea6a777f077c89c3e0b8fe98d3c860
2024-01-29 12:31:11 -08:00
zaidoon c3bff1c02d Allow setting Stderr Logger via C API (#12262)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12262

Reviewed By: pdillinger

Differential Revision: D53027616

Pulled By: ajkr

fbshipit-source-id: 2e88e53e0c02447c613439f5528161ea1340b323
2024-01-25 12:36:40 -08:00
Hui Xiao 1b2b16b38e Fix bug of newer ingested data assigned with an older seqno (#12257)
Summary:
**Context:**
We found an edge case where newer ingested data is assigned with an older seqno. This causes older data of that key to be returned for read.

Consider the following lsm shape:
![image](https://github.com/facebook/rocksdb/assets/83968999/973fd160-5065-49cd-8b7b-b6ab4badae23)
Then ingest a file to L5 containing new data of key_overlap. Because of [this](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fblob%2F5a26f392ca640818da0b8590be6119699e852b07%2Fdb%2Fexternal_sst_file_ingestion_job.cc%3Ffbclid%3DIwAR10clXxpUSrt6sYg12sUMeHfShS7XigFrsJHvZoUDroQpbj_Sb3dG_JZFc%23L951-L956&h=AT0m56P7O0ZML7jk1sdjgnZZyGPMXg9HkKvBEb8mE9ZM3fpJjPrArAMsaHWZQPt9Ki-Pn7lv7x-RT9NEd_202Y6D2juIVHOIt3EjCZptDKBLRBMG49F8iBUSM9ypiKe8XCfM-FNW2Hl4KbVq2e3nZRbMvUM), the file is assigned with seqno 2, older than the old data's seqno 4. After just another compaction, we will drop the new_v for key_overlap because of the seqno and cause older data to be returned.
![image](https://github.com/facebook/rocksdb/assets/83968999/a3ef95e4-e7ae-4c30-8d03-955cd4b5ed42)

**Summary:**
This PR removes the incorrect seqno assignment

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12257

Test Plan:
- New unit test failed before the fix but passes after
- python3 tools/db_crashtest.py --compaction_style=1 --ingest_external_file_one_in=10 --preclude_last_level_data_seconds=36000 --compact_files_one_in=10 --enable_blob_files=0 blackbox`
- Rehearsal stress test

Reviewed By: cbi42

Differential Revision: D52926092

Pulled By: hx235

fbshipit-source-id: 9e4dade0f6cc44e548db8fca27ccbc81a621cd6f
2024-01-24 11:21:05 -08:00
zaidoon e572ae9f57 expose mode option to Rate Limiter via C API (#12259)
Summary:
addresses https://github.com/facebook/rocksdb/issues/12220 to allow rate limiting compaction but not flushes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12259

Reviewed By: jaykorean

Differential Revision: D52965342

Pulled By: ajkr

fbshipit-source-id: 38566d9ac75c932c63e10cc53796fab0e46e3b2e
2024-01-22 11:45:53 -08:00
Changyu Bi 4b684e96b7 Allow more intra-L0 compaction when L0 is small (#12214)
Summary:
introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214

Test Plan:
- new unit test
- benchmark:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1

main:
fillrandom   :     234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations;    0.5 MB/s
rocksdb.compact.read.bytes COUNT : 1490756235
rocksdb.compact.write.bytes COUNT : 1469056734
rocksdb.flush.write.bytes COUNT : 71099011

branch:
fillrandom   :     128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations;    0.9 MB/s
rocksdb.compact.read.bytes COUNT : 807474156
rocksdb.compact.write.bytes COUNT : 781977610
rocksdb.flush.write.bytes COUNT : 71098785
```

Reviewed By: ajkr

Differential Revision: D52637771

Pulled By: cbi42

fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3
2024-01-22 10:23:57 -08:00
Peter Dillinger 800cfae987 Start 9.0.0 release (#12256)
Summary:
with release notes for 8.11.fb, format_compatible test update, and version.h update.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256

Test Plan: CI

Reviewed By: cbi42

Differential Revision: D52926051

Pulled By: pdillinger

fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4
2024-01-20 08:38:20 -08:00
Peter Dillinger cb08a682d4 Fix/cleanup SeqnoToTimeMapping (#12253)
Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.

* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
  * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
  * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
  * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.

Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253

Test Plan: unit tests updated

Reviewed By: jowlyzhang

Differential Revision: D52913733

Pulled By: pdillinger

fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
2024-01-19 21:50:38 -08:00
Andrew Kryczka d69628e6ce Mark unsafe/outdated options as deprecated (#12249)
Summary:
These options were added for users to roll back a behavior change without downgrading. To our knowledge they were not needed so can now be removed.

- `level_compaction_dynamic_file_size`
- `ignore_max_compaction_bytes_for_input`

These options were added for users to disable an online validation in case it is expensive or has false positives. Those validations have shown to be cheap, correct, and are enabled by default, so these options can be removed.

- `check_flush_compaction_key_order`
- `flush_verify_memtable_count`
- `compaction_verify_record_count`
- `fail_if_options_file_error`

This option was added for users to violate API contracts or run old databases that used to violate API contracts. It appears to be set by MyRocks so it is unclear whether we can remove it. In any case we should discourage it until it can be removed.

- `enforce_single_del_contracts`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12249

Reviewed By: cbi42

Differential Revision: D52886651

Pulled By: ajkr

fbshipit-source-id: e0d5a35144ce048505899efb1ca68c3948050aa4
2024-01-19 10:44:49 -08:00
Levi Tamasi 7e4406a171 Add a changelog entry for PR 12235 (#12238)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12238

Reviewed By: jowlyzhang

Differential Revision: D52809593

fbshipit-source-id: 692852cdd3074275ef92bde83ff15a800d8ae3d5
2024-01-16 13:02:18 -08:00
anand76 b49f9cdd3c Add CompressionOptions to the compressed secondary cache (#12234)
Summary:
Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned.

Tests -
Run db_bench and verify compression options in the LOG file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234

Reviewed By: ajkr

Differential Revision: D52758133

Pulled By: anand1976

fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6
2024-01-16 12:21:27 -08:00
akankshamahajan cad76a2e1e Fix bug in auto_readahead_size that returned wrong key (#12229)
Summary:
IndexType::kBinarySearchWithFirstKey +
BlockCacheLookupForReadAheadSize enabled => FindNextUserEntryInternal assertion fails or iterator lands at a wrong key because BlockCacheLookupForReadAheadSize moves the index_iter_ and in internal_wrapper.h, result_.key didn't update and pointed to wrong key. Also ikey_ was also pointing to iter_.key() instead of copying the key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12229

Test Plan:
```
 rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
mkdir /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
./db_stress -threads=1 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=0 --backup_one_in=0 --batch_protection_bytes_per_key=0 --blob_cache_size=0 --blob_compaction_readahead_size=0 --blob_compression_type=lz4 --blob_file_size=0 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0 --blob_garbage_collection_force_threshold=0 --block_protection_bytes_per_key=0 --block_size=2048 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 --db_write_buffer_size=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=3 --ingest_external_file_one_in=10 --initial_auto_readahead_size=0 --iterpercent=55 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=8 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=10000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=1 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --snapshot_hold_ops=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=1 --verify_sst_unique_id_in_manifest=0 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=0 > repro.out
Verification failed. Expected state has key 0000000000000077000000000000004178, iterator is at key 0000000000000077000000000000008A78
Column family: default, op_logs: S 0000000000000077000000000000003D7878787878 NNNN
No writes or ops?
Verification failed :(
```

Reviewed By: ajkr

Differential Revision: D52710655

Pulled By: akankshamahajan15

fbshipit-source-id: 9d2e684e190fb0832bdce3337bce1c6548cd054d
2024-01-16 11:30:36 -08:00
Andrew Kryczka 2dda7a0dd2 Detect compaction pressure at lower debt ratios (#12236)
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236

Reviewed By: cbi42

Differential Revision: D52765685

Pulled By: ajkr

fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
2024-01-15 22:41:18 -08:00
Peter Dillinger 5da900f28a Fix a case of ignored corruption in creating backups (#12200)
Summary:
We often need to read the table properties of an SST file when taking a backup. However, we currently do not check checksums for this step, and even with that enabled, we ignore failures. This change ensures we fail creating a backup if corruption is detected in that step of reading table properties.

To get this working properly (with existing unit tests), we also add some temperature handling logic like already exists in
BackupEngineImpl::ReadFileAndComputeChecksum and elsewhere in BackupEngine. Also, SstFileDumper needed a fix to its error handling logic.

This was originally intended to help diagnose some mysterious failures (apparent corruptions) seen in taking backups in the crash test, though that is now fixed in https://github.com/facebook/rocksdb/pull/12206

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12200

Test Plan: unit test added that corrupts table properties, along with existing tests

Reviewed By: ajkr

Differential Revision: D52520674

Pulled By: pdillinger

fbshipit-source-id: 032cfc0791428f3b8147d34c7d424ab128e28f42
2024-01-05 09:48:19 -08:00
Peter Dillinger ed46981bea Fix and defend against FilePrefetchBuffer combined with mmap reads (#12206)
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.

In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
  * Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.

Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206

Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200

Reviewed By: akankshamahajan15

Differential Revision: D52551701

Pulled By: pdillinger

fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
2024-01-04 18:39:05 -08:00
Hui Xiao 06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
hulk b7ecbe309d Trigger compaction to the next level if the data age exceeds periodic_compaction_seconds (#12175)
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.

Relate issue: https://github.com/facebook/rocksdb/issues/12165

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175

Reviewed By: ajkr

Differential Revision: D52446722

Pulled By: cbi42

fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
2023-12-28 12:50:08 -08:00
Changyu Bi 3d81f175b4 Prioritize marked file in level compaction (#12187)
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187

Test Plan: * New unit tests

Reviewed By: ajkr

Differential Revision: D52437194

Pulled By: cbi42

fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
2023-12-28 10:28:37 -08:00
Peter Dillinger a771a47a1b Fix leak or crash on failure in automatic atomic flush (#12176)
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:

```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176

Test Plan: unit test added

Reviewed By: cbi42

Differential Revision: D52417071

Pulled By: pdillinger

fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5
2023-12-26 11:04:25 -08:00
zaidoon ad0362ac92 Expose Options::ttl through C API (#12170)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12170

Reviewed By: jaykorean

Differential Revision: D52378902

Pulled By: cbi42

fbshipit-source-id: 0bac94b8785d5149df86e7317e69c0e64beab887
2023-12-21 15:04:53 -08:00
Levi Tamasi 81765866c4 Update HISTORY/version/format compatibility script for the 8.10 release (#12154)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12154

Reviewed By: jaykorean, akankshamahajan15

Differential Revision: D52216271

Pulled By: ltamasi

fbshipit-source-id: 13bab72802eeec8f6e3544be9ebcd7f725a64d2e
2023-12-15 14:44:23 -08:00
Peter Dillinger 88bc91f3cc Cap eviction effort (CPU under stress) in HyperClockCache (#12141)
Summary:
HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list.  All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem.

However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system.

To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141

Test Plan:
unit test included

 ## Performance test

We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior:

```
./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7
```

```
Before, LRU (alternate baseline not exhibiting bad behavior):
Rough parallel ops/sec = 2290997
1088060 maxresident

Before, AutoHCC (bad behavior):
Rough parallel ops/sec = 141011 <- Yes, more than 10x slower
1083932 maxresident
```

Now let us sample a range of values in the solution space:

```
After, AutoHCC, eviction_effort_cap = 1:
Rough parallel ops/sec = 3212586
2402216 maxresident

After, AutoHCC, eviction_effort_cap = 10:
Rough parallel ops/sec = 2371639
1248884 maxresident

After, AutoHCC, eviction_effort_cap = 30:
Rough parallel ops/sec = 1981092
1131596 maxresident

After, AutoHCC, eviction_effort_cap = 100:
Rough parallel ops/sec = 1446188
1090976 maxresident

After, AutoHCC, eviction_effort_cap = 1000:
Rough parallel ops/sec = 549568
1084064 maxresident
```

I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default.

```
Change to -pinned_ratio=0.85
Before, LRU:
Rough parallel ops/sec = 2108373
1078232 maxresident

Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2164910
1077312 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2145542
1077216 maxresident
```

The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress.

```
Change to -pinned_ratio=0.25 (low stress)
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2221149
1076540 maxresident

After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2224521
1076664 maxresident
```

No measurable difference under normal circumstances.

Some tests repeated with FixedHCC, with similar results.

Reviewed By: anand1976

Differential Revision: D52174755

Pulled By: pdillinger

fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8
2023-12-14 22:13:32 -08:00
Akanksha Mahajan cd577f6059 Fix WRITE_STALL start_time (#12147)
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147

Test Plan: Existing CircleCI

Reviewed By: cbi42

Differential Revision: D52173481

Pulled By: akankshamahajan15

fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
2023-12-14 13:45:06 -08:00
akankshamahajan e7c6259447 Make auto_readahead_size default true (#12080)
Summary:
Make auto_readahead_size option default true

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12080

Test Plan: benchmarks and exisiting tests

Reviewed By: anand1976

Differential Revision: D52152132

Pulled By: akankshamahajan15

fbshipit-source-id: f1515563564e77df457dff2e865e4ede8c3ddf44
2023-12-14 11:25:51 -08:00
akankshamahajan d926593df5 Fix stress tests failure for auto_readahead_size (#12131)
Summary:
When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter  so that
- BlockBasedTableIterator can point index_iter_ to the right block.
- disable readahead_cache_lookup.
However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case.

So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131

Test Plan:
- Tested on db_stress command that was failing -
`./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35`

 - make crash_test -j32

Reviewed By: anand1976

Differential Revision: D51986326

Pulled By: akankshamahajan15

fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9
2023-12-13 12:15:04 -08:00
Andrew Kryczka d8e47620d7 Speedup based on pending compaction bytes relative to data size (#12130)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.

This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130

Reviewed By: hx235

Differential Revision: D52000746

Pulled By: ajkr

fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7
2023-12-13 10:37:27 -08:00
anand76 c1b84d0437 Fix false negative in TieredSecondaryCache nvm cache lookup (#12134)
Summary:
There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134

Test Plan: Add a new unit test in tiered_secondary_cache_test

Reviewed By: akankshamahajan15

Differential Revision: D52023309

Pulled By: anand1976

fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b
2023-12-11 16:59:59 -08:00
Peter Dillinger c96d9a0fbb Allow TablePropertiesCollectorFactory to return null collector (#12129)
Summary:
As part of building another feature, I wanted this:
* Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases.
* Polished, clarified some related API comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129

Test Plan: unit test added

Reviewed By: ltamasi

Differential Revision: D51966667

Pulled By: pdillinger

fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770
2023-12-11 12:02:56 -08:00
Alan Paxton 5a063ecd34 Java API consistency between RocksDB.put() , .merge() and Transaction.put() , .merge() (#11019)
Summary:
### Implement new Java API get()/put()/merge() methods, and transactional variants.

The Java API methods are very inconsistent in terms of how they pass parameters (byte[], ByteBuffer), and what variants and defaulted parameters they support. We try to bring some consistency to this.
 * All APIs should support calls with ByteBuffer parameters.
 * Similar methods (RocksDB.get() vs Transaction.get()) should support as similar as possible sets of parameters for predictability.
 * get()-like methods should provide variants where the caller supplies the target buffer, for the sake of efficiency. Allocation costs in Java can be significant when large buffers are repeatedly allocated and freed.

### API Additions

 1. RockDB.get implement indirect ByteBuffers. Added indirect ByteBuffers and supporting native methods for get().
 2. RocksDB.Iterator implement missing (byte[], offset, length) variants for key() and value() parameters.
 3. Transaction.get() implement missing methods, based on RocksDB.get. Added ByteBuffer.get with and without column family. Added byte[]-as-target get.
 4. Transaction.iterator() implement a getIterator() which defaults ReadOptions; as per RocksDB.iterator(). Rationalize support API for this and RocksDB.iterator()
 5. RocksDB.merge implement ByteBuffer methods; both direct and indirect buffers. Shadow the methods of RocksDB.put; RocksDB.put only offers ByteBuffer API with explicit WriteOptions. Duplicated this with RocksDB.merge
 6. Transaction.merge implement methods as per RocksDB.merge methods. Transaction is already constructed with WriteOptions, so no explicit WriteOptions methods required.
 7. Transaction.mergeUntracked implement the same API methods as Transaction.merge except the ones that use assumeTracked, because that’s not a feature of merge untracked.

### Support Changes (C++)

The current JNI code in C++ supports multiple variants of methods through a number of helper functions. There are numerous TODO suggestions in the code proposing that the helpers be re-factored/shared.

We have taken a different approach for the new methods; we have created wrapper classes `JDirectBufferSlice`, `JDirectBufferPinnableSlice`, `JByteArraySlice` and `JByteArrayPinnableSlice` RAII classes which construct slices from JNI parameters and can then be passed directly to RocksDB methods. For instance, the `Java_org_rocksdb_Transaction_getDirect` method is implemented like this:

```
  try {
    ROCKSDB_NAMESPACE::JDirectBufferSlice key(env, jkey_bb, jkey_off,
                                              jkey_part_len);
    ROCKSDB_NAMESPACE::JDirectBufferPinnableSlice value(env, jval_bb, jval_off,
                                                        jval_part_len);
    ROCKSDB_NAMESPACE::KVException::ThrowOnError(
        env, txn->Get(*read_options, column_family_handle, key.slice(),
                      &value.pinnable_slice()));
    return value.Fetch();
  } catch (const ROCKSDB_NAMESPACE::KVException& e) {
    return e.Code();
  }
```
Notice the try/catch mechanism with the `KVException` class, which combined with RAII and the wrapper classes means that there is no ad-hoc cleanup necessary in the JNI methods.

We propose to extend this mechanism to existing JNI methods as further work.

### Support Changes (Java)

Where there are multiple parameter-variant versions of the same method, we use fewer or just one supporting native method for all of them. This makes maintenance a bit easier and reduces the opportunity for coding errors mixing up (untyped) object handles.

In  order to support this efficiently, some classes need to have default values for column families and read options added and cached so that they are not re-constructed on every method call.

This PR closes https://github.com/facebook/rocksdb/issues/9776

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11019

Reviewed By: ajkr

Differential Revision: D52039446

Pulled By: jowlyzhang

fbshipit-source-id: 45d0140a4887e42134d2e56520e9b8efbd349660
2023-12-11 11:03:17 -08:00
Kevin Mingtarja 44fd914128 Fix double counting of BYTES_WRITTEN ticker (#12111)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12061.

We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable.

The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111

Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes.

Reviewed By: jaykorean

Differential Revision: D51954327

Pulled By: jowlyzhang

fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c
2023-12-08 17:12:11 -08:00
Levi Tamasi a143f93236 Turn the default Timer in PeriodicTaskScheduler into a leaky Meyers singleton (#12128)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12128

The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues.

Reviewed By: akankshamahajan15

Differential Revision: D51963950

fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948
2023-12-08 10:34:07 -08:00
akankshamahajan c77b50a4fd Add AsyncIO support for tuning readahead_size by block cache lookup (#11936)
Summary:
Add support for tuning of readahead_size by block cache lookup for async_io.

**Design/ Implementation** -

**BlockBasedTableIterator.cc** -

`BlockCacheLookupForReadAheadSize` callback API lookups in the block cache and tries to reduce the start
and end offset passed. This function looks into the block cache for the blocks between `start_offset`
and `end_offset` and add all the handles in the queue.

It then iterates from the end in the handles to find first miss block and update the end offset to that block.
It also iterates from the start and find first miss block and update the start offset to that block.

```
_read_curr_block_ argument : True if this call was due to miss in the cache and caller wants to read that block
                             synchronously.
                             False if current call is to prefetch additional data in extra buffers
                            (due to ReadAsync call in FilePrefetchBuffer)
```
In case there is no data to be read in that callback (because of upper_bound or all blocks are in cache),
it updates start and end offset to be equal and that `FilePrefetchBuffer` interprets that as 0 length to be read.

**FilePrefetchBuffer.cc** -

FilePrefetchBuffer calls the callback - `ReadAheadSizeTuning` and pass the start and end offset to that
callback to get updated start and end offset to read based on cache hits/misses.

1. In case of Read calls (when offset passed to FilePrefetchBuffer is on cache miss and that data needs to be read), _read_curr_block_ is passed true.
2. In case of ReadAsync calls, when buffer is all consumed and can go for additional prefetching,  the start offset passed is the initial end offset of prev buffer (without any updated offset based on cache hit/miss).

Foreg. if following are the data blocks with cache hit/miss and start offset
and Read API found miss on DB1 and based on readahead_size (50)  it passes end offset to be 50.
 [DB1 - miss- 0 ] [DB2 - hit -10] [DB3 - miss -20] [DB4 - miss-30] [DB5 - hit-40]
 [DB6 - hit-50] [DB7 - miss-60] [DB8 - miss - 70] [DB9 - hit - 80] [DB6 - hit 90]

- For Read call - updated start offset remains 0 but end offset updates to DB4, as DB5 is in cache.
- Read calls saves initial end offset 50 as that was meant to be prefetched.
- Now for next ReadAsync call - the start offset will be 50 (previous buffer initial end offset) and based on readahead_size, end offset will be 100
- On callback, because of cache hits - callback will update the start offset to 60 and end offset to 80 to read only 2 data blocks (DB7 and DB8).
- And for that ReadAsync call - initial end offset will be set to 100 which will again used by next ReadAsync call as start offset.
-  `initial_end_offset_` in `BufferInfo` is used to save the initial end offset of that buffer.

- If let's say DB5 and DB6 overlaps in 2 buffers (because of alignment), `prev_buf_end_offset` is passed to make sure already prefetched data is not prefetched again in second buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11936

Test Plan:
- Ran crash_test several times.
-  New unit tests added.

Reviewed By: anand1976

Differential Revision: D50906217

Pulled By: akankshamahajan15

fbshipit-source-id: 0d75d3c98274e98aa34901b201b8fb05232139cf
2023-12-06 13:48:15 -08:00
Levi Tamasi 2045fe4693 Mention PR 11892 in the changelog (#12118)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12118

Reviewed By: jaykorean

Differential Revision: D51820703

fbshipit-source-id: d2a86a4781618747c6b7c71971862d510a25e103
2023-12-04 13:20:28 -08:00
Jay Huh ddb7df10ef Update HISTORY.md and version.h for 8.9.fb release (#12074)
Summary:
Creating cut for 8.9 release

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12074

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D51435289

Pulled By: jaykorean

fbshipit-source-id: 3918a8250032839e5b71f67f26c8ba01cbc17a41
2023-11-21 18:07:19 -08:00
Yu Zhang 84a54e1e28 Fix some bugs in index builder and reader for the UDT in memtable only feature (#12062)
Summary:
These bugs surfaced while I was trying to add the stress test for the feature:

Bug 1) On the index building path: the optimization to use user key instead of internal key as separator needed a bit tweak for when user defined timestamps can be removed. Because even though the user key look different now and eligible to be used as separator, when their user-defined timestamps are removed, they could be equal and that invariant no longer stands.

Bug 2) On the index reading path: one path that builds the second level index iterator for `PartitionedIndexReader` are not passing the corresponding `user_defined_timestamps_persisted` flag. As a result, the default `true` value be used leading to no minimum timestamps padded when they should be.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12062

Test Plan:
For bug 1): added separate unit test `BlockBasedTableReaderTest::Get` to exercise the `Get` API. It's a different code path from `MultiGet` so worth having its own test. Also in order to cover the bug, the test is modified to generate key values with the same user provided key, different timestamps and different sequence numbers. The test reads back different versions of the same user provided key.  `MultiGet` takes one `ReadOptions` with one read timestamp so we cannot test retrieving different versions of the same key easily.

For bug 2): simply added options `BlockBasedTableOptions.metadata_cache_options.partition_pinning = PinningTier::kAll` to exercise all the index iterator creating paths.

Reviewed By: ltamasi

Differential Revision: D51508280

Pulled By: jowlyzhang

fbshipit-source-id: 8b174d3d70373c0599266ac1f467f2bd4d7ea6e5
2023-11-21 14:05:02 -08:00
Andrew Kryczka 9202db1867 Consider archived WALs for deletion more frequently (#12069)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11000.

That issue pointed out that RocksDB was slow to delete archived WALs in case time-based and size-based expiration were enabled, and the time-based threshold (`WAL_ttl_seconds`) was small. This PR prevents the delay by taking into account `WAL_ttl_seconds` when deciding the frequency to process archived WALs for deletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12069

Reviewed By: pdillinger

Differential Revision: D51262589

Pulled By: ajkr

fbshipit-source-id: e65431a06ee96f4c599ba84a27d1aedebecbb003
2023-11-15 15:42:28 -08:00
Yu Zhang c6c683a0ca Remove the default force behavior for `EnableFileDeletion` API (#12001)
Summary:
Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted.

This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them.  When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK.

This PR also makes unit test that do not need force behavior to do a regular enable file deletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001

Reviewed By: ajkr

Differential Revision: D51214683

Pulled By: jowlyzhang

fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4
2023-11-10 14:35:54 -08:00
Yueh-Hsuan Chiang 73d223c4e2 Add auto_tuned option to RateLimiter C API (#12058)
Summary:
#### Problem
While the RocksDB C API does have the RateLimiter API, it does not
expose the auto_tuned option.

#### Summary of Change
This PR exposes auto_tuned RateLimiter option in RocksDB C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12058

Test Plan: Augment the C API existing test to cover the new API.

Reviewed By: cbi42

Differential Revision: D51201933

Pulled By: ajkr

fbshipit-source-id: 5bc595a9cf9f88f50fee797b729ba96f09ed8266
2023-11-10 09:53:09 -08:00
Hui Xiao f337533b6f Ensure and clarify how RocksDB calls TablePropertiesCollector's functions (#12053)
Summary:
**Context/Summary:**
It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish".

However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand.

This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`)

Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053

Test Plan:
- Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once.
- Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality  `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change.

Reviewed By: ajkr

Differential Revision: D51095434

Pulled By: hx235

fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973
2023-11-08 14:00:36 -08:00
Jay Huh 2adef5367a AttributeGroups - PutEntity Implementation (#11977)
Summary:
Write Path for AttributeGroup Support. The new `PutEntity()` API uses `WriteBatch` and atomically writes WideColumns entities in multiple Column Families.

Combined the release note from PR https://github.com/facebook/rocksdb/issues/11925

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11977

Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` updated
- `WriteBatchTest::AttributeGroupTest` added
- `WriteBatchTest::AttributeGroupSavePointTest` added

Reviewed By: ltamasi

Differential Revision: D50457122

Pulled By: jaykorean

fbshipit-source-id: 4997b265e415588ce077933082dcd1ac3eeae2cd
2023-11-06 16:52:51 -08:00
Peter Dillinger 92dc5f3e67 AutoHCC: fix a bug with "blind" Insert (#12046)
Summary:
I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions.

It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`).

The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have
* Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain.
* Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty.
* Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain.

This is easily fixed by holding on to a reference until inserted onto the chain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046

Test Plan:
As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed.

Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000)

Reviewed By: jowlyzhang

Differential Revision: D51016979

Pulled By: pdillinger

fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b
2023-11-06 16:06:01 -08:00
Jay Huh 2dab137182 Mark more files for periodic compaction during offpeak (#12031)
Summary:
- The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice.
- It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization.
- Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start.
- We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031

Test Plan:
Unit Tests added
- `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled
- `DBTestUniversalCompaction2::PeriodicCompaction` for Universal

Reviewed By: cbi42

Differential Revision: D50900292

Pulled By: jaykorean

fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d
2023-11-06 11:43:59 -08:00
马越 8e1adab5ce add RocksDB#clipColumnFamily to Java API (#11868)
Summary:
### main change:

- add java clipColumnFamily api in Rocksdb.java
The method signature of the new API is
 ```
 public void clipColumnFamily(final ColumnFamilyHandle columnFamilyHandle, final byte[] beginKey,
      final byte[] endKey)
```
### Test
add unit test RocksDBTest#clipColumnFamily()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11868

Reviewed By: jaykorean

Differential Revision: D50889783

Pulled By: cbi42

fbshipit-source-id: 7f545171ad9adb9c20bdd92efae2e6bc55d5703f
2023-11-02 08:00:08 -07:00
Yu Zhang 60df39e530 Rate limiting stale sst files' deletion during recovery (#12016)
Summary:
As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016

Reviewed By: hx235

Differential Revision: D50670482

Pulled By: jowlyzhang

fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e
2023-10-28 09:50:52 -07:00
anand76 52be8f54f2 Add APIs to query secondary cache capacity and usage for TieredCache (#12011)
Summary:
In `TieredCache`, the underlying compressed secondary cache is hidden from the user. So we need a way to query the capacity, as well as the portion of cache reservation charged to the compressed secondary cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12011

Test Plan: Update the unit tests

Reviewed By: akankshamahajan15

Differential Revision: D50651943

Pulled By: anand1976

fbshipit-source-id: 06d1cb5edb75a790c919bce718e2ff65f5908220
2023-10-25 16:54:50 -07:00
Hui Xiao 0f141352d8 Fix race between flush error recovery and db destruction (#12002)
Summary:
**Context:**
DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: 519f2a41fb/db/db_impl/db_impl.cc (L525) -> 519f2a41fb/db/error_handler.cc (L250) -> 519f2a41fb/db/error_handler.cc (L808-L823)

However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below.

```
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
```

**Summary:**
This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](519f2a41fb/db/error_handler.cc (L688-L694)) such case inside the created recovery thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002

Test Plan: A new UT repro-ed the crash before this fix and and pass after.

Reviewed By: ajkr

Differential Revision: D50586191

Pulled By: hx235

fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093
2023-10-25 11:59:09 -07:00
Hui Xiao ab15d33566 Update history, version and format testing for 8.8 (#12004)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12004

Reviewed By: cbi42

Differential Revision: D50586984

Pulled By: hx235

fbshipit-source-id: 1480a8c2757340ebf83510557104aaa0e437b3ae
2023-10-24 12:03:07 -07:00
Hui Xiao 99b371b417 Skip subsequent trace writes after encountering trace write failure (#11996)
Summary:
**Context/Summary:**
We ignore trace writing status e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222)

If a write into the trace file fails, subsequent trace write will continue onto the same file.

This will trigger the assertion `assert(sync_without_flush_called_)` intended to catch write to a file that has previously seen error, added in https://github.com/facebook/rocksdb/pull/10489, https://github.com/facebook/rocksdb/pull/10555

Alternative (rejected) is to handle trace writing status at a higher level at e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222). However, it makes sense to ignore such status considering tracing is not a critical but assistant component to db operation. And this alternative requires more code change. So it's better to handle the failure at a lower level as this PR

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11996

Test Plan: Add new UT failed before this PR and pass after

Reviewed By: akankshamahajan15

Differential Revision: D50532467

Pulled By: hx235

fbshipit-source-id: f2032abafd94917adbf89a20841d15b448782a33
2023-10-24 09:58:02 -07:00
rogertyang 543191f2ea Add bounds checking to WBWIIteratorImpl and respect bounds of ReadOptions in Transaction (#11680)
Summary:
Fix https://github.com/facebook/rocksdb/issues/11607
Fix https://github.com/facebook/rocksdb/issues/11679
Fix https://github.com/facebook/rocksdb/issues/11606
Fix https://github.com/facebook/rocksdb/issues/2343

Add bounds checking to `WBWIIteratorImpl`, which will be reflected in `BaseDeltaIterator::delta_iterator_::Valid()`, just like `BaseDeltaIterator::base_iterator_::Valid()`. In this way, the two sub itertors become more aligned from `BaseDeltaIterator`'s perspective. Like `DBIter`, the added bounds checking caps in either bound when seeking and disvalidates the `WBWIIteratorImpl` iterator when the lower bound is past or the upper bound is reached.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11680

Test Plan:
- A simple test added to write_batch_with_index_test.cc to exercise the bounds checking in `WBWIIteratorImpl`.
- A sophisticated test added to transaction_test.cc to assert that `Transaction` with different write policies honor bounds in `ReadOptions`. It should be so as long as the  `BaseDeltaIterator` is correctly coordinating the two sub iterators to perform iterating and bounds checking.

Reviewed By: ajkr

Differential Revision: D48125229

Pulled By: cbi42

fbshipit-source-id: c9acea52595aed1471a63d7ca6ef15d2a2af1367
2023-10-20 13:28:28 -07:00
Hui Xiao 0836a2b26d New tickers on deletion compactions grouped by reasons (#11957)
Summary:
Context/Summary: as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957

Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats

Reviewed By: ajkr, cbi42

Differential Revision: D50294310

Pulled By: hx235

fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd
2023-10-18 18:00:07 -07:00
akankshamahajan 9135a61ec6 Fix corruption error in stress test for auto_readahead_size enabled (#11961)
Summary:
Fix corruption error - "Corruption: first key in index doesn't match first key in block". when auto_readahead_size is enabled. Error is because of bug when index_iter_ moves forward, first_internal_key of that index_iter_ is not copied. So the Slice points to a different key resulting in wrong comparison when doing comparison.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11961

Test Plan: Ran stress test which reproduced this error.

Reviewed By: anand1976

Differential Revision: D50310589

Pulled By: akankshamahajan15

fbshipit-source-id: 95d8320b8388f1e3822c32024f84754f3a20a631
2023-10-17 12:21:08 -07:00
Hui Xiao 25d4379cc8 Make rate limiter single burst bytes runtime changeable (#11923)
Summary:
Context/Summary: as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11923

Test Plan: new UT

Reviewed By: ajkr

Differential Revision: D49941161

Pulled By: hx235

fbshipit-source-id: f75a4d07f3cdd86863ea22c57f2bcd3a621baaf3
2023-10-16 10:21:35 -07:00
Jay Huh c9d8e6a5bf AttributeGroups - MultiGetEntity Implementation (#11925)
Summary:
Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`.
An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families.

Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`.

Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf".

Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo.

With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925

Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added

will enable this new API in the `db_stress` in a separate PR

Reviewed By: ltamasi

Differential Revision: D50017414

Pulled By: jaykorean

fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586
2023-10-13 15:58:03 -07:00
anand76 90e160733e Fix runtime error in UpdateTieredCache due to integer underflow (#11949)
Summary:
With the introduction of the `UpdateTieredCache` API, its possible to dynamically change the compressed secondary cache ratio of the total cache capacity. In order to optimize performance, we avoid using a mutex when inserting/releasing placeholder entries, which can result in some inaccuracy in the accounting during the dynamic update. This inaccuracy was causing a runtime error due to an integer underflow in `UpdateCacheReservationRatio`, causing ubsan crash tests to fail. This PR fixes it by explicitly checking for the underflow.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11949

Test Plan:
1. Added a unit test that fails without the fix
2. Run ubsan_crash

Reviewed By: akankshamahajan15

Differential Revision: D50240217

Pulled By: anand1976

fbshipit-source-id: d2f7b79da54eec8b61aec2cc1f2943da5d5847ac
2023-10-12 15:09:40 -07:00
Peter Dillinger d010b02e86 Fix race in options taking effect (#11929)
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.

To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.

I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.

Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929

Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace

I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace

Reviewed By: ajkr

Differential Revision: D50024506

Pulled By: pdillinger

fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
2023-10-12 10:05:23 -07:00
Andrew Kryczka 8a9cfd5292 Make stopped writes block on recovery (#11879)
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.

This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879

Reviewed By: anand1976

Differential Revision: D49571196

Pulled By: ajkr

fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0
2023-10-10 06:29:01 -07:00
Peter Dillinger 141b872bd4 Improve efficiency of create_missing_column_families, light refactor (#11920)
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.

Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920

Test Plan: tests added / updated

Reviewed By: jaykorean

Differential Revision: D49919147

Pulled By: pdillinger

fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
2023-10-04 14:14:22 -07:00
Levi Tamasi b00fa5597e Fix the handling of wide-column base values in the max_successive_merges logic (#11913)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913

The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator.

Reviewed By: jaykorean

Differential Revision: D49795097

fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3
2023-10-02 16:25:25 -07:00
Andrew Kryczka 3c4cc6c2cc flip default `DBOptions::fail_if_options_file_error` (#11800)
Summary:
Changed `DBOptions::fail_if_options_file_error` default from `false` to
`true`. It is safer to fail an operation by default when it encounters
an error.

Also changed the API doc to list items in the conventional way for listing items in a sentence. The slashes weren't working well as one got dropped, probably because it looked like a typo.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11800

Test Plan: rely on CI

Reviewed By: jowlyzhang

Differential Revision: D49030532

Pulled By: ajkr

fbshipit-source-id: e606062aa25f9063d8c6fb0d03aebca5c2bc56d3
2023-09-29 15:15:32 -07:00
Jay Huh 63ed868840 Offpeak in db option (#11893)
Summary:
RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage.

In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893

Test Plan:
- New Unit Test Added - `DBOptionsTest::OffPeakTimes`
- Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest`

Reviewed By: pdillinger

Differential Revision: D49714553

Pulled By: jaykorean

fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752
2023-09-29 13:03:39 -07:00
Hui Xiao fce04587b8 Only fallback to RocksDB internal prefetching on unsupported FS prefetching (#11897)
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/11631 introduced an undesired fallback behavior to RocksDB internal prefetching even when FS prefetching return non-OK status other than "Unsupported". We only want to fall back when FS prefetching is not supported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11897

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D49667055

Pulled By: hx235

fbshipit-source-id: fa36e4e5d6dc9507080217035f9d6ff8e4abda28
2023-09-26 18:44:41 -07:00
Hui Xiao 719f5511f6 No file system prefetching when Options::compaction_readahead_size is 0 (#11887)
Summary:
**Context/Summary:**

https://github.com/facebook/rocksdb/pull/11631 introduced `readahead()` system call for compaction read under non direct IO. When `Options::compaction_readahead_size` is 0, the `readahead()` will issued with a small size (i.e, the block size, by default 4KB)

Benchmarks shows that such readahead() call regresses the compaction read compared with "no readahead()" case (see Test Plan for more).

Therefore we decided to not issue such `readhead() ` when `Options::compaction_readahead_size` is 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11887

Test Plan:
Settings: `compaction_readahead_size = 0, use_direct_reads=false`
Setup:
```
TEST_TMPDIR=../ ./db_bench -benchmarks=filluniquerandom -disable_auto_compactions=true -write_buffer_size=1048576 -compression_type=none -value_size=10240 && tar -cf ../dbbench.tar -C ../dbbench/ .
```
Run:
```
for i in $(seq 3); do rm -rf ../dbbench/ && mkdir -p ../dbbench/ && tar -xf ../dbbench.tar -C ../dbbench/ . && sudo bash -c 'sync && echo 3 > /proc/sys/vm/drop_caches' && TEST_TMPDIR=../ /usr/bin/time ./db_bench_{pre_PR11631|PR11631|PR11631_with_improvementPR11887} -benchmarks=compact -use_existing_db=true -db=../dbbench/ -disable_auto_compactions=true -compression_type=none ; done |& grep elapsed
```

pre-PR11631("no readahead()" case):

PR11631:

PR11631+this improvement:

Reviewed By: ajkr

Differential Revision: D49607266

Pulled By: hx235

fbshipit-source-id: 2efa0dc91bac3c11cc2be057c53d894645f683ef
2023-09-26 10:08:43 -07:00
Changyu Bi 49da91ec09 Update files for version 8.8 (#11878)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11878

Reviewed By: ajkr

Differential Revision: D49568389

Pulled By: cbi42

fbshipit-source-id: b2022735799be9b5e81e03dfb418f8b104632ecf
2023-09-23 11:02:19 -07:00
akankshamahajan 3d67b5e8e5 Lookup ahead in block cache ahead to tune Readaheadsize (#11860)
Summary:
Implement block cache lookup to determine readahead_size during scans. It's enabled if auto_readahead_size, block_cache and iterate_upper_bound - all three are set.

Design -
1. Whenever there is a cache miss and FilePrefetchBuffer is called, a callback is made to determine readahead_size for that prefetching.
2. The callback iterates over index and do block cache lookup for each data block handle until existing readahead_size is reached. Then It removes the cache hit data blocks from end to calculate optimized readahead_size.
3. Since index_iter_ is moved, it stores block handles in a queue, and use that queue to get block handle instead of doing index_iter_->Next().
4. This is for Sync scans. Async scans support is in progress.

NOTE:
The issue right now is after Seek and Next, if Prev is called, there is no way to do Prev operation. index_iter_ is already pointing to a different block. So it returns "Not supported" in that case with error message - "auto tuning of readahead size is not supported with Prev op"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11860

Test Plan:
- Added new unit test
- crash_tests
- Running scans locally to check for any regression

Reviewed By: anand1976

Differential Revision: D49548118

Pulled By: akankshamahajan15

fbshipit-source-id: f1aee409a71b4ad9e5bf3610f43edf30c6630c78
2023-09-22 18:12:08 -07:00
anand76 48589b961f Fix updating the capacity of a tiered cache (#11873)
Summary:
Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity.

With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy.

Tests:
New unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873

Reviewed By: akankshamahajan15

Differential Revision: D49562250

Pulled By: anand1976

fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-22 18:07:46 -07:00
Changyu Bi 0086809601 Fix a bug with atomic_flush that causes DB to stuck after a flush failure (#11872)
Summary:
With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: 269478ee46/db/db_impl/db_impl_compaction_flush.cc (L725) This prevents any future flush to pick old memtables.

A more detailed repro is in unit test.

This PR fixes this issue by
1. Ensure we rollback memtables if an atomic flush fails due to background error
2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed
3. Do not schedule non-recovery flushes when there is a background error that stops background work

There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11872

Test Plan: new unit test.

Reviewed By: jowlyzhang

Differential Revision: D49556867

Pulled By: cbi42

fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f
2023-09-22 16:43:50 -07:00
Levi Tamasi 6afde14266 Add changelog entry for wide-column full merge (#11874)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11874

Add a changelog entry for https://github.com/facebook/rocksdb/pull/11858 .

Reviewed By: jaykorean

Differential Revision: D49557350

fbshipit-source-id: 44fcd08e9847407d9f18dd3d9363d233f4591c84
2023-09-22 14:33:47 -07:00
Levi Tamasi 12d9386a4f Return a special OK status when the number of merge operands exceeds a threshold (#11870)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11870

Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`.

Reviewed By: jaykorean

Differential Revision: D49522786

fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c
2023-09-22 13:49:19 -07:00
anand76 269478ee46 Support compressed and local flash secondary cache stacking (#11812)
Summary:
This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache.

The basic design is as follows -
1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache.
2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block.
3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache.
4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded.

We can easily implement additional admission policies if desired.

Todo (In a subsequent PR):
1. Add to db_bench and run benchmarks
2. Add to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812

Reviewed By: pdillinger

Differential Revision: D49461842

Pulled By: anand1976

fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-21 20:30:53 -07:00
Changyu Bi b927ba5936 Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an  error message like the following:

IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```

This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)

This PR fixes the issue by rollback these pending flushes.

There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.

There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865

Test Plan: * Added repro unit tests.

Reviewed By: anand1976

Differential Revision: D49484922

Pulled By: cbi42

fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 15:31:29 -07:00
anand76 548aabfe5f Disable compressed secondary cache if capacity is 0 (#11863)
Summary:
This PR makes disabling the compressed secondary cache by setting capacity to 0 a bit more efficient. Previously, inserts/lookups would go to the backing LRUCache before getting rejected due to 0 capacity. With this change, insert/lookup would return from ```CompressedSecondaryCache``` itself.

Tests:
Existing tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11863

Reviewed By: akankshamahajan15

Differential Revision: D49476248

Pulled By: anand1976

fbshipit-source-id: f0f17a5e3df7d8bfc06709f8f23c1302056ba590
2023-09-20 22:30:17 -07:00
akankshamahajan c1a97fe1f6 Fix Assertion `roundup_len2 >= alignment' failed in crash tests (#11852)
Summary:
When auto_readahead_size is enabled in async_io, during seek, first buffer will prefetch the data - (current block + readahead till upper_bound). There can be cases where
1.  first buffer prefetched all the data till upper bound, or
2.  first buffer already has the data from prev seek call
and second buffer prefetch further leading to alignment issues.

This PR fixes that assertion and second buffer won't go for prefetching if first buffer has already prefetched till upper_bound.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11852

Test Plan:
- Added new unit test that failed without this fix.
- crash tests passed locally

Reviewed By: pdillinger

Differential Revision: D49384138

Pulled By: akankshamahajan15

fbshipit-source-id: 54417e909e4d986f1e5a17dbaea059cd4962fd4d
2023-09-20 16:13:20 -07:00
Hui Xiao 089070cb36 Expose more info about input files in `CompactionFilter::Context` (#11857)
Summary:
**Context:**
As requested, lowest level as well as a map from input file to its table properties among all input files used in table creation  (if any) are exposed in `CompactionFilter::Context`.

**Summary:**
This PR contains two commits:
(1) [Refactory](0012777f0e) to make resonating/using what is in `Compaction:: table_properties_` easier
- Separate `Compaction:: table_properties_` into `Compaction:: input_table_properties_` and `Compaction:: output_table_properties_`
- Separate the "set input table properties" logic into `Compaction:: SetInputTableProperties()`) from `Compaction:: GetInputTableProperties`
- Call `Compaction:: SetInputTableProperties()` as soon as possible, which is right after `Compaction::SetInputVersion()`. Bundle these two functions into one `Compaction::FinalizeInputInfo()` to minimize missing one or the other

(2) [Expose more info about input files:](6093e7dfba) `CompactionFilter::Context::input_start_level/input_table_properties`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11857

Test Plan:
- Modify existing UT `
TEST_F(DBTestCompactionFilter, CompactionFilterContextManual)` to cover new logics

Reviewed By: ajkr

Differential Revision: D49402540

Pulled By: hx235

fbshipit-source-id: 469fff50fa0e5964ffa5ea8db0743f61438ea392
2023-09-20 13:34:39 -07:00
chuhao zeng 8acf17002a Fix row cache falsely return kNotFound when timestamp enabled (#11816)
Summary:
**Summary:**
When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound.

**Cause of error:**
If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](8fc78a3a9e/db/db_impl/db_impl.cc (L2112)).

Hence the default value set at this [line](694e49cbb1/table/get_context.cc (L611)) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file.

**Proposed Solution**
Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11816

Reviewed By: ajkr

Differential Revision: D49419029

Pulled By: jowlyzhang

fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6
2023-09-20 11:34:38 -07:00
Peter Dillinger 44e4ffd60d Release note for #11738 AutoHCC (#11855)
Summary:
Neglected in original PR

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11855

Test Plan: no code change

Reviewed By: jowlyzhang

Differential Revision: D49424759

Pulled By: pdillinger

fbshipit-source-id: 9a2ef2c0992a65f2a8f04dce9cefe978f8087407
2023-09-19 11:35:43 -07:00
Changyu Bi cc254efea6 Release compaction files in manifest write callback (#11764)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply().  This is done by passing in a [callback](ba59751430/db/version_set.h (L1199)) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11764

Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released

Reviewed By: ajkr

Differential Revision: D48742152

Pulled By: cbi42

fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
2023-09-18 13:11:53 -07:00
akankshamahajan 5b5b011cdd Avoid double block cache lookup during Seek with async_io option (#11616)
Summary:
With the async_io option, the Seek happens in 2 phases. Phase 1 starts an asynchronous read on a block cache miss, and phase 2 waits for it to complete and finishes the seek. In both phases, BlockBasedTable::NewDataBlockIterator is called, which tries to lookup the block cache for the data block first before looking in the prefetch buffer. It's optimized by doing the block cache lookup only in the first phase and save some CPU.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11616

Test Plan: Added unit test

Reviewed By: jaykorean

Differential Revision: D47477887

Pulled By: akankshamahajan15

fbshipit-source-id: 0355e0a68fc0ea2eb92340ae42735afcdbcbfd79
2023-09-18 11:32:30 -07:00
Andrew Kryczka 4196ad81e3 LZ4 set acceleration parameter (#11844)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11844

Test Plan:
Command:
```
for level in 1 1234 32767 -1 -10 -100 -1000 -10000 -100000 -1000000; do echo -n "level=$level " && ./db_bench -benchmarks=compress -compression_type=lz4 -compression_level=$level |& awk '/^compress / {print $5, $6}' ; done
```

Output:
```
level=1 181340 ops/sec
level=1234 183197 ops/sec
level=32767 181480 ops/sec
level=-1 181053 ops/sec
level=-10 662858 ops/sec
level=-100 2611516 ops/sec
level=-1000 3043125 ops/sec
level=-10000 3001351 ops/sec
level=-100000 2861834 ops/sec
level=-1000000 2906413 ops/sec
```

Reviewed By: cbi42

Differential Revision: D49331443

Pulled By: ajkr

fbshipit-source-id: c8909708c3b2b9b83bf2bda2d3f24b8a92d4c2ea
2023-09-18 09:26:29 -07:00
Peter Dillinger 1c6faf3587 Make RibbonFilterPolicy::bloom_before_level mutable (SetOptions()) (#11838)
Summary:
An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars.

As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making
RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic.

I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838

Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in).

Reviewed By: ajkr

Differential Revision: D49296284

Pulled By: pdillinger

fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a
2023-09-15 15:46:10 -07:00
leipeng 68ce5d84f6 Add new Iterator API Refresh(const snapshot*) (#10594)
Summary:
This PR resolves https://github.com/facebook/rocksdb/issues/10487 & https://github.com/facebook/rocksdb/issues/10536, user code needs to call Refresh() periodically.

The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq)  -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`.

This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10594

Test Plan:
* New unit tests.
* Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case.

TODO in a following PR:
* Stress test Iterator::Refresh() to different snapshots or no snapshot.

Reviewed By: ajkr

Differential Revision: D48456896

Pulled By: cbi42

fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779
2023-09-15 10:44:43 -07:00
Hui Xiao ed913513bd Fix a bug of rocksdb.file.read.verify.file.checksums.micros not being populated (#11836)
Summary:
**Context/Summary:**
`rocksdb.file.read.verify.file.checksums.micros ` was added in https://github.com/facebook/rocksdb/pull/11444 but the related path was not populated with statistics and clock object correctly so the actual statistics collection didn't happen. This PR fixed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11836

Test Plan:
Setup:
```
./db_bench --benchmarks="fillrandom" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb
```
Run:
```
./db_bench --use_existing_db=1  --benchmarks="verifyfilechecksums" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb --statistics=1 --stats_level=4
```
Post-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 9.000000 P95 : 9.000000 P99 : 9.000000 P100 : 9.000000 COUNT : 1 SUM : 9
```

Pre-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

Reviewed By: ajkr

Differential Revision: D49293378

Pulled By: hx235

fbshipit-source-id: 1acd8b828c28e088d0c5d63897f53cd180b82f42
2023-09-15 10:36:14 -07:00
akankshamahajan 1e2fd343bb Update upper_bound_offset when reseek changes iterate_upper_bound dynamically (#11775)
Summary:
Update the logic in FilePrefetchBuffer to update `upper_bound_offset_` during reseek. During Reseek, `iterate_upper_bound` can be changed dynamically. So added an API to update that in FilePrefetchBuffer.
Added unit test to confirm the behavior.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11775

Test Plan:
- Check stress tests in case there is any failure after this diff.
- make crash_test -j32 with auto_readahead_size=1 passed locally

Reviewed By: anand1976

Differential Revision: D48815177

Pulled By: akankshamahajan15

fbshipit-source-id: 5f44fbb3af06c86a1c38f139c5fa4543891837f4
2023-09-15 10:05:56 -07:00
Yu Zhang e1fd348b92 Fix a bug in multiget for cleaning up SuperVersion (#11830)
Summary:
When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done.

It's currently doing this:
`MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()`

This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830

Test Plan:
`make all check`
Added a unit test that would originally fail

Reviewed By: ltamasi

Differential Revision: D49287715

Pulled By: jowlyzhang

fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0
2023-09-15 09:50:39 -07:00
Jay Huh f2b623bcc1 GetEntity Support for ReadOnlyDB and SecondaryDB (#11799)
Summary:
`GetEntity` API support for ReadOnly DB and Secondary DB.
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`)
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity`
- `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally.
- `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()`
- Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`.
- Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799

Test Plan:
**Unit Tests**
- Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup.
- Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()`
- `make -j64 check`

**Crash Tests**
- `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
- `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 `
- `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`

Reviewed By: ltamasi

Differential Revision: D49037040

Pulled By: jaykorean

fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b
2023-09-15 08:30:44 -07:00
Hui Xiao ef3e289b2d Conditionally exclude some L0 input files in size amp compaction (#11749)
Summary:
**Context/Summary:**
A size amp compaction can select and prevent a large number of L0 files from being selected by other compaction. If such compaction is running long or being queued behind, these L0 files will exist for long. With a few more flushes, we can run into write stop triggered by # L0 files. We've seen this happen on a host with many DBs sharing same thread pool, each of these DBs submits a size amp compaction with (110-180)+ files to the pool upon reopen and with a few more flushes, they hit the 200 L0 write stop condition.

The idea is to exclude some L0 input files in size amp compaction that are harmless to size amp reduction but improve the situation described above.

The exclusion algorithm is in `MightExcludeNewL0sToReduceWriteStop()` with two elements:

1.  #L0 to exclude + (level0_stop_writes_trigger - num_l0_input_pre_exclusion) should be in the range of [min_merge_width, max_merge_width].
    -  This is to ensure we are excluding enough L0 input files but not too many to be qualified to picked for another compaction along with the incoming future L0 files before write stop.
2. Based on (1), further constrain #L0 to exclude based on the post-exclusion compaction score. The goal is to ensure our exclusion will not disqualify the size amp compaction from being a size amp compaction after exclusion.

**Tets plan:** New unit test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11749

Reviewed By: ajkr

Differential Revision: D48850631

Pulled By: hx235

fbshipit-source-id: 2c321036e164087c36319dd5645cbbf6b6152092
2023-09-12 15:53:15 -07:00
Changyu Bi 9d71682d1b Add statistics `COMPACTION_CPU_TOTAL_TIME` for total compaction time (#11741)
Summary:
Existing compaction statistics are `COMPACTION_TIME` and `COMPACTION_CPU_TIME` which are histogram and are logged at the end of a compaction. The new statistics `COMPACTION_CPU_TOTAL_TIME` is for cumulative total compaction time which is updated regularly during a compaction. This allows user to more closely track compaction cpu usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11741

Test Plan: * new unit test `DBTestWithParam.CompactionTotalTimeTest`

Reviewed By: ajkr

Differential Revision: D48608094

Pulled By: cbi42

fbshipit-source-id: b597109f3e4bf2237fb5a216b6fd036e5363b4c0
2023-09-12 15:48:36 -07:00
akankshamahajan ed5b6c0d99 Avoid alignment in FilePrefetchBuffer during seek with async_io (#11793)
Summary:
During Seek, the iterator seeks every file on L0. In async_io, it submit the requests to seek on every file on L0 asynchronously using RocksDB FilePrefetchBuffer.
However, FilePrefetchBuffer does alignment and reads extra bytes then needed that can increase the throughput.
In case of non direct io, the alignment can be avoided.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11793

Test Plan:
- Added a unit test that fails without this PR.
- make crash_test -j32 completed successfully

Reviewed By: anand1976

Differential Revision: D48985051

Pulled By: akankshamahajan15

fbshipit-source-id: 2d130a9e7c3df9c4fcd0408406e6277ab75a4389
2023-09-11 11:41:44 -07:00