Summary:
This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited.
The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` d610e14f93/include/rocksdb/sst_file_manager.h (L126) however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1.
In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call.
There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590
Test Plan: Added unit test and verified with manual testing
Reviewed By: anand1976
Differential Revision: D56830519
Pulled By: jowlyzhang
fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598
Test Plan:
- New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption: ...file checksum mismatch`) and passes after
- Benchmark
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG 300 runs] : 421334 (± 4126) ops/sec; 46.6 (± 0.5) MB/sec
Post-PR: (no regression observed but a slight improvement)
fillseq [AVG 300 runs] : 425768 (± 4309) ops/sec; 47.1 (± 0.5) MB/sec
Reviewed By: ajkr, anand1976
Differential Revision: D56725688
Pulled By: hx235
fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.
Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377
Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.
Reviewed By: jowlyzhang
Differential Revision: D54129517
Pulled By: pdillinger
fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
Summary:
See comment at top of the test case and release note.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597
Reviewed By: jaykorean
Differential Revision: D56718786
Pulled By: ajkr
fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f
Summary:
When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 || s >= first_seqno_` in `rocksdb::MemTable::Add`.
We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591
Test Plan: 1. Add a unit test to verify the bug and fix
Reviewed By: jowlyzhang
Differential Revision: D56655812
Pulled By: anand1976
fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87
Summary:
Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592
Reviewed By: hx235
Differential Revision: D56650862
Pulled By: ajkr
fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141
Summary:
I had a TODO to complete `CompactionOptions`'s compression API but never did it: d610e14f93/db/compaction/compaction_picker.cc (L371-L373)
Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`.
In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587
Reviewed By: hx235
Differential Revision: D56619273
Pulled By: ajkr
fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573
Test Plan:
**Unit Tests Added**
```
./db_iterator_test --gtest_filter="*IteratorsConsistentView*"
```
```
./multi_cf_iterator_test -- --gtest_filter="*ConsistentView*"
```
**Performance Check**
Setup
```
make -j64 release
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom : 6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom : 6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found)
```
Reviewed By: jowlyzhang
Differential Revision: D56444066
Pulled By: jaykorean
fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99
Summary:
Prior to this PR the following sequence could happen:
1. `RunManualCompaction()` A schedules compaction to thread pool and waits
2. `RunManualCompaction()` B waits without scheduling anything due to conflict
3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both
4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done
5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool
6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction
7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution
8. `DisableManualCompaction()` returns as there are no more manual compactions running
Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward:
5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done
6'. `DisableManualCompaction()` returns as there are no more manual compactions running
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578
Reviewed By: cbi42
Differential Revision: D56528144
Pulled By: ajkr
fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b
Summary:
Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558
Reviewed By: hx235
Differential Revision: D56279019
Pulled By: ajkr
fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29
Summary:
**Context/Summary:**
When `BlockBasedTableOptions::block_align=true`, we pad bytes to align blocks d41e568b1c/table/block_based/block_based_table_builder.cc (L1415-L1421).
Those bytes are not included in generating the file checksum upon file creation. But `VerifyFileChecksums()` includes those bytes in generating the file check to compare against the checksum generating upon file creation. Therefore a file checksum mismatch is returned in `VerifyFileChecksums()`.
We decided to include those padded bytes in generating the checksum upon file creation.
Bonus: also fix surrounding code to use actual padded bytes for verification - see https://github.com/facebook/rocksdb/pull/12542#discussion_r1571429163
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12542
Test Plan:
- New UT
- Benchmark
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG 300 runs] : 422857 (± 3942) ops/sec; 46.8 (± 0.4) MB/sec
Post-PR:
fillseq [AVG 300 runs] : 424707 (± 3799) ops/sec; 47.0 (± 0.4) MB/sec
Reviewed By: ajkr
Differential Revision: D56168447
Pulled By: hx235
fbshipit-source-id: 96209ef950d42943d336f11968ae3fcf9872fc2c
Summary:
This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518
Reviewed By: ajkr
Differential Revision: D55932155
Pulled By: anand1976
fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557
Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move.
Reviewed By: jaykorean
Differential Revision: D56275054
fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730
Summary:
Adding an option to wait for purge to complete in `WaitForCompact` API.
Internally, RocksDB has a way to wait for purge to complete (e.g. TEST_WaitForPurge() in db_impl_debug.cc), but there's no public API available for gracefully wait for purge to complete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12520
Test Plan:
Unit Test Added - `WaitForCompactWithWaitForPurgeOptionTest`
```
./deletefile_test -- --gtest_filter="*WaitForCompactWithWaitForPurgeOptionTest*"
```
Existing Tests
```
./db_compaction_test -- --gtest_filter="*WaitForCompactWithOption*"
```
Reviewed By: ajkr
Differential Revision: D55888283
Pulled By: jaykorean
fbshipit-source-id: cfc6d6e8657deaefab8961890b36e390095c9f65
Summary:
In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546
Reviewed By: jaykorean
Differential Revision: D56193693
Pulled By: ajkr
fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12539
As a follow-up to https://github.com/facebook/rocksdb/pull/12533, this PR extends `WriteBatchWithIndex` with a `MultiGetEntityFromBatchAndDB` API that enables users to perform batched wide-column point lookups with read-your-own-writes consistency. This API transparently combines data from the indexed write batch and the underlying database as needed and presents the results in the form of a wide-column entity.
Reviewed By: jaykorean
Differential Revision: D56153145
fbshipit-source-id: 537967051b7521bb41b04070ac1a78a1d8873c08
Summary:
Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480#12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`.
Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL`
Here are some implementation details:
- `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration.
- `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap.
- `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap.
- Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once.
- This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534
Test Plan:
Uncommented the assertions in `verifyAttributeGroupIterator()`
```
./multi_cf_iterator_test
```
Reviewed By: ltamasi
Differential Revision: D56089019
Pulled By: jaykorean
fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12533
The PR extends `WriteBatchWithIndex` with a new wide-column point lookup API `GetEntityFromBatchAndDB`. Similarly to `GetFromBatchAndDB`, the new API can transparently combine data from the write batch with data from the underlying database as needed. Like `DB::GetEntity`, it returns any result in the form of a wide-column entity (i.e. plain key-values are wrapped into an entity with a single anonymous column).
Reviewed By: jaykorean
Differential Revision: D56069132
fbshipit-source-id: 4f19cdeea4ce136497ce79fc9d28c925de59e220
Summary:
This PR adds support to programmatically iterate a raw table file with an iterator returned by `SstFileReader::NewTableIterator`. For third party tools to use to observe SST files created by RocksDB.
The original feature request was from this merge request: https://github.com/facebook/rocksdb/pull/12370
Since keys returned by raw table iterators are internal keys, this PR also adds a struct `ParsedEntryInfo` and util method `ParseEntry` to support user to parse internal key. `GetInternalKeyForSeek`, and `GetInternalKeyForSeekForPrev` to support users to create internal keys for seek operations with this raw table iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12385
Test Plan: Added unit tests
Reviewed By: cbi42
Differential Revision: D55662855
Pulled By: jowlyzhang
fbshipit-source-id: 0716a173ee95924fbd4e1f9b6cccf06525c40049
Summary:
**Context/Summary:**
We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior.
A related change to OptionChangeMigration():
- When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size.
- Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481
Test Plan:
- New UT
- Modified UT
Reviewed By: cbi42
Differential Revision: D55351178
Pulled By: hx235
fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.
This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403
Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior
Reviewed By: ajkr
Differential Revision: D54544824
Pulled By: anand1976
fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
Summary:
This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including:
1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry.
2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries.
3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible.
Test plans:
Added unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455
Reviewed By: pdillinger
Differential Revision: D55165422
Pulled By: jowlyzhang
fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.
Tests:
Add new unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427
Reviewed By: ajkr
Differential Revision: D55025941
Pulled By: anand1976
fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
Summary:
Thanks ltamasi for pointing out this bug.
We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443
Reviewed By: cbi42
Differential Revision: D54930128
Pulled By: ajkr
fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d
Summary:
The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because:
- Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge
- The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me
If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438
Reviewed By: hx235
Differential Revision: D54914669
Pulled By: ajkr
fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a
Summary:
Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439
Test Plan:
- existing UT
- not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests
Reviewed By: anand1976
Differential Revision: D54920817
Pulled By: cbi42
fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12424
The PR adds a wide-column point lookup API `GetEntityFromBatch` to `WriteBatchWithIndex`. Similarly to APIs like `DB::GetEntity`, this new API returns wide-column entities as-is, and wraps plain values in an entity with a single column (the anonymous default column). Also, similarly to `WriteBatchWithIndex::GetFromBatch`, it only reads data from the batch itself.
Reviewed By: jaykorean
Differential Revision: D54826535
fbshipit-source-id: 92604f3ebd90fe1afbd36f2d2194b7dee0011efa
Summary:
When PR https://github.com/facebook/rocksdb/issues/9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this:
1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`.
5950907a82/utilities/transactions/transaction_util.cc (L141)
2) Let's denote the committed timestamp set via `Transaction::SetCommitTimestamp` to be `Ts_cmt`. Later `WriteCommitedTxn::Commit` would only pass if this condition is met: `Ts_read < Ts_cmt`. 5950907a82/utilities/transactions/pessimistic_transaction.cc (L431)
Together these two checks can ensure `Ts_db < Ts_cmt` to meet the user-defined timestamp invariant that newer timestamp should have newer sequence number.
The `do_validate` flag was originally intended to make snapshot based validation optional. If it's true, `GetForUpdate` checks no entry is written after the snapshot. If it's false, it will skip this snapshot based validation. In this PR, we are making the UDT based validation configurable too based on this flag instead of mandating it for below reasons: 1) in some cases the users themselves can enforce aformentioned invariant on their side independently, without RocksDB help, for example, if they are managing a monotonically increasing timestamp, and their transactions are only committed in a single thread. So they don't need this UDT based validation and wants to skip it, 2) It also could be expensive or not practical for users to come up with such a read timestamp that is exactly in between their commit timestamp and the db's timestamp. For example, in aformentioned case where a monotonically increasing timestamp is managed, the users would need to access this timestamp both for setting the read timestamp and for setting the commit timestamp. So it's preferable to skip this check too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12369
Test Plan: added unit tests
Reviewed By: ltamasi
Differential Revision: D54268920
Pulled By: jowlyzhang
fbshipit-source-id: ca7693796f9bb11f376a2059d91841e51c89435a
Summary:
This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once.
The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits.
Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery.
The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios.
The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary.
- `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates.
- `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406
Reviewed By: jaykorean, cbi42
Differential Revision: D54494904
Pulled By: ajkr
fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52
Summary:
Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388
In terms of public API:
* Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre.
* Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general).
* Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature.
* Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost"
In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed.
Eventual suggested follow-up:
* Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare).
* More temperature handling for CreateColumnFamilyWithImport and Checkpoints.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402
Test Plan:
Deeply revamped
ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer.
Used macros instead of functions for better tracing to critical source location on test failures.
Some enhancements to FileTemperatureTestFS in the process of developing the revamped test.
Reviewed By: jowlyzhang
Differential Revision: D54442794
Pulled By: pdillinger
fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995
Summary:
The current design proposes using a combination of `job_id`, `db_id`, and `db_session_id` to create a unique identifier for remote compaction jobs. However, this approach may not be suitable for users who prefer a different format for the unique identifier.
At Meta, we are utilizing generic compute offload to offload compaction tasks to remote workers. The compute offload client generates a UUID for each task, which requires an update to the current RocksDB API for onboarding purposes.
Users still have the option to create the unique identifier by combining `job_id`, `db_id`, and `db_session_id` if they prefer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12384
Test Plan:
```
$> ./compaction_service_test 13:29:35
[==========] Running 14 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 14 tests from CompactionServiceTest
[ RUN ] CompactionServiceTest.BasicCompactions
[ OK ] CompactionServiceTest.BasicCompactions (2642 ms)
[ RUN ] CompactionServiceTest.ManualCompaction
[ OK ] CompactionServiceTest.ManualCompaction (454 ms)
[ RUN ] CompactionServiceTest.CancelCompactionOnRemoteSide
[ OK ] CompactionServiceTest.CancelCompactionOnRemoteSide (1643 ms)
[ RUN ] CompactionServiceTest.FailedToStart
[ OK ] CompactionServiceTest.FailedToStart (1332 ms)
[ RUN ] CompactionServiceTest.InvalidResult
[ OK ] CompactionServiceTest.InvalidResult (1516 ms)
[ RUN ] CompactionServiceTest.SubCompaction
[ OK ] CompactionServiceTest.SubCompaction (551 ms)
[ RUN ] CompactionServiceTest.CompactionFilter
[ OK ] CompactionServiceTest.CompactionFilter (563 ms)
[ RUN ] CompactionServiceTest.Snapshot
[ OK ] CompactionServiceTest.Snapshot (124 ms)
[ RUN ] CompactionServiceTest.ConcurrentCompaction
[ OK ] CompactionServiceTest.ConcurrentCompaction (660 ms)
[ RUN ] CompactionServiceTest.CompactionInfo
[ OK ] CompactionServiceTest.CompactionInfo (984 ms)
[ RUN ] CompactionServiceTest.FallbackLocalAuto
[ OK ] CompactionServiceTest.FallbackLocalAuto (343 ms)
[ RUN ] CompactionServiceTest.FallbackLocalManual
[ OK ] CompactionServiceTest.FallbackLocalManual (380 ms)
[ RUN ] CompactionServiceTest.RemoteEventListener
[ OK ] CompactionServiceTest.RemoteEventListener (491 ms)
[ RUN ] CompactionServiceTest.TablePropertiesCollector
[ OK ] CompactionServiceTest.TablePropertiesCollector (169 ms)
[----------] 14 tests from CompactionServiceTest (11854 ms total)
[----------] Global test environment tear-down
[==========] 14 tests from 1 test case ran. (11855 ms total)
[ PASSED ] 14 tests.
```
Reviewed By: hx235
Differential Revision: D54220339
Pulled By: jaykorean
fbshipit-source-id: 5a9054f31933d1996adca02082eb37b6d5353224
Summary:
.. so write time can be measured under the new perf level for single-threaded writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12394
Test Plan: * add a new UT `PerfContextTest.WriteMemtableTimePerfLevel`
Reviewed By: anand1976
Differential Revision: D54326263
Pulled By: cbi42
fbshipit-source-id: d0e334d9581851ba6cf53c776c0bd876365d1e00
Summary:
When the rate limiter does not have any waiting requests, the first request to arrive may consume all of the available bandwidth, despite potentially having lower priority than requests that arrive later in the same refill interval. Then, those higher priority requests must wait for a refill. So even in scenarios in which we have an overall bandwidth surplus, the highest priority requests can be sporadically delayed up to a whole refill period.
Alone, this isn't necessarily problematic as the refill period is configurable via `refill_period_us` and can be tuned down as needed until the max sporadic delay is tolerable. However, tuning down `refill_period_us` had a side effect of reducing burst size. Some users require a certain burst size to issue optimal I/O sizes to the underlying storage system.
To satisfy those users, this PR decouples the refill period from the burst size. That way, the max sporadic delay can be limited without impacting I/O sizes issued to the underlying storage system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12379
Test Plan:
The goal is to show we can now limit the max sporadic delay without impacting compaction's I/O size.
The benchmark runs compaction with a large I/O size, while user reads simultaneously run at a low rate that does not consume all of the available bandwidth. The max sporadic delay is measured using the P100 of rocksdb.file.read.get.micros. I just used strace to verify the compaction reads follow `rate_limiter_single_burst_bytes`
Setup: `./db_bench -benchmarks=fillrandom,flush -write_buffer_size=67108864 -disable_auto_compactions=true -value_size=256 -num=1048576`
Benchmark: `./db_bench -benchmarks=readrandom -use_existing_db=true -num=1048576 -duration=10 -benchmark_read_rate_limit=4096 -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -rate_limit_user_ops=true -statistics=true -cache_size=0 -stats_level=5 -compaction_readahead_size=16777216 -use_direct_reads=true`
Results:
refill_micros | rocksdb.file.read.get.micros (P100)
-- | --
10000 | 10802
100000 | 100240
1000000 | 922061
For verifying compaction read sizes: `strace -fye pread64 ./db_bench -benchmarks=compact -use_existing_db=true -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -compaction_readahead_size=16777216 -use_direct_reads=true`
Reviewed By: hx235
Differential Revision: D54165675
Pulled By: ajkr
fbshipit-source-id: c5968486316cbfb7ff8e5b7d75d3589883dd1105
Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368
Test Plan: Add a unit test for write_thread_wait_nanos
Reviewed By: ajkr
Differential Revision: D54021583
Pulled By: anand1976
fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
Summary:
This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature.
Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`. We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`.
To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps.
This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356
Test Plan:
added unit test
existing unit test
Reviewed By: cbi42
Differential Revision: D54025116
Pulled By: jowlyzhang
fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554
Summary:
This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365
Test Plan:
"rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read.
```
$ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true |& grep 'block.cache.data.add COUNT' ; done
strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0
strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1
```
Reviewed By: hx235
Differential Revision: D53982520
Pulled By: ajkr
fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4
Summary:
with release notes for 9.0.fb, format_compatible test update, and version.h update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360
Test Plan: CI
Reviewed By: cbi42
Differential Revision: D53879416
Pulled By: jowlyzhang
fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327
Reviewed By: pdillinger
Differential Revision: D53828151
Pulled By: anand1976
fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
Summary:
It's in production for a large storage service, and it was initially released 6 months ago (8.6.0). IMHO that's enough room for "easy downgrade" to most any user's previously integrated version, even if they only update a few times a year.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12352
Test Plan:
tests updated, including format capatibility test
table_test: ApproximateOffsetOfCompressed is affected because adding index block to metaindex adds about 13 bytes
to SST files in format_version 6. This test has historically been problematic and one reason is that, apparently, not only
could it pass/fail depending on snappy compression version, but also how long your host name is, because of db_host_id.
I've cleared that out for the test, which takes care of format_version=6 and hopefully improves long-term reliability.
Suggested follow-up: FinishImpl in table_test.cc takes a table_options that is ignored in some cases and might not match
the ioptions.table_factory configuration unless the caller is very careful. This should be cleaned up somehow.
Reviewed By: anand1976
Differential Revision: D53786884
Pulled By: pdillinger
fbshipit-source-id: 1964cbd40d3ab0a821fdc01c458031df716fcf51
Summary:
.. for public api change related to sst_dump.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12353
Reviewed By: jaykorean
Differential Revision: D53791123
Pulled By: cbi42
fbshipit-source-id: 3fbe9c7a3eb0a30dc1a00d39bc8a46028baa3779
Summary:
This PR adds support in `SstFileWriter` to create SST files without persisting timestamps when the column family has enabled UDTs in Memtable only feature. The sst files created from flush and compaction do not contain timestamps, we want to make the sst files created by `SstFileWriter` to follow the same pattern and not persist timestamps. This is to prepare for ingesting external SST files for this type of column family.
There are timestamp-aware APIs and non timestamp-aware APIs in `SstFileWriter`. The former are exclusively used for when the column family's comparator is timestamp-aware, a.k.a `Comparator::timestamp_size() > 0`, while the latter are exclusively used for the column family's comparator is non timestamp-aware, a.k.a `Comparator::timestamp_size() == 0`. There are sanity checks to make sure these APIs are correctly used.
In this PR, the APIs usage continue with above enforcement, where even though timestamps are not eventually persisted, users are still asked to use only the timestamp-aware APIs. But because data points will logically all have minimum timestamps, we don't allow multiple versions of the same user key (without timestamp) to be added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12348
Test Plan:
Added unit tests
Manual inspection of generated sst files with `sst_dump`
Reviewed By: ltamasi
Differential Revision: D53732667
Pulled By: jowlyzhang
fbshipit-source-id: e43beba0d3a1736b94ee5c617163a6280efd65b7
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337
Reviewed By: hx235
Differential Revision: D53630393
Pulled By: jowlyzhang
fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347
`DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels).
Reviewed By: jowlyzhang
Differential Revision: D53675153
fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481
Summary:
(as title)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12336
Test Plan: in use at Meta for a large service; in crash test
Reviewed By: hx235
Differential Revision: D53537628
Pulled By: pdillinger
fbshipit-source-id: 69e7ac9ab7b59b928d1144105667a7fde8a55a5a
Summary:
Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`.
These APIs have different expectations of what keys this range class contain. Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp.
This PR contains:
1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead.
2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071
Test Plan:
existing tests
Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled.
The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT.
Reviewed By: ltamasi
Differential Revision: D53292608
Pulled By: jowlyzhang
fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10835 to allow disabling the new compaction behavior if it's not safe. The option is enabled by default and there has not been a need to disable it. So it should be safe to remove now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12323
Reviewed By: ajkr
Differential Revision: D53330336
Pulled By: cbi42
fbshipit-source-id: 36eef4664ac96b3a7ed627c48bd6610b0a7eafc5
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10655 to allow reverting to old behavior. The option is enabled by default and there has not been a need to disable it. Remove it for 9.0 release. Also fixed and improved a few unit tests that depended on setting this option to false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12325
Test Plan: existing tests.
Reviewed By: hx235
Differential Revision: D53369430
Pulled By: cbi42
fbshipit-source-id: 0ec2440ca8d88db7f7211c581542c7581bd4d3de
Summary:
The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -
1. Duplicate keys in a MultiGet batch
2. Key matches the last key in a non-zero, non-bottommost level file
3. Final value is not in the file (merge operand, not snapshot visible etc)
4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
6. Value or merge operand for the key is present in the very next level
The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295
Test Plan: Add unit tests that fail an assertion or return wrong result without the fix
Reviewed By: hx235
Differential Revision: D53187634
Pulled By: anand1976
fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea
Summary:
`check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311
Reviewed By: cbi42
Differential Revision: D53233379
Pulled By: ajkr
fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637
Summary:
As titled. This changes public API behavior, and subclasses of `WritableFile` and `FSWritableFile` need to explicitly provide an implementation for the `GetFileSize` method after this change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12303
Reviewed By: ajkr
Differential Revision: D53205769
Pulled By: jowlyzhang
fbshipit-source-id: 2e613ca3650302913821b33159b742bdf1d24bc7
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306
Reviewed By: cbi42
Differential Revision: D53200559
Pulled By: ajkr
fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e
Summary:
As titled, the replacement tickers have been introduced in https://github.com/facebook/rocksdb/issues/11509 and in use since release 8.4. This PR completely removes the misspelled ones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12302
Test Plan: CI tests
Reviewed By: jaykorean
Differential Revision: D53196935
Pulled By: jowlyzhang
fbshipit-source-id: 9c9d0d321247690db5edfdc52b4fecb2f1218979
Summary:
Provide support for FSBuffer for point lookups
It also add support for compaction and scan reads that goes through BlockFetcher when readahead/prefetching is not enabled.
Some of the compaction/Scan reads goes through FilePrefetchBuffer and some through BlockFetcher. This PR add support to use underlying file system scratch buffer for reads that go through BlockFetcher as for FilePrefetch reads, design is complicated to support this feature.
Design - In order to use underlying FileSystem provided scratch for Reads, it uses MultiRead with 1 request instead of Read API which required API change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12266
Test Plan: Stress test using underlying file system scratch buffer internally.
Reviewed By: anand1976
Differential Revision: D53019089
Pulled By: akankshamahajan15
fbshipit-source-id: 4fe3d090d77363320e4b67186fd4d51c005c0961
Summary:
introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214
Test Plan:
- new unit test
- benchmark:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1
main:
fillrandom : 234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations; 0.5 MB/s
rocksdb.compact.read.bytes COUNT : 1490756235
rocksdb.compact.write.bytes COUNT : 1469056734
rocksdb.flush.write.bytes COUNT : 71099011
branch:
fillrandom : 128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations; 0.9 MB/s
rocksdb.compact.read.bytes COUNT : 807474156
rocksdb.compact.write.bytes COUNT : 781977610
rocksdb.flush.write.bytes COUNT : 71098785
```
Reviewed By: ajkr
Differential Revision: D52637771
Pulled By: cbi42
fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3
Summary:
with release notes for 8.11.fb, format_compatible test update, and version.h update.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256
Test Plan: CI
Reviewed By: cbi42
Differential Revision: D52926051
Pulled By: pdillinger
fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4
Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.
* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
* Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
* Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
* Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.
Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253
Test Plan: unit tests updated
Reviewed By: jowlyzhang
Differential Revision: D52913733
Pulled By: pdillinger
fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
Summary:
These options were added for users to roll back a behavior change without downgrading. To our knowledge they were not needed so can now be removed.
- `level_compaction_dynamic_file_size`
- `ignore_max_compaction_bytes_for_input`
These options were added for users to disable an online validation in case it is expensive or has false positives. Those validations have shown to be cheap, correct, and are enabled by default, so these options can be removed.
- `check_flush_compaction_key_order`
- `flush_verify_memtable_count`
- `compaction_verify_record_count`
- `fail_if_options_file_error`
This option was added for users to violate API contracts or run old databases that used to violate API contracts. It appears to be set by MyRocks so it is unclear whether we can remove it. In any case we should discourage it until it can be removed.
- `enforce_single_del_contracts`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12249
Reviewed By: cbi42
Differential Revision: D52886651
Pulled By: ajkr
fbshipit-source-id: e0d5a35144ce048505899efb1ca68c3948050aa4
Summary:
Add ```CompressionOptions``` to ```CompressedSecondaryCacheOptions``` to allow users to set options such as compression level. It allows performance to be fine tuned.
Tests -
Run db_bench and verify compression options in the LOG file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12234
Reviewed By: ajkr
Differential Revision: D52758133
Pulled By: anand1976
fbshipit-source-id: af849fbffce6f84704387c195d8edba40d9548f6
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236
Reviewed By: cbi42
Differential Revision: D52765685
Pulled By: ajkr
fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
Summary:
We often need to read the table properties of an SST file when taking a backup. However, we currently do not check checksums for this step, and even with that enabled, we ignore failures. This change ensures we fail creating a backup if corruption is detected in that step of reading table properties.
To get this working properly (with existing unit tests), we also add some temperature handling logic like already exists in
BackupEngineImpl::ReadFileAndComputeChecksum and elsewhere in BackupEngine. Also, SstFileDumper needed a fix to its error handling logic.
This was originally intended to help diagnose some mysterious failures (apparent corruptions) seen in taking backups in the crash test, though that is now fixed in https://github.com/facebook/rocksdb/pull/12206
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12200
Test Plan: unit test added that corrupts table properties, along with existing tests
Reviewed By: ajkr
Differential Revision: D52520674
Pulled By: pdillinger
fbshipit-source-id: 032cfc0791428f3b8147d34c7d424ab128e28f42
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.
In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
* Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.
Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206
Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200
Reviewed By: akankshamahajan15
Differential Revision: D52551701
Pulled By: pdillinger
fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.
For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS
Some related code refactory to make implementation cleaner:
- Blob stats
- Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
- Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
- TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
- Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
- Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority
## Test
### db bench
Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100
rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```
compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```
blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB
Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```
```
Stacked Blob DB
Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench
pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445
post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164
rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```
### Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests
### Performance
Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true
Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,
Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```
Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846
Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```
Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark
Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860
Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910
Reviewed By: ajkr
Differential Revision: D49788060
Pulled By: hx235
fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.
Relate issue: https://github.com/facebook/rocksdb/issues/12165
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175
Reviewed By: ajkr
Differential Revision: D52446722
Pulled By: cbi42
fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187
Test Plan: * New unit tests
Reviewed By: ajkr
Differential Revision: D52437194
Pulled By: cbi42
fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:
```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176
Test Plan: unit test added
Reviewed By: cbi42
Differential Revision: D52417071
Pulled By: pdillinger
fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5
Summary:
HyperClockCache is intended to mitigate performance problems under stress conditions (as well as optimizing average-case parallel performance). In LRUCache, the biggest such problem is lock contention when one or a small number of cache entries becomes particularly hot. Regardless of cache sharding, accesses to any particular cache entry are linearized against a single mutex, which is held while each access updates the LRU list. All HCC variants are fully lock/wait-free for accessing blocks already in the cache, which fully mitigates this contention problem.
However, HCC (and CLOCK in general) can exhibit extremely degraded performance under a different stress condition: when no (or almost no) entries in a cache shard are evictable (they are pinned). Unlike LRU which can find any evictable entries immediately (at the cost of more coordination / synchronization on each access), CLOCK has to search for evictable entries. Under the right conditions (almost exclusively MB-scale caches not GB-scale), the CPU cost of each cache miss could fall off a cliff and bog down the whole system.
To effectively mitigate this problem (IMHO), I'm introducing a new default behavior and tuning parameter for HCC, `eviction_effort_cap`. See the comments on the new config parameter in the public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12141
Test Plan:
unit test included
## Performance test
We can use cache_bench to validate no regression (CPU and memory) in normal operation, and to measure change in behavior when cache is almost entirely pinned. (TODO: I'm not sure why I had to get the pinned ratio parameter well over 1.0 to see truly bad performance, but the behavior is there.) Build with `make DEBUG_LEVEL=0 USE_CLANG=1 PORTABLE=0 cache_bench`. We also set MALLOC_CONF="narenas:1" for all these runs to essentially remove jemalloc variances from the results, so that the max RSS given by /usr/bin/time is essentially ideal (assuming the allocator minimizes fragmentation and other memory overheads well). Base command reproducing bad behavior:
```
./cache_bench -cache_type=auto_hyper_clock_cache -threads=12 -histograms=0 -pinned_ratio=1.7
```
```
Before, LRU (alternate baseline not exhibiting bad behavior):
Rough parallel ops/sec = 2290997
1088060 maxresident
Before, AutoHCC (bad behavior):
Rough parallel ops/sec = 141011 <- Yes, more than 10x slower
1083932 maxresident
```
Now let us sample a range of values in the solution space:
```
After, AutoHCC, eviction_effort_cap = 1:
Rough parallel ops/sec = 3212586
2402216 maxresident
After, AutoHCC, eviction_effort_cap = 10:
Rough parallel ops/sec = 2371639
1248884 maxresident
After, AutoHCC, eviction_effort_cap = 30:
Rough parallel ops/sec = 1981092
1131596 maxresident
After, AutoHCC, eviction_effort_cap = 100:
Rough parallel ops/sec = 1446188
1090976 maxresident
After, AutoHCC, eviction_effort_cap = 1000:
Rough parallel ops/sec = 549568
1084064 maxresident
```
I looks like `cap=30` is a sweet spot balancing acceptable CPU and memory overheads, so is chosen as the default.
```
Change to -pinned_ratio=0.85
Before, LRU:
Rough parallel ops/sec = 2108373
1078232 maxresident
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2164910
1077312 maxresident
After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2145542
1077216 maxresident
```
The slight CPU improvement above is consistent with the cap, with no measurable memory overhead under moderate stress.
```
Change to -pinned_ratio=0.25 (low stress)
Before, AutoHCC, averaged over ~20 runs:
Rough parallel ops/sec = 2221149
1076540 maxresident
After, AutoHCC, eviction_effort_cap = 30, averaged over ~20 runs:
Rough parallel ops/sec = 2224521
1076664 maxresident
```
No measurable difference under normal circumstances.
Some tests repeated with FixedHCC, with similar results.
Reviewed By: anand1976
Differential Revision: D52174755
Pulled By: pdillinger
fbshipit-source-id: d278108031b1220c1fa4c89c5a9d34b7cf4ef1b8
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147
Test Plan: Existing CircleCI
Reviewed By: cbi42
Differential Revision: D52173481
Pulled By: akankshamahajan15
fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.
This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130
Reviewed By: hx235
Differential Revision: D52000746
Pulled By: ajkr
fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7
Summary:
There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134
Test Plan: Add a new unit test in tiered_secondary_cache_test
Reviewed By: akankshamahajan15
Differential Revision: D52023309
Pulled By: anand1976
fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b
Summary:
As part of building another feature, I wanted this:
* Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases.
* Polished, clarified some related API comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129
Test Plan: unit test added
Reviewed By: ltamasi
Differential Revision: D51966667
Pulled By: pdillinger
fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770
Summary:
### Implement new Java API get()/put()/merge() methods, and transactional variants.
The Java API methods are very inconsistent in terms of how they pass parameters (byte[], ByteBuffer), and what variants and defaulted parameters they support. We try to bring some consistency to this.
* All APIs should support calls with ByteBuffer parameters.
* Similar methods (RocksDB.get() vs Transaction.get()) should support as similar as possible sets of parameters for predictability.
* get()-like methods should provide variants where the caller supplies the target buffer, for the sake of efficiency. Allocation costs in Java can be significant when large buffers are repeatedly allocated and freed.
### API Additions
1. RockDB.get implement indirect ByteBuffers. Added indirect ByteBuffers and supporting native methods for get().
2. RocksDB.Iterator implement missing (byte[], offset, length) variants for key() and value() parameters.
3. Transaction.get() implement missing methods, based on RocksDB.get. Added ByteBuffer.get with and without column family. Added byte[]-as-target get.
4. Transaction.iterator() implement a getIterator() which defaults ReadOptions; as per RocksDB.iterator(). Rationalize support API for this and RocksDB.iterator()
5. RocksDB.merge implement ByteBuffer methods; both direct and indirect buffers. Shadow the methods of RocksDB.put; RocksDB.put only offers ByteBuffer API with explicit WriteOptions. Duplicated this with RocksDB.merge
6. Transaction.merge implement methods as per RocksDB.merge methods. Transaction is already constructed with WriteOptions, so no explicit WriteOptions methods required.
7. Transaction.mergeUntracked implement the same API methods as Transaction.merge except the ones that use assumeTracked, because that’s not a feature of merge untracked.
### Support Changes (C++)
The current JNI code in C++ supports multiple variants of methods through a number of helper functions. There are numerous TODO suggestions in the code proposing that the helpers be re-factored/shared.
We have taken a different approach for the new methods; we have created wrapper classes `JDirectBufferSlice`, `JDirectBufferPinnableSlice`, `JByteArraySlice` and `JByteArrayPinnableSlice` RAII classes which construct slices from JNI parameters and can then be passed directly to RocksDB methods. For instance, the `Java_org_rocksdb_Transaction_getDirect` method is implemented like this:
```
try {
ROCKSDB_NAMESPACE::JDirectBufferSlice key(env, jkey_bb, jkey_off,
jkey_part_len);
ROCKSDB_NAMESPACE::JDirectBufferPinnableSlice value(env, jval_bb, jval_off,
jval_part_len);
ROCKSDB_NAMESPACE::KVException::ThrowOnError(
env, txn->Get(*read_options, column_family_handle, key.slice(),
&value.pinnable_slice()));
return value.Fetch();
} catch (const ROCKSDB_NAMESPACE::KVException& e) {
return e.Code();
}
```
Notice the try/catch mechanism with the `KVException` class, which combined with RAII and the wrapper classes means that there is no ad-hoc cleanup necessary in the JNI methods.
We propose to extend this mechanism to existing JNI methods as further work.
### Support Changes (Java)
Where there are multiple parameter-variant versions of the same method, we use fewer or just one supporting native method for all of them. This makes maintenance a bit easier and reduces the opportunity for coding errors mixing up (untyped) object handles.
In order to support this efficiently, some classes need to have default values for column families and read options added and cached so that they are not re-constructed on every method call.
This PR closes https://github.com/facebook/rocksdb/issues/9776
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11019
Reviewed By: ajkr
Differential Revision: D52039446
Pulled By: jowlyzhang
fbshipit-source-id: 45d0140a4887e42134d2e56520e9b8efbd349660
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12061.
We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable.
The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111
Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes.
Reviewed By: jaykorean
Differential Revision: D51954327
Pulled By: jowlyzhang
fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12128
The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues.
Reviewed By: akankshamahajan15
Differential Revision: D51963950
fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948
Summary:
Add support for tuning of readahead_size by block cache lookup for async_io.
**Design/ Implementation** -
**BlockBasedTableIterator.cc** -
`BlockCacheLookupForReadAheadSize` callback API lookups in the block cache and tries to reduce the start
and end offset passed. This function looks into the block cache for the blocks between `start_offset`
and `end_offset` and add all the handles in the queue.
It then iterates from the end in the handles to find first miss block and update the end offset to that block.
It also iterates from the start and find first miss block and update the start offset to that block.
```
_read_curr_block_ argument : True if this call was due to miss in the cache and caller wants to read that block
synchronously.
False if current call is to prefetch additional data in extra buffers
(due to ReadAsync call in FilePrefetchBuffer)
```
In case there is no data to be read in that callback (because of upper_bound or all blocks are in cache),
it updates start and end offset to be equal and that `FilePrefetchBuffer` interprets that as 0 length to be read.
**FilePrefetchBuffer.cc** -
FilePrefetchBuffer calls the callback - `ReadAheadSizeTuning` and pass the start and end offset to that
callback to get updated start and end offset to read based on cache hits/misses.
1. In case of Read calls (when offset passed to FilePrefetchBuffer is on cache miss and that data needs to be read), _read_curr_block_ is passed true.
2. In case of ReadAsync calls, when buffer is all consumed and can go for additional prefetching, the start offset passed is the initial end offset of prev buffer (without any updated offset based on cache hit/miss).
Foreg. if following are the data blocks with cache hit/miss and start offset
and Read API found miss on DB1 and based on readahead_size (50) it passes end offset to be 50.
[DB1 - miss- 0 ] [DB2 - hit -10] [DB3 - miss -20] [DB4 - miss-30] [DB5 - hit-40]
[DB6 - hit-50] [DB7 - miss-60] [DB8 - miss - 70] [DB9 - hit - 80] [DB6 - hit 90]
- For Read call - updated start offset remains 0 but end offset updates to DB4, as DB5 is in cache.
- Read calls saves initial end offset 50 as that was meant to be prefetched.
- Now for next ReadAsync call - the start offset will be 50 (previous buffer initial end offset) and based on readahead_size, end offset will be 100
- On callback, because of cache hits - callback will update the start offset to 60 and end offset to 80 to read only 2 data blocks (DB7 and DB8).
- And for that ReadAsync call - initial end offset will be set to 100 which will again used by next ReadAsync call as start offset.
- `initial_end_offset_` in `BufferInfo` is used to save the initial end offset of that buffer.
- If let's say DB5 and DB6 overlaps in 2 buffers (because of alignment), `prev_buf_end_offset` is passed to make sure already prefetched data is not prefetched again in second buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11936
Test Plan:
- Ran crash_test several times.
- New unit tests added.
Reviewed By: anand1976
Differential Revision: D50906217
Pulled By: akankshamahajan15
fbshipit-source-id: 0d75d3c98274e98aa34901b201b8fb05232139cf
Summary:
These bugs surfaced while I was trying to add the stress test for the feature:
Bug 1) On the index building path: the optimization to use user key instead of internal key as separator needed a bit tweak for when user defined timestamps can be removed. Because even though the user key look different now and eligible to be used as separator, when their user-defined timestamps are removed, they could be equal and that invariant no longer stands.
Bug 2) On the index reading path: one path that builds the second level index iterator for `PartitionedIndexReader` are not passing the corresponding `user_defined_timestamps_persisted` flag. As a result, the default `true` value be used leading to no minimum timestamps padded when they should be.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12062
Test Plan:
For bug 1): added separate unit test `BlockBasedTableReaderTest::Get` to exercise the `Get` API. It's a different code path from `MultiGet` so worth having its own test. Also in order to cover the bug, the test is modified to generate key values with the same user provided key, different timestamps and different sequence numbers. The test reads back different versions of the same user provided key. `MultiGet` takes one `ReadOptions` with one read timestamp so we cannot test retrieving different versions of the same key easily.
For bug 2): simply added options `BlockBasedTableOptions.metadata_cache_options.partition_pinning = PinningTier::kAll` to exercise all the index iterator creating paths.
Reviewed By: ltamasi
Differential Revision: D51508280
Pulled By: jowlyzhang
fbshipit-source-id: 8b174d3d70373c0599266ac1f467f2bd4d7ea6e5
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11000.
That issue pointed out that RocksDB was slow to delete archived WALs in case time-based and size-based expiration were enabled, and the time-based threshold (`WAL_ttl_seconds`) was small. This PR prevents the delay by taking into account `WAL_ttl_seconds` when deciding the frequency to process archived WALs for deletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12069
Reviewed By: pdillinger
Differential Revision: D51262589
Pulled By: ajkr
fbshipit-source-id: e65431a06ee96f4c599ba84a27d1aedebecbb003
Summary:
Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted.
This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them. When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK.
This PR also makes unit test that do not need force behavior to do a regular enable file deletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001
Reviewed By: ajkr
Differential Revision: D51214683
Pulled By: jowlyzhang
fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4
Summary:
#### Problem
While the RocksDB C API does have the RateLimiter API, it does not
expose the auto_tuned option.
#### Summary of Change
This PR exposes auto_tuned RateLimiter option in RocksDB C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12058
Test Plan: Augment the C API existing test to cover the new API.
Reviewed By: cbi42
Differential Revision: D51201933
Pulled By: ajkr
fbshipit-source-id: 5bc595a9cf9f88f50fee797b729ba96f09ed8266
Summary:
**Context/Summary:**
It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish".
However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand.
This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`)
Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053
Test Plan:
- Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once.
- Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change.
Reviewed By: ajkr
Differential Revision: D51095434
Pulled By: hx235
fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973
Summary:
I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions.
It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`).
The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have
* Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain.
* Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty.
* Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain.
This is easily fixed by holding on to a reference until inserted onto the chain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046
Test Plan:
As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed.
Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000)
Reviewed By: jowlyzhang
Differential Revision: D51016979
Pulled By: pdillinger
fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b
Summary:
- The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice.
- It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization.
- Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start.
- We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031
Test Plan:
Unit Tests added
- `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled
- `DBTestUniversalCompaction2::PeriodicCompaction` for Universal
Reviewed By: cbi42
Differential Revision: D50900292
Pulled By: jaykorean
fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d
Summary:
### main change:
- add java clipColumnFamily api in Rocksdb.java
The method signature of the new API is
```
public void clipColumnFamily(final ColumnFamilyHandle columnFamilyHandle, final byte[] beginKey,
final byte[] endKey)
```
### Test
add unit test RocksDBTest#clipColumnFamily()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11868
Reviewed By: jaykorean
Differential Revision: D50889783
Pulled By: cbi42
fbshipit-source-id: 7f545171ad9adb9c20bdd92efae2e6bc55d5703f
Summary:
As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016
Reviewed By: hx235
Differential Revision: D50670482
Pulled By: jowlyzhang
fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e
Summary:
In `TieredCache`, the underlying compressed secondary cache is hidden from the user. So we need a way to query the capacity, as well as the portion of cache reservation charged to the compressed secondary cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12011
Test Plan: Update the unit tests
Reviewed By: akankshamahajan15
Differential Revision: D50651943
Pulled By: anand1976
fbshipit-source-id: 06d1cb5edb75a790c919bce718e2ff65f5908220
Summary:
**Context:**
DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: 519f2a41fb/db/db_impl/db_impl.cc (L525) -> 519f2a41fb/db/error_handler.cc (L250) -> 519f2a41fb/db/error_handler.cc (L808-L823)
However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below.
```
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
```
**Summary:**
This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](519f2a41fb/db/error_handler.cc (L688-L694)) such case inside the created recovery thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002
Test Plan: A new UT repro-ed the crash before this fix and and pass after.
Reviewed By: ajkr
Differential Revision: D50586191
Pulled By: hx235
fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093
Summary:
**Context/Summary:**
We ignore trace writing status e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222)
If a write into the trace file fails, subsequent trace write will continue onto the same file.
This will trigger the assertion `assert(sync_without_flush_called_)` intended to catch write to a file that has previously seen error, added in https://github.com/facebook/rocksdb/pull/10489, https://github.com/facebook/rocksdb/pull/10555
Alternative (rejected) is to handle trace writing status at a higher level at e.g, 543191f2ea/db/db_impl/db_impl_write.cc (L221-L222). However, it makes sense to ignore such status considering tracing is not a critical but assistant component to db operation. And this alternative requires more code change. So it's better to handle the failure at a lower level as this PR
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11996
Test Plan: Add new UT failed before this PR and pass after
Reviewed By: akankshamahajan15
Differential Revision: D50532467
Pulled By: hx235
fbshipit-source-id: f2032abafd94917adbf89a20841d15b448782a33
Summary:
Fix https://github.com/facebook/rocksdb/issues/11607
Fix https://github.com/facebook/rocksdb/issues/11679
Fix https://github.com/facebook/rocksdb/issues/11606
Fix https://github.com/facebook/rocksdb/issues/2343
Add bounds checking to `WBWIIteratorImpl`, which will be reflected in `BaseDeltaIterator::delta_iterator_::Valid()`, just like `BaseDeltaIterator::base_iterator_::Valid()`. In this way, the two sub itertors become more aligned from `BaseDeltaIterator`'s perspective. Like `DBIter`, the added bounds checking caps in either bound when seeking and disvalidates the `WBWIIteratorImpl` iterator when the lower bound is past or the upper bound is reached.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11680
Test Plan:
- A simple test added to write_batch_with_index_test.cc to exercise the bounds checking in `WBWIIteratorImpl`.
- A sophisticated test added to transaction_test.cc to assert that `Transaction` with different write policies honor bounds in `ReadOptions`. It should be so as long as the `BaseDeltaIterator` is correctly coordinating the two sub iterators to perform iterating and bounds checking.
Reviewed By: ajkr
Differential Revision: D48125229
Pulled By: cbi42
fbshipit-source-id: c9acea52595aed1471a63d7ca6ef15d2a2af1367
Summary:
Context/Summary: as titled
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957
Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats
Reviewed By: ajkr, cbi42
Differential Revision: D50294310
Pulled By: hx235
fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd
Summary:
Fix corruption error - "Corruption: first key in index doesn't match first key in block". when auto_readahead_size is enabled. Error is because of bug when index_iter_ moves forward, first_internal_key of that index_iter_ is not copied. So the Slice points to a different key resulting in wrong comparison when doing comparison.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11961
Test Plan: Ran stress test which reproduced this error.
Reviewed By: anand1976
Differential Revision: D50310589
Pulled By: akankshamahajan15
fbshipit-source-id: 95d8320b8388f1e3822c32024f84754f3a20a631
Summary:
Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`.
An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families.
Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`.
Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf".
Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo.
With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925
Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added
will enable this new API in the `db_stress` in a separate PR
Reviewed By: ltamasi
Differential Revision: D50017414
Pulled By: jaykorean
fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586
Summary:
With the introduction of the `UpdateTieredCache` API, its possible to dynamically change the compressed secondary cache ratio of the total cache capacity. In order to optimize performance, we avoid using a mutex when inserting/releasing placeholder entries, which can result in some inaccuracy in the accounting during the dynamic update. This inaccuracy was causing a runtime error due to an integer underflow in `UpdateCacheReservationRatio`, causing ubsan crash tests to fail. This PR fixes it by explicitly checking for the underflow.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11949
Test Plan:
1. Added a unit test that fails without the fix
2. Run ubsan_crash
Reviewed By: akankshamahajan15
Differential Revision: D50240217
Pulled By: anand1976
fbshipit-source-id: d2f7b79da54eec8b61aec2cc1f2943da5d5847ac
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.
To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.
I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.
Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929
Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace
I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace
Reviewed By: ajkr
Differential Revision: D50024506
Pulled By: pdillinger
fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.
This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879
Reviewed By: anand1976
Differential Revision: D49571196
Pulled By: ajkr
fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.
Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920
Test Plan: tests added / updated
Reviewed By: jaykorean
Differential Revision: D49919147
Pulled By: pdillinger
fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913
The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator.
Reviewed By: jaykorean
Differential Revision: D49795097
fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3
Summary:
Changed `DBOptions::fail_if_options_file_error` default from `false` to
`true`. It is safer to fail an operation by default when it encounters
an error.
Also changed the API doc to list items in the conventional way for listing items in a sentence. The slashes weren't working well as one got dropped, probably because it looked like a typo.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11800
Test Plan: rely on CI
Reviewed By: jowlyzhang
Differential Revision: D49030532
Pulled By: ajkr
fbshipit-source-id: e606062aa25f9063d8c6fb0d03aebca5c2bc56d3
Summary:
RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage.
In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893
Test Plan:
- New Unit Test Added - `DBOptionsTest::OffPeakTimes`
- Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest`
Reviewed By: pdillinger
Differential Revision: D49714553
Pulled By: jaykorean
fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/11631 introduced an undesired fallback behavior to RocksDB internal prefetching even when FS prefetching return non-OK status other than "Unsupported". We only want to fall back when FS prefetching is not supported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11897
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D49667055
Pulled By: hx235
fbshipit-source-id: fa36e4e5d6dc9507080217035f9d6ff8e4abda28
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/11631 introduced `readahead()` system call for compaction read under non direct IO. When `Options::compaction_readahead_size` is 0, the `readahead()` will issued with a small size (i.e, the block size, by default 4KB)
Benchmarks shows that such readahead() call regresses the compaction read compared with "no readahead()" case (see Test Plan for more).
Therefore we decided to not issue such `readhead() ` when `Options::compaction_readahead_size` is 0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11887
Test Plan:
Settings: `compaction_readahead_size = 0, use_direct_reads=false`
Setup:
```
TEST_TMPDIR=../ ./db_bench -benchmarks=filluniquerandom -disable_auto_compactions=true -write_buffer_size=1048576 -compression_type=none -value_size=10240 && tar -cf ../dbbench.tar -C ../dbbench/ .
```
Run:
```
for i in $(seq 3); do rm -rf ../dbbench/ && mkdir -p ../dbbench/ && tar -xf ../dbbench.tar -C ../dbbench/ . && sudo bash -c 'sync && echo 3 > /proc/sys/vm/drop_caches' && TEST_TMPDIR=../ /usr/bin/time ./db_bench_{pre_PR11631|PR11631|PR11631_with_improvementPR11887} -benchmarks=compact -use_existing_db=true -db=../dbbench/ -disable_auto_compactions=true -compression_type=none ; done |& grep elapsed
```
pre-PR11631("no readahead()" case):
PR11631:
PR11631+this improvement:
Reviewed By: ajkr
Differential Revision: D49607266
Pulled By: hx235
fbshipit-source-id: 2efa0dc91bac3c11cc2be057c53d894645f683ef
Summary:
Implement block cache lookup to determine readahead_size during scans. It's enabled if auto_readahead_size, block_cache and iterate_upper_bound - all three are set.
Design -
1. Whenever there is a cache miss and FilePrefetchBuffer is called, a callback is made to determine readahead_size for that prefetching.
2. The callback iterates over index and do block cache lookup for each data block handle until existing readahead_size is reached. Then It removes the cache hit data blocks from end to calculate optimized readahead_size.
3. Since index_iter_ is moved, it stores block handles in a queue, and use that queue to get block handle instead of doing index_iter_->Next().
4. This is for Sync scans. Async scans support is in progress.
NOTE:
The issue right now is after Seek and Next, if Prev is called, there is no way to do Prev operation. index_iter_ is already pointing to a different block. So it returns "Not supported" in that case with error message - "auto tuning of readahead size is not supported with Prev op"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11860
Test Plan:
- Added new unit test
- crash_tests
- Running scans locally to check for any regression
Reviewed By: anand1976
Differential Revision: D49548118
Pulled By: akankshamahajan15
fbshipit-source-id: f1aee409a71b4ad9e5bf3610f43edf30c6630c78
Summary:
Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity.
With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy.
Tests:
New unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873
Reviewed By: akankshamahajan15
Differential Revision: D49562250
Pulled By: anand1976
fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
Summary:
With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: 269478ee46/db/db_impl/db_impl_compaction_flush.cc (L725) This prevents any future flush to pick old memtables.
A more detailed repro is in unit test.
This PR fixes this issue by
1. Ensure we rollback memtables if an atomic flush fails due to background error
2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed
3. Do not schedule non-recovery flushes when there is a background error that stops background work
There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11872
Test Plan: new unit test.
Reviewed By: jowlyzhang
Differential Revision: D49556867
Pulled By: cbi42
fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11870
Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`.
Reviewed By: jaykorean
Differential Revision: D49522786
fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c
Summary:
This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache.
The basic design is as follows -
1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache.
2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block.
3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache.
4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded.
We can easily implement additional admission policies if desired.
Todo (In a subsequent PR):
1. Add to db_bench and run benchmarks
2. Add to db_stress
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812
Reviewed By: pdillinger
Differential Revision: D49461842
Pulled By: anand1976
fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an error message like the following:
IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```
This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)
This PR fixes the issue by rollback these pending flushes.
There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.
There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865
Test Plan: * Added repro unit tests.
Reviewed By: anand1976
Differential Revision: D49484922
Pulled By: cbi42
fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
Summary:
This PR makes disabling the compressed secondary cache by setting capacity to 0 a bit more efficient. Previously, inserts/lookups would go to the backing LRUCache before getting rejected due to 0 capacity. With this change, insert/lookup would return from ```CompressedSecondaryCache``` itself.
Tests:
Existing tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11863
Reviewed By: akankshamahajan15
Differential Revision: D49476248
Pulled By: anand1976
fbshipit-source-id: f0f17a5e3df7d8bfc06709f8f23c1302056ba590
Summary:
When auto_readahead_size is enabled in async_io, during seek, first buffer will prefetch the data - (current block + readahead till upper_bound). There can be cases where
1. first buffer prefetched all the data till upper bound, or
2. first buffer already has the data from prev seek call
and second buffer prefetch further leading to alignment issues.
This PR fixes that assertion and second buffer won't go for prefetching if first buffer has already prefetched till upper_bound.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11852
Test Plan:
- Added new unit test that failed without this fix.
- crash tests passed locally
Reviewed By: pdillinger
Differential Revision: D49384138
Pulled By: akankshamahajan15
fbshipit-source-id: 54417e909e4d986f1e5a17dbaea059cd4962fd4d
Summary:
**Context:**
As requested, lowest level as well as a map from input file to its table properties among all input files used in table creation (if any) are exposed in `CompactionFilter::Context`.
**Summary:**
This PR contains two commits:
(1) [Refactory](0012777f0e) to make resonating/using what is in `Compaction:: table_properties_` easier
- Separate `Compaction:: table_properties_` into `Compaction:: input_table_properties_` and `Compaction:: output_table_properties_`
- Separate the "set input table properties" logic into `Compaction:: SetInputTableProperties()`) from `Compaction:: GetInputTableProperties`
- Call `Compaction:: SetInputTableProperties()` as soon as possible, which is right after `Compaction::SetInputVersion()`. Bundle these two functions into one `Compaction::FinalizeInputInfo()` to minimize missing one or the other
(2) [Expose more info about input files:](6093e7dfba) `CompactionFilter::Context::input_start_level/input_table_properties`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11857
Test Plan:
- Modify existing UT `
TEST_F(DBTestCompactionFilter, CompactionFilterContextManual)` to cover new logics
Reviewed By: ajkr
Differential Revision: D49402540
Pulled By: hx235
fbshipit-source-id: 469fff50fa0e5964ffa5ea8db0743f61438ea392
Summary:
**Summary:**
When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound.
**Cause of error:**
If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](8fc78a3a9e/db/db_impl/db_impl.cc (L2112)).
Hence the default value set at this [line](694e49cbb1/table/get_context.cc (L611)) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file.
**Proposed Solution**
Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11816
Reviewed By: ajkr
Differential Revision: D49419029
Pulled By: jowlyzhang
fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](ba59751430/db/version_set.h (L1199)) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11764
Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released
Reviewed By: ajkr
Differential Revision: D48742152
Pulled By: cbi42
fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
Summary:
With the async_io option, the Seek happens in 2 phases. Phase 1 starts an asynchronous read on a block cache miss, and phase 2 waits for it to complete and finishes the seek. In both phases, BlockBasedTable::NewDataBlockIterator is called, which tries to lookup the block cache for the data block first before looking in the prefetch buffer. It's optimized by doing the block cache lookup only in the first phase and save some CPU.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11616
Test Plan: Added unit test
Reviewed By: jaykorean
Differential Revision: D47477887
Pulled By: akankshamahajan15
fbshipit-source-id: 0355e0a68fc0ea2eb92340ae42735afcdbcbfd79
Summary:
An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars.
As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making
RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic.
I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838
Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in).
Reviewed By: ajkr
Differential Revision: D49296284
Pulled By: pdillinger
fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a
Summary:
This PR resolves https://github.com/facebook/rocksdb/issues/10487 & https://github.com/facebook/rocksdb/issues/10536, user code needs to call Refresh() periodically.
The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq) -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`.
This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10594
Test Plan:
* New unit tests.
* Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case.
TODO in a following PR:
* Stress test Iterator::Refresh() to different snapshots or no snapshot.
Reviewed By: ajkr
Differential Revision: D48456896
Pulled By: cbi42
fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779
Summary:
Update the logic in FilePrefetchBuffer to update `upper_bound_offset_` during reseek. During Reseek, `iterate_upper_bound` can be changed dynamically. So added an API to update that in FilePrefetchBuffer.
Added unit test to confirm the behavior.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11775
Test Plan:
- Check stress tests in case there is any failure after this diff.
- make crash_test -j32 with auto_readahead_size=1 passed locally
Reviewed By: anand1976
Differential Revision: D48815177
Pulled By: akankshamahajan15
fbshipit-source-id: 5f44fbb3af06c86a1c38f139c5fa4543891837f4
Summary:
When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done.
It's currently doing this:
`MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()`
This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830
Test Plan:
`make all check`
Added a unit test that would originally fail
Reviewed By: ltamasi
Differential Revision: D49287715
Pulled By: jowlyzhang
fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0
Summary:
`GetEntity` API support for ReadOnly DB and Secondary DB.
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`)
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity`
- `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally.
- `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()`
- Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`.
- Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799
Test Plan:
**Unit Tests**
- Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup.
- Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()`
- `make -j64 check`
**Crash Tests**
- `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
- `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 `
- `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
Reviewed By: ltamasi
Differential Revision: D49037040
Pulled By: jaykorean
fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b
Summary:
**Context/Summary:**
A size amp compaction can select and prevent a large number of L0 files from being selected by other compaction. If such compaction is running long or being queued behind, these L0 files will exist for long. With a few more flushes, we can run into write stop triggered by # L0 files. We've seen this happen on a host with many DBs sharing same thread pool, each of these DBs submits a size amp compaction with (110-180)+ files to the pool upon reopen and with a few more flushes, they hit the 200 L0 write stop condition.
The idea is to exclude some L0 input files in size amp compaction that are harmless to size amp reduction but improve the situation described above.
The exclusion algorithm is in `MightExcludeNewL0sToReduceWriteStop()` with two elements:
1. #L0 to exclude + (level0_stop_writes_trigger - num_l0_input_pre_exclusion) should be in the range of [min_merge_width, max_merge_width].
- This is to ensure we are excluding enough L0 input files but not too many to be qualified to picked for another compaction along with the incoming future L0 files before write stop.
2. Based on (1), further constrain #L0 to exclude based on the post-exclusion compaction score. The goal is to ensure our exclusion will not disqualify the size amp compaction from being a size amp compaction after exclusion.
**Tets plan:** New unit test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11749
Reviewed By: ajkr
Differential Revision: D48850631
Pulled By: hx235
fbshipit-source-id: 2c321036e164087c36319dd5645cbbf6b6152092
Summary:
Existing compaction statistics are `COMPACTION_TIME` and `COMPACTION_CPU_TIME` which are histogram and are logged at the end of a compaction. The new statistics `COMPACTION_CPU_TOTAL_TIME` is for cumulative total compaction time which is updated regularly during a compaction. This allows user to more closely track compaction cpu usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11741
Test Plan: * new unit test `DBTestWithParam.CompactionTotalTimeTest`
Reviewed By: ajkr
Differential Revision: D48608094
Pulled By: cbi42
fbshipit-source-id: b597109f3e4bf2237fb5a216b6fd036e5363b4c0
Summary:
During Seek, the iterator seeks every file on L0. In async_io, it submit the requests to seek on every file on L0 asynchronously using RocksDB FilePrefetchBuffer.
However, FilePrefetchBuffer does alignment and reads extra bytes then needed that can increase the throughput.
In case of non direct io, the alignment can be avoided.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11793
Test Plan:
- Added a unit test that fails without this PR.
- make crash_test -j32 completed successfully
Reviewed By: anand1976
Differential Revision: D48985051
Pulled By: akankshamahajan15
fbshipit-source-id: 2d130a9e7c3df9c4fcd0408406e6277ab75a4389