Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12606
The patch extends optimistic transactions and WriteCommitted pessimistic transactions with support for the `PutEntity` API. Similarly to the other APIs, `PutEntity` is available via both the `Transaction` and `TransactionDB` interfaces, where using the latter executes the write in a single-operation transaction as usual. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added in separate PRs.
Reviewed By: jaykorean
Differential Revision: D56911242
fbshipit-source-id: 57cf8bb6c6b1b40ba4a8a652831c13a617644289
Summary:
Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs".
This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622
Reviewed By: cbi42
Differential Revision: D56987303
Pulled By: ajkr
fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48
Summary:
when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602
Test Plan:
a new repro unit test. Before this PR, it fails with
```
[ RUN ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF
db/import_column_family_test.cc:1048: Failure
db_->WaitForCompact(o)
Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3
```
Reviewed By: hx235
Differential Revision: D56851808
Pulled By: cbi42
fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84
Summary:
This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited.
The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` d610e14f93/include/rocksdb/sst_file_manager.h (L126) however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1.
In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call.
There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590
Test Plan: Added unit test and verified with manual testing
Reviewed By: anand1976
Differential Revision: D56830519
Pulled By: jowlyzhang
fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598
Test Plan:
- New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption: ...file checksum mismatch`) and passes after
- Benchmark
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG 300 runs] : 421334 (± 4126) ops/sec; 46.6 (± 0.5) MB/sec
Post-PR: (no regression observed but a slight improvement)
fillseq [AVG 300 runs] : 425768 (± 4309) ops/sec; 47.1 (± 0.5) MB/sec
Reviewed By: ajkr, anand1976
Differential Revision: D56725688
Pulled By: hx235
fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5
Summary:
This feature has been around for a couple of years and users haven't reported any problems with it.
Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377
Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in.
Reviewed By: jowlyzhang
Differential Revision: D54129517
Pulled By: pdillinger
fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c
Summary:
See comment at top of the test case and release note.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597
Reviewed By: jaykorean
Differential Revision: D56718786
Pulled By: ajkr
fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f
Summary:
When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 || s >= first_seqno_` in `rocksdb::MemTable::Add`.
We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591
Test Plan: 1. Add a unit test to verify the bug and fix
Reviewed By: jowlyzhang
Differential Revision: D56655812
Pulled By: anand1976
fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87
Summary:
Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592
Reviewed By: hx235
Differential Revision: D56650862
Pulled By: ajkr
fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141
Summary:
I had a TODO to complete `CompactionOptions`'s compression API but never did it: d610e14f93/db/compaction/compaction_picker.cc (L371-L373)
Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`.
In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587
Reviewed By: hx235
Differential Revision: D56619273
Pulled By: ajkr
fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573
Test Plan:
**Unit Tests Added**
```
./db_iterator_test --gtest_filter="*IteratorsConsistentView*"
```
```
./multi_cf_iterator_test -- --gtest_filter="*ConsistentView*"
```
**Performance Check**
Setup
```
make -j64 release
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom : 6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom : 6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found)
```
Reviewed By: jowlyzhang
Differential Revision: D56444066
Pulled By: jaykorean
fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99
Summary:
Prior to this PR the following sequence could happen:
1. `RunManualCompaction()` A schedules compaction to thread pool and waits
2. `RunManualCompaction()` B waits without scheduling anything due to conflict
3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both
4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done
5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool
6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction
7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution
8. `DisableManualCompaction()` returns as there are no more manual compactions running
Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward:
5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done
6'. `DisableManualCompaction()` returns as there are no more manual compactions running
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578
Reviewed By: cbi42
Differential Revision: D56528144
Pulled By: ajkr
fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b
Summary:
Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558
Reviewed By: hx235
Differential Revision: D56279019
Pulled By: ajkr
fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29
Summary:
**Context/Summary:**
When `BlockBasedTableOptions::block_align=true`, we pad bytes to align blocks d41e568b1c/table/block_based/block_based_table_builder.cc (L1415-L1421).
Those bytes are not included in generating the file checksum upon file creation. But `VerifyFileChecksums()` includes those bytes in generating the file check to compare against the checksum generating upon file creation. Therefore a file checksum mismatch is returned in `VerifyFileChecksums()`.
We decided to include those padded bytes in generating the checksum upon file creation.
Bonus: also fix surrounding code to use actual padded bytes for verification - see https://github.com/facebook/rocksdb/pull/12542#discussion_r1571429163
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12542
Test Plan:
- New UT
- Benchmark
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG 300 runs] : 422857 (± 3942) ops/sec; 46.8 (± 0.4) MB/sec
Post-PR:
fillseq [AVG 300 runs] : 424707 (± 3799) ops/sec; 47.0 (± 0.4) MB/sec
Reviewed By: ajkr
Differential Revision: D56168447
Pulled By: hx235
fbshipit-source-id: 96209ef950d42943d336f11968ae3fcf9872fc2c
Summary:
This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518
Reviewed By: ajkr
Differential Revision: D55932155
Pulled By: anand1976
fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557
Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move.
Reviewed By: jaykorean
Differential Revision: D56275054
fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730
Summary:
Adding an option to wait for purge to complete in `WaitForCompact` API.
Internally, RocksDB has a way to wait for purge to complete (e.g. TEST_WaitForPurge() in db_impl_debug.cc), but there's no public API available for gracefully wait for purge to complete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12520
Test Plan:
Unit Test Added - `WaitForCompactWithWaitForPurgeOptionTest`
```
./deletefile_test -- --gtest_filter="*WaitForCompactWithWaitForPurgeOptionTest*"
```
Existing Tests
```
./db_compaction_test -- --gtest_filter="*WaitForCompactWithOption*"
```
Reviewed By: ajkr
Differential Revision: D55888283
Pulled By: jaykorean
fbshipit-source-id: cfc6d6e8657deaefab8961890b36e390095c9f65
Summary:
In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546
Reviewed By: jaykorean
Differential Revision: D56193693
Pulled By: ajkr
fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12539
As a follow-up to https://github.com/facebook/rocksdb/pull/12533, this PR extends `WriteBatchWithIndex` with a `MultiGetEntityFromBatchAndDB` API that enables users to perform batched wide-column point lookups with read-your-own-writes consistency. This API transparently combines data from the indexed write batch and the underlying database as needed and presents the results in the form of a wide-column entity.
Reviewed By: jaykorean
Differential Revision: D56153145
fbshipit-source-id: 537967051b7521bb41b04070ac1a78a1d8873c08
Summary:
Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480#12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`.
Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL`
Here are some implementation details:
- `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration.
- `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap.
- `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap.
- Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once.
- This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534
Test Plan:
Uncommented the assertions in `verifyAttributeGroupIterator()`
```
./multi_cf_iterator_test
```
Reviewed By: ltamasi
Differential Revision: D56089019
Pulled By: jaykorean
fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12533
The PR extends `WriteBatchWithIndex` with a new wide-column point lookup API `GetEntityFromBatchAndDB`. Similarly to `GetFromBatchAndDB`, the new API can transparently combine data from the write batch with data from the underlying database as needed. Like `DB::GetEntity`, it returns any result in the form of a wide-column entity (i.e. plain key-values are wrapped into an entity with a single anonymous column).
Reviewed By: jaykorean
Differential Revision: D56069132
fbshipit-source-id: 4f19cdeea4ce136497ce79fc9d28c925de59e220
Summary:
This PR adds support to programmatically iterate a raw table file with an iterator returned by `SstFileReader::NewTableIterator`. For third party tools to use to observe SST files created by RocksDB.
The original feature request was from this merge request: https://github.com/facebook/rocksdb/pull/12370
Since keys returned by raw table iterators are internal keys, this PR also adds a struct `ParsedEntryInfo` and util method `ParseEntry` to support user to parse internal key. `GetInternalKeyForSeek`, and `GetInternalKeyForSeekForPrev` to support users to create internal keys for seek operations with this raw table iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12385
Test Plan: Added unit tests
Reviewed By: cbi42
Differential Revision: D55662855
Pulled By: jowlyzhang
fbshipit-source-id: 0716a173ee95924fbd4e1f9b6cccf06525c40049
Summary:
**Context/Summary:**
We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior.
A related change to OptionChangeMigration():
- When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size.
- Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481
Test Plan:
- New UT
- Modified UT
Reviewed By: cbi42
Differential Revision: D55351178
Pulled By: hx235
fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.
This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403
Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior
Reviewed By: ajkr
Differential Revision: D54544824
Pulled By: anand1976
fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
Summary:
This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including:
1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry.
2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries.
3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible.
Test plans:
Added unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455
Reviewed By: pdillinger
Differential Revision: D55165422
Pulled By: jowlyzhang
fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.
Tests:
Add new unit tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427
Reviewed By: ajkr
Differential Revision: D55025941
Pulled By: anand1976
fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
Summary:
Thanks ltamasi for pointing out this bug.
We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443
Reviewed By: cbi42
Differential Revision: D54930128
Pulled By: ajkr
fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d
Summary:
The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because:
- Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge
- The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me
If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438
Reviewed By: hx235
Differential Revision: D54914669
Pulled By: ajkr
fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a
Summary:
Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439
Test Plan:
- existing UT
- not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests
Reviewed By: anand1976
Differential Revision: D54920817
Pulled By: cbi42
fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12424
The PR adds a wide-column point lookup API `GetEntityFromBatch` to `WriteBatchWithIndex`. Similarly to APIs like `DB::GetEntity`, this new API returns wide-column entities as-is, and wraps plain values in an entity with a single column (the anonymous default column). Also, similarly to `WriteBatchWithIndex::GetFromBatch`, it only reads data from the batch itself.
Reviewed By: jaykorean
Differential Revision: D54826535
fbshipit-source-id: 92604f3ebd90fe1afbd36f2d2194b7dee0011efa
Summary:
When PR https://github.com/facebook/rocksdb/issues/9629 introduced user-defined timestamp support for `WriteCommittedTxn`, it adds this usage mandate for API `GetForUpdate` when UDT is enabled. The `do_validate` flag has to be true, and user should have already called `Transaction::SetReadTimestampForValidation` to set a read timestamp for validation. The rationale behind this mandate is this:
1) with do_vaildate = true, `GetForUpdate` could verify this relationships: let's denote the user-defined timestamp in db for the key as `Ts_db` and the read timestamp user set via `Transaction::SetReadTimestampForValidation` as `Ts_read`. UDT based validation will only pass if `Ts_db <= Ts_read`.
5950907a82/utilities/transactions/transaction_util.cc (L141)
2) Let's denote the committed timestamp set via `Transaction::SetCommitTimestamp` to be `Ts_cmt`. Later `WriteCommitedTxn::Commit` would only pass if this condition is met: `Ts_read < Ts_cmt`. 5950907a82/utilities/transactions/pessimistic_transaction.cc (L431)
Together these two checks can ensure `Ts_db < Ts_cmt` to meet the user-defined timestamp invariant that newer timestamp should have newer sequence number.
The `do_validate` flag was originally intended to make snapshot based validation optional. If it's true, `GetForUpdate` checks no entry is written after the snapshot. If it's false, it will skip this snapshot based validation. In this PR, we are making the UDT based validation configurable too based on this flag instead of mandating it for below reasons: 1) in some cases the users themselves can enforce aformentioned invariant on their side independently, without RocksDB help, for example, if they are managing a monotonically increasing timestamp, and their transactions are only committed in a single thread. So they don't need this UDT based validation and wants to skip it, 2) It also could be expensive or not practical for users to come up with such a read timestamp that is exactly in between their commit timestamp and the db's timestamp. For example, in aformentioned case where a monotonically increasing timestamp is managed, the users would need to access this timestamp both for setting the read timestamp and for setting the commit timestamp. So it's preferable to skip this check too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12369
Test Plan: added unit tests
Reviewed By: ltamasi
Differential Revision: D54268920
Pulled By: jowlyzhang
fbshipit-source-id: ca7693796f9bb11f376a2059d91841e51c89435a
Summary:
This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once.
The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits.
Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery.
The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios.
The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary.
- `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates.
- `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406
Reviewed By: jaykorean, cbi42
Differential Revision: D54494904
Pulled By: ajkr
fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52
Summary:
Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388
In terms of public API:
* Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre.
* Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general).
* Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature.
* Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost"
In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed.
Eventual suggested follow-up:
* Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare).
* More temperature handling for CreateColumnFamilyWithImport and Checkpoints.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402
Test Plan:
Deeply revamped
ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer.
Used macros instead of functions for better tracing to critical source location on test failures.
Some enhancements to FileTemperatureTestFS in the process of developing the revamped test.
Reviewed By: jowlyzhang
Differential Revision: D54442794
Pulled By: pdillinger
fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995
Summary:
The current design proposes using a combination of `job_id`, `db_id`, and `db_session_id` to create a unique identifier for remote compaction jobs. However, this approach may not be suitable for users who prefer a different format for the unique identifier.
At Meta, we are utilizing generic compute offload to offload compaction tasks to remote workers. The compute offload client generates a UUID for each task, which requires an update to the current RocksDB API for onboarding purposes.
Users still have the option to create the unique identifier by combining `job_id`, `db_id`, and `db_session_id` if they prefer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12384
Test Plan:
```
$> ./compaction_service_test 13:29:35
[==========] Running 14 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 14 tests from CompactionServiceTest
[ RUN ] CompactionServiceTest.BasicCompactions
[ OK ] CompactionServiceTest.BasicCompactions (2642 ms)
[ RUN ] CompactionServiceTest.ManualCompaction
[ OK ] CompactionServiceTest.ManualCompaction (454 ms)
[ RUN ] CompactionServiceTest.CancelCompactionOnRemoteSide
[ OK ] CompactionServiceTest.CancelCompactionOnRemoteSide (1643 ms)
[ RUN ] CompactionServiceTest.FailedToStart
[ OK ] CompactionServiceTest.FailedToStart (1332 ms)
[ RUN ] CompactionServiceTest.InvalidResult
[ OK ] CompactionServiceTest.InvalidResult (1516 ms)
[ RUN ] CompactionServiceTest.SubCompaction
[ OK ] CompactionServiceTest.SubCompaction (551 ms)
[ RUN ] CompactionServiceTest.CompactionFilter
[ OK ] CompactionServiceTest.CompactionFilter (563 ms)
[ RUN ] CompactionServiceTest.Snapshot
[ OK ] CompactionServiceTest.Snapshot (124 ms)
[ RUN ] CompactionServiceTest.ConcurrentCompaction
[ OK ] CompactionServiceTest.ConcurrentCompaction (660 ms)
[ RUN ] CompactionServiceTest.CompactionInfo
[ OK ] CompactionServiceTest.CompactionInfo (984 ms)
[ RUN ] CompactionServiceTest.FallbackLocalAuto
[ OK ] CompactionServiceTest.FallbackLocalAuto (343 ms)
[ RUN ] CompactionServiceTest.FallbackLocalManual
[ OK ] CompactionServiceTest.FallbackLocalManual (380 ms)
[ RUN ] CompactionServiceTest.RemoteEventListener
[ OK ] CompactionServiceTest.RemoteEventListener (491 ms)
[ RUN ] CompactionServiceTest.TablePropertiesCollector
[ OK ] CompactionServiceTest.TablePropertiesCollector (169 ms)
[----------] 14 tests from CompactionServiceTest (11854 ms total)
[----------] Global test environment tear-down
[==========] 14 tests from 1 test case ran. (11855 ms total)
[ PASSED ] 14 tests.
```
Reviewed By: hx235
Differential Revision: D54220339
Pulled By: jaykorean
fbshipit-source-id: 5a9054f31933d1996adca02082eb37b6d5353224
Summary:
.. so write time can be measured under the new perf level for single-threaded writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12394
Test Plan: * add a new UT `PerfContextTest.WriteMemtableTimePerfLevel`
Reviewed By: anand1976
Differential Revision: D54326263
Pulled By: cbi42
fbshipit-source-id: d0e334d9581851ba6cf53c776c0bd876365d1e00
Summary:
When the rate limiter does not have any waiting requests, the first request to arrive may consume all of the available bandwidth, despite potentially having lower priority than requests that arrive later in the same refill interval. Then, those higher priority requests must wait for a refill. So even in scenarios in which we have an overall bandwidth surplus, the highest priority requests can be sporadically delayed up to a whole refill period.
Alone, this isn't necessarily problematic as the refill period is configurable via `refill_period_us` and can be tuned down as needed until the max sporadic delay is tolerable. However, tuning down `refill_period_us` had a side effect of reducing burst size. Some users require a certain burst size to issue optimal I/O sizes to the underlying storage system.
To satisfy those users, this PR decouples the refill period from the burst size. That way, the max sporadic delay can be limited without impacting I/O sizes issued to the underlying storage system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12379
Test Plan:
The goal is to show we can now limit the max sporadic delay without impacting compaction's I/O size.
The benchmark runs compaction with a large I/O size, while user reads simultaneously run at a low rate that does not consume all of the available bandwidth. The max sporadic delay is measured using the P100 of rocksdb.file.read.get.micros. I just used strace to verify the compaction reads follow `rate_limiter_single_burst_bytes`
Setup: `./db_bench -benchmarks=fillrandom,flush -write_buffer_size=67108864 -disable_auto_compactions=true -value_size=256 -num=1048576`
Benchmark: `./db_bench -benchmarks=readrandom -use_existing_db=true -num=1048576 -duration=10 -benchmark_read_rate_limit=4096 -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -rate_limit_user_ops=true -statistics=true -cache_size=0 -stats_level=5 -compaction_readahead_size=16777216 -use_direct_reads=true`
Results:
refill_micros | rocksdb.file.read.get.micros (P100)
-- | --
10000 | 10802
100000 | 100240
1000000 | 922061
For verifying compaction read sizes: `strace -fye pread64 ./db_bench -benchmarks=compact -use_existing_db=true -rate_limiter_bytes_per_sec=67108864 -rate_limiter_refill_period_us=$refill_micros -rate_limiter_single_burst_bytes=16777216 -rate_limit_bg_reads=true -compaction_readahead_size=16777216 -use_direct_reads=true`
Reviewed By: hx235
Differential Revision: D54165675
Pulled By: ajkr
fbshipit-source-id: c5968486316cbfb7ff8e5b7d75d3589883dd1105
Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368
Test Plan: Add a unit test for write_thread_wait_nanos
Reviewed By: ajkr
Differential Revision: D54021583
Pulled By: anand1976
fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
Summary:
This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature.
Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`. We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`.
To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps.
This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356
Test Plan:
added unit test
existing unit test
Reviewed By: cbi42
Differential Revision: D54025116
Pulled By: jowlyzhang
fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554