Summary:
See comment at top of the test case and release note.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597
Reviewed By: jaykorean
Differential Revision: D56718786
Pulled By: ajkr
fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f
Summary:
When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 || s >= first_seqno_` in `rocksdb::MemTable::Add`.
We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591
Test Plan: 1. Add a unit test to verify the bug and fix
Reviewed By: jowlyzhang
Differential Revision: D56655812
Pulled By: anand1976
fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87
Summary:
Mixed code from `MultiGetCommand` and `GetEntityCommand` to introduce `MultiGetEntityCommand`. Some minor fixes for the related subcommands are included.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12593
Reviewed By: jaykorean
Differential Revision: D56687147
Pulled By: ajkr
fbshipit-source-id: 2ad7b7ba8e05e990b43f2d1eb4990f746ce5f1ea
Summary:
Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592
Reviewed By: hx235
Differential Revision: D56650862
Pulled By: ajkr
fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.
This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.
Reviewed By: palmje
Differential Revision: D56650257
fbshipit-source-id: ce628fbf12ea5846bb7103455ab859c5ed7e3598
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.
This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.
Reviewed By: palmje
Differential Revision: D56650296
fbshipit-source-id: ee3491d30e6c1fdefb3010c8ae1104b3f45e70f6
Summary:
I had a TODO to complete `CompactionOptions`'s compression API but never did it: d610e14f93/db/compaction/compaction_picker.cc (L371-L373)
Without solving that TODO, the API remains incomplete and unsafe. Now, however, I don't think it's worthwhile to complete it. I think we should instead delete the API entirely. This PR deprecates it in preparation for deletion in a future major release. The `ColumnFamilyOptions` settings for compression should be good enough for `CompactFiles()` since they are apparently good enough for every other compaction, including `CompactRange()`.
In the meantime, I also changed the default `CompressionType`. Having callers of `CompactFiles()` use Snappy compression by default does not make sense when the default could be to simply use the same compression type that is used for every other compaction. As a bonus, this change makes the default `CompressionType` consistent with the `CompressionOptions` that will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12587
Reviewed By: hx235
Differential Revision: D56619273
Pulled By: ajkr
fbshipit-source-id: 1477de49f14b06c72d6f0045616a8ce91d97e66e
Summary:
`MultiOpsTxnsStressTest` relies on snapshot which is incompatible with `inplace_update_support`. TransactionDB uses snapshot too so we don't expect it to be used with `inplace_update_support` either.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12586
Test Plan:
```
python3 tools/db_crashtest.py whitebox --[test_multiops_txn|txn] --txn_write_policy=1
```
Reviewed By: hx235
Differential Revision: D56602769
Pulled By: cbi42
fbshipit-source-id: 8778541295f0af71e8ce912c8f872ab5cc607fc1
Summary:
Example failure (cannot reproduce):
```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBWriteTestInstance/DBWriteTest
[ RUN ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
db/db_write_test.cc:809: Failure
dbfull()->SyncWAL()
Not implemented: SyncWAL() is not supported for this implementation of WAL file
[ FAILED ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0 (49 ms)
[----------] 1 test from DBWriteTestInstance/DBWriteTest (49 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (49 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0
```
I have no idea why `SyncWAL()` would not be supported from what is presumably a `SpecialEnv` so added more debug info in case it fails again in CI. The last failure was https://github.com/facebook/rocksdb/actions/runs/8731304938/job/23956487511?fbclid=IwAR2jyXgVQtCezri3axV5MwMdI7D6VIudMk1xkiN_FL9-x2dkBv4IqIjjgB4 and it only happened once ever AFAIK.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12580
Reviewed By: hx235
Differential Revision: D56541996
Pulled By: ajkr
fbshipit-source-id: 1eab17567db783c11054fa85dd8b8880eacd3a50
Summary:
**Context/Summary:**
Our crash test recently surfaced incompatibilities between DeleteRange and inplace_update_support. Incorrect read result will be returned after insertion into memtables already contain delete range data.
This PR is to clarify this in API and re-enable `inplace_update_support` in crash test with sanitization.
Ideally there should be a way to check memtable for delete range entry upon put under inplace_update_support = true
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12577
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D56492556
Pulled By: hx235
fbshipit-source-id: 9e80e5c69dd708716619a266f41580959680c83b
Summary:
Fixing the failure in IteratorsConsistentViewExplicitSnapshot as shown in https://github.com/facebook/rocksdb/actions/runs/8825927545/job/24230854140?pr=12581
The failure was due to the timing of the `flush()` for the later Column Family in the loop. If the flush for the later CFs installs the new super version before getting the SV for the iterator, assertion succeeds, but if the order flips, SV will be obsolete and assertion can fail.
This PR simplifies the test in a way that we do only one `flush()` so that `SYNC_POINT` can guarantee the order of operations. For ImplicitSnapshot test, it now just triggers flush for the second CF after obtaining SV for the first CF. For the ExplicitSnapshot test, it now triggers atomic flush() for all CFs after obtaining SV for the first CF.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12582
Test Plan:
```
./db_iterator_test --gtest_filter="*IteratorsConsistentView*"
./multi_cf_iterator_test -- --gtest_filter="*ConsistentView*
```
Reviewed By: ajkr, jowlyzhang
Differential Revision: D56557234
Pulled By: jaykorean
fbshipit-source-id: 7aa2f6d0e12a915b6e16cd240389bcfb5b4a5b62
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573
Test Plan:
**Unit Tests Added**
```
./db_iterator_test --gtest_filter="*IteratorsConsistentView*"
```
```
./multi_cf_iterator_test -- --gtest_filter="*ConsistentView*"
```
**Performance Check**
Setup
```
make -j64 release
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom : 6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom : 6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found)
```
Reviewed By: jowlyzhang
Differential Revision: D56444066
Pulled By: jaykorean
fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99
Summary:
Prior to this PR the following sequence could happen:
1. `RunManualCompaction()` A schedules compaction to thread pool and waits
2. `RunManualCompaction()` B waits without scheduling anything due to conflict
3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both
4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done
5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool
6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction
7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution
8. `DisableManualCompaction()` returns as there are no more manual compactions running
Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward:
5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done
6'. `DisableManualCompaction()` returns as there are no more manual compactions running
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578
Reviewed By: cbi42
Differential Revision: D56528144
Pulled By: ajkr
fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b
Summary:
**Context/Summary:**
Our recent crash test failures show inplace_update_support can cause DB to return value inconsistent with expected state upon crash recovery if delete range was used in the previous run AND inplace_update_support=true is used in either previous or the current verification run. Since it's a bit hard to keep track of whether previous run has used delete range or not, I decided to temporarily disable inplace_update_support in crash test to keep crash test stabilized before figuring why these two features are incompatible and how to prevent such combination in crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12574
Test Plan: Rehearsed many stress run with `inplace_update_support=0` and they passed
Reviewed By: jaykorean
Differential Revision: D56454951
Pulled By: hx235
fbshipit-source-id: 57f2ae6308bad7ed4077ddb9e658380742afa293
Summary:
Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558
Reviewed By: hx235
Differential Revision: D56279019
Pulled By: ajkr
fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29
Summary:
**Context/Summary:**
When `BlockBasedTableOptions::block_align=true`, we pad bytes to align blocks d41e568b1c/table/block_based/block_based_table_builder.cc (L1415-L1421).
Those bytes are not included in generating the file checksum upon file creation. But `VerifyFileChecksums()` includes those bytes in generating the file check to compare against the checksum generating upon file creation. Therefore a file checksum mismatch is returned in `VerifyFileChecksums()`.
We decided to include those padded bytes in generating the checksum upon file creation.
Bonus: also fix surrounding code to use actual padded bytes for verification - see https://github.com/facebook/rocksdb/pull/12542#discussion_r1571429163
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12542
Test Plan:
- New UT
- Benchmark
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG 300 runs] : 422857 (± 3942) ops/sec; 46.8 (± 0.4) MB/sec
Post-PR:
fillseq [AVG 300 runs] : 424707 (± 3799) ops/sec; 47.0 (± 0.4) MB/sec
Reviewed By: ajkr
Differential Revision: D56168447
Pulled By: hx235
fbshipit-source-id: 96209ef950d42943d336f11968ae3fcf9872fc2c
Summary:
A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways -
1. It has its own directory with links to the leader's database
2. Periodically refreshes itself
3. (Future) Snapshot support
4. (Future) Garbage collection of obsolete links
5. (Long term) Memtable replication
There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive.
The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files.
TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540
Reviewed By: jowlyzhang
Differential Revision: D56315718
Pulled By: anand1976
fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b
Summary:
As title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12563
Test Plan:
```
make -j64 release
```
**Before the fix**
```
$DEBUG_LEVEL is 0, $LIB_MODE is static
Makefile:306: Warning: /mnt/gvfs/third-party2/llvm-fb/1f6edd1ff15c99c861afc8f3cd69054cd974dd64/15/platform010/72a2ff8/../../src/llvm/clang/tools/scan-build/bin/scan-build does not exist
...
```
**After the fix**
```
$DEBUG_LEVEL is 0, $LIB_MODE is static
...
```
Reviewed By: ajkr
Differential Revision: D56318047
Pulled By: jaykorean
fbshipit-source-id: 4a11ad8353fc94aa96676e57c67063d051de5fbc
Summary:
While implementing MultiCFIterators (CoalescingIterator and AttributeGroupIterator), we found that the existing `NewIterators()` API does not ensure a uniform view of the DB across all column families. The `NewIterators()` function is utilized to generate child iterators for the MultiCfIterators, and it's expected that all child iterators maintain a consistent view of the DB.
For example, within the loop where the super version for each CF is being obtained, if a CF undergoes compaction after the super versions for previous CFs have already been retrieved, we lose the consistency in the view of the CFs for the iterators due to the API not under a db mutex.
This preliminary refactoring of `MultiCFSnapshot` aims to address this issue in the `NewIterators()` API in the later PR. Currently, `MultiCFSnapshot` is used to achieve a consistent view across CFs in `MultiGet`. The `MultiGetColumnFamilyData` contains MultiGet-specific information that can be decoupled from the cfd and sv, allowing `MultiCFSnapshot` to be used in other places.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12561
Test Plan:
**Existing Unit Tests for `MultiCFSnapshot()`**
```
./db_basic_test -- --gtest_filter="*MultiGet*"
```
**Performance Test**
Setup
```
make -j64 release
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none
```
Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom : 4.760 micros/op 210072 ops/sec 4.760 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom : 4.593 micros/op 217727 ops/sec 4.593 seconds 1000000 operations; (0 of 1000000 found)
```
Reviewed By: anand1976
Differential Revision: D56309422
Pulled By: jaykorean
fbshipit-source-id: 7a9164d12c810b6c2d2db062827fcc4a36cbc77b
Summary:
This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518
Reviewed By: ajkr
Differential Revision: D55932155
Pulled By: anand1976
fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12564
Similarly to how `db`, `column_family`, and `results` are handled, bail out early from `WriteBatchWithIndex::MultiGetEntityFromBatchAndDB` if `keys` is `nullptr`. Note that these checks are best effort in the sense that with the current method signature, the callee has no way of reporting an error if `statuses` is `nullptr` or catching other types of invalid pointers (e.g. when `keys` and/or `results` is non-`nullptr` but do not point to a contiguous range of `num_keys` objects). We can improve this (and many similar RocksDB APIs) using `std::span` in a major release once we move to C++20.
Reviewed By: jaykorean
Differential Revision: D56318179
fbshipit-source-id: bc7a258eda82b5f6c839f212ab824130e773a4f0
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12562
The patch makes a small usability improvement by consistently resetting any user-facing wide-column structures (`DBIter::columns()`, `BaseDeltaIterator::columns()`, and any `PinnableWideColumns` objects) upon encountering any deserialization failures.
Reviewed By: jaykorean
Differential Revision: D56312764
fbshipit-source-id: 44efed0d1720cc06bf6facf928f73ce39a1bd2ca
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557
Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move.
Reviewed By: jaykorean
Differential Revision: D56275054
fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730
Summary:
Adding an option to wait for purge to complete in `WaitForCompact` API.
Internally, RocksDB has a way to wait for purge to complete (e.g. TEST_WaitForPurge() in db_impl_debug.cc), but there's no public API available for gracefully wait for purge to complete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12520
Test Plan:
Unit Test Added - `WaitForCompactWithWaitForPurgeOptionTest`
```
./deletefile_test -- --gtest_filter="*WaitForCompactWithWaitForPurgeOptionTest*"
```
Existing Tests
```
./db_compaction_test -- --gtest_filter="*WaitForCompactWithOption*"
```
Reviewed By: ajkr
Differential Revision: D55888283
Pulled By: jaykorean
fbshipit-source-id: cfc6d6e8657deaefab8961890b36e390095c9f65
Summary:
In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546
Reviewed By: jaykorean
Differential Revision: D56193693
Pulled By: ajkr
fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da
Summary:
**Context/Summary:**
In-place memtable updates (inplace_update_support) is not compatible with concurrent writes (allow_concurrent_memtable_write). So we disallow this combination in crash test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12550
Test Plan: CI
Reviewed By: jaykorean
Differential Revision: D56204269
Pulled By: hx235
fbshipit-source-id: 06608f2591db5e37470a1da6afcdfd2701781c2d
Summary:
**Context/Summary:**
This PR includes some public DB APIs not tested in crash/stress yet can be added in a straightforward way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12541
Test Plan:
- Locally run crash test heavily stressing on these new APIs
- CI
Reviewed By: jowlyzhang
Differential Revision: D56164892
Pulled By: hx235
fbshipit-source-id: 8bb568c3e65aec39d642987033f1d76c52f69bd8
Summary:
Thanks to how we are using `DBIter` as child iterators in MultiCfIterators (both `CoalescingIterator` and `AttributeGroupIterator`), we got the lower/upper bound feature for free. This PR simply adds unit test coverage to ensure that the lower/upper bounds are working as expected in the MultiCfIterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12548
Test Plan:
UnitTest Added
```
./multi_cf_iterator_test
```
Reviewed By: ltamasi
Differential Revision: D56197966
Pulled By: jaykorean
fbshipit-source-id: fa51cc70705dbc5efd836ac006a7c6a49d05707a
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12539
As a follow-up to https://github.com/facebook/rocksdb/pull/12533, this PR extends `WriteBatchWithIndex` with a `MultiGetEntityFromBatchAndDB` API that enables users to perform batched wide-column point lookups with read-your-own-writes consistency. This API transparently combines data from the indexed write batch and the underlying database as needed and presents the results in the form of a wide-column entity.
Reviewed By: jaykorean
Differential Revision: D56153145
fbshipit-source-id: 537967051b7521bb41b04070ac1a78a1d8873c08
Summary:
Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480#12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`.
Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL`
Here are some implementation details:
- `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration.
- `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap.
- `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap.
- Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once.
- This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534
Test Plan:
Uncommented the assertions in `verifyAttributeGroupIterator()`
```
./multi_cf_iterator_test
```
Reviewed By: ltamasi
Differential Revision: D56089019
Pulled By: jaykorean
fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054
Summary:
**Context/Summary:**
`inplace_update_support=true` is not tested in crash/stress test. Since it's not compatible with snapshots like compaction_filter, we need to sanitize its value in presence of snapshots-related options. A minor refactoring is added to centralize such sanitization in db_crashtest.py - see `check_multiget_consistency` and `check_multiget_entity_consistency`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12535
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D56102978
Pulled By: hx235
fbshipit-source-id: 2e2ab6685a65123b14a321b99f45f60bc6509c6b
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12533
The PR extends `WriteBatchWithIndex` with a new wide-column point lookup API `GetEntityFromBatchAndDB`. Similarly to `GetFromBatchAndDB`, the new API can transparently combine data from the write batch with data from the underlying database as needed. Like `DB::GetEntity`, it returns any result in the form of a wide-column entity (i.e. plain key-values are wrapped into an entity with a single anonymous column).
Reviewed By: jaykorean
Differential Revision: D56069132
fbshipit-source-id: 4f19cdeea4ce136497ce79fc9d28c925de59e220
Summary:
Our `FileSystem` for simulating unsynced data loss should not sync during `Close()` because it masks bugs where we forgot to sync as long as we closed the file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12528
Test Plan:
Peeled back https://github.com/facebook/rocksdb/issues/10560 fix and verified it is caught much faster now (few seconds vs. ???) with command like
```
$ TEST_TMPDIR=./ python3 tools/db_crashtest.py blackbox --disable_wal=0 --max_key=1000 --write_buffer_size=131072 --max_bytes_for_level_base=524288 --target_file_size_base=131072 --interval=3 --sync_fault_injection=1 --enable_blob_files=0 --manual_wal_flush_one_in=10 --sync_wal_one_in=0 --get_live_files_one_in=0 --get_sorted_wal_files_one_in=0 --backup_one_in=0 --checkpoint_one_in=0 --write_fault_one_in=0 --read_fault_one_in=0 --open_write_fault_one_in=0 --compact_range_one_in=0 --compact_files_one_in=0 --open_read_fault_one_in=0 --get_property_one_in=0 --writepercent=100 -readpercent=0 -prefixpercent=0 -delpercent=0 -delrangepercent=0 -iterpercent=0
```
Reviewed By: anand1976
Differential Revision: D56033250
Pulled By: ajkr
fbshipit-source-id: 6bbf480d79a06c46f08f6214010937f6654af5ca
Summary:
This PR fixes error for CF smallest and largest keys computation in ImportColumnFamilyJob::Prepare.
Before this fix smallest and largest keys for CF were computed incorrectly, and ImportColumnFamilyJob::Prepare function might not have detect overlaps between CFs. I added test to detect this error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12526
Reviewed By: hx235
Differential Revision: D56046044
Pulled By: ajkr
fbshipit-source-id: d562fbfc9cc2d9624372d24d34a649198a960691
Summary:
Context/Summary:
We need a `nvm_sec_cache` when `kAdmPolicyThreeQueue` is used otherwise a nullptr cache will be accessed causing us segfault in https://github.com/facebook/rocksdb/pull/12521
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12524
Test Plan: - Re-enabled `kAdmPolicyThreeQueue` and rehearsed stress test that failed before this fix and pass after
Reviewed By: jowlyzhang
Differential Revision: D55997093
Pulled By: hx235
fbshipit-source-id: e1c6f1015091b4cff0ce6a3fff981d5dece52a62
Summary:
Previously when building with fbcode and having a system install of liburing, it would link liburing from fbcode statically as well as the system library dynamically. That led to the following error:
```
./db_stress: error while loading shared libraries: liburing.so.1: cannot open shared object file: No such file or directory
```
The fix is to skip the feature test for system liburing when `FBCODE_BUILD=true`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12525
Test Plan:
- `make clean && make ROCKSDB_NO_FBCODE=1 V=1 -j56 db_stress && ./db_stress`
- `make clean && make V=1 -j56 db_stress && ./db_stress`
Reviewed By: anand1976
Differential Revision: D55997335
Pulled By: ajkr
fbshipit-source-id: 17d8561100f41c6c9ae382a80c6cddc14f050bdc
Summary:
There are a couple of reasons to modify the current implementation of the MultiCfIterator, which implements the generic `Iterator` interface.
- The default behavior of `value()`/`columns()` returning data from different Column Families for different keys can be prone to errors, even though there might be valid use cases where users do not care about the origin of the value/columns.
- The `attribute_groups()` API, which is not yet implemented, will not be useful for a single-CF iterator.
In this PR, we are implementing the following changes:
- `IteratorBase` introduced, which includes all basic iterator functions except `value()` and `columns()`.
- `Iterator`, which now inherits from `IteratorBase`, includes `value()` and `columns()`.
- New public interface `AttributeGroupIterator` inherits from `IteratorBase` and additionally includes `attribute_groups()` (to be implemented).
- Renamed former `MultiCfIterator` to `CoalescingIterator` which inherits from `Iterator`
- Existing MultiCfIteratorTest has been split into two - `CoalescingIteratorTest` and `AttributeGroupIteratorTest`.
- Moved AttributeGroup related code from `wide_columns.h` to a new file, `attribute_groups.h`.
Some Implementation Details
- `MultiCfIteratorImpl` takes two functions - `populate_func` and `reset_func` and use them to populate `value_` and `columns_` in CoalescingIterator and `attribute_groups_` in AttributeGroupIterator. In CoalescingIterator, populate_func is `Coalesce()`, in AttributeGroupIterator populate_func is `AddToAttributeGroups()`. `reset_func` clears populated value_, columns_ and attribute_groups_ accordingly.
- `Coalesce()` merge sorts columns from multiple CFs when a key exists in more than on CFs. column that appears in later CF overwrites the prior ones.
For example, if CF1 has `"key_1" ==> {"col_1": "foo", "col_2", "baz"}` and CF2 has `"key_1" ==> {"col_2": "quux", "col_3", "bla"}`, and when the iterator is at `key_1`, `columns()` will return `{"col_1": "foo", "col_2", "quux", "col_3", "bla"}`
In this example, `value()` will be empty, because none of them have values for `kDefaultColumnName`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12480
Test Plan:
## Unit Test
```
./multi_cf_iterator_test
```
## Performance Test
To make sure this change does not impact existing `Iterator` performance
**Build**
```
$> make -j64 release
```
**Setup**
```
$> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none
```
**Run**
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000
```
**Before the change**
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator : 0.519 micros/op 1927904 ops/sec 0.519 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom : 5.302 micros/op 188589 ops/sec 5.303 seconds 1000000 operations; (0 of 1000000 found)
```
**After the change**
```
DB path: [/dev/shm/db_bench/dbbench]
newiterator : 0.497 micros/op 2011012 ops/sec 0.497 seconds 1000000 operations;
DB path: [/dev/shm/db_bench/dbbench]
seekrandom : 5.252 micros/op 190405 ops/sec 5.252 seconds 1000000 operations; (0 of 1000000 found)
```
Reviewed By: ltamasi
Differential Revision: D55353909
Pulled By: jaykorean
fbshipit-source-id: 8d7786ffee09e022261ce34aa60e8633685e1946