mirror of
https://github.com/facebook/rocksdb.git
synced 2024-11-27 02:44:18 +00:00
1495 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Yu Zhang | 552bc01669 |
Surface timestamp from db to the transaction iterator (#11847)
Summary: Provide an override implementation of `Iterator::timestamp` API for `BaseDeltaIterator` so that timestamp read from DB can be surfaced by an iterator created from inside of a transaction. The behavior of the API follows this rule: 1) If the entry is read from within the transaction, an empty `Slice` is returned as the timestamp, regardless of whether `Transaction::SetCommitTimestamp` is called. 2) If the entry is read from the DB, the corresponding `DBIter::timestamp()` API's result is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11847 Test Plan: make all check add some unit test Reviewed By: ltamasi Differential Revision: D49377359 Pulled By: jowlyzhang fbshipit-source-id: 1511ead262ce3515ee6c6e0f829f1b69a10fe994 |
||
anand76 | 269478ee46 |
Support compressed and local flash secondary cache stacking (#11812)
Summary: This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache. The basic design is as follows - 1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache. 2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block. 3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache. 4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded. We can easily implement additional admission policies if desired. Todo (In a subsequent PR): 1. Add to db_bench and run benchmarks 2. Add to db_stress Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812 Reviewed By: pdillinger Differential Revision: D49461842 Pulled By: anand1976 fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b |
||
Peter (Stig) Edwards | bf488c3052 |
Use *next_sequence -1 here (#11861)
Summary:
To fix off-by-one error: Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001.
Fixes https://github.com/facebook/rocksdb/issues/11822
I think introduced in
|
||
Levi Tamasi | f42e70bf56 |
Integrate FullMergeV3 into the query and compaction paths (#11858)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11858 The patch builds on https://github.com/facebook/rocksdb/pull/11807 and integrates the `FullMergeV3` API into the read and compaction code paths by updating and extending the logic in `MergeHelper`. In particular, when it comes to merge inputs, the existing `TimedFullMergeWithEntity` is folded into `TimedFullMerge`, since wide-column base values are now handled the same way as plain base values (or no base values for that matter), e.g. they are passed directly to the `MergeOperator`. On the other hand, there is some new differentiation on the output side. Namely, there are now two sets of `TimedFullMerge` variants: one set for contexts where the complete merge result and its value type are needed (used by iterators and compactions), and another set where the merge result is needed in a form determined by the client (used by the point lookup APIs, where e.g. for `Get` we have to extract the value of the default column of any wide-column results). Implementation-wise, the two sets of overloads use different visitors to process the `std::variant` produced by `FullMergeV3`. This has the benefit of eliminating some repeated code e.g. in the point lookup paths, since `TimedFullMerge` now populates the application's result object (`PinnableSlice`/`string` or `PinnableWideColumns`) directly. Moreover, within each set of variants, there is a separate overload for the no base value/plain base value/wide-column base value cases, which eliminates some repeated branching w/r/t to the type of the base value if any. Reviewed By: jaykorean Differential Revision: D49352562 fbshipit-source-id: c2fb9853dba3fbbc6918665bde4195c4ea150a0c |
||
Changyu Bi | c90807d103 |
Inject retryable write IOError when writing to SST files in stress test (#11829)
Summary: * db_crashtest.py now may set `write_fault_one_in` to 500 for blackbox and whitebox simple test. * Error injection only applies to writing to SST files. Flush error will cause DB to pause background operations and auto-resume. Compaction error will just re-schedule later. * File ingestion and back up tests are updated to check if the result status is due to an injected error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11829 Test Plan: a full round of whitebox simple and blackbox simple crash test * `python3 ./tools/db_crashtest.py whitebox/blackbox --simple --write_fault_one_in=500` Reviewed By: ajkr Differential Revision: D49256962 Pulled By: cbi42 fbshipit-source-id: 68e0c9648d8e03bad39c7672b25d5500fc286d97 |
||
Changyu Bi | 051cad3867 |
Fix CI failure due to transaction_test (#11843)
Summary: Test ` build-linux-static_lib-alt_namespace-status_checked` has been failing in main branch. ``` utilities/transactions/transaction_test.cc:6777:3: error: 'rocksdb' has not been declared 6777 | rocksdb::GetMergeOperandsOptions mergeOperandOptions; | ^~~~~~~ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11843 Test Plan: `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 transaction_test` Reviewed By: sarangbh Differential Revision: D49330210 Pulled By: cbi42 fbshipit-source-id: 85c99236eeca6a777af0101684fbab5a33cca1c9 |
||
Sarang Masti | c4a19ed399 |
Add Transaction::CollapseKey to collapse merge op chains ondemand (#11815)
Summary: Application using rocksdb today dont have much control over the cost of reads when merge-ops are enabled, expect for waiting for compactions to kick in or using max_successive_merges hint, which only applies to memtable. This change adds Transaction::CollapseKey api giving applications the ability to request merge chain collapse on-demand, when they detect high read costs due to merges. Currently, this only supported on PessimisticTransactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11815 Test Plan: Add a unit test Reviewed By: ajkr Differential Revision: D49309387 Pulled By: sarangbh fbshipit-source-id: a1eb5cc9e3bd4b3206a36150aacead770318e3e1 |
||
Yu Zhang | 39a4ff2cab |
Track full_history_ts_low per SuperVersion (#11784)
Summary: As discussed in https://github.com/facebook/rocksdb/issues/11730 , this PR tracks the effective `full_history_ts_low` per SuperVersion and update existing sanity checks for `ReadOptions.timestamp >= full_history_ts_low` to use this per SuperVersion `full_history_ts_low` instead. This also means the check is moved to happen after acquiring SuperVersion. There are two motivations for this: 1) Each time `full_history_ts_low` really come into effect to collapse history, a new SuperVersion is always installed, because it would involve either a Flush or Compaction, both of which change the LSM tree shape. We can take advantage of this to ensure that as long as this sanity check is passed, even if `full_history_ts_low` can be concurrently increased and collapse some history above the requested `ReadOptions.timestamp`, a read request won’t have visibility to that part of history through this SuperVersion that it already acquired. 2) the existing sanity check uses `ColumnFamilyData::GetFullHistoryTsLow` without locking the db mutex, which is the mutex all `IncreaseFullHistoryTsLow` operation is using when mutating this field. So there is a race condition. This also solve the race condition on the read path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11784 Test Plan: `make all check` // Checks success scenario really provide the read consistency attribute as mentioned above. `./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckPassReadIsConsistent*` // Checks failure scenario cleans up SuperVersion properly. `./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*` `./db_secondary_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*` `./db_readonly_with_timestamp_test --gtest_filter=*FullHistoryTsLowSanitchCheckFail*` Reviewed By: ltamasi Differential Revision: D48894795 Pulled By: jowlyzhang fbshipit-source-id: 1f801fe8e1bc8e63ca76c03cbdbd0974e5ff5bf6 |
||
Andrew Kryczka | 392d6957cd |
Added compaction read errors to db_stress (#11789)
Summary: - Fixed misspellings of "inject" - Made user read errors retryable when `FLAGS_inject_error_severity == 1` - Added compaction read errors when `FLAGS_read_fault_one_in > 0`. These are always retryable so that the DB will keep accepting writes - Reenabled setting `compaction_readahead_size` in crash test. The reason for disabling it was to "keep the test clean", which is not a good enough reason to skip testing it Pull Request resolved: https://github.com/facebook/rocksdb/pull/11789 Test Plan: With https://github.com/facebook/rocksdb/issues/11782 reverted, reproduced the bug: - Build: `make -j56 db_stress` - Command: `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --interval=10 --max_key=1000000` - Output: ``` stderr has error message: ***put or merge error: Corruption: Compaction number of input keys does not match number of keys processed.*** ``` Reviewed By: cbi42 Differential Revision: D48939994 Pulled By: ajkr fbshipit-source-id: a1efb799efecdfd5d9cfd185e4a6321db8fccfbb |
||
jsteemann | 0b8b17a9d1 |
avoid find() -> insert() sequence (#11743)
Summary: when a key is recorded for locking in a pessimistic transaction, the key is first looked up in a map, and then inserted into the map if it was not already contained. this can be simplified to an unconditional insert. in the ideal case that all keys are unique, this saves all the find() operations. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11743 Reviewed By: anand1976 Differential Revision: D48656798 Pulled By: ajkr fbshipit-source-id: d0150de2db757e0c05e1797cfc24380790c71276 |
||
Changyu Bi | c2aad555c3 |
Add CompressionOptions::checksum for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (
|
||
anand76 | a1743e85be |
Implement a allow cache hits admission policy for the compressed secondary cache (#11713)
Summary: This PR implements a new admission policy for the compressed secondary cache, which includes the functionality of the existing policy, and also admits items evicted from the primary block cache with the hit bit set. Effectively, the new policy works as follows - 1. When an item is demoted from the primary cache without a hit, a placeholder is inserted in the compressed cache. A second demotion will insert the full entry. 2. When an item is promoted from the compressed cache to the primary cache for the first time, a placeholder is inserted in the primary. The second promotion inserts the full entry, while erasing it form the compressed cache. 3. If an item is demoted from the primary cache with the hit bit set, it is immediately inserted in the compressed secondary cache. The ```TieredVolatileCacheOptions``` has been updated with a new option, ```adm_policy```, which allows the policy to be selected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11713 Reviewed By: pdillinger Differential Revision: D48444512 Pulled By: anand1976 fbshipit-source-id: b4cbf8c169a88097dff08e36e8bc4b3088de1492 |
||
Changyu Bi | 76ed9a3990 |
Add missing status check when compiling with ASSERT_STATUS_CHECKED=1 (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by
|
||
Yu Zhang | c751583c03 |
Set default cf ts sz for a reused transaction (#11685)
Summary: Set up the default column family timestamp size for a reused write committed transaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11685 Test Plan: Added unit test. Reviewed By: ltamasi Differential Revision: D48195129 Pulled By: jowlyzhang fbshipit-source-id: 54faa900c123fc6daa412c01490e36c10a24a678 |
||
Hui Xiao | 9a034801ce |
Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
- For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
- New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.
- Misc
- More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
- Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444
Test Plan:
- CI fake db crash/stress test
- Microbenchmarking
**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version:
|
||
Jay Huh | 9a2a6db2a9 |
Use C++17 [[fallthrough]] in transaction_test.cc (#11663)
Summary: (Copied from https://www.internalfb.com/diff/D46606060) This diff makes its files safe for use with -Wimplicit-fallthrough. Now that we're using C+20 there's no reason not to use this C++17 feature to make our code safer. It's currently possible to write code like this: ``` switch(x){ case 1: foo1(); case 2: foo2(); break; case 3: foo3(); } ``` But that's scary because we don't know whether the fallthrough from case 1 was intentional or not. The -Wimplicit-fallthrough flag will make this an error. The solution is to either fix the bug by inserting break or indicating intention by using [[fallthrough]]; (from C++17). ``` switch(x){ case 1: foo1(); [[fallthrough]]; // Solution if we intended to fallthrough break; // Solution if we did not intend to fallthrough case 2: foo2(); break; case 3: foo3(); } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11663 Test Plan: Existing tests Reviewed By: jowlyzhang Differential Revision: D47961248 Pulled By: jaykorean fbshipit-source-id: 0d374c721bf1b328c14949dc5c17693da7311d03 |
||
Peter Dillinger | bb8fcc0044 |
db_stress: Reinstate Transaction::Rollback() calls before destruction (#11656)
Summary: https://github.com/facebook/rocksdb/issues/11653 broke some crash tests. Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656 Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn Reviewed By: cbi42 Differential Revision: D47906280 Pulled By: pdillinger fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b |
||
Peter Dillinger | b3c54186ab |
Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653)
Summary: In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers. Also, some of the APIs were not clear about Transaction semantics, so I have clarified: * (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction). * Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test. Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress. * Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function. * Use unique_ptr (nicely complements removing unnecessary Rollbacks) * Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".) Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653 Test Plan: this is the test :) Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances. Reviewed By: cbi42 Differential Revision: D47882995 Pulled By: pdillinger fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d |
||
Peter Dillinger | 05a1d52e77 |
Use FaultInjectionTestFS in transaction_test, clarify Close() APIs (#11499)
Summary: ... instead of race-condition-laden FaultInjectionTestEnv. See https://app.circleci.com/pipelines/github/facebook/rocksdb/27912/workflows/4c63e5a8-597e-439d-8c7e-82308056af02/jobs/609648 and similar PR https://github.com/facebook/rocksdb/issues/11271 Had to fix the semantics of FaultInjectionTestFS Close() operations to allow a non-OK Close() to fulfill the obligation to close before destruction. To me, this is the obvious choice of Close contract, because what is the caller supposed to do if Close() fails and they still have an obligation to successfully close before object destruction? Call Close() in an infinite loop? Leak the object? I have added API comments to the Env and Filesystem Close() functions to clarify the contracts. Note that `DB::Close()` has one exception to this kind of Close contract, but it is clearly described in API comments and it is really only for catching programming mistakes, not for dealing with exogenous errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11499 Test Plan: watch CI Reviewed By: jowlyzhang Differential Revision: D46375708 Pulled By: pdillinger fbshipit-source-id: 03d4d8251e5df50a82ecd139f7e83f613015fe40 |
||
Jay Huh | 87bc929db3 |
Flush option in WaitForCompact() (#11483)
Summary: Context: As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `flush` option in `WaitForCompactOptions` to flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 1. `bool flush = false` added to `WaitForCompactOptions` 2. `DBImpl::FlushAllColumnFamilies()` is introduced and `DBImpl::FlushForGetLiveFiles()` is refactored to call it. 3. `DBImpl::FlushAllColumnFamilies()` gets called before waiting in `WaitForCompact()` if `flush` option is `true` 4. Some previous WaitForCompact tests were parameterized to include both cases for `abort_on_pause_` being true/false as well as `flush_` being true/false Pull Request resolved: https://github.com/facebook/rocksdb/pull/11483 Test Plan: - `DBCompactionTest::WaitForCompactWithOptionToFlush` added - Changed existing DBCompactionTest::WaitForCompact tests to `DBCompactionWaitForCompactTest` to include params Reviewed By: pdillinger Differential Revision: D46289770 Pulled By: jaykorean fbshipit-source-id: 70d3f461d96a6e06390be60170dd7c4d0d38f8b0 |
||
Jay Huh | 81aeb15988 |
Add WaitForCompact with WaitForCompactOptions to public API (#11436)
Summary: Context: This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions` 1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing. 1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct. 2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h` 3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436 Test Plan: Following tests added - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish` - `DBCompactionTest::WaitForCompactAbortOnPauseAborted` - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted` - `DBCompactionTest::WaitForCompactShutdownWhileWaiting` - `TransactionTest::WaitForCompactAbortOnPause` NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place. Reviewed By: pdillinger Differential Revision: D45799659 Pulled By: jaykorean fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b |
||
Peter Dillinger | 17bc27741f |
Improve memory efficiency of many OptimisticTransactionDBs (#11439)
Summary: Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually! In this change: * Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway. * Allocate the mutexes in a contiguous array, for efficiency * Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose. * Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing) * Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439 Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!) Reviewed By: ltamasi Differential Revision: D45796393 Pulled By: pdillinger fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54 |
||
Yu Zhang | ffb5f1f445 |
Refactor WriteUnpreparedStressTest to be a unit test (#11424)
Summary: This patch remove the "stress" aspect from the WriteUnpreparedStressTest and leave it to be a unit test for some correctness testing w.r.t. snapshot functionality. I added some read-your-write verification to the transaction test in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11424 Test Plan: `./write_unprepared_transaction_test` `./db_crashtest.py whitebox --txn` `./db_crashtest.py blackbox --txn` Reviewed By: hx235 Differential Revision: D45551521 Pulled By: jowlyzhang fbshipit-source-id: 20c3d510eb4255b08ddd7b6c85bdb4945436f6e8 |
||
Jay Huh | 586d78b31e |
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary: Context: In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue. In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code. This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`. Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`. Furthermore, all tests that previously called `waitForCompact(true)` have been fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443 Test Plan: Existing tests that involve a shutdown in progress: - DBCompactionTest::CompactRangeShutdownWhileDelayed - DBTestWithParam::PreShutdownMultipleCompaction - DBTestWithParam::PreShutdownCompactionMiddle Reviewed By: pdillinger Differential Revision: D45923426 Pulled By: jaykorean fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae |
||
Peter Dillinger | 206fdea3d9 |
Change internal headers with duplicate names (#11408)
Summary: In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement. No public API changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408 Test Plan: existing tests Reviewed By: ltamasi Differential Revision: D45456696 Pulled By: pdillinger fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373 |
||
Levi Tamasi | d3ed796855 |
Deflake some old BlobDB test cases (#11417)
Summary: The old `StackableDB` based BlobDB implementation relies on a DB listener to track the total size of the SST files in the database and to trigger FIFO eviction. Some test cases in `BlobDBTest` assume that the listener is notified by the time `DB::Flush` returns, which is not guaranteed (side note: `TEST_WaitForFlushMemTable` would not guarantee this either). The patch fixes these tests by using `SyncPoint`s to make sure the listener is actually called before verifying the FIFO behavior. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11417 Test Plan: ``` make -j56 COERCE_CONTEXT_SWITCH=1 blob_db_test ./blob_db_test --gtest_filter=BlobDBTest.FIFOEviction_TriggerOnSSTSizeChange ./blob_db_test --gtest_filter=BlobDBTest.FilterForFIFOEviction ./blob_db_test --gtest_filter=BlobDBTest.FIFOEviction_NoEnoughBlobFilesToEvict ``` Reviewed By: ajkr Differential Revision: D45407135 Pulled By: ltamasi fbshipit-source-id: fcd63d76937d2c975f569a6635ce8730772a3d75 |
||
Hui Xiao | 151242ce46 |
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132 |
||
Yu Zhang | 647cd73674 |
Initial add UDT in memtable only option (#11362)
Summary: This option is immutable through the life time of the DB open. For now, updating its value between different DB open sessions is also a non compatible change. When I work on support for updating comparator, the type of updates accepted for this option will be supported then. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11362 Test Plan: `make check` Reviewed By: ltamasi Differential Revision: D44873870 Pulled By: jowlyzhang fbshipit-source-id: aa02094754b58d99abf9af4c9a8108c1350254cb |
||
Peter Dillinger | 204fcff751 |
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5 |
||
Peter Dillinger | 601efe3cf2 |
Misc cleanup of block cache code (#11291)
Summary: ... ahead of a larger change. * Rename confusingly named `is_in_sec_cache` to `kept_in_sec_cache` * Unify naming of "standalone" block cache entries (was "detached" in clock_cache) * Remove some unused definitions in clock_cache.h (leftover from a previous revision) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11291 Test Plan: usual tests and CI, no behavior changes Reviewed By: anand1976 Differential Revision: D43984642 Pulled By: pdillinger fbshipit-source-id: b8bf0c5b90a932a88bcbdb413b2f256834aedf97 |
||
Changyu Bi | 9aa3b6f9ae |
Support range deletion tombstones in CreateColumnFamilyWithImport (#11252)
Summary: CreateColumnFamilyWithImport() did not support range tombstones for two reasons: 1. it uses point keys of a input file to determine its boundary (smallest and largest internal key), which means range tombstones outside of the point key range will be effectively dropped. 2. it does not handle files with no point keys. Also included a fix in external_sst_file_ingestion_job.cc where the blocks read in `GetIngestedFileInfo()` can be added to block cache now (issue fixed in https://github.com/facebook/rocksdb/pull/6429). This PR adds support for exporting and importing column family with range tombstones. The main change is to add smallest internal key and largest internal key to `SstFileMetaData` that will be part of the output of `ExportColumnFamily()`. Then during `CreateColumnFamilyWithImport(...,const ExportImportFilesMetaData& metadata,...)`, file boundaries can be set from `metadata` directly. This is needed since when file boundaries are extended by range tombstones, sometimes they cannot be deduced from a file's content alone. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11252 Test Plan: - added unit tests that fails before this change Closes https://github.com/facebook/rocksdb/issues/11245 Reviewed By: ajkr Differential Revision: D43577443 Pulled By: cbi42 fbshipit-source-id: 6bff78e583cc50c44854994dea0a8dd519398f2f |
||
zhangliangkai1992 | 7a07afe82e |
DBWithTTLImpl::IsStale overflow when ttl is 15 years (#11279)
Summary: Fix DBWIthTTLImpl::IsStale overflow Pull Request resolved: https://github.com/facebook/rocksdb/pull/11279 Reviewed By: cbi42 Differential Revision: D43875039 Pulled By: ajkr fbshipit-source-id: 3e5feb8c4c4480bf1421b0763ade3d2e459ec028 |
||
mrambacher | b6640c3117 |
Remove FactoryFunc from LoadXXXObject (#11203)
Summary: The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available. With the removal of LITE mode, the function was no longer required. Note that the MergeOperator had some private classes defined in header files. To gain access to their constructors (and name methods), the class definitions were moved into header files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203 Reviewed By: cbi42 Differential Revision: D43160255 Pulled By: pdillinger fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c |
||
Levi Tamasi | 9794acb597 |
Add a new MultiGetEntity API (#11222)
Summary: The new `MultiGetEntity` API can be used to get a consistent view of a batch of keys, with the results presented as wide-column entities. Similarly to `GetEntity` and the iterator's `columns` API, if the entry corresponding to the key is a wide-column entity to start with, it is returned as-is, and if it is a plain key-value, it is wrapped into an entity with a single default column. Implementation-wise, the new API shares the logic of the batched `MultiGet` API (via the `MultiGetCommon` methods). Both single-CF and multi-CF `MultiGetEntity` APIs are provided, and blobs are also supported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11222 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D43256950 Pulled By: ltamasi fbshipit-source-id: 47fb2cb7e2d0470e3580f43fdb2fe9e51f0e7005 |
||
Peter Dillinger | 3cacd4b4ec |
Put Cache and CacheWrapper in new public header (#11192)
Summary: The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment. In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544 Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it. Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192 Test Plan: `make check` Reviewed By: anand1976 Differential Revision: D43055487 Pulled By: pdillinger fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110 |
||
Peter Dillinger | 390cc0b156 |
Ensure LockWAL() stall cleared for UnlockWAL() return (#11172)
Summary: Fixes https://github.com/facebook/rocksdb/issues/11160 By counting the number of stalls placed on a write queue, we can check in UnlockWAL() whether the stall present at the start of UnlockWAL() has been cleared by the end, or wait until it's cleared. More details in code comments and new unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11172 Test Plan: unit test added. Yes, it uses sleep to amplify failure on buggy behavior if present, but using a sync point to only allow new behavior would fail with the old code only because it doesn't contain the new sync point. Basically, using a sync point in UnlockWAL() could easily mask a regression by artificially limiting key behaviors. The test would only check that UnlockWAL() invokes code that *should* do the right thing, without checking that it *does* the right thing. Reviewed By: ajkr Differential Revision: D42894341 Pulled By: pdillinger fbshipit-source-id: 15c9da0ca383e6aec845b29f5447d76cecbf46c3 |
||
Peter Dillinger | 94e3beec77 |
Cleanup, improve, stress test LockWAL() (#11143)
Summary: The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change: * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use. * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions) * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability) * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other. * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads. * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143 Test Plan: unit tests added / updated, added to stress/crash test Reviewed By: ajkr Differential Revision: D42848627 Pulled By: pdillinger fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4 |
||
sdong | 4720ba4391 |
Remove RocksDB LITE (#11147)
Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2 |
||
Yu Zhang | 6943ff6e50 |
Remove deprecated util functions in options_util.h (#11126)
Summary: Remove the util functions in options_util.h that have previously been marked deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126 Test Plan: `make check` Reviewed By: ltamasi Differential Revision: D42757496 Pulled By: jowlyzhang fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb |
||
Peter Dillinger | 546e213c4f |
Fix DelayWrite() calls for two_write_queues (#11130)
Summary: PR https://github.com/facebook/rocksdb/issues/11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. https://github.com/facebook/rocksdb/issues/11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader. This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue. HISTORY not updated because this is intended for the same release where the bug was introduced. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11130 Test Plan: Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case. Will also run crash_test_with_multiops_wc_txn for a while looking for issues. Reviewed By: ajkr Differential Revision: D42749330 Pulled By: pdillinger fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b |
||
sdong | 2800aa069a |
Remove compressed block cache (#11117)
Summary: Compressed block cache is replaced by compressed secondary cache. Remove the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117 Test Plan: See CI passes Reviewed By: pdillinger Differential Revision: D42700164 fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d |
||
Andrew Kryczka | b7fbcefda8 |
Add API to limit blast radius of merge operator failure (#11092)
Summary: Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage. To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092 Reviewed By: siying Differential Revision: D42525673 Pulled By: ajkr fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee |
||
Wenlong Zhang | 1cfe3528a2 |
support loongarch64 for rocksdb (#10036)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10036 Reviewed By: hx235 Differential Revision: D42424074 Pulled By: ajkr fbshipit-source-id: 004adb75005a26bd01c5d568d1ec6ac442cd59dd |
||
Peter Dillinger | 9f7801c5f1 |
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432 |
||
Changyu Bi | f24ef5d6ab |
Fix BackupEngineTest.ExcludeFiles memory leak (#11066)
Summary: Valgrind was complaining about the test BackupEngineTest.ExcludeFiles. The cause is backup_engine not being freed similar to https://github.com/facebook/rocksdb/issues/9610. ``` ==18228== Command: ./backup_engine_test --gtest_filter=BackupEngineTest.ExcludeFiles ==18228== Note: Google Test filter = BackupEngineTest.ExcludeFiles [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from BackupEngineTest [ RUN ] BackupEngineTest.ExcludeFiles [ OK ] BackupEngineTest.ExcludeFiles (16264 ms) [----------] 1 test from BackupEngineTest (16273 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (16306 ms total) [ PASSED ] 1 test. ==18228== ==18228== HEAP SUMMARY: ==18228== in use at exit: 14,099 bytes in 159 blocks ==18228== total heap usage: 255,328 allocs, 255,169 frees, 497,538,546 bytes allocated ==18228== ==18228== 19 bytes in 1 blocks are possibly lost in loss record 4 of 67 ==18228== at 0x483BE63: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so) ==18228== by 0x1E752D: void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .constprop.0] (basic_string.tcc:219) ==18228== by 0x1F1898: _M_construct_aux<char*> (basic_string.h:251) ==18228== by 0x1F1898: _M_construct<char*> (basic_string.h:270) ==18228== by 0x1F1898: basic_string (basic_string.h:455) ==18228== by 0x1F1898: construct<std::__cxx11::basic_string<char>, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (new_allocator.h:146) ==18228== by 0x1F1898: construct<std::__cxx11::basic_string<char>, const std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&> (alloc_traits.h:483) ==18228== by 0x1F1898: push_back (stl_vector.h:1189) ==18228== by 0x1F1898: rocksdb::(anonymous namespace)::TestFs::NewWritableFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::FileOptions const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::IODebugContext*) (backup_engine_test.cc:208) ==18228== by 0x4B3583: rocksdb::NewWritableFile(rocksdb::FileSystem*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::unique_ptr<rocksdb::FSWritableFile, std::default_delete<rocksdb::FSWritableFile> >*, rocksdb::FileOptions const&) (read_write_util.cc:23) ==18228== by 0x31C3A8: rocksdb::DBImpl::CreateWAL(unsigned long, unsigned long, unsigned long, rocksdb::log::Writer**) (db_impl_open.cc:1752) ==18228== by 0x321A8C: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1852) ==18228== by 0x322E7F: Open (db_impl_open.cc:1660) ==18228== by 0x322E7F: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1637) ==18228== by 0x1EE1CD: InitializeDBAndBackupEngine (backup_engine_test.cc:724) ==18228== by 0x1EE1CD: rocksdb::(anonymous namespace)::BackupEngineTest::OpenDBAndBackupEngine(bool, bool, rocksdb::(anonymous namespace)::BackupEngineTest::ShareOption) (backup_engine_test.cc:732) ==18228== by 0x217585: rocksdb::(anonymous namespace)::BackupEngineTest_ExcludeFiles_Test::TestBody() (backup_engine_test.cc:4232) ==18228== by 0x296143: HandleSehExceptionsInMethodIfSupported<testing::Test, void> (gtest-all.cc:3899) ==18228== by 0x296143: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (gtest-all.cc:3935) ==18228== by 0x28A0A5: testing::Test::Run() [clone .part.0] (gtest-all.cc:3973) ==18228== by 0x28A364: Run (gtest-all.cc:3965) ==18228== by 0x28A364: testing::TestInfo::Run() [clone .part.0] (gtest-all.cc:4149) ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11066 Test Plan: make -j24 J=24 ROCKSDBTESTS_SUBSET=backup_engine_test valgrind_check_some Reviewed By: ajkr Differential Revision: D42297791 Pulled By: cbi42 fbshipit-source-id: db67982b27b91cc78e1a9f4a96da0cba7c9785b7 |
||
Peter Dillinger | 02f2b20864 |
Add BackupEngine feature to exclude files (#11030)
Summary: We have a request for RocksDB to essentially support disconnected incremental backup. In other words, if there is limited or no connectivity to the primary backup dir, we should still be able to take an incremental backup relative to that primary backup dir, assuming some metadata about that primary dir is available (and obviously anticipating primary backup dir will be fully available if restore is needed). To support that, this feature allows the API user to "exclude" DB files from backup. This only applies to files that can be shared between backups (sst and blob files), and excluded files are tracked in the backup metadata sufficiently to ensure they are restored at restore time. At restore time, the user provides a set of alternate backup directories (as open BackupEngines, which can be read-only), and excluded files must be found in one of the backup directories ("included" in some backup). This feature depends on backup schema version 2 features, though schema version 2.0 support is not sufficient to read / restore a backup with exclusions. This change updates the schema version to 2.1 because of this feature, so that it's easy to recognize whether a RocksDB release supports this feature, while backups not using the feature are fully compatible with 2.0. Also in this PR: * Stacked on https://github.com/facebook/rocksdb/pull/11029 * Allow progress_callback to be empty, not just no-op function, and recover from exceptions thrown by BackupEngine callbacks. * The internal-only `AsBackupEngine()` function is working around the diamond hierarchy of `BackupEngineImplThreadSafe` to get to the internals, without using confusing features like virtual inheritance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11030 Test Plan: unit tests added / updated Reviewed By: ajkr Differential Revision: D42004388 Pulled By: pdillinger fbshipit-source-id: 31b6e533d308a5462e528d9012d650482d974077 |
||
anand76 | c3f720c60d |
Enable ReadAsync testing and fault injection in db_stress (#11037)
Summary: The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037 Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890. Reviewed By: akankshamahajan15 Differential Revision: D42061290 Pulled By: anand1976 fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074 |
||
Yanqin Jin | c93ba7db5d |
Revise LockWAL/UnlockWAL implementation (#11020)
Summary: RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and releases the internal `DBImpl::log_write_mutex_`. According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288 > Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_. This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()` is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`. There are two issues: 1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after calling `LockWAL()` because log_write_mutex is not recursive. 2. Two threads may dead lock due to lock order inversion. To fix these issues, we can modify the implementation of LockWAL so that it does not keep `log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can instead manually inject a write stall so that all future writes will be stopped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020 Test Plan: make check Reviewed By: ajkr Differential Revision: D41785203 Pulled By: riversand963 fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce |
||
Hui Xiao | 98d5db5c2e |
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run
|
||
Peter Dillinger | 9b34c097a1 |
Fix bug updating latest backup on delete (#11029)
Summary: Previously, the "latest" valid backup would not be updated on delete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11029 Test Plan: unit test included (added to an existing test for efficiency) Reviewed By: hx235 Differential Revision: D41967925 Pulled By: pdillinger fbshipit-source-id: ca143354d281eb979557ea421902cd26803a1137 |