mirror of https://github.com/facebook/rocksdb.git
258 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Hui Xiao | 20213d01a3 |
Fix crash in CompactFiles() of conflict range under `preclude_last_level_data_seconds > 0` (#12628)
Summary: **Context/Summary:** Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below ``` Assertion `output_level == 0 || !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed. ``` This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628 Test Plan: New UT crashed before the fix and return correct status after the fix. Reviewed By: cbi42 Differential Revision: D57123536 Pulled By: hx235 fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430 |
|
Changyu Bi | e2ef349f56 |
Deflake unit test `DBCompactionTest.CompactionLimiter` (#12596)
Summary: The test has been flaky for a long time. A recent [failure](https://github.com/facebook/rocksdb/actions/runs/8820808355/job/24215219590?pr=12578) shows that there is still flush running when the assertion fails. I think this is because `WaitForFlushMemTable()` may return before the a flush schedules the next compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12596 Test Plan: I could not repro the failure locally: `gtest-parallel --repeat=8000 --workers=100 ./db_compaction_test --gtest_filter="*CompactionLimiter*"` Reviewed By: ajkr Differential Revision: D56715874 Pulled By: cbi42 fbshipit-source-id: f5f64eb30fff7e115c19beedad2dc22afa06258d |
|
Andrew Kryczka | 6807da0b44 |
Fix `DisableManualCompaction()` hang (#12578)
Summary: Prior to this PR the following sequence could happen: 1. `RunManualCompaction()` A schedules compaction to thread pool and waits 2. `RunManualCompaction()` B waits without scheduling anything due to conflict 3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both 4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done 5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool 6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction 7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution 8. `DisableManualCompaction()` returns as there are no more manual compactions running Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward: 5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done 6'. `DisableManualCompaction()` returns as there are no more manual compactions running Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578 Reviewed By: cbi42 Differential Revision: D56528144 Pulled By: ajkr fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b |
|
Changyu Bi | a0aade7e62 |
Add some debug print for flaky test `DBCompactionTest.CompactionLimiter` (#12509)
Summary: The unit test fails occasionally can cannot be reproed locally. ``` [ RUN ] DBCompactionTest.CompactionLimiter db/db_compaction_test.cc:6139: Failure Expected equality of these values: cf_count Which is: 17 env_->GetThreadPoolQueueLen(Env::LOW) Which is: 15 [ FAILED ] DBCompactionTest.CompactionLimiter (512 ms) ``` Add some debug print to help triaging if it fails again. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12509 Reviewed By: jowlyzhang Differential Revision: D55770552 Pulled By: cbi42 fbshipit-source-id: 2a39b2199f80352fcf2c6cd2b9c8b81c727eee8c |
|
Hui Xiao | d985902ef4 |
Disallow refitting more than 1 file from non-L0 to L0 (#12481)
Summary: **Context/Summary:** We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior. A related change to OptionChangeMigration(): - When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size. - Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481 Test Plan: - New UT - Modified UT Reviewed By: cbi42 Differential Revision: D55351178 Pulled By: hx235 fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80 |
|
yuzhangyu@fb.com | 1cfdece85d |
Run internal cpp modernizer on RocksDB repo (#12398)
Summary: When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE` with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains: Running `xplat/scripts/codemod_service/cpp_modernizer.sh` in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment) without swapping out the namespace macro `ROCKSDB_NAMESPACE` Followed by RocksDB's own `make format` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398 Test Plan: Auto tests Reviewed By: hx235 Differential Revision: D54382532 Pulled By: jowlyzhang fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa |
|
Peter Dillinger | 13ef21c22e |
default_write_temperature option (#12388)
Summary: Currently SST files that aren't applicable to last_level_temperature nor file_temperature_age_thresholds are written with temperature kUnknown, which is a little weird and doesn't support CF-based tiering. The default_temperature option only affects how kUnknown is interpreted for stats. This change adds a new per-CF option default_write_temperature that determines the temperature of new SST files when those other options do not apply. Also made a change to ignore last_level_temperature with FIFO compaction, because I found that could lead to an infinite loop in compaction. Needed follow-up: Fix temperature handling with external file ingestion Pull Request resolved: https://github.com/facebook/rocksdb/pull/12388 Test Plan: unit tests extended appropriately. (Ignore whitespace changes when reviewing.) Reviewed By: jowlyzhang Differential Revision: D54266574 Pulled By: pdillinger fbshipit-source-id: c9ec9a74dbf22be6e986f77f9689d05fea8ef0bb |
|
Peter Dillinger | 54cb9c77d9 |
Prefer static_cast in place of most reinterpret_cast (#12308)
Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: * Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. * Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2 |
|
Yu Zhang | e3e8fbb497 |
Add a separate range classes for internal usage (#12071)
Summary: Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`. These APIs have different expectations of what keys this range class contain. Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp. This PR contains: 1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead. 2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071 Test Plan: existing tests Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled. The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT. Reviewed By: ltamasi Differential Revision: D53292608 Pulled By: jowlyzhang fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8 |
|
Changyu Bi | ace1721b28 |
Remove deprecated option `level_compaction_dynamic_file_size` (#12325)
Summary: The option is introduced in https://github.com/facebook/rocksdb/issues/10655 to allow reverting to old behavior. The option is enabled by default and there has not been a need to disable it. Remove it for 9.0 release. Also fixed and improved a few unit tests that depended on setting this option to false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12325 Test Plan: existing tests. Reviewed By: hx235 Differential Revision: D53369430 Pulled By: cbi42 fbshipit-source-id: 0ec2440ca8d88db7f7211c581542c7581bd4d3de |
|
Changyu Bi | 3812a77771 |
Deflake `DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit` (#12289)
Summary: The test has been failing with ``` [ RUN ] DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit db/db_compaction_test.cc:9661: Failure Expected equality of these values: 0u Which is: 0 env_->GetThreadPoolQueueLen(Env::Priority::LOW) Which is: 1 ``` This can happen when thread pool queue len is checked before `test::SleepingBackgroundTask::DoSleepTask` is scheduled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12289 Reviewed By: ajkr Differential Revision: D53064300 Pulled By: cbi42 fbshipit-source-id: 9ed1b714243880f82bd1cc1584b402ac9cf57507 |
|
Andrew Kryczka | 2dda7a0dd2 |
Detect compaction pressure at lower debt ratios (#12236)
Summary: This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236 Reviewed By: cbi42 Differential Revision: D52765685 Pulled By: ajkr fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4 |
|
Jay Huh | 2dab137182 |
Mark more files for periodic compaction during offpeak (#12031)
Summary: - The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice. - It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization. - Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start. - We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031 Test Plan: Unit Tests added - `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled - `DBTestUniversalCompaction2::PeriodicCompaction` for Universal Reviewed By: cbi42 Differential Revision: D50900292 Pulled By: jaykorean fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d |
|
Changyu Bi | d5bc30befa |
Enforce status checking after Valid() returns false for IteratorWrapper (#11975)
Summary: ... when compiled with ASSERT_STATUS_CHECKED = 1. The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782. Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975 Test Plan: * `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check` * I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task. Reviewed By: ajkr Differential Revision: D50383790 Pulled By: cbi42 fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce |
|
Changyu Bi | cc254efea6 |
Release compaction files in manifest write callback (#11764)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply(). This is done by passing in a [callback](
|
|
Changyu Bi | 3285ba7a29 |
Fix unit test tsan failure (#11828)
Summary: The test DBCompactionWaitForCompactTest.WaitForCompactWithOptionToFlushAndCloseDB failed tsan in https://app.circleci.com/pipelines/github/facebook/rocksdb/32009/workflows/577e4e1f-a909-4e80-8ef4-af98b5ff7446/jobs/660989. I cannot repro locally, but this should be the fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11828 Test Plan: `./db_compaction_test --gtest_filter="*DBCompactionWaitForCompactTest/DBCompactionWaitForCompactTest.*"` Reviewed By: jaykorean Differential Revision: D49241904 Pulled By: cbi42 fbshipit-source-id: 68714c836d982dcb3946da104533d5c0594980de |
|
Changyu Bi | 458acf8169 |
Add some unit tests when file read returns error during compaction/scanning (#11788)
Summary: Some repro unit tests for the bug fixed in https://github.com/facebook/rocksdb/pull/11782. Ran on main without https://github.com/facebook/rocksdb/pull/11782: ``` ./db_compaction_test --gtest_filter='*ErrorWhenReadFileHead' Note: Google Test filter = *ErrorWhenReadFileHead [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBCompactionTest [ RUN ] DBCompactionTest.ErrorWhenReadFileHead db/db_compaction_test.cc:10105: Failure Value of: s.IsIOError() Actual: false Expected: true [ FAILED ] DBCompactionTest.ErrorWhenReadFileHead (3960 ms) ./db_iterator_test --gtest_filter="*ErrorWhenReadFile*" Note: Google Test filter = *ErrorWhenReadFile* [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBIteratorTest [ RUN ] DBIteratorTest.ErrorWhenReadFile db/db_iterator_test.cc:3399: Failure Value of: (iter->status()).ok() Actual: true Expected: false [ FAILED ] DBIteratorTest.ErrorWhenReadFile (280 ms) [----------] 1 test from DBIteratorTest (280 ms total) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11788 Reviewed By: ajkr Differential Revision: D48940284 Pulled By: cbi42 fbshipit-source-id: 06f3c5963f576db3f85d305ffb2745ee13d209bb |
|
Hui Xiao | 05daa12332 |
Change compaction_readahead_size default value to 2MB (#11762)
Summary: **Context/Summary:** After https://github.com/facebook/rocksdb/pull/11631, we rely on `compaction_readahead_size` for how much to read ahead for compaction read under non-direct IO case. https://github.com/facebook/rocksdb/pull/11658 therefore also sanitized 0 `compaction_readahead_size` to 2MB under non-direct IO, which is consistent with the existing sanitization with direct IO. However, this makes disabling compaction readahead impossible as well as add one more scenario to the inconsistent effects between `Options.compaction_readahead_size=0` during DB open and `SetDBOptions("compaction_readahead_size", "0")` . - `SetDBOptions("compaction_readahead_size", "0")` will disable compaction readahead as its logic never goes through sanitization above while `Options.compaction_readahead_size=0` will go through sanitization. Therefore we decided to do this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11762 Test Plan: Modified existing UTs to cover this PR Reviewed By: ajkr Differential Revision: D48759560 Pulled By: hx235 fbshipit-source-id: b3f85e58bda362a6fa1dc26bd8a87aa0e171af79 |
|
Jay Huh | 0fa0c97d3e |
Timeout in microsecond option in WaitForCompactOptions (#11711)
Summary: While it's rare, we may run into a scenario where `WaitForCompact()` waits for background jobs indefinitely. For example, not enough space error will add the job back to the queue while WaitForCompact() waits for _all jobs_ including the jobs that are in the queue to be completed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11711 Test Plan: `DBCompactionWaitForCompactTest::WaitForCompactToTimeout` added `timeout` option added to the variables for all of the existing DBCompactionWaitForCompactTests Reviewed By: pdillinger, jowlyzhang Differential Revision: D48416390 Pulled By: jaykorean fbshipit-source-id: 7b6a12f705ab6c6dfaf8ad736a484ca654a86106 |
|
Changyu Bi | d1ff401472 |
Delay bottommost level single file compactions (#11701)
Summary: For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions. * The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701 Test Plan: * Add two unit tests to test when bottommost_file_compaction_delay > 0. * Ran crash test with the new option. Reviewed By: jaykorean, ajkr Differential Revision: D48331564 Pulled By: cbi42 fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8 |
|
Jay Huh | 52816ff64d |
Close DB option in WaitForCompact() (#11497)
Summary: Context: As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `close_db` option in `WaitForCompactOptions` to close DB after waiting for compactions to finish. Must be set to true to close the DB upon compactions finishing. 1. `bool close_db = false` added to `WaitForCompactOptions` 2. Introduced `CancelPeriodicTaskSchedulers()` and moved unregistering PeriodicTaskSchedulers to it.`CancelAllBackgroundWork()` calls it now. 3. When close_db option is on, unpersisted data (data in memtable when WAL is disabled) will be flushed in `WaitForCompact()` if flush option is not on (and `mutable_db_options_.avoid_flush_during_shutdown` is not true). The unpersisted data flush in `CancelAllBackgroundWork()` will be skipped because `shutting_down_` flag will be set true before calling `Close()`. 4. Atomic boolean `reject_new_background_jobs_` is introduced to prevent new background jobs from being added during the short period of time after waiting is done and before `shutting_down_` is set by `Close()`. 5. `WaitForCompact()` now waits for recovery in progress to complete as well. (flush operations from WAL -> L0 files) 6. Added `close_db_` cases to all existing `WaitForCompactTests` 7. Added a scenario to `DBBasicTest::DBClose` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11497 Test Plan: - Existing DBCompactionTests - `WaitForCompactWithOptionToFlushAndCloseDB` added - Added a scenario to `DBBasicTest::DBClose` Reviewed By: pdillinger, jowlyzhang Differential Revision: D46337560 Pulled By: jaykorean fbshipit-source-id: 0f8c7ee09394847f2af5ea4bdd331b47bcdef0b0 |
|
Changyu Bi | 76ed9a3990 |
Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by
|
|
Changyu Bi | 6a0f637633 |
Compare the number of input keys and processed keys for compactions (#11571)
Summary: ... to improve data integrity validation during compaction. A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed. For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code. For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571 Test Plan: * Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case. * All other unit tests for non-corrupted case. * Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple` Reviewed By: ajkr Differential Revision: D47131965 Pulled By: cbi42 fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92 |
|
Yu Zhang | 4ea7b796b7 |
Respect cutoff timestamp during flush (#11599)
Summary: Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval. This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`. For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`. Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it. Tet plan: ``` ./column_family_test --gtest_filter="*RetainUDT*" ./memtable_list_test --gtest_filter="*WithTimestamp*" ./flush_job_test --gtest_filter="*WithTimestamp*" ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599 Reviewed By: ajkr Differential Revision: D47561586 Pulled By: jowlyzhang fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89 |
|
Dan Wang | 8a7b9888d4 |
Fix the sync point SanitizeOptions::AfterChangeMaxOpenFiles which is not executed in db_compaction_test (#11583)
Summary: In [db_impl_open.cc](https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc), the sync point `SanitizeOptions::AfterChangeMaxOpenFiles` is used to set `max_open_files` with some specified "**invalid**" value even if it has been sanitized. However, in [db_compaction_test.cc](https://github.com/facebook/rocksdb/blob/main/db/db_compaction_test.cc), `SanitizeOptions::AfterChangeMaxOpenFiles` would not be executed since `SyncPoint::EnableProcessing()` is run after `DBTestBase::Reopen()`. To enable `SanitizeOptions::AfterChangeMaxOpenFiles`, `SyncPoint::EnableProcessing()` should be put ahead of `DBTestBase::Reopen()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11583 Test Plan: run unit tests locally as below: ``` make J=1 check [ RUN ] DBCompactionTest.LevelTtlCascadingCompactions [ OK ] DBCompactionTest.LevelTtlCascadingCompactions (85 ms) [ RUN ] DBCompactionTest.LevelPeriodicCompaction [ OK ] DBCompactionTest.LevelPeriodicCompaction (57 ms) ``` Reviewed By: jowlyzhang Differential Revision: D47311827 Pulled By: ajkr fbshipit-source-id: 99165e87a8129e404af06fdf9b4c96eca540fd23 |
|
Changyu Bi | bc04ec85db |
Make option `level_compaction_dynamic_level_bytes` true by default (#11525)
Summary: after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size. The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit |
|
Changyu Bi | 2e8cc98ab2 |
Fix subcompaction bug to allow running two subcompactions (#11501)
Summary: as reported in https://github.com/facebook/rocksdb/issues/11476, RocksDB currently does not execute compactions in two subcompactions even when they qualify. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11501 Test Plan: * Add a new unit test. * Run crash test with max_subcompactions=2: `python3 tools/db_crashtest.py blackbox --simple --subcompactions=2 --target_file_size_base=1048576 --compaction_style=0` * saw logs showing compactions being executed as 2 subcompactions ``` 2023/06/01-17:28:44.028470 3147486 (Original Log Time 2023/06/01-17:28:44.025972) EVENT_LOG_v1 {"time_micros": 1685665724025939, "job": 6, "event": "compaction_finished", "compaction_time_micros": 34539, "compaction_time_cpu_micros": 26404, "output_level": 6, "num_output_files": 2, "total_output_size": 1109796, "num_input_records": 13188, "num_output_records": 13021, "num_subcompactions": 2, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 13]} ``` Reviewed By: ajkr Differential Revision: D46411497 Pulled By: cbi42 fbshipit-source-id: 3ebfc02e19f78f782e114a9546dc3d481d496258 |
|
Changyu Bi | e95cc1217d |
`CompactRange()` always compacts to bottommost level for leveled compaction (#11468)
Summary:
currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here:
|
|
Jay Huh | 87bc929db3 |
Flush option in WaitForCompact() (#11483)
Summary: Context: As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `flush` option in `WaitForCompactOptions` to flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 1. `bool flush = false` added to `WaitForCompactOptions` 2. `DBImpl::FlushAllColumnFamilies()` is introduced and `DBImpl::FlushForGetLiveFiles()` is refactored to call it. 3. `DBImpl::FlushAllColumnFamilies()` gets called before waiting in `WaitForCompact()` if `flush` option is `true` 4. Some previous WaitForCompact tests were parameterized to include both cases for `abort_on_pause_` being true/false as well as `flush_` being true/false Pull Request resolved: https://github.com/facebook/rocksdb/pull/11483 Test Plan: - `DBCompactionTest::WaitForCompactWithOptionToFlush` added - Changed existing DBCompactionTest::WaitForCompact tests to `DBCompactionWaitForCompactTest` to include params Reviewed By: pdillinger Differential Revision: D46289770 Pulled By: jaykorean fbshipit-source-id: 70d3f461d96a6e06390be60170dd7c4d0d38f8b0 |
|
Changyu Bi | e1c7209beb |
Fix flaky test: `DBCompactionTest.WaitForCompactShutdownWhileWaiting` (#11488)
Summary: tsan complains with the following error message. This is likely due to DB object destroyed while WaitForCompact() is still running. ``` [ RUN ] DBCompactionTest.WaitForCompactShutdownWhileWaiting ================== WARNING: ThreadSanitizer: data race (pid=1128703) Atomic read of size 1 at 0x7b8c00000740 by thread T4: #0 pthread_cond_wait <null> (db_compaction_test+0x46970a) https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::Wait() /root/project/port/port_posix.cc:119:23 (librocksdb.so.8.4+0x7c4c60) https://github.com/facebook/rocksdb/issues/2 rocksdb::InstrumentedCondVar::WaitInternal() /root/project/monitoring/instrumented_mutex.cc:69:9 (librocksdb.so.8.4+0x75f697) https://github.com/facebook/rocksdb/issues/3 rocksdb::InstrumentedCondVar::Wait() /root/project/monitoring/instrumented_mutex.cc:62:3 (librocksdb.so.8.4+0x75f697) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::WaitForCompact(rocksdb::WaitForCompactOptions const&) /root/project/db/db_impl/db_impl_compaction_flush.cc:3978:14 (librocksdb.so.8.4+0x494174) https://github.com/facebook/rocksdb/issues/5 rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30::operator()() const /root/project/db/db_compaction_test.cc:3479:26 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/6 void std::__invoke_impl<void, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>(std::__invoke_other, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/7 std::__invoke_result<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>::type std::__invoke<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>(rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/8 void std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/9 std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/10 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (db_compaction_test+0x5cdc90) https://github.com/facebook/rocksdb/issues/11 <null> <null> (libstdc++.so.6+0xda6b3) Previous write of size 1 at 0x7b8c00000740 by thread T5: #0 pthread_mutex_destroy <null> (db_compaction_test+0x46a4f8) https://github.com/facebook/rocksdb/issues/1 rocksdb::port::Mutex::~Mutex() /root/project/port/port_posix.cc:77:48 (librocksdb.so.8.4+0x7c480e) https://github.com/facebook/rocksdb/issues/2 rocksdb::InstrumentedMutex::~InstrumentedMutex() /root/project/./monitoring/instrumented_mutex.h:20:7 (librocksdb.so.8.4+0x41fda6) https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::~DBImpl() /root/project/db/db_impl/db_impl.cc:755:1 (librocksdb.so.8.4+0x41fda6) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::~DBImpl() /root/project/db/db_impl/db_impl.cc:737:19 (librocksdb.so.8.4+0x4203d9) https://github.com/facebook/rocksdb/issues/5 rocksdb::DBTestBase::Close() /root/project/db/db_test_util.cc:670:3 (librocksdb_test_debug.so+0x57413) https://github.com/facebook/rocksdb/issues/6 rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31::operator()() const /root/project/db/db_compaction_test.cc:3485:49 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/7 void std::__invoke_impl<void, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>(std::__invoke_other, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/8 std::__invoke_result<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>::type std::__invoke<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>(rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/9 void std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/10 std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/11 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (db_compaction_test+0x5cdf03) https://github.com/facebook/rocksdb/issues/12 <null> <null> (libstdc++.so.6+0xda6b3) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11488 Test Plan: ``` COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 db_compaction_test gtest-parallel --repeat=10000 ./db_compaction_test --gtest_filter="*WaitForCompactShutdownWhileWaiting*" -w200 ``` Reviewed By: jaykorean Differential Revision: D46293891 Pulled By: cbi42 fbshipit-source-id: 8ca259cb1e09a9e4f4095b2d084f2ba92b710b97 |
|
Jay Huh | 81aeb15988 |
Add WaitForCompact with WaitForCompactOptions to public API (#11436)
Summary: Context: This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions` 1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing. 1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct. 2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h` 3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436 Test Plan: Following tests added - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish` - `DBCompactionTest::WaitForCompactAbortOnPauseAborted` - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted` - `DBCompactionTest::WaitForCompactShutdownWhileWaiting` - `TransactionTest::WaitForCompactAbortOnPause` NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place. Reviewed By: pdillinger Differential Revision: D45799659 Pulled By: jaykorean fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b |
|
Jay Huh | 586d78b31e |
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary: Context: In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue. In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code. This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`. Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`. Furthermore, all tests that previously called `waitForCompact(true)` have been fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443 Test Plan: Existing tests that involve a shutdown in progress: - DBCompactionTest::CompactRangeShutdownWhileDelayed - DBTestWithParam::PreShutdownMultipleCompaction - DBTestWithParam::PreShutdownCompactionMiddle Reviewed By: pdillinger Differential Revision: D45923426 Pulled By: jaykorean fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae |
|
Changyu Bi | 8827cd0618 |
Support compacting files to different temperatures in FIFO compaction (#11428)
Summary: - Add a new option `CompactionOptionsFIFO::file_temperature_age_thresholds` that allows user to specify age thresholds for compacting files to different temperatures. File temperature can be used to store files in different storage media. The new options allows specifying multiple temperature-age pairs. The option uses struct for a temperature-age pair to use the existing parsing functionality to make the option dynamically settable. - Deprecate the old option `age_for_warm` that was added for a similar purpose. - Compaction score calculation logic is updated to check if a file needs to be compacted to change its temperature. - Some refactoring is done in `FIFOCompactionPicker::PickTemperatureChangeCompaction`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11428 Test Plan: adapted unit tests that were for `age_for_warm` to this new option. Reviewed By: ajkr Differential Revision: D45611412 Pulled By: cbi42 fbshipit-source-id: 2dc384841f61cc04abb9681e31aa2de0f0b06106 |
|
Changyu Bi | 43e9a60bb2 |
Always allow L0->L1 trivial move during manual compaction (#11375)
Summary: during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr), - before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move), - after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction. Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved. This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375 Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed. Reviewed By: ajkr Differential Revision: D44943518 Pulled By: cbi42 fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708 |
|
Changyu Bi | ba16e8eee7 |
Try to pick more files in `LevelCompactionBuilder::TryExtendNonL0TrivialMove()` (#11347)
Summary: Before this PR, in `LevelCompactionBuilder::TryExtendNonL0TrivialMove(index)`, we start from a file at index and expand the compaction input towards right to find files to trivial move. This PR adds the logic to also expand towards left. Another major change made in this PR is to not expand L0 files through `TryExtendNonL0TrivialMove()`. This happens currently when compacting L0 files to an empty output level. The condition for expanding files in `TryExtendNonL0TrivialMove()` is to check atomic boundary, which does not take into account that L0 files can overlap in key range and are not sorted in key order. So it may include more L0 files than needed and disallow a trivial move. This change is included in this PR so that we don't make it worse by always expanding L0 in both direction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11347 Test Plan: * new unit test * Benchmark does not show obvious improvement or regression: ``` Write sequentially ./db_bench --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216 Main: fillseq : 4.726 micros/op 211592 ops/sec 472.607 seconds 100000000 operations; 23.4 MB/s This PR: fillseq : 4.755 micros/op 210289 ops/sec 475.534 seconds 100000000 operations; 23.3 MB/s Write randomly ./db_bench --benchmarks=fillrandom --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216 Main: fillrandom : 16.351 micros/op 61159 ops/sec 1635.066 seconds 100000000 operations; 6.8 MB/s This PR: fillrandom : 15.798 micros/op 63298 ops/sec 1579.817 seconds 100000000 operations; 7.0 MB/s ``` Reviewed By: ajkr Differential Revision: D44645650 Pulled By: cbi42 fbshipit-source-id: 8631f3a6b3f01decbbf18c34f2b62833cb4f9733 |
|
Changyu Bi | b3c43a5b99 |
Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340)
Summary: When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340 Test Plan: - New unit tests - `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run. Reviewed By: ajkr Differential Revision: D44563884 Pulled By: cbi42 fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8 |
|
Changyu Bi | 601320164b |
Trivially move files down when opening db with level_compaction_dynamic_l… (#11321)
Summary: …evel_bytes During DB open, if a column family uses level compaction with level_compaction_dynamic_level_bytes=true, trivially move its files down in the LSM such that the bottommost files are in Lmax, the second from bottommost level files are in Lmax-1 and so on. This is aimed to make it easier to migrate level_compaction_dynamic_level_bytes from false to true. Before this change, a full manual compaction is suggested for such migration. After this change, user can just restart DB to turn on this option. db_crashtest.py is updated to randomly choose value for level_compaction_dynamic_level_bytes. Note that there may still be too many unnecessary levels if a user is migrating from universal compaction or level compaction with a smaller level multiplier. A full manual compaction may still be needed in that case before some PR that automatically drain unnecessary levels like https://github.com/facebook/rocksdb/issues/3921 lands. Eventually we may want to change the default value of option level_compaction_dynamic_level_bytes to true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11321 Test Plan: 1. Added unit tests. 2. Crash test: ran a variation of db_crashtest.py (like 32516507e77521ae887e45091b69139e32e8efb7) that turns level_compaction_dynamic_level_bytes on and off and switches between LC and UC for the same DB. TODO: Update `OptionChangeMigration`, either after this PR or https://github.com/facebook/rocksdb/issues/3921. Reviewed By: ajkr Differential Revision: D44341930 Pulled By: cbi42 fbshipit-source-id: 013de19a915c6a0502be569f07c4cc8f1c3c6be2 |
|
sdong | b92bc04ab0 |
Deflake DBCompactionTest.CancelCompactionWaitingOnConflict (#11318)
Summary: In DBCompactionTest::CancelCompactionWaitingOnConflict, when generating SST files to trigger a compaction, we don't wait after each file, which may cause multiple memtables going to the same SST file, causing insufficient files to trigger the compaction. We do the waiting instead, except the last one, which would trigger compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11318 Test Plan: Run DBCompactionTest.CancelCompactionWaitingOnConflict multiple times. Reviewed By: ajkr Differential Revision: D44267273 fbshipit-source-id: 86af49b05fc67ea3335312f0f5f3d22df1520bf8 |
|
Andrew Kryczka | 071c33846d |
Allow canceling manual compaction while waiting for conflicting compaction (#11165)
Summary: This PR adds logic to the `RunManualCompaction()` loop to check for cancellation before waiting on any conflicting compactions to finish. In case of cancellation, `RunManualCompaction()` no longer waits on conflicting compactions Pull Request resolved: https://github.com/facebook/rocksdb/pull/11165 Test Plan: repro test case Reviewed By: cbi42 Differential Revision: D42864058 Pulled By: ajkr fbshipit-source-id: ea4dd1a8f294abe212905495a8fbe8f07fca3f5a |
|
sdong | 4720ba4391 |
Remove RocksDB LITE (#11147)
Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2 |
|
Changyu Bi | 4d0f9a995c |
Consider TTL compaction file cutting earlier to prevent small output file (#11075)
Summary: in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075 Test Plan: - Added new unit test. Reviewed By: hx235 Differential Revision: D42398739 Pulled By: cbi42 fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7 |
|
Hui Xiao | 9502856edd |
Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary: **Context:** File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998. RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions. - Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason. - But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in |
|
Peter Dillinger | e6b6e74154 |
Make CompactRange() more aware of SstPartitionerFactory (#11032)
Summary: Some users are at least considering using SstPartitioner to support efficient physical migration of specific key ranges between RocksDB instances. One might expect manual `CompactRange()` over a narrow key range across some partition to enforce partitioning of any SST files crossing that partition boundary, but that currently only works if there are keys within that range. This change makes the overlap logic in CompactRange more aware of the partitioner to automatically select relevant files crossing a partition boundary, even when they otherwise would not be selected due to the compaction range falling in a gap between entries. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11032 Test Plan: unit test included Reviewed By: hx235 Differential Revision: D41981380 Pulled By: pdillinger fbshipit-source-id: 2fe445bdddc73c00276c20f295cc1fa33d15b05a |
|
Hui Xiao | 98d5db5c2e |
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run
|
|
Hui Xiao | f1574a20ff |
Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999)
Summary:
**Context/Summary:**
This reverts commit
|
|
Andrew Kryczka | 5cf6ab6f31 |
Ran clang-format on db/ directory (#10910)
Summary: Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910 Reviewed By: riversand963 Differential Revision: D40880683 Pulled By: ajkr fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174 |
|
Hui Xiao | fc74abb436 |
Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
Summary: **Context:** Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case Repro: ``` COERCE_CONTEXT_SWICH=1 make -j56 db_stress ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35 put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049 ``` **Summary:** FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving. - [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true` - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56 - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here. - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0` - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno` - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378) - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction. Some additional clean-up included in this PR: - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming - Added comment about `earliest_memtable_seqno` in related APIs - Made parameter `earliest_memtable_seqno` constant and required Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777 Test Plan: - make check - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only Reviewed By: ajkr Differential Revision: D40090485 Pulled By: hx235 fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f |
|
Changyu Bi | eca47fb696 |
Ignore kBottommostFiles compaction logic when allow_ingest_behind (#10767)
Summary: fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles) if allow_ingest_behind is enabled and the bottommost level is unfilled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767 Test Plan: Added a unit test to reproduce the compaction loop. Reviewed By: ajkr Differential Revision: D40031861 Pulled By: ajkr fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b |
|
Jay Zhuang | f007ad8b4f |
RoundRobin TTL compaction (#10725)
Summary: For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725 Reviewed By: ajkr Differential Revision: D39771069 Pulled By: jay-zhuang fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd |
|
Andrew Kryczka | c7ccbb33a6 |
Allow manual compactions to run in parallel by default (#10317)
Summary: This PR changes the default value of `CompactRangeOptions::exclusive_manual_compaction` from true to false so manual `CompactRange()`s can run in parallel with other compactions. I believe no artificial parallelism restriction is the intuitive behavior so feel the old default value is a trap, which I have fallen into several times, including yesterday. `CompactRangeOptions::exclusive_manual_compaction == false` has been used in both our correctness test and in production for years so should be reasonably safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10317 Reviewed By: jay-zhuang Differential Revision: D37659392 Pulled By: ajkr fbshipit-source-id: 504915e978bbe300b79483d064070c75e93d91e5 |