rocksdb/db/db_impl
Andrew Kryczka f24c39ab3d Prevent corruption with parallel manual compactions and `change_level == true` (#9077)
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.

Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered  (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.

The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.

Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).

The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.

Thanks to siying for finding the bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077

Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.

Reviewed By: siying

Differential Revision: D31921908

Pulled By: ajkr

fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
2021-10-27 23:08:56 -07:00
..
compacted_db_impl.cc Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
compacted_db_impl.h Move compacted_db_impl.[c|h] to db/db_impl (#8082) 2021-03-23 13:49:26 -07:00
db_impl.cc Experimental support for SST unique IDs (#8990) 2021-10-18 23:32:01 -07:00
db_impl.h Make `DB::Close()` thread-safe (#8970) 2021-10-18 20:32:35 -07:00
db_impl_compaction_flush.cc Prevent corruption with parallel manual compactions and `change_level == true` (#9077) 2021-10-27 23:08:56 -07:00
db_impl_debug.cc Eliminate compiler complaining, which the return type of the function… (#8498) 2021-07-08 10:09:05 -07:00
db_impl_experimental.cc Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
db_impl_files.cc Don't ignore deletion rate limit if WAL dir is different (#8967) 2021-09-30 13:26:31 -07:00
db_impl_open.cc Add (Live)FileStorageInfo API (#8968) 2021-10-16 10:04:32 -07:00
db_impl_readonly.cc Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
db_impl_readonly.h RocksJava - Add errorIfLogFileExists parameter to RocksDB.openReadOnly (#7046) 2020-09-17 15:41:25 -07:00
db_impl_secondary.cc Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
db_impl_secondary.h Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
db_impl_write.cc remove unused local obj and simpilify comple code (#9052) 2021-10-20 14:08:05 -07:00