mirror of https://github.com/facebook/rocksdb.git
22 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Changyu Bi | f97e33454f |
Fix a bug with auto recovery on WAL write error (#12995)
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:
```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped:
|
|
Changyu Bi | a31fe52173 |
Remove the return value of `SetBGError()` (#12792)
Summary: the return value for `ErrorHandler::SetBGError(error)` seems to be not well-defined, it can be `bg_error_` (no matter if the `bg_error_` is set to the input error), ok status or [`recovery_error_`]( |
|
Yu Zhang | ef342246dc |
Consolidate stats recording in error handler (#11992)
Summary: This is a non functional refactor, mostly for deduplicating the stats recording logic in error handler. Plus some documentation update and simple code dedupe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11992 Test Plan: existing tests Reviewed By: hx235 Differential Revision: D52967713 Pulled By: jowlyzhang fbshipit-source-id: d584eae1a06410438f5a4c59c2cb67666ea7de1a |
|
Yu Zhang | 509947ce2c |
Quarantine files in a limbo state after a manifest error (#12030)
Summary: Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`. These are the types of files being actively tracked in quarantine in this PR: 1) new table files and blob files from a background job 2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed. Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening. We track these files' file numbers because they share the same file number space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030 Test Plan: Modified existing unit tests Reviewed By: ajkr Differential Revision: D51036774 Pulled By: jowlyzhang fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716 |
|
Yu Zhang | 933ee295f4 |
Fix a race condition between recovery and backup (#11955)
Summary: A race condition between recovery and backup can happen with error messages like this: ```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory``` PR https://github.com/facebook/rocksdb/issues/6949 introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event: [Backup engine] disable file deletion [Recovery] disable file deletion <= this is optional for the race condition, it may or may not get called [Backup engine] get list of file to copy/link [Recovery] force enable file deletion .... some files refered by backup engine get deleted [Backup engine] copy/link file <= error no file found This PR fixes this with: 1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call. 2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler. 3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955 Test Plan: existing unit tests Reviewed By: anand1976 Differential Revision: D50290592 Pulled By: jowlyzhang fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b |
|
Andrew Kryczka | 77d160ef47 |
Consolidate `ErrorHandler`'s recovery status variables (#11937)
Summary: cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937 Reviewed By: cbi42 Differential Revision: D50105793 Pulled By: ajkr fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f |
|
Changyu Bi | 1c871a4d86 |
Only flush after recovery for retryable IOError (#11880)
Summary: https://github.com/facebook/rocksdb/issues/11872 causes a unit test to start failing with the error message below. The cause is that the additional call to `FlushAllColumnFamilies()` in `DBImpl::ResumeImpl()` can run while DB is closing. More detailed explanation: there are two places where we call `ResumeImpl()`: 1. in `ErrorHandler::RecoverFromBGError`, for manual resume or recovery from errors like OutOfSpace through sst file manager, and 2. in `Errorhandler::RecoverFromRetryableBGIOError`, for error recovery from errors like flush failure due to retryable IOError. This is tracked by `ErrorHandler::recovery_thread_`. Here is how DB close waits for error recovery: |
|
Andrew Kryczka | 5cf6ab6f31 |
Ran clang-format on db/ directory (#10910)
Summary: Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910 Reviewed By: riversand963 Differential Revision: D40880683 Pulled By: ajkr fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174 |
|
Wallace | 1e9bf25f61 |
Do not hold mutex when write keys if not necessary (#7516)
Summary: ## Problem Summary RocksDB will acquire the global mutex of db instance for every time when user calls `Write`. When RocksDB schedules a lot of compaction jobs, it will compete the mutex with write thread and it will hurt the write performance. ## Problem Solution: I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7516 Test Plan: 1. make check 2. CI 3. COMPILE_WITH_TSAN=1 make db_stress make crash_test make crash_test_with_multiops_wp_txn make crash_test_with_multiops_wc_txn make crash_test_with_atomic_flush Reviewed By: siying Differential Revision: D36908702 Pulled By: riversand963 fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe |
|
anand76 | a88d8795ec |
Expand auto recovery to background read errors (#9679)
Summary: Fix and enhance the background error recovery logic to handle the following situations - 1. Background read errors during flush/compaction (previously was resulting in unrecoverable state) 2. Fix auto recovery failure on read/write errors during atomic flush. It was failing due to a bug in setting the resuming_from_bg_err variable in AtomicFlushMemTablesToOutputFiles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9679 Test Plan: Add new unit tests in error_handler_fs_test Reviewed By: riversand963 Differential Revision: D34770097 Pulled By: anand1976 fbshipit-source-id: 136da973a28d684b9c74bdf668519b0cbbbe1742 |
|
Zhichao Cao | efaef9b40a |
cleanup error_handler related code (#9098)
Summary: Remove code not in use, add comments, remove redundant code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9098 Test Plan: make check Reviewed By: anand1976 Differential Revision: D32027219 Pulled By: zhichao-cao fbshipit-source-id: 253aae926c87726268af6c027bf805dc9156c8a8 |
|
storagezhang | 711881bc25 |
Fix some typos in comments (#8066)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8066 Reviewed By: jay-zhuang Differential Revision: D27280799 Pulled By: mrambacher fbshipit-source-id: 68f91f5af4ffe0a84be581961bf9366887f47702 |
|
Zhichao Cao | 08ec5e7321 |
Add the statistics and info log for Error handler (#8050)
Summary: Add statistics and info log for error handler: counters for bg error, bg io error, bg retryable io error, auto resume, auto resume total retry, and auto resume sucess; Histogram for auto resume retry count in each recovery call. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8050 Test Plan: make check and add test to error_handler_fs_test Reviewed By: anand1976 Differential Revision: D26990565 Pulled By: zhichao-cao fbshipit-source-id: 49f71e8ea4e9db8b189943976404205b56ab883f |
|
mrambacher | db03172d08 |
Change ErrorHandler methods to return const Status& (#7539)
Summary: This change eliminates the need for a lot of the PermitUncheckedError calls on return from ErrorHandler methods. The calls are no longer needed as the status is returned as a reference rather than a copy. Additionally, this means that the originating status (recovery_error_, bg_error_) is not cleared implicitly as a result of calling one of these methods. For this class, I do not know if the proper behavior should be to call PermitUncheckedError in the destructor or if the checked state should be cleared when the status is cleared. I did tests both ways. Without the code in the destructor, the status will need to be cleared in at least some of the places where it is set to OK. When running tests, I found no instances where this class was destructed with a non-OK, non-checked Status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7539 Reviewed By: anand1976 Differential Revision: D25340565 Pulled By: pdillinger fbshipit-source-id: 1730c035c81a475875ea745226112030ec25136c |
|
Zhichao Cao | c268628c25 |
Map retryable IO error during Flush without WAL to soft error and no switch memtable during resume (#7310)
Summary: In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310 Test Plan: adding new unit test, pass make check. Reviewed By: anand1976 Differential Revision: D23710892 Pulled By: zhichao-cao fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9 |
|
mrambacher | e9befdebbf |
Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283)
Summary: This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check. When it was unclear what the proper behavior was for unchecked status codes, a TODO was added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283 Reviewed By: akankshamahajan15 Differential Revision: D23251497 Pulled By: ajkr fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523 |
|
Zhichao Cao | a10f12eda1 |
Auto resume the DB from Retryable IO Error (#6765)
Summary: In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765 Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check Reviewed By: anand1976 Differential Revision: D21916789 Pulled By: zhichao-cao fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b |
|
Zhichao Cao | 4246888101 |
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0 |
|
sdong | fdf882ded2 |
Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e |
|
Dmitry Alimov | e13d8dcbbb |
Fix typos in comments (#4456)
Summary: Fix some typos in the comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/4456 Differential Revision: D10209214 Pulled By: miasantreble fbshipit-source-id: dff857ba60396bc95126e635db96d7dc8330d2cb |
|
Anand Ananthabhotla | a27fce408e |
Auto recovery from out of space errors (#4164)
Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a |
|
Anand Ananthabhotla | 52d4c9b7f6 |
Allow DB resume after background errors (#3997)
Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd |