Go to file
Changyu Bi b927ba5936 Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an  error message like the following:

IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```

This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)

This PR fixes the issue by rollback these pending flushes.

There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.

There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865

Test Plan: * Added repro unit tests.

Reviewed By: anand1976

Differential Revision: D49484922

Pulled By: cbi42

fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 15:31:29 -07:00
.circleci Circleci macos sunset (#11633) 2023-08-21 11:53:40 -07:00
.github/workflows Fix failed CI job "Check buck targets and code format" (#11532) 2023-06-13 14:20:11 -07:00
buckifier Del `(object)` from 200 inc instagram-server/distillery/slipstream/thrift_models/StoryFeedMediaSticker/ttypes.py 2023-08-25 16:22:09 -07:00
build_tools Del `(object)` from 200 inc instagram-server/distillery/slipstream/thrift_models/StoryFeedMediaSticker/ttypes.py 2023-08-25 16:22:09 -07:00
cache Disable compressed secondary cache if capacity is 0 (#11863) 2023-09-20 22:30:17 -07:00
cmake gcc-11 and cmake related cleanup (#9286) 2021-12-17 17:04:35 -08:00
coverage Remove platform009 and default to platform010 (#11333) 2023-03-30 09:56:37 -07:00
db Rollback other pending memtable flushes when a flush fails (#11865) 2023-09-21 15:31:29 -07:00
db_stress_tool Initialize FaultInjectionTestFS DirectWritable field (#11862) 2023-09-19 12:23:38 -07:00
docs Fix typo in twitter link (#11529) 2023-06-12 15:26:13 -07:00
env Add SystemClock::TimedWait() function (#11753) 2023-08-29 18:39:10 -07:00
examples Remove RocksDB LITE (#11147) 2023-01-27 13:14:19 -08:00
file Fix Assertion `roundup_len2 >= alignment' failed in crash tests (#11852) 2023-09-20 16:13:20 -07:00
fuzz Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
include/rocksdb Expose more info about input files in `CompactionFilter::Context` (#11857) 2023-09-20 13:34:39 -07:00
java Implement trimming of readhead size when upper bound is specified (#11684) 2023-08-18 15:52:04 -07:00
logging Disabling some IO error assertion in EnvLogger (#11314) 2023-03-20 13:23:29 -07:00
memory cache_bench enhancements for jemalloc etc. (#11758) 2023-08-24 19:14:38 -07:00
memtable remove redundant move (#11418) 2023-05-03 09:37:21 -07:00
microbench Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666) 2023-08-18 15:01:59 -07:00
monitoring GetEntity Support for ReadOnlyDB and SecondaryDB (#11799) 2023-09-15 08:30:44 -07:00
options Make RibbonFilterPolicy::bloom_before_level mutable (SetOptions()) (#11838) 2023-09-15 15:46:10 -07:00
plugin Add initial CMake support to plugin (#9214) 2021-11-30 17:16:53 -08:00
port Suppress TSAN reports on AutoHyperClockTable::Lookup (#11806) 2023-09-08 10:50:47 -07:00
table Fix row cache falsely return kNotFound when timestamp enabled (#11816) 2023-09-20 11:34:38 -07:00
test_util Add a unit test for the fix in #11763 (#11810) 2023-09-11 12:54:50 -07:00
third-party fix optimization-disabled test builds with platform010 (#11361) 2023-04-10 13:59:44 -07:00
tools Fix stress test failure due to write fault injections and disable write fault injection (#11859) 2023-09-19 08:33:05 -07:00
trace_replay Fix error maybe-uninitialized #11100 (#11101) 2023-01-19 13:59:48 -08:00
unreleased_history Rollback other pending memtable flushes when a flush fails (#11865) 2023-09-21 15:31:29 -07:00
util LZ4 set acceleration parameter (#11844) 2023-09-18 09:26:29 -07:00
utilities Use *next_sequence -1 here (#11861) 2023-09-21 13:52:01 -07:00
.clang-format
.gitignore Add .arcconfig to .gitignore (fb internal use) (#11803) 2023-09-07 14:57:39 -07:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.watchmanconfig Added .watchmanconfig file to rocksdb repo (#5593) 2019-07-19 15:00:33 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt cmake: check PORTABLE for well-known boolean representations (#11724) 2023-09-18 12:11:15 -07:00
CODE_OF_CONDUCT.md Adopt Contributor Covenant 2019-08-29 23:21:01 -07:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
HISTORY.md Remove "rocksdb.file.read.db.open.micros" typo from 8.6 HISTORY (#11839) 2023-09-14 16:07:59 -07:00
INSTALL.md Simplify detection of x86 CPU features (#11419) 2023-05-09 22:25:45 -07:00
LANGUAGE-BINDINGS.md Add grocksdb in Go language bindings (#10498) 2022-08-23 15:02:10 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Wide Column support in ldb (#11754) 2023-08-30 12:45:52 -07:00
PLUGINS.md Added encryption plugin based on Intel open-source ipp-crypto library (#11429) 2023-05-08 12:13:43 -07:00
README.md Remove deprecated integration tests from README.md (#11354) 2023-04-07 16:52:50 -07:00
TARGETS Wide Column support in ldb (#11754) 2023-08-30 12:45:52 -07:00
USERS.md Add Apache Kvrocks RocksDB use case in USERS.md (#11779) 2023-09-01 23:39:41 -07:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md Update branch name in WINDOWS_PORT.md (#8745) 2021-09-01 19:26:39 -07:00
common.mk Clean up variables for temporary directory (#9961) 2022-05-06 16:38:06 -07:00
crash_test.mk Stress/Crash Test for OptimisticTransactionDB (#11513) 2023-06-17 16:27:37 -07:00
issue_template.md Add Google Group to Issue Template 2020-01-28 14:40:37 -08:00
rocksdb.pc.in build: fix pkg-config file generation (#9953) 2022-05-30 12:46:40 -07:00
src.mk Wide Column support in ldb (#11754) 2023-08-30 12:45:52 -07:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.