Go to file
Little-Wallace ec3e3c3e02 Fix corruption with intra-L0 on ingested files (#5958)
Summary:
## Problem Description

Our process was abort when it call `CheckConsistency`. And the information in  `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ".  Here are the causes of the accident I investigated.

* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency`  determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst,  the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
    * If more than one ingested file was ingested before memtable schedule flush,  and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable.  So `CheckConsistency` will return a `Corruption`.
    * If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start,  but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2).  **But there was still some data in s1  written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
    > s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno  < s1.largest_seqno

So I skip pick sst file which was ingested in function `FindIntraL0Compaction `

## Reason

Here is my bug report: https://github.com/facebook/rocksdb/issues/5913

There are two situations that can cause the check to fail.

### First situation:
- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.

- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.

- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is  less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.

### Secondary situaion

- First we have flushed many sst in L0,  we call them [s1, s2, s3].

- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1.  And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.

- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.

- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction.  We call it s6.
  - compacted 4@0 files to L0
- When s6 is added into manifest,  the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5.  But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
   - s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958

Differential Revision: D18601316

fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
2019-11-19 15:09:11 -08:00
buckifier Abandon use of folly::Optional (#6036) 2019-11-14 14:04:15 -08:00
build_tools New Bloom filter implementation for full and partitioned filters (#6007) 2019-11-13 16:44:01 -08:00
cache Misc hashing updates / upgrades (#5909) 2019-10-24 17:16:46 -07:00
cmake cmake: s/SNAPPY_LIBRARIES/snappy_LIBRARIES/ (#5687) 2019-08-16 15:49:23 -07:00
coverage Fix interpreter lines for files with python2-only syntax. 2019-07-09 10:51:37 -07:00
db Fix corruption with intra-L0 on ingested files (#5958) 2019-11-19 15:09:11 -08:00
docs Blog post for write_unprepared (#5711) 2019-08-15 14:41:13 -07:00
env Add Env::SanitizeEnvOptions (#5885) 2019-10-14 12:25:00 -07:00
examples Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
file Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
hdfs Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
include/rocksdb More fixes to auto-GarbageCollect in BackupEngine (#6023) 2019-11-14 06:20:18 -08:00
java fix typo (#6025) 2019-11-13 11:02:28 -08:00
logging Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
memory Charge block cache for cache internal usage (#5797) 2019-09-16 15:26:21 -07:00
memtable Misc hashing updates / upgrades (#5909) 2019-10-24 17:16:46 -07:00
monitoring Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
options Apply formatter to recent 200+ commits. (#5830) 2019-09-20 12:04:26 -07:00
port Fix block cache ID uniqueness for Windows builds (#5844) 2019-10-11 18:19:31 -07:00
table Remove a few unnecessary includes 2019-11-19 08:20:42 -08:00
test_util New Bloom filter implementation for full and partitioned filters (#6007) 2019-11-13 16:44:01 -08:00
third-party Refactor/consolidate legacy Bloom implementation details (#5784) 2019-09-16 16:17:09 -07:00
tools db_stress sometimes generates keys close to SST file boundaries (#6037) 2019-11-19 13:17:03 -08:00
trace_replay Misc hashing updates / upgrades (#5909) 2019-10-24 17:16:46 -07:00
util New Bloom filter implementation for full and partitioned filters (#6007) 2019-11-13 16:44:01 -08:00
utilities Mark blob files not needed by any memtables/SSTs obsolete (#6032) 2019-11-18 16:30:06 -08:00
.clang-format
.gitignore Make buckifier python3 compatible (#5922) 2019-10-23 13:52:27 -07:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.travis.yml Partial rebalance of TEST_GROUPs for Travis (#6010) 2019-11-07 09:50:59 -08:00
.watchmanconfig Added .watchmanconfig file to rocksdb repo (#5593) 2019-07-19 15:00:33 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Abandon use of folly::Optional (#6036) 2019-11-14 14:04:15 -08:00
CODE_OF_CONDUCT.md Adopt Contributor Covenant 2019-08-29 23:21:01 -07:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING Add GPLv2 as an alternative license. 2017-04-27 18:06:12 -07:00
DEFAULT_OPTIONS_HISTORY.md options.delayed_write_rate use the rate of rate_limiter by default. 2017-05-24 09:58:24 -07:00
DUMP_FORMAT.md
HISTORY.md Fix corruption with intra-L0 on ingested files (#5958) 2019-11-19 15:09:11 -08:00
INSTALL.md Update the version of the dependencies used by the RocksJava static build (#4761) 2018-12-18 20:25:43 -08:00
LANGUAGE-BINDINGS.md LANGUAGE-BINDINGS.md: mention python-rocksdb 2019-03-20 11:10:48 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Abandon use of folly::Optional (#6036) 2019-11-14 14:04:15 -08:00
README.md Replaced some words (#5877) 2019-10-07 12:28:09 -07:00
ROCKSDB_LITE.md Fix some typos in comments and docs. 2018-03-08 10:27:25 -08:00
TARGETS Abandon use of folly::Optional (#6036) 2019-11-14 14:04:15 -08:00
USERS.md Add avrio to USERS.md (#5748) 2019-09-15 21:29:09 -07:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md #5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output (#5152) 2019-04-04 11:38:19 -07:00
appveyor.yml New API to get all merge operands for a Key (#5604) 2019-08-06 14:26:44 -07:00
defs.bzl Add clarifying/instructive header to TARGETS and defs.bzl 2019-11-05 20:20:33 -08:00
issue_template.md Add a template for issues 2017-09-29 11:41:28 -07:00
src.mk FilterPolicy consolidation, part 2/2 (#5966) 2019-10-24 15:44:51 -07:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Linux/Mac Build Status Windows Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.