Go to file
Peter Dillinger 7a1b0207e6 format_version=6 and context-aware block checksums (#9058)
Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.

To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files

Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.

## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.

## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)

## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).

Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).

```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```

Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness

Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058

Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.

Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.

The format compatibility test does not need updating because we have not updated the default
format_version.

Reviewed By: ajkr, mrambacher

Differential Revision: D33100915

Pulled By: pdillinger

fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
2023-07-30 16:40:01 -07:00
.circleci Revert enabling IO uring in db_stress (#11242) 2023-02-21 12:53:55 -08:00
.github/workflows Fix failed CI job "Check buck targets and code format" (#11532) 2023-06-13 14:20:11 -07:00
buckifier Fix duplicate symbols in linking with buck2 (#11421) 2023-05-01 10:45:36 -07:00
build_tools Repair/instate jemalloc build on M1 (#11257) 2023-05-22 11:06:41 -07:00
cache Even more HyperClockCache refactoring (#11630) 2023-07-24 09:36:09 -07:00
cmake gcc-11 and cmake related cleanup (#9286) 2021-12-17 17:04:35 -08:00
coverage Remove platform009 and default to platform010 (#11333) 2023-03-30 09:56:37 -07:00
db format_version=6 and context-aware block checksums (#9058) 2023-07-30 16:40:01 -07:00
db_stress_tool Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653) 2023-07-28 16:25:29 -07:00
docs Fix typo in twitter link (#11529) 2023-06-12 15:26:13 -07:00
env format_version=6 and context-aware block checksums (#9058) 2023-07-30 16:40:01 -07:00
examples Remove RocksDB LITE (#11147) 2023-01-27 13:14:19 -08:00
file Move prefetching responsibility to page cache for compaction read under non directIO usecase (#11631) 2023-07-21 14:52:52 -07:00
fuzz Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
include/rocksdb format_version=6 and context-aware block checksums (#9058) 2023-07-30 16:40:01 -07:00
java Support switching on / off UDT together with in-Memtable-only feature (#11623) 2023-07-26 20:16:32 -07:00
logging Disabling some IO error assertion in EnvLogger (#11314) 2023-03-20 13:23:29 -07:00
memory Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
memtable remove redundant move (#11418) 2023-05-03 09:37:21 -07:00
microbench Small improvements to DBGet microbenchmark (#11498) 2023-06-02 16:39:14 -07:00
monitoring Add a ticker to track number of trash files deleted in background thread (#11540) 2023-06-16 10:05:25 -07:00
options Compare the number of input keys and processed keys for compactions (#11571) 2023-07-28 09:47:31 -07:00
plugin Add initial CMake support to plugin (#9214) 2021-11-30 17:16:53 -08:00
port Add OpenBSD Support (#11255) 2023-06-27 11:58:29 -07:00
table format_version=6 and context-aware block checksums (#9058) 2023-07-30 16:40:01 -07:00
test_util format_version=6 and context-aware block checksums (#9058) 2023-07-30 16:40:01 -07:00
third-party fix optimization-disabled test builds with platform010 (#11361) 2023-04-10 13:59:44 -07:00
tools format_version=6 and context-aware block checksums (#9058) 2023-07-30 16:40:01 -07:00
trace_replay Fix error maybe-uninitialized #11100 (#11101) 2023-01-19 13:59:48 -08:00
unreleased_history format_version=6 and context-aware block checksums (#9058) 2023-07-30 16:40:01 -07:00
util Compare the number of input keys and processed keys for compactions (#11571) 2023-07-28 09:47:31 -07:00
utilities Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653) 2023-07-28 16:25:29 -07:00
.clang-format
.gitignore Tweak on IsTrivialMove() (#11467) 2023-05-26 16:40:50 -07:00
.lgtm.yml
.watchmanconfig
AUTHORS
CMakeLists.txt Allow rocksdb library to be usable with CMake's `FetchContent` API (#11575) 2023-07-19 10:39:30 -07:00
CODE_OF_CONDUCT.md
CONTRIBUTING.md
COPYING
DEFAULT_OPTIONS_HISTORY.md Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md
HISTORY.md Strip leading and trailing whitespace for unreleased_history entries (#11652) 2023-07-28 14:57:07 -07:00
INSTALL.md Simplify detection of x86 CPU features (#11419) 2023-05-09 22:25:45 -07:00
LANGUAGE-BINDINGS.md Add grocksdb in Go language bindings (#10498) 2022-08-23 15:02:10 -07:00
LICENSE.Apache
LICENSE.leveldb
Makefile Add utils to use for handling user defined timestamp size record in WAL (#11451) 2023-05-22 14:28:58 -07:00
PLUGINS.md Added encryption plugin based on Intel open-source ipp-crypto library (#11429) 2023-05-08 12:13:43 -07:00
README.md Remove deprecated integration tests from README.md (#11354) 2023-04-07 16:52:50 -07:00
TARGETS Add utils to use for handling user defined timestamp size record in WAL (#11451) 2023-05-22 14:28:58 -07:00
USERS.md Add more users (#11369) 2023-05-03 10:47:41 -07:00
Vagrantfile
WINDOWS_PORT.md Update branch name in WINDOWS_PORT.md (#8745) 2021-09-01 19:26:39 -07:00
common.mk Clean up variables for temporary directory (#9961) 2022-05-06 16:38:06 -07:00
crash_test.mk Stress/Crash Test for OptimisticTransactionDB (#11513) 2023-06-17 16:27:37 -07:00
issue_template.md
rocksdb.pc.in build: fix pkg-config file generation (#9953) 2022-05-30 12:46:40 -07:00
src.mk Add utils to use for handling user defined timestamp size record in WAL (#11451) 2023-05-22 14:28:58 -07:00
thirdparty.inc

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.