mirror of
https://github.com/facebook/rocksdb.git
synced 2024-11-29 18:33:58 +00:00
7a1b0207e6
Summary: ## Context checksum All RocksDB checksums currently use 32 bits of checking power, which should be 1 in 4 billion false negative (FN) probability (failing to detect corruption). This is true for random corruptions, and in some cases small corruptions are guaranteed to be detected. But some possible corruptions, such as in storage metadata rather than storage payload data, would have a much higher FN rate. For example: * Data larger than one SST block is replaced by data from elsewhere in the same or another SST file. Especially with block_align=true, the probability of exact block size match is probably around 1 in 100, making the FN probability around that same. Without `block_align=true` the probability of same block start location is probably around 1 in 10,000, for FN probability around 1 in a million. To solve this problem in new format_version=6, we add "context awareness" to block checksum checks. The stored and expected checksum value is modified based on the block's position in the file and which file it is in. The modifications are cleverly chosen so that, for example * blocks within about 4GB of each other are guaranteed to use different context * blocks that are offset by exactly some multiple of 4GiB are guaranteed to use different context * files generated by the same process are guaranteed to use different context for the same offsets, until wrap-around after 2^32 - 1 files Thus, with format_version=6, if a valid SST block and checksum is misplaced, its checksum FN probability should be essentially ideal, 1 in 4B. ## Footer checksum This change also adds checksum protection to the SST footer (with format_version=6), for the first time without relying on whole file checksum. To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to defeat the footer checksum, we change much of the footer data format including an "extended magic number" in format_version 6 that would be interpreted as empty index and metaindex block handles in older footer versions. We also change the encoding of handles to free up space for other new data in footer. ## More detail: making space in footer In order to keep footer the same size in format_version=6 (avoid change to IO patterns), we have to free up some space for new data. We do this two ways: * Metaindex block handle is encoded down to 4 bytes (from 10) by assuming it immediately precedes the footer, and by assuming it is < 4GB. * Index block handle is moved into metaindex. (I don't know why it was in footer to begin with.) ## Performance In case of small performance penalty, I've made a "pay as you go" optimization to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep with the only field used in that structure after construction: `prefix_extractor`. This makes the PR an overall performance improvement (results below). Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6, even including that improvement for both. That's based on extreme case table write performance testing, many files with many blocks. This is relatively checksum intensive (small blocks) and salt generation intensive (small files). ``` (for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out ``` Each value below is ops/s averaged over 100 runs, run simultaneously with competing configuration for load fairness Before -> after (both fv=5): 483530 -> 483673 (negligible) Re-run 1: 480733 -> 485427 (1.0% faster) Re-run 2: 483821 -> 484541 (0.1% faster) Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster) Re-run 1: 482212 -> 485075 (0.6% faster) Re-run 2: 483590 -> 484073 (0.1% faster) After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster) Re-run 1: 485331 -> 483385 (0.4% slower) Re-run 2: 485283 -> 483435 (0.4% slower) Re-run 3: 483647 -> 486109 (0.5% faster) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058 Test Plan: unit tests included (table_test, db_properties_test, salt in env_test). General DB tests and crash test updated to test new format_version. Also temporarily updated the default format version to 6 and saw some test failures. Almost all were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum, though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block checksum, just assuming it was verified in opening the index reader (probably *usually* true but probably not always true). Some other concerns about VerifyChecksum are left in FIXME comments. The only remaining test failure on change of default (in block_fetcher_test) now has a comment about how to upgrade the test. The format compatibility test does not need updating because we have not updated the default format_version. Reviewed By: ajkr, mrambacher Differential Revision: D33100915 Pulled By: pdillinger fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
174 lines
7.2 KiB
C++
174 lines
7.2 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
#pragma once
|
|
|
|
#include <map>
|
|
#include <memory>
|
|
#include <string>
|
|
#include <vector>
|
|
|
|
#include "db/builder.h"
|
|
#include "db/table_properties_collector.h"
|
|
#include "rocksdb/comparator.h"
|
|
#include "rocksdb/memory_allocator.h"
|
|
#include "rocksdb/options.h"
|
|
#include "rocksdb/slice.h"
|
|
#include "table/block_based/block_builder.h"
|
|
#include "table/block_based/block_type.h"
|
|
#include "table/format.h"
|
|
#include "util/kv_map.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
class BlockBuilder;
|
|
class BlockHandle;
|
|
class Env;
|
|
class Footer;
|
|
class Logger;
|
|
class RandomAccessFile;
|
|
struct TableProperties;
|
|
|
|
// Meta block names for metaindex
|
|
extern const std::string kPropertiesBlockName;
|
|
extern const std::string kIndexBlockName;
|
|
extern const std::string kPropertiesBlockOldName;
|
|
extern const std::string kCompressionDictBlockName;
|
|
extern const std::string kRangeDelBlockName;
|
|
|
|
class MetaIndexBuilder {
|
|
public:
|
|
MetaIndexBuilder(const MetaIndexBuilder&) = delete;
|
|
MetaIndexBuilder& operator=(const MetaIndexBuilder&) = delete;
|
|
|
|
MetaIndexBuilder();
|
|
void Add(const std::string& key, const BlockHandle& handle);
|
|
|
|
// Write all the added key/value pairs to the block and return the contents
|
|
// of the block.
|
|
Slice Finish();
|
|
|
|
private:
|
|
// store the sorted key/handle of the metablocks.
|
|
stl_wrappers::KVMap meta_block_handles_;
|
|
std::unique_ptr<BlockBuilder> meta_index_block_;
|
|
};
|
|
|
|
class PropertyBlockBuilder {
|
|
public:
|
|
PropertyBlockBuilder(const PropertyBlockBuilder&) = delete;
|
|
PropertyBlockBuilder& operator=(const PropertyBlockBuilder&) = delete;
|
|
|
|
PropertyBlockBuilder();
|
|
|
|
void AddTableProperty(const TableProperties& props);
|
|
void Add(const std::string& key, uint64_t value);
|
|
void Add(const std::string& key, const std::string& value);
|
|
void Add(const UserCollectedProperties& user_collected_properties);
|
|
|
|
// Write all the added entries to the block and return the block contents
|
|
Slice Finish();
|
|
|
|
private:
|
|
std::unique_ptr<BlockBuilder> properties_block_;
|
|
stl_wrappers::KVMap props_;
|
|
};
|
|
|
|
// Were we encounter any error occurs during user-defined statistics collection,
|
|
// we'll write the warning message to info log.
|
|
void LogPropertiesCollectionError(Logger* info_log, const std::string& method,
|
|
const std::string& name);
|
|
|
|
// Utility functions help table builder to trigger batch events for user
|
|
// defined property collectors.
|
|
// Return value indicates if there is any error occurred; if error occurred,
|
|
// the warning message will be logged.
|
|
// NotifyCollectTableCollectorsOnAdd() triggers the `Add` event for all
|
|
// property collectors.
|
|
bool NotifyCollectTableCollectorsOnAdd(
|
|
const Slice& key, const Slice& value, uint64_t file_size,
|
|
const std::vector<std::unique_ptr<IntTblPropCollector>>& collectors,
|
|
Logger* info_log);
|
|
|
|
void NotifyCollectTableCollectorsOnBlockAdd(
|
|
const std::vector<std::unique_ptr<IntTblPropCollector>>& collectors,
|
|
uint64_t block_uncomp_bytes, uint64_t block_compressed_bytes_fast,
|
|
uint64_t block_compressed_bytes_slow);
|
|
|
|
// NotifyCollectTableCollectorsOnFinish() triggers the `Finish` event for all
|
|
// property collectors. The collected properties will be added to `builder`.
|
|
bool NotifyCollectTableCollectorsOnFinish(
|
|
const std::vector<std::unique_ptr<IntTblPropCollector>>& collectors,
|
|
Logger* info_log, PropertyBlockBuilder* builder);
|
|
|
|
// Read table properties from a file using known BlockHandle.
|
|
// @returns a status to indicate if the operation succeeded. On success,
|
|
// *table_properties will point to a heap-allocated TableProperties
|
|
// object, otherwise value of `table_properties` will not be modified.
|
|
Status ReadTablePropertiesHelper(
|
|
const ReadOptions& ro, const BlockHandle& handle,
|
|
RandomAccessFileReader* file, FilePrefetchBuffer* prefetch_buffer,
|
|
const Footer& footer, const ImmutableOptions& ioptions,
|
|
std::unique_ptr<TableProperties>* table_properties,
|
|
MemoryAllocator* memory_allocator = nullptr);
|
|
|
|
// Read table properties from the properties block of a plain table.
|
|
// @returns a status to indicate if the operation succeeded. On success,
|
|
// *table_properties will point to a heap-allocated TableProperties
|
|
// object, otherwise value of `table_properties` will not be modified.
|
|
Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size,
|
|
uint64_t table_magic_number,
|
|
const ImmutableOptions& ioptions,
|
|
const ReadOptions& read_options,
|
|
std::unique_ptr<TableProperties>* properties,
|
|
MemoryAllocator* memory_allocator = nullptr,
|
|
FilePrefetchBuffer* prefetch_buffer = nullptr);
|
|
|
|
// Find the meta block from the meta index block. Returns OK and
|
|
// block_handle->IsNull() if not found.
|
|
Status FindOptionalMetaBlock(InternalIterator* meta_index_iter,
|
|
const std::string& meta_block_name,
|
|
BlockHandle* block_handle);
|
|
|
|
// Find the meta block from the meta index block. Returns Corruption if not
|
|
// found.
|
|
Status FindMetaBlock(InternalIterator* meta_index_iter,
|
|
const std::string& meta_block_name,
|
|
BlockHandle* block_handle);
|
|
|
|
// Find the meta block
|
|
Status FindMetaBlockInFile(RandomAccessFileReader* file, uint64_t file_size,
|
|
uint64_t table_magic_number,
|
|
const ImmutableOptions& ioptions,
|
|
const ReadOptions& read_options,
|
|
const std::string& meta_block_name,
|
|
BlockHandle* block_handle,
|
|
MemoryAllocator* memory_allocator = nullptr,
|
|
FilePrefetchBuffer* prefetch_buffer = nullptr,
|
|
Footer* footer_out = nullptr);
|
|
|
|
// Read meta block contents
|
|
Status ReadMetaIndexBlockInFile(RandomAccessFileReader* file,
|
|
uint64_t file_size, uint64_t table_magic_number,
|
|
const ImmutableOptions& ioptions,
|
|
const ReadOptions& read_options,
|
|
BlockContents* block_contents,
|
|
MemoryAllocator* memory_allocator = nullptr,
|
|
FilePrefetchBuffer* prefetch_buffer = nullptr,
|
|
Footer* footer_out = nullptr);
|
|
|
|
// Read the specified meta block with name meta_block_name
|
|
// from `file` and initialize `contents` with contents of this block.
|
|
// Return Status::OK in case of success.
|
|
Status ReadMetaBlock(RandomAccessFileReader* file,
|
|
FilePrefetchBuffer* prefetch_buffer, uint64_t file_size,
|
|
uint64_t table_magic_number,
|
|
const ImmutableOptions& ioptions,
|
|
const ReadOptions& read_options,
|
|
const std::string& meta_block_name, BlockType block_type,
|
|
BlockContents* contents,
|
|
MemoryAllocator* memory_allocator = nullptr);
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|