mirror of
https://github.com/facebook/rocksdb.git
synced 2024-12-04 20:02:50 +00:00
7a1b0207e6
Summary: ## Context checksum All RocksDB checksums currently use 32 bits of checking power, which should be 1 in 4 billion false negative (FN) probability (failing to detect corruption). This is true for random corruptions, and in some cases small corruptions are guaranteed to be detected. But some possible corruptions, such as in storage metadata rather than storage payload data, would have a much higher FN rate. For example: * Data larger than one SST block is replaced by data from elsewhere in the same or another SST file. Especially with block_align=true, the probability of exact block size match is probably around 1 in 100, making the FN probability around that same. Without `block_align=true` the probability of same block start location is probably around 1 in 10,000, for FN probability around 1 in a million. To solve this problem in new format_version=6, we add "context awareness" to block checksum checks. The stored and expected checksum value is modified based on the block's position in the file and which file it is in. The modifications are cleverly chosen so that, for example * blocks within about 4GB of each other are guaranteed to use different context * blocks that are offset by exactly some multiple of 4GiB are guaranteed to use different context * files generated by the same process are guaranteed to use different context for the same offsets, until wrap-around after 2^32 - 1 files Thus, with format_version=6, if a valid SST block and checksum is misplaced, its checksum FN probability should be essentially ideal, 1 in 4B. ## Footer checksum This change also adds checksum protection to the SST footer (with format_version=6), for the first time without relying on whole file checksum. To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to defeat the footer checksum, we change much of the footer data format including an "extended magic number" in format_version 6 that would be interpreted as empty index and metaindex block handles in older footer versions. We also change the encoding of handles to free up space for other new data in footer. ## More detail: making space in footer In order to keep footer the same size in format_version=6 (avoid change to IO patterns), we have to free up some space for new data. We do this two ways: * Metaindex block handle is encoded down to 4 bytes (from 10) by assuming it immediately precedes the footer, and by assuming it is < 4GB. * Index block handle is moved into metaindex. (I don't know why it was in footer to begin with.) ## Performance In case of small performance penalty, I've made a "pay as you go" optimization to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep with the only field used in that structure after construction: `prefix_extractor`. This makes the PR an overall performance improvement (results below). Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6, even including that improvement for both. That's based on extreme case table write performance testing, many files with many blocks. This is relatively checksum intensive (small blocks) and salt generation intensive (small files). ``` (for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out ``` Each value below is ops/s averaged over 100 runs, run simultaneously with competing configuration for load fairness Before -> after (both fv=5): 483530 -> 483673 (negligible) Re-run 1: 480733 -> 485427 (1.0% faster) Re-run 2: 483821 -> 484541 (0.1% faster) Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster) Re-run 1: 482212 -> 485075 (0.6% faster) Re-run 2: 483590 -> 484073 (0.1% faster) After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster) Re-run 1: 485331 -> 483385 (0.4% slower) Re-run 2: 485283 -> 483435 (0.4% slower) Re-run 3: 483647 -> 486109 (0.5% faster) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058 Test Plan: unit tests included (table_test, db_properties_test, salt in env_test). General DB tests and crash test updated to test new format_version. Also temporarily updated the default format version to 6 and saw some test failures. Almost all were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum, though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block checksum, just assuming it was verified in opening the index reader (probably *usually* true but probably not always true). Some other concerns about VerifyChecksum are left in FIXME comments. The only remaining test failure on change of default (in block_fetcher_test) now has a comment about how to upgrade the test. The format compatibility test does not need updating because we have not updated the default format_version. Reviewed By: ajkr, mrambacher Differential Revision: D33100915 Pulled By: pdillinger fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
341 lines
11 KiB
C++
341 lines
11 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
#include "table/plain/plain_table_builder.h"
|
|
|
|
#include <assert.h>
|
|
|
|
#include <limits>
|
|
#include <map>
|
|
#include <string>
|
|
|
|
#include "db/dbformat.h"
|
|
#include "file/writable_file_writer.h"
|
|
#include "logging/logging.h"
|
|
#include "rocksdb/comparator.h"
|
|
#include "rocksdb/env.h"
|
|
#include "rocksdb/filter_policy.h"
|
|
#include "rocksdb/options.h"
|
|
#include "rocksdb/table.h"
|
|
#include "table/block_based/block_builder.h"
|
|
#include "table/format.h"
|
|
#include "table/meta_blocks.h"
|
|
#include "table/plain/plain_table_bloom.h"
|
|
#include "table/plain/plain_table_factory.h"
|
|
#include "table/plain/plain_table_index.h"
|
|
#include "util/coding.h"
|
|
#include "util/crc32c.h"
|
|
#include "util/stop_watch.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
namespace {
|
|
|
|
// a utility that helps writing block content to the file
|
|
// @offset will advance if @block_contents was successfully written.
|
|
// @block_handle the block handle this particular block.
|
|
IOStatus WriteBlock(const Slice& block_contents, WritableFileWriter* file,
|
|
uint64_t* offset, BlockHandle* block_handle) {
|
|
block_handle->set_offset(*offset);
|
|
block_handle->set_size(block_contents.size());
|
|
IOStatus io_s = file->Append(block_contents);
|
|
|
|
if (io_s.ok()) {
|
|
*offset += block_contents.size();
|
|
}
|
|
return io_s;
|
|
}
|
|
|
|
} // namespace
|
|
|
|
// kPlainTableMagicNumber was picked by running
|
|
// echo rocksdb.table.plain | sha1sum
|
|
// and taking the leading 64 bits.
|
|
extern const uint64_t kPlainTableMagicNumber = 0x8242229663bf9564ull;
|
|
extern const uint64_t kLegacyPlainTableMagicNumber = 0x4f3418eb7a8f13b8ull;
|
|
|
|
PlainTableBuilder::PlainTableBuilder(
|
|
const ImmutableOptions& ioptions, const MutableCFOptions& moptions,
|
|
const IntTblPropCollectorFactories* int_tbl_prop_collector_factories,
|
|
uint32_t column_family_id, int level_at_creation, WritableFileWriter* file,
|
|
uint32_t user_key_len, EncodingType encoding_type, size_t index_sparseness,
|
|
uint32_t bloom_bits_per_key, const std::string& column_family_name,
|
|
uint32_t num_probes, size_t huge_page_tlb_size, double hash_table_ratio,
|
|
bool store_index_in_file, const std::string& db_id,
|
|
const std::string& db_session_id, uint64_t file_number)
|
|
: ioptions_(ioptions),
|
|
moptions_(moptions),
|
|
bloom_block_(num_probes),
|
|
file_(file),
|
|
bloom_bits_per_key_(bloom_bits_per_key),
|
|
huge_page_tlb_size_(huge_page_tlb_size),
|
|
encoder_(encoding_type, user_key_len, moptions.prefix_extractor.get(),
|
|
index_sparseness),
|
|
store_index_in_file_(store_index_in_file),
|
|
prefix_extractor_(moptions.prefix_extractor.get()) {
|
|
// Build index block and save it in the file if hash_table_ratio > 0
|
|
if (store_index_in_file_) {
|
|
assert(hash_table_ratio > 0 || IsTotalOrderMode());
|
|
index_builder_.reset(new PlainTableIndexBuilder(
|
|
&arena_, ioptions, moptions.prefix_extractor.get(), index_sparseness,
|
|
hash_table_ratio, huge_page_tlb_size_));
|
|
properties_
|
|
.user_collected_properties[PlainTablePropertyNames::kBloomVersion] =
|
|
"1"; // For future use
|
|
}
|
|
|
|
properties_.fixed_key_len = user_key_len;
|
|
|
|
// for plain table, we put all the data in a big chuck.
|
|
properties_.num_data_blocks = 1;
|
|
// Fill it later if store_index_in_file_ == true
|
|
properties_.index_size = 0;
|
|
properties_.filter_size = 0;
|
|
// To support roll-back to previous version, now still use version 0 for
|
|
// plain encoding.
|
|
properties_.format_version = (encoding_type == kPlain) ? 0 : 1;
|
|
properties_.column_family_id = column_family_id;
|
|
properties_.column_family_name = column_family_name;
|
|
properties_.db_id = db_id;
|
|
properties_.db_session_id = db_session_id;
|
|
properties_.db_host_id = ioptions.db_host_id;
|
|
if (!ReifyDbHostIdProperty(ioptions_.env, &properties_.db_host_id).ok()) {
|
|
ROCKS_LOG_INFO(ioptions_.logger, "db_host_id property will not be set");
|
|
}
|
|
properties_.orig_file_number = file_number;
|
|
properties_.prefix_extractor_name =
|
|
moptions_.prefix_extractor != nullptr
|
|
? moptions_.prefix_extractor->AsString()
|
|
: "nullptr";
|
|
|
|
std::string val;
|
|
PutFixed32(&val, static_cast<uint32_t>(encoder_.GetEncodingType()));
|
|
properties_
|
|
.user_collected_properties[PlainTablePropertyNames::kEncodingType] = val;
|
|
|
|
assert(int_tbl_prop_collector_factories);
|
|
for (auto& factory : *int_tbl_prop_collector_factories) {
|
|
assert(factory);
|
|
|
|
table_properties_collectors_.emplace_back(
|
|
factory->CreateIntTblPropCollector(column_family_id,
|
|
level_at_creation));
|
|
}
|
|
}
|
|
|
|
PlainTableBuilder::~PlainTableBuilder() {
|
|
// They are supposed to have been passed to users through Finish()
|
|
// if the file succeeds.
|
|
status_.PermitUncheckedError();
|
|
io_status_.PermitUncheckedError();
|
|
}
|
|
|
|
void PlainTableBuilder::Add(const Slice& key, const Slice& value) {
|
|
// temp buffer for metadata bytes between key and value.
|
|
char meta_bytes_buf[6];
|
|
size_t meta_bytes_buf_size = 0;
|
|
|
|
ParsedInternalKey internal_key;
|
|
if (!ParseInternalKey(key, &internal_key, false /* log_err_key */)
|
|
.ok()) { // TODO
|
|
assert(false);
|
|
return;
|
|
}
|
|
if (internal_key.type == kTypeRangeDeletion) {
|
|
status_ = Status::NotSupported("Range deletion unsupported");
|
|
return;
|
|
}
|
|
|
|
// Store key hash
|
|
if (store_index_in_file_) {
|
|
if (moptions_.prefix_extractor == nullptr) {
|
|
keys_or_prefixes_hashes_.push_back(GetSliceHash(internal_key.user_key));
|
|
} else {
|
|
Slice prefix =
|
|
moptions_.prefix_extractor->Transform(internal_key.user_key);
|
|
keys_or_prefixes_hashes_.push_back(GetSliceHash(prefix));
|
|
}
|
|
}
|
|
|
|
// Write value
|
|
assert(offset_ <= std::numeric_limits<uint32_t>::max());
|
|
auto prev_offset = static_cast<uint32_t>(offset_);
|
|
// Write out the key
|
|
io_status_ = encoder_.AppendKey(key, file_, &offset_, meta_bytes_buf,
|
|
&meta_bytes_buf_size);
|
|
if (SaveIndexInFile()) {
|
|
index_builder_->AddKeyPrefix(GetPrefix(internal_key), prev_offset);
|
|
}
|
|
|
|
// Write value length
|
|
uint32_t value_size = static_cast<uint32_t>(value.size());
|
|
if (io_status_.ok()) {
|
|
char* end_ptr =
|
|
EncodeVarint32(meta_bytes_buf + meta_bytes_buf_size, value_size);
|
|
assert(end_ptr <= meta_bytes_buf + sizeof(meta_bytes_buf));
|
|
meta_bytes_buf_size = end_ptr - meta_bytes_buf;
|
|
io_status_ = file_->Append(Slice(meta_bytes_buf, meta_bytes_buf_size));
|
|
}
|
|
|
|
// Write value
|
|
if (io_status_.ok()) {
|
|
io_status_ = file_->Append(value);
|
|
offset_ += value_size + meta_bytes_buf_size;
|
|
}
|
|
|
|
if (io_status_.ok()) {
|
|
properties_.num_entries++;
|
|
properties_.raw_key_size += key.size();
|
|
properties_.raw_value_size += value.size();
|
|
if (internal_key.type == kTypeDeletion ||
|
|
internal_key.type == kTypeSingleDeletion) {
|
|
properties_.num_deletions++;
|
|
} else if (internal_key.type == kTypeMerge) {
|
|
properties_.num_merge_operands++;
|
|
}
|
|
}
|
|
|
|
// notify property collectors
|
|
NotifyCollectTableCollectorsOnAdd(
|
|
key, value, offset_, table_properties_collectors_, ioptions_.logger);
|
|
status_ = io_status_;
|
|
}
|
|
|
|
Status PlainTableBuilder::Finish() {
|
|
assert(!closed_);
|
|
closed_ = true;
|
|
|
|
properties_.data_size = offset_;
|
|
|
|
// Write the following blocks
|
|
// 1. [meta block: bloom] - optional
|
|
// 2. [meta block: index] - optional
|
|
// 3. [meta block: properties]
|
|
// 4. [metaindex block]
|
|
// 5. [footer]
|
|
|
|
MetaIndexBuilder meta_index_builer;
|
|
|
|
if (store_index_in_file_ && (properties_.num_entries > 0)) {
|
|
assert(properties_.num_entries <= std::numeric_limits<uint32_t>::max());
|
|
BlockHandle bloom_block_handle;
|
|
if (bloom_bits_per_key_ > 0) {
|
|
bloom_block_.SetTotalBits(
|
|
&arena_,
|
|
static_cast<uint32_t>(properties_.num_entries) * bloom_bits_per_key_,
|
|
ioptions_.bloom_locality, huge_page_tlb_size_, ioptions_.logger);
|
|
|
|
PutVarint32(&properties_.user_collected_properties
|
|
[PlainTablePropertyNames::kNumBloomBlocks],
|
|
bloom_block_.GetNumBlocks());
|
|
|
|
bloom_block_.AddKeysHashes(keys_or_prefixes_hashes_);
|
|
|
|
Slice bloom_finish_result = bloom_block_.Finish();
|
|
|
|
properties_.filter_size = bloom_finish_result.size();
|
|
io_status_ =
|
|
WriteBlock(bloom_finish_result, file_, &offset_, &bloom_block_handle);
|
|
|
|
if (!io_status_.ok()) {
|
|
status_ = io_status_;
|
|
return status_;
|
|
}
|
|
meta_index_builer.Add(BloomBlockBuilder::kBloomBlock, bloom_block_handle);
|
|
}
|
|
BlockHandle index_block_handle;
|
|
Slice index_finish_result = index_builder_->Finish();
|
|
|
|
properties_.index_size = index_finish_result.size();
|
|
io_status_ =
|
|
WriteBlock(index_finish_result, file_, &offset_, &index_block_handle);
|
|
|
|
if (!io_status_.ok()) {
|
|
status_ = io_status_;
|
|
return status_;
|
|
}
|
|
|
|
meta_index_builer.Add(PlainTableIndexBuilder::kPlainTableIndexBlock,
|
|
index_block_handle);
|
|
}
|
|
|
|
// Calculate bloom block size and index block size
|
|
PropertyBlockBuilder property_block_builder;
|
|
// -- Add basic properties
|
|
property_block_builder.AddTableProperty(properties_);
|
|
|
|
property_block_builder.Add(properties_.user_collected_properties);
|
|
|
|
// -- Add user collected properties
|
|
NotifyCollectTableCollectorsOnFinish(
|
|
table_properties_collectors_, ioptions_.logger, &property_block_builder);
|
|
|
|
// -- Write property block
|
|
BlockHandle property_block_handle;
|
|
io_status_ = WriteBlock(property_block_builder.Finish(), file_, &offset_,
|
|
&property_block_handle);
|
|
if (!io_status_.ok()) {
|
|
status_ = io_status_;
|
|
return status_;
|
|
}
|
|
meta_index_builer.Add(kPropertiesBlockName, property_block_handle);
|
|
|
|
// -- write metaindex block
|
|
BlockHandle metaindex_block_handle;
|
|
io_status_ = WriteBlock(meta_index_builer.Finish(), file_, &offset_,
|
|
&metaindex_block_handle);
|
|
if (!io_status_.ok()) {
|
|
status_ = io_status_;
|
|
return status_;
|
|
}
|
|
|
|
// Write Footer
|
|
// no need to write out new footer if we're using default checksum
|
|
FooterBuilder footer;
|
|
Status s = footer.Build(kPlainTableMagicNumber, /* format_version */ 0,
|
|
offset_, kNoChecksum, metaindex_block_handle);
|
|
if (!s.ok()) {
|
|
status_ = s;
|
|
return status_;
|
|
}
|
|
io_status_ = file_->Append(footer.GetSlice());
|
|
if (io_status_.ok()) {
|
|
offset_ += footer.GetSlice().size();
|
|
}
|
|
status_ = io_status_;
|
|
return status_;
|
|
}
|
|
|
|
void PlainTableBuilder::Abandon() { closed_ = true; }
|
|
|
|
uint64_t PlainTableBuilder::NumEntries() const {
|
|
return properties_.num_entries;
|
|
}
|
|
|
|
uint64_t PlainTableBuilder::FileSize() const { return offset_; }
|
|
|
|
std::string PlainTableBuilder::GetFileChecksum() const {
|
|
if (file_ != nullptr) {
|
|
return file_->GetFileChecksum();
|
|
} else {
|
|
return kUnknownFileChecksum;
|
|
}
|
|
}
|
|
|
|
const char* PlainTableBuilder::GetFileChecksumFuncName() const {
|
|
if (file_ != nullptr) {
|
|
return file_->GetFileChecksumFuncName();
|
|
} else {
|
|
return kUnknownFileChecksumFuncName;
|
|
}
|
|
}
|
|
void PlainTableBuilder::SetSeqnoTimeTableProperties(const std::string& string,
|
|
uint64_t uint_64) {
|
|
// TODO: storing seqno to time mapping is not yet support for plain table.
|
|
TableBuilder::SetSeqnoTimeTableProperties(string, uint_64);
|
|
}
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|