rocksdb/include/rocksdb/table_properties.h
Mikhail Antonov 7fe3b32896 Added support for differential snapshots
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).

This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.

From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".

This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.

For now, what's done here according to initial discussions:

Preserving deletes:
 - We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
 - I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
 - Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.

Iterator changes:
 - couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.

TableCache changes:
 - I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.

What's left:

 - Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999

Differential Revision: D6175602

Pulled By: mikhail-antonov

fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
2017-11-01 18:56:43 -07:00

214 lines
8.3 KiB
C++

// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#pragma once
#include <stdint.h>
#include <map>
#include <string>
#include "rocksdb/status.h"
#include "rocksdb/types.h"
namespace rocksdb {
// -- Table Properties
// Other than basic table properties, each table may also have the user
// collected properties.
// The value of the user-collected properties are encoded as raw bytes --
// users have to interprete these values by themselves.
// Note: To do prefix seek/scan in `UserCollectedProperties`, you can do
// something similar to:
//
// UserCollectedProperties props = ...;
// for (auto pos = props.lower_bound(prefix);
// pos != props.end() && pos->first.compare(0, prefix.size(), prefix) == 0;
// ++pos) {
// ...
// }
typedef std::map<std::string, std::string> UserCollectedProperties;
// table properties' human-readable names in the property block.
struct TablePropertiesNames {
static const std::string kDataSize;
static const std::string kIndexSize;
static const std::string kIndexPartitions;
static const std::string kTopLevelIndexSize;
static const std::string kFilterSize;
static const std::string kRawKeySize;
static const std::string kRawValueSize;
static const std::string kNumDataBlocks;
static const std::string kNumEntries;
static const std::string kFormatVersion;
static const std::string kFixedKeyLen;
static const std::string kFilterPolicy;
static const std::string kColumnFamilyName;
static const std::string kColumnFamilyId;
static const std::string kComparator;
static const std::string kMergeOperator;
static const std::string kPrefixExtractorName;
static const std::string kPropertyCollectors;
static const std::string kCompression;
static const std::string kCreationTime;
static const std::string kOldestKeyTime;
};
extern const std::string kPropertiesBlock;
extern const std::string kCompressionDictBlock;
extern const std::string kRangeDelBlock;
// `TablePropertiesCollector` provides the mechanism for users to collect
// their own properties that they are interested in. This class is essentially
// a collection of callback functions that will be invoked during table
// building. It is construced with TablePropertiesCollectorFactory. The methods
// don't need to be thread-safe, as we will create exactly one
// TablePropertiesCollector object per table and then call it sequentially
class TablePropertiesCollector {
public:
virtual ~TablePropertiesCollector() {}
// DEPRECATE User defined collector should implement AddUserKey(), though
// this old function still works for backward compatible reason.
// Add() will be called when a new key/value pair is inserted into the table.
// @params key the user key that is inserted into the table.
// @params value the value that is inserted into the table.
virtual Status Add(const Slice& /*key*/, const Slice& /*value*/) {
return Status::InvalidArgument(
"TablePropertiesCollector::Add() deprecated.");
}
// AddUserKey() will be called when a new key/value pair is inserted into the
// table.
// @params key the user key that is inserted into the table.
// @params value the value that is inserted into the table.
virtual Status AddUserKey(const Slice& key, const Slice& value,
EntryType /*type*/, SequenceNumber /*seq*/,
uint64_t /*file_size*/) {
// For backwards-compatibility.
return Add(key, value);
}
// Finish() will be called when a table has already been built and is ready
// for writing the properties block.
// @params properties User will add their collected statistics to
// `properties`.
virtual Status Finish(UserCollectedProperties* properties) = 0;
// Return the human-readable properties, where the key is property name and
// the value is the human-readable form of value.
virtual UserCollectedProperties GetReadableProperties() const = 0;
// The name of the properties collector can be used for debugging purpose.
virtual const char* Name() const = 0;
// EXPERIMENTAL Return whether the output file should be further compacted
virtual bool NeedCompact() const { return false; }
};
// Constructs TablePropertiesCollector. Internals create a new
// TablePropertiesCollector for each new table
class TablePropertiesCollectorFactory {
public:
struct Context {
uint32_t column_family_id;
static const uint32_t kUnknownColumnFamily;
};
virtual ~TablePropertiesCollectorFactory() {}
// has to be thread-safe
virtual TablePropertiesCollector* CreateTablePropertiesCollector(
TablePropertiesCollectorFactory::Context context) = 0;
// The name of the properties collector can be used for debugging purpose.
virtual const char* Name() const = 0;
};
// TableProperties contains a bunch of read-only properties of its associated
// table.
struct TableProperties {
public:
// the total size of all data blocks.
uint64_t data_size = 0;
// the size of index block.
uint64_t index_size = 0;
// Total number of index partitions if kTwoLevelIndexSearch is used
uint64_t index_partitions = 0;
// Size of the top-level index if kTwoLevelIndexSearch is used
uint64_t top_level_index_size = 0;
// the size of filter block.
uint64_t filter_size = 0;
// total raw key size
uint64_t raw_key_size = 0;
// total raw value size
uint64_t raw_value_size = 0;
// the number of blocks in this table
uint64_t num_data_blocks = 0;
// the number of entries in this table
uint64_t num_entries = 0;
// format version, reserved for backward compatibility
uint64_t format_version = 0;
// If 0, key is variable length. Otherwise number of bytes for each key.
uint64_t fixed_key_len = 0;
// ID of column family for this SST file, corresponding to the CF identified
// by column_family_name.
uint64_t column_family_id =
rocksdb::TablePropertiesCollectorFactory::Context::kUnknownColumnFamily;
// The time when the SST file was created.
// Since SST files are immutable, this is equivalent to last modified time.
uint64_t creation_time = 0;
// Timestamp of the earliest key. 0 means unknown.
uint64_t oldest_key_time = 0;
// Name of the column family with which this SST file is associated.
// If column family is unknown, `column_family_name` will be an empty string.
std::string column_family_name;
// The name of the filter policy used in this table.
// If no filter policy is used, `filter_policy_name` will be an empty string.
std::string filter_policy_name;
// The name of the comparator used in this table.
std::string comparator_name;
// The name of the merge operator used in this table.
// If no merge operator is used, `merge_operator_name` will be "nullptr".
std::string merge_operator_name;
// The name of the prefix extractor used in this table
// If no prefix extractor is used, `prefix_extractor_name` will be "nullptr".
std::string prefix_extractor_name;
// The names of the property collectors factories used in this table
// separated by commas
// {collector_name[1]},{collector_name[2]},{collector_name[3]} ..
std::string property_collectors_names;
// The compression algo used to compress the SST files.
std::string compression_name;
// user collected properties
UserCollectedProperties user_collected_properties;
UserCollectedProperties readable_properties;
// The offset of the value of each property in the file.
std::map<std::string, uint64_t> properties_offsets;
// convert this object to a human readable form
// @prop_delim: delimiter for each property.
std::string ToString(const std::string& prop_delim = "; ",
const std::string& kv_delim = "=") const;
// Aggregate the numerical member variables of the specified
// TableProperties.
void Add(const TableProperties& tp);
};
// Extra properties
// Below is a list of non-basic properties that are collected by database
// itself. Especially some properties regarding to the internal keys (which
// is unknown to `table`).
extern uint64_t GetDeletedKeys(const UserCollectedProperties& props);
extern uint64_t GetMergeOperands(const UserCollectedProperties& props,
bool* property_present);
} // namespace rocksdb