mirror of
https://github.com/facebook/rocksdb.git
synced 2024-11-30 22:41:48 +00:00
9f7801c5f1
Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
379 lines
15 KiB
C++
379 lines
15 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#pragma once
|
|
|
|
#include <array>
|
|
#include <cstdint>
|
|
#include <string>
|
|
|
|
#include "file/file_prefetch_buffer.h"
|
|
#include "file/random_access_file_reader.h"
|
|
#include "memory/memory_allocator.h"
|
|
#include "options/cf_options.h"
|
|
#include "port/malloc.h"
|
|
#include "port/port.h" // noexcept
|
|
#include "rocksdb/slice.h"
|
|
#include "rocksdb/status.h"
|
|
#include "rocksdb/table.h"
|
|
#include "util/hash.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
class RandomAccessFile;
|
|
struct ReadOptions;
|
|
|
|
bool ShouldReportDetailedTime(Env* env, Statistics* stats);
|
|
|
|
// the length of the magic number in bytes.
|
|
constexpr uint32_t kMagicNumberLengthByte = 8;
|
|
|
|
// BlockHandle is a pointer to the extent of a file that stores a data
|
|
// block or a meta block.
|
|
class BlockHandle {
|
|
public:
|
|
// Creates a block handle with special values indicating "uninitialized,"
|
|
// distinct from the "null" block handle.
|
|
BlockHandle();
|
|
BlockHandle(uint64_t offset, uint64_t size);
|
|
|
|
// The offset of the block in the file.
|
|
uint64_t offset() const { return offset_; }
|
|
void set_offset(uint64_t _offset) { offset_ = _offset; }
|
|
|
|
// The size of the stored block
|
|
uint64_t size() const { return size_; }
|
|
void set_size(uint64_t _size) { size_ = _size; }
|
|
|
|
void EncodeTo(std::string* dst) const;
|
|
char* EncodeTo(char* dst) const;
|
|
Status DecodeFrom(Slice* input);
|
|
Status DecodeSizeFrom(uint64_t offset, Slice* input);
|
|
|
|
// Return a string that contains the copy of handle.
|
|
std::string ToString(bool hex = true) const;
|
|
|
|
// if the block handle's offset and size are both "0", we will view it
|
|
// as a null block handle that points to no where.
|
|
bool IsNull() const { return offset_ == 0 && size_ == 0; }
|
|
|
|
static const BlockHandle& NullBlockHandle() { return kNullBlockHandle; }
|
|
|
|
// Maximum encoding length of a BlockHandle
|
|
static constexpr uint32_t kMaxEncodedLength = 2 * kMaxVarint64Length;
|
|
|
|
inline bool operator==(const BlockHandle& rhs) const {
|
|
return offset_ == rhs.offset_ && size_ == rhs.size_;
|
|
}
|
|
inline bool operator!=(const BlockHandle& rhs) const {
|
|
return !(*this == rhs);
|
|
}
|
|
|
|
private:
|
|
uint64_t offset_;
|
|
uint64_t size_;
|
|
|
|
static const BlockHandle kNullBlockHandle;
|
|
};
|
|
|
|
// Value in block-based table file index.
|
|
//
|
|
// The index entry for block n is: y -> h, [x],
|
|
// where: y is some key between the last key of block n (inclusive) and the
|
|
// first key of block n+1 (exclusive); h is BlockHandle pointing to block n;
|
|
// x, if present, is the first key of block n (unshortened).
|
|
// This struct represents the "h, [x]" part.
|
|
struct IndexValue {
|
|
BlockHandle handle;
|
|
// Empty means unknown.
|
|
Slice first_internal_key;
|
|
|
|
IndexValue() = default;
|
|
IndexValue(BlockHandle _handle, Slice _first_internal_key)
|
|
: handle(_handle), first_internal_key(_first_internal_key) {}
|
|
|
|
// have_first_key indicates whether the `first_internal_key` is used.
|
|
// If previous_handle is not null, delta encoding is used;
|
|
// in this case, the two handles must point to consecutive blocks:
|
|
// handle.offset() ==
|
|
// previous_handle->offset() + previous_handle->size() + kBlockTrailerSize
|
|
void EncodeTo(std::string* dst, bool have_first_key,
|
|
const BlockHandle* previous_handle) const;
|
|
Status DecodeFrom(Slice* input, bool have_first_key,
|
|
const BlockHandle* previous_handle);
|
|
|
|
std::string ToString(bool hex, bool have_first_key) const;
|
|
};
|
|
|
|
inline uint32_t GetCompressFormatForVersion(uint32_t format_version) {
|
|
// As of format_version 2, we encode compressed block with
|
|
// compress_format_version == 2. Before that, the version is 1.
|
|
// DO NOT CHANGE THIS FUNCTION, it affects disk format
|
|
return format_version >= 2 ? 2 : 1;
|
|
}
|
|
|
|
constexpr uint32_t kLatestFormatVersion = 5;
|
|
|
|
inline bool IsSupportedFormatVersion(uint32_t version) {
|
|
return version <= kLatestFormatVersion;
|
|
}
|
|
|
|
// Footer encapsulates the fixed information stored at the tail end of every
|
|
// SST file. In general, it should only include things that cannot go
|
|
// elsewhere under the metaindex block. For example, checksum_type is
|
|
// required for verifying metaindex block checksum (when applicable), but
|
|
// index block handle can easily go in metaindex block (possible future).
|
|
// See also FooterBuilder below.
|
|
class Footer {
|
|
public:
|
|
// Create empty. Populate using DecodeFrom.
|
|
Footer() {}
|
|
|
|
// Deserialize a footer (populate fields) from `input` and check for various
|
|
// corruptions. `input_offset` is the offset within the target file of
|
|
// `input` buffer (future use).
|
|
// If enforce_table_magic_number != 0, will return corruption if table magic
|
|
// number is not equal to enforce_table_magic_number.
|
|
Status DecodeFrom(Slice input, uint64_t input_offset,
|
|
uint64_t enforce_table_magic_number = 0);
|
|
|
|
// Table magic number identifies file as RocksDB SST file and which kind of
|
|
// SST format is use.
|
|
uint64_t table_magic_number() const { return table_magic_number_; }
|
|
|
|
// A version (footer and more) within a kind of SST. (It would add more
|
|
// unnecessary complexity to separate footer versions and
|
|
// BBTO::format_version.)
|
|
uint32_t format_version() const { return format_version_; }
|
|
|
|
// Block handle for metaindex block.
|
|
const BlockHandle& metaindex_handle() const { return metaindex_handle_; }
|
|
|
|
// Block handle for (top-level) index block.
|
|
const BlockHandle& index_handle() const { return index_handle_; }
|
|
|
|
// Checksum type used in the file.
|
|
ChecksumType checksum_type() const {
|
|
return static_cast<ChecksumType>(checksum_type_);
|
|
}
|
|
|
|
// Block trailer size used by file with this footer (e.g. 5 for block-based
|
|
// table and 0 for plain table). This is inferred from magic number so
|
|
// not in the serialized form.
|
|
inline size_t GetBlockTrailerSize() const { return block_trailer_size_; }
|
|
|
|
// Convert this object to a human readable form
|
|
std::string ToString() const;
|
|
|
|
// Encoded lengths of Footers. Bytes for serialized Footer will always be
|
|
// >= kMinEncodedLength and <= kMaxEncodedLength.
|
|
//
|
|
// Footer version 0 (legacy) will always occupy exactly this many bytes.
|
|
// It consists of two block handles, padding, and a magic number.
|
|
static constexpr uint32_t kVersion0EncodedLength =
|
|
2 * BlockHandle::kMaxEncodedLength + kMagicNumberLengthByte;
|
|
static constexpr uint32_t kMinEncodedLength = kVersion0EncodedLength;
|
|
|
|
// Footer of versions 1 and higher will always occupy exactly this many
|
|
// bytes. It originally consisted of the checksum type, two block handles,
|
|
// padding (to maximum handle encoding size), a format version number, and a
|
|
// magic number.
|
|
static constexpr uint32_t kNewVersionsEncodedLength =
|
|
1 + 2 * BlockHandle::kMaxEncodedLength + 4 + kMagicNumberLengthByte;
|
|
static constexpr uint32_t kMaxEncodedLength = kNewVersionsEncodedLength;
|
|
|
|
static constexpr uint64_t kNullTableMagicNumber = 0;
|
|
|
|
static constexpr uint32_t kInvalidFormatVersion = 0xffffffffU;
|
|
|
|
private:
|
|
static constexpr int kInvalidChecksumType =
|
|
(1 << (sizeof(ChecksumType) * 8)) | kNoChecksum;
|
|
|
|
uint64_t table_magic_number_ = kNullTableMagicNumber;
|
|
uint32_t format_version_ = kInvalidFormatVersion;
|
|
BlockHandle metaindex_handle_;
|
|
BlockHandle index_handle_;
|
|
int checksum_type_ = kInvalidChecksumType;
|
|
uint8_t block_trailer_size_ = 0;
|
|
};
|
|
|
|
// Builder for Footer
|
|
class FooterBuilder {
|
|
public:
|
|
// Run builder in inputs. This is a single step with lots of parameters for
|
|
// efficiency (based on perf testing).
|
|
// * table_magic_number identifies file as RocksDB SST file and which kind of
|
|
// SST format is use.
|
|
// * format_version is a version for the footer and can also apply to other
|
|
// aspects of the SST file (see BlockBasedTableOptions::format_version).
|
|
// NOTE: To save complexity in the caller, when format_version == 0 and
|
|
// there is a corresponding legacy magic number to the one specified, the
|
|
// legacy magic number will be written for forward compatibility.
|
|
// * footer_offset is the file offset where the footer will be written
|
|
// (for future use).
|
|
// * checksum_type is for formats using block checksums.
|
|
// * index_handle is optional for some kinds of SST files.
|
|
void Build(uint64_t table_magic_number, uint32_t format_version,
|
|
uint64_t footer_offset, ChecksumType checksum_type,
|
|
const BlockHandle& metaindex_handle,
|
|
const BlockHandle& index_handle = BlockHandle::NullBlockHandle());
|
|
|
|
// After Builder, get a Slice for the serialized Footer, backed by this
|
|
// FooterBuilder.
|
|
const Slice& GetSlice() const {
|
|
assert(slice_.size());
|
|
return slice_;
|
|
}
|
|
|
|
private:
|
|
Slice slice_;
|
|
std::array<char, Footer::kMaxEncodedLength> data_;
|
|
};
|
|
|
|
// Read the footer from file
|
|
// If enforce_table_magic_number != 0, ReadFooterFromFile() will return
|
|
// corruption if table_magic number is not equal to enforce_table_magic_number
|
|
Status ReadFooterFromFile(const IOOptions& opts, RandomAccessFileReader* file,
|
|
FileSystem& fs, FilePrefetchBuffer* prefetch_buffer,
|
|
uint64_t file_size, Footer* footer,
|
|
uint64_t enforce_table_magic_number = 0);
|
|
|
|
// Computes a checksum using the given ChecksumType. Sometimes we need to
|
|
// include one more input byte logically at the end but not part of the main
|
|
// data buffer. If data_size >= 1, then
|
|
// ComputeBuiltinChecksum(type, data, size)
|
|
// ==
|
|
// ComputeBuiltinChecksumWithLastByte(type, data, size - 1, data[size - 1])
|
|
uint32_t ComputeBuiltinChecksum(ChecksumType type, const char* data,
|
|
size_t size);
|
|
uint32_t ComputeBuiltinChecksumWithLastByte(ChecksumType type, const char* data,
|
|
size_t size, char last_byte);
|
|
|
|
// Represents the contents of a block read from an SST file. Depending on how
|
|
// it's created, it may or may not own the actual block bytes. As an example,
|
|
// BlockContents objects representing data read from mmapped files only point
|
|
// into the mmapped region. Depending on context, it might be a serialized
|
|
// (potentially compressed) block, including a trailer beyond `size`, or an
|
|
// uncompressed block.
|
|
//
|
|
// Please try to use this terminology when dealing with blocks:
|
|
// * "Serialized block" - bytes that go into storage. For block-based table
|
|
// (usually the case) this includes the block trailer. Here the `size` does
|
|
// not include the trailer, but other places in code might include the trailer
|
|
// in the size.
|
|
// * "Maybe compressed block" - like a serialized block, but without the
|
|
// trailer (or no promise of including a trailer). Must be accompanied by a
|
|
// CompressionType in some other variable or field.
|
|
// * "Uncompressed block" - "payload" bytes that are either stored with no
|
|
// compression, used as input to compression function, or result of
|
|
// decompression function.
|
|
// * "Parsed block" - an in-memory form of a block in block cache, as it is
|
|
// used by the table reader. Different C++ types are used depending on the
|
|
// block type (see block_cache.h). Only trivially parsable block types
|
|
// use BlockContents as the parsed form.
|
|
//
|
|
struct BlockContents {
|
|
// Points to block payload (without trailer)
|
|
Slice data;
|
|
CacheAllocationPtr allocation;
|
|
|
|
#ifndef NDEBUG
|
|
// Whether there is a known trailer after what is pointed to by `data`.
|
|
// See BlockBasedTable::GetCompressionType.
|
|
bool has_trailer = false;
|
|
#endif // NDEBUG
|
|
|
|
BlockContents() {}
|
|
|
|
// Does not take ownership of the underlying data bytes.
|
|
BlockContents(const Slice& _data) : data(_data) {}
|
|
|
|
// Takes ownership of the underlying data bytes.
|
|
BlockContents(CacheAllocationPtr&& _data, size_t _size)
|
|
: data(_data.get(), _size), allocation(std::move(_data)) {}
|
|
|
|
// Takes ownership of the underlying data bytes.
|
|
BlockContents(std::unique_ptr<char[]>&& _data, size_t _size)
|
|
: data(_data.get(), _size) {
|
|
allocation.reset(_data.release());
|
|
}
|
|
|
|
// Returns whether the object has ownership of the underlying data bytes.
|
|
bool own_bytes() const { return allocation.get() != nullptr; }
|
|
|
|
// The additional memory space taken by the block data.
|
|
size_t usable_size() const {
|
|
if (allocation.get() != nullptr) {
|
|
auto allocator = allocation.get_deleter().allocator;
|
|
if (allocator) {
|
|
return allocator->UsableSize(allocation.get(), data.size());
|
|
}
|
|
#ifdef ROCKSDB_MALLOC_USABLE_SIZE
|
|
return malloc_usable_size(allocation.get());
|
|
#else
|
|
return data.size();
|
|
#endif // ROCKSDB_MALLOC_USABLE_SIZE
|
|
} else {
|
|
return 0; // no extra memory is occupied by the data
|
|
}
|
|
}
|
|
|
|
size_t ApproximateMemoryUsage() const {
|
|
return usable_size() + sizeof(*this);
|
|
}
|
|
|
|
BlockContents(BlockContents&& other) noexcept { *this = std::move(other); }
|
|
|
|
BlockContents& operator=(BlockContents&& other) {
|
|
data = std::move(other.data);
|
|
allocation = std::move(other.allocation);
|
|
#ifndef NDEBUG
|
|
has_trailer = other.has_trailer;
|
|
#endif // NDEBUG
|
|
return *this;
|
|
}
|
|
};
|
|
|
|
// The `data` points to serialized block contents read in from file, which
|
|
// must be compressed and include a trailer beyond `size`. A new buffer is
|
|
// allocated with the given allocator (or default) and the uncompressed
|
|
// contents are returned in `out_contents`.
|
|
// format_version is as defined in include/rocksdb/table.h, which is
|
|
// used to determine compression format version.
|
|
Status UncompressSerializedBlock(const UncompressionInfo& info,
|
|
const char* data, size_t size,
|
|
BlockContents* out_contents,
|
|
uint32_t format_version,
|
|
const ImmutableOptions& ioptions,
|
|
MemoryAllocator* allocator = nullptr);
|
|
|
|
// This is a variant of UncompressSerializedBlock that does not expect a
|
|
// block trailer beyond `size`. (CompressionType is taken from `info`.)
|
|
Status UncompressBlockData(const UncompressionInfo& info, const char* data,
|
|
size_t size, BlockContents* out_contents,
|
|
uint32_t format_version,
|
|
const ImmutableOptions& ioptions,
|
|
MemoryAllocator* allocator = nullptr);
|
|
|
|
// Replace db_host_id contents with the real hostname if necessary
|
|
Status ReifyDbHostIdProperty(Env* env, std::string* db_host_id);
|
|
|
|
// Implementation details follow. Clients should ignore,
|
|
|
|
// TODO(andrewkr): we should prefer one way of representing a null/uninitialized
|
|
// BlockHandle. Currently we use zeros for null and use negation-of-zeros for
|
|
// uninitialized.
|
|
inline BlockHandle::BlockHandle() : BlockHandle(~uint64_t{0}, ~uint64_t{0}) {}
|
|
|
|
inline BlockHandle::BlockHandle(uint64_t _offset, uint64_t _size)
|
|
: offset_(_offset), size_(_size) {}
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|