mirror of
https://github.com/facebook/rocksdb.git
synced 2024-11-27 20:43:57 +00:00
1777e5f7e9
Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
234 lines
7 KiB
C++
234 lines
7 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#pragma once
|
|
#include <vector>
|
|
|
|
#include "db/dbformat.h"
|
|
#include "rocksdb/db.h"
|
|
#include "util/autovector.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
class SnapshotList;
|
|
|
|
// Snapshots are kept in a doubly-linked list in the DB.
|
|
// Each SnapshotImpl corresponds to a particular sequence number.
|
|
class SnapshotImpl : public Snapshot {
|
|
public:
|
|
SequenceNumber number_; // const after creation
|
|
// It indicates the smallest uncommitted data at the time the snapshot was
|
|
// taken. This is currently used by WritePrepared transactions to limit the
|
|
// scope of queries to IsInSnapshot.
|
|
SequenceNumber min_uncommitted_ = kMinUnCommittedSeq;
|
|
|
|
SequenceNumber GetSequenceNumber() const override { return number_; }
|
|
|
|
int64_t GetUnixTime() const override { return unix_time_; }
|
|
|
|
uint64_t GetTimestamp() const override { return timestamp_; }
|
|
|
|
private:
|
|
friend class SnapshotList;
|
|
|
|
// SnapshotImpl is kept in a doubly-linked circular list
|
|
SnapshotImpl* prev_;
|
|
SnapshotImpl* next_;
|
|
|
|
SnapshotList* list_; // just for sanity checks
|
|
|
|
int64_t unix_time_;
|
|
|
|
uint64_t timestamp_;
|
|
|
|
// Will this snapshot be used by a Transaction to do write-conflict checking?
|
|
bool is_write_conflict_boundary_;
|
|
};
|
|
|
|
class SnapshotList {
|
|
public:
|
|
SnapshotList() {
|
|
list_.prev_ = &list_;
|
|
list_.next_ = &list_;
|
|
list_.number_ = 0xFFFFFFFFL; // placeholder marker, for debugging
|
|
// Set all the variables to make UBSAN happy.
|
|
list_.list_ = nullptr;
|
|
list_.unix_time_ = 0;
|
|
list_.timestamp_ = 0;
|
|
list_.is_write_conflict_boundary_ = false;
|
|
count_ = 0;
|
|
}
|
|
|
|
// No copy-construct.
|
|
SnapshotList(const SnapshotList&) = delete;
|
|
|
|
bool empty() const {
|
|
assert(list_.next_ != &list_ || 0 == count_);
|
|
return list_.next_ == &list_;
|
|
}
|
|
SnapshotImpl* oldest() const { assert(!empty()); return list_.next_; }
|
|
SnapshotImpl* newest() const { assert(!empty()); return list_.prev_; }
|
|
|
|
SnapshotImpl* New(SnapshotImpl* s, SequenceNumber seq, uint64_t unix_time,
|
|
bool is_write_conflict_boundary,
|
|
uint64_t ts = std::numeric_limits<uint64_t>::max()) {
|
|
s->number_ = seq;
|
|
s->unix_time_ = unix_time;
|
|
s->timestamp_ = ts;
|
|
s->is_write_conflict_boundary_ = is_write_conflict_boundary;
|
|
s->list_ = this;
|
|
s->next_ = &list_;
|
|
s->prev_ = list_.prev_;
|
|
s->prev_->next_ = s;
|
|
s->next_->prev_ = s;
|
|
count_++;
|
|
return s;
|
|
}
|
|
|
|
// Do not responsible to free the object.
|
|
void Delete(const SnapshotImpl* s) {
|
|
assert(s->list_ == this);
|
|
s->prev_->next_ = s->next_;
|
|
s->next_->prev_ = s->prev_;
|
|
count_--;
|
|
}
|
|
|
|
// retrieve all snapshot numbers up until max_seq. They are sorted in
|
|
// ascending order (with no duplicates).
|
|
std::vector<SequenceNumber> GetAll(
|
|
SequenceNumber* oldest_write_conflict_snapshot = nullptr,
|
|
const SequenceNumber& max_seq = kMaxSequenceNumber) const {
|
|
std::vector<SequenceNumber> ret;
|
|
GetAll(&ret, oldest_write_conflict_snapshot, max_seq);
|
|
return ret;
|
|
}
|
|
|
|
void GetAll(std::vector<SequenceNumber>* snap_vector,
|
|
SequenceNumber* oldest_write_conflict_snapshot = nullptr,
|
|
const SequenceNumber& max_seq = kMaxSequenceNumber) const {
|
|
std::vector<SequenceNumber>& ret = *snap_vector;
|
|
// So far we have no use case that would pass a non-empty vector
|
|
assert(ret.size() == 0);
|
|
|
|
if (oldest_write_conflict_snapshot != nullptr) {
|
|
*oldest_write_conflict_snapshot = kMaxSequenceNumber;
|
|
}
|
|
|
|
if (empty()) {
|
|
return;
|
|
}
|
|
const SnapshotImpl* s = &list_;
|
|
while (s->next_ != &list_) {
|
|
if (s->next_->number_ > max_seq) {
|
|
break;
|
|
}
|
|
// Avoid duplicates
|
|
if (ret.empty() || ret.back() != s->next_->number_) {
|
|
ret.push_back(s->next_->number_);
|
|
}
|
|
|
|
if (oldest_write_conflict_snapshot != nullptr &&
|
|
*oldest_write_conflict_snapshot == kMaxSequenceNumber &&
|
|
s->next_->is_write_conflict_boundary_) {
|
|
// If this is the first write-conflict boundary snapshot in the list,
|
|
// it is the oldest
|
|
*oldest_write_conflict_snapshot = s->next_->number_;
|
|
}
|
|
|
|
s = s->next_;
|
|
}
|
|
return;
|
|
}
|
|
|
|
// get the sequence number of the most recent snapshot
|
|
SequenceNumber GetNewest() {
|
|
if (empty()) {
|
|
return 0;
|
|
}
|
|
return newest()->number_;
|
|
}
|
|
|
|
int64_t GetOldestSnapshotTime() const {
|
|
if (empty()) {
|
|
return 0;
|
|
} else {
|
|
return oldest()->unix_time_;
|
|
}
|
|
}
|
|
|
|
int64_t GetOldestSnapshotSequence() const {
|
|
if (empty()) {
|
|
return 0;
|
|
} else {
|
|
return oldest()->GetSequenceNumber();
|
|
}
|
|
}
|
|
|
|
uint64_t count() const { return count_; }
|
|
|
|
private:
|
|
// Dummy head of doubly-linked list of snapshots
|
|
SnapshotImpl list_;
|
|
uint64_t count_;
|
|
};
|
|
|
|
// All operations on TimestampedSnapshotList must be protected by db mutex.
|
|
class TimestampedSnapshotList {
|
|
public:
|
|
explicit TimestampedSnapshotList() = default;
|
|
|
|
std::shared_ptr<const SnapshotImpl> GetSnapshot(uint64_t ts) const {
|
|
if (ts == std::numeric_limits<uint64_t>::max() && !snapshots_.empty()) {
|
|
auto it = snapshots_.rbegin();
|
|
assert(it != snapshots_.rend());
|
|
return it->second;
|
|
}
|
|
auto it = snapshots_.find(ts);
|
|
if (it == snapshots_.end()) {
|
|
return std::shared_ptr<const SnapshotImpl>();
|
|
}
|
|
return it->second;
|
|
}
|
|
|
|
void GetSnapshots(
|
|
uint64_t ts_lb, uint64_t ts_ub,
|
|
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const {
|
|
assert(ts_lb < ts_ub);
|
|
auto it_low = snapshots_.lower_bound(ts_lb);
|
|
auto it_high = snapshots_.lower_bound(ts_ub);
|
|
for (auto it = it_low; it != it_high; ++it) {
|
|
snapshots.emplace_back(it->second);
|
|
}
|
|
}
|
|
|
|
void AddSnapshot(const std::shared_ptr<const SnapshotImpl>& snapshot) {
|
|
assert(snapshot);
|
|
snapshots_.try_emplace(snapshot->GetTimestamp(), snapshot);
|
|
}
|
|
|
|
// snapshots_to_release: the container to where the timestamped snapshots will
|
|
// be moved so that it retains the last reference to the snapshots and the
|
|
// snapshots won't be actually released which requires db mutex. The
|
|
// snapshots will be released by caller of ReleaseSnapshotsOlderThan().
|
|
void ReleaseSnapshotsOlderThan(
|
|
uint64_t ts,
|
|
autovector<std::shared_ptr<const SnapshotImpl>>& snapshots_to_release) {
|
|
auto ub = snapshots_.lower_bound(ts);
|
|
for (auto it = snapshots_.begin(); it != ub; ++it) {
|
|
snapshots_to_release.emplace_back(it->second);
|
|
}
|
|
snapshots_.erase(snapshots_.begin(), ub);
|
|
}
|
|
|
|
private:
|
|
std::map<uint64_t, std::shared_ptr<const SnapshotImpl>> snapshots_;
|
|
};
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|