rocksdb/utilities/transactions/pessimistic_transaction.h
Yanqin Jin 1777e5f7e9 Snapshots with user-specified timestamps (#9879)
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.

It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.

This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.

In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.

This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```

If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.

Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();

// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
    std::shared_ptr<TransactionNotifier> notifier,
    TxnTimestamp ts,
    std::shared_ptr<const Snapshot>* ret);

// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```

The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```

We also provide two additional APIs for stats collection and reporting purposes.

```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
    TxnTimestamp ts_lb,
    TxnTimestamp ts_ub,
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```

To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```

Before shutdown, RocksDB will release all timestamped snapshots.

Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.

Timestamped snapshots can also be used with user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879

Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```

Reviewed By: siying

Differential Revision: D35783919

Pulled By: riversand963

fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
2022-06-10 16:07:03 -07:00

314 lines
12 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
#pragma once
#ifndef ROCKSDB_LITE
#include <algorithm>
#include <atomic>
#include <mutex>
#include <stack>
#include <string>
#include <unordered_map>
#include <vector>
#include "db/write_callback.h"
#include "rocksdb/db.h"
#include "rocksdb/slice.h"
#include "rocksdb/snapshot.h"
#include "rocksdb/status.h"
#include "rocksdb/types.h"
#include "rocksdb/utilities/transaction.h"
#include "rocksdb/utilities/transaction_db.h"
#include "rocksdb/utilities/write_batch_with_index.h"
#include "util/autovector.h"
#include "utilities/transactions/transaction_base.h"
#include "utilities/transactions/transaction_util.h"
namespace ROCKSDB_NAMESPACE {
class PessimisticTransactionDB;
// A transaction under pessimistic concurrency control. This class implements
// the locking API and interfaces with the lock manager as well as the
// pessimistic transactional db.
class PessimisticTransaction : public TransactionBaseImpl {
public:
PessimisticTransaction(TransactionDB* db, const WriteOptions& write_options,
const TransactionOptions& txn_options,
const bool init = true);
// No copying allowed
PessimisticTransaction(const PessimisticTransaction&) = delete;
void operator=(const PessimisticTransaction&) = delete;
~PessimisticTransaction() override;
void Reinitialize(TransactionDB* txn_db, const WriteOptions& write_options,
const TransactionOptions& txn_options);
Status Prepare() override;
Status Commit() override;
// It is basically Commit without going through Prepare phase. The write batch
// is also directly provided instead of expecting txn to gradually batch the
// transactions writes to an internal write batch.
Status CommitBatch(WriteBatch* batch);
Status Rollback() override;
Status RollbackToSavePoint() override;
Status SetName(const TransactionName& name) override;
// Generate a new unique transaction identifier
static TransactionID GenTxnID();
TransactionID GetID() const override { return txn_id_; }
std::vector<TransactionID> GetWaitingTxns(uint32_t* column_family_id,
std::string* key) const override {
std::lock_guard<std::mutex> lock(wait_mutex_);
std::vector<TransactionID> ids(waiting_txn_ids_.size());
if (key) *key = waiting_key_ ? *waiting_key_ : "";
if (column_family_id) *column_family_id = waiting_cf_id_;
std::copy(waiting_txn_ids_.begin(), waiting_txn_ids_.end(), ids.begin());
return ids;
}
void SetWaitingTxn(autovector<TransactionID> ids, uint32_t column_family_id,
const std::string* key) {
std::lock_guard<std::mutex> lock(wait_mutex_);
waiting_txn_ids_ = ids;
waiting_cf_id_ = column_family_id;
waiting_key_ = key;
}
void ClearWaitingTxn() {
std::lock_guard<std::mutex> lock(wait_mutex_);
waiting_txn_ids_.clear();
waiting_cf_id_ = 0;
waiting_key_ = nullptr;
}
// Returns the time (in microseconds according to Env->GetMicros())
// that this transaction will be expired. Returns 0 if this transaction does
// not expire.
uint64_t GetExpirationTime() const { return expiration_time_; }
// returns true if this transaction has an expiration_time and has expired.
bool IsExpired() const;
// Returns the number of microseconds a transaction can wait on acquiring a
// lock or -1 if there is no timeout.
int64_t GetLockTimeout() const { return lock_timeout_; }
void SetLockTimeout(int64_t timeout) override {
lock_timeout_ = timeout * 1000;
}
// Returns true if locks were stolen successfully, false otherwise.
bool TryStealingLocks();
bool IsDeadlockDetect() const override { return deadlock_detect_; }
int64_t GetDeadlockDetectDepth() const { return deadlock_detect_depth_; }
virtual Status GetRangeLock(ColumnFamilyHandle* column_family,
const Endpoint& start_key,
const Endpoint& end_key) override;
protected:
// Refer to
// TransactionOptions::use_only_the_last_commit_time_batch_for_recovery
bool use_only_the_last_commit_time_batch_for_recovery_ = false;
// Refer to
// TransactionOptions::skip_prepare
bool skip_prepare_ = false;
virtual Status PrepareInternal() = 0;
virtual Status CommitWithoutPrepareInternal() = 0;
// batch_cnt if non-zero is the number of sub-batches. A sub-batch is a batch
// with no duplicate keys. If zero, then the number of sub-batches is unknown.
virtual Status CommitBatchInternal(WriteBatch* batch,
size_t batch_cnt = 0) = 0;
virtual Status CommitInternal() = 0;
virtual Status RollbackInternal() = 0;
virtual void Initialize(const TransactionOptions& txn_options);
Status LockBatch(WriteBatch* batch, LockTracker* keys_to_unlock);
Status TryLock(ColumnFamilyHandle* column_family, const Slice& key,
bool read_only, bool exclusive, const bool do_validate = true,
const bool assume_tracked = false) override;
void Clear() override;
PessimisticTransactionDB* txn_db_impl_;
DBImpl* db_impl_;
// If non-zero, this transaction should not be committed after this time (in
// microseconds according to Env->NowMicros())
uint64_t expiration_time_;
// Timestamp used by the transaction to perform all GetForUpdate.
// Use this timestamp for conflict checking.
// read_timestamp_ == kMaxTxnTimestamp means this transaction has not
// performed any GetForUpdate. It is possible that the transaction has
// performed blind writes or Get, though.
TxnTimestamp read_timestamp_{kMaxTxnTimestamp};
TxnTimestamp commit_timestamp_{kMaxTxnTimestamp};
private:
friend class TransactionTest_ValidateSnapshotTest_Test;
// Used to create unique ids for transactions.
static std::atomic<TransactionID> txn_id_counter_;
// Unique ID for this transaction
TransactionID txn_id_;
// IDs for the transactions that are blocking the current transaction.
//
// empty if current transaction is not waiting.
autovector<TransactionID> waiting_txn_ids_;
// The following two represents the (cf, key) that a transaction is waiting
// on.
//
// If waiting_key_ is not null, then the pointer should always point to
// a valid string object. The reason is that it is only non-null when the
// transaction is blocked in the PointLockManager::AcquireWithTimeout
// function. At that point, the key string object is one of the function
// parameters.
uint32_t waiting_cf_id_;
const std::string* waiting_key_;
// Mutex protecting waiting_txn_ids_, waiting_cf_id_ and waiting_key_.
mutable std::mutex wait_mutex_;
// Timeout in microseconds when locking a key or -1 if there is no timeout.
int64_t lock_timeout_;
// Whether to perform deadlock detection or not.
bool deadlock_detect_;
// Whether to perform deadlock detection or not.
int64_t deadlock_detect_depth_;
// Refer to TransactionOptions::skip_concurrency_control
bool skip_concurrency_control_;
virtual Status ValidateSnapshot(ColumnFamilyHandle* column_family,
const Slice& key,
SequenceNumber* tracked_at_seq);
void UnlockGetForUpdate(ColumnFamilyHandle* column_family,
const Slice& key) override;
};
class WriteCommittedTxn : public PessimisticTransaction {
public:
WriteCommittedTxn(TransactionDB* db, const WriteOptions& write_options,
const TransactionOptions& txn_options);
// No copying allowed
WriteCommittedTxn(const WriteCommittedTxn&) = delete;
void operator=(const WriteCommittedTxn&) = delete;
~WriteCommittedTxn() override {}
using TransactionBaseImpl::GetForUpdate;
Status GetForUpdate(const ReadOptions& read_options,
ColumnFamilyHandle* column_family, const Slice& key,
std::string* value, bool exclusive,
const bool do_validate) override;
Status GetForUpdate(const ReadOptions& read_options,
ColumnFamilyHandle* column_family, const Slice& key,
PinnableSlice* pinnable_val, bool exclusive,
const bool do_validate) override;
using TransactionBaseImpl::Put;
// `key` does NOT include timestamp even when it's enabled.
Status Put(ColumnFamilyHandle* column_family, const Slice& key,
const Slice& value, const bool assume_tracked = false) override;
Status Put(ColumnFamilyHandle* column_family, const SliceParts& key,
const SliceParts& value,
const bool assume_tracked = false) override;
using TransactionBaseImpl::PutUntracked;
Status PutUntracked(ColumnFamilyHandle* column_family, const Slice& key,
const Slice& value) override;
Status PutUntracked(ColumnFamilyHandle* column_family, const SliceParts& key,
const SliceParts& value) override;
using TransactionBaseImpl::Delete;
// `key` does NOT include timestamp even when it's enabled.
Status Delete(ColumnFamilyHandle* column_family, const Slice& key,
const bool assume_tracked = false) override;
Status Delete(ColumnFamilyHandle* column_family, const SliceParts& key,
const bool assume_tracked = false) override;
using TransactionBaseImpl::DeleteUntracked;
Status DeleteUntracked(ColumnFamilyHandle* column_family,
const Slice& key) override;
Status DeleteUntracked(ColumnFamilyHandle* column_family,
const SliceParts& key) override;
using TransactionBaseImpl::SingleDelete;
// `key` does NOT include timestamp even when it's enabled.
Status SingleDelete(ColumnFamilyHandle* column_family, const Slice& key,
const bool assume_tracked = false) override;
Status SingleDelete(ColumnFamilyHandle* column_family, const SliceParts& key,
const bool assume_tracked = false) override;
using TransactionBaseImpl::SingleDeleteUntracked;
Status SingleDeleteUntracked(ColumnFamilyHandle* column_family,
const Slice& key) override;
using TransactionBaseImpl::Merge;
Status Merge(ColumnFamilyHandle* column_family, const Slice& key,
const Slice& value, const bool assume_tracked = false) override;
Status SetReadTimestampForValidation(TxnTimestamp ts) override;
Status SetCommitTimestamp(TxnTimestamp ts) override;
TxnTimestamp GetCommitTimestamp() const override { return commit_timestamp_; }
private:
template <typename TValue>
Status GetForUpdateImpl(const ReadOptions& read_options,
ColumnFamilyHandle* column_family, const Slice& key,
TValue* value, bool exclusive,
const bool do_validate);
template <typename TKey, typename TOperation>
Status Operate(ColumnFamilyHandle* column_family, const TKey& key,
const bool do_validate, const bool assume_tracked,
TOperation&& operation);
Status PrepareInternal() override;
Status CommitWithoutPrepareInternal() override;
Status CommitBatchInternal(WriteBatch* batch, size_t batch_cnt) override;
Status CommitInternal() override;
Status RollbackInternal() override;
// Column families that enable timestamps and whose data are written when
// indexing_enabled_ is false. If a key is written when indexing_enabled_ is
// true, then the corresponding column family is not added to cfs_with_ts
// even if it enables timestamp.
std::unordered_set<uint32_t> cfs_with_ts_tracked_when_indexing_disabled_;
};
} // namespace ROCKSDB_NAMESPACE
#endif // ROCKSDB_LITE