Go to file
Yanqin Jin 1777e5f7e9 Snapshots with user-specified timestamps (#9879)
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.

It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.

This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.

In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.

This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```

If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.

Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();

// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
    std::shared_ptr<TransactionNotifier> notifier,
    TxnTimestamp ts,
    std::shared_ptr<const Snapshot>* ret);

// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```

The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```

We also provide two additional APIs for stats collection and reporting purposes.

```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
    TxnTimestamp ts_lb,
    TxnTimestamp ts_ub,
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```

To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```

Before shutdown, RocksDB will release all timestamped snapshots.

Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.

Timestamped snapshots can also be used with user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879

Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```

Reviewed By: siying

Differential Revision: D35783919

Pulled By: riversand963

fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
2022-06-10 16:07:03 -07:00
.circleci Disable CI benchmark from #9723 (#10119) 2022-06-05 22:28:50 -07:00
.github/workflows Use released clang-format instead of the one from dev branch (#9646) 2022-03-01 10:51:38 -08:00
buckifier Multi file concurrency in MultiGet using coroutines and async IO (#9968) 2022-05-19 15:36:27 -07:00
build_tools Update jemalloc version for platform009 (#10143) 2022-06-09 17:13:13 -07:00
cache Enable SecondaryCache::CreateFromString to create sec cache based on the uri for CompressedSecondaryCache (#10132) 2022-06-10 12:23:10 -07:00
cmake gcc-11 and cmake related cleanup (#9286) 2021-12-17 17:04:35 -08:00
coverage Fix commit_prereq and other targets (#9797) 2022-04-04 09:58:18 -07:00
db Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
db_stress_tool Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
docs Bump nokogiri from 1.13.4 to 1.13.6 in /docs (#10019) 2022-05-20 11:00:15 -07:00
env Add comments/permit unchecked error to close_db_dir pull requests (#10093) 2022-06-02 21:52:35 -07:00
examples Add a simple example of backup and restore (#10054) 2022-06-03 23:25:31 -07:00
file Handle "NotSupported" status by default implementation of Close() in … (#10127) 2022-06-07 09:49:31 -07:00
fuzz Fix compilation errors and add fuzzers to CircleCI (#9420) 2022-02-01 10:32:15 -08:00
include/rocksdb Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
java Fix Java build (#10105) 2022-06-03 08:11:31 -07:00
logging Deflake unit test BackupEngineTest.Concurrency (#10069) 2022-05-31 09:36:32 -07:00
memory Fail DB::Open() if logger cannot be created (#9984) 2022-05-27 07:23:31 -07:00
memtable Rewrite memory-charging feature's option API (#9926) 2022-05-17 15:01:51 -07:00
microbench Fix flaky db_basic_bench caused by unreleased iterator (#10058) 2022-05-25 18:02:04 -07:00
monitoring Fix overflow bug in standard deviation computation. (#10100) 2022-06-06 13:53:47 -07:00
options Make it possible to enable blob files starting from a certain LSM tree level (#10077) 2022-06-02 20:04:33 -07:00
plugin Add initial CMake support to plugin (#9214) 2021-11-30 17:16:53 -08:00
port Explicitly closing all directory file descriptors (#10049) 2022-06-01 18:03:34 -07:00
table Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128) 2022-06-10 08:51:45 -07:00
test_util Fix a bug of not setting enforce_single_del_contracts (#10027) 2022-05-20 16:48:50 -07:00
third-party Meta-internal folly integration with F14FastMap (#9546) 2022-04-13 07:34:01 -07:00
tools Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
trace_replay Use std::numeric_limits<> (#9954) 2022-05-05 13:08:21 -07:00
util Add wide column serialization primitives (#9915) 2022-06-03 20:54:48 -07:00
utilities Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
.clang-format
.gitignore Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
.lgtm.yml Create lgtm.yml for LGTM.com C/C++ analysis (#4058) 2018-06-26 12:43:04 -07:00
.travis.yml Fix remaining uses of "backupable" (#9792) 2022-04-05 09:52:33 -07:00
.watchmanconfig Added .watchmanconfig file to rocksdb repo (#5593) 2019-07-19 15:00:33 -07:00
AUTHORS Update RocksDB Authors File 2017-10-18 14:42:10 -07:00
CMakeLists.txt Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
CODE_OF_CONDUCT.md Adopt Contributor Covenant 2019-08-29 23:21:01 -07:00
CONTRIBUTING.md Add Code of Conduct 2017-12-05 18:42:35 -08:00
COPYING
DEFAULT_OPTIONS_HISTORY.md Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md
HISTORY.md Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
INSTALL.md Update supported VS versions in INSTALL.md (#9823) 2022-04-13 13:03:40 -07:00
LANGUAGE-BINDINGS.md Update branch name to "main" in README/LANGUAGE_BINDINGS (#8727) 2021-09-01 15:26:34 -07:00
LICENSE.Apache Change RocksDB License 2017-07-15 16:11:23 -07:00
LICENSE.leveldb Add back the LevelDB license file 2017-07-16 18:42:18 -07:00
Makefile Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
PLUGINS.md Add pmem-rocksdb-plugin link in PLUGINs.md (#9934) 2022-05-12 22:02:28 -07:00
README.md README: De-list slack channel, list Google group (#9387) 2022-01-18 08:19:48 -08:00
ROCKSDB_LITE.md Fix remaining uses of "backupable" (#9792) 2022-04-05 09:52:33 -07:00
TARGETS Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
USERS.md Add Solana's RocksDB use case in USERS.md (#9558) 2022-02-16 09:23:01 -08:00
Vagrantfile Adding CentOS 7 Vagrantfile & build script 2018-02-26 15:27:17 -08:00
WINDOWS_PORT.md Update branch name in WINDOWS_PORT.md (#8745) 2021-09-01 19:26:39 -07:00
common.mk Clean up variables for temporary directory (#9961) 2022-05-06 16:38:06 -07:00
crash_test.mk Clean up variables for temporary directory (#9961) 2022-05-06 16:38:06 -07:00
issue_template.md Add Google Group to Issue Template 2020-01-28 14:40:37 -08:00
rocksdb.pc.in build: fix pkg-config file generation (#9953) 2022-05-30 12:46:40 -07:00
src.mk Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
thirdparty.inc Fix build jemalloc api (#5470) 2019-06-24 17:40:32 -07:00

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status TravisCI Status Appveyor Build status PPC64le Build Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.