rocksdb/tools
Yanqin Jin 1777e5f7e9 Snapshots with user-specified timestamps (#9879)
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.

It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.

This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.

In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.

This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```

If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.

Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();

// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
    std::shared_ptr<TransactionNotifier> notifier,
    TxnTimestamp ts,
    std::shared_ptr<const Snapshot>* ret);

// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```

The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```

We also provide two additional APIs for stats collection and reporting purposes.

```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
    TxnTimestamp ts_lb,
    TxnTimestamp ts_ub,
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```

To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```

Before shutdown, RocksDB will release all timestamped snapshots.

Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.

Timestamped snapshots can also be used with user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879

Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```

Reviewed By: siying

Differential Revision: D35783919

Pulled By: riversand963

fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
2022-06-10 16:07:03 -07:00
..
advisor Update branch as "main" in tools/advisor/README.md (#8744) 2021-09-01 20:26:28 -07:00
block_cache_analyzer Use std::numeric_limits<> (#9954) 2022-05-05 13:08:21 -07:00
dump Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
CMakeLists.txt Mark dependencies as PRIVATE and fix missing dependencies in tools. (#6790) 2020-05-12 21:07:55 -07:00
Dockerfile adding docker build script and dockerfile 2015-05-22 16:03:39 -07:00
analyze_txn_stress_test.sh Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
auto_sanity_test.sh Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
backup_db.sh Revamp check_format_compatible.sh (#8012) 2021-03-02 11:42:27 -08:00
benchmark.sh Fix parsing of db_bench output (#10124) 2022-06-08 09:23:36 -07:00
benchmark_leveldb.sh Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
blob_dump.cc Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
check_all_python.py Allow missing "unversioned" python, as in CentOS 8 (#6883) 2020-05-29 11:29:23 -07:00
check_format_compatible.sh Update version on main to 7.4 and add 7.3 to the format compatibility checks (#10038) 2022-05-23 14:55:33 -07:00
db_bench.cc Add (& fix) some simple source code checks (#8821) 2021-09-07 21:19:27 -07:00
db_bench_tool.cc Add support for FastLRUCache in db_bench. (#10096) 2022-06-03 11:16:49 -07:00
db_bench_tool_test.cc Make it possible to enable blob files starting from a certain LSM tree level (#10077) 2022-06-02 20:04:33 -07:00
db_crashtest.py Snapshots with user-specified timestamps (#9879) 2022-06-10 16:07:03 -07:00
db_repl_stress.cc Remove using namespace (#9369) 2022-01-12 09:31:12 -08:00
db_sanity_test.cc Remove own ToString() (#9955) 2022-05-06 13:03:58 -07:00
dbench_monitor Fix /bin/bash shebangs 2017-08-03 15:56:46 -07:00
generate_random_db.sh Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
ingest_external_sst.sh Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
io_tracer_parser.cc Add IO Tracer Parser (#7333) 2020-09-23 15:50:26 -07:00
io_tracer_parser_test.cc Cleanup includes in dbformat.h (#8930) 2021-09-29 04:04:40 -07:00
io_tracer_parser_tool.cc Add request_id in IODebugContext. (#8045) 2021-04-01 13:14:51 -07:00
io_tracer_parser_tool.h Add IO Tracer Parser (#7333) 2020-09-23 15:50:26 -07:00
ldb.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
ldb_cmd.cc Make it possible to enable blob files starting from a certain LSM tree level (#10077) 2022-06-02 20:04:33 -07:00
ldb_cmd_impl.h Support single delete in ldb (#9469) 2022-05-10 16:37:19 -07:00
ldb_cmd_test.cc Remove own ToString() (#9955) 2022-05-06 13:03:58 -07:00
ldb_test.py Make it possible to enable blob files starting from a certain LSM tree level (#10077) 2022-06-02 20:04:33 -07:00
ldb_tool.cc Default `try_load_options` to true when DB is specified (#9937) 2022-05-04 08:49:46 -07:00
pflag Fix /bin/bash shebangs 2017-08-03 15:56:46 -07:00
reduce_levels_test.cc Remove own ToString() (#9955) 2022-05-06 13:03:58 -07:00
regression_test.sh Abort RocksDB performance regression test on failure in test setup (#10053) 2022-05-25 13:35:08 -07:00
restore_db.sh Revamp check_format_compatible.sh (#8012) 2021-03-02 11:42:27 -08:00
rocksdb_dump_test.sh Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
run_blob_bench.sh Make it possible to enable blob files starting from a certain LSM tree level (#10077) 2022-06-02 20:04:33 -07:00
run_flash_bench.sh Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
run_leveldb.sh Add copyright headers per FB open-source checkup tool. (#5199) 2019-04-18 10:55:01 -07:00
sample-dump.dmp First version of rocksdb_dump and rocksdb_undump. 2015-06-19 16:24:36 -07:00
simulated_hybrid_file_system.cc Improve SimulatedHybridFileSystem (#9301) 2021-12-29 11:14:42 -08:00
simulated_hybrid_file_system.h Improve SimulatedHybridFileSystem (#9301) 2021-12-29 11:14:42 -08:00
sst_dump.cc Implement a new subcommand "identify" for sst_dump (#6943) 2020-06-08 13:58:28 -07:00
sst_dump_test.cc Use the comparator from the sst file table properties in sst_dump_tool (#9491) 2022-02-08 12:15:35 -08:00
sst_dump_tool.cc Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857) 2022-05-20 12:09:09 -07:00
trace_analyzer.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
trace_analyzer_test.cc Support read rate-limiting in SequentialFileReader (#9973) 2022-05-24 10:28:57 -07:00
trace_analyzer_tool.cc Support read rate-limiting in SequentialFileReader (#9973) 2022-05-24 10:28:57 -07:00
trace_analyzer_tool.h Add commit marker with timestamp (#9266) 2021-12-10 11:05:35 -08:00
verify_random_db.sh Fix some bugs in verify_random_db.sh (#10112) 2022-06-03 16:35:13 -07:00
write_external_sst.sh Revamp check_format_compatible.sh (#8012) 2021-03-02 11:42:27 -08:00
write_stress.cc Add a SystemClock class to capture the time functions of an Env (#7858) 2021-01-25 22:09:11 -08:00
write_stress_runner.py Allow missing "unversioned" python, as in CentOS 8 (#6883) 2020-05-29 11:29:23 -07:00