Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.
This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```
If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.
Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();
// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
std::shared_ptr<TransactionNotifier> notifier,
TxnTimestamp ts,
std::shared_ptr<const Snapshot>* ret);
// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```
The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```
We also provide two additional APIs for stats collection and reporting purposes.
```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
TxnTimestamp ts_lb,
TxnTimestamp ts_ub,
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```
To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```
Before shutdown, RocksDB will release all timestamped snapshots.
Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.
Timestamped snapshots can also be used with user-defined timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```
Reviewed By: siying
Differential Revision: D35783919
Pulled By: riversand963
fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
Summary:
With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument().
Test plan
```
$COMPILE_WITH_ASAN=1 make -j24 all
$./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow
$ make -j24 check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109
Reviewed By: riversand963
Differential Revision: D36901126
Pulled By: jowlyzhang
fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86
Summary:
We have three related concepts:
* BlockType: an internal enum conceptually indicating a type of SST file
block
* CacheEntryRole: a user-facing enum for categorizing block cache entries,
which is also involved in associated cache entries with an appropriate
deleter. Can include categories for non-block cache entries (e.g. memory
reservations).
* TBlocklike: a C++ type for the actual type behind a void* cache entry.
We had some existing code ugliness because BlockType did not imply
TBlocklike, because of various kinds of "filter" block. This refactoring
fixes that with new BlockTypes.
More clean-up can come in later work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098
Test Plan: existing tests
Reviewed By: akankshamahajan15
Differential Revision: D36897945
Pulled By: pdillinger
fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295
Summary:
If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`,
RocksDB will return the timestamps of the point tombstones.
Note: DeleteRange is still unsupported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D36677956
Pulled By: riversand963
fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
Reviewed By: ajkr, hx235
Differential Revision: D36722135
Pulled By: littlepig2013
fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
Summary:
TSAN test is slower, for `TransactionStressTest` and
`DeadlockStress`, they're reaching the timeout limit of 600 seconds.
Decreasing the transaction test number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10063
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D36711727
Pulled By: jay-zhuang
fbshipit-source-id: 600f82a6d32108f52fbe5572fcc7497607b7fe98
Summary:
For regular db instance and secondary instance, we return error and refuse to open DB if Logger creation fails.
Our current code allows it, but it is really difficult to debug because
there will be no LOG files. The same for OPTIONS file, which will be explored in another PR.
Furthermore, Arena::AllocateAligned(size_t bytes, size_t huge_page_size, Logger* logger) has an
assertion as the following:
```cpp
#ifdef MAP_HUGETLB
if (huge_page_size > 0 && bytes > 0) {
assert(logger != nullptr);
}
#endif
```
It can be removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9984
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D36347754
Pulled By: riversand963
fbshipit-source-id: 529798c0511d2eaa2f0fd40cf7e61c4cbc6bc57e
Summary:
There are currently some preprocessor checks that assume support for Visual Studio versions older than 2015 (i.e., 0 < _MSC_VER < 1900), although we don't support them any more.
We removed all code that only compiles on those older versions, except third-party/ files.
The ROCKSDB_NOEXCEPT symbol is now obsolete, since it now always gets replaced by noexcept. We removed it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10065
Reviewed By: pdillinger
Differential Revision: D36721901
Pulled By: guidotag
fbshipit-source-id: a2892d365ef53cce44a0a7d90dd6b72ee9b5e5f2
Summary:
There are some time-related POSIX APIs that are not available on Windows
(e.g. `localtime_r`), which we have worked around by providing our own
implementations in `port/sys_time.h`. This workaround actually relies on
some ambiguity: on Windows, a call to `localtime_r` calls
`ROCKSDB_NAMESPACE::port::localtime_r` (which is pulled into
`ROCKSDB_NAMESPACE` by a using-declaration), while on other platforms
it calls the global `localtime_r`. This works fine as long as there is only one
candidate function; however, it breaks down when there is more than one
`localtime_r` visible in a scope.
The patch fixes this by introducing `ROCKSDB_NAMESPACE::port::{TimeVal, GetTimeOfDay, LocalTimeR}`
to eliminate any ambiguity.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10045
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D36639372
Pulled By: ltamasi
fbshipit-source-id: fc13dbfa421b7c8918111a6d9e24ce77e91a7c50
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).
The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973
Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
- Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
- Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
- backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
- restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)
```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000
# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
# Restore
#./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```
Reviewed By: hx235
Differential Revision: D36327418
Pulled By: cbi42
fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
Summary:
Changed the static objects that had non-trivial destructors to use the STATIC_AVOID_DESTRUCTION construct.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9958
Reviewed By: pdillinger
Differential Revision: D36442982
Pulled By: mrambacher
fbshipit-source-id: 029d47b1374d30d198bfede369a4c0ae7a4eb519
Summary:
**Context:**
`BackupEngineRateLimitingTestWithParam.RateLimiting` and `BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup` involve creating backup and restoring of a big database with rate-limiting. Using the normal env with a normal clock requires real elapse of time (13702 - 19848 ms/per test). As suggested in https://github.com/facebook/rocksdb/pull/8722#discussion_r703698603, this PR is to speed it up with SpecialEnv (`time_elapse_only_sleep=true`) where its clock accepts fake elapse of time during rate-limiting (100 - 600 ms/per test)
**Summary:**
- Added TEST_ function to set clock of the default rate limiters in backup engine
- Shrunk testdb by 10 times while keeping it big enough for testing
- Renamed some test variables and reorganized some if-else branch for clarity without changing the test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9974
Test Plan:
- Run tests pre/post PR the same time to verify the tests are sped up by 90 - 95%
`BackupEngineRateLimitingTestWithParam.RateLimiting`
Pre:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0 (11123 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 (9441 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2 (11096 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3 (9339 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4 (11121 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5 (9413 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6 (11185 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7 (9511 ms)
[----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (82230 ms total)
```
Post:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0 (395 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 (564 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2 (358 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3 (567 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4 (173 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5 (176 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6 (191 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7 (177 ms)
[----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (2601 ms total)
```
`BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup`
Pre:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0 (7275 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1 (3961 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2 (7117 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3 (3921 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4 (19862 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5 (10231 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6 (19848 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7 (10372 ms)
[----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (82587 ms total)
```
Post:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0 (157 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1 (152 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2 (160 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3 (158 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4 (155 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5 (151 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6 (146 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7 (153 ms)
[----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (1232 ms total)
```
Reviewed By: pdillinger
Differential Revision: D36336345
Pulled By: hx235
fbshipit-source-id: 724c6ba745f95f56d4440a6d2f1e4512a2987589
Summary:
These methods allow for more thorough testing of the ObjectRegistry and Customizable infrastructure in a simpler manner. With this change, the Customizable tests can now check what factories are registered and attempt to create each of them in a systematic fashion.
With this change, I think all of the factories registered with the ObjectRegistry/CreateFromString are now tested via the customizable_test classes.
Note that there were a few other minor changes. There was a "posix://*" register with the ObjectRegistry which was missed during the PatternEntry conversion -- these changes found that. The nickname and default names for the FileSystem classes was also inverted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9358
Reviewed By: pdillinger
Differential Revision: D33433542
Pulled By: mrambacher
fbshipit-source-id: 9a32da74e6620745b4eeffb2712be70eeeadfa7e
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
`db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`.
Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947
Test Plan:
- regular blackbox, blackbox --simple
- various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16`
Reviewed By: riversand963
Differential Revision: D36152775
Pulled By: ajkr
fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
Enforce the contract of SingleDelete so that they are not mixed with
Delete for the same key. Otherwise, it will lead to undefined behavior.
See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes.
Also fix unit tests and write-unprepared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D35837817
Pulled By: riversand963
fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635
Summary:
Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions.
1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw.
2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true.
3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn
```
Reviewed By: pdillinger
Differential Revision: D35806678
Pulled By: riversand963
fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616
Summary:
The current locktree implementation stores the address of the
PessimisticTransactions object as the TXNID. However, when a transaction
is blocked on a lock, it records the list of waitees with conflicting
locks using the rocksdb assigned TransactionID. This is performed by
calling GetID() on PessimisticTransactions objects of the waitees,
and then recorded in the waiter's list.
However, there is no guarantee the objects are valid when recording the
waitee list during the conflict callbacks because the waitee
could have released the lock and freed the PessimisticTransactions
object.
The waitee/txnid values are only valid PessimisticTransaction objects
while the mutex for the root of the locktree is held.
The simplest fix for this problem is to use the address of the
PessimisticTransaction as the TransactionID so that it is consistent
with its usage in the locktree. The TXNID is only converted back to a
PessimisticTransaction for the report_wait callbacks. Since
these callbacks are now all made within the critical section where the
lock_request queue mutx is held, these conversions will be safe.
Otherwise, only the uint64_t TXNID of the waitee is registerd
with the waiter transaction. The PessimisitcTransaction object of the
waitee is never referenced.
The main downside of this approach is the TransactionID will not change
if the PessimisticTransaction object is reused for new transactions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9898
Test Plan:
Add a new test case and run unit tests.
Also verified with MyRocks workloads using range locks that the
crash no longer happens.
Reviewed By: riversand963
Differential Revision: D35950376
Pulled By: hermanlee
fbshipit-source-id: 8c9cae272e23e487fc139b6a8ed5b8f8f24b1570
Summary:
This PR does not affect write-committed.
Add a member, `rollback_deletion_type_callback` to TransactionDBOptions
so that a write-prepared transaction, when rolling back, can call this
callback to decide if a `Delete` or `SingleDelete` should be used to
cancel a prior `Put` written to the database during prepare phase.
The purpose of this PR is to prevent mixing `Delete` and `SingleDelete`
for the same key, causing undefined behaviors. Without this PR, the
following can happen:
```
// The application always issues SingleDelete when deleting keys.
txn1->Put('a');
txn1->Prepare(); // writes to memtable and potentially gets flushed/compacted to Lmax
txn1->Rollback(); // inserts DELETE('a')
txn2->Put('a');
txn2->Commit(); // writes to memtable and potentially gets flushed/compacted
```
In the database, we may have
```
L0: [PUT('a', s=100)]
L1: [DELETE('a', s=90)]
Lmax: [PUT('a', s=0)]
```
If a compaction compacts L0 and L1, then we have
```
L1: [PUT('a', s=100)]
Lmax: [PUT('a', s=0)]
```
If a future transaction issues a SingleDelete, we have
```
L0: [SD('a', s=110)]
L1: [PUT('a', s=100)]
Lmax: [PUT('a', s=0)]
```
Then, a compaction including L0, L1 and Lmax leads to
```
Lmax: [PUT('a', s=0)]
```
which is incorrect.
Similar bugs reported and addressed in
https://github.com/cockroachdb/pebble/issues/1255. Based on our team's
current priority, we have decided to take this approach for now. We may
come back and revisit in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9873
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D35762170
Pulled By: riversand963
fbshipit-source-id: b28d56eefc786b53c9844b9ef4a7807acdd82c8d
Summary:
Add a merge operator that allows users to register specific aggregation function so that they can does aggregation based per key using different aggregation types.
See comments of function CreateAggMergeOperator() for actual usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9780
Test Plan: Add a unit test to coverage various cases.
Reviewed By: ltamasi
Differential Revision: D35267444
fbshipit-source-id: 5b02f31c4f3e17e96dd4025cdc49fca8c2868628
Summary:
In `FileMetaData`, we keep track of the lowest-numbered blob file
referenced by the SST file in question for the purposes of BlobDB's
garbage collection in the `oldest_blob_file_number` field, which is
updated in `UpdateBoundaries`. However, with the current code,
`BlobIndex` decoding errors (or invalid blob file numbers) are swallowed
in this method. The patch changes this by propagating these errors
and failing the corresponding flush/compaction. (Note that since blob
references are generated by the BlobDB code and also parsed by
`CompactionIterator`, in reality this can only happen in the case of
memory corruption.)
This change necessitated updating some unit tests that involved
fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like
`"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es.
Some were relying on the earlier behavior to simulate corruption; these
were replaced with `SyncPoint`-based test code that corrupts a valid
blob reference at read time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D35683671
Pulled By: ltamasi
fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6
Summary:
Especially after updating to C++17, I don't see a compelling case for
*requiring* any folly components in RocksDB. I was able to purge the existing
hard dependencies, and it can be quite difficult to strip out non-trivial components
from folly for use in RocksDB. (The prospect of doing that on F14 has changed
my mind on the best approach here.)
But this change creates an optional integration where we can plug in
components from folly at compile time, starting here with F14FastMap to replace
std::unordered_map when possible (probably no public APIs for example). I have
replaced the biggest CPU users of std::unordered_map with compile-time
pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
that is in the Makefile for public CI testing. A full folly build is not needed, but
checking out the full folly repo is much simpler for getting the dependency,
and anything else we might want to optionally integrate in the future.
Some picky details:
* I don't think the distributed mutex stuff is actually used, so it was easy to remove.
* I implemented an alternative to `folly::constexpr_log2` (which is much easier
in C++17 than C++11) so that I could pull out the hard dependencies on
`ConstexprMath.h`
* I had to add noexcept move constructors/operators to some types to make
F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
macro to make that easier in some common cases.
* Updated Meta-internal buck build to use folly F14Map (always)
No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
production integration for open source users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
Test Plan:
CircleCI tests updated so that a couple of them use folly.
Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
(Note: they should probably use buck but they currently use Makefile.)
Example performance improvement: when filter partitions are pinned in cache,
they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
a test that exercises that heavily. Build DB with
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
```
and test with (simultaneous runs with & without folly, ~20 times each to see
convergence)
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
```
Average ops/s no folly: 26229.2
Average ops/s with folly: 26853.3 (+2.4%)
Reviewed By: ajkr
Differential Revision: D34181736
Pulled By: pdillinger
fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
Summary:
Added a Plugin class to the ObjectRegistry. Enabled compile-time and program-time addition of plugins to the Registry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7949
Reviewed By: mrambacher
Differential Revision: D33517674
Pulled By: pdillinger
fbshipit-source-id: c3e3270aab76a489bfa9e85d78cdfca951912557
Summary:
For write-prepared/write-unprepared transactions,
GetCommitTimeWriteBatch() can be used only if the transaction is started
with `TransactionOptions::use_only_the_last_commit_time_batch_for_recovery` set
to true. Otherwise, it is possible that multiple uncommitted versions of the
same key exist in the database. During bottommost compaction, RocksDB may
set the sequence numbers of both to zero once they become committed, causing
output SST file to have two identical internal keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9794
Test Plan:
make check
pay special attention to the following
```
transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/*
```
Reviewed By: lth
Differential Revision: D35327214
Pulled By: riversand963
fbshipit-source-id: 3bae00a28359c10e96e4c6f676d20de5610d8a0f
Summary:
Various renaming and fixes to get rid of remaining uses of
"backupable" which is terminology leftover from the original, flawed
design of BackupableDB. Now any DB can be backed up, using BackupEngine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D35334386
Pulled By: pdillinger
fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329
Summary:
In making `SstFileMetaData` inherit from `FileStorageInfo`, I
overlooked setting some `FileStorageInfo` fields when then default
`SstFileMetaData()` ctor is used. This affected `GetLiveFilesMetaData()`.
Also removed some buggy `static_cast<size_t>`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9769
Test Plan: Updated tests
Reviewed By: jay-zhuang
Differential Revision: D35220383
Pulled By: pdillinger
fbshipit-source-id: 05b4ee468258dbd3699517e1124838bf405fe7f8
Summary:
The param name force_erase may be misleading, since the handle is erased only if it has last reference even if the param is set true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9728
Reviewed By: pdillinger
Differential Revision: D35038673
Pulled By: gitbw95
fbshipit-source-id: 0d16d1e8fed17b97eba7fb53207119332f659a5f
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".
To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.
Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.
Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.
Reviewed By: ajkr
Differential Revision: D34686968
Pulled By: pdillinger
fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629
Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
`PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
`PessimisticTransaction::SingleDelete()` will write to or delete a key, while
`PessimisticTransaction::GetForUpdate()` is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
`Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
`Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
section. `Commit()` will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each `PessimisticTransaction`, `read_timestamp_` and
`commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.
read_timestamp_ is used mainly for validation, and should be set before first call to
`GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
`TryLock()` that can verify if another transaction has written the same key since
`read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.
commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.
We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
`PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.
Reviewed By: ltamasi
Differential Revision: D31822445
fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43
Summary:
In preparation for more support for file Temperatures in BackupEngine,
this change does some test refactoring:
* Move DBTest2::BackupFileTemperature test to
BackupEngineTest::FileTemperatures, with some updates to make it work
in the new home. This test will soon be expanded for deeper backup work.
* Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
support sharing because of above moved test, but split off the "no link"
part to the test needing it.
* Use custom FileSystems in backupable_db_test rather than custom Envs,
because going through Env file interfaces doesn't support temperatures.
* Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
parameter to FsyncWithDirOptions, which was required because this
limitation caused a crash only after moving to higher fidelity of
FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
parameter details)
* `backupable_options_` -> `engine_options_` as part of the ongoing
work to get rid of the obsolete "backupable" naming.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655
Test Plan: test code updates only
Reviewed By: jay-zhuang
Differential Revision: D34622183
Pulled By: pdillinger
fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
Summary:
This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps
and reading the keys back. **The index does not have timestamps.**
Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it.
When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before
comparison.
When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be
provided to query the db. The assumption is that data in the batch must be newer than data from the db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D34354849
Pulled By: riversand963
fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0
Summary:
Add Temperature hints information from RocksDB in API
`NewSequentialFile()`. backup and checkpoint operations need to open the
source files with `NewSequentialFile()`, which will have the temperature
hints. Other operations are not covered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499
Test Plan: Added unittest
Reviewed By: pdillinger
Differential Revision: D34006115
Pulled By: jay-zhuang
fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f
Summary:
Make FilterPolicy into a Customizable class. Allow new FilterPolicy to be discovered through the ObjectRegistry
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590
Reviewed By: pdillinger
Differential Revision: D34327367
Pulled By: mrambacher
fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
- setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
- benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
Reviewed By: hx235
Differential Revision: D33747386
Pulled By: ajkr
fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
Summary:
Add support for doubles to ObjectLibrary::PatternEntry. This support will allow patterns containing a non-integer number to be parsed correctly.
Added appropriate test cases to cover this new option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9577
Reviewed By: pdillinger
Differential Revision: D34269763
Pulled By: mrambacher
fbshipit-source-id: b5ce16cbd3665c2974ec0f3412ef2b403ef8b155
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9562
With per-transaction `read_timestamp_`, it is possible to perform transaction validation after
locking a key in addition to sequence-based validation. Specifically, if a transaction has a
read_timestamp, then we perform timestamp-based validation as well after the key is locked
via `GetForUpdate()`. This is to make sure that no other transaction has modified the key and
committed successfully since the read timestamp (but before the locking operation) which
represents a consistent view of the database.
Reviewed By: ltamasi
Differential Revision: D31822034
fbshipit-source-id: c6f1828b7fc23e4f85e2d1ed73ff51464a058d91
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9537
Add `Transaction::SetReadTimestampForValidation()` and
`Transaction::SetCommitTimestamp()` APIs with default implementation
returning `Status::NotSupported()`. Currently, calling these two APIs do not
have any effect.
Also add checks to `PessimisticTransactionDB`
to enforce that column families in the same db either
- disable user-defined timestamp
- enable 64-bit timestamp
Just to clarify, a `PessimisticTransactionDB` can have some column families without
timestamps as well as column families that enable timestamp.
Each `PessimisticTransaction` can have two optional timestamps, `read_timestamp_`
used for additional validation and `commit_timestamp_` which denotes when the transaction commits.
For now, we are going to support `WriteCommittedTxn` (in a series of subsequent PRs)
Once set, we do not allow decreasing `read_timestamp_`. The `commit_timestamp_` must be
greater than `read_timestamp_` for each transaction and must be set before commit, unless
the transaction does not involve any column family that enables user-defined timestamp.
TransactionDB builds on top of RocksDB core `DB` layer. Though `DB` layer assumes
that user-defined timestamps are byte arrays, `TransactionDB` uses uint64_t to store
timestamps. When they are passed down, they are still interpreted as
byte-arrays by `DB`.
Reviewed By: ltamasi
Differential Revision: D31567959
fbshipit-source-id: b0b6b69acab5d8e340cf174f33e8b09f1c3d3502
Summary:
This change should guarantee that the default ObjectLibrary/Registry are long-lived and not destroyed while the process is running. This will prevent some issues of them being referenced after they were destroyed via the static destruction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9464
Reviewed By: pdillinger
Differential Revision: D33849876
Pulled By: mrambacher
fbshipit-source-id: 7a69177d7c58c81be293fc7ef8e600d47ddbc14b
Summary:
This fix addresses https://github.com/facebook/rocksdb/issues/9299.
If attempting to create a new object via the ObjectRegistry and a factory is not found, the ObjectRegistry will return a "NotSupported" status. This is the same behavior as previously.
If the factory is found but could not successfully create the object, an "InvalidArgument" status is returned. If the factory returned a reason why (in the errmsg), this message will be in the returned status.
In practice, there are two options in the ConfigOptions that control how these errors are propagated:
- If "ignore_unknown_options=true", then both InvalidArgument and NotSupported status codes will be swallowed internally. Both cases will return success
- If "ignore_unsupported_options=true", then having no factory will return success but a failing factory will return an error
- If both options are false, both cases (no and failing factory) will return errors.
In practice this likely only changes Customizable that may be partially available. For example, the JEMallocMemoryAllocator is a built-in allocator that is registered with the system but may not be compiled in. In this case, the status code for this allocator changed from NotSupported("JEMalloc not available") to InvalidArgumen("JEMalloc not available"). Other Customizable builtins/plugins would have the same semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9333
Reviewed By: pdillinger
Differential Revision: D33517681
Pulled By: mrambacher
fbshipit-source-id: 8033052d4a4a7b88c2d9f90147b1b4467e51f6fd
Summary:
... seen only in internal clang-analyze runs after https://github.com/facebook/rocksdb/issues/9481
* Mostly, this works around falsely reported leaks by using
std::unique_ptr in some places where clang-analyze was getting
confused. (I didn't see any changes in C++17 that could make our Status
implementation leak memory.)
* Also fixed SetBGError returning address of a stack variable.
* Also fixed another false null deref report by adding an assert.
Also, use SKIP_LINK=1 to speed up `make analyze`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9515
Test Plan:
Was able to reproduce the reported errors locally and verify
they're fixed (except SetBGError). Otherwise, existing tests
Reviewed By: hx235
Differential Revision: D34054630
Pulled By: pdillinger
fbshipit-source-id: 38600ef3da75ddca307dff96b7a1a523c2885c2e
Summary:
Added a CountedFileSystem that tracks a number of file operations (opens, closes, deletes, renames, flushes, syncs, fsyncs, reads, writes). This class was based on the ReportFileOpEnv from db_bench.
This is a stepping stone PR to be able to change the SpecialEnv into a SpecialFileSystem, where several of the file varieties wish to do operation counting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9283
Reviewed By: pdillinger
Differential Revision: D33062004
Pulled By: mrambacher
fbshipit-source-id: d0d297a7fb9c48c06cbf685e5fa755c27193b6f5
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.
In reference to https://github.com/facebook/rocksdb/issues/9389
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33780269
Pulled By: pdillinger
fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
Summary:
Regexes are considered potentially problematic for use in
registering RocksDB extensions, so we are removing
ObjectLibrary::Register() and the Regex public API it depended on (now
unused).
In reference to https://github.com/facebook/rocksdb/issues/9389
Why?
* The power of Regexes can make it hard to reason about which extension
will match what. (The replacement API isn't perfect, but we are at least
"holding the line" on patterns we have seen in practice.)
* It is easy to make regexes that don't quite mean what you think they
mean, such as forgetting that the `.` in `foo.bar` can match any character
or that matching is nondeterministic, as in `a🅱️42` matching `.*:[0-9]+`.
* Some regexes and implementations can have disastrously bad
performance. This might not be much practical concern for ObjectLibray
here, but we don't want to encourage potentially dangerous further use
in production code. (Testing code is fine. See TestRegex.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9439
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33792342
Pulled By: pdillinger
fbshipit-source-id: 4f64dcb04764e639162c8977a5fa196f67754cec
Summary:
This PR moves RADOS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host RADOS support. At this point,
people can start from the example repo and fork.
The goal is to include this commit in RocksDB 7.0 release.
Reference:
https://github.com/ajkr/dedupfs by ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9206
Test Plan:
Follow instructions in https://github.com/riversand963/rocksdb-rados-env/blob/main/README.md and build
test binary `env_librados_test` and run it.
Also, make check
Reviewed By: ajkr
Differential Revision: D33751690
Pulled By: riversand963
fbshipit-source-id: 30466c62afa9e4619847a48567ed158e62835e35
Summary:
Range Locking supports Lock Escalation. Lock Escalation is invoked when
lock memory is nearly exhausted and it reduced the amount of memory used
by joining adjacent locks.
Bridging the gap between certain locks has adverse effects. For example,
in MyRocks it is not a good idea to bridge the gap between locks in
different indexes, as that get the lock to cover large portions of
indexes, or even entire indexes.
Resolve this by introducing Escalation Barrier. The escalation process
will call the user-provided barrier callback function:
bool(const Endpoint& a, const Endpoint& b)
If the function returns true, there's a barrier between a and b and Lock
Escalation will not try to bridge the gap between a and b.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9290
Reviewed By: akankshamahajan15
Differential Revision: D33486753
Pulled By: riversand963
fbshipit-source-id: f97910b67aba0579ea1d35f523ca6863d3dd018e
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369
Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all
Reviewed By: pdillinger
Differential Revision: D33517260
Pulled By: riversand963
fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions. The PatternEntry methods were left in place but renamed to AddFactory. The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.
Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362
Reviewed By: pdillinger
Differential Revision: D33432562
Pulled By: mrambacher
fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293
Reviewed By: pdillinger, zhichao-cao
Differential Revision: D33181591
Pulled By: mrambacher
fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
Summary:
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches.
Added tests for the new derivative classes and got all unit tests to pass.
Resolves https://github.com/facebook/rocksdb/issues/9225.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264
Reviewed By: pdillinger
Differential Revision: D33062001
Pulled By: mrambacher
fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03
Summary:
The failure looked like this:
```
utilities/backupable/backupable_db_test.cc:3161: Failure
Value of: db_chroot_env_->FileExists(prev_manifest_path).IsNotFound()
Actual: false
Expected: true
```
The failure could be coerced consistently with the following patch:
```
diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 80410f671..637636791 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2772,6 +2772,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) {
if (job_context.HaveSomethingToClean() ||
job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
mutex_.Unlock();
+ bg_cv_.SignalAll();
+ sleep(1);
TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound");
// Have to flush the info logs before bg_flush_scheduled_--
// because if bg_flush_scheduled_ becomes 0 and the lock is
```
The cause was a familiar problem, which is manual flush/compaction may
return before files they obsoleted are removed. The solution is just to
wait for "scheduled" work to complete, which includes all phases
including cleanup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9327
Test Plan:
after this PR, even the above patch to coerce the bug cannot
cause the test to fail.
Reviewed By: riversand963
Differential Revision: D33252208
Pulled By: ajkr
fbshipit-source-id: 720a7eaca58c7247d221911fffe3d5e1dbf581e9
Summary:
locktree is a module providing Range Locking. It has a counter for
the number of times a lock acquisition request was blocked by an
existing conflicting lock and had to wait for it to be released.
Expose this counter in RangeLockManagerHandle::Counters::lock_wait_count.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9289
Reviewed By: jay-zhuang
Differential Revision: D33079182
Pulled By: riversand963
fbshipit-source-id: 25b1a362d9da247536ab5007bd15900b319f139e
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980
Reviewed By: zhichao-cao
Differential Revision: D32990403
Pulled By: mrambacher
fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
Summary:
`db_stress` is a user of `FaultInjectionTestFS`. After injecting a write error, `db_stress` probabilistically determins
data drop (https://github.com/facebook/rocksdb/blob/6.27.fb/db_stress_tool/db_stress_test_base.cc#L2615:L2619).
In some of our recent runs of `db_stress`, we found duplicate trailing entries corresponding to file trivial move in
the MANIFEST, causing the recovery to fail, because the file move operation is not idempotent: you cannot delete a
file from a given level twice.
Investigation suggests that data buffering in both `WritableFileWriter` and `FaultInjectionTestFS` may be the root cause.
WritableFileWriter buffers data to write in a memory buffer, `WritableFileWriter::buf_`. After each
`WriteBuffered()`/`WriteBufferedWithChecksum()` succeeds, the `buf_` is cleared.
If the underlying file `WritableFileWriter::writable_file_` is opened in buffered IO mode, then `FaultInjectionTestFS`
buffers data written for each file until next file sync. After an injected error, user of `FaultInjectionFS` can
choose to drop some or none of previously buffered data. If `db_stress` does not drop any unsynced data, then
such data will still exist in the `FaultInjectionTestFS`'s buffer.
Existing implementation of `WritableileWriter::WriteBuffered()` does not clear `buf_` if there is an error. This may lead
to the data being buffered two copies: one in `WritableFileWriter`, and another in `FaultInjectionTestFS`.
We also know that the `WritableFileWriter` of MANIFEST file will close upon an error. During `Close()`, it will flush the
content in `buf_`. If no write error is injected to `FaultInjectionTestFS` this time, then we end up with two copies of the
data appended to the file.
To fix, we clear the `WritableFileWriter::buf_` upon failure as well. We focus this PR on files opened in non-direct mode.
This PR includes a unit test to reproduce a case when write error injection
to `WritableFile` can cause duplicate trailing entries.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9236
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D33033984
Pulled By: riversand963
fbshipit-source-id: ebfa5a0db8cbf1ed73100528b34fcba543c5db31
Summary:
Context:
[Rapid thread creation and deletion](https://github.com/facebook/rocksdb/blob/6.27.fb/utilities/transactions/write_prepared_transaction_test.cc#L439-L444) in `SnapshotConcurrentAccessTest.SnapshotConcurrentAcces` inside a [potentially big loop](https://github.com/facebook/rocksdb/blob/6.27.fb/utilities/transactions/write_prepared_transaction_test.cc#L1238-L1248) can lead to heavy-loading the system with many threads due to delay in actually cleaning up thread's resource in the kernel sometime. We ran into some [flaky failure](https://app.circleci.com/pipelines/github/facebook/rocksdb/10383/workflows/136f1005-80a9-4515-aee9-fe36ac6462a1/jobs/253289) in CI and reproduced it by below:
- Command
```
Added `ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();` like https://github.com/facebook/rocksdb/pull/9276
DEBUG_LEVEL=2 make -j56 write_prepared_transaction_test
GTEST_CATCH_EXCEPTIONS=0 ~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
```
- Stack, where `write_prepared_transaction_test.cc:442` in `https://github.com/facebook/rocksdb/issues/9` points to thread creation
```
[ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
....terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
Received signal 6 (Aborted)
#0 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38) [0x7fc114f39438]
...
https://github.com/facebook/rocksdb/issues/7 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8e73) [0x7fc1158a5e73] ?? ??:0
https://github.com/facebook/rocksdb/issues/8 ./write_prepared_transaction_test() [0x4ca86c] std:🧵:thread<rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> const&, rocksdb::WritePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda()https://github.com/facebook/rocksdb/issues/1}>(rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, s d::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, rocksdb::WritePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)::{l mbda()https://github.com/facebook/rocksdb/issues/1}&&) /usr/include/c++/5/thread:137 (discriminator 4)
https://github.com/facebook/rocksdb/issues/9 ./write_prepared_transaction_test() [0x4bb80c] rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, rocksdb::W itePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long) /home/circleci/project/utilities/transactions/write_prepared_transaction_test.cc:442
https://github.com/facebook/rocksdb/issues/10 ./write_prepared_transaction_test() [0x4407b6] rocksdb::SnapshotConcurrentAccessTest_SnapshotConcurrentAccess_Test::TestBody() /home/circleci/project/utilities/transactions/write_prepared_transaction_test.cc:1244
...
[109/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 returned/aborted with exit code -6 (34462 ms)
```
- Move thread 2's work into current thread to avoid half of the thread creation cuz there is no difference in doing so. We expect this can make the thread-creation error less often, even though we can't gurantee it from happening again. Considering this is a trivial change with positive impact, it's still worth landing and monitor if it's enough to solve the problem in reality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9281
Test Plan:
Before the change, repeating the test 200 times with 200 workers failed
`~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1`
```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest
[ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
..unknown file: Failure
C++ exception with description "Resource temporarily unavailable" thrown in the test body.
[ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) (11882 ms)
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest (11882 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11882 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20)
```
After the change: repeating the test 200 times with 200 workers didn't fail, even with repeating the "repeating" for 10 times like below
`for i in {1..10}; do ~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1; done`
```
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
```
It does failed when repeating the test 400 times with 400 workers
`~/project$ ~/gtest-parallel/gtest-parallel -r 400 -w 400 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1`
```
[1/400] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 (2928 ms)
Note: Google Test filter = TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest
[ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) (2597 ms)
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest (2597 ms total)
```
Reviewed By: ajkr
Differential Revision: D33026776
Pulled By: hx235
fbshipit-source-id: 509f57126392821e835e48396e5bf224f4f5dcac
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
This changes write_prepared_transaction_test under CircleCI to
print a stack trace on unhandled exception, so that we can debug rare
exceptions seen in CircleCI:
[ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/24
.......unknown file: Failure
C++ exception with description "Resource temporarily unavailable" thrown in the test body.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9276
Test Plan:
manual run test with seeded 'throw', with and without
CIRCLECI=true environment variable
Reviewed By: ajkr, hx235
Differential Revision: D32996993
Pulled By: pdillinger
fbshipit-source-id: e790408ce204b676d3d84a290e41be511b203bfa
Summary:
You could easily reproduce the failure by injecting sleep(11)
before `store.Flush()`. Fixed by setting TTL time to approximately test
timeout time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9226
Test Plan: manual
Reviewed By: akankshamahajan15
Differential Revision: D32698105
Pulled By: pdillinger
fbshipit-source-id: 40529af9d9f2389585988b7c81dffb120e2795a2
Summary:
**Context:**
Some existing internal calls of `GenericRateLimiter::Request()` in backupable_db.cc and newly added internal calls in https://github.com/facebook/rocksdb/pull/8722/ do not make sure `bytes <= GetSingleBurstBytes()` as required by rate_limiter https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L47.
**Impacts of this bug include:**
(1) In debug build, when `GenericRateLimiter::Request()` requests bytes greater than `GenericRateLimiter:: kMinRefillBytesPerPeriod = 100` byte, process will crash due to assertion failure. See https://github.com/facebook/rocksdb/pull/9063#discussion_r737034133 and for possible scenario
(2) In production build, although there will not be the above crash due to disabled assertion, the bug can lead to a request of small bytes being blocked for a long time by a request of same priority with insanely large bytes from a different thread. See updated https://github.com/facebook/rocksdb/wiki/Rate-Limiter ("Notice that although....the maximum bytes that can be granted in a single request have to be bounded...") for more info.
There is an on-going effort to move rate-limiting to file wrapper level so rate limiting in `BackupEngine` and this PR might be made obsolete in the future.
**Summary:**
- Implemented loop-calling `GenericRateLimiter::Request()` with `bytes <= GetSingleBurstBytes()` as a static private helper function `BackupEngineImpl::LoopRateLimitRequestHelper`
-- Considering make this a util function in `RateLimiter` later or do something with `RateLimiter::RequestToken()`
- Replaced buggy internal callers with this helper function wherever requested byte is not pre-limited by `GetSingleBurstBytes()`
- Removed the minimum refill bytes per period enforced by `GenericRateLimiter` since it is useless and prevents testing `GenericRateLimiter` for extreme case with small refill bytes per period.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9063
Test Plan:
- Added a new test that failed the assertion before this change and now passes
- It exposed bugs in [the write during creation in `CopyOrCreateFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2034-L2043)), [the read of table properties in `GetFileDbIdentities()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2372-L2378)), [some read of metadata in `BackupMeta::LoadFromFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2726))
- Passing Existing tests
Reviewed By: ajkr
Differential Revision: D31824535
Pulled By: hx235
fbshipit-source-id: d2b3dea7a64e2a4b1e6a59fca322f0800a4fcbcc
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162
Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.
Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.
Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.
Reviewed By: ltamasi
Differential Revision: D31567960
fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
Summary:
The individual commits in this PR should be self-explanatory.
All small and _very_ low-priority changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5896
Reviewed By: riversand963
Differential Revision: D18065108
Pulled By: mrambacher
fbshipit-source-id: 236b1a1d9d21f982cc08aa67027108dde5eaf280
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105
The user contract of SingleDelete is that: a SingleDelete can only be issued to
a key that exists and has NOT been updated. For example, application can insert
one key `key`, and uses a SingleDelete to delete it in the future. The `key`
cannot be updated or removed using Delete.
In reality, especially when write-prepared transaction is being used, things
can get tricky. For example, a prepared transaction already writes `key` to the
memtable after a successful Prepare(). Afterwards, should the transaction
rollback, it will insert a Delete into the memtable to cancel out the prior
Put. Consider the following sequence of operations.
```
// operation sequence 1
Begin txn
Put(key)
Prepare()
Flush()
Rollback txn
Flush()
```
There will be two SSTs resulting from above. One of the contains a PUT, while
the second one contains a Delete. It is also known that releasing a snapshot
can lead to an L0 containing only a SD for a particular key. Consider the
following operations following the above block.
```
// operation sequence 2
db->Put(key)
db->SingleDelete(key)
Flush()
```
The operation sequence 2 can result in an L0 with only the SD.
Should there be a snapshot for conflict checking created before operation
sequence 1, then an attempt to compact the db may hit the assertion failure
below, because ikey_.type is Delete (from a rollback).
```
else if (clear_and_output_next_key_) {
assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex);
}
```
To fix the assertion failure, we can skip the SingleDelete if we detect an
earlier Delete in the same snapshot interval.
Reviewed By: ltamasi
Differential Revision: D32056848
fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060
RocksDB bottommost level compaction may zero out an internal key's sequence if
the key's sequence is in the earliest_snapshot.
In write-prepared transaction, checking the visibility of a certain sequence in
a specific released snapshot may return a "snapshot released" result.
Therefore, it is possible, after a certain sequence of events, a PUT has its
sequence zeroed out, but a subsequent SingleDelete of the same key will still
be output with its original sequence. This violates the ascending order of
keys and leads to incorrect result.
The solution is to use an extra variable `last_key_seq_zeroed_` to track the
information about visibility in earliest snapshot. With this variable, we can
know for sure that a SingleDelete is in the earliest snapshot even if the said
snapshot is released during compaction before processing the SD.
Reviewed By: ltamasi
Differential Revision: D31813016
fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
`IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
forced, the same as above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
Test Plan: run tests on btrfs and non btrfs
Reviewed By: ajkr
Differential Revision: D30885059
Pulled By: jay-zhuang
fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061
In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:
```
earliest_snap < earliest_write_conflict_snap
```
If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.
Reviewed By: ltamasi
Differential Revision: D31813017
fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
Summary:
This feature was not part of any common or CI build, so no
surprise it broke. Now we can at least ensure compilation. I don't know
how to run the test successfully (missing config file) so it is bypassed
for now.
Fixes https://github.com/facebook/rocksdb/issues/9078
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9088
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D32009467
Pulled By: pdillinger
fbshipit-source-id: 3e0d1e5fde7f0ece703d48a81479e1cc7392c25c
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.
There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?
1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image
Not sure if there is more required for travis. Happy to help in any way I can.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962
Reviewed By: mrambacher
Differential Revision: D31802198
Pulled By: pdillinger
fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.
This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.
Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.
Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)
Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968
Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).
Reviewed By: siying
Differential Revision: D31242045
Pulled By: pdillinger
fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.
The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995
Test Plan:
- Verified it fixes the following failure:
```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```
- `make check -j48`
Reviewed By: ltamasi
Differential Revision: D31495388
Pulled By: ajkr
fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
Summary:
Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB.
Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912
Test Plan: add new tests to lru_cache_test and pass make check.
Reviewed By: pdillinger
Differential Revision: D31452871
Pulled By: zhichao-cao
fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48
Summary:
This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead.
Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
Hopefully, this sort of code hygiene cleanup will speed up the builds...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
Reviewed By: pdillinger
Differential Revision: D31142788
Pulled By: mrambacher
fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d
Summary:
Right now FaultInjectionTestFS::InjectThreadSpecificReadError() might try to corrupt return bytes, but these bytes might be from mmapped files, which would cause segfault. Instead FaultInjectionTestFS::InjectThreadSpecificReadError() should never corrupt data unless it is in caller's buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8952
Test Plan: See db_stress still runs and make sure in a test run failurs are still injected in non-mmap cases.
Reviewed By: ajkr, ltamasi
Differential Revision: D31147318
fbshipit-source-id: 9484a64ff2aaa36685557203f449286e694e65f9
Summary:
There is a corner case when using WriteUnprepared transactions when
`WriteUnpreparedTxn::Get` returns `Status::TryAgain` instead of
propagating the result of `GetFromBatchAndDB`. The patch adds
`PermitUncheckedError` to make the `ASSERT_STATUS_CHECKED` build pass in
this case as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8947
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D31125422
Pulled By: ltamasi
fbshipit-source-id: 42de51dcfa9384e032244c2b4d3f40e9a4111194
Summary:
Several improvements to MultiRead:
1. Fix a bug in stress test which causes false positive when both MultiRead() return and individual read request have failure injected.
2. Add two more types of fault that should be handled: empty read results and checksum mismatch
3. Add a message indicating which type of fault is injected
4. Increase the failure rate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8937
Reviewed By: anand1976
Differential Revision: D31085930
fbshipit-source-id: 3a04994a3cadebf9a64d25e1fe12b14b7a272fba
Summary:
Updates a few remaining functions that should have been updated
from Status -> IOStatus, and adds to HISTORY for the overall change
including https://github.com/facebook/rocksdb/issues/8820.
This change is for inclusion in version 6.25.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8940
Test Plan: CI
Reviewed By: zhichao-cao
Differential Revision: D31085029
Pulled By: pdillinger
fbshipit-source-id: 91557c6a39ef1d90357d4f4dcd79af0645d87c7b
Summary:
Right now, the failure injection test for MultiGet() is not sufficient. Improve it with TestFSRandomAccessFile::MultiRead() injecting failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8925
Test Plan: Run crash test locally for a while.
Reviewed By: anand1976
Differential Revision: D31000529
fbshipit-source-id: 439c7e02cf7440ac5af82deb609e202abdca3e1f
Summary:
In order to populate the IOStatus up to the higher level, replace some of the Status to IOStatus.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8820
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D30967215
Pulled By: zhichao-cao
fbshipit-source-id: ccf9d5cfbd9d3de047c464aaa85f9fa43b474903
Summary:
This allows the wrapper classes to own the wrapped object and eliminates confusion as to ownership. Previously, many classes implemented their own ownership solutions. Fixes https://github.com/facebook/rocksdb/issues/8606
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8618
Reviewed By: pdillinger
Differential Revision: D30136064
Pulled By: mrambacher
fbshipit-source-id: d0bf471df8818dbb1770a86335fe98f761cca193
Summary:
It's always annoying to find a header does not include its own
dependencies and only works when included after other includes. This
change adds `make check-headers` which validates that each header can
be included at the top of a file. Some headers are excluded e.g. because
of platform or external dependencies.
rocksdb_namespace.h had to be re-worked slightly to enable checking for
failure to include it. (ROCKSDB_NAMESPACE is a valid namespace name.)
Fixes mostly involve adding and cleaning up #includes, but for
FileTraceWriter, a constructor was out-of-lined to make a forward
declaration sufficient.
This check is not currently run with `make check` but is added to
CircleCI build-linux-unity since that one is already relatively fast.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8893
Test Plan: existing tests and resolving issues detected by new check
Reviewed By: mrambacher
Differential Revision: D30823300
Pulled By: pdillinger
fbshipit-source-id: 9fff223944994c83c105e2e6496d24845dc8e572
Summary:
ManagedObjects are shared pointer objects where RocksDB wants to share a single object between multiple configurations. For example, the Cache may be shared between multiple column families/tables or the Statistics may be shared between multiple databases.
ManagedObjects are stored in the ObjectRegistry by Type (e.g. Cache) and ID. For a given type/ID name, a single object is stored.
APIs were added to get/set/create these objects.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8658
Reviewed By: pdillinger
Differential Revision: D30806273
Pulled By: mrambacher
fbshipit-source-id: 832ac4423b210c4c4b4a456b35897334775d3160
Summary:
Context:
While all the non-trivial write operations in BackupEngine go through the RateLimiter, reads currently do not. In general, this is not a huge issue because (especially since some I/O efficiency fixes) reads in BackupEngine are mostly limited by corresponding writes, for both backup and restore. But in principle we should charge the RateLimiter for reads as well.
- Charged read operations in `BackupEngineImpl::CopyOrCreateFile`, `BackupEngineImpl::ReadFileAndComputeChecksum`, `BackupEngineImpl::BackupMeta::LoadFromFile` and `BackupEngineImpl::GetFileDbIdentities`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8722
Test Plan:
- Passed existing tests
- Passed added unit tests
Reviewed By: pdillinger
Differential Revision: D30610464
Pulled By: hx235
fbshipit-source-id: 9b08c9387159a5385c8d390d6666377a0d0117e5
Summary:
A "LATEST_BACKUP" file was left in the backup directory by
"BackupEngineTest.NoDeleteWithReadOnly" test, affecting future test
runs. In particular, it caused "BackupEngineTest.IOStats" to fail since
it relies on backup directory containing only data written by its
`BackupEngine`.
The fix is to promote "LATEST_BACKUP" to an explicitly managed file so
it is deleted in `BackupEngineTest` constructor if it exists.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8887
Test Plan:
below command used to fail. Now it passes:
```
$ TEST_TMPDIR=/dev/shm ./backupable_db_test --gtest_filter='BackupEngineTest.NoDeleteWithReadOnly:BackupEngineTest.IOStats'
```
Reviewed By: pdillinger
Differential Revision: D30812336
Pulled By: ajkr
fbshipit-source-id: 32dfbe1368ebdab872e610764bfea5daf9a2af09
Summary:
Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819
Reviewed By: pdillinger
Differential Revision: D30779238
Pulled By: ajkr
fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8
Summary:
* Consolidate use of std::regex for testing to testharness.cc, to
minimize Facebook linters constantly flagging uses in non-production
code.
* Improve syntax and error messages for asserting some string matches a
regex in tests.
* Add a public Regex wrapper class to encapsulate existing usage in
ObjectRegistry.
* Remove unnecessary include <regex>
* Put warnings that use of Regex in production code could cause bad
performance or stack overflow.
Intended follow-up work:
* Replace std::regex with another underlying implementation like RE2
* Improve ObjectRegistry interface in terms of possibly confusing literal
string matching vs. regex and in terms of reporting invalid regex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740
Test Plan:
tests updated, basic unit test for public Regex, and some manual
testing of temporary changes to see example error messages:
utilities/backupable/backupable_db_test.cc:917: Failure
000010_1162373755_138626.blob (child.name)
does not match regex
[0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern)
db/db_basic_test.cc:74: Failure
R3SHSBA8C4U0CIMV2ZB0 (sid3)
does not match regex [0-9A-Z]{20}HAHAHA
Reviewed By: mrambacher
Differential Revision: D30706246
Pulled By: pdillinger
fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd
Summary:
Old typedef syntax is confusing
Most but not all changes with
perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files
make format
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751
Test Plan: existing
Reviewed By: zhichao-cao
Differential Revision: D30745277
Pulled By: pdillinger
fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9
Summary:
If RateLimiter burst bytes changes during concurrent Restore
operations
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8732
Test Plan: updated unit test fails with TSAN before change, passes after
Reviewed By: ajkr
Differential Revision: D30683879
Pulled By: pdillinger
fbshipit-source-id: d0ddb3587ade91ee2a4d926b475acf7781b03086
Summary:
Env::GenerateUniqueId() works fine on Windows and on POSIX
where /proc/sys/kernel/random/uuid exists. Our other implementation is
flawed and easily produces collision in a new multi-threaded test.
As we rely more heavily on DB session ID uniqueness, this becomes a
serious issue.
This change combines several individually suitable entropy sources
for reliable generation of random unique IDs, with goal of uniqueness
and portability, not cryptographic strength nor maximum speed.
Specifically:
* Moves code for getting UUIDs from the OS to port::GenerateRfcUuid
rather than in Env implementation details. Callers are now told whether
the operation fails or succeeds.
* Adds an internal API GenerateRawUniqueId for generating high-quality
128-bit unique identifiers, by combining entropy from three "tracks":
* Lots of info from default Env like time, process id, and hostname.
* std::random_device
* port::GenerateRfcUuid (when working)
* Built-in implementations of Env::GenerateUniqueId() will now always
produce an RFC 4122 UUID string, either from platform-specific API or
by converting the output of GenerateRawUniqueId.
DB session IDs now use GenerateRawUniqueId while DB IDs (not as
critical) try to use port::GenerateRfcUuid but fall back on
GenerateRawUniqueId with conversion to an RFC 4122 UUID.
GenerateRawUniqueId is declared and defined under env/ rather than util/
or even port/ because of the Env dependency.
Likely follow-up: enhance GenerateRawUniqueId to be faster after the
first call and to guarantee uniqueness within the lifetime of a single
process (imparting the same property onto DB session IDs).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8708
Test Plan:
A new mini-stress test in env_test checks the various public
and internal APIs for uniqueness, including each track of
GenerateRawUniqueId individually. We can't hope to verify anywhere close
to 128 bits of entropy, but it can at least detect flaws as bad as the
old code. Serial execution of the new tests takes about 350 ms on
my machine.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D30563780
Pulled By: pdillinger
fbshipit-source-id: de4c9ff4b2f581cf784fcedb5f39f16e5185c364
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.
New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".
`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
Reviewed By: ajkr
Differential Revision: D30290216
Pulled By: autopear
fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
Summary:
In debug mode, we are seeing assertion failure as follows
```
db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \
Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed.
```
It is caused by releasing earliest snapshot during compaction between the execution of
`NextFromInput()` and `PrepareOutput()`.
In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`,
incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation
level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also
outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be
skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened.
Fix by taking a more conservative approach: make compaction zero out sequence number only
if key is in the earliest snapshot when the compaction starts.
Another assertion failure is
```
Assertion `current_user_key_snapshot_ == last_snapshot' failed.
```
It's caused by releasing the snapshot between the PUT and SingleDelete during compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D30145645
Pulled By: riversand963
fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81
Summary:
- Remove extra `;` in trace_record.h
- Remove some unnecessary `assert` in trace_record_handler.cc
- Initialize `env_` after` exec_handler_` in `ReplayerImpl` to let db be asserted in creating the handler before getting `db->GetEnv()`.
- Update history to include the new `TraceReader::Reset()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8652
Reviewed By: ajkr
Differential Revision: D30276872
Pulled By: autopear
fbshipit-source-id: 476ee162e0f241490c6209307448343a5b326b37
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
Guarantees that if a restore is interrupted, DB::Open will fail. This works by
restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT.
Also makes restore respect BackupEngineOptions::sync (default true). When set,
the restore is guaranteed persisted by the time it returns OK. Also makes the above
atomicity guarantee work in case the interruption is power loss or OS crash (not just
process interruption or crash).
Fixes https://github.com/facebook/rocksdb/issues/8500
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568
Test Plan:
added to backup mini-stress unit test. Passes with
gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp)
Reviewed By: akankshamahajan15
Differential Revision: D29812605
Pulled By: pdillinger
fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
- Added Options/Configurable/Object Registration for TTL and Cassandra variants
- Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option
- Added tests for new functionality
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481
Reviewed By: zhichao-cao
Differential Revision: D30136050
Pulled By: mrambacher
fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
Summary:
```FaultInjectionTestFS``` injects various types of read errors in ```FileSystem``` APIs. One type of error is corruption errors, where data is intentionally corrupted or truncated. There is corresponding validation in db_stress to verify that an injected error results in a user visible Get/MultiGet error. However, for corruption errors, its hard to know when a corruption is supposed to be detected by the user request, due to prefetching and, in case of direct IO, padding. This results in false positives. So remove that functionality.
Block checksum validation for Get/MultiGet is confined to ```BlockFetcher```, so we don't lose a lot by disabling this since its a small surface area to test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8616
Reviewed By: zhichao-cao
Differential Revision: D30074422
Pulled By: anand1976
fbshipit-source-id: 6a61fac18f95514c15364b75013799ddf83294df
Summary:
An arbitrary string can be used as a delimiter in StringAppend merge operator
flavor. In particular, it allows using an empty string, combining binary values for
the same key byte-to-byte one next to another.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8536
Reviewed By: mrambacher
Differential Revision: D29962120
Pulled By: zhichao-cao
fbshipit-source-id: 4ef5d846a47835cf428a11200409e30e2dbffc4f
Summary:
Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.
After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).
Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582
Reviewed By: pdillinger, autopear
Differential Revision: D29881122
Pulled By: mrambacher
fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
Summary:
This PR tries to remove some unnecessary checks as well as unreachable code blocks to
improve readability. An obvious non-public API method naming typo is also corrected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565
Test Plan: make check
Reviewed By: lth
Differential Revision: D29963984
Pulled By: riversand963
fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
Originally the 2 options `db_log_dir` and `wal_dir` will be reused in a snapshot db since the options files are just copied. By default, if `wal_dir` was not set when a db was created, it is set to the db's dir. Therefore, the snapshot db will use the same WAL dir. If both the original db and the snapshot db write to or delete from the WAL dir, one may modify or delete files which belong to the other. The same applies to `db_log_dir` as well, but as info log files are not copied or linked, it is simpler for this option.
2 arguments are added to `Checkpoint::CreateCheckpoint()`, allowing to override these 2 options.
`wal_dir`: If the function argument `wal_dir` is empty, or set to the original db location, or the checkpoint location, the snapshot's `wal_dir` option will be updated to the checkpoint location. Otherwise, the absolute path specified in the argument will be used. During checkpointing, live WAL files will be copied or linked the new location, instead of the current WAL dir specified in the original db.
`db_log_dir`: Same as `wal_dir`, but no files will be copied or linked.
A new unit test was added: `CheckpointTest.CheckpointWithOptionsDirsTest`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8572
Test Plan:
New unit test
```
checkpoint_test --gtest_filter="CheckpointTest.CheckpointWithOptionsDirsTest"
```
Output
```
Note: Google Test filter = CheckpointTest.CheckpointWithOptionsDirsTest
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CheckpointTest
[ RUN ] CheckpointTest.CheckpointWithOptionsDirsTest
[ OK ] CheckpointTest.CheckpointWithOptionsDirsTest (11712 ms)
[----------] 1 test from CheckpointTest (11712 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11713 ms total)
[ PASSED ] 1 test.
```
This test will fail without this patch. Just modify the code to remove the 2 arguments introduced in this patch in `CreateCheckpoint()`.
Reviewed By: zhichao-cao
Differential Revision: D29832761
Pulled By: autopear
fbshipit-source-id: e6a639b4d674380df82998c0839e79cab695fe29
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555
Reviewed By: ajkr
Differential Revision: D29758399
Pulled By: jay-zhuang
fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
Summary:
ObjectLibrary is shared between multiple DB instances, the
Register() could have race condition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8574
Test Plan: pass the failed test
Reviewed By: ajkr
Differential Revision: D29855096
Pulled By: jay-zhuang
fbshipit-source-id: 541eed0bd495d2c963d858d81e7eabf1ba16153c
Summary:
If we want to check whether a Status s is NoSpace() or not, we should check the subcode instread of using s==Status::NoSpace(). Fix some of the incorrect check in the ErrorHandler.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8504
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D29601764
Pulled By: zhichao-cao
fbshipit-source-id: cdab56a827891c23746bba9cbb53f169fe35f086
Summary:
When DB Stress enables write failure in reopen, WAL files are also created with a wrapper writalbe file which buffers write until fsync. However, crash test currently expects all writes to WAL is persistent. This is at odd with the unsynced bytes dropped. To work it around temporarily, we disable WAL write failure for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8548
Test Plan: Run db_stress. Manual printf to make sure only WAL files are skipped.
Reviewed By: jay-zhuang
Differential Revision: D29745095
fbshipit-source-id: 1879dd2c01abad7879ca243ee94570ec47c347f3
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.
This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)
This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538
Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.
Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.
Before (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0)
rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0)
rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323
Notice the 5s P100 write time.
After (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0)
rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918
P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0)
rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832
No substantial difference.
Reviewed By: siying
Differential Revision: D29738847
Pulled By: pdillinger
fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
Summary:
… small overwritten files.
If a file is overwritten with renamed and the parent path is not synced, FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() will delete the file. However, RocksDB relies on file renaming to be atomic no matter whether the parent directory is synced or not, and the current behavior breaks the assumption and caused some false positive: https://github.com/facebook/rocksdb/pull/8489
Since the atomic renaming is used in CURRENT files, to fix the problem, in FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync(), we recover the state of overwritten file if the file is small.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8501
Test Plan: Run stress test for a while and see it doesn't break.
Reviewed By: anand1976
Differential Revision: D29594384
fbshipit-source-id: 589b5c2f0a9d2aca53752d7bdb0231efa5b3ae92
Summary:
Various tests had disabled valgrind due to it slowing down and timing
out (as is the case right now) the CI runs. Where a test was disabled with no comment,
I assumed slowness was the cause. For these tests that were slow under
valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them
behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475
Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly
Reviewed By: jay-zhuang
Differential Revision: D29504843
Pulled By: ajkr
fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7
Summary:
```TestFSWritableFile``` buffers data in ```Append``` in order to simulate unsynced data loss on crash. This is only required for buffered IO and should be disabled for direct IO. Otherwise, it causes crash tests to assert on the buffer address alignment - ```db_stress: env/io_posix.cc:1194: virtual rocksdb::IOStatus rocksdb::PosixWritableFile::Append(const rocksdb::Slice&, const rocksdb::IOOptions&, rocksdb::IODebugContext*): Assertion `IsSectorAligned(data.data(), GetRequiredBufferAlignment())' failed.```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8490
Reviewed By: zhichao-cao
Differential Revision: D29565080
Pulled By: anand1976
fbshipit-source-id: 682831fd66ed3b9597caa74fc453e22dfaf9b973
Summary:
Inject read failures in DB reopen, just as what we do for metadata writes and writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476
Test Plan: Some manual tests and make sure failures are triggered.
Reviewed By: anand1976
Differential Revision: D29507283
fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c
Summary:
Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474
Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected.
Reviewed By: zhichao-cao
Differential Revision: D29503116
fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.
In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412
Test Plan: make check, add new testing cases.
Reviewed By: anand1976
Differential Revision: D29151545
Pulled By: zhichao-cao
fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
Summary:
This test case has been failing occasionally due to automatic
compactions kicking in, resulting in GC generating additional
blob files that the test did not expect. Disabling automatic
compactions to get rid of this flakiness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8444
Test Plan: `gtest-parallel --repeat=1000 ./blob_db_test --gtest_filter="BlobDBTest.SnapshotAndGarbageCollection"`
Reviewed By: jay-zhuang
Differential Revision: D29316987
Pulled By: ltamasi
fbshipit-source-id: 9815d189ed7d63890622768675a01990e3680221
Summary:
This reverts commit 25be1ed66a.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438
Test Plan: Run the impacted mysql test 40 times
Reviewed By: ajkr
Differential Revision: D29286247
Pulled By: jay-zhuang
fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04
Summary:
This commit is for enabling `DBWithTTL` to use `DeteleRange` which it cannot before.
As (int32_t)Timestamp is suffixed to values in `DBWithTTL`, there is no reason that it
cannot use the common used api. I added `DeleteRangeCF` in `DBWithTTLImpl::Write`
so that we can use `DeteleRange` normally. When we run code like
`dbWithTtl->DeleteRange(start, end)`, it executes`WriteBatchInternal::DeleteRange`
internally. Intended to fix https://github.com/facebook/rocksdb/issues/7218
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8384
Test Plan: added corresponded testing logic to existing unit test
Reviewed By: jay-zhuang
Differential Revision: D29176734
fbshipit-source-id: 6874ed979fc08e1d138149d03653e43a75f0e0e6
Summary:
This reverts commit 9167ece586.
It was found to reliably trip a compaction picking conflict assertion in a MyRocks unit test. We don't understand why yet so reverting in the meantime.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8410
Test Plan: `make check -j48`
Reviewed By: jay-zhuang
Differential Revision: D29150300
Pulled By: ajkr
fbshipit-source-id: 2de8664f355d6da015e84e5fec2e3f90f49741c8
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
Test Plan: add the test to lru_cache_test
Reviewed By: pdillinger
Differential Revision: D29006215
Pulled By: zhichao-cao
fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
Summary:
This is a duplicate of https://github.com/facebook/rocksdb/issues/4948 by mzhaom to fix tests after rebase.
This change is a follow-up to https://github.com/facebook/rocksdb/issues/4927, which made this possible by allowing tombstone dropping/seqnum zeroing optimizations on the last key in the compaction. Now the `largest_seqno != 0` condition suffices to prevent snapshot release triggered compaction from entering an infinite loop.
The issues caused by the extraneous condition `level_and_file.second->num_deletions > 1` are:
- files could have `largest_seqno > 0` forever making it impossible to tell they cannot contain any covering keys
- it doesn't trigger compaction when there are many overwritten keys. Some MyRocks use case actually doesn't use Delete but instead calls Put with empty value to "delete" keys, so we'd like to be able to trigger compaction in this case too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8357
Test Plan: - make check
Reviewed By: jay-zhuang
Differential Revision: D28855340
Pulled By: ajkr
fbshipit-source-id: a261b51eecafec492499e6d01e8e43112f801798
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.
The new information can be seen in info LOG, for example:
Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):
$ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
rocksdb.block-cache-entry-stats.bytes.data-block: 0
rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
rocksdb.block-cache-entry-stats.bytes.index-block: 178992
rocksdb.block-cache-entry-stats.bytes.misc: 0
rocksdb.block-cache-entry-stats.bytes.other-block: 0
rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
rocksdb.block-cache-entry-stats.capacity: 8388608
rocksdb.block-cache-entry-stats.count.data-block: 0
rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
rocksdb.block-cache-entry-stats.count.index-block: 215
rocksdb.block-cache-entry-stats.count.misc: 1
rocksdb.block-cache-entry-stats.count.other-block: 0
rocksdb.block-cache-entry-stats.count.write-buffer: 0
rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
rocksdb.block-cache-entry-stats.percent.misc: 0.000000
rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.
This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.
An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".
Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.
Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.
This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
Test Plan: manual testing with db_bench, and a couple of basic unit tests
Reviewed By: ltamasi
Differential Revision: D28488721
Pulled By: pdillinger
fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
Summary:
The MultiGetFromBatchAndDB would fail if the PinnableSlice value being returned was pinned. This could happen if the value was retrieved from the DB (not memtable) or potentially if the values were reused (and a previous iteration returned a slice that was pinned).
This change resets the pinnable value to clear it prior to attempting to use it, thereby eliminating the problem with the value already being pinned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8299
Reviewed By: jay-zhuang
Differential Revision: D28455426
Pulled By: mrambacher
fbshipit-source-id: a34d7d983ec9b6bb4c8a2b4892f72858d43e6972
Summary:
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.
cache_bench results before and after this change -
Command
```./cache_bench -ops_per_thread=10000000 -threads=1```
Before
```Complete in 40.688 s; QPS = 245774```
```Complete in 40.486 s; QPS = 246996```
```Complete in 42.019 s; QPS = 237989```
After
```Complete in 40.672 s; QPS = 245869```
```Complete in 44.622 s; QPS = 224107```
```Complete in 42.445 s; QPS = 235599```
db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
Commands
```./db_bench --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
Before
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 80.702 micros/op 198104 ops/sec; 54.4 MB/s (3708999 of 3708999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 87.124 micros/op 183625 ops/sec; 50.4 MB/s (3439999 of 3439999 found)
```
After
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 77.653 micros/op 206025 ops/sec; 56.6 MB/s (3866999 of 3866999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 84.962 micros/op 188299 ops/sec; 51.7 MB/s (3535999 of 3535999 found)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
Reviewed By: zhichao-cao
Differential Revision: D28357511
Pulled By: anand1976
fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
Summary:
We saw the `Commit()` fail with "Operation expired" so apparently the
expiration time is too short. Increased the magnitude of the times in
this test to make flakiness less likely.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8258
Reviewed By: jay-zhuang
Differential Revision: D28177033
Pulled By: ajkr
fbshipit-source-id: 0357acee6cc14c104b6ccd39231a683a606ab130
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used
The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166
Reviewed By: zhichao-cao
Differential Revision: D27657952
Pulled By: mrambacher
fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
Summary:
The WBWI has two differing modes of operation dependent on the value
of the constructor parameter `overwrite_key`.
Currently, regardless of the parameter, neither mode performs as
expected when using Merge. This PR remedies this by correctly invoking
the appropriate Merge Operator before returning results from the WBWI.
Examples of issues that exist which are solved by this PR:
## Example 1 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.
## Example 2 with o`verwrite_key=true`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
## Example 3 with `overwrite_key=false`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`
## Example 4 with `overwrite_key=true`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v1')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
## Example 5 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`
## Example 6 with `overwrite_key=true`
Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135
Reviewed By: pdillinger
Differential Revision: D27657938
Pulled By: mrambacher
fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
Summary:
An early design of BackupEngine used stackable DB, so I guess a
DB had to opt-in to being backupable. Unfortunately the naming of that
obsolete design still infects our public API and implementation.
This change fixes the public API, with a deprecated
backward-compatibility header. `BackupableDBOptions` is renamed to
`BackupEngineOptions` (copy-replace in the public header) and
backup_engine.h replaces backupable_db.h (present for backward
compatibility). The only other change in backupable_db.h ->
backup_engine.h is cleaning up headers.
Later changes will fix the internal implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274
Test Plan:
The internal implementation of BackupEngine uses the name
BackupEngineOptions, while the unit tests use the old name
BackupableDBOptions. This gives me confidence that both still work.
Reviewed By: mrambacher
Differential Revision: D28259471
Pulled By: pdillinger
fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.
Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.
Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).
Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240
Test Plan: ASAN make check
Reviewed By: mrambacher
Differential Revision: D28075891
Pulled By: pdillinger
fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
Summary:
DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place.
If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully.
This option is enabled in crash test in half of the time.
Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235
Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory.
Reviewed By: anand1976
Differential Revision: D28010944
fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.
readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).
There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176
Reviewed By: pdillinger
Differential Revision: D27954339
Pulled By: mrambacher
fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8208
Make include of "file_system.h" use the same include path as everywhere
else.
Reviewed By: riversand963, akankshamahajan15
Differential Revision: D27881606
fbshipit-source-id: fc1e076229fde21041a813c655ce017b5070c8b3
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
Summary:
Historically, the DB properties `rocksdb.cur-size-active-mem-table`,
`rocksdb.cur-size-all-mem-tables`, and `rocksdb.size-all-mem-tables` called
the method `MemTable::ApproximateMemoryUsage` for mutable memtables,
which is not safe without synchronization. This resulted in data races with
memtable inserts. The patch changes the code handling these properties
to use `MemTable::ApproximateMemoryUsageFast` instead, which returns a
cached value backed by an atomic variable. Two test cases had to be updated
for this change. `MemoryTest.MemTableAndTableReadersTotal` was fixed by
increasing the value size used so each value ends up in its own memtable,
which was the original intention (note: the test has been broken in the sense
that the test code didn't consider that memtable sizes below 64 KB get
increased to 64 KB by `SanitizeOptions`, and has been passing only by
accident). `DBTest.MemoryUsageWithMaxWriteBufferSizeToMaintain` relies on
completely up-to-date values and thus was changed to use `ApproximateMemoryUsage`
directly instead of going through the DB properties. Note: this should be safe in this case
since there's only a single thread involved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8206
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D27866811
Pulled By: ltamasi
fbshipit-source-id: 7bd754d0565e0a65f1f7f0e78ffc093beef79394
Summary:
Test was flaky because for kUseDbSessionId naming, blob files use
naming scheme kLegacyCrc32cAndFileSize. So expected number of files
because of collision can vary. So disabling blobdb for this test case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8197
Reviewed By: pdillinger
Differential Revision: D27836997
Pulled By: akankshamahajan15
fbshipit-source-id: 5eb21a5f4acae3d6b730a9e1b207264fbc18cb80
Summary:
Resolves https://github.com/facebook/rocksdb/issues/8014
- Add an assertion on `DB::Open` to ensure `db_options.max_open_files` is unlimited if FIFO Compaction is being used.
- This is to align with what the docs mention and to prevent premature data deletion.
- Update tests to work with this assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8172
Test Plan:
```bash
$ make check -j$(nproc)
Generated TARGETS Summary:
- 6 libs
- 0 binarys
- 180 tests
```
Reviewed By: ajkr
Differential Revision: D27768792
Pulled By: thejchap
fbshipit-source-id: cf6350535e3a3577fec72bcba75b3c094dc7a6f3
Summary:
* CreateNewBackup(WithMetadata) returning the BackupID of new backup
through optional new output param. This is especially useful with the
new mutithreading support, so that you can transactionally determine the
ID of a backup you create.
* GetBackupInfo / GetLatestBackupInfo for individual backups, so that
you don't have to comb through a vector of backups if you don't want to.
Updated HISTORY.md (including re: BlobDB support as new feature)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8170
Test Plan:
Added test logic to existing tests, to minimize increase in
cost of running tests
Reviewed By: zhichao-cao
Differential Revision: D27680410
Pulled By: pdillinger
fbshipit-source-id: 1fc45b73d81aae293ccd4a43d9583d7fd915d3eb
Summary:
Add support for blob files for backup/restore like table files.
Since DB session ID is currently not supported for blob files (there is no place to store it in
the header), so for blob files uses the
kLegacyCrc32cAndFileSize naming scheme even if
share_files_with_checksum_naming is set to kUseDbSessionId.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8129
Test Plan: Add new test units
Reviewed By: ltamasi
Differential Revision: D27408510
Pulled By: akankshamahajan15
fbshipit-source-id: b27434d189a639ef3e6ad165c61a143a2daaf06e
Summary:
Fixing another crash test failure in the case of
write_dbid_to_manifest=true and reading a backup as read-only DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8164
Test Plan:
enhanced unit test for backup as read-only DB, ran
blackbox_crash_test more with elevated backup_one_in
Reviewed By: zhichao-cao
Differential Revision: D27622237
Pulled By: pdillinger
fbshipit-source-id: 680d0f99ddb465a601737f2e3f2c80efd47384fb
Summary:
Forgot to re-test crash test after adding read-only filesystem
enforcement to https://github.com/facebook/rocksdb/issues/8142. The problem is ReadOnlyFileSystem would reject
CreateDirIfMissing whenever DBOptions::create_if_missing=true. The fix
that is better for users is to allow CreateDirIfMissing in
ReadOnlyFileSystem if the directory exists, so that they don't cause a
failure on using create_if_missing with opening backups as read-only
DBs. Added this option test to the unit test (in addition to being in the
crash test).
Also fixed a couple of lints.
And some better messaging from 'make format' so that when you run it
with uncommitted changes, it's clear that it's only checking the
uncommitted changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8161
Test Plan: local blackbox_crash_test with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27614409
Pulled By: pdillinger
fbshipit-source-id: 63ccb626c7e34c200d61c6bca2a8f60da9015179
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.
Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.
Possible follow-up work:
* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.
Implementation details:
Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.
To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.
To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.
Additional minor refactoring in backupable_db.cc to support the new
functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142
Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27535408
Pulled By: pdillinger
fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
Summary:
Before corrupting a file in the DB and expecting corruption to
be detected, open DB read-only to ensure file is not made obsolete by
compaction. Also, to avoid obsolete files not yet deleted, only select
live files to corrupt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8151
Test Plan: watch CI
Reviewed By: akankshamahajan15
Differential Revision: D27568849
Pulled By: pdillinger
fbshipit-source-id: 39a69a2eafde0482b20a197949d24abe21952f27
Summary:
BackupEngine previously had unclear but strict concurrency
requirements that the API user must follow for safe use. Now we make
that clear, by separating operations into "Read," "Append," and "Write"
operations, and specifying which combinations are safe across threads on
the same BackupEngine object (previously none; now all, using a
read-write lock), and which are safe across different BackupEngine
instances open on the same backup_dir.
The changes to backupable_db.h should be backward compatible. It is
mostly about eliminating copies of what should be the same function and
(unsurprisingly) useful documentation comments were often placed on
only one of the two copies. With the re-organization, we are also
grouping different categories of operations. In the future we might add
BackupEngineReadAppendOnly, but that didn't seem necessary.
To mark API Read operations 'const', I had to mark some implementation
functions 'const' and some fields mutable.
Functional changes:
* Added RWMutex locking around public API functions to implement thread
safety on a single object. To avoid future bugs, this is another
internal class layered on top (removing many "override" in
BackupEngineImpl). It would be possible to allow more concurrency
between operations, rather than mutual exclusion, but IMHO not worth the
work.
* Fixed a race between Open() (Initialize()) and CreateNewBackup() for
different objects on the same backup_dir, where Initialize() could
delete the temporary meta file created during CreateNewBackup().
(This was found by the new test.)
Also cleaned up a couple of "status checked" TODOs, and improved a
checksum mismatch error message to include involved files.
Potential follow-up work:
* CreateNewBackup has an API wart because it doesn't tell you the
BackupID it just created, which makes it of limited use in a multithreaded
setting.
* We could also consider a Refresh() function to catch up to
changes made from another BackupEngine object to the same dir.
* Use a lock file to prevent multiple writer BackupEngines, but this
won't work on remote filesystems not supporting lock files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115
Test Plan:
new mini-stress test in backup unit tests, run with gcc,
clang, ASC, TSAN, and UBSAN, 100 iterations each.
Reviewed By: ajkr
Differential Revision: D27347589
Pulled By: pdillinger
fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15
Summary:
The implementation of TransactionDB::WrapDB() and
TransactionDB::WrapStackableDB() are almost identical, except for the
type of the first argument `db`. This PR adds a new template function in
anonymous namespace, and calls it in the above two functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8079
Test Plan: make check
Reviewed By: lth
Differential Revision: D27184575
Pulled By: riversand963
fbshipit-source-id: f2855a6db3a7e897d0d611f7050ca4b696c56a7a
Summary:
This does not add any new public APIs or published
functionality, but adds the ability to read and use (and in tests,
write) backups with a new meta file schema, based on the old schema
but not forward-compatible (before this change). The new schema enables
some capabilities not in the old:
* Explicit versioning, so that users get clean error messages the next
time we want to break forward compatibility.
* Ignoring unrecognized fields (with warning), so that new non-critical
features can be added without breaking forward compatibility.
* Rejecting future "non-ignorable" fields, so that new features critical
to some use-cases could potentially be added outside of linear schema
versions, with broken forward compatibility.
* Fields at the end of the meta file, such as for checksum of the meta
file's contents (up to that point)
* New optional 'size' field for each file, which is checked when present
* Optionally omitting 'crc32' field, so that we aren't required to have
a crc32c checksum for files to take a backup. (E.g. to support backup
via hard links and to better support file custom checksums.)
Because we do not have a JSON parser and to share code, the new schema
is simply derived from the old schema.
BackupEngine code is updated to allow missing checksums in some places,
and to make that easier, `has_checksum` and `verify_checksum_after_work`
are eliminated. Empty `checksum_hex` indicates checksum is unknown. I'm
not too afraid of regressing on data integrity, because
(a) we have pretty good test coverage of corruption detection in backups, and
(b) we are increasingly relying on the DB itself for data integrity rather than
it being an exclusive feature of backups.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8069
Test Plan:
new unit tests, added to crash test (some local run with
boosted backup probability)
Reviewed By: ajkr
Differential Revision: D27139824
Pulled By: pdillinger
fbshipit-source-id: 9e0e4decfb42bb84783d64d2d246456d97e8e8c5
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071
Test Plan: make check, added new unit test.
Reviewed By: anand1976
Differential Revision: D27177043
Pulled By: zhichao-cao
fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
Summary:
These classes were wraps of Env that provided only extensions to the FileSystem functionality. Changed the classes to be FileSystems and the wraps to be of the CompositeEnvWrapper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7968
Reviewed By: anand1976
Differential Revision: D26900253
Pulled By: mrambacher
fbshipit-source-id: 94001d8024a3c54a1c11adadca2bac66c3af2a77
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
This API can be used for things like determining how much space
can be freed up by deleting a particular backup, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042
Test Plan:
validation of the API added to many existing backup unit
tests
Reviewed By: mrambacher
Differential Revision: D26936577
Pulled By: pdillinger
fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a
Summary:
Removed confusing, awkward, and undocumented internal API
ReadOneLine and replaced with very simple LineFileReader.
In refactoring backupable_db.cc, this has the side benefit of
removing the arbitrary cap on the size of backup metadata files.
Also added Status::MustCheck to make it easy to mark a Status as
"must check." Using this, I can ensure that after
LineFileReader::ReadLine returns false the caller checks GetStatus().
Also removed some excessive conditional compilation in status.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026
Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED
Reviewed By: mrambacher
Differential Revision: D26831687
Pulled By: pdillinger
fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f
Summary:
New comment for share_files_with_checksum:
// Only used if share_table_files is set to true. Setting to false is
// DEPRECATED and potentially dangerous because in that case BackupEngine
// can lose data if backing up databases with distinct or divergent
// history, for example if restoring from a backup other than the latest,
// writing to the DB, and creating another backup. Setting to true (default)
// prevents these issues by ensuring that different table files (SSTs) with
// the same number are treated as distinct. See
// share_files_with_checksum_naming and ShareFilesNaming.
I have also removed interim option kFlagMatchInterimNaming, which is no
longer needed and was never needed for correct+compatible operation
(just performance).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020
Test Plan:
tests updated. Backward+forward compatibility verified with
SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup
options, and I manually verified shared_checksum in
/tmp/rocksdb_format_compatible_peterd/bak/current/ after run.
Reviewed By: ajkr
Differential Revision: D26786331
Pulled By: pdillinger
fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130
Summary:
The checkpointing logic supports passing file level checksums
to the copy_file_cb callback function which is used by the backup code
for detecting corruption during file copies.
However, this is currently implemented only for table files.
This PR extends the checksum retrieval to blob files as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003
Test Plan: Add new test units
Reviewed By: ltamasi
Differential Revision: D26680701
Pulled By: akankshamahajan15
fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5
Summary:
Allow applications to implement a custom compaction filter and pass it to BlobDB.
The compaction filter's custom logic can operate on blobs.
To do so, application needs to subclass `CompactionFilter` abstract class and implement `FilterV2()` method.
Optionally, a method called `ShouldFilterBlobByKey()` can be implemented if application's custom logic rely solely
on the key to make a decision without reading the blob, thus saving extra IO. Examples can be found in
db/blob/db_blob_compaction_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7974
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D26509280
Pulled By: riversand963
fbshipit-source-id: 59f9ae5614c4359de32f4f2b16684193cc537b39
Summary:
Fix this scenario:
trx1> acquire shared lock on $key
trx2> acquire shared lock on the same $key
trx1> attempt to acquire a unique lock on $key.
Lock acquisition will fail, and deadlock detection will start.
It will call iterate_and_get_overlapping_row_locks() which will
produce a list with two locks (shared locks by trx1 and trx2).
However the code in lock_request::build_wait_graph() was not prepared
to find the lock by the same transaction in the list of conflicting
locks. Fix it to ignore it.
(One may suggest to fix iterate_and_get_overlapping_row_locks() to not
include locks by trx1. This is not a good idea, because that function
is also used to report all locks currently held)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7938
Reviewed By: zhichao-cao
Differential Revision: D26529374
Pulled By: ajkr
fbshipit-source-id: d89cbed008db1a97a8f2351b9bfb75310750d16a
Summary:
TransactionDB uses read callback to filter out un-committed data before
a snapshot. But `MultiGet()` API doesn't use that at all, which causes
returning unwanted data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7963
Test Plan: Added unittest to reproduce
Reviewed By: anand1976
Differential Revision: D26455851
Pulled By: jay-zhuang
fbshipit-source-id: 265276698cf9d8c4cd79e3250ef10d14375bac55
Summary:
The patch adds checkpoint support to BlobDB. Blob files are hard linked or
copied, depending on whether the checkpoint directory is on the same filesystem
or not, similarly to table files.
TODO: Add support for blob files to `ExportColumnFamily` and to the checksum
verification logic used by backup/restore.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7959
Test Plan: Ran `make check` and the crash test for a while.
Reviewed By: riversand963
Differential Revision: D26434768
Pulled By: ltamasi
fbshipit-source-id: 994be55a8dc08133028250760fca440d2c7c4dc5
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
Test Plan: add new unit test, pass make check/ make asan_check
Reviewed By: pdillinger
Differential Revision: D24313271
Pulled By: zhichao-cao
fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
Summary:
Explicitly reject all range deletions on `TransactionDB` or `OptimisticTransactionDB`, except when the user provides sufficient promises that allow us to proceed safely. The necessary promises are described in the API doc for `TransactionDB::DeleteRange()`. There is currently no way to provide enough promises to make it safe in `OptimisticTransactionDB`.
Fixes https://github.com/facebook/rocksdb/issues/7913.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7929
Test Plan: unit tests covering the cases it's permitted/rejected
Reviewed By: ltamasi
Differential Revision: D26240254
Pulled By: ajkr
fbshipit-source-id: 2834a0ce64cc3e4c3799e35b885a5e79c2f4f6d9
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584
Test Plan: Run all existing tests.
Reviewed By: pdillinger
Differential Revision: D20626739
fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
Summary:
Removed the uses of the Legacy FileWrapper classes from the source code. The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem.
Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851
Reviewed By: anand1976
Differential Revision: D26114816
Pulled By: mrambacher
fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
In the original stacked BlobDB implementation, which writes blobs to blob files
immediately and treats blob files as logs, it makes sense to flush the file after
writing each blob to protect against process crashes; however, in the integrated
implementation, which builds blob files in the background jobs, this unnecessarily
reduces performance. This patch fixes this by simply adding a `do_flush` flag to
`BlobLogWriter`, which is set to `true` by the stacked implementation and to `false`
by the new code. Note: the change itself is trivial but the tests needed some work;
since in the new implementation, blobs are now buffered, adding a blob to
`BlobFileBuilder` is no longer guaranteed to result in an actual I/O. Therefore, we can
no longer rely on `FaultInjectionTestEnv` when testing failure cases; instead, we
manipulate the return values of I/O methods directly using `SyncPoint`s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7892
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D26022814
Pulled By: ltamasi
fbshipit-source-id: b3dce419f312137fa70d84cdd9b908fd5d60d8cd
Summary:
`CheckpointTest.CurrentFileModifiedWhileCheckpointing` could hang
because now create checkpoint triggers flush twice. The test should wait
both flush done.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7849
Test Plan: `gtest-parallel ./checkpoint_test --gtest_filter=CheckpointTest.CurrentFileModifiedWhileCheckpointing -r 100`
Reviewed By: ajkr
Differential Revision: D25860713
Pulled By: jay-zhuang
fbshipit-source-id: e1c2f23037dedc33e205519f4289a25e77816b41
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.
Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819
Reviewed By: ajkr
Differential Revision: D25837394
Pulled By: jay-zhuang
fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
Summary:
Currently, manifest size is determined before getting min_log_num.
But between getting manifest size and getting min_log_num, concurrently, a flush might succeed, which will write new records to manifest to make some WALs become outdated, then min_log_num will be correspondingly increased, but the new records in manifest will not be copied into the checkpoint because the manifest's size is determined before them, then the newly outdated WALs will still exist in the checkpoint's manifest, but they are not linked/copied to the checkpoint because their log number is < min_log_num, so a corruption of missing WAL will be reported when restoring from the checkpoint.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7836
Test Plan: make crash_test
Reviewed By: ajkr
Differential Revision: D25788204
Pulled By: cheng-chang
fbshipit-source-id: a4e5acf30f08270b3c0a95304ff559a9e655252f
Summary:
This PR does the following:
-> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.
With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system
- Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself.
With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction.
Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
Reviewed By: zhichao-cao
Differential Revision: D25762190
Pulled By: anand1976
fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
Summary:
Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes.
Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786
Reviewed By: jay-zhuang
Differential Revision: D25761460
Pulled By: anand1976
fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f
Summary:
1. Made `WriteBatchWithIndexInternal` into a class that stores the `DB*` or `DBOptions*`.
2. Changed the `GetFromBatch` method to be non-static and use an instance of the class. Added `MergeKey` methods to perform the merge itself and return any status.
This change unifies the multiple calls to the `MergeHelper` under a single wrapped API.
Closes https://github.com/facebook/rocksdb/issues/6683
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6851
Reviewed By: ajkr
Differential Revision: D21706574
Pulled By: pdillinger
fbshipit-source-id: 6860bd64d62669aaa591846e914eed3b674e68b1
Summary:
BasicLockEscalation will cause false-positive warnings under TSAN (this is a known issue in TSAN, see details in https://gist.github.com/spetrunia/77274cf2d5848e0a7e090d622695ed4e), skip this test if TSAN is enabled, or if we are not sure whether TSAN is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7814
Test Plan: watch the tsan contrun test to pass.
Reviewed By: zhichao-cao
Differential Revision: D25708094
Pulled By: cheng-chang
fbshipit-source-id: 4fc813ff373301d033d086154cc7bb60a5e95889
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798
Reviewed By: jay-zhuang
Differential Revision: D25680451
Pulled By: pdillinger
fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917
Summary:
In a stress test failure, we observe that a WAL is skipped when creating checkpoint, although its log number >= MinLogNumberToKeep(). This might happen in the following case:
1. when creating the checkpoint, there are 2 column families: CF0 and CF1, and there are 2 WALs: 1, 2;
2. CF0's log number is 1, CF0's active memtable is empty, CF1's log number is 2, CF1's active memtable is not empty, WAL 2 is not empty, the sequence number points to WAL 2;
2. the checkpoint process flushes CF0, since CF0' active memtable is empty, there is no need to SwitchMemtable, thus no new WAL will be created, so CF0's log number is now 2, concurrently, some data is written to CF0 and WAL 2;
3. the checkpoint process flushes CF1, WAL 3 is created and CF1's log number is now 3, CF0's log number is still 2 because CF0 is not empty and WAL 2 contains its unflushed data concurrently written in step 2;
4. the checkpoint process determines that WAL 1 and 2 are no longer needed according to [live_wal_files[i]->StartSequence() >= *sequence_number](https://github.com/facebook/rocksdb/blob/master/utilities/checkpoint/checkpoint_impl.cc#L388), so it skips linking them to the checkpoint directory;
5. but according to `MinLogNumberToKeep()`, WAL 2 still needs to be kept because CF0's log number is 2.
If the checkpoint is reopened in read-only mode, and only read from the snapshot with the initial sequence number, then there will be no data loss or data inconsistency.
But if the checkpoint is reopened and read from the most recent sequence number, suppose in step 3, there are also data concurrently written to CF1 and WAL 3, then the most recent sequence number refers to the latest entry in WAL 3, so the data written in step 2 should also be visible, but since WAL 2 is discarded, those data are lost.
When tracking WAL in MANIFEST is enabled, when reopening the checkpoint, since WAL 2 is still tracked in MANIFEST as alive, but it's missing from the checkpoint directory, a corruption will be reported.
This PR makes the checkpoint process to only skip a WAL if its log number < `MinLogNumberToKeep`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7789
Test Plan: watch existing tests to pass.
Reviewed By: ajkr
Differential Revision: D25662346
Pulled By: cheng-chang
fbshipit-source-id: 136471095baa01886cf44809455cf855f24857a0
Summary:
Range Locking - an implementation based on the locktree library
- Add a RangeTreeLockManager and RangeTreeLockTracker which implement
range locking using the locktree library.
- Point locks are handled as locks on single-point ranges.
- Add a unit test: range_locking_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7506
Reviewed By: akankshamahajan15
Differential Revision: D25320703
Pulled By: cheng-chang
fbshipit-source-id: f86347384b42ba2b0257d67eca0f45f806b69da7
Summary:
This disables Linux/amd64 builds in Travis for PRs, and adds a
gcc-10+c++20 build in CircleCI, which should fill out sufficient coverage
vs. what we had in Travis
Fixed a use of std::is_pod, which is deprecated in c++20
Fixed ++ on a volatile in db_repl_stress.cc, with bigger refactoring.
Although ++ on this volatile was probably ok with one thread writer and
one thread reader, the code was still overly complex. There was a
deadcode check for error
`if (replThread.no_read < dataPump.no_records)` which can be proven
never to happen based on the structure of the code. It infinite loops
instead for the case intended to be checked. I just simplified the code
for what should be the same checking power.
Also most configurations seem to be using make parallelism = 2 * vcores,
so fixing / using that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7791
Test Plan:
CI
and `while ./db_repl_stress; do echo again; done` for a while
Reviewed By: siying
Differential Revision: D25669834
Pulled By: pdillinger
fbshipit-source-id: b2c688053d0b1d52c989903449d3cd27a04130d6
Summary:
Inject the random write error to stress test, it requires set reopen=0 and disable_wal=true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7653
Test Plan: pass db_stress and python3 db_crashtest.py blackbox
Reviewed By: ajkr
Differential Revision: D25354132
Pulled By: zhichao-cao
fbshipit-source-id: 44721104eecb416e27f65f854912c40e301dd669
Summary:
Some clients do not close their iterators until after the transaction finishes. To handle this case, we will invalidate any iterators on transaction clear.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7733
Reviewed By: cheng-chang
Differential Revision: D25261158
Pulled By: lth
fbshipit-source-id: b91320f00c54cbe0e6882b794b34f3bb5640dbc0
Summary:
To be used for implementing Range Locking.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7753
Reviewed By: zhichao-cao
Differential Revision: D25378980
Pulled By: cheng-chang
fbshipit-source-id: 801a9c5cd92a84654ca2586b73e8f69001e89320
Summary:
This PR has two commits:
1. Modify the code to allow different Lock Managers (of any kind) to be used. It is implied that a LockManager uses its own custom LockTracker.
2. Add definitions for Range Locking (class Endpoint and GetRangeLock() function.
cheng-chang, is this what you've had in mind (should the PR have both item 1 and item 2?)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7443
Reviewed By: zhichao-cao
Differential Revision: D24123172
Pulled By: cheng-chang
fbshipit-source-id: c6548ad6d4cc3c25f68d13b29147bc6fdf357185
Summary:
Handle misuse of snprintf return value to avoid Out of bound
read/write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7686
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D25030831
Pulled By: akankshamahajan15
fbshipit-source-id: 1a1d181c067c78b94d720323ae00b79566b57cfa
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)
In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)
TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D25256293
Pulled By: ltamasi
fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667
Reviewed By: riversand963
Differential Revision: D24933360
Pulled By: ajkr
fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
Summary:
The tests often times out in internal infra, skipping fsync should reduce test time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7641
Test Plan: watch existing tests to pass
Reviewed By: anand1976
Differential Revision: D24765098
Pulled By: cheng-chang
fbshipit-source-id: c62bf8110361aee901918d632cf4772435d05e8d
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D24579392
Pulled By: riversand963
fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497
When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`
Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.
Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test
Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515
Reviewed By: ajkr
Differential Revision: D24240264
Pulled By: ramvadiv
fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24485420
Pulled By: zhichao-cao
fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
Summary:
When `ASSERT_STATUS_CHECKED` is enabled, `transaction_test` does not pass without this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7572
Test Plan: `ASSERT_STATUS_CHECKED=1 make -j32 transaction_test && ./transaction_test`
Reviewed By: zhichao-cao
Differential Revision: D24404319
Pulled By: cheng-chang
fbshipit-source-id: 13689035995366ab06d8eada3ea404e45fef8bc5
Summary:
Further refinement of the earlier PR. Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563
Reviewed By: zhichao-cao
Differential Revision: D24422491
Pulled By: ajkr
fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67
Summary:
In order to be able to introduce more locking protocols, we need to abstract out the locking subsystem in TransactionDB into a set of interfaces.
PR https://github.com/facebook/rocksdb/pull/7013 introduces interface `LockTracker`. This PR is a follow up to take the first step to abstract out a `LockManager` interface.
Further modifications to the interface may be needed when introducing the first implementation of range lock. But the idea here is to put the range lock implementation based on range tree under the `utilities/transactions/lock/range/range_tree`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7532
Test Plan: point_lock_manager_test
Reviewed By: ajkr
Differential Revision: D24238731
Pulled By: cheng-chang
fbshipit-source-id: 2a9458cd8b3fb008d9529dbc4d3b28c24631f463
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.
TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24260219
Pulled By: ltamasi
fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554
Reviewed By: akankshamahajan15
Differential Revision: D24298985
Pulled By: zhichao-cao
fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
Summary:
The patch does some cleanup in and around the legacy `BlobLogReader` class:
* It renames the class to `BlobLogSequentialReader` to emphasize that it is for
sequentially iterating through blobs in a blob file, as opposed to doing random
point reads using `BlobIndex`es (which is `BlobFileReader`'s jurisdiction).
* It removes some dead code from the old BlobDB implementation that references
`BlobLogReader` (namely the method `BlobFile::OpenRandomAccessReader`).
* It cleans up some `#include`s and forward declarations.
* It fixes some incorrect/outdated comments related to the reader class.
* It adds a few assertions to the `Read` methods of the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7517
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24172611
Pulled By: ltamasi
fbshipit-source-id: 43e2ae1eba5c3dd30c1070cb00f217edc45bd64f
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452
Test Plan: See CI tests pass.
Reviewed By: zhichao-cao
Differential Revision: D23979764
fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
Summary:
The patch adds support for injecting errors when reading from `RandomAccessFile`
using `FaultInjectionTestEnv`. (This functionality was curiously missing
w/r/t `RandomAccessFile`, even though it was implemented for `RandomRWFile`.)
The patch also fixes up a test case in `blob_db_test` which uses `FaultInjectionTestEnv`
but has so far relied on reads from `RandomAccessFile`s succeeding even after
deactivating the filesystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7447
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D23971740
Pulled By: ltamasi
fbshipit-source-id: 8492736cb64b1ee138c658822535f3ff4fe560c6
Summary:
A generic algorithm in progress depends on a templatized
version of fastrange, so this change generalizes it and renames
it to fit our style guidelines, FastRange32, FastRange64, and now
FastRangeGeneric.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436
Test Plan: added a few more test cases
Reviewed By: jay-zhuang
Differential Revision: D23958153
Pulled By: pdillinger
fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8
Summary:
Add new AppendWithVerify and PositionedAppendWithVerify APIs to Env and FileSystem to bring the data verification information (data checksum information) from upper layer (e.g., WritableFileWriter) to the storage layer. This PR only include the API definition, no functional codes are added to unblock other developers which depend on these APIs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7419
Test Plan: make -j32
Reviewed By: pdillinger
Differential Revision: D23883196
Pulled By: zhichao-cao
fbshipit-source-id: 94676c26bc56144cc32e3661f84f21eccd790411
Summary:
Two relatively simple functional changes to incremental backup
behavior, integrated with a minor refactoring to reduce code redundancy and
improve error/log message. There are nuances to the impact of these changes,
but I believe they are fundamentally good and generally safe. Those functional
changes:
* Incremental backups no longer read DB table files that are already saved to a
shared part of the backup directory, unless `share_files_with_checksum` is used
with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file
checksums are needed to determine file naming.
* Justification: incremental backups should not need to read the whole DB,
especially without rate limiting. (Although other BackupEngine reads are not
rate limited either, other non-trivial reads are generally limited by a
corresponding write, as in copying files.) Also, the fact that this is not
already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110.
* When considering whether a table file is already backed up in a shared part
of backup directory, BackupEngine would already query the sizes of source (DB)
and pre-existing destination (backup) files. BackupEngine now uses these file
sizes to detect corruption, as at least one of (a) old backup, (b) backup in
progress, or (c) current DB is corrupt if there's a size mismatch.
* Justification: a random related fix that also helps to cover a small hole
in corruption checking uncovered by the other functional change:
* For `share_table_files` without "checksum" (not recommended), the other
change regresses in detecting fundamentally unsafe use of this option
combination: when you might generate different versions of same SST file
number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this
regression is greatly mitigated by the new file size checking. Nevertheless,
almost no reason to use `share_files_with_checksum=false` should remain, and
comments are updated appropriately.
Also, this change renames internal function `CalculateChecksum` to
`ReadFileAndComputeChecksum` to make the performance impact of this function
clear in code reviews.
It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it
cannot be true for a DB with unique file names (like DBImpl). Nevertheless,
I've tried to keep its functionality intact when `true` to minimize risk for
now, despite having no unit tests for which it is true.
Select impact details (much more in unit tests): For
`share_files_with_checksum`, I am confident there is no regression (vs.
pre-6.12) in detecting DB or backup corruption at backup creation time, mostly
because the old design did not leverage this extra checksum computation for
detecting inconsistencies at backup creation time. (With computed checksums in
names, a recently corrupted file just looked like a different file vs. what was
already backed up.)
Even in the hypothetical case of DB session id collision (~100 bits entropy
collision), file size in name and/or our file size check add an extra layer of
protection against false success in creating an accurate new backup. (Unit test
included.)
`DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking
are still able to catch corruptions that `CreateNewBackup` does not. Note that
when custom file checksum support is added to BackupEngine, that will
essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`.
We could add options for `CreateNewBackup` to cover some of what would be
caught by `VerifyBackup` with checksum checking.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413
Test Plan:
Two new unit tests included, both of which fail without these
changes. Although we don't test the I/O improvement directly, we test it
indirectly in DB corruption detection power that was inadvertently unlocked
with new backup file naming PLUS computing current content checksums (now
removed). (I don't think that case of DB corruption detection justifies reading
the whole DB on incremental backup.)
Reviewed By: zhichao-cao
Differential Revision: D23818480
Pulled By: pdillinger
fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac
Summary:
This change reverts BackupEngine to 6.12 state to accommodate a
higher-priority fix that does not easily merge with this custom checksum
support. We intend to reinstate this support soon, by merging a revert
of this change.
For backupable_db_test, I've removed the tests depending on this
feature.
I've also removed relevant HISTORY.md entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411
Test Plan: unit tests
Reviewed By: ajkr
Differential Revision: D23793835
Pulled By: pdillinger
fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052
Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.
This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.
We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).
We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.
Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`
Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.
Actual file name patterns and priorities, as regexes:
kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
[0-9]+_[0-9]+_[0-9]+[.]sst
kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
[0-9]+_[0-9a-fA-F-]+[.]sst
kUseDbSessionId AND NOT kFlagIncludeFileSize ->
[0-9]+_s[0-9A-Z]{20}[.]sst
kUseDbSessionId AND kFlagIncludeFileSize (default) ->
[0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst
We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.
This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400
Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.
Reviewed By: ajkr
Differential Revision: D23759587
Pulled By: pdillinger
fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
The patch adds support for exposing the start of the expiration range
for TTL blob files through the `GetLiveFilesMetaData` API. This can be
used for monitoring purposes, i.e. to make sure TTL blob files are
deleted in a timely manner. The patch also fixes a couple of uninitialized
variable issues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7365
Test Plan: `make check`
Reviewed By: pdillinger
Differential Revision: D23605465
Pulled By: ltamasi
fbshipit-source-id: 97a9612bf5f4b058423debdd3f28f576bb23a70f
Summary:
(1) Skip check on specific key if restoring an old backup
(small minority of cases) because it can fail in those cases. (2) Remove
an old assertion about number of column families and number of keys
passed in, which is broken by atomic flush (cf_consistency) test. Like
other code (for better or worse) assume a single key and iterate over
column families. (3) Apply mock_direct_io to NewSequentialFile so that
db_stress backup works on /dev/shm.
Also add more context to output in case of backup/restore db_stress
failure.
Also a minor fix to BackupEngine to report first failure status in
creating new backup, and drop another clue about the potential
source of a "Backup failed" status.
Reverts "Disable backup/restore stress test (https://github.com/facebook/rocksdb/issues/7350)"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7357
Test Plan:
Using backup_one_in=10000,
"USE_CLANG=1 make crash_test_with_atomic_flush" for 30+ minutes
"USE_CLANG=1 make blackbox_crash_test" for 30+ minutes
And with use_direct_reads with TEST_TMPDIR=/dev/shm/rocksdb
Reviewed By: riversand963
Differential Revision: D23567244
Pulled By: pdillinger
fbshipit-source-id: e77171c2e8394d173917e36898c02dead1c40b77
Summary:
The patch adds a class called `BlobFileBuilder` that can be used to build
and cut blob files in background jobs (flushes/compactions). The class
enforces a value size threshold (`min_blob_size`; smaller blobs will be inlined
in the LSM tree itself), and supports specifying a blob file size limit (`blob_file_size`),
as well as compression (`blob_compression_type`) and checksums for blob files.
It also keeps track of the generated blob files and their associated `BlobFileAddition`
metadata, which can be applied as part of the background job's `VersionEdit`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7306
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23298817
Pulled By: ltamasi
fbshipit-source-id: 38f35d81dab1ba81f15236240612ec173d7f21b5
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.
This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.
Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309
Test Plan: updated unit test, before and after
Reviewed By: ajkr
Differential Revision: D23311994
Pulled By: pdillinger
fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5