Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)
I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.
HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262
Test Plan: updated unit tests
Reviewed By: hx235
Differential Revision: D37473155
Pulled By: pdillinger
fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
Summary:
In some case, GetFileSize would be failure in copy_file_cb.
If failure, we can return immediately, the subsequent code
is meaningless, and add a log info let user know that problem
happen here.
Singed-off-by: Yite Gu <ess_gyt@qq.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10176
Reviewed By: cbi42
Differential Revision: D37510888
Pulled By: ajkr
fbshipit-source-id: 044ad8c45852fd19b8cd564b11f65d40c39e296f
Summary:
We saw flakes with the following failure:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
utilities/backup/backup_engine_test.cc:2667: Failure
Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what(): utilities/backup/backup_engine_test.cc:2667: Failure
Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4
Received signal 6 (Aborted)
t/run-backup_engine_test-RateLimiting-BackupEngineRateLimitingTestWithParam.RateLimiting-1: line 4: 1032887 Aborted (core dumped) TEST_TMPDIR=$d ./backup_engine_test --gtest_filter=RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
```
Investigation revealed we forgot to use the mock time `SystemClock` for
restore rate limiting. Then the test used wall clock time, which made
the execution of "GenericRateLimiter::Request:PostTimedWait"
non-deterministic as wall clock time might have advanced enough that
waiting was not needed.
This PR changes restore rate limiting to use
mock time, which guarantees we always execute
"GenericRateLimiter::Request:PostTimedWait". Then the assertions that
rely on times recorded inside that callback should be robust.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10271
Test Plan:
Applied the following patch which guaranteed repro before the fix.
Verified the test passes after this PR even with that patch applied.
```
diff --git a/util/rate_limiter.cc b/util/rate_limiter.cc
index f369e3220..6b3ed82fa 100644
--- a/util/rate_limiter.cc
+++ b/util/rate_limiter.cc
@@ -158,6 +158,7 @@ void GenericRateLimiter::SetBytesPerSecond(int64_t bytes_per_second) {
void GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri,
Statistics* stats) {
+ usleep(100000);
assert(bytes <= refill_bytes_per_period_.load(std::memory_order_relaxed));
bytes = std::max(static_cast<int64_t>(0), bytes);
TEST_SYNC_POINT("GenericRateLimiter::Request");
```
Reviewed By: hx235
Differential Revision: D37499848
Pulled By: ajkr
fbshipit-source-id: fd790d5a192996be8ba13b656751ccc7d8cb8f6e
Summary:
https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
`DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
directly expose the error to caller without some additional work.
This is a first version proposal which:
- Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
- Checks the error during `DB::Open()` and return it to caller if non-ok
This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
in case other callers of `SanitizeOptions()` assumes that a logger should be created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D37321717
Pulled By: riversand963
fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b
Summary:
`FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.
The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.
An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.
```
Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185
Test Plan:
- repro unit test
- the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"
Reviewed By: riversand963
Differential Revision: D37200993
Pulled By: ajkr
fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
- Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
Reviewed By: ajkr, anand1976
Differential Revision: D37135172
Pulled By: hx235
fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
Test Plan:
- Manual
- Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
- Enabled in MyShadow for 1+ week
- Automated
- Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
- Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
Reviewed By: cbi42
Differential Revision: D36614569
Pulled By: ajkr
fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).
This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).
Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB
Reviewed By: riversand963
Differential Revision: D37212647
Pulled By: pdillinger
fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.
This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```
If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.
Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();
// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
std::shared_ptr<TransactionNotifier> notifier,
TxnTimestamp ts,
std::shared_ptr<const Snapshot>* ret);
// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```
The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```
We also provide two additional APIs for stats collection and reporting purposes.
```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
TxnTimestamp ts_lb,
TxnTimestamp ts_ub,
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```
To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```
Before shutdown, RocksDB will release all timestamped snapshots.
Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.
Timestamped snapshots can also be used with user-defined timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```
Reviewed By: siying
Differential Revision: D35783919
Pulled By: riversand963
fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
Summary:
With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument().
Test plan
```
$COMPILE_WITH_ASAN=1 make -j24 all
$./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow
$ make -j24 check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109
Reviewed By: riversand963
Differential Revision: D36901126
Pulled By: jowlyzhang
fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86
Summary:
We have three related concepts:
* BlockType: an internal enum conceptually indicating a type of SST file
block
* CacheEntryRole: a user-facing enum for categorizing block cache entries,
which is also involved in associated cache entries with an appropriate
deleter. Can include categories for non-block cache entries (e.g. memory
reservations).
* TBlocklike: a C++ type for the actual type behind a void* cache entry.
We had some existing code ugliness because BlockType did not imply
TBlocklike, because of various kinds of "filter" block. This refactoring
fixes that with new BlockTypes.
More clean-up can come in later work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098
Test Plan: existing tests
Reviewed By: akankshamahajan15
Differential Revision: D36897945
Pulled By: pdillinger
fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295
Summary:
If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`,
RocksDB will return the timestamps of the point tombstones.
Note: DeleteRange is still unsupported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D36677956
Pulled By: riversand963
fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
Reviewed By: ajkr, hx235
Differential Revision: D36722135
Pulled By: littlepig2013
fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
Summary:
TSAN test is slower, for `TransactionStressTest` and
`DeadlockStress`, they're reaching the timeout limit of 600 seconds.
Decreasing the transaction test number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10063
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D36711727
Pulled By: jay-zhuang
fbshipit-source-id: 600f82a6d32108f52fbe5572fcc7497607b7fe98
Summary:
For regular db instance and secondary instance, we return error and refuse to open DB if Logger creation fails.
Our current code allows it, but it is really difficult to debug because
there will be no LOG files. The same for OPTIONS file, which will be explored in another PR.
Furthermore, Arena::AllocateAligned(size_t bytes, size_t huge_page_size, Logger* logger) has an
assertion as the following:
```cpp
#ifdef MAP_HUGETLB
if (huge_page_size > 0 && bytes > 0) {
assert(logger != nullptr);
}
#endif
```
It can be removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9984
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D36347754
Pulled By: riversand963
fbshipit-source-id: 529798c0511d2eaa2f0fd40cf7e61c4cbc6bc57e
Summary:
There are currently some preprocessor checks that assume support for Visual Studio versions older than 2015 (i.e., 0 < _MSC_VER < 1900), although we don't support them any more.
We removed all code that only compiles on those older versions, except third-party/ files.
The ROCKSDB_NOEXCEPT symbol is now obsolete, since it now always gets replaced by noexcept. We removed it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10065
Reviewed By: pdillinger
Differential Revision: D36721901
Pulled By: guidotag
fbshipit-source-id: a2892d365ef53cce44a0a7d90dd6b72ee9b5e5f2
Summary:
There are some time-related POSIX APIs that are not available on Windows
(e.g. `localtime_r`), which we have worked around by providing our own
implementations in `port/sys_time.h`. This workaround actually relies on
some ambiguity: on Windows, a call to `localtime_r` calls
`ROCKSDB_NAMESPACE::port::localtime_r` (which is pulled into
`ROCKSDB_NAMESPACE` by a using-declaration), while on other platforms
it calls the global `localtime_r`. This works fine as long as there is only one
candidate function; however, it breaks down when there is more than one
`localtime_r` visible in a scope.
The patch fixes this by introducing `ROCKSDB_NAMESPACE::port::{TimeVal, GetTimeOfDay, LocalTimeR}`
to eliminate any ambiguity.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10045
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D36639372
Pulled By: ltamasi
fbshipit-source-id: fc13dbfa421b7c8918111a6d9e24ce77e91a7c50
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).
The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973
Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
- Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
- Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
- backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
- restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)
```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000
# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
# Restore
#./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```
Reviewed By: hx235
Differential Revision: D36327418
Pulled By: cbi42
fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
Summary:
Changed the static objects that had non-trivial destructors to use the STATIC_AVOID_DESTRUCTION construct.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9958
Reviewed By: pdillinger
Differential Revision: D36442982
Pulled By: mrambacher
fbshipit-source-id: 029d47b1374d30d198bfede369a4c0ae7a4eb519
Summary:
**Context:**
`BackupEngineRateLimitingTestWithParam.RateLimiting` and `BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup` involve creating backup and restoring of a big database with rate-limiting. Using the normal env with a normal clock requires real elapse of time (13702 - 19848 ms/per test). As suggested in https://github.com/facebook/rocksdb/pull/8722#discussion_r703698603, this PR is to speed it up with SpecialEnv (`time_elapse_only_sleep=true`) where its clock accepts fake elapse of time during rate-limiting (100 - 600 ms/per test)
**Summary:**
- Added TEST_ function to set clock of the default rate limiters in backup engine
- Shrunk testdb by 10 times while keeping it big enough for testing
- Renamed some test variables and reorganized some if-else branch for clarity without changing the test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9974
Test Plan:
- Run tests pre/post PR the same time to verify the tests are sped up by 90 - 95%
`BackupEngineRateLimitingTestWithParam.RateLimiting`
Pre:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0 (11123 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 (9441 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2 (11096 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3 (9339 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4 (11121 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5 (9413 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6 (11185 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7 (9511 ms)
[----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (82230 ms total)
```
Post:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/0 (395 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 (564 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/2 (358 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/3 (567 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/4 (173 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/5 (176 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/6 (191 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/7 (177 ms)
[----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (2601 ms total)
```
`BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup`
Pre:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0 (7275 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1 (3961 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2 (7117 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3 (3921 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4 (19862 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5 (10231 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6 (19848 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7 (10372 ms)
[----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (82587 ms total)
```
Post:
```
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/0 (157 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/1 (152 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/2 (160 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/3 (158 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/4 (155 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/5 (151 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/6 (146 ms)
[ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7
[ OK ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimitingVerifyBackup/7 (153 ms)
[----------] 8 tests from RateLimiting/BackupEngineRateLimitingTestWithParam (1232 ms total)
```
Reviewed By: pdillinger
Differential Revision: D36336345
Pulled By: hx235
fbshipit-source-id: 724c6ba745f95f56d4440a6d2f1e4512a2987589
Summary:
These methods allow for more thorough testing of the ObjectRegistry and Customizable infrastructure in a simpler manner. With this change, the Customizable tests can now check what factories are registered and attempt to create each of them in a systematic fashion.
With this change, I think all of the factories registered with the ObjectRegistry/CreateFromString are now tested via the customizable_test classes.
Note that there were a few other minor changes. There was a "posix://*" register with the ObjectRegistry which was missed during the PatternEntry conversion -- these changes found that. The nickname and default names for the FileSystem classes was also inverted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9358
Reviewed By: pdillinger
Differential Revision: D33433542
Pulled By: mrambacher
fbshipit-source-id: 9a32da74e6620745b4eeffb2712be70eeeadfa7e
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
`db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`.
Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947
Test Plan:
- regular blackbox, blackbox --simple
- various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16`
Reviewed By: riversand963
Differential Revision: D36152775
Pulled By: ajkr
fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
Enforce the contract of SingleDelete so that they are not mixed with
Delete for the same key. Otherwise, it will lead to undefined behavior.
See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes.
Also fix unit tests and write-unprepared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D35837817
Pulled By: riversand963
fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635
Summary:
Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions.
1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw.
2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true.
3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn
```
Reviewed By: pdillinger
Differential Revision: D35806678
Pulled By: riversand963
fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616
Summary:
The current locktree implementation stores the address of the
PessimisticTransactions object as the TXNID. However, when a transaction
is blocked on a lock, it records the list of waitees with conflicting
locks using the rocksdb assigned TransactionID. This is performed by
calling GetID() on PessimisticTransactions objects of the waitees,
and then recorded in the waiter's list.
However, there is no guarantee the objects are valid when recording the
waitee list during the conflict callbacks because the waitee
could have released the lock and freed the PessimisticTransactions
object.
The waitee/txnid values are only valid PessimisticTransaction objects
while the mutex for the root of the locktree is held.
The simplest fix for this problem is to use the address of the
PessimisticTransaction as the TransactionID so that it is consistent
with its usage in the locktree. The TXNID is only converted back to a
PessimisticTransaction for the report_wait callbacks. Since
these callbacks are now all made within the critical section where the
lock_request queue mutx is held, these conversions will be safe.
Otherwise, only the uint64_t TXNID of the waitee is registerd
with the waiter transaction. The PessimisitcTransaction object of the
waitee is never referenced.
The main downside of this approach is the TransactionID will not change
if the PessimisticTransaction object is reused for new transactions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9898
Test Plan:
Add a new test case and run unit tests.
Also verified with MyRocks workloads using range locks that the
crash no longer happens.
Reviewed By: riversand963
Differential Revision: D35950376
Pulled By: hermanlee
fbshipit-source-id: 8c9cae272e23e487fc139b6a8ed5b8f8f24b1570
Summary:
This PR does not affect write-committed.
Add a member, `rollback_deletion_type_callback` to TransactionDBOptions
so that a write-prepared transaction, when rolling back, can call this
callback to decide if a `Delete` or `SingleDelete` should be used to
cancel a prior `Put` written to the database during prepare phase.
The purpose of this PR is to prevent mixing `Delete` and `SingleDelete`
for the same key, causing undefined behaviors. Without this PR, the
following can happen:
```
// The application always issues SingleDelete when deleting keys.
txn1->Put('a');
txn1->Prepare(); // writes to memtable and potentially gets flushed/compacted to Lmax
txn1->Rollback(); // inserts DELETE('a')
txn2->Put('a');
txn2->Commit(); // writes to memtable and potentially gets flushed/compacted
```
In the database, we may have
```
L0: [PUT('a', s=100)]
L1: [DELETE('a', s=90)]
Lmax: [PUT('a', s=0)]
```
If a compaction compacts L0 and L1, then we have
```
L1: [PUT('a', s=100)]
Lmax: [PUT('a', s=0)]
```
If a future transaction issues a SingleDelete, we have
```
L0: [SD('a', s=110)]
L1: [PUT('a', s=100)]
Lmax: [PUT('a', s=0)]
```
Then, a compaction including L0, L1 and Lmax leads to
```
Lmax: [PUT('a', s=0)]
```
which is incorrect.
Similar bugs reported and addressed in
https://github.com/cockroachdb/pebble/issues/1255. Based on our team's
current priority, we have decided to take this approach for now. We may
come back and revisit in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9873
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D35762170
Pulled By: riversand963
fbshipit-source-id: b28d56eefc786b53c9844b9ef4a7807acdd82c8d
Summary:
Add a merge operator that allows users to register specific aggregation function so that they can does aggregation based per key using different aggregation types.
See comments of function CreateAggMergeOperator() for actual usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9780
Test Plan: Add a unit test to coverage various cases.
Reviewed By: ltamasi
Differential Revision: D35267444
fbshipit-source-id: 5b02f31c4f3e17e96dd4025cdc49fca8c2868628
Summary:
In `FileMetaData`, we keep track of the lowest-numbered blob file
referenced by the SST file in question for the purposes of BlobDB's
garbage collection in the `oldest_blob_file_number` field, which is
updated in `UpdateBoundaries`. However, with the current code,
`BlobIndex` decoding errors (or invalid blob file numbers) are swallowed
in this method. The patch changes this by propagating these errors
and failing the corresponding flush/compaction. (Note that since blob
references are generated by the BlobDB code and also parsed by
`CompactionIterator`, in reality this can only happen in the case of
memory corruption.)
This change necessitated updating some unit tests that involved
fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like
`"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es.
Some were relying on the earlier behavior to simulate corruption; these
were replaced with `SyncPoint`-based test code that corrupts a valid
blob reference at read time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D35683671
Pulled By: ltamasi
fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6
Summary:
Especially after updating to C++17, I don't see a compelling case for
*requiring* any folly components in RocksDB. I was able to purge the existing
hard dependencies, and it can be quite difficult to strip out non-trivial components
from folly for use in RocksDB. (The prospect of doing that on F14 has changed
my mind on the best approach here.)
But this change creates an optional integration where we can plug in
components from folly at compile time, starting here with F14FastMap to replace
std::unordered_map when possible (probably no public APIs for example). I have
replaced the biggest CPU users of std::unordered_map with compile-time
pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
that is in the Makefile for public CI testing. A full folly build is not needed, but
checking out the full folly repo is much simpler for getting the dependency,
and anything else we might want to optionally integrate in the future.
Some picky details:
* I don't think the distributed mutex stuff is actually used, so it was easy to remove.
* I implemented an alternative to `folly::constexpr_log2` (which is much easier
in C++17 than C++11) so that I could pull out the hard dependencies on
`ConstexprMath.h`
* I had to add noexcept move constructors/operators to some types to make
F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
macro to make that easier in some common cases.
* Updated Meta-internal buck build to use folly F14Map (always)
No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
production integration for open source users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546
Test Plan:
CircleCI tests updated so that a couple of them use folly.
Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
(Note: they should probably use buck but they currently use Makefile.)
Example performance improvement: when filter partitions are pinned in cache,
they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
a test that exercises that heavily. Build DB with
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
```
and test with (simultaneous runs with & without folly, ~20 times each to see
convergence)
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
```
Average ops/s no folly: 26229.2
Average ops/s with folly: 26853.3 (+2.4%)
Reviewed By: ajkr
Differential Revision: D34181736
Pulled By: pdillinger
fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
Summary:
Added a Plugin class to the ObjectRegistry. Enabled compile-time and program-time addition of plugins to the Registry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7949
Reviewed By: mrambacher
Differential Revision: D33517674
Pulled By: pdillinger
fbshipit-source-id: c3e3270aab76a489bfa9e85d78cdfca951912557
Summary:
For write-prepared/write-unprepared transactions,
GetCommitTimeWriteBatch() can be used only if the transaction is started
with `TransactionOptions::use_only_the_last_commit_time_batch_for_recovery` set
to true. Otherwise, it is possible that multiple uncommitted versions of the
same key exist in the database. During bottommost compaction, RocksDB may
set the sequence numbers of both to zero once they become committed, causing
output SST file to have two identical internal keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9794
Test Plan:
make check
pay special attention to the following
```
transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/*
```
Reviewed By: lth
Differential Revision: D35327214
Pulled By: riversand963
fbshipit-source-id: 3bae00a28359c10e96e4c6f676d20de5610d8a0f
Summary:
Various renaming and fixes to get rid of remaining uses of
"backupable" which is terminology leftover from the original, flawed
design of BackupableDB. Now any DB can be backed up, using BackupEngine.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D35334386
Pulled By: pdillinger
fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329
Summary:
In making `SstFileMetaData` inherit from `FileStorageInfo`, I
overlooked setting some `FileStorageInfo` fields when then default
`SstFileMetaData()` ctor is used. This affected `GetLiveFilesMetaData()`.
Also removed some buggy `static_cast<size_t>`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9769
Test Plan: Updated tests
Reviewed By: jay-zhuang
Differential Revision: D35220383
Pulled By: pdillinger
fbshipit-source-id: 05b4ee468258dbd3699517e1124838bf405fe7f8
Summary:
The param name force_erase may be misleading, since the handle is erased only if it has last reference even if the param is set true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9728
Reviewed By: pdillinger
Differential Revision: D35038673
Pulled By: gitbw95
fbshipit-source-id: 0d16d1e8fed17b97eba7fb53207119332f659a5f
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".
To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.
Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.
Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.
Reviewed By: ajkr
Differential Revision: D34686968
Pulled By: pdillinger
fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629
Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are
locked upon first operation that writes the key or has the intention of writing. For example,
`PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`,
`PessimisticTransaction::SingleDelete()` will write to or delete a key, while
`PessimisticTransaction::GetForUpdate()` is used by application to indicate
to RocksDB that the transaction has the intention of performing write operation later
in the same transaction.
Pessimistic transactions support two-phase commit (2PC). A transaction can be
`Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once
`Prepare()` succeeds, the transaction has acquired the necessary resources to commit.
The resources include locks, persistence of WAL, etc.
Write-committed transaction is the default pessimistic transaction implementation. In
RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare
section. `Commit()` will write a commit marker to the WAL and then write data to the
memtables. While writing to the memtables, different keys in the transaction's write batch
will be assigned different sequence numbers in ascending order.
Until commit/rollback, the transaction holds locks on the keys so that no other transaction
can write to the same keys. Furthermore, the keys' sequence numbers represent the order
in which they are committed and should be made visible. This is convenient for us to
implement support for user-defined timestamps.
Since column families with and without timestamps can co-exist in the same database,
a transaction may or may not involve timestamps. Based on this observation, we add two
optional members to each `PessimisticTransaction`, `read_timestamp_` and
`commit_timestamp_`. If no key in the transaction's write batch has timestamp, then
setting these two variables do not have any effect. For the rest of this commit, we discuss
only the cases when these two variables are meaningful.
read_timestamp_ is used mainly for validation, and should be set before first call to
`GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls
`TryLock()` that can verify if another transaction has written the same key since
`read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed
written the same key, then validation fails, and RocksDB allows this transaction to
refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()`
with a different timestamp to read, but the result of the read should not be used to
determine data that will be written later.
commit_timestamp_ must be set after finishing writing and before transaction commit.
This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after
prepare phase succeeds.
We currently require that the commit timestamp be chosen after all keys are locked. This
means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used
by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`,
`PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`,
etc. will return non-ok status because they specify timestamps before locking the keys.
Users are also prompted to use the `Transaction` APIs when they receive the non-ok status.
Reviewed By: ltamasi
Differential Revision: D31822445
fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43
Summary:
In preparation for more support for file Temperatures in BackupEngine,
this change does some test refactoring:
* Move DBTest2::BackupFileTemperature test to
BackupEngineTest::FileTemperatures, with some updates to make it work
in the new home. This test will soon be expanded for deeper backup work.
* Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
support sharing because of above moved test, but split off the "no link"
part to the test needing it.
* Use custom FileSystems in backupable_db_test rather than custom Envs,
because going through Env file interfaces doesn't support temperatures.
* Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
parameter to FsyncWithDirOptions, which was required because this
limitation caused a crash only after moving to higher fidelity of
FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
parameter details)
* `backupable_options_` -> `engine_options_` as part of the ongoing
work to get rid of the obsolete "backupable" naming.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655
Test Plan: test code updates only
Reviewed By: jay-zhuang
Differential Revision: D34622183
Pulled By: pdillinger
fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
Summary:
This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps
and reading the keys back. **The index does not have timestamps.**
Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it.
When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before
comparison.
When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be
provided to query the db. The assumption is that data in the batch must be newer than data from the db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D34354849
Pulled By: riversand963
fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0
Summary:
Add Temperature hints information from RocksDB in API
`NewSequentialFile()`. backup and checkpoint operations need to open the
source files with `NewSequentialFile()`, which will have the temperature
hints. Other operations are not covered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499
Test Plan: Added unittest
Reviewed By: pdillinger
Differential Revision: D34006115
Pulled By: jay-zhuang
fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f
Summary:
Make FilterPolicy into a Customizable class. Allow new FilterPolicy to be discovered through the ObjectRegistry
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590
Reviewed By: pdillinger
Differential Revision: D34327367
Pulled By: mrambacher
fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
- setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
- benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
Reviewed By: hx235
Differential Revision: D33747386
Pulled By: ajkr
fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
Summary:
Add support for doubles to ObjectLibrary::PatternEntry. This support will allow patterns containing a non-integer number to be parsed correctly.
Added appropriate test cases to cover this new option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9577
Reviewed By: pdillinger
Differential Revision: D34269763
Pulled By: mrambacher
fbshipit-source-id: b5ce16cbd3665c2974ec0f3412ef2b403ef8b155
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9562
With per-transaction `read_timestamp_`, it is possible to perform transaction validation after
locking a key in addition to sequence-based validation. Specifically, if a transaction has a
read_timestamp, then we perform timestamp-based validation as well after the key is locked
via `GetForUpdate()`. This is to make sure that no other transaction has modified the key and
committed successfully since the read timestamp (but before the locking operation) which
represents a consistent view of the database.
Reviewed By: ltamasi
Differential Revision: D31822034
fbshipit-source-id: c6f1828b7fc23e4f85e2d1ed73ff51464a058d91
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9537
Add `Transaction::SetReadTimestampForValidation()` and
`Transaction::SetCommitTimestamp()` APIs with default implementation
returning `Status::NotSupported()`. Currently, calling these two APIs do not
have any effect.
Also add checks to `PessimisticTransactionDB`
to enforce that column families in the same db either
- disable user-defined timestamp
- enable 64-bit timestamp
Just to clarify, a `PessimisticTransactionDB` can have some column families without
timestamps as well as column families that enable timestamp.
Each `PessimisticTransaction` can have two optional timestamps, `read_timestamp_`
used for additional validation and `commit_timestamp_` which denotes when the transaction commits.
For now, we are going to support `WriteCommittedTxn` (in a series of subsequent PRs)
Once set, we do not allow decreasing `read_timestamp_`. The `commit_timestamp_` must be
greater than `read_timestamp_` for each transaction and must be set before commit, unless
the transaction does not involve any column family that enables user-defined timestamp.
TransactionDB builds on top of RocksDB core `DB` layer. Though `DB` layer assumes
that user-defined timestamps are byte arrays, `TransactionDB` uses uint64_t to store
timestamps. When they are passed down, they are still interpreted as
byte-arrays by `DB`.
Reviewed By: ltamasi
Differential Revision: D31567959
fbshipit-source-id: b0b6b69acab5d8e340cf174f33e8b09f1c3d3502
Summary:
This change should guarantee that the default ObjectLibrary/Registry are long-lived and not destroyed while the process is running. This will prevent some issues of them being referenced after they were destroyed via the static destruction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9464
Reviewed By: pdillinger
Differential Revision: D33849876
Pulled By: mrambacher
fbshipit-source-id: 7a69177d7c58c81be293fc7ef8e600d47ddbc14b
Summary:
This fix addresses https://github.com/facebook/rocksdb/issues/9299.
If attempting to create a new object via the ObjectRegistry and a factory is not found, the ObjectRegistry will return a "NotSupported" status. This is the same behavior as previously.
If the factory is found but could not successfully create the object, an "InvalidArgument" status is returned. If the factory returned a reason why (in the errmsg), this message will be in the returned status.
In practice, there are two options in the ConfigOptions that control how these errors are propagated:
- If "ignore_unknown_options=true", then both InvalidArgument and NotSupported status codes will be swallowed internally. Both cases will return success
- If "ignore_unsupported_options=true", then having no factory will return success but a failing factory will return an error
- If both options are false, both cases (no and failing factory) will return errors.
In practice this likely only changes Customizable that may be partially available. For example, the JEMallocMemoryAllocator is a built-in allocator that is registered with the system but may not be compiled in. In this case, the status code for this allocator changed from NotSupported("JEMalloc not available") to InvalidArgumen("JEMalloc not available"). Other Customizable builtins/plugins would have the same semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9333
Reviewed By: pdillinger
Differential Revision: D33517681
Pulled By: mrambacher
fbshipit-source-id: 8033052d4a4a7b88c2d9f90147b1b4467e51f6fd
Summary:
... seen only in internal clang-analyze runs after https://github.com/facebook/rocksdb/issues/9481
* Mostly, this works around falsely reported leaks by using
std::unique_ptr in some places where clang-analyze was getting
confused. (I didn't see any changes in C++17 that could make our Status
implementation leak memory.)
* Also fixed SetBGError returning address of a stack variable.
* Also fixed another false null deref report by adding an assert.
Also, use SKIP_LINK=1 to speed up `make analyze`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9515
Test Plan:
Was able to reproduce the reported errors locally and verify
they're fixed (except SetBGError). Otherwise, existing tests
Reviewed By: hx235
Differential Revision: D34054630
Pulled By: pdillinger
fbshipit-source-id: 38600ef3da75ddca307dff96b7a1a523c2885c2e
Summary:
Added a CountedFileSystem that tracks a number of file operations (opens, closes, deletes, renames, flushes, syncs, fsyncs, reads, writes). This class was based on the ReportFileOpEnv from db_bench.
This is a stepping stone PR to be able to change the SpecialEnv into a SpecialFileSystem, where several of the file varieties wish to do operation counting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9283
Reviewed By: pdillinger
Differential Revision: D33062004
Pulled By: mrambacher
fbshipit-source-id: d0d297a7fb9c48c06cbf685e5fa755c27193b6f5
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
This also removes the obsolete names BackupableDBOptions
and UtilityDB. API users must now use BackupEngineOptions and
DBWithTTL::Open. In C API, `rocksdb_backupable_db_*` is replaced
`rocksdb_backup_engine_*`. Similar renaming in Java API.
In reference to https://github.com/facebook/rocksdb/issues/9389
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33780269
Pulled By: pdillinger
fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac
Summary:
Regexes are considered potentially problematic for use in
registering RocksDB extensions, so we are removing
ObjectLibrary::Register() and the Regex public API it depended on (now
unused).
In reference to https://github.com/facebook/rocksdb/issues/9389
Why?
* The power of Regexes can make it hard to reason about which extension
will match what. (The replacement API isn't perfect, but we are at least
"holding the line" on patterns we have seen in practice.)
* It is easy to make regexes that don't quite mean what you think they
mean, such as forgetting that the `.` in `foo.bar` can match any character
or that matching is nondeterministic, as in `a🅱️42` matching `.*:[0-9]+`.
* Some regexes and implementations can have disastrously bad
performance. This might not be much practical concern for ObjectLibray
here, but we don't want to encourage potentially dangerous further use
in production code. (Testing code is fine. See TestRegex.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9439
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D33792342
Pulled By: pdillinger
fbshipit-source-id: 4f64dcb04764e639162c8977a5fa196f67754cec
Summary:
This PR moves RADOS support from RocksDB repo to a separate repo. The new (temporary?) repo
in this PR serves as an example before we finalize the decision on where and who to host RADOS support. At this point,
people can start from the example repo and fork.
The goal is to include this commit in RocksDB 7.0 release.
Reference:
https://github.com/ajkr/dedupfs by ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9206
Test Plan:
Follow instructions in https://github.com/riversand963/rocksdb-rados-env/blob/main/README.md and build
test binary `env_librados_test` and run it.
Also, make check
Reviewed By: ajkr
Differential Revision: D33751690
Pulled By: riversand963
fbshipit-source-id: 30466c62afa9e4619847a48567ed158e62835e35
Summary:
Range Locking supports Lock Escalation. Lock Escalation is invoked when
lock memory is nearly exhausted and it reduced the amount of memory used
by joining adjacent locks.
Bridging the gap between certain locks has adverse effects. For example,
in MyRocks it is not a good idea to bridge the gap between locks in
different indexes, as that get the lock to cover large portions of
indexes, or even entire indexes.
Resolve this by introducing Escalation Barrier. The escalation process
will call the user-provided barrier callback function:
bool(const Endpoint& a, const Endpoint& b)
If the function returns true, there's a barrier between a and b and Lock
Escalation will not try to bridge the gap between a and b.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9290
Reviewed By: akankshamahajan15
Differential Revision: D33486753
Pulled By: riversand963
fbshipit-source-id: f97910b67aba0579ea1d35f523ca6863d3dd018e
Summary:
As title.
This is part of an fb-internal task.
First, remove all `using namespace` statements if applicable.
Next, utilize multiple build platforms and see if anything is broken.
Should anything become broken, fix the compilation errors with as little extra change as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369
Test Plan:
internal build and make check
make clean && make static_lib && cd examples && make all
Reviewed By: pdillinger
Differential Revision: D33517260
Pulled By: riversand963
fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40
Summary:
In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions. The PatternEntry methods were left in place but renamed to AddFactory. The goal is to allow for the deprecation of the original regex Registry method in an upcoming release.
Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362
Reviewed By: pdillinger
Differential Revision: D33432562
Pulled By: mrambacher
fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02
Summary:
Allows the Env to have options (Configurable) and loads like other Customizable classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293
Reviewed By: pdillinger, zhichao-cao
Differential Revision: D33181591
Pulled By: mrambacher
fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7
Summary:
Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches.
Added tests for the new derivative classes and got all unit tests to pass.
Resolves https://github.com/facebook/rocksdb/issues/9225.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264
Reviewed By: pdillinger
Differential Revision: D33062001
Pulled By: mrambacher
fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03
Summary:
The failure looked like this:
```
utilities/backupable/backupable_db_test.cc:3161: Failure
Value of: db_chroot_env_->FileExists(prev_manifest_path).IsNotFound()
Actual: false
Expected: true
```
The failure could be coerced consistently with the following patch:
```
diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 80410f671..637636791 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -2772,6 +2772,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) {
if (job_context.HaveSomethingToClean() ||
job_context.HaveSomethingToDelete() || !log_buffer.IsEmpty()) {
mutex_.Unlock();
+ bg_cv_.SignalAll();
+ sleep(1);
TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound");
// Have to flush the info logs before bg_flush_scheduled_--
// because if bg_flush_scheduled_ becomes 0 and the lock is
```
The cause was a familiar problem, which is manual flush/compaction may
return before files they obsoleted are removed. The solution is just to
wait for "scheduled" work to complete, which includes all phases
including cleanup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9327
Test Plan:
after this PR, even the above patch to coerce the bug cannot
cause the test to fail.
Reviewed By: riversand963
Differential Revision: D33252208
Pulled By: ajkr
fbshipit-source-id: 720a7eaca58c7247d221911fffe3d5e1dbf581e9
Summary:
locktree is a module providing Range Locking. It has a counter for
the number of times a lock acquisition request was blocked by an
existing conflicting lock and had to wait for it to be released.
Expose this counter in RangeLockManagerHandle::Counters::lock_wait_count.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9289
Reviewed By: jay-zhuang
Differential Revision: D33079182
Pulled By: riversand963
fbshipit-source-id: 25b1a362d9da247536ab5007bd15900b319f139e
Summary:
- Make MemoryAllocator and its implementations into a Customizable class.
- Added a "DefaultMemoryAllocator" which uses new and delete
- Added a "CountedMemoryAllocator" that counts the number of allocs and free
- Updated the existing tests to use these new allocators
- Changed the memkind allocator test into a generic test that can test the various allocators.
- Added tests for creating all of the allocators
- Added tests to verify/create the JemallocNodumpAllocator using its options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980
Reviewed By: zhichao-cao
Differential Revision: D32990403
Pulled By: mrambacher
fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).
The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.
This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)
The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.
Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.
Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126
Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.
### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)
### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime
We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:
```
Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```
These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key. With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.
More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day
After enough data, we get a result at the end:
```
(keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```
If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:
```
(keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```
The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:
```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```
I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.
Reviewed By: zhichao-cao
Differential Revision: D33171746
Pulled By: pdillinger
fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
Summary:
`db_stress` is a user of `FaultInjectionTestFS`. After injecting a write error, `db_stress` probabilistically determins
data drop (https://github.com/facebook/rocksdb/blob/6.27.fb/db_stress_tool/db_stress_test_base.cc#L2615:L2619).
In some of our recent runs of `db_stress`, we found duplicate trailing entries corresponding to file trivial move in
the MANIFEST, causing the recovery to fail, because the file move operation is not idempotent: you cannot delete a
file from a given level twice.
Investigation suggests that data buffering in both `WritableFileWriter` and `FaultInjectionTestFS` may be the root cause.
WritableFileWriter buffers data to write in a memory buffer, `WritableFileWriter::buf_`. After each
`WriteBuffered()`/`WriteBufferedWithChecksum()` succeeds, the `buf_` is cleared.
If the underlying file `WritableFileWriter::writable_file_` is opened in buffered IO mode, then `FaultInjectionTestFS`
buffers data written for each file until next file sync. After an injected error, user of `FaultInjectionFS` can
choose to drop some or none of previously buffered data. If `db_stress` does not drop any unsynced data, then
such data will still exist in the `FaultInjectionTestFS`'s buffer.
Existing implementation of `WritableileWriter::WriteBuffered()` does not clear `buf_` if there is an error. This may lead
to the data being buffered two copies: one in `WritableFileWriter`, and another in `FaultInjectionTestFS`.
We also know that the `WritableFileWriter` of MANIFEST file will close upon an error. During `Close()`, it will flush the
content in `buf_`. If no write error is injected to `FaultInjectionTestFS` this time, then we end up with two copies of the
data appended to the file.
To fix, we clear the `WritableFileWriter::buf_` upon failure as well. We focus this PR on files opened in non-direct mode.
This PR includes a unit test to reproduce a case when write error injection
to `WritableFile` can cause duplicate trailing entries.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9236
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D33033984
Pulled By: riversand963
fbshipit-source-id: ebfa5a0db8cbf1ed73100528b34fcba543c5db31
Summary:
Context:
[Rapid thread creation and deletion](https://github.com/facebook/rocksdb/blob/6.27.fb/utilities/transactions/write_prepared_transaction_test.cc#L439-L444) in `SnapshotConcurrentAccessTest.SnapshotConcurrentAcces` inside a [potentially big loop](https://github.com/facebook/rocksdb/blob/6.27.fb/utilities/transactions/write_prepared_transaction_test.cc#L1238-L1248) can lead to heavy-loading the system with many threads due to delay in actually cleaning up thread's resource in the kernel sometime. We ran into some [flaky failure](https://app.circleci.com/pipelines/github/facebook/rocksdb/10383/workflows/136f1005-80a9-4515-aee9-fe36ac6462a1/jobs/253289) in CI and reproduced it by below:
- Command
```
Added `ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();` like https://github.com/facebook/rocksdb/pull/9276
DEBUG_LEVEL=2 make -j56 write_prepared_transaction_test
GTEST_CATCH_EXCEPTIONS=0 ~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
```
- Stack, where `write_prepared_transaction_test.cc:442` in `https://github.com/facebook/rocksdb/issues/9` points to thread creation
```
[ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
....terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
Received signal 6 (Aborted)
#0 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38) [0x7fc114f39438]
...
https://github.com/facebook/rocksdb/issues/7 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8e73) [0x7fc1158a5e73] ?? ??:0
https://github.com/facebook/rocksdb/issues/8 ./write_prepared_transaction_test() [0x4ca86c] std:🧵:thread<rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> const&, rocksdb::WritePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda()https://github.com/facebook/rocksdb/issues/1}>(rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, s d::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, rocksdb::WritePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)::{l mbda()https://github.com/facebook/rocksdb/issues/1}&&) /usr/include/c++/5/thread:137 (discriminator 4)
https://github.com/facebook/rocksdb/issues/9 ./write_prepared_transaction_test() [0x4bb80c] rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, rocksdb::W itePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long) /home/circleci/project/utilities/transactions/write_prepared_transaction_test.cc:442
https://github.com/facebook/rocksdb/issues/10 ./write_prepared_transaction_test() [0x4407b6] rocksdb::SnapshotConcurrentAccessTest_SnapshotConcurrentAccess_Test::TestBody() /home/circleci/project/utilities/transactions/write_prepared_transaction_test.cc:1244
...
[109/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 returned/aborted with exit code -6 (34462 ms)
```
- Move thread 2's work into current thread to avoid half of the thread creation cuz there is no difference in doing so. We expect this can make the thread-creation error less often, even though we can't gurantee it from happening again. Considering this is a trivial change with positive impact, it's still worth landing and monitor if it's enough to solve the problem in reality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9281
Test Plan:
Before the change, repeating the test 200 times with 200 workers failed
`~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1`
```
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest
[ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
..unknown file: Failure
C++ exception with description "Resource temporarily unavailable" thrown in the test body.
[ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) (11882 ms)
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest (11882 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11882 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20)
```
After the change: repeating the test 200 times with 200 workers didn't fail, even with repeating the "repeating" for 10 times like below
`for i in {1..10}; do ~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1; done`
```
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
```
It does failed when repeating the test 400 times with 400 workers
`~/project$ ~/gtest-parallel/gtest-parallel -r 400 -w 400 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1`
```
[1/400] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 (2928 ms)
Note: Google Test filter = TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest
[ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1
unknown file: Failure
C++ exception with description "std::bad_alloc" thrown in the test body.
[ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) (2597 ms)
[----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest (2597 ms total)
```
Reviewed By: ajkr
Differential Revision: D33026776
Pulled By: hx235
fbshipit-source-id: 509f57126392821e835e48396e5bf224f4f5dcac
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
This changes write_prepared_transaction_test under CircleCI to
print a stack trace on unhandled exception, so that we can debug rare
exceptions seen in CircleCI:
[ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/24
.......unknown file: Failure
C++ exception with description "Resource temporarily unavailable" thrown in the test body.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9276
Test Plan:
manual run test with seeded 'throw', with and without
CIRCLECI=true environment variable
Reviewed By: ajkr, hx235
Differential Revision: D32996993
Pulled By: pdillinger
fbshipit-source-id: e790408ce204b676d3d84a290e41be511b203bfa
Summary:
You could easily reproduce the failure by injecting sleep(11)
before `store.Flush()`. Fixed by setting TTL time to approximately test
timeout time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9226
Test Plan: manual
Reviewed By: akankshamahajan15
Differential Revision: D32698105
Pulled By: pdillinger
fbshipit-source-id: 40529af9d9f2389585988b7c81dffb120e2795a2
Summary:
**Context:**
Some existing internal calls of `GenericRateLimiter::Request()` in backupable_db.cc and newly added internal calls in https://github.com/facebook/rocksdb/pull/8722/ do not make sure `bytes <= GetSingleBurstBytes()` as required by rate_limiter https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L47.
**Impacts of this bug include:**
(1) In debug build, when `GenericRateLimiter::Request()` requests bytes greater than `GenericRateLimiter:: kMinRefillBytesPerPeriod = 100` byte, process will crash due to assertion failure. See https://github.com/facebook/rocksdb/pull/9063#discussion_r737034133 and for possible scenario
(2) In production build, although there will not be the above crash due to disabled assertion, the bug can lead to a request of small bytes being blocked for a long time by a request of same priority with insanely large bytes from a different thread. See updated https://github.com/facebook/rocksdb/wiki/Rate-Limiter ("Notice that although....the maximum bytes that can be granted in a single request have to be bounded...") for more info.
There is an on-going effort to move rate-limiting to file wrapper level so rate limiting in `BackupEngine` and this PR might be made obsolete in the future.
**Summary:**
- Implemented loop-calling `GenericRateLimiter::Request()` with `bytes <= GetSingleBurstBytes()` as a static private helper function `BackupEngineImpl::LoopRateLimitRequestHelper`
-- Considering make this a util function in `RateLimiter` later or do something with `RateLimiter::RequestToken()`
- Replaced buggy internal callers with this helper function wherever requested byte is not pre-limited by `GetSingleBurstBytes()`
- Removed the minimum refill bytes per period enforced by `GenericRateLimiter` since it is useless and prevents testing `GenericRateLimiter` for extreme case with small refill bytes per period.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9063
Test Plan:
- Added a new test that failed the assertion before this change and now passes
- It exposed bugs in [the write during creation in `CopyOrCreateFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2034-L2043)), [the read of table properties in `GetFileDbIdentities()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2372-L2378)), [some read of metadata in `BackupMeta::LoadFromFile()`](df7cc66e17/utilities/backupable/backupable_db.cc (L2726))
- Passing Existing tests
Reviewed By: ajkr
Differential Revision: D31824535
Pulled By: hx235
fbshipit-source-id: d2b3dea7a64e2a4b1e6a59fca322f0800a4fcbcc
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162
Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.
Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.
Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.
Reviewed By: ltamasi
Differential Revision: D31567960
fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
Summary:
The individual commits in this PR should be self-explanatory.
All small and _very_ low-priority changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5896
Reviewed By: riversand963
Differential Revision: D18065108
Pulled By: mrambacher
fbshipit-source-id: 236b1a1d9d21f982cc08aa67027108dde5eaf280
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105
The user contract of SingleDelete is that: a SingleDelete can only be issued to
a key that exists and has NOT been updated. For example, application can insert
one key `key`, and uses a SingleDelete to delete it in the future. The `key`
cannot be updated or removed using Delete.
In reality, especially when write-prepared transaction is being used, things
can get tricky. For example, a prepared transaction already writes `key` to the
memtable after a successful Prepare(). Afterwards, should the transaction
rollback, it will insert a Delete into the memtable to cancel out the prior
Put. Consider the following sequence of operations.
```
// operation sequence 1
Begin txn
Put(key)
Prepare()
Flush()
Rollback txn
Flush()
```
There will be two SSTs resulting from above. One of the contains a PUT, while
the second one contains a Delete. It is also known that releasing a snapshot
can lead to an L0 containing only a SD for a particular key. Consider the
following operations following the above block.
```
// operation sequence 2
db->Put(key)
db->SingleDelete(key)
Flush()
```
The operation sequence 2 can result in an L0 with only the SD.
Should there be a snapshot for conflict checking created before operation
sequence 1, then an attempt to compact the db may hit the assertion failure
below, because ikey_.type is Delete (from a rollback).
```
else if (clear_and_output_next_key_) {
assert(ikey_.type == kTypeValue || ikey_.type == kTypeBlobIndex);
}
```
To fix the assertion failure, we can skip the SingleDelete if we detect an
earlier Delete in the same snapshot interval.
Reviewed By: ltamasi
Differential Revision: D32056848
fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060
RocksDB bottommost level compaction may zero out an internal key's sequence if
the key's sequence is in the earliest_snapshot.
In write-prepared transaction, checking the visibility of a certain sequence in
a specific released snapshot may return a "snapshot released" result.
Therefore, it is possible, after a certain sequence of events, a PUT has its
sequence zeroed out, but a subsequent SingleDelete of the same key will still
be output with its original sequence. This violates the ascending order of
keys and leads to incorrect result.
The solution is to use an extra variable `last_key_seq_zeroed_` to track the
information about visibility in earliest snapshot. With this variable, we can
know for sure that a SingleDelete is in the earliest snapshot even if the said
snapshot is released during compaction before processing the SD.
Reviewed By: ltamasi
Differential Revision: D31813016
fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
synced. So an API `FsyncAfterFileRename(filename, ...)` is provided
to sync the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is forced by set
`IOOptions.force_dir_fsync = true`
4. renaming multiple files (like backup and checkpoint): dir-fsync is
forced, the same as above.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903
Test Plan: run tests on btrfs and non btrfs
Reviewed By: ajkr
Differential Revision: D30885059
Pulled By: jay-zhuang
fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061
In write-prepared txn, checking a sequence's visibility in a released (old)
snapshot may return "Snapshot released". Suppose we have two snapshots:
```
earliest_snap < earliest_write_conflict_snap
```
If we release `earliest_write_conflict_snap` but keep `earliest_snap` during
bottommost level compaction, then it is possible that certain sequence of
events can lead to a PUT being seq-zeroed followed by a SingleDelete of the
same key. This violates the ascending order of keys, and will cause data
inconsistency.
Reviewed By: ltamasi
Differential Revision: D31813017
fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285
Summary:
This feature was not part of any common or CI build, so no
surprise it broke. Now we can at least ensure compilation. I don't know
how to run the test successfully (missing config file) so it is bypassed
for now.
Fixes https://github.com/facebook/rocksdb/issues/9078
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9088
Test Plan: CI
Reviewed By: mrambacher
Differential Revision: D32009467
Pulled By: pdillinger
fbshipit-source-id: 3e0d1e5fde7f0ece703d48a81479e1cc7392c25c
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.
There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?
1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image
Not sure if there is more required for travis. Happy to help in any way I can.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962
Reviewed By: mrambacher
Differential Revision: D31802198
Pulled By: pdillinger
fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.
This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.
Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.
Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)
Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968
Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).
Reviewed By: siying
Differential Revision: D31242045
Pulled By: pdillinger
fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.
The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995
Test Plan:
- Verified it fixes the following failure:
```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```
- `make check -j48`
Reviewed By: ltamasi
Differential Revision: D31495388
Pulled By: ajkr
fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
Summary:
Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB.
Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912
Test Plan: add new tests to lru_cache_test and pass make check.
Reviewed By: pdillinger
Differential Revision: D31452871
Pulled By: zhichao-cao
fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48
Summary:
This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead.
Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing.
Hopefully, this sort of code hygiene cleanup will speed up the builds...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930
Reviewed By: pdillinger
Differential Revision: D31142788
Pulled By: mrambacher
fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d