Commit graph

11143 commits

Author SHA1 Message Date
Hui Xiao d665afdbf3 Account memory of FileMetaData in global memory limit (#9924)
Summary:
**Context/Summary:**
As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924

Test Plan:
- Previous `make check` verified there are only 2 places where the memory of  the allocated `FileMetaData` can be released
- New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
- db bench (CPU cost of `charge_file_metadata` in write and compact)
   - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
   - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`

table 1 - write

#-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**

table 2 - compact

#-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**

- stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1  --cache_size=1` killed as normal

Reviewed By: ajkr

Differential Revision: D36055583

Pulled By: hx235

fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2022-06-14 13:06:40 -07:00
Akanksha Mahajan 40d19bc12c Fix the failure related to io_uring_prep_cancel (#10165)
Summary:
Fix for Internal jobs are failing with
```
 error: no matching function for call to 'io_uring_prep_cancel'
      io_uring_prep_cancel(sqe, posix_handle, 0);
      ^~~~~~~~~~~~~~~~~~~~
note: candidate function not viable: no known conversion from 'rocksdb::Posix_IOHandle *' to '__u64' (aka 'unsigned long long') for 2nd argument
static inline void io_uring_prep_cancel(struct io_uring_sqe *sqe,
```

User data is set using `io_uring_set_data` API so no need to pass posix_handle here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10165

Test Plan: CircleCI jobs

Reviewed By: jay-zhuang

Differential Revision: D37145233

Pulled By: akankshamahajan15

fbshipit-source-id: 05da650e1240e9c6fcc8aed5f0067308dccb164a
2022-06-14 12:35:11 -07:00
Guido Tagliavini Ponce f105e1a501 Make the per-shard hash table fixed-size. (#10154)
Summary:
We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.

Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154

Test Plan: `make -j24 check`

Reviewed By: pdillinger

Differential Revision: D37124451

Pulled By: guidotag

fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
2022-06-13 20:29:00 -07:00
Yanqin Jin bfaf8291c5 Fix a race condition in transaction stress test (#10157)
Summary:
Before this PR, there can be a race condition between the thread calling
`StressTest::Open()` and a background compaction thread calling
`MultiOpsTxnsStressTest::VerifyPkSkFast()`.

```
Time   thread1                             bg_compact_thr
 |     TransactionDB::Open(..., &txn_db_)
 |     db_ is still nullptr
 |                                         db_->GetSnapshot()  // segfault
 |     db_ = txn_db_
 V
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10157

Test Plan: CI

Reviewed By: akankshamahajan15

Differential Revision: D37121653

Pulled By: riversand963

fbshipit-source-id: 6a53117f958e9ee86f77297fdeb843e5160a9331
2022-06-13 18:54:38 -07:00
Akanksha Mahajan c0e0f30667 Implement AbortIO using io_uring (#10125)
Summary:
Implement AbortIO in posix using io_uring to cancel any pending read requests submitted. Its cancelled using io_uring_prep_cancel which sets the IORING_OP_ASYNC_CANCEL flag.

To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT.

Reference: https://kernel.dk/io_uring-whatsnew.pdf,
https://lore.kernel.org/io-uring/d9a8d76d23690842f666c326631ecc2d85b6c1bc.1615566409.git.asml.silence@gmail.com/

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10125

Test Plan: Existing Posix tests.

Reviewed By: anand1976

Differential Revision: D36946970

Pulled By: akankshamahajan15

fbshipit-source-id: 3bc1f1521b3151d01a348fc6431eb3fc85db3a14
2022-06-13 18:07:24 -07:00
Mark Callaghan 04bd347995 Increase num_levels for universal from 8 to 40 (#10158)
Summary:
See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move
isn't done for universal when compaction is from L0 into L0. So a too small value for
num_levels with db_bench means there will be fewer trivial moves with universal and
that means that write-amp will increase.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158

Test Plan: run it

Reviewed By: siying

Differential Revision: D37122519

Pulled By: mdcallag

fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e
2022-06-13 16:24:32 -07:00
Peter Dillinger ad135f3ffd Document design/specification bugs with auto_prefix_mode (#10144)
Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.

We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.

A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".

This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)

Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144

Test Plan: test updates included

Reviewed By: siying

Differential Revision: D37052710

Pulled By: pdillinger

fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
2022-06-13 11:08:50 -07:00
Akanksha Mahajan 8273435c22 Bypass tests instead of skipping to resolve internal failure (#10148)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10148

Reviewed By: hx235

Differential Revision: D37092202

Pulled By: akankshamahajan15

fbshipit-source-id: 12fae5641a1c4ab584e586db95f4044273aba23a
2022-06-12 12:05:11 -07:00
Guido Tagliavini Ponce 415200d792 Assume fixed size key (#10137)
Summary:
FastLRUCache now only supports 16B keys. The tests have changed to reflect this.

Because the unit tests were designed for caches that accept any string as keys, some tests are no longer compatible with FastLRUCache. We have disabled those for runs with FastLRUCache. (We could potentially change all tests to use 16B keys, but we don't because the cache public API does not require this.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10137

Test Plan: make -j24 check

Reviewed By: gitbw95

Differential Revision: D37083934

Pulled By: guidotag

fbshipit-source-id: be1719cf5f8364a9a32bc4555bce1a0de3833b0d
2022-06-10 19:12:18 -07:00
sdong 80afa77660 Run fadvise with mmap file (#10142)
Summary:
Right now with mmap file, we don't run fadvise following users' requests. There is no reason for that so this diff does that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10142

Test Plan:
A simple readrandom against files with page cache dropped shows latency improvement from 7.8 us to 2.8:

./db_bench -use_existing_db --benchmarks=readrandom --num=100

Reviewed By: anand1976

Differential Revision: D37074975

fbshipit-source-id: ccc72bcac1b5fd634eb8fa2b6a5d9afe332e0bf6
2022-06-10 16:34:01 -07:00
Yanqin Jin 1777e5f7e9 Snapshots with user-specified timestamps (#9879)
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.

It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.

This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.

In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.

This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```

If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.

Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();

// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
    std::shared_ptr<TransactionNotifier> notifier,
    TxnTimestamp ts,
    std::shared_ptr<const Snapshot>* ret);

// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```

The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```

We also provide two additional APIs for stats collection and reporting purposes.

```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
    TxnTimestamp ts_lb,
    TxnTimestamp ts_ub,
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```

To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```

Before shutdown, RocksDB will release all timestamped snapshots.

Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.

Timestamped snapshots can also be used with user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879

Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```

Reviewed By: siying

Differential Revision: D35783919

Pulled By: riversand963

fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
2022-06-10 16:07:03 -07:00
gitbw95 f4052d13b7 Enable SecondaryCache::CreateFromString to create sec cache based on the uri for CompressedSecondaryCache (#10132)
Summary:
Update SecondaryCache::CreateFromString and enable it to create sec cache based on the uri for CompressedSecondaryCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10132

Test Plan: Add unit tests.

Reviewed By: anand1976

Differential Revision: D36996997

Pulled By: gitbw95

fbshipit-source-id: 882ad563cff6d38b306a53426ad7e47273f34edc
2022-06-10 12:23:10 -07:00
Peter Dillinger d3a3b02134 Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128)
Summary:
When opening an SST file created using index_type=kHashSearch,
the *current* prefix_extractor would be saved, and used with hash index
if the *new current* prefix_extractor at query time is compatible with
the SST file. This is a problem if the prefix_extractor at SST open time
is not compatible but SetOptions later changes (back) to one that is
compatible.

This change fixes that by using the known compatible (or missing) prefix
extractor we save for use with prefix filtering. Detail: I have moved the
InternalKeySliceTransform wrapper to avoid some indirection and remove
unnecessary fields.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128

Test Plan:
expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails
before fix and probably covers some other previously uncovered cases.

Reviewed By: siying

Differential Revision: D36955738

Pulled By: pdillinger

fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10
2022-06-10 08:51:45 -07:00
Yu Zhang 693dffd8e8 Return try again when full_history_ts_low is higher than requested ts (#10126)
Summary:
This PR helps handle the race condition mentioned in this comment thread: https://github.com/facebook/rocksdb/pull/7884#discussion_r572402281 In case where actual full_history_ts_low is higher than the user's requested ts, return a try again message so they don't have the misconception that data between [ts, full_history_ts_low) is kept.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10126

Test Plan:
```
$COMPILE_WITH_ASAN=1 make -j24 all
$./db_with_timestamp_basic_test --gtest_filter=UpdateFullHistoryTsLowTest.ConcurrentUpdate
$ make -j24 check
```

Reviewed By: riversand963

Differential Revision: D37055368

Pulled By: jowlyzhang

fbshipit-source-id: 787fd0984a246540fa03ac227b1d232590d27828
2022-06-10 08:21:08 -07:00
Peter Dillinger 5fa6ef7f18 Fix fragile CacheTest::ApplyToAllEntriesDuringResize (#10145)
Summary:
As seen in https://github.com/facebook/rocksdb/issues/10137, simply churning the cache key hashes (e.g.
by changing the raw cache keys) could trigger failure in this test, due
to possibility of some cache shard exceeding its portion of capacity
and evicting entries. Updated the test to be less fragile by using
greater margins, and added a pre-check for evictions, which doesn't
manifest as a race condition, before the main check that can race.

Also added stack trace handler to cache_test for debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10145

Test Plan:
test thousands of iterations with gtest-parallel, including
with changes in https://github.com/facebook/rocksdb/issues/10137 that were surfacing the problem. Pre-check
without the fix would always fail with https://github.com/facebook/rocksdb/issues/10137

Reviewed By: guidotag

Differential Revision: D37058771

Pulled By: pdillinger

fbshipit-source-id: a7cf137967aef49c07ae9602d8523c63e7388fab
2022-06-09 19:43:19 -07:00
Bo Wang 1a3e23a251 Update jemalloc version for platform009 (#10143)
Summary:
Update jemalloc version for platform009. Current one is a bit old and the new one can bring some quick wins (e.g. new heap profiling features on devserver).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10143

Test Plan:
1. The building and testing on devserver should work.
2. `db_bench` with `--dump_malloc_stats`
`./db_bench --benchmarks=fillrandom --num=10000000 -db=/db_bench_1 `
`./db_bench --benchmarks=overwrite,stats --num=10000000 -use_existing_db -duration=10 --benchmark_write_rate_limit=2000000 -db=/db_bench_1 `
`./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000  --statistics -db=/db_bench_1 --dump_malloc_stats=true`

Before this PR: jemalloc Version: "5.2.1-1303-g73b8faa7149e452f93e52005c89459da08343570"
After this PR: jemalloc Version:

Reviewed By: anand1976

Differential Revision: D37049347

Pulled By: gitbw95

fbshipit-source-id: 3fcd82cca989047b4bbdfdebe5beba2c4c255ed8
2022-06-09 17:13:13 -07:00
Akanksha Mahajan ecfd4aef0c Enable wal_compression in crash_tests (#10141)
Summary:
Same as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10141

Test Plan:
```
export CRASH_TEST_EXT_ARGS=" --wal_compression=zstd"
 make crash_test -j
```

Reviewed By: riversand963

Differential Revision: D37042810

Pulled By: akankshamahajan15

fbshipit-source-id: 53f0793d78241f1b5c954dcc808cb4c0a3e9172a
2022-06-09 12:08:01 -07:00
Akanksha Mahajan f85b31a2e9 Fix bug for WalManager with compressed WAL (#10130)
Summary:
RocksDB uses WalManager to manage WAL files. In WalManager::ReadFirstLine(), the assumption is that reading the first record of a valid WAL file will return OK status and set the output sequence to non-zero value.
This assumption has been broken by WAL compression which writes a `kSetCompressionType` record which is not associated with any sequence number.
Consequently, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10130

Test Plan: - Newly Added test

Reviewed By: riversand963

Differential Revision: D36985744

Pulled By: akankshamahajan15

fbshipit-source-id: dfde7b3be68b6a30b75b49479779748eedf29f7f
2022-06-08 14:16:43 -07:00
Mark Callaghan 9efae14428 Fix parsing of db_bench output (#10124)
Summary:
A recent diff add a few more fields to one of the db_bench output lines that gets parsed.
This diff updates tools/benchmark.sh to handle that.

overwrite    :       7.939 micros/op 125963 ops/sec;   50.5 MB/s

overwrite    :       7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations;   51.0 MB/s

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124

Test Plan: Run it

Reviewed By: jay-zhuang

Differential Revision: D36945137

Pulled By: mdcallag

fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa
2022-06-08 09:23:36 -07:00
Yanqin Jin f890527b16 Update test for secondary instance in stress test (#10121)
Summary:
This PR updates secondary instance testing in stress test by default.

A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary.

Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan
to read some random keys with only very basic verification to make sure no assertion failure is triggered.

Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled.

Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic
catch-up.

In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not
work well when the underlying file is renamed by primary, e.g. SstFileManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121

Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush
```

Reviewed By: ajkr

Differential Revision: D36939458

Pulled By: riversand963

fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e
2022-06-07 21:07:47 -07:00
Andrew Kryczka ff32346415 Set db_stress defaults for TSAN deadlock detector (#10131)
Summary:
After https://github.com/facebook/rocksdb/issues/9357 we began seeing the following error attempting to acquire
locks for file ingestion:

```
FATAL: ThreadSanitizer CHECK failed: /home/engshare/third-party2/llvm-fb/12/src/llvm/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40)
```

The command was using default values for `ingest_external_file_width`
(1000) and `log2_keys_per_lock` (2). The expected number of locks needed
to update those keys is then (1000 / 2^2) = 250, which is above the 0x40 (64)
limit. This PR reduces the default value of `ingest_external_file_width`
to 100 so the expected number of locks is 25, which is within the limit.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10131

Reviewed By: ltamasi

Differential Revision: D36986307

Pulled By: ajkr

fbshipit-source-id: e918cdb2fcc39517d585f1e5fd2539e185ada7c1
2022-06-07 15:15:09 -07:00
gitbw95 5cbee1f609 Add unit test to verify that the dynamic priority can be passed from compaction to FS (#10088)
Summary:
**Summary:**
Add unit tests to verify that the dynamic priority can be passed from compaction to FS. Compaction reads&writes and other DB reads&writes share the same read&write paths to FSRandomAccessFile or FSWritableFile, so a MockTestFileSystem is added to replace the default filesystem from Env to intercept and verify the io_priority. To prepare the compaction input files, use the default filesystem from Env. To test the io priority of the compaction reads and writes, db_options_.fs is set as MockTestFileSystem.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10088

Test Plan: Add unit tests.

Reviewed By: anand1976

Differential Revision: D36882528

Pulled By: gitbw95

fbshipit-source-id: 120adc15801966f2b8c9fc45285f590a3fff96d1
2022-06-07 11:57:12 -07:00
zczhu b6de139df5 Handle "NotSupported" status by default implementation of Close() in … (#10127)
Summary:
The default implementation of Close() function in Directory/FSDirectory classes returns `NotSupported` status. However, we don't want operations that worked in older versions to begin failing after upgrading when run on FileSystems that have not implemented Directory::Close() yet. So we require the upper level that calls Close() function should properly handle "NotSupported" status instead of treating it as an error status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10127

Reviewed By: ajkr

Differential Revision: D36971112

Pulled By: littlepig2013

fbshipit-source-id: 100f0e6ad1191e1acc1ba6458c566a11724cf466
2022-06-07 09:49:31 -07:00
zczhu 3ee6c9baec Consolidate manual_compaction_paused_ check (#10070)
Summary:
As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `*canceled` to be true in `DisableManualCompaction()` and `*canceled` to be false in the last time calling `EnableManualCompaction()`.

Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070

Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test).

Reviewed By: ajkr

Differential Revision: D36949133

Pulled By: littlepig2013

fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821
2022-06-06 18:32:26 -07:00
Yu Zhang a101c9de60 Return "invalid argument" when read timestamp is too old (#10109)
Summary:
With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument().
Test plan
```
$COMPILE_WITH_ASAN=1 make -j24 all
$./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow
$ make -j24 check
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109

Reviewed By: riversand963

Differential Revision: D36901126

Pulled By: jowlyzhang

fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86
2022-06-06 14:36:22 -07:00
zczhu 9f244b2119 Fix default implementaton of close() function for Directory/FSDirecto… (#10123)
Summary:
As pointed by anand1976 in his [comment](https://github.com/facebook/rocksdb/pull/10049#pullrequestreview-994255819), previous implementation (adding Close() function in Directory/FSDirectory class) is not backward-compatible. And we mistakenly added the default implementation `return Status::NotSupported("Close")` or `return IOStatus::NotSupported("Close")` in WritableFile class in this [pull request](https://github.com/facebook/rocksdb/pull/10101). This pull request fixes the above issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10123

Reviewed By: ajkr

Differential Revision: D36943661

Pulled By: littlepig2013

fbshipit-source-id: 9dc45f4d2ab3a9d51c30bdfde679f1d13c4d5509
2022-06-06 14:27:31 -07:00
Guido Tagliavini Ponce 2af132c341 Fix overflow bug in standard deviation computation. (#10100)
Summary:
There was an overflow bug when computing the variance in the HistogramStat class.

This manifests, for instance, when running cache_bench with default arguments. This executes 32M lookups/inserts/deletes in a block cache, and then computes (among other things) the variance of the latencies. The variance is computed as ``variance = (cur_sum_squares * cur_num - cur_sum * cur_sum) / (cur_num * cur_num)``, where ``cum_sum_squares`` is the sum of the squares of the samples, ``cur_num`` is the number of samples, and ``cur_sum`` is the sum of the samples. Because the median latency in a typical run is around 3800 nanoseconds, both the ``cur_sum_squares * cur_num`` and ``cur_sum * cur_sum`` terms overflow as uint64_t.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10100

Test Plan: Added a unit test. Run ``make -j24 histogram_test && ./histogram_test``.

Reviewed By: pdillinger

Differential Revision: D36942738

Pulled By: guidotag

fbshipit-source-id: 0af5fb9e2a297a284e8e74c24e604d302906006e
2022-06-06 13:53:47 -07:00
Peter Dillinger 4f78f9699b Refactor: Add BlockTypes to make them imply C++ type in block cache (#10098)
Summary:
We have three related concepts:
* BlockType: an internal enum conceptually indicating a type of SST file
block
* CacheEntryRole: a user-facing enum for categorizing block cache entries,
which is also involved in associated cache entries with an appropriate
deleter. Can include categories for non-block cache entries (e.g. memory
reservations).
* TBlocklike: a C++ type for the actual type behind a void* cache entry.

We had some existing code ugliness because BlockType did not imply
TBlocklike, because of various kinds of "filter" block. This refactoring
fixes that with new BlockTypes.

More clean-up can come in later work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098

Test Plan: existing tests

Reviewed By: akankshamahajan15

Differential Revision: D36897945

Pulled By: pdillinger

fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295
2022-06-06 11:16:12 -07:00
Jay Zhuang e36008d863 Disable CI benchmark from #9723 (#10119)
Summary:
The script is broken.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10119

Reviewed By: ltamasi

Differential Revision: D36928076

Pulled By: jay-zhuang

fbshipit-source-id: f325cfd00869c506c64573fe8192cb5b561825d6
2022-06-05 22:28:50 -07:00
Alan Paxton 2f4a0ffef8 CI Benchmarking with CircleCI Runner and OpenSearch Dashboard (EB 1088) (#9723)
Summary:
CircleCI runner based benchmarking. A runner is a dedicate machine configured for CircleCI to perform work on. Our work is a repeatable benchmark, the `benchmark-linux` job in `config.yml`

A runner, in CircleCI terminology, is a machine that is managed by the client (us) rather than running on CircleCI resources in the cloud. This means that we define and configure the iron, and that therefore the performance is repeatable and predictable. Which is what we need for performance regression benchmarking.

On a time schedule (or on commit, during branch development) benchmarks are set off on the runner, and then a script is run `benchmark_log_tool.py` which parses the benchmark output and pushes it into a pre-configured OpenSearch document connected to an OpenSearch dashboard. Members of the team can examine benchmark performance changes on the dashboard.

As time progresses we can add different benchmarks to the suite which gets run.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9723

Reviewed By: pdillinger

Differential Revision: D35555626

Pulled By: jay-zhuang

fbshipit-source-id: c6a905ca04494495c3784cfbb991f5ab90c807ee
2022-06-04 09:31:47 -07:00
yite.gu 560906ab33 Add a simple example of backup and restore (#10054)
Summary:
Add a simple example of backup and restore

Signed-off-by: YiteGu <ess_gyt@qq.com>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10054

Reviewed By: jay-zhuang

Differential Revision: D36678141

Pulled By: ajkr

fbshipit-source-id: 43545356baddb4c2c76c62cd63d7a3238d1f8a00
2022-06-03 23:25:31 -07:00
Levi Tamasi e9c74bc474 Add wide column serialization primitives (#9915)
Summary:
The patch adds some low-level logic that can be used to serialize/deserialize
a sorted vector of wide columns to/from a simple binary searchable string
representation. Currently, there is no user-facing API; this will be implemented in
subsequent stages.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9915

Test Plan: `make check`

Reviewed By: siying

Differential Revision: D35978076

Pulled By: ltamasi

fbshipit-source-id: 33f5f6628ec3bcd8c8beab363b1978ac047a8788
2022-06-03 20:54:48 -07:00
Yanqin Jin 3e02c6e05a Point-lookup returns timestamps of Delete and SingleDelete (#10056)
Summary:
If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`,
RocksDB will return the timestamps of the point tombstones.

Note: DeleteRange is still unsupported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D36677956

Pulled By: riversand963

fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae
2022-06-03 20:00:42 -07:00
Hui Xiao 4bdcc80192 Increase ChargeTableReaderTest/ChargeTableReaderTest.Basic error tolerance rate from 1% to 5% (#10113)
Summary:
**Context:**
https://github.com/facebook/rocksdb/pull/9748 added support to charge table reader memory to block cache. In the test `ChargeTableReaderTest/ChargeTableReaderTest.Basic`, it estimated the table reader memory, calculated the expected number of table reader opened based on this estimation and asserted this number with actual number. The expected number of table reader opened calculated based on estimated table reader memory will not be 100% accurate and should have tolerance for error. It was previously set to 1% and recently encountered an assertion failure that `(opened_table_reader_num) <= (max_table_reader_num_capped_upper_bound), actual: 375 or 376 vs 374` where `opened_table_reader_num` is the actual opened one and `max_table_reader_num_capped_upper_bound` is the estimated opened one (=371 * 1.01). I believe it's safe to increase error tolerance from 1% to 5% hence there is this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10113

Test Plan: - CI again succeeds.

Reviewed By: ajkr

Differential Revision: D36911556

Pulled By: hx235

fbshipit-source-id: 259687dd77b450fea0f5658a5b567a1d31d4b1f7
2022-06-03 19:42:22 -07:00
Zeyi (Rice) Fan c1018b7516 cmake: add an option to skip thirdparty.inc on Windows (#10110)
Summary:
When building RocksDB with getdeps on Windows, `thirdparty.inc` get in the way since `FindXXXX.cmake` are working properly now.

This PR adds an option to skip that file when building RocksDB so we can disable it.

FB: see [D36905191](https://www.internalfb.com/diff/D36905191).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10110

Reviewed By: siying

Differential Revision: D36913882

Pulled By: fanzeyi

fbshipit-source-id: 33d36841dc0d4fe87f51e1d9fd2b158a3adab88f
2022-06-03 19:20:34 -07:00
Levi Tamasi 7d36bc4273 Fix some bugs in verify_random_db.sh (#10112)
Summary:
The patch attempts to fix three bugs in `verify_random_db.sh`:
1) https://github.com/facebook/rocksdb/pull/9937 changed the default for
`--try_load_options` to true in the script's use case, so we have to
explicitly set it to false if the corresponding argument of the script
is 0. This should fix the issue we've been seeing with our forward
compatibility tests where 7.3 is unable to open a database created by
the version on main after adding a new configuration option.
2) The script seems to support two "extra parameters"; however,
in practice, if the second one was set, only that one was passed on to
`ldb`. Now both get forwarded.
3) When running the `diff` command, the base DB directory was passed as
the second argument instead of the file containing the `ldb` output
(this actually seems to work, probably accidentally though).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10112

Reviewed By: pdillinger

Differential Revision: D36911363

Pulled By: ltamasi

fbshipit-source-id: fe29db4e28d373cee51a12322c59050fc50e926d
2022-06-03 16:35:13 -07:00
Yanqin Jin d739de63e5 Fix a bug in WAL tracking (#10087)
Summary:
Closing https://github.com/facebook/rocksdb/issues/10080

When `SyncWAL()` calls `MarkLogsSynced()`, even if there is only one active WAL file,
this event should still be added to the MANIFEST.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10087

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D36797580

Pulled By: riversand963

fbshipit-source-id: 24184c9dd606b3939a454ed41de6e868d1519999
2022-06-03 16:33:00 -07:00
Guido Tagliavini Ponce eb99e08076 Add support for FastLRUCache in cache_bench (#10095)
Summary:
cache_bench can now run with FastLRUCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10095

Test Plan:
- Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 cache_bench``. Then test the appropriate code is used by running ``./cache_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache.
- Verify that FastLRUCache (currently a clone of LRUCache) has similar latency distribution than LRUCache, by comparing the outputs of ``./cache_bench -cache_type=fast_lru_cache`` and ``./cache_bench -cache_type=lru_cache``.

Reviewed By: pdillinger

Differential Revision: D36875834

Pulled By: guidotag

fbshipit-source-id: eb2ad0bb32c2717a258a6ac66ed736e06f826cd8
2022-06-03 13:40:09 -07:00
zczhu 21906d66f6 Add default impl to dir close (#10101)
Summary:
As pointed by anand1976 in his [comment](https://github.com/facebook/rocksdb/pull/10049#pullrequestreview-994255819), previous implementation is not backward-compatible. In this implementation, the default implementation `return Status::NotSupported("Close")` or `return IOStatus::NotSupported("Close")` is added for `Close()` function for `*Directory` classes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10101

Test Plan: DBBasicTest.DBCloseAllDirectoryFDs

Reviewed By: anand1976

Differential Revision: D36899346

Pulled By: littlepig2013

fbshipit-source-id: 430624793362f330cbb8837960f0e8712a944ab9
2022-06-03 12:53:28 -07:00
Guido Tagliavini Ponce cf85607795 Add support for FastLRUCache in db_bench. (#10096)
Summary:
db_bench can now run with FastLRUCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10096

Test Plan:
- Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 db_bench``. Then test the appropriate code is used by running ``./db_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache.
- Verify that FastLRUCache (currently a clone of LRUCache) produces similar benchmark data than LRUCache, by comparing the outputs of ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=fast_lru_cache`` and ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=lru_cache``.

Reviewed By: gitbw95

Differential Revision: D36898774

Pulled By: guidotag

fbshipit-source-id: f9f6b6f6da124f88b21b3c8dee742fbb04eff773
2022-06-03 11:16:49 -07:00
Yanqin Jin 2b3c50c429 Temporarily disable wal compression (#10108)
Summary:
Will re-enable after fixing the bug in https://github.com/facebook/rocksdb/issues/10099 and https://github.com/facebook/rocksdb/issues/10097.
Right now, the priority is https://github.com/facebook/rocksdb/issues/10087, but the bug in WAL compression prevents the mini crash test from passing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10108

Reviewed By: pdillinger

Differential Revision: D36897214

Pulled By: riversand963

fbshipit-source-id: d64dc52738222d5f66003f7731dc46eaeed812be
2022-06-03 10:22:52 -07:00
Mark Callaghan 5506954b1f Enhance to support more tuning options, and universal and integrated… (#9704)
Summary:
… BlobDB for all tests

This does two big things:
* provides more tuning options
* supports universal and integrated BlobDB for all of the benchmarks that are leveled-only

It does several smaller things, and I will list a few
* sets l0_slowdown_writes_trigger which wasn't set before this diff.
* improves readability in report.tsv by using smaller field names in the header
* adds more columns to report.tsv

report.tsv before this diff:
```
ops_sec mb_sec  total_size_gb   level0_size_gb  sum_gb  write_amplification     write_mbps      usec_op percentile_50   percentile_75   percentile_99   percentile_99.9 percentile_99.99        uptime  stall_time      stall_percent   test_name       test_date      rocksdb_version  job_id
823294  329.8   0.0     21.5    21.5    1.0     183.4   1.2     1.0     1.0     3       6       14      120     00:00:0.000     0.0     fillseq.wal_disabled.v400       2022-03-16T15:46:45.000-07:00   7.0
326520  130.8   0.0     0.0     0.0     0.0     0       12.2    139.8   155.1   170     234     250     60      00:00:0.000     0.0     multireadrandom.t4      2022-03-16T15:48:47.000-07:00   7.0
86313   345.7   0.0     0.0     0.0     0.0     0       46.3    44.8    50.6    75      84      108     60      00:00:0.000     0.0     revrangewhilewriting.t4 2022-03-16T15:50:48.000-07:00   7.0
101294  405.7   0.0     0.1     0.1     1.0     1.6     39.5    40.4    45.9    64      75      103     62      00:00:0.000     0.0     fwdrangewhilewriting.t4 2022-03-16T15:52:50.000-07:00   7.0
258141  103.4   0.0     0.1     1.2     18.2    19.8    15.5    14.3    18.1    28      34      48      62      00:00:0.000     0.0     readwhilewriting.t4     2022-03-16T15:54:51.000-07:00   7.0
334690  134.1   0.0     7.6     18.7    4.2     308.8   12.0    11.8    13.7    21      30      62      62      00:00:0.000     0.0     overwrite.t4.s0 2022-03-16T15:56:53.000-07:00   7.0
```
report.tsv with this diff:
```
ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
831144  332.9   22GB    0.0GB,  21.7    1.0     185.1   264     262     0       0       1.2     1.0     3       6       14      9198    120     0.0     0       0.4     0.0     0.7     fillseq.wal_disabled.v400       2022-03-16T16:21:23     7.0
325229  130.3   22GB    0.0GB,  0.0             0.0     0       0       0       0       12.3    139.8   170     237     249     572     60      0.0     0       0.4     0.1     1.2     multireadrandom.t4      2022-03-16T16:23:25     7.0
312920  125.3   26GB    0.0GB,  11.1    2.6     189.3   115     113     0       0       12.8    11.8    21      34      1255    6442    60      0.2     1       0.7     0.1     0.6     overwritesome.t4.s0     2022-03-16T16:25:27     7.0
81698   327.2   25GB    0.0GB,  0.0             0.0     0       0       0       0       48.9    46.2    79      246     369     9445    60      0.0     0       0.4     0.1     1.4     revrangewhilewriting.t4 2022-03-16T16:30:21     7.0
92484   370.4   25GB    0.0GB,  0.1     1.5     1.1     1       0       0       0       43.2    42.3    75      103     110     9512    62      0.0     0       0.4     0.1     1.4     fwdrangewhilewriting.t4 2022-03-16T16:32:24     7.0
241661  96.8    25GB    0.0GB,  0.1     1.5     1.1     1       0       0       0       16.5    17.1    30      34      49      9092    62      0.0     0       0.4     0.1     1.4     readwhilewriting.t4     2022-03-16T16:34:27     7.0
305234  122.3   30GB    0.0GB,  12.1    2.7     201.7   127     124     0       0       13.1    11.8    21      128     1934    6339    62      0.0     0       0.7     0.1     0.7     overwrite.t4.s0 2022-03-16T16:36:30     7.0
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9704

Test Plan: run it

Reviewed By: jay-zhuang

Differential Revision: D36864627

Pulled By: mdcallag

fbshipit-source-id: d5af1cfc258a16865210163fa6fd1b803ab1a7d3
2022-06-03 08:20:10 -07:00
Levi Tamasi 7b2c0140ba Fix Java build (#10105)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10105

Reviewed By: cbi42

Differential Revision: D36891073

Pulled By: ltamasi

fbshipit-source-id: 16487ec708fc96add2a1ebc2d98f6439dfc852ca
2022-06-03 08:11:31 -07:00
Levi Tamasi b8fe7df2e5 Fix LITE build (#10106)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10106

Reviewed By: cbi42

Differential Revision: D36891284

Pulled By: ltamasi

fbshipit-source-id: 304ffa84549201659feb0b74d6ba54a83f08906b
2022-06-02 23:42:41 -07:00
zczhu e88d8935ae Add comments/permit unchecked error to close_db_dir pull requests (#10093)
Summary:
In [close_db_dir](https://github.com/facebook/rocksdb/pull/10049) pull request, some merging conflicts occurred (some comments and one line `s.PermitUncheckedError()` are missing). This pull request aims to put them back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10093

Reviewed By: ajkr

Differential Revision: D36884117

Pulled By: littlepig2013

fbshipit-source-id: 8c0e2a8793fc52804067c511843bd1ff4912c1c3
2022-06-02 21:52:35 -07:00
Yanqin Jin ed50ccd19a Install zstd on CircleCI linux (#10102)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10102

Reviewed By: jay-zhuang

Differential Revision: D36885468

Pulled By: riversand963

fbshipit-source-id: 6ed5b62dda8fe0f4be4b66d09bdec0134cf4500c
2022-06-02 21:38:29 -07:00
Gang Liao e6432dfd4c Make it possible to enable blob files starting from a certain LSM tree level (#10077)
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.

In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077

Reviewed By: ltamasi

Differential Revision: D36884156

Pulled By: gangliao

fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
2022-06-02 20:04:33 -07:00
Jay Zhuang a020031552 Add kLastTemperature as temperature high bound (#10044)
Summary:
Only used as temperature high bound for current code, may
increase with more temperatures added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10044

Test Plan: ci

Reviewed By: siying

Differential Revision: D36633410

Pulled By: jay-zhuang

fbshipit-source-id: eecdfa7623c31778c31d789902eacf78aad7b482
2022-06-02 13:10:49 -07:00
Gang Liao 3dc6ebaf74 Support specifying blob garbage collection parameters when CompactRange() (#10073)
Summary:
Garbage collection is generally controlled by the BlobDB configuration options `enable_blob_garbage_collection` and `blob_garbage_collection_age_cutoff`. However, there might be use cases where we would want to temporarily override these options while performing a manual compaction. (One use case would be doing a full key-space manual compaction with full=100% garbage collection age cutoff in order to minimize the space occupied by the database.) Our goal here is to make it possible to override the configured GC parameters when using the `CompactRange` API to perform manual compactions. This PR would involve:

- Extending the `CompactRangeOptions` structure so clients can both force-enable and force-disable GC, as well as use a different cutoff than what's currently configured
- Storing whether blob GC should actually be enabled during a certain manual compaction and the cutoff to use in the `Compaction` object (considering the above overrides) and passing it to `CompactionIterator` via `CompactionProxy`
- Updating the BlobDB wiki to document the new options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10073

Test Plan: Adding unit tests and adding the new options to the stress test tool.

Reviewed By: ltamasi

Differential Revision: D36848700

Pulled By: gangliao

fbshipit-source-id: c878ef101d1c612429999f513453c319f75d78e9
2022-06-01 19:40:26 -07:00
Zichen Zhu 65893ad959 Explicitly closing all directory file descriptors (#10049)
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049

Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.

Reviewed By: ajkr, hx235

Differential Revision: D36722135

Pulled By: littlepig2013

fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
2022-06-01 18:03:34 -07:00