Commit Graph

11228 Commits

Author SHA1 Message Date
zczhu e716bda010 Add FLAGS_compaction_pri into crash_test (#10255)
Summary:
Add FLAGS_compaction_pri into correctness test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255

Test Plan: run crash_test with FLAGS_compaction_pri

Reviewed By: ajkr

Differential Revision: D37510372

Pulled By: littlepig2013

fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641
2022-06-30 22:56:58 -07:00
Akanksha Mahajan 11215e0f3a Fix bug in Logger creation if dbname and db_log_dir are on different filesystem (#10292)
Summary:
If dbname and db_log_dir are at different filesystems (one
local and one remote), creation of dbname will fail because that path
doesn't exist wrt to db_log_dir.
This patch will ignore the error returned on creation of dbname. If they
are on same filesystem, db_log_dir creation will automatically return
the error in case there is any error in creation of dbname.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10292

Test Plan: Existing unit tests

Reviewed By: riversand963

Differential Revision: D37567773

Pulled By: akankshamahajan15

fbshipit-source-id: 005d28c536208d4c126c8cb8e196d1d85b881100
2022-06-30 19:04:25 -07:00
sdong 4428c76181 Multi-File Trivial Move in L0->L1 (#10188)
Summary:
In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move
2. Modify the trivial move condition so that this compaction would be tagged as trivial move.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188

Test Plan:
See throughput improvements with db_bench with fast fillseq benchmark and small L0 files:

./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes

The throughput improved by about 50%. Stalling still happens though.

Reviewed By: jay-zhuang

Differential Revision: D37224743

fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659
2022-06-30 18:04:23 -07:00
zczhu 4f51101d31 Remove compact cursor when split sub-compactions (#10289)
Summary:
In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here.  The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289

Reviewed By: ajkr

Differential Revision: D37559091

Pulled By: littlepig2013

fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc
2022-06-30 15:36:46 -07:00
Mark Callaghan 720ab355f9 Add undefok for BlobDB options not supported prior to 7.5 (#10276)
Summary:
This adds --undefok to support use of this script with BlobDB for db_bench versions prior
to 7.5 when the options land in a release.

While there is a limit to how far back this script can go WRT backwards compatiblity,
this is an easy change to support early 7.x releases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276

Test Plan: Run it with versions of db_bench that do not and then do support these options

Reviewed By: gangliao

Differential Revision: D37529299

Pulled By: mdcallag

fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8
2022-06-30 14:07:26 -07:00
sdong b397dcd390 Change The Way Level Target And Compaction Score Are Calculated (#10057)
Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10057

Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.

Reviewed By: ajkr

Differential Revision: D37539742

fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4
2022-06-30 13:32:47 -07:00
Gang Liao 056e08d6c4 Enable blob caching for MultiGetBlob in RocksDB (#10272)
Summary:
- [x] Enabled blob caching for MultiGetBlob in RocksDB
- [x] Refactored MultiGetBlob logic and interface in RocksDB
- [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource
- [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272

Reviewed By: ltamasi

Differential Revision: D37558112

Pulled By: gangliao

fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db
2022-06-30 13:24:35 -07:00
Andrew Kryczka 20754b3654 include compaction cursors in VersionEdit debug string (#10288)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10288

Test Plan:
try it out -

```
$ ldb manifest_dump --db=/dev/shm/rocksdb.0uWV/rocksdb_crashtest_whitebox/ --hex --verbose | grep CompactCursor | head -3
  CompactCursor: 1 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1
  CompactCursor: 1 '0000000000001F35000000000000012B0000000000000022' seq:0, type:1
  CompactCursor: 2 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1
```

Reviewed By: littlepig2013

Differential Revision: D37557177

Pulled By: ajkr

fbshipit-source-id: 7b76b857d9e7a9f3d53398a61bb1d4b78873b91e
2022-06-30 12:46:45 -07:00
Yueh-Hsuan Chiang 17a6f7faaf Add load_latest_options() to C api (#10152)
Summary:
Add load_latest_options() to C api.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10152

Test Plan:
Extend the existing c_test by reopening db using the latest options file
at different parts of the test.

Reviewed By: akankshamahajan15

Differential Revision: D37305225

Pulled By: ajkr

fbshipit-source-id: 8b3bab73f56fa6fcbdba45aae393145d007b3962
2022-06-30 11:03:52 -07:00
Yanqin Jin b87c355772 Fix assertion error with read_opts.iter_start_ts (#10279)
Summary:
If the internal iterator is not valid, `SeekToLast` with iter_start_ts should have `valid_` is false without assertion failure.
Test plan
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10279

Reviewed By: ltamasi

Differential Revision: D37539393

Pulled By: riversand963

fbshipit-source-id: 8e94057838f8a05144fad5768f4d62f1893ec315
2022-06-30 10:16:03 -07:00
Guido Tagliavini Ponce 57a0e2f304 Clock cache (#10273)
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.

Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273

Test Plan: ``make -j24 check``

Reviewed By: pdillinger

Differential Revision: D37522461

Pulled By: guidotag

fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
2022-06-29 21:50:39 -07:00
Johnny Shaw c2dc4c0c52 Fix GetWindowsErrSz nullptr bug (#10282)
Summary:
`GetWindowsErrSz` may assign a `nullptr` to `std::string` in the event it cannot format the error code to a string. This will result in a crash when `std::string` attempts to calculate the length from `nullptr`.

The change here checks the output from `FormatMessageA` and only assigns to the otuput `std::string` if it is not null. Additionally, the call to free the buffer is only made if a non-null value is returned from `FormatMessageA`. In the event `FormatMessageA` does not output a string, an empty string is returned instead.

Fixes https://github.com/facebook/rocksdb/issues/10274

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10282

Reviewed By: riversand963

Differential Revision: D37542143

Pulled By: ajkr

fbshipit-source-id: c21f5119ddb451f76960acec94639d0f538052f2
2022-06-29 20:41:54 -07:00
leipeng 490fcac078 WriteBatch reorder fields to reduce padding (#10266)
Summary:
this reorder reduces sizeof(WriteBatch) by 16 bytes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10266

Reviewed By: akankshamahajan15

Differential Revision: D37505201

Pulled By: ajkr

fbshipit-source-id: 6cb6c3735073fcb63921f822d5e15670fecb1c26
2022-06-29 13:02:48 -07:00
sdong 6115254416 Fix A Bug Where Concurrent Compactions Cause Further Slowing Down (#10270)
Summary:
Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed.

Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270

Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value.

Reviewed By: ajkr

Differential Revision: D37497327

fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4
2022-06-29 11:20:36 -07:00
Edvard Davtyan 12bfd519de Expose LRU cache num_shard_bits paramater in C api (#10222)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10222

Reviewed By: cbi42

Differential Revision: D37358171

Pulled By: ajkr

fbshipit-source-id: e86285fdceaec943415ee9d482090009b00cbc95
2022-06-29 11:12:25 -07:00
Mark Callaghan 28f2d3cca6 Benchmark fix write amplification computation (#10236)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10236

Reviewed By: ajkr

Differential Revision: D37489898

Pulled By: mdcallag

fbshipit-source-id: 4b4565973b1f2c47342b4d1b857c8f89e91da145
2022-06-29 07:22:22 -07:00
Yanqin Jin b6cfda1283 Support `iter_start_ts` for backward iteration (#10200)
Summary:
Resolves https://github.com/facebook/rocksdb/issues/9761

With this PR, applications can create an iterator with the following
```cpp
ReadOptions read_opts;
read_opts.timestamp = &ts_ub;
read_opts.iter_start_ts = &ts_lb;
auto* it = db->NewIterator(read_opts);
it->SeekToLast();
// or it->SeekForPrev("foo");
it->Prev();
...
```
The application can access different versions of the same user key via `key()`, `value()`, and `timestamp()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10200

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D37258074

Pulled By: riversand963

fbshipit-source-id: 3f0b866ade50dcff7ef60d506397a9dd6ec91565
2022-06-28 19:51:05 -07:00
Peter Dillinger d96febeeaa Update/clarify required properties for prefix extractors (#10245)
Summary:
Most of the properties listed as required for prefix extractors
are not really required but offer some conveniences. This updates API
comments to clarify actual requirements, and adds tests to demonstrate
how previously presumed requirements can be safely violated.

This might seem like a useless exercise, but this relaxing of requirements
would be needed if we generalize prefixes to group keys not just at the
byte level but also based on bits or arbitrary value ranges. For
applications without a "natural" prefix size, having only byte-level
granularity often means one prefix size to the next differs in magnitude
by a factor of 256.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10245

Test Plan: Tests added, also covering missing Iterator cases from https://github.com/facebook/rocksdb/issues/10244

Reviewed By: bjlemaire

Differential Revision: D37371559

Pulled By: pdillinger

fbshipit-source-id: ab2dd719992eea7656e9042cf8542393e02fa244
2022-06-28 16:08:30 -07:00
Andrew Kryczka ca81b80d83 Deflake RateLimiting/BackupEngineRateLimitingTestWithParam (#10271)
Summary:
We saw flakes with the following failure:

```
[ RUN      ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
utilities/backup/backup_engine_test.cc:2667: Failure
Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4
terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException'
what():  utilities/backup/backup_engine_test.cc:2667: Failure
Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4
Received signal 6 (Aborted)
t/run-backup_engine_test-RateLimiting-BackupEngineRateLimitingTestWithParam.RateLimiting-1: line 4: 1032887 Aborted                 (core dumped) TEST_TMPDIR=$d ./backup_engine_test --gtest_filter=RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1
```

Investigation revealed we forgot to use the mock time `SystemClock` for
restore rate limiting. Then the test used wall clock time, which made
the execution of "GenericRateLimiter::Request:PostTimedWait"
non-deterministic as wall clock time might have advanced enough that
waiting was not needed.

This PR changes restore rate limiting to use
mock time, which guarantees we always execute
"GenericRateLimiter::Request:PostTimedWait". Then the assertions that
rely on times recorded inside that callback should be robust.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10271

Test Plan:
Applied the following patch which guaranteed repro before the fix.
Verified the test passes after this PR even with that patch applied.

```
 diff --git a/util/rate_limiter.cc b/util/rate_limiter.cc
index f369e3220..6b3ed82fa 100644
 --- a/util/rate_limiter.cc
+++ b/util/rate_limiter.cc
@@ -158,6 +158,7 @@ void GenericRateLimiter::SetBytesPerSecond(int64_t bytes_per_second) {

 void GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri,
                                  Statistics* stats) {
+  usleep(100000);
   assert(bytes <= refill_bytes_per_period_.load(std::memory_order_relaxed));
   bytes = std::max(static_cast<int64_t>(0), bytes);
   TEST_SYNC_POINT("GenericRateLimiter::Request");
```

Reviewed By: hx235

Differential Revision: D37499848

Pulled By: ajkr

fbshipit-source-id: fd790d5a192996be8ba13b656751ccc7d8cb8f6e
2022-06-28 14:27:49 -07:00
Gang Liao d7ebb58cb5 Add blob cache tickers, perf context statistics, and DB properties (#10203)
Summary:
In order to be able to monitor the performance of the new blob cache, we made the follow changes:
- Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics)
- Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context)
- Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203

Reviewed By: ltamasi

Differential Revision: D37478658

Pulled By: gangliao

fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40
2022-06-28 13:52:35 -07:00
Guido Tagliavini Ponce c6055cba30 Calculate table size of FastLRUCache more accurately (#10235)
Summary:
Calculate the required size of the hash table in FastLRUCache more accurately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10235

Test Plan: ``make -j24 check``

Reviewed By: gitbw95

Differential Revision: D37460546

Pulled By: guidotag

fbshipit-source-id: 7945128d6f002832f8ed922ef0151919f4350854
2022-06-27 21:04:59 -07:00
Gang Liao a1eb02f089 Change the semantics of bytes_read in GetBlob/MultiGetBlob for consistency (#10248)
Summary:
The `bytes_read` returned by the current BlobSource interface is ambiguous. The uncompressed blob size is returned if the cache hits. The size of the blob read from disk, presumably the compressed version, is returned if the cache misses. Two differing semantics might cause ambiguity and consistency issues. For example, this inconsistency causes the assertion failure (T124246362 and its hot fix is https://github.com/facebook/rocksdb/issues/10249).

This goal is to require that the value of `byte read` always be an on-disk blob record size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10248

Reviewed By: ltamasi

Differential Revision: D37470292

Pulled By: gangliao

fbshipit-source-id: fbca521b2791d3674dbf2484cea5fcae2fdd94d2
2022-06-27 17:15:21 -07:00
Levi Tamasi 0d1e0722ef Fix in-place updates for value types other than kTypeValue (#10254)
Summary:
The patch fixes a couple of issues related to in-place updates: 1) the value type was not passed from
`MemTableInserter::PutCFImpl` to `MemTable::Update` and 2) `MemTable::UpdateCallback` was called
for any value type (with the callee's logic assuming `kTypeValue`) even though the callback mechanism
is only safe for plain values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10254

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37463644

Pulled By: ltamasi

fbshipit-source-id: 33802477dac0691681f416ae84c4d9742c6fe41a
2022-06-27 16:37:09 -07:00
Yanqin Jin d3de59255a Enable compaction filter for db_stress with user-defined timestamp (#10259)
Summary:
Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter.

This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of
the presence of timestamps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259

Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts

Reviewed By: ajkr

Differential Revision: D37459692

Pulled By: riversand963

fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef
2022-06-27 11:53:09 -07:00
Levi Tamasi c73d2a9d18 Add API for writing wide-column entities (#10242)
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37369748

Pulled By: ltamasi

fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
2022-06-25 15:30:47 -07:00
Andrew Kryczka f322f273b0 Temporarily disable mempurge in crash test (#10252)
Summary:
Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252

Reviewed By: riversand963

Differential Revision: D37432948

Pulled By: ajkr

fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a
2022-06-24 17:11:27 -07:00
Bo Wang 8e63d90ff8 Pass rate_limiter_priority through filter block reader functions to FS (#10251)
Summary:
With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251

Test Plan: Current unit tests should pass.

Reviewed By: pdillinger

Differential Revision: D37427667

Pulled By: gitbw95

fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433
2022-06-24 16:13:44 -07:00
zczhu 410ca2efd2 Fix the flaky cursor persist test (#10250)
Summary:
The 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` may occasionally fail due to the inconsistent LSM state. The issue is fixed by adding `Flush()` and `WaitForFlushMemTable()` to produce a more predictable and stable LSM state.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10250

Test Plan: 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test`

Reviewed By: jay-zhuang, riversand963

Differential Revision: D37426091

Pulled By: littlepig2013

fbshipit-source-id: 56fbaab0384c380c1f279a16dc8732b139c9f611
2022-06-24 14:02:33 -07:00
sdong 246d469750 Reduce overhead of SortFileByOverlappingRatio() (#10161)
Summary:
Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161

Test Plan:
Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran:

TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000

The throughput improved by 11%.

Reviewed By: jay-zhuang

Differential Revision: D37129469

fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d
2022-06-24 14:01:11 -07:00
Gang Liao 052666aed5 BlobDB in crash test hitting assertion (#10249)
Summary:
This task is to fix assertion failures during the crash test runs. The cache entry size might not match value size because value size can include the on-disk (possibly compressed) size. Therefore, we removed the assertions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10249

Reviewed By: ltamasi

Differential Revision: D37407576

Pulled By: gangliao

fbshipit-source-id: 577559f267c5b2437bcd0631cd0efabb6dde3b69
2022-06-23 22:02:16 -07:00
Yanqin Jin 725df120e9 Fix race condition between file purge and backup/checkpoint (#10187)
Summary:
Resolves https://github.com/facebook/rocksdb/issues/10129

I extracted this fix from https://github.com/facebook/rocksdb/issues/7516 since it's also already a bug in main branch, and we want to
separate it from the main part of the PR.

There can be a race condition between two threads. Thread 1 executes
`DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`.
```
Time   thread 1                                thread 2
  |  mutex_.lock
  |  read disable_delete_obsolete_files_
  |  ...
  |  wait on log_sync_cv and release mutex_
  |                                          mutex_.lock
  |                                          ++disable_delete_obsolete_files_
  |                                          mutex_.unlock
  |                                          mutex_.lock
  |                                          while (pending_purge_obsolete_files > 0) { bg_cv.wait;}
  |                                          wake up with mutex_ locked
  |                                          compute WALs tracked by MANIFEST
  |                                          mutex_.unlock
  |  wake up with mutex_ locked
  |  ++pending_purge_obsolete_files_
  |  mutex_.unlock
  |
  |  delete obsolete WAL
  |                                          WAL missing but tracked in MANIFEST.
  V
```

The fix proposed eliminates the possibility of the above by increasing
`pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10187

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D37214235

Pulled By: riversand963

fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184
2022-06-23 18:32:25 -07:00
Mark Callaghan 6061905790 Wrapper for benchmark.sh to run a sequence of db_bench tests (#10215)
Summary:
This provides two things:
1) Runs a sequence of db_bench tests. This sequence was chosen to provide
good coverage with less variance.
2) Makes it easier to do A/B testing for multiple binaries. This combines
the report.tsv files into summary.tsv to make it easier to compare results
across multiple binaries.

Example output for 2) is:

ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
1115171 446.7   9GB             8.9     1.0     454.7   26      26      0       0       0.9     0.5     2       7       51      5547    20      0.0     0       0.1     0.1     0.2     fillseq.wal_disabled.v400       2022-04-12T08:53:51     6.0
1045726 418.9   8GB     0.0GB   8.4     1.0     432.4   27      26      0       0       1.0     0.5     2       6       102     5618    20      0.0     0       0.1     0.0     0.1     fillseq.wal_disabled.v400       2022-04-12T12:25:36     6.28

ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
2969192 1189.3  16GB            0.0             0.0     0       0       0       0       10.8    9.3     25      33      49      13551   1781    0.0     0       48.2    6.8     16.8    readrandom.t32  2022-04-12T08:54:28     6.0
2692922 1078.6  16GB    0.0GB   0.0             0.0     0       0       0       0       11.9    10.2    30      38      56      49735   1781    0.0     0       47.8    6.7     16.8    readrandom.t32  2022-04-12T12:26:15     6.28

...

ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
180227  72.2    38GB            1126.4  8.7     643.2   3286    3218    0       0       177.6   50.2    2687    4083    6148    854083  1793    68.4    7804    17.0    5.9     0.5     overwrite.t32.s0        2022-04-12T11:55:21     6.0
236512  94.7    31GB    0.0GB   1502.9  8.9     862.2   5242    5125    0       0       135.3   59.9    2537    3268    5404    18545   1785    49.7    5112    25.5    8.0     9.4     overwrite.t32.s0        2022-04-12T15:27:25     6.28

Example output with formatting preserved is here:
https://gist.github.com/mdcallag/4432e5bbaf91915c916d46bd6ce3c313

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10215

Test Plan: run it

Reviewed By: jay-zhuang

Differential Revision: D37299892

Pulled By: mdcallag

fbshipit-source-id: e6e0ed638fd7e8deeb869d700593fdc3eba899c8
2022-06-23 18:07:14 -07:00
Yueh-Hsuan Chiang 2a3792edfc Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175)
Summary:
Add suggest_compact_range() and suggest_compact_range_cf() to C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175

Test Plan:
As verifying the result requires SyncPoint, which is not available in the c_test.c,
the test is currently done by invoking the functions and making sure it does not crash.

Reviewed By: jay-zhuang

Differential Revision: D37305191

Pulled By: ajkr

fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db
2022-06-23 16:25:25 -07:00
zczhu 17a1d65e3a Cut output files at compaction cursors (#10227)
Summary:
The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we **DO NOT** need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227

Test Plan:
Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test`

Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351)

Reviewed By: jay-zhuang

Differential Revision: D37388088

Pulled By: littlepig2013

fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d
2022-06-23 14:25:42 -07:00
Gang Liao ba1f62ddfb Read from blob cache first when MultiGetBlob() (#10225)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225

Test Plan:
Add test cases for MultiGetBlob
In this task, we added the new API MultiGetBlob() for BlobSource.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Reviewed By: ltamasi

Differential Revision: D37358364

Pulled By: gangliao

fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6
2022-06-23 13:52:00 -07:00
Guido Tagliavini Ponce b52620ab0e Fix key size in cache_bench (#10234)
Summary:
cache_bench wasn't generating 16B keys, which are necessary for FastLRUCache. Also:
- Added asserts in cache_bench, which is assuming that inserts never fail. When they fail (for example, if we used keys of the wrong size), memory allocated to the values will becomes leaked, and eventually the program crashes.
- Move kCacheKeySize to the right spot.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10234

Test Plan:
``make -j24 check``. Also, run cache_bench with FastLRUCache and check that memory usage doesn't blow up:
``./cache_bench -cache_type=fast_lru_cache -num_shard_bits=6 -skewed=true \
                        -lookup_insert_percent=100 -lookup_percent=0 -insert_percent=0 -erase_percent=0 \
                        -populate_cache=true -cache_size=1073741824 -ops_per_thread=10000000 \
                        -value_bytes=8192 -resident_ratio=1 -threads=16``

Reviewed By: pdillinger

Differential Revision: D37382949

Pulled By: guidotag

fbshipit-source-id: b697a942ebb215de5d341f98dc8566763436ba9b
2022-06-23 11:26:50 -07:00
Peter Dillinger f81ea75df7 Don't count no prefix as Bloom hit (#10244)
Summary:
When a key is "out of domain" for the prefix_extractor (no
prefix assigned) then the Bloom filter is not queried. PerfContext
was counting this as a Bloom "hit" while Statistics doesn't count this
as a prefix Bloom checked. I think it's more accurate to call it neither
hit nor miss, so changing the counting to make it PerfContext coounting
more like Statistics.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10244

Test Plan:
tests updates and expanded (Get and MultiGet). Iterator test
coverage of the change will come in next PR

Reviewed By: bjlemaire

Differential Revision: D37371297

Pulled By: pdillinger

fbshipit-source-id: fed132fba6a92b2314ab898d449fce2d1586c157
2022-06-23 11:00:27 -07:00
Baptiste Lemaire 5879053fd0 Dynamically changeable `MemPurge` option (#10011)
Summary:
**Summary**
Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.

**Motivation**
RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.

**Content of this PR**
This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.

**Benchmarking**
I will add numbers to prove that there is no performance impact within the next 12 hours.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011

Reviewed By: pdillinger

Differential Revision: D36462357

Pulled By: bjlemaire

fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
2022-06-23 09:42:18 -07:00
Gang Liao 2352e2dfda Add the blob cache to the stress tests and the benchmarking tool (#10202)
Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202

Reviewed By: ltamasi

Differential Revision: D37325739

Pulled By: gangliao

fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
2022-06-22 16:04:03 -07:00
Bo Wang c073ed7601 Fix typo in comments and code (#10233)
Summary:
Fix typo in comments and code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233

Test Plan: Existing unit tests should pass.

Reviewed By: jay-zhuang, anand1976

Differential Revision: D37356702

Pulled By: gitbw95

fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51
2022-06-22 15:45:21 -07:00
Yueh-Hsuan Chiang e103b87296 Add get_column_family_metadata() and related functions to C API (#10207)
Summary:
* Add metadata related structs and functions in C API, including
  - `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
     that returns `rocksdb_column_family_metadata_t`.
  - `rocksdb_column_family_metadata_t` and its get functions & destroy function.
  - `rocksdb_level_metadata_t` and its and its get functions & destroy function.
  - `rocksdb_file_metadata_t` and its and get functions & destroy functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207

Test Plan:
Extend the existing c_test.c to include additional checks for column_family_metadata
inside CheckCompaction.

Reviewed By: riversand963

Differential Revision: D37305209

Pulled By: ajkr

fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e
2022-06-22 15:00:28 -07:00
Alan Paxton a16e2ff82a Adapt benchmark result script to new fields. (#10120)
Summary:
Recently merged CI benchmark scripts were failing.

There has clearly been a major revision of the fields of benchmark output. The upload script expects and sanity-checks the existence of some fields (changes date to conform to OpenSearch format)..., so the script needs to change.

Also add a bit more exception checking to make it more obvious when this happens again.

We have deleted the existing report.tsv from the benchmark machine. An existing report.tsv is appended to by default, so that if the fields change, later rows no longer match the header. This makes for an upload that dies half way through the report file, when the format no longer matches the header.

Re-instate the config.yml for running the benchmarks, so we can once again test it in situ.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10120

Reviewed By: pdillinger

Differential Revision: D37314908

Pulled By: jay-zhuang

fbshipit-source-id: 34f5243fee694b75c6838eb55d3398e4273254b2
2022-06-22 09:26:13 -07:00
Yanqin Jin 36fefd7e22 Continue to deflake BackupEngineTest.Concurrency (#10228)
Summary:
Even after https://github.com/facebook/rocksdb/issues/10069, `BackupEngineTest.Concurrency` is still flaky with decreased probability of failure.

Repro steps as follows
```bash
make backup_engine_test
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

The first two commits of this PR demonstrate how the test is flaky. https://github.com/facebook/rocksdb/issues/10069 handles the case in which
`Rename()` file returns `IOError` with subcode `PathNotFound`, and `CreateLoggerFromOptions()`
allows the operation to succeed, as expected by the test. However, `BackupEngineTest` uses
`RemapFileSystem` on top of `ChrootFileSystem` which can return `NotFound` instead of `IOError`.

This behavior is different from `Env::Default()` which returns PathNotFound if the src of `rename()`
does not exist. We should make the behaviors of the test Env/FS match a real Env/FS.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10228

Test Plan:
```bash
make check
gtest-parallel -r 1000 -w 64 ./backup_engine_test --gtest_filter=BackupEngineTest.Concurrency
```

Reviewed By: pdillinger

Differential Revision: D37337241

Pulled By: riversand963

fbshipit-source-id: 07a53115e424467b55a731866e571f0ad4c6635d
2022-06-22 08:50:05 -07:00
Yanqin Jin 9586dcf1ce Expose the initial logger creation error (#10223)
Summary:
https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
`DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
directly expose the error to caller without some additional work.
This is a first version proposal which:
- Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
- Checks the error during `DB::Open()` and return it to caller if non-ok

This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
in case other callers of `SanitizeOptions()` assumes that a logger should be created.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D37321717

Pulled By: riversand963

fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b
2022-06-22 08:26:38 -07:00
Yanqin Jin 42c631b339 Update API comment about Options::best_efforts_recovery (#10180)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10180

Reviewed By: pdillinger

Differential Revision: D37182037

Pulled By: riversand963

fbshipit-source-id: a8dc865b86e2249beb7a543c317e94a14781e910
2022-06-21 23:34:39 -07:00
Peter Dillinger 84210c9489 Add data block hash index to crash test, fix MultiGet issue (#10220)
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.

Fixes https://github.com/facebook/rocksdb/issues/10186

I intend to pick this into the 7.4.0 release candidate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220

Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.

Reviewed By: anand1976

Differential Revision: D37315647

Pulled By: pdillinger

fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
2022-06-21 16:23:58 -07:00
Yanqin Jin d654888b8f Refactor wal filter processing during recovery (#10214)
Summary:
So that DBImpl::RecoverLogFiles do not have to deal with implementation
details of WalFilter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10214

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D37299122

Pulled By: riversand963

fbshipit-source-id: acf1a80f1ef75da393d375f55968b2f3ac189816
2022-06-21 14:51:56 -07:00
Bo Wang f7605ec655 Update LZ4 library for platform009 (#10224)
Summary:
Update LZ4 library for platform009.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10224

Test Plan: Current unit tests should pass.

Reviewed By: anand1976

Differential Revision: D37321801

Pulled By: gitbw95

fbshipit-source-id: 8a3d3019d9f7478ac737176f2d2f443c0159829e
2022-06-21 13:22:58 -07:00
zczhu 30141461f9 Add basic kRoundRobin compaction policy (#10107)
Summary:
Add `kRoundRobin` as a compaction priority. The implementation is as follows.

- Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`).
- After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function.
- Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107

Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test`

Reviewed By: ajkr

Differential Revision: D37316037

Pulled By: littlepig2013

fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2022-06-21 11:56:53 -07:00
Yanqin Jin b012d23557 Destroy iniital db dir for a test in DBWALTest (#10221)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10221

Reviewed By: hx235

Differential Revision: D37316280

Pulled By: riversand963

fbshipit-source-id: 062781acec2f36beebc62003bcc8ec280488d572
2022-06-21 11:27:10 -07:00