Commit graph

5287 commits

Author SHA1 Message Date
Jay Huh 87bc929db3 Flush option in WaitForCompact() (#11483)
Summary:
Context:

As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `flush` option in `WaitForCompactOptions` to flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
1. `bool flush = false` added to `WaitForCompactOptions`
2. `DBImpl::FlushAllColumnFamilies()` is introduced and `DBImpl::FlushForGetLiveFiles()` is refactored to call it.
3. `DBImpl::FlushAllColumnFamilies()` gets called before waiting in `WaitForCompact()` if `flush` option is `true`
4. Some previous WaitForCompact tests were parameterized to include both cases for `abort_on_pause_` being true/false as well as `flush_` being true/false

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11483

Test Plan:
- `DBCompactionTest::WaitForCompactWithOptionToFlush` added
- Changed existing DBCompactionTest::WaitForCompact tests to `DBCompactionWaitForCompactTest` to include params

Reviewed By: pdillinger

Differential Revision: D46289770

Pulled By: jaykorean

fbshipit-source-id: 70d3f461d96a6e06390be60170dd7c4d0d38f8b0
2023-05-31 12:53:51 -07:00
Yu Zhang 56ca9e3106 Logging timestamp size record in WAL and use it during recovery (#11471)
Summary:
Start logging the timestamp size record in WAL and use the record during recovery.  Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471

Test Plan:
```
make all check
./db_secondary_test
./db_wal_test --gtest_filter="*WithTimestamp*"
./repair_test --gtest_filter="*WithTimestamp*"
```

Reviewed By: ltamasi

Differential Revision: D46236769

Pulled By: jowlyzhang

fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0
2023-05-30 19:32:00 -07:00
Changyu Bi e1c7209beb Fix flaky test: DBCompactionTest.WaitForCompactShutdownWhileWaiting (#11488)
Summary:
tsan complains with the following error message. This is likely due to DB object destroyed while WaitForCompact() is still running.
```
[ RUN      ] DBCompactionTest.WaitForCompactShutdownWhileWaiting
==================
WARNING: ThreadSanitizer: data race (pid=1128703)
  Atomic read of size 1 at 0x7b8c00000740 by thread T4:
    #0 pthread_cond_wait <null> (db_compaction_test+0x46970a)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::Wait() /root/project/port/port_posix.cc:119:23 (librocksdb.so.8.4+0x7c4c60)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::InstrumentedCondVar::WaitInternal() /root/project/monitoring/instrumented_mutex.cc:69:9 (librocksdb.so.8.4+0x75f697)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::InstrumentedCondVar::Wait() /root/project/monitoring/instrumented_mutex.cc:62:3 (librocksdb.so.8.4+0x75f697)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::WaitForCompact(rocksdb::WaitForCompactOptions const&) /root/project/db/db_impl/db_impl_compaction_flush.cc:3978:14 (librocksdb.so.8.4+0x494174)
    https://github.com/facebook/rocksdb/issues/5 rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30::operator()() const /root/project/db/db_compaction_test.cc:3479:26 (db_compaction_test+0x5cdc90)
    https://github.com/facebook/rocksdb/issues/6 void std::__invoke_impl<void, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>(std::__invoke_other, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (db_compaction_test+0x5cdc90)
    https://github.com/facebook/rocksdb/issues/7 std::__invoke_result<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>::type std::__invoke<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30>(rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (db_compaction_test+0x5cdc90)
    https://github.com/facebook/rocksdb/issues/8 void std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (db_compaction_test+0x5cdc90)
    https://github.com/facebook/rocksdb/issues/9 std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (db_compaction_test+0x5cdc90)
    https://github.com/facebook/rocksdb/issues/10 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_30> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (db_compaction_test+0x5cdc90)
    https://github.com/facebook/rocksdb/issues/11 <null> <null> (libstdc++.so.6+0xda6b3)

  Previous write of size 1 at 0x7b8c00000740 by thread T5:
    #0 pthread_mutex_destroy <null> (db_compaction_test+0x46a4f8)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::port::Mutex::~Mutex() /root/project/port/port_posix.cc:77:48 (librocksdb.so.8.4+0x7c480e)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::InstrumentedMutex::~InstrumentedMutex() /root/project/./monitoring/instrumented_mutex.h:20:7 (librocksdb.so.8.4+0x41fda6)
    https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::~DBImpl() /root/project/db/db_impl/db_impl.cc:755:1 (librocksdb.so.8.4+0x41fda6)
    https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::~DBImpl() /root/project/db/db_impl/db_impl.cc:737:19 (librocksdb.so.8.4+0x4203d9)
    https://github.com/facebook/rocksdb/issues/5 rocksdb::DBTestBase::Close() /root/project/db/db_test_util.cc:670:3 (librocksdb_test_debug.so+0x57413)
    https://github.com/facebook/rocksdb/issues/6 rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31::operator()() const /root/project/db/db_compaction_test.cc:3485:49 (db_compaction_test+0x5cdf03)
    https://github.com/facebook/rocksdb/issues/7 void std::__invoke_impl<void, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>(std::__invoke_other, rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14 (db_compaction_test+0x5cdf03)
    https://github.com/facebook/rocksdb/issues/8 std::__invoke_result<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>::type std::__invoke<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31>(rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14 (db_compaction_test+0x5cdf03)
    https://github.com/facebook/rocksdb/issues/9 void std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253:13 (db_compaction_test+0x5cdf03)
    https://github.com/facebook/rocksdb/issues/10 std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> >::operator()() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260:11 (db_compaction_test+0x5cdf03)
    https://github.com/facebook/rocksdb/issues/11 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<rocksdb::DBCompactionTest_WaitForCompactShutdownWhileWaiting_Test::TestBody()::$_31> > >::_M_run() /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211:13 (db_compaction_test+0x5cdf03)
    https://github.com/facebook/rocksdb/issues/12 <null> <null> (libstdc++.so.6+0xda6b3)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11488

Test Plan:
```
COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 db_compaction_test

gtest-parallel --repeat=10000 ./db_compaction_test --gtest_filter="*WaitForCompactShutdownWhileWaiting*" -w200
```

Reviewed By: jaykorean

Differential Revision: D46293891

Pulled By: cbi42

fbshipit-source-id: 8ca259cb1e09a9e4f4095b2d084f2ba92b710b97
2023-05-30 15:00:05 -07:00
Andrew Kryczka 3e7fc88167 add WriteBatch::Release() (#11482)
Summary:
Together with the existing constructor,
`explicit WriteBatch(std::string&& rep)`, this enables transferring `WriteBatch` via its `std::string` representation. Associated info like KV checksums are dropped but the caller can use `WriteBatch::VerifyChecksum()` before taking ownership if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11482

Reviewed By: cbi42

Differential Revision: D46233884

Pulled By: ajkr

fbshipit-source-id: 6bc64a6e75fb7bbf61d08c09520fc3705a7b44d8
2023-05-26 18:15:14 -07:00
Soli de1dd4ca19 Tweak on IsTrivialMove() (#11467)
Summary:
`output_level_` and `number_levels_` are not changing in iteration of `inputs_` files.

Moving the check out of `for` loop could slightly improve performance.

It is easier to review when ignore whitespace changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11467

Reviewed By: cbi42

Differential Revision: D46155962

Pulled By: ajkr

fbshipit-source-id: 45ec80b13152b3bed7305e6f707cb9b187d5f315
2023-05-26 16:40:50 -07:00
Jay Huh 81aeb15988 Add WaitForCompact with WaitForCompactOptions to public API (#11436)
Summary:
Context:

This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions`

1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing.

1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct.
2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h`
3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436

Test Plan:
Following tests added
- `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish`
- `DBCompactionTest::WaitForCompactAbortOnPauseAborted`
- `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted`
- `DBCompactionTest::WaitForCompactShutdownWhileWaiting`
- `TransactionTest::WaitForCompactAbortOnPause`

NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place.

Reviewed By: pdillinger

Differential Revision: D45799659

Pulled By: jaykorean

fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b
2023-05-25 17:25:51 -07:00
Yu Zhang d1ae7f6c41 Add support to strip / pad timestamp when writing / reading a block (#11472)
Summary:
This patch adds support in `BlockBuilder` to strip user-defined timestamp from the `key` added via `Add(key, value)` and its equivalent APIs. The stripping logic is different when the key is either a user key or an internal key, so the `BlockBuilder` is created with a flag to indicate that. This patch also add support on the read path to APIs `NewIndexIterator`, `NewDataIterator` to support pad a min timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11472

Test Plan:
Three test modes are added to parameterize existing tests:
UserDefinedTimestampTestMode::kNone -> UDT feature is not enabled
UserDefinedTimestampTestMode::kNormal -> UDT feature enabled, write / read with min timestamp
UserDefinedTimestampTestMode::kStripUserDefinedTimestamps -> UDT feature enabled, write / read with min timestamp, set `persist_user_defined_timestamps` where it applies to false.
The tests read/write with min timestamp so that point read and range scan can correctly read values in all three test modes.

`block_test` are parameterized to run with above three test modes and some additional parameteriazation

```
make all check
./block_test --gtest_filter="P/BlockTest*"
./block_test --gtest_filter="P/IndexBlockTest*"
```

Reviewed By: ltamasi

Differential Revision: D46200539

Pulled By: jowlyzhang

fbshipit-source-id: 59f5d6b584639976b69c2943eba723bd47d9b3c0
2023-05-25 15:41:32 -07:00
Hui Xiao dcc6fc99f9 Fix StopWatch bug; Remove setting record_read_stats (#11474)
Summary:
**Context/Summary:**
- StopWatch enable stats even when `StatsLevel::kExceptTimers` is set. It's a harmless bug though since `reportTimeToHistogram()` will not report it anyway according to https://github.com/facebook/rocksdb/blob/main/include/rocksdb/statistics.h#L705
-  https://github.com/facebook/rocksdb/pull/11288 should have removed logics of setting `record_read_stats = !for_compaction` as we don't differentiate `RandomAccessFileReader`'s stats behavior based on compaction or not (instead we now report stats of different IO activities including compaction to different stats). Fixing this should report more compaction related file read micros that aren't reported previously due to `for_compaction==true`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11474

Test Plan:
- DB bench pre vs post fix with small max_open_files

Setup command
`./db_ bench  -db=/dev/shm/testdb/ -statistics=true -benchmarks=fillseq -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3`

Run command
`./db_bench --open_files=1 -use_existing_db=true -db=/dev/shm/testdb2/ -statistics=true -benchmarks=compactall -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=true -compression_type=none -bloom_bits=3`

Pre-fix
```
rocksdb.sst.read.micros P50 : 2.056175 P95 : 4.647739 P99 : 8.948475 P100 : 25.000000 COUNT : 4451 SUM : 12827
rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.read.compaction.micros P50 : 2.057397 P95 : 4.625253 P99 : 8.749474 P100 : 25.000000 COUNT : 4382 SUM : 12608
rocksdb.file.read.db.open.micros P50 : 1.985294 P95 : 9.100000 P99 : 13.000000 P100 : 13.000000 COUNT : 69 SUM : 219
```

Post-fix (with a higher `rocksdb.file.read.compaction.micros` count)
```
rocksdb.sst.read.micros P50 : 1.858968 P95 : 3.653086 P99 : 5.968000 P100 : 21.000000 COUNT : 3548 SUM : 9119
rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.read.compaction.micros P50 : 1.857027 P95 : 3.627614 P99 : 5.738621 P100 : 21.000000 COUNT : 3479 SUM : 8904
rocksdb.file.read.db.open.micros P50 : 2.000000 P95 : 6.733333 P99 : 11.000000 P100 : 11.000000 COUNT : 69 SUM : 215
```
- CI

Reviewed By: ajkr

Differential Revision: D46137221

Pulled By: hx235

fbshipit-source-id: e5b4ee7001af26f2ee0377bc6334f43b2a527388
2023-05-25 10:16:58 -07:00
Peter Dillinger 17bc27741f Improve memory efficiency of many OptimisticTransactionDBs (#11439)
Summary:
Currently it's easy to use a ton of memory with many small OptimisticTransactionDB instances, because each one by default allocates a million mutexes (40 bytes each on my compiler) for validating transactions. It even puts a lot of pressure on the allocator by allocating each one individually!

In this change:
* Create a new object and option that enables sharing these buckets of mutexes between instances. This is generally good for load balancing potential contention as various DBs become hotter or colder with txn writes. About the only cases where this sharing wouldn't make sense (e.g. each DB usually written by one thread) are cases that would be better off with OccValidationPolicy::kValidateSerial which doesn't use the buckets anyway.
* Allocate the mutexes in a contiguous array, for efficiency
* Add an option to ensure the mutexes are cache-aligned. In several other places we use cache-aligned mutexes but OptimisticTransactionDB historically does not. It should be a space-time trade-off the user can choose.
* Provide some visibility into the memory used by the mutex buckets with an ApproximateMemoryUsage() function (also used in unit testing)
* Share code with other users of "striped" mutexes, appropriate refactoring for customization & efficiency (e.g. using FastRange instead of modulus)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11439

Test Plan: unit tests added. Ran sized-up versions of stress test in unit test, including a before-and-after performance test showing no consistent difference. (NOTE: OptimisticTransactionDB not currently covered by db_stress!)

Reviewed By: ltamasi

Differential Revision: D45796393

Pulled By: pdillinger

fbshipit-source-id: ae2b3a26ad91ceeec15debcdc63ff48df6736a54
2023-05-24 11:57:15 -07:00
rogertyang 28bf7ba77d remove unnecessary code in super version getter (#11452)
Summary:
Do not bother comparing the version of the local super version handle with the global one.

An inequality comparison result indicates nothing but a spurious obsoleteness. It only happens when the writer has increased the `ColumnFamilyData::super_version_number_`(5fc57eec2b/db/column_family.cc (L1309)) but has not yet called `ResetThreadLocalSuperVersions()`(5fc57eec2b/db/column_family.cc (L1328)) at the time when a reader reads the local handle(`void* ptr = local_sv_->Swap(SuperVersion::kSVInUse);`). In other words, the existence of a local handle is a sufficent evidence of its fressness.

With this PR, we save one or even two atomic instructions when getting a handle of super version.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11452

Reviewed By: ajkr

Differential Revision: D46059317

Pulled By: cbi42

fbshipit-source-id: 68b4b1ca8a9929a4aa470105c37a09e0625b014d
2023-05-23 12:16:24 -07:00
Yu Zhang 11ebddb1d4 Add utils to use for handling user defined timestamp size record in WAL (#11451)
Summary:
Add a util method `HandleWriteBatchTimestampSizeDifference` to handle a `WriteBatch` read from WAL log when user-defined timestamp size record is written and read. Two check modes are added: `kVerifyConsistency` that just verifies the recorded timestamp size are consistent with the running ones. This mode is to be used by `db_impl_secondary` for opening a DB as secondary instance. It will also be used by `db_impl_open` before the user comparator switch support is added to make a column switch between enabling/disable UDT feature. The other mode `kReconcileInconsistency` will be used by `db_impl_open` later when user comparator can be changed.

Another change is to extract a method `CollectColumnFamilyIdsFromWriteBatch` in db_secondary_impl.h into its standalone util file so it can be shared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11451

Test Plan:
```
make check
./udt_util_test
```

Reviewed By: ltamasi

Differential Revision: D45894386

Pulled By: jowlyzhang

fbshipit-source-id: b96790777f154cddab6d45d9ba2e5d20ebc6fe9d
2023-05-22 14:28:58 -07:00
Peter Dillinger 39f5846ec7 Much better stats for seeks and prefix filtering (#11460)
Summary:
We want to know more about opportunities for better range filters, and the effectiveness of our own range filters. Currently the stats are very limited, essentially logging just hits and misses against prefix filters for range scans in BLOOM_FILTER_PREFIX_* without tracking the false positive rate. Perhaps confusingly, when prefix filters are used for point queries, the stats are currently going into the non-PREFIX tickers.

This change does several things:
* Introduce new stat tickers for seeks and related filtering, \*LEVEL_SEEK\*
  * Most importantly, allows us to see opportunities for range filtering. Specifically, we can count how many times a seek in an SST file accesses at least one data block, and how many times at least one value() is then accessed. If a data block was accessed but no value(), we can generally assume that the key(s) seen was(were) not of interest so could have been filtered with the right kind of filter, avoiding the data block access.
  * We can get the same level of detail when a filter (for now, prefix Bloom/ribbon) is used, or not. Specifically, we can infer a false positive rate for prefix filters (not available before) from the seek "false positive" rate: when a data block is accessed but no value() is called. (There can be other explanations for a seek false positive, but in typical iterator usage it would indicate a filter false positive.)
  * For efficiency, I wanted to avoid making additional calls to the prefix extractor (or key comparisons, etc.), which would be required if we wanted to more precisely detect filter false positives. I believe that instrumenting value() is the best balance of efficiency vs. accurately measuring what we are often interested in.
  * The stats are divided between last level and non-last levels, to help understand potential tiered storage use cases.
* The old BLOOM_FILTER_PREFIX_* stats have a different meaning: no longer referring to iterators but to point queries using prefix filters. BLOOM_FILTER_PREFIX_TRUE_POSITIVE is added for computing the prefix false positive rate on point queries, which can be due to filter false positives as well as different keys with the same prefix.
* Similarly, the non-PREFIX BLOOM_FILTER stats are now for whole key filtering only.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11460

Test Plan:
unit tests updated, including updating many to pop the stat value since last read to improve test
readability and maintainability.

Performance test shows a consistent small improvement with these changes, both with clang and with gcc. CPU profile indicates that RecordTick is using less CPU, and this makes sense at least for a high filter miss rate. Before, we were recording two ticks per filter miss in iterators (CHECKED & USEFUL) and now recording just one (FILTERED).

Create DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
```
And run simultaneous before&after with
```
TEST_TMPDIR=/dev/shm ./db_bench -readonly -benchmarks=seekrandom[-X1000] -num=10000000 -bloom_bits=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -seek_nexts=1 -duration=20 -seed=43 -threads=8 -cache_size=1000000000 -statistics
```
Before: seekrandom [AVG 275 runs] : 189680 (± 222) ops/sec;   18.4 (± 0.0) MB/sec
After: seekrandom [AVG 275 runs] : 197110 (± 208) ops/sec;   19.1 (± 0.0) MB/sec

Reviewed By: ajkr

Differential Revision: D46029177

Pulled By: pdillinger

fbshipit-source-id: cdace79a2ea548d46c5900b068c5b7c3a02e5822
2023-05-19 15:25:49 -07:00
Peter Dillinger 4067acabca Compatibility step for separating BlockCache and GeneralCache APIs (#11450)
Summary:
Add two type aliases for Cache: BlockCache and GeneralCache, and add LRUCacheOptions::MakeSharedGeneralCache(). This will ease upgrade to an intended future change to separate the cache API between block cache and other (general) uses, including row cache. Separating the APIs will make it easier to expose more details of block caching for customization. For example, it would be nice to pass the file unique ID and offset as the logical cache key instead of using a Slice, which could facilitate some file-specific customizations in block cache. This would also make it clear that HyperClockCache is not usable as a general cache, because it can only deal with fixed-size block cache keys.

block_cache, row_cache, and blob_cache are the uses of Cache in the public API. blob_cache should be able to use BlockCache while row_cache is a GeneralCache user, as its keys are of arbitrary size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11450

Test Plan: see updated unit test.

Reviewed By: anand1976

Differential Revision: D45882067

Pulled By: pdillinger

fbshipit-source-id: ff5d9f0b644f87ae337a29a7728ce3ed07b2a4b2
2023-05-18 20:40:19 -07:00
mayue.fight 8d8eb0e77e Support Clip DB to KeyRange (#11379)
Summary:
This PR is part of the request https://github.com/facebook/rocksdb/issues/11317.
(Another part is https://github.com/facebook/rocksdb/pull/11378)

ClipDB() will clip the entries in the CF according to the range [begin_key, end_key). All the entries outside this range will be completely deleted (including tombstones).
 This feature is mainly used to ensure that there is no overlapping Key when calling CreateColumnFamilyWithImports() to import multiple CFs.

When Calling ClipDB [begin, end), there are the following steps

1.  Quickly and directly delete files without overlap
 DeleteFilesInRanges(nullptr, begin) + DeleteFilesInRanges(end, nullptr)
2. Delete the Key outside the range
Delete[smallest_key, begin) + Delete[end, largest_key]
3. Delete the tombstone through Manul Compact
CompactRange(option, nullptr, nullptr)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11379

Reviewed By: ajkr

Differential Revision: D45840358

Pulled By: cbi42

fbshipit-source-id: 54152e8a45fd8ede137f99787eb252f0b51440a4
2023-05-18 13:25:01 -07:00
Jay Huh 586d78b31e Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:

In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.

In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.

This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.

Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.

Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443

Test Plan:
Existing tests that involve a shutdown in progress:

- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle

Reviewed By: pdillinger

Differential Revision: D45923426

Pulled By: jaykorean

fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-17 18:13:50 -07:00
Peter Dillinger 206fdea3d9 Change internal headers with duplicate names (#11408)
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.

No public API changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D45456696

Pulled By: pdillinger

fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
2023-05-17 11:27:09 -07:00
Andrew Kryczka fb636f2498 Fix write stall stats dump format (#11445)
Summary:
I noticed in https://github.com/facebook/rocksdb/issues/11426 there is a missing line break. Before this PR the output looked like

```
$ ./db_bench -stats_per_interval=1 -stats_interval=100000
...
Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0, Block cache LRUCache@0x7f8695831b50#2766536 capacity: 32.00 MB seed: 1155354975 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 0 last_secs: 9.3e-05 secs_since: 2
...
Write Stall (count): write-buffer-manager-limit-stops: 0, num-running-compactions: 0
...
```

After this PR it looks like

```
$ ./db_bench -stats_per_interval=1 -stats_interval=100000
...
Write Stall (count): cf-l0-file-count-limit-delays-with-ongoing-compaction: 0, cf-l0-file-count-limit-stops-with-ongoing-compaction: 0, l0-file-count-limit-delays: 0, l0-file-count-limit-stops: 0, memtable-limit-delays: 0, memtable-limit-stops: 0, pending-compaction-bytes-delays: 0, pending-compaction-bytes-stops: 0, total-delays: 0, total-stops: 0
Block cache LRUCache@0x7f8e0d231b50#2736585 capacity: 32.00 MB seed: 920433955 usage: 0.09 KB table_size: 1024 occupancy: 1 collections: 1 last_copies: 1 last_secs: 6.5e-05 secs_since: 4
...
Write Stall (count): write-buffer-manager-limit-stops: 0
num-running-compactions: 0
...
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11445

Reviewed By: hx235

Differential Revision: D45844752

Pulled By: ajkr

fbshipit-source-id: 1c708cb05b6e270922ac2fa95f5d011f273347eb
2023-05-15 11:47:17 -07:00
anand76 2084cdf237 Delete temp OPTIONS file on failure to write it (#11423)
Summary:
When the DB is opened, RocksDB creates a temp OPTIONS file, writes the current options to it, and renames it. In case of a failure, the temp file is left behind, and is not deleted by PurgeObsoleteFiles(). Fix this by explicitly deleting the temp file if writing to it or renaming it fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11423

Test Plan: Add a unit test

Reviewed By: akankshamahajan15

Differential Revision: D45540454

Pulled By: anand1976

fbshipit-source-id: 47facdc30d8cc5667036312d04b21d3fc253c92e
2023-05-12 22:39:39 -07:00
Andrew Kryczka 113f3250f1 Add block checksum mismatch ticker stat (#11438)
Summary:
Added a ticker stat, `BLOCK_CHECKSUM_MISMATCH_COUNT`, to count how many block checksum verifications detected a mismatch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11438

Test Plan: new unit test

Reviewed By: pdillinger

Differential Revision: D45788179

Pulled By: ajkr

fbshipit-source-id: e2b44eba7c23b3e110ebe69eaa78a710dec2590f
2023-05-12 18:16:11 -07:00
Yu Zhang 47235dda9e Add support in log writer and reader for a user-defined timestamp size record (#11433)
Summary:
This patch adds support to write and read a user-defined timestamp size record in log writer and log reader. It will be used by WAL logs to persist the user-defined timestamp format for subsequent WriteBatch records. Reading and writing UDT sizes for WAL logs are not included in this patch. It will be in a follow up.

The syntax for the record is: at write time, one such record is added when log writer encountered any non-zero UDT size it hasn't recorded so far. At read time, all such records read up to a point are accumulated and applicable to all subsequent WriteBatch records.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11433

Test Plan:
```
make clean && make -j32 all
./log_test --gtest_filter="*WithTimestampSize*"
```

Reviewed By: ltamasi

Differential Revision: D45678708

Pulled By: jowlyzhang

fbshipit-source-id: b770c8f45bb7b9383b14aac9f22af781304fb41d
2023-05-11 17:26:19 -07:00
Changyu Bi 8827cd0618 Support compacting files to different temperatures in FIFO compaction (#11428)
Summary:
- Add a new option `CompactionOptionsFIFO::file_temperature_age_thresholds` that allows user to specify age thresholds for compacting files to different temperatures. File temperature can be used to store files in different storage media. The new options allows specifying multiple temperature-age pairs. The option uses struct for a temperature-age pair to use the existing parsing functionality to make the option dynamically settable.
- Deprecate the old option `age_for_warm` that was added for a similar purpose.
- Compaction score calculation logic is updated to check if a file needs to be compacted to change its temperature.
- Some refactoring is done in `FIFOCompactionPicker::PickTemperatureChangeCompaction`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11428

Test Plan: adapted unit tests that were for `age_for_warm` to this new option.

Reviewed By: ajkr

Differential Revision: D45611412

Pulled By: cbi42

fbshipit-source-id: 2dc384841f61cc04abb9681e31aa2de0f0b06106
2023-05-11 16:40:59 -07:00
Peter Dillinger f4a02f2c52 Add hash_seed to Caches (#11391)
Summary:
See motivation and description in new ShardedCacheOptions::hash_seed option.

Updated db_bench so that its seed param is used for the cache hash seed.
Made its code more safe to ensure seed is set before use.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11391

Test Plan:
unit tests added / updated

**Performance** - no discernible difference seen running cache_bench repeatedly before & after. With lru_cache and hyper_clock_cache.

Reviewed By: hx235

Differential Revision: D45557797

Pulled By: pdillinger

fbshipit-source-id: 40bf4da6d66f9d41a8a0eb8e5cf4246a4aa07934
2023-05-09 22:24:26 -07:00
akankshamahajan 6ba4717f35 Fix build error: variable 'base_level' may be uninitialized (#11435)
Summary:
Fix build error: variable 'base_level' may be uninitialized
```
 db_impl_compaction_flush.cc:1195:21: error: variable 'base_level' may be uninitialized when used here [-Werror,-Wconditional-uninitialized]
            level = base_level;
```
                    ^~~~~~~~~~

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11435

Test Plan: CircleCI jobs

Reviewed By: cbi42

Differential Revision: D45708176

Pulled By: akankshamahajan15

fbshipit-source-id: 851b1205b22b63d728495e5735fa91b0ad8e012b
2023-05-09 15:43:43 -07:00
Hui Xiao 8f763bdeab Record and use the tail size to prefetch table tail (#11406)
Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g,  small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug.  Therefore we decide to record the exact tail size and use it directly  to prefetch tail of the SST instead of relying heuristics.

**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
   - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406

Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
 diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
 --- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
                            &tail_prefetch_size);

   // Try file system prefetch
-  if (!file->use_direct_io() && !force_direct_prefetch) {
+  if (false && !file->use_direct_io() && !force_direct_prefetch) {
     if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
              .IsNotSupported()) {
       prefetch_buffer->reset(new FilePrefetchBuffer(
 diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
 --- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
           std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
     } else {
       BlockBasedTableOptions block_based_options;
+      block_based_options.metadata_cache_options.partition_pinning =
+      PinningTier::kAll;
       block_based_options.checksum =
           static_cast<ChecksumType>(FLAGS_checksum_type);
       if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```

(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```

As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR

3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.

Reviewed By: pdillinger

Differential Revision: D45413346

Pulled By: hx235

fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
2023-05-08 13:14:28 -07:00
Changyu Bi a11f1e12ca Fix flaky test DBTestUniversalManualCompactionOutputPathId.ManualCompactionOutputPathId (#11412)
Summary:
the test is flaky when compiled with `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test`. The cause is that a manual compaction `CompactRange()` can finish and return before obsolete files are deleted. One reason for this is that a manual compaction waits until `manual.done` is set here 62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L1978)
and the compaction thread can set `manual.done`:
62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L3672)
and then temporarily release mutex_:
62fc15f009/db/db_impl/db_impl_files.cc (L317)
before purging obsolete files:
62fc15f009/db/db_impl/db_impl_compaction_flush.cc (L3144)

With `COERCE_CONTEXT_SWITCH=1`, `bg_cv_.SignalAll()` is called during `mutex_.Lock()`, so the manual compaction thread can wake up and return before obsolete files are deleted. Updated the test to only count live SST files.

Also updated `FindObsoleteFiles()` to avoid locking a locked mutex.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11412

Test Plan: `make -j56 COERCE_CONTEXT_SWITCH=1 ./db_universal_compaction_test`

Reviewed By: ajkr

Differential Revision: D45342242

Pulled By: cbi42

fbshipit-source-id: 955c9796aa3f484e3557d300f97cffacb3ed9b0c
2023-05-03 11:12:20 -07:00
clundro 50b33ebb1b remove redundant move (#11418)
Summary:
when I use g++-13 to exec the `make all` command,  the output throws the warnings.
```
db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
```

and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418

Reviewed By: akankshamahajan15

Differential Revision: D45528223

Pulled By: ajkr

fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc
2023-05-03 09:37:21 -07:00
leipeng a475e9f746 DBIter::FindValueForCurrentKey: remove unused Status s (#11394)
Summary:
This PR remove a historical useless code

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11394

Reviewed By: ajkr

Differential Revision: D45506226

Pulled By: akankshamahajan15

fbshipit-source-id: 32c98627100c9ad131bf65c4a1fe97ab61502daf
2023-05-03 08:52:03 -07:00
anand76 03a892a9fb Delete empty WAL files on reopen (#11409)
Summary:
When a DB is opened, RocksDB creates an empty WAL file. When the DB is reopened and the WAL is empty, the min log number to keep is not advanced until a memtable flush happens. If a process crashes soon after reopening the DB, its likely that no memtable flush would have happened, which means the empty WAL file is not deleted. In a crash loop scenario, this leads to empty WAL files accumulating. Fix this by ensuring the min log number is advanced if the WAL is empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11409

Test Plan: Add a unit test

Reviewed By: ajkr

Differential Revision: D45281685

Pulled By: anand1976

fbshipit-source-id: 0225877c613e65ffb30972a0051db2830105423e
2023-05-02 15:54:29 -07:00
Peter Dillinger 41a7fbf758 Avoid long parameter lists configuring Caches (#11386)
Summary:
For better clarity, encouraging more options explicitly specified using fields rather than positionally via constructor parameter lists. Simplifies code maintenance as new fields are added. Deprecate some cases of the confusing pattern of NewWhatever() functions returning shared_ptr.

Net reduction of about 70 source code lines (including comments).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11386

Test Plan: existing tests

Reviewed By: ajkr

Differential Revision: D45059075

Pulled By: pdillinger

fbshipit-source-id: d53fa09b268024f9c55254bb973b6c69feebf41a
2023-05-01 14:52:01 -07:00
Changyu Bi 62fc15f009 Block per key-value checksum (#11287)
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates

Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`

Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.

Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):

```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none

BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE

The readrandom ops/sec looks like the following:
Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
Main              240805     223604         198176           161653       139040
PR prot_bytes=0   238691     226693         200127           161082       141153
PR prot_bytes=1   214983     193199         178532           137013       108211
prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
prot_bytes=0
```

The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287

Reviewed By: ajkr

Differential Revision: D43970708

Pulled By: cbi42

fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
2023-04-25 12:08:23 -07:00
leipeng 40d69b59ad DBImpl::MultiGet: delete unused var superversions_to_delete (#11395)
Summary:
In db_impl.cc DBImpl::MultiGet: delete unused var `superversions_to_delete`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11395

Reviewed By: ajkr

Differential Revision: D45240896

Pulled By: cbi42

fbshipit-source-id: 0fff99b0d794b6f6d4aadee6036bddd6cb19eb31
2023-04-25 10:46:29 -07:00
Peter Dillinger fb63d9b4ee Fix compression tests when snappy not available (#11396)
Summary:
Tweak some bounds and things, and reduce risk of surprise results by running on all supported compressions (mostly).

Also improves the precise compressibility of CompressibleString by using RandomBinaryString.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11396

Test Plan: updated tests

Reviewed By: ltamasi

Differential Revision: D45211938

Pulled By: pdillinger

fbshipit-source-id: 9dc1dd8574a60a9364efe18558be66d31a35598b
2023-04-22 12:41:36 -07:00
Peter Dillinger d79be3dca2 Changes and enhancements to compression stats, thresholds (#11388)
Summary:
## Option API updates
* Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7.
* Remove unnecessary constructor for CompressionOptions.
* Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated).

 ## Stat API updates
* Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed.
* Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing.
* New or existing tickers relevant to compression:
  * BYTES_COMPRESSED_FROM
  * BYTES_COMPRESSED_TO
  * BYTES_COMPRESSION_BYPASSED
  * BYTES_COMPRESSION_REJECTED
  * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing)
  * NUMBER_BLOCK_COMPRESSED (existing)
  * NUMBER_BLOCK_COMPRESSION_BYPASSED
  * NUMBER_BLOCK_COMPRESSION_REJECTED
  * BYTES_DECOMPRESSED_FROM
  * BYTES_DECOMPRESSED_TO

We can compute a number of things with these stats:
* "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO
* Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED)
* Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED
* Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES

Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression.

 ## BlockBasedTableBuilder
Various small refactorings, optimizations, and name clean-ups.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388

Test Plan:
unit tests added

* `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field.

Reviewed By: ajkr

Differential Revision: D45128202

Pulled By: pdillinger

fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803
2023-04-21 21:57:40 -07:00
Hui Xiao 151242ce46 Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.

**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
   - Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288

Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
     - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689

// PlainTable
Does not apply
```
- **Db bench 2: performance**

**Read**

SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`

**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none

rocksdb.sst.read.micros  COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`

Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`

Reviewed By: ajkr

Differential Revision: D44007011

Pulled By: hx235

fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 09:07:18 -07:00
Andrew Kryczka 0a774a102f Clarify SstFileWriter::DeleteRange() ordering requirements (#11390)
Summary:
As titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11390

Reviewed By: cbi42

Differential Revision: D45148830

Pulled By: ajkr

fbshipit-source-id: 9a8dfd040514bae3d8ed9e97a79cae7683f2749a
2023-04-20 13:02:16 -07:00
Changyu Bi 43e9a60bb2 Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.

This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375

Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.

Reviewed By: ajkr

Differential Revision: D44943518

Pulled By: cbi42

fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 11:10:48 -07:00
Andrew Kryczka f3818948e8 Deflake DBWriteTest.LockWALInEffect (#11382)
Summary:
This test exhibited the following flaky failure:

```
db/db_write_test.cc:653: Failure
db_->Resume()
Corruption: Not active
```

I was able to repro it by applying the following patch to coerce a specific race condition:

```
 diff --git a/db/db_write_test.cc b/db/db_write_test.cc
index d82c57376..775ba3cde 100644
 --- a/db/db_write_test.cc
+++ b/db/db_write_test.cc
@@ -636,6 +636,10 @@ TEST_P(DBWriteTest, LockWALInEffect) {
   ASSERT_TRUE(dbfull()->WALBufferIsEmpty());
   ASSERT_OK(db_->UnlockWAL());

+  // Test thread: sleep interval: [0, 3)
+  // In this interval, the file system is active
+  sleep(3);
+
   // Fail the WAL flush if applicable
   fault_fs->SetFilesystemActive(false);
   Status s = Put("key2", "value");
@@ -649,6 +653,11 @@ TEST_P(DBWriteTest, LockWALInEffect) {
     ASSERT_OK(db_->LockWAL());
     ASSERT_OK(db_->UnlockWAL());
   }
+
+  // Test thread: sleep interval: [3, 6)
+  // In this interval, the file system is inactive
+  sleep(3);
+
   fault_fs->SetFilesystemActive(true);
   ASSERT_OK(db_->Resume());
   // Writes should work again
 diff --git a/db/flush_job.cc b/db/flush_job.cc
index 8193f594f..602ee2c9f 100644
 --- a/db/flush_job.cc
+++ b/db/flush_job.cc
@@ -979,6 +979,10 @@ Status FlushJob::WriteLevel0Table() {
           DirFsyncOptions(DirFsyncOptions::FsyncReason::kNewFileSynced));
     }
     TEST_SYNC_POINT_CALLBACK("FlushJob::WriteLevel0Table", &mems_);
+    // Flush thread: sleep interval: [0, 4)
+    // Upon awakening, the file system will be inactive. Then the MANIFEST
+    // update will fail.
+    sleep(4);
     db_mutex_->Lock();
   }
   base_->Unref();
```

The fix for this scenario is explained in the code change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11382

Reviewed By: cbi42

Differential Revision: D45027632

Pulled By: ajkr

fbshipit-source-id: 6bfa35a5781c0c080fb74e13f2b2c9f871f7effb
2023-04-17 11:00:08 -07:00
Andrew Kryczka b8555ba470 Deflake DBBloomFilterTest.OptimizeFiltersForHits (#11383)
Summary:
In CircleCI build-linux-arm-test-full job (https://app.circleci.com/pipelines/github/facebook/rocksdb/26462/workflows/a9d39d2c-c970-4b0f-9c10-7743beb9771b/jobs/591722), this test exhibited the following flaky failure:

```
db/db_bloom_filter_test.cc:2506: Failure
Expected: (TestGetTickerCount(options, BLOOM_FILTER_USEFUL)) > (65000 * 2), actual: 120558 vs 130000
```

I ssh'd to an instance and observed it cuts memtables at slightly different points across runs. Logging in `ConcurrentArena` pointed to `try_lock()` returning false at different points across runs.

This PR changes the approach to allow a fixed number of keys per memtable flush. I verified the bloom filter useful count is deterministic now even on the CircleCI ARM instance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11383

Reviewed By: cbi42

Differential Revision: D45036829

Pulled By: ajkr

fbshipit-source-id: b602dacb63955f1af09bf0ed409cde0552805a08
2023-04-17 10:36:22 -07:00
Murali Vilayannur 226ee25d30 Block fetch CPU time counters in perf context (#11342)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11342

Reviewed By: ajkr

Differential Revision: D45026838

Pulled By: mnv104

fbshipit-source-id: 099ed9579922b8fa6e7d3332bbb829d50ec47d91
2023-04-15 11:09:44 -07:00
mayue.fight 4d72f48e57 Fix the wrong calculation of largest_key in import_column_family_job (#11381)
Summary:
When calculating the largest_key in ImportColumnFamilyJob::GetIngestedFileInfo, only the first element of range_del_iter is calculated. If range_del_iter has multiple elements, the largest_key will be wrong

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11381

Reviewed By: cbi42

Differential Revision: D44981450

Pulled By: ajkr

fbshipit-source-id: 584bc7da86295568a96984d2951644f289e578c7
2023-04-15 10:33:23 -07:00
Changyu Bi ba16e8eee7 Try to pick more files in LevelCompactionBuilder::TryExtendNonL0TrivialMove() (#11347)
Summary:
Before this PR, in `LevelCompactionBuilder::TryExtendNonL0TrivialMove(index)`, we start from a file at index and expand the compaction input towards right to find files to trivial move. This PR adds the logic to also expand towards left.

Another major change made in this PR is to not expand L0 files through `TryExtendNonL0TrivialMove()`. This happens currently when compacting L0 files to an empty output level. The condition for expanding files in `TryExtendNonL0TrivialMove()` is to check atomic boundary, which does not take into account that L0 files can overlap in key range and are not sorted in key order. So it may include more L0 files than needed and disallow a trivial move. This change is included in this PR so that we don't make it worse by always expanding L0 in both direction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11347

Test Plan:
* new unit test
* Benchmark does not show obvious improvement or regression:
```
Write sequentially
./db_bench --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216

Main:
fillseq      :       4.726 micros/op 211592 ops/sec 472.607 seconds 100000000 operations;   23.4 MB/s
This PR:
fillseq      :       4.755 micros/op 210289 ops/sec 475.534 seconds 100000000 operations;   23.3 MB/s

Write randomly
./db_bench --benchmarks=fillrandom --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216

Main:
fillrandom   :      16.351 micros/op 61159 ops/sec 1635.066 seconds 100000000 operations;    6.8 MB/s
This PR:
fillrandom   :      15.798 micros/op 63298 ops/sec 1579.817 seconds 100000000 operations;    7.0 MB/s
```

Reviewed By: ajkr

Differential Revision: D44645650

Pulled By: cbi42

fbshipit-source-id: 8631f3a6b3f01decbbf18c34f2b62833cb4f9733
2023-04-14 11:50:20 -07:00
mayue.fight 9500d90d1b Fix serval bugs in ImportColumnFamilyTest (#11372)
Summary:
**Context/Summary:**
ASSERT_EQ will only verify the code of Status, but will not check the state message of Status.

- Assert by checking Status state in `ImportColumnFamilyTest`
- Forgot to set db_comparator_name when creating ExportImportFilesMetaData in `ImportColumnFamilyNegativeTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11372

Reviewed By: ajkr

Differential Revision: D45004343

Pulled By: cbi42

fbshipit-source-id: a13d45521df17ead3d6d4c1c1fe1e4c95397ce8b
2023-04-14 10:44:42 -07:00
Niklas Fiekas d5a9c0c937 C-API: Constify cache functions where possible (#11243)
Summary:
Makes it easier to use generated Rust bindings. Constness of these is already part of the C++ API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11243

Reviewed By: hx235

Differential Revision: D44840394

Pulled By: ajkr

fbshipit-source-id: bcd1aeb8c959c304148d25b00043bb8c4cd3e0a4
2023-04-10 12:19:40 -07:00
Changyu Bi 64cead919f Initialize lowest_unnecessary_level_ in VersionStorageInfo constructor (#11359)
Summary:
valgrind complains "Conditional jump or move depends on uninitialised value(s)". A sample error message:

```
[ RUN      ] DBCompactionTest.DrainUnnecessaryLevelsAfterDBBecomesSmall
==3353864== Conditional jump or move depends on uninitialised value(s)
==3353864==    at 0x8647B4: rocksdb::VersionStorageInfo::ComputeCompactionScore(rocksdb::ImmutableOptions const&, rocksdb::MutableCFOptions const&) (version_set.cc:3414)
==3353864==    by 0x86B340: rocksdb::VersionSet::AppendVersion(rocksdb::ColumnFamilyData*, rocksdb::Version*) (version_set.cc:4946)
==3353864==    by 0x876B88: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6876)
==3353864==    by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483)
==3353864==    by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187)
==3353864==    by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31)
==3353864==    by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729)
==3353864==    by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522)
==3353864==    by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928)
==3353864==    by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743)
==3353864==    by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720)
==3353864==    by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710)
==3353864==  Uninitialised value was created by a heap allocation
==3353864==    at 0x4842F0F: operator new(unsigned long) (vg_replace_malloc.c:422)
==3353864==    by 0x876AF4: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6870)
==3353864==    by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483)
==3353864==    by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187)
==3353864==    by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31)
==3353864==    by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729)
==3353864==    by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522)
==3353864==    by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928)
==3353864==    by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743)
==3353864==    by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720)
==3353864==    by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710)
==3353864==    by 0x591F73: rocksdb::DBTestBase::Reopen(rocksdb::Options const&) (db_test_util.cc:662)
```

This is likely about `lowest_unnecessary_level_` even though it would be initialized in `CalculateBaseBytes()` before being used in `ComputeCompactionScore()`. Initialize it also in VersionStorageInfo constructor to prevent valgrind from  complaining.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11359

Test Plan: - ran a test with valgrind which gave the error message above before this PR: `valgrind --track-origins=yes ./db_compaction_test  --gtest_filter="*DrainUnnecessaryLevelsAfterDBBecomesSmall*"`

Reviewed By: hx235

Differential Revision: D44799112

Pulled By: cbi42

fbshipit-source-id: 557208a66f04a2163b418b2a651bdb7e777c4511
2023-04-07 15:17:18 -07:00
Changyu Bi f631138e1c Better support for merge operation with data block hash index (#11356)
Summary:
when data block hash index finds a key of op_type `kTypeMerge`, do not redo data block seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11356

Test Plan:
- added new unit test
- crash test: `python3 tools/db_crashtest.py whitebox --simple --use_merge=1 --data_block_index_type=1`
- benchmark see slight improvement in read throughput:
```
TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=mergerandom -use_existing_db=false -num=10000000 -compression_type=none -level_compaction_dynamic_level_bytes=1 -merge_operator=PutOperator -write_buffer_size=1000000 --use_data_block_hash_index=1

TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=readrandom[-X10] -use_existing_db=true -num=10000000 -merge_operator=PutOperator -readonly=1 -disable_auto_compactions=1 -reads=100000

Main: readrandom [AVG 10 runs] : 29526 (± 1118) ops/sec;    2.1 (± 0.1) MB/sec
Post-PR: readrandom [AVG 10 runs] : 31095 (± 662) ops/sec;    2.2 (± 0.0) MB/sec
```

Reviewed By: pdillinger

Differential Revision: D44759895

Pulled By: cbi42

fbshipit-source-id: 387f0c35938c7e0e96b810ca3babf1967fc68191
2023-04-07 10:06:03 -07:00
Wentian Guo 0578d9f951 Filter table files by timestamp: Get operator (#11332)
Summary:
If RocksDB enables user-defined timestamp, then RocksDB read path can filter table files by the min/max timestamps of each file. If application wants to lookup a key that is the most recent and visible to a certain timestamp ts, then we can compare ts with the min_ts of each file. If ts < min_ts, then we know all keys in the file is not visible at time ts, then we do not have to open the file. This can also save an IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11332

Reviewed By: pdillinger

Differential Revision: D44763497

Pulled By: guowentian

fbshipit-source-id: abde346b9f18480fe03c04e4006e7d62aa9c22a8
2023-04-06 15:39:38 -07:00
Changyu Bi b3c43a5b99 Drain unnecessary levels when level_compaction_dynamic_level_bytes=true (#11340)
Summary:
When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340

Test Plan:
- New unit tests
- `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.

Reviewed By: ajkr

Differential Revision: D44563884

Pulled By: cbi42

fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
2023-04-06 11:20:43 -07:00
anand76 0623c5b903 Ensure VerifyFileChecksums reads don't exceed readahead_size (#11328)
Summary:
VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328

Test Plan: Add a unit test

Reviewed By: pdillinger

Differential Revision: D44718781

Pulled By: anand1976

fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
2023-04-05 16:22:08 -07:00
Hui Xiao 7f5b9f40cb Fix initialization-order-fiasco in write_stall_stats.cc (#11355)
Summary:
**Context/Summary:**
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11355

Test Plan:
- Ran previously failed tests and they succeed
- Perf
`./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`

Reviewed By: ajkr

Differential Revision: D44719333

Pulled By: hx235

fbshipit-source-id: 23d22f314144071d97f7106ff1241c31c0bdf08b
2023-04-05 14:42:31 -07:00
Niklas Fiekas e5a560ec98 Expose cache occupancy via C API (#11327)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11327

Reviewed By: cbi42

Differential Revision: D44422225

Pulled By: ajkr

fbshipit-source-id: 3bfcf47290b3133c151bdfdd181896ba2e6be520
2023-04-03 14:42:43 -07:00