Commit graph

5261 commits

Author SHA1 Message Date
leipeng a475e9f746 DBIter::FindValueForCurrentKey: remove unused Status s (#11394)
Summary:
This PR remove a historical useless code

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11394

Reviewed By: ajkr

Differential Revision: D45506226

Pulled By: akankshamahajan15

fbshipit-source-id: 32c98627100c9ad131bf65c4a1fe97ab61502daf
2023-05-03 08:52:03 -07:00
anand76 03a892a9fb Delete empty WAL files on reopen (#11409)
Summary:
When a DB is opened, RocksDB creates an empty WAL file. When the DB is reopened and the WAL is empty, the min log number to keep is not advanced until a memtable flush happens. If a process crashes soon after reopening the DB, its likely that no memtable flush would have happened, which means the empty WAL file is not deleted. In a crash loop scenario, this leads to empty WAL files accumulating. Fix this by ensuring the min log number is advanced if the WAL is empty.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11409

Test Plan: Add a unit test

Reviewed By: ajkr

Differential Revision: D45281685

Pulled By: anand1976

fbshipit-source-id: 0225877c613e65ffb30972a0051db2830105423e
2023-05-02 15:54:29 -07:00
Peter Dillinger 41a7fbf758 Avoid long parameter lists configuring Caches (#11386)
Summary:
For better clarity, encouraging more options explicitly specified using fields rather than positionally via constructor parameter lists. Simplifies code maintenance as new fields are added. Deprecate some cases of the confusing pattern of NewWhatever() functions returning shared_ptr.

Net reduction of about 70 source code lines (including comments).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11386

Test Plan: existing tests

Reviewed By: ajkr

Differential Revision: D45059075

Pulled By: pdillinger

fbshipit-source-id: d53fa09b268024f9c55254bb973b6c69feebf41a
2023-05-01 14:52:01 -07:00
Changyu Bi 62fc15f009 Block per key-value checksum (#11287)
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates

Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`

Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.

Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):

```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none

BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE

The readrandom ops/sec looks like the following:
Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
Main              240805     223604         198176           161653       139040
PR prot_bytes=0   238691     226693         200127           161082       141153
PR prot_bytes=1   214983     193199         178532           137013       108211
prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
prot_bytes=0
```

The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287

Reviewed By: ajkr

Differential Revision: D43970708

Pulled By: cbi42

fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
2023-04-25 12:08:23 -07:00
leipeng 40d69b59ad DBImpl::MultiGet: delete unused var superversions_to_delete (#11395)
Summary:
In db_impl.cc DBImpl::MultiGet: delete unused var `superversions_to_delete`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11395

Reviewed By: ajkr

Differential Revision: D45240896

Pulled By: cbi42

fbshipit-source-id: 0fff99b0d794b6f6d4aadee6036bddd6cb19eb31
2023-04-25 10:46:29 -07:00
Peter Dillinger fb63d9b4ee Fix compression tests when snappy not available (#11396)
Summary:
Tweak some bounds and things, and reduce risk of surprise results by running on all supported compressions (mostly).

Also improves the precise compressibility of CompressibleString by using RandomBinaryString.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11396

Test Plan: updated tests

Reviewed By: ltamasi

Differential Revision: D45211938

Pulled By: pdillinger

fbshipit-source-id: 9dc1dd8574a60a9364efe18558be66d31a35598b
2023-04-22 12:41:36 -07:00
Peter Dillinger d79be3dca2 Changes and enhancements to compression stats, thresholds (#11388)
Summary:
## Option API updates
* Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7.
* Remove unnecessary constructor for CompressionOptions.
* Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated).

 ## Stat API updates
* Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed.
* Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing.
* New or existing tickers relevant to compression:
  * BYTES_COMPRESSED_FROM
  * BYTES_COMPRESSED_TO
  * BYTES_COMPRESSION_BYPASSED
  * BYTES_COMPRESSION_REJECTED
  * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing)
  * NUMBER_BLOCK_COMPRESSED (existing)
  * NUMBER_BLOCK_COMPRESSION_BYPASSED
  * NUMBER_BLOCK_COMPRESSION_REJECTED
  * BYTES_DECOMPRESSED_FROM
  * BYTES_DECOMPRESSED_TO

We can compute a number of things with these stats:
* "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO
* Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED)
* Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED
* Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES

Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression.

 ## BlockBasedTableBuilder
Various small refactorings, optimizations, and name clean-ups.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388

Test Plan:
unit tests added

* `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field.

Reviewed By: ajkr

Differential Revision: D45128202

Pulled By: pdillinger

fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803
2023-04-21 21:57:40 -07:00
Hui Xiao 151242ce46 Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.

**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
   - Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288

Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
     - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689

// PlainTable
Does not apply
```
- **Db bench 2: performance**

**Read**

SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`

**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none

rocksdb.sst.read.micros  COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`

Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`

Reviewed By: ajkr

Differential Revision: D44007011

Pulled By: hx235

fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 09:07:18 -07:00
Andrew Kryczka 0a774a102f Clarify SstFileWriter::DeleteRange() ordering requirements (#11390)
Summary:
As titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11390

Reviewed By: cbi42

Differential Revision: D45148830

Pulled By: ajkr

fbshipit-source-id: 9a8dfd040514bae3d8ed9e97a79cae7683f2749a
2023-04-20 13:02:16 -07:00
Changyu Bi 43e9a60bb2 Always allow L0->L1 trivial move during manual compaction (#11375)
Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.

This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375

Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.

Reviewed By: ajkr

Differential Revision: D44943518

Pulled By: cbi42

fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
2023-04-20 11:10:48 -07:00
Andrew Kryczka f3818948e8 Deflake DBWriteTest.LockWALInEffect (#11382)
Summary:
This test exhibited the following flaky failure:

```
db/db_write_test.cc:653: Failure
db_->Resume()
Corruption: Not active
```

I was able to repro it by applying the following patch to coerce a specific race condition:

```
 diff --git a/db/db_write_test.cc b/db/db_write_test.cc
index d82c57376..775ba3cde 100644
 --- a/db/db_write_test.cc
+++ b/db/db_write_test.cc
@@ -636,6 +636,10 @@ TEST_P(DBWriteTest, LockWALInEffect) {
   ASSERT_TRUE(dbfull()->WALBufferIsEmpty());
   ASSERT_OK(db_->UnlockWAL());

+  // Test thread: sleep interval: [0, 3)
+  // In this interval, the file system is active
+  sleep(3);
+
   // Fail the WAL flush if applicable
   fault_fs->SetFilesystemActive(false);
   Status s = Put("key2", "value");
@@ -649,6 +653,11 @@ TEST_P(DBWriteTest, LockWALInEffect) {
     ASSERT_OK(db_->LockWAL());
     ASSERT_OK(db_->UnlockWAL());
   }
+
+  // Test thread: sleep interval: [3, 6)
+  // In this interval, the file system is inactive
+  sleep(3);
+
   fault_fs->SetFilesystemActive(true);
   ASSERT_OK(db_->Resume());
   // Writes should work again
 diff --git a/db/flush_job.cc b/db/flush_job.cc
index 8193f594f..602ee2c9f 100644
 --- a/db/flush_job.cc
+++ b/db/flush_job.cc
@@ -979,6 +979,10 @@ Status FlushJob::WriteLevel0Table() {
           DirFsyncOptions(DirFsyncOptions::FsyncReason::kNewFileSynced));
     }
     TEST_SYNC_POINT_CALLBACK("FlushJob::WriteLevel0Table", &mems_);
+    // Flush thread: sleep interval: [0, 4)
+    // Upon awakening, the file system will be inactive. Then the MANIFEST
+    // update will fail.
+    sleep(4);
     db_mutex_->Lock();
   }
   base_->Unref();
```

The fix for this scenario is explained in the code change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11382

Reviewed By: cbi42

Differential Revision: D45027632

Pulled By: ajkr

fbshipit-source-id: 6bfa35a5781c0c080fb74e13f2b2c9f871f7effb
2023-04-17 11:00:08 -07:00
Andrew Kryczka b8555ba470 Deflake DBBloomFilterTest.OptimizeFiltersForHits (#11383)
Summary:
In CircleCI build-linux-arm-test-full job (https://app.circleci.com/pipelines/github/facebook/rocksdb/26462/workflows/a9d39d2c-c970-4b0f-9c10-7743beb9771b/jobs/591722), this test exhibited the following flaky failure:

```
db/db_bloom_filter_test.cc:2506: Failure
Expected: (TestGetTickerCount(options, BLOOM_FILTER_USEFUL)) > (65000 * 2), actual: 120558 vs 130000
```

I ssh'd to an instance and observed it cuts memtables at slightly different points across runs. Logging in `ConcurrentArena` pointed to `try_lock()` returning false at different points across runs.

This PR changes the approach to allow a fixed number of keys per memtable flush. I verified the bloom filter useful count is deterministic now even on the CircleCI ARM instance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11383

Reviewed By: cbi42

Differential Revision: D45036829

Pulled By: ajkr

fbshipit-source-id: b602dacb63955f1af09bf0ed409cde0552805a08
2023-04-17 10:36:22 -07:00
Murali Vilayannur 226ee25d30 Block fetch CPU time counters in perf context (#11342)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11342

Reviewed By: ajkr

Differential Revision: D45026838

Pulled By: mnv104

fbshipit-source-id: 099ed9579922b8fa6e7d3332bbb829d50ec47d91
2023-04-15 11:09:44 -07:00
mayue.fight 4d72f48e57 Fix the wrong calculation of largest_key in import_column_family_job (#11381)
Summary:
When calculating the largest_key in ImportColumnFamilyJob::GetIngestedFileInfo, only the first element of range_del_iter is calculated. If range_del_iter has multiple elements, the largest_key will be wrong

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11381

Reviewed By: cbi42

Differential Revision: D44981450

Pulled By: ajkr

fbshipit-source-id: 584bc7da86295568a96984d2951644f289e578c7
2023-04-15 10:33:23 -07:00
Changyu Bi ba16e8eee7 Try to pick more files in LevelCompactionBuilder::TryExtendNonL0TrivialMove() (#11347)
Summary:
Before this PR, in `LevelCompactionBuilder::TryExtendNonL0TrivialMove(index)`, we start from a file at index and expand the compaction input towards right to find files to trivial move. This PR adds the logic to also expand towards left.

Another major change made in this PR is to not expand L0 files through `TryExtendNonL0TrivialMove()`. This happens currently when compacting L0 files to an empty output level. The condition for expanding files in `TryExtendNonL0TrivialMove()` is to check atomic boundary, which does not take into account that L0 files can overlap in key range and are not sorted in key order. So it may include more L0 files than needed and disallow a trivial move. This change is included in this PR so that we don't make it worse by always expanding L0 in both direction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11347

Test Plan:
* new unit test
* Benchmark does not show obvious improvement or regression:
```
Write sequentially
./db_bench --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216

Main:
fillseq      :       4.726 micros/op 211592 ops/sec 472.607 seconds 100000000 operations;   23.4 MB/s
This PR:
fillseq      :       4.755 micros/op 210289 ops/sec 475.534 seconds 100000000 operations;   23.3 MB/s

Write randomly
./db_bench --benchmarks=fillrandom --compression_type=lz4 --write_buffer_size=1000000 --num=100000000 --value_size=100 -level_compaction_dynamic_level_bytes --target_file_size_base=7340032 --max_bytes_for_level_base=16777216

Main:
fillrandom   :      16.351 micros/op 61159 ops/sec 1635.066 seconds 100000000 operations;    6.8 MB/s
This PR:
fillrandom   :      15.798 micros/op 63298 ops/sec 1579.817 seconds 100000000 operations;    7.0 MB/s
```

Reviewed By: ajkr

Differential Revision: D44645650

Pulled By: cbi42

fbshipit-source-id: 8631f3a6b3f01decbbf18c34f2b62833cb4f9733
2023-04-14 11:50:20 -07:00
mayue.fight 9500d90d1b Fix serval bugs in ImportColumnFamilyTest (#11372)
Summary:
**Context/Summary:**
ASSERT_EQ will only verify the code of Status, but will not check the state message of Status.

- Assert by checking Status state in `ImportColumnFamilyTest`
- Forgot to set db_comparator_name when creating ExportImportFilesMetaData in `ImportColumnFamilyNegativeTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11372

Reviewed By: ajkr

Differential Revision: D45004343

Pulled By: cbi42

fbshipit-source-id: a13d45521df17ead3d6d4c1c1fe1e4c95397ce8b
2023-04-14 10:44:42 -07:00
Niklas Fiekas d5a9c0c937 C-API: Constify cache functions where possible (#11243)
Summary:
Makes it easier to use generated Rust bindings. Constness of these is already part of the C++ API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11243

Reviewed By: hx235

Differential Revision: D44840394

Pulled By: ajkr

fbshipit-source-id: bcd1aeb8c959c304148d25b00043bb8c4cd3e0a4
2023-04-10 12:19:40 -07:00
Changyu Bi 64cead919f Initialize lowest_unnecessary_level_ in VersionStorageInfo constructor (#11359)
Summary:
valgrind complains "Conditional jump or move depends on uninitialised value(s)". A sample error message:

```
[ RUN      ] DBCompactionTest.DrainUnnecessaryLevelsAfterDBBecomesSmall
==3353864== Conditional jump or move depends on uninitialised value(s)
==3353864==    at 0x8647B4: rocksdb::VersionStorageInfo::ComputeCompactionScore(rocksdb::ImmutableOptions const&, rocksdb::MutableCFOptions const&) (version_set.cc:3414)
==3353864==    by 0x86B340: rocksdb::VersionSet::AppendVersion(rocksdb::ColumnFamilyData*, rocksdb::Version*) (version_set.cc:4946)
==3353864==    by 0x876B88: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6876)
==3353864==    by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483)
==3353864==    by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187)
==3353864==    by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31)
==3353864==    by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729)
==3353864==    by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522)
==3353864==    by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928)
==3353864==    by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743)
==3353864==    by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720)
==3353864==    by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710)
==3353864==  Uninitialised value was created by a heap allocation
==3353864==    at 0x4842F0F: operator new(unsigned long) (vg_replace_malloc.c:422)
==3353864==    by 0x876AF4: rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const*) (version_set.cc:6870)
==3353864==    by 0xBA66FE: rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) (version_edit_handler.cc:483)
==3353864==    by 0xBA4A81: rocksdb::VersionEditHandler::Initialize() (version_edit_handler.cc:187)
==3353864==    by 0xBA3927: rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) (version_edit_handler.cc:31)
==3353864==    by 0x870173: rocksdb::VersionSet::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool) (version_set.cc:5729)
==3353864==    by 0x7538FA: rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*) (db_impl_open.cc:522)
==3353864==    by 0x75BA0F: rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) (db_impl_open.cc:1928)
==3353864==    by 0x75A735: rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**) (db_impl_open.cc:1743)
==3353864==    by 0x75A510: rocksdb::DB::Open(rocksdb::Options const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::DB**) (db_impl_open.cc:1720)
==3353864==    by 0x5925FD: rocksdb::DBTestBase::TryReopen(rocksdb::Options const&) (db_test_util.cc:710)
==3353864==    by 0x591F73: rocksdb::DBTestBase::Reopen(rocksdb::Options const&) (db_test_util.cc:662)
```

This is likely about `lowest_unnecessary_level_` even though it would be initialized in `CalculateBaseBytes()` before being used in `ComputeCompactionScore()`. Initialize it also in VersionStorageInfo constructor to prevent valgrind from  complaining.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11359

Test Plan: - ran a test with valgrind which gave the error message above before this PR: `valgrind --track-origins=yes ./db_compaction_test  --gtest_filter="*DrainUnnecessaryLevelsAfterDBBecomesSmall*"`

Reviewed By: hx235

Differential Revision: D44799112

Pulled By: cbi42

fbshipit-source-id: 557208a66f04a2163b418b2a651bdb7e777c4511
2023-04-07 15:17:18 -07:00
Changyu Bi f631138e1c Better support for merge operation with data block hash index (#11356)
Summary:
when data block hash index finds a key of op_type `kTypeMerge`, do not redo data block seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11356

Test Plan:
- added new unit test
- crash test: `python3 tools/db_crashtest.py whitebox --simple --use_merge=1 --data_block_index_type=1`
- benchmark see slight improvement in read throughput:
```
TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=mergerandom -use_existing_db=false -num=10000000 -compression_type=none -level_compaction_dynamic_level_bytes=1 -merge_operator=PutOperator -write_buffer_size=1000000 --use_data_block_hash_index=1

TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=readrandom[-X10] -use_existing_db=true -num=10000000 -merge_operator=PutOperator -readonly=1 -disable_auto_compactions=1 -reads=100000

Main: readrandom [AVG 10 runs] : 29526 (± 1118) ops/sec;    2.1 (± 0.1) MB/sec
Post-PR: readrandom [AVG 10 runs] : 31095 (± 662) ops/sec;    2.2 (± 0.0) MB/sec
```

Reviewed By: pdillinger

Differential Revision: D44759895

Pulled By: cbi42

fbshipit-source-id: 387f0c35938c7e0e96b810ca3babf1967fc68191
2023-04-07 10:06:03 -07:00
Wentian Guo 0578d9f951 Filter table files by timestamp: Get operator (#11332)
Summary:
If RocksDB enables user-defined timestamp, then RocksDB read path can filter table files by the min/max timestamps of each file. If application wants to lookup a key that is the most recent and visible to a certain timestamp ts, then we can compare ts with the min_ts of each file. If ts < min_ts, then we know all keys in the file is not visible at time ts, then we do not have to open the file. This can also save an IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11332

Reviewed By: pdillinger

Differential Revision: D44763497

Pulled By: guowentian

fbshipit-source-id: abde346b9f18480fe03c04e4006e7d62aa9c22a8
2023-04-06 15:39:38 -07:00
Changyu Bi b3c43a5b99 Drain unnecessary levels when level_compaction_dynamic_level_bytes=true (#11340)
Summary:
When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340

Test Plan:
- New unit tests
- `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.

Reviewed By: ajkr

Differential Revision: D44563884

Pulled By: cbi42

fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
2023-04-06 11:20:43 -07:00
anand76 0623c5b903 Ensure VerifyFileChecksums reads don't exceed readahead_size (#11328)
Summary:
VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328

Test Plan: Add a unit test

Reviewed By: pdillinger

Differential Revision: D44718781

Pulled By: anand1976

fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
2023-04-05 16:22:08 -07:00
Hui Xiao 7f5b9f40cb Fix initialization-order-fiasco in write_stall_stats.cc (#11355)
Summary:
**Context/Summary:**
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11355

Test Plan:
- Ran previously failed tests and they succeed
- Perf
`./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`

Reviewed By: ajkr

Differential Revision: D44719333

Pulled By: hx235

fbshipit-source-id: 23d22f314144071d97f7106ff1241c31c0bdf08b
2023-04-05 14:42:31 -07:00
Niklas Fiekas e5a560ec98 Expose cache occupancy via C API (#11327)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11327

Reviewed By: cbi42

Differential Revision: D44422225

Pulled By: ajkr

fbshipit-source-id: 3bfcf47290b3133c151bdfdd181896ba2e6be520
2023-04-03 14:42:43 -07:00
Hui Xiao 39c29372bf Add SetAllowStall() (#11335)
Summary:
**Context/Summary:**
- Allow runtime changes to whether `WriteBufferManager` allows stall or not by calling `SetAllowStall()`
- Misc: some clean up - see PR conversation

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11335

Test Plan: - New UT

Reviewed By: akankshamahajan15

Differential Revision: D44502555

Pulled By: hx235

fbshipit-source-id: 24b5cc57df7734b11d42e4870c06c87b95312b5e
2023-03-30 09:43:33 -07:00
Hui Xiao c14eb134ed Add experimental PerfContext counters for db iterator Prev/Next/Seek* APIs (#11320)
Summary:
**Context/Summary:**
Motived by user need of investigating db iterator behavior during an interval of any time length of a certain thread, we decide to collect and expose related counters in `PerfContext` as an experimental feature, in addition to the existing db-scope ones (i.e, tickers)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11320

Test Plan:
- new UT
- db bench

Setup
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -compression_type=none -bloom_bits=3
```
Test till converges
```
./db_bench -seed=1679526311157283 -use_existing_db=1 -perf_level=2 -db=/dev/shm/testdb/ -benchmarks="seekrandom[-X60]"
```
pre-change
`seekrandom [AVG 33 runs] : 7545 (± 100) ops/sec`
post-change (no regression)
`seekrandom [AVG 33 runs] : 7688 (± 67) ops/sec`

Reviewed By: cbi42

Differential Revision: D44321931

Pulled By: hx235

fbshipit-source-id: f98a254ba3e3ced95eb5928884e33f1b99dca401
2023-03-28 10:23:12 -07:00
Changyu Bi 601320164b Trivially move files down when opening db with level_compaction_dynamic_l… (#11321)
Summary:
…evel_bytes

 During DB open, if a column family uses level compaction with level_compaction_dynamic_level_bytes=true, trivially move its files down in the LSM such that the bottommost files are in Lmax, the second from bottommost level files are in Lmax-1 and so on. This is aimed to make it easier to migrate level_compaction_dynamic_level_bytes from false to true.  Before this change, a full manual compaction is suggested for such migration. After this change, user can just restart DB to turn on this option. db_crashtest.py is updated to randomly choose value for level_compaction_dynamic_level_bytes.

Note that there may still be too many unnecessary levels if a user is migrating from universal compaction or level compaction with a smaller level multiplier. A full manual compaction may still be needed in that case before some PR that automatically drain unnecessary levels like https://github.com/facebook/rocksdb/issues/3921 lands. Eventually we may want to change the default value of option level_compaction_dynamic_level_bytes to true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11321

Test Plan:
1. Added unit tests.
2. Crash test: ran a variation of db_crashtest.py (like 32516507e77521ae887e45091b69139e32e8efb7) that turns level_compaction_dynamic_level_bytes on and off and switches between LC and UC for the same DB.

TODO: Update `OptionChangeMigration`, either after this PR or https://github.com/facebook/rocksdb/issues/3921.

Reviewed By: ajkr

Differential Revision: D44341930

Pulled By: cbi42

fbshipit-source-id: 013de19a915c6a0502be569f07c4cc8f1c3c6be2
2023-03-27 14:55:16 -07:00
karemta-orday 40c2ec6d08 Add in-transaction multi-get-for-update to the C interface (#11107)
Summary:
Hi, this is basically a part of https://github.com/facebook/rocksdb/pull/6488 that only adds `multi_get_for_update` functionality to C API (I'd like to call it from Rust), since `multi_get` was already added here https://github.com/facebook/rocksdb/pull/9252

https://github.com/facebook/rocksdb/pull/6488 has conflicts, so I guess it might be easier to get this one in

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11107

Reviewed By: pdillinger

Differential Revision: D42680764

Pulled By: ajkr

fbshipit-source-id: a50f96e1c7f3d470b4ab07e9ff5a283e5cf44865
2023-03-27 12:14:18 -07:00
Andrew Kryczka 9f8cdc8ad6 validate SstFileWriter range tombstones cover positive ranges (#11322)
Summary:
As titled. This is the same as https://github.com/facebook/rocksdb/issues/6788 but for range tombstones written through `SstFileWriter` rather than through `DB`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11322

Reviewed By: cbi42

Differential Revision: D44317733

Pulled By: ajkr

fbshipit-source-id: f6eb8791ae2c09c169b6bfe0d047449d924b377e
2023-03-22 21:03:13 -07:00
sdong b92bc04ab0 Deflake DBCompactionTest.CancelCompactionWaitingOnConflict (#11318)
Summary:
In DBCompactionTest::CancelCompactionWaitingOnConflict, when generating SST files to trigger a compaction, we don't wait after each file, which may cause multiple memtables going to the same SST file, causing insufficient files to trigger the compaction. We do the waiting instead, except the last one, which would trigger compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11318

Test Plan: Run DBCompactionTest.CancelCompactionWaitingOnConflict multiple times.

Reviewed By: ajkr

Differential Revision: D44267273

fbshipit-source-id: 86af49b05fc67ea3335312f0f5f3d22df1520bf8
2023-03-21 15:38:33 -07:00
Andrew Kryczka 8c445407b7 Specify precedence in SstFileWriter::DeleteRange() API contract (#11309)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11309

Reviewed By: cbi42

Differential Revision: D44198501

Pulled By: ajkr

fbshipit-source-id: d603aca37b56aac5df255833793a3300807d63cf
2023-03-18 17:37:17 -07:00
Hui Xiao cb58477185 New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300)
Summary:
**Context/Summary:**
Users are interested in figuring out what has caused write stall.
- Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`)
- Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager
- Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped`

- Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map
- Misc: rename/comment some write stall InternalStats for clarity

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300

Test Plan:
- New UT
- Stress test
`python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1`
- Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time.
```
./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3
```
pre-change:
```
fillseq [AVG 15 runs] : 1176 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations;    0.5 MB/s
fillseq [AVG 16 runs] : 1162 (± 685) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 17 runs] : 1136 (± 646) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 18 runs] : 1118 (± 610) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 19 runs] : 1100 (± 578) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations;    0.3 MB/s
fillseq [AVG 20 runs] : 1071 (± 551) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 21 runs] : 1059 (± 525) ops/sec;    0.5 (± 0.3) MB/sec
```
post-change:
```
fillseq [AVG 15 runs] : 1226 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 16 runs] : 1196 (± 687) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 17 runs] : 1174 (± 647) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 18 runs] : 1156 (± 611) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 19 runs] : 1134 (± 579) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 20 runs] : 1119 (± 550) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 21 runs] : 1106 (± 524) ops/sec;    0.6 (± 0.3) MB/sec
```

Reviewed By: ajkr

Differential Revision: D44159541

Pulled By: hx235

fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 09:51:58 -07:00
Peter Dillinger 204fcff751 HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.

This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.

It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.

## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.

## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.

These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.

* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)

I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.

## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.

## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.

## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.

## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.

## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.

## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.

## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.

Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).

Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.

## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301

Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.

## Crash/stress test
Updated to use the new combination.

## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.

```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```

**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127

**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)

This is an acceptable difference IMHO.

Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.

Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec

So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:

**Before**:
multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec

Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.

Let's also look at Get() in db_bench:

```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
**After**:
readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec

That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:

**Before**:
readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
**After**:
readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec

Reviewed By: ltamasi

Differential Revision: D44177044

Pulled By: pdillinger

fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-17 20:23:49 -07:00
anand76 eac6b6d0cd Ignore async_io ReadOption if FileSystem doesn't support it (#11296)
Summary:
In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296

Test Plan: Add new unit tests

Reviewed By: akankshamahajan15

Differential Revision: D44045776

Pulled By: anand1976

fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
2023-03-17 14:57:09 -07:00
hackingthekernel 291300ece8 add c-api for allowing FIFO compaction (#11156)
Summary:
Addressing issue https://github.com/facebook/rocksdb/issues/11079

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11156

Reviewed By: pdillinger

Differential Revision: D42869964

Pulled By: ajkr

fbshipit-source-id: 58214901d4e072c568d4c5cf0944a0b1c60de897
2023-03-16 16:57:03 -07:00
Peter Dillinger ccaa3225b0 Simplify tracking entries already in SecondaryCache (#11299)
Summary:
In preparation for factoring secondary cache support out of individual Cache implementations, we can get rid of the "in secondary cache" flag on entries through a workable hack: when an entry is promoted from secondary, it is inserted in primary using a helper that lacks secondary cache support, thus preventing re-insertion into secondary cache through existing logic.

This adds to the complexity of building CacheItemHelpers, because you always have to be able to get to an equivalent helper without secondary cache support, but that complexity is reasonably isolated within RocksDB typed_cache.h and test code.

gcc-7 seems to have problems with constexpr constructor referencing `this` so removed constexpr support on CacheItemHelper.

Also refactored some related test code to share common code / functionality.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11299

Test Plan: existing tests

Reviewed By: anand1976

Differential Revision: D44101453

Pulled By: pdillinger

fbshipit-source-id: 7a59d0a3938ee40159c90c3e65d7004f6a272345
2023-03-15 17:51:44 -07:00
Peter Dillinger 601efe3cf2 Misc cleanup of block cache code (#11291)
Summary:
... ahead of a larger change.
* Rename confusingly named `is_in_sec_cache` to `kept_in_sec_cache`
* Unify naming of "standalone" block cache entries (was "detached" in clock_cache)
* Remove some unused definitions in clock_cache.h (leftover from a previous revision)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11291

Test Plan: usual tests and CI, no behavior changes

Reviewed By: anand1976

Differential Revision: D43984642

Pulled By: pdillinger

fbshipit-source-id: b8bf0c5b90a932a88bcbdb413b2f256834aedf97
2023-03-15 12:08:17 -07:00
Hui Xiao 11cb6af6e5 Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
    pid=$!
    sleep 0.2
    sleep 10
    kill $pid
    sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
    pid=$!
    sleep 0.2
    sleep 40
    kill $pid
    sleep 0.2

Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```

The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.

**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148

Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test

Reviewed By: ajkr

Differential Revision: D42799871

Pulled By: hx235

fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 16:53:20 -07:00
Levi Tamasi 49881921cd Rename a recently added PerfContext counter (#11294)
Summary:
The patch renames the counter added in https://github.com/facebook/rocksdb/issues/11284 for better consistency with the existing naming scheme.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11294

Test Plan: `make check`

Reviewed By: jowlyzhang

Differential Revision: D44035964

Pulled By: ltamasi

fbshipit-source-id: 8b1a2a03ee728148365367e0ecc1fcf462f62191
2023-03-13 18:43:27 -07:00
Peter Dillinger 648e972f30 Document DB::Resume(), fix LockWALInEffect test (#11290)
Summary:
In rare cases seeing failures like this

```
[ RUN      ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/2
db/db_write_test.cc:653: Failure
Put("key3", "value")
Corruption: Not active
```

in a test with no explicit threading. This is likely because of the unpredictability of background auto-resume. I didn't really know this feature, in part because DB::Resume() was undocumented. So I believe I have fixed the test and documented the API function.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11290

Test Plan: 1000s of stress runs of the test with gtest-parallel

Reviewed By: anand1976

Differential Revision: D43984583

Pulled By: pdillinger

fbshipit-source-id: d30dec120b4864e193751b2e33ff16834d313db3
2023-03-13 14:19:59 -07:00
Changyu Bi 9aa3b6f9ae Support range deletion tombstones in CreateColumnFamilyWithImport (#11252)
Summary:
CreateColumnFamilyWithImport() did not support range tombstones for two reasons:
1. it uses point keys of a input file to determine its boundary (smallest and largest internal key), which means range tombstones outside of the point key range will be effectively dropped.
2. it does not handle files with no point keys.

Also included a fix in external_sst_file_ingestion_job.cc where the blocks read in `GetIngestedFileInfo()` can be added to block cache now (issue fixed in https://github.com/facebook/rocksdb/pull/6429).

This PR adds support for exporting and importing column family with range tombstones. The main change is to add smallest internal key and largest internal key to `SstFileMetaData` that will be part of the output of `ExportColumnFamily()`. Then during `CreateColumnFamilyWithImport(...,const ExportImportFilesMetaData& metadata,...)`, file boundaries can be set from `metadata` directly. This is needed since when file boundaries are extended by range tombstones, sometimes they cannot be deduced from a file's content alone.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11252

Test Plan:
- added unit tests that fails before this change

Closes https://github.com/facebook/rocksdb/issues/11245

Reviewed By: ajkr

Differential Revision: D43577443

Pulled By: cbi42

fbshipit-source-id: 6bff78e583cc50c44854994dea0a8dd519398f2f
2023-03-13 11:06:59 -07:00
Jaepil Jeong 969d4e1dd2 Fix compile errors in Clang due to unused variables depending on the build configuration (#11234)
Summary:
This PR fixes compilation errors in Clang due to unused variables like the below:
```
[109/329] Building CXX object CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o
FAILED: CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o
ccache /opt/homebrew/opt/llvm/bin/clang++ -DGFLAGS=1 -DGFLAGS_IS_A_DLL=0 -DHAVE_FULLFSYNC -DJEMALLOC_NO_DEMANGLE -DLZ4 -DOS_MACOSX -DROCKSDB_JEMALLOC -DROCKSDB_LIB_IO_POSIX -DROCKSDB_NO_DYNAMIC_EXTENSION -DROCKSDB_PLATFORM_POSIX -DSNAPPY -DTBB -DZLIB -DZSTD -I/Users/jaepil/work/deepsearch/deps/cpp/rocksdb -I/Users/jaepil/work/deepsearch/deps/cpp/rocksdb/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -I/Users/jaepil/app/include -I/opt/homebrew/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include/c++/v1 -W -Wextra -Wall -pthread -Wsign-compare -Wshadow -Wno-unused-parameter -Wno-unused-variable -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers -Wno-strict-aliasing -Wno-invalid-offsetof -fno-omit-frame-pointer -momit-leaf-frame-pointer -march=armv8-a+crc+crypto -Wno-unused-function -Werror -O2 -g -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk -std=gnu++20 -MD -MT CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o -MF CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o.d -o CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o -c /Users/jaepil/work/deepsearch/deps/cpp/rocksdb/db/version_edit_handler.cc
/Users/jaepil/work/deepsearch/deps/cpp/rocksdb/db/version_edit_handler.cc:30:10: error: variable 'recovered_edits' set but not used [-Werror,-Wunused-but-set-variable]
  size_t recovered_edits = 0;
         ^
1 error generated.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11234

Reviewed By: cbi42

Differential Revision: D43458604

Pulled By: ajkr

fbshipit-source-id: d8c50e1a108887b037a120cd9f19374ddaeee817
2023-03-09 16:42:57 -08:00
Levi Tamasi 1d52438504 Add a PerfContext counter for merge operands applied in point lookups (#11284)
Summary:
The existing PerfContext counter `internal_merge_count` only tracks the
Merge operands applied during range scans. The patch adds a new counter
called `internal_merge_count_point_lookups` to track the same metric
for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
also fixes a couple of cases in the iterator where the existing counter wasn't
updated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284

Test Plan: `make check`

Reviewed By: jowlyzhang

Differential Revision: D43926082

Pulled By: ltamasi

fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
2023-03-08 18:22:11 -08:00
Levi Tamasi a1a3b23346 Deflake/fix BlobSourceCacheReservationTest.IncreaseCacheReservationOnFullCache (#11273)
Summary:
`BlobSourceCacheReservationTest.IncreaseCacheReservationOnFullCache` is both flaky and also doesn't do what its name says. The patch changes this test so it actually tests increasing the cache reservation, hopefully also deflaking it in the process.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11273

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43800935

Pulled By: ltamasi

fbshipit-source-id: 5eb54130dfbe227285b0e14f2084aa4b89f0b107
2023-03-06 09:50:39 -08:00
Peter Dillinger e168c1b1a4 Use FaultInjectionTestFS in DBWriteTest.LockWALInEffect (#11271)
Summary:
Existing use of FaultInjectionTestEnv shows rare TSAN errors with parallel Sync and Flush. This appears to be fixed in FaultInjectionTestFS. (Sigh, code duplication and divergence.)

Example failure:
https://app.circleci.com/pipelines/github/facebook/rocksdb/24631/workflows/fc2a66f0-f21c-48d6-a944-3885bcff50a4/jobs/571928

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11271

Test Plan: wasn't able to reproduce locally but stress tested the updated test with gtest-parallel -r1000 and TSAN.

Reviewed By: ajkr

Differential Revision: D43779477

Pulled By: pdillinger

fbshipit-source-id: a019b0f1d4045a26a15ab08aab63828a398f6d3e
2023-03-05 08:21:16 -08:00
Igor Canadi ddde1e6af8 Avoid ColumnFamilyDescriptor copy (#10978)
Summary:
Hi. :) Noticed we are copying ColumnFamilyDescriptor here because my process crashed during copy constructor (cause unrelated)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10978

Reviewed By: cbi42

Differential Revision: D41473924

Pulled By: ajkr

fbshipit-source-id: 58a3473f2d7b24918f79d4b2726c20081c5e95b4
2023-03-03 20:55:31 -08:00
Changyu Bi d053926fa2 Improve documentation for MergingIterator (#11161)
Summary:
Add some comments to try to explain how/why MergingIterator works. Made some small refactoring, mostly in MergingIterator::SkipNextDeleted() and MergingIterator::SeekImpl().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11161

Test Plan:
crash test with small key range:
```
python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=6000 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10
```

Reviewed By: ajkr

Differential Revision: D42860994

Pulled By: cbi42

fbshipit-source-id: 3f0c1c9c6481a7f468bf79d823998907a8116e9e
2023-03-03 12:17:30 -08:00
Yu Zhang 8dfcfd4e90 Fix backward iteration issue when user defined timestamp is enabled in BlobDB (#11258)
Summary:
During backward iteration, blob verification would fail because the user key (ts included) in `saved_key_` doesn't match the blob. This happens because during`FindValueForCurrentKey`, `saved_key_` is not updated when the user key(ts not included) is the same for all cases except when `timestamp_lb_` is specified. This breaks the blob verification logic when user defined timestamp is enabled and `timestamp_lb_` is not specified. Fix this by always updating `saved_key_` when a smaller user key (ts included) is seen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11258

Test Plan:
`make check`
`./db_blob_basic_test --gtest_filter=DBBlobWithTimestampTest.IterateBlobs`

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced with:

`./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5`

Baseline:

- seekrandom [AVG    6 runs] : 72188 (± 1481) ops/sec;   37.2 (± 0.8) MB/sec

With this PR:

- seekrandom [AVG    6 runs] : 74171 (± 1427) ops/sec;   38.2 (± 0.7) MB/sec

Reviewed By: ltamasi

Differential Revision: D43675642

Pulled By: jowlyzhang

fbshipit-source-id: 8022ae8522d1f66548821855e6eed63640c14e04
2023-03-01 13:28:54 -08:00
Levi Tamasi 3c9eed688e Enable moving a string or PinnableSlice into PinnableWideColumns (#11248)
Summary:
This makes it possible to eliminate some copies in `GetEntity` / `MultiGetEntity`,
in particular when `Merge`s or blobs are involved.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11248

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43544215

Pulled By: ltamasi

fbshipit-source-id: bc4c8955a24bbd8bc4ab098e72133ead757f9707
2023-02-24 10:33:00 -08:00
Yu Zhang f007b8fdea Support iter_start_ts in integrated BlobDB (#11244)
Summary:
Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244

Test Plan:
```make check
./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs"
tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1```

Reviewed By: ltamasi

Differential Revision: D43506726

Pulled By: jowlyzhang

fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0
2023-02-22 15:44:59 -08:00