Commit Graph

5511 Commits

Author SHA1 Message Date
anand76 95b41eec6d Fix potential incorrect result for duplicate key in MultiGet (#12295)
Summary:
The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -
1. Duplicate keys in a MultiGet batch
2. Key matches the last key in a non-zero, non-bottommost level file
3. Final value is not in the file (merge operand, not snapshot visible etc)
4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
6. Value or merge operand for the key is present in the very next level

The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the  conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295

Test Plan: Add unit tests that fail an assertion or return wrong result without the fix

Reviewed By: hx235

Differential Revision: D53187634

Pulled By: anand1976

fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea
2024-02-02 11:48:35 -08:00
Andrew Kryczka f9d45358ca Removed `check_flush_compaction_key_order` (#12311)
Summary:
`check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311

Reviewed By: cbi42

Differential Revision: D53233379

Pulled By: ajkr

fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637
2024-01-31 16:30:26 -08:00
Peter Dillinger 76c834e441 Remove 'virtual' when implied by 'override' (#12319)
Summary:
... to follow modern C++ style / idioms.

Used this hack:
```
for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319

Test Plan: existing tests, CI

Reviewed By: jaykorean

Differential Revision: D53275303

Pulled By: pdillinger

fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377
2024-01-31 13:14:42 -08:00
Yu Zhang d11584e42e Be consistent in key range overlap check (#12315)
Summary:
We should be consistent in how we check key range overlap in memtables and in sst files. While all the sst file key range overlap check compares the user key without timestamp, for example:
377eee77f8/db/version_set.cc (L129-L130)

This key range overlap check for memtable is comparing the whole user key. Currently it happen to achieve the same effect because this function is only called by `ExternalSstFileIngestionJob` and `DBImpl::CompactRange`, which takes a user key without timestamp as the range end, pad a max or min timestamp to it depending on whether the end is exclusive. So use `Compartor::Compare` here is working too, but we should update it to `Comparator::CompareWithoutTimestamp` to be consistent with all the other file key range overlapping check functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12315

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D53273456

Pulled By: jowlyzhang

fbshipit-source-id: c094ae1f0c195d52542124c4fb03fdca14241e85
2024-01-31 11:12:52 -08:00
Peter Dillinger 2b4245559c Don't warn on (recursive) disable file deletion (#12310)
Summary:
To stop spamming our warning logs with normal behavior.

Also fix comment on `DisableFileDeletions()`.

In response to https://github.com/facebook/rocksdb/issues/12001 I've indicated my objection to granting legitimacy to force=true, but I'm not addressing that here and now. In short, the user shouldn't be asked to think about whether they want to use the *wrong* behavior. ;)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12310

Test Plan: existing tests

Reviewed By: jowlyzhang

Differential Revision: D53233117

Pulled By: pdillinger

fbshipit-source-id: 5d2aedb76b02b30f8a5fa5b436fc57fde5d40d6e
2024-01-30 11:58:31 -08:00
Andrew Kryczka aacf60dda2 Speedup based on number of files marked for compaction (#12306)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306

Reviewed By: cbi42

Differential Revision: D53200559

Pulled By: ajkr

fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e
2024-01-29 17:29:04 -08:00
Peter Dillinger 61ed0de600 Add more detail to some statuses (#12307)
Summary:
and also fix comment/label on some MacOS CI jobs. Motivated by a crash test failure missing a definitive indicator of the genesis of the status:

```
file ingestion error: Operation failed. Try again.:
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12307

Test Plan: just cosmetic changes. These statuses should not arise frequently enough to be a performance issue (copying messages).

Reviewed By: jaykorean

Differential Revision: D53199529

Pulled By: pdillinger

fbshipit-source-id: ad83daaa5d80f75c9f81158e90fb6d9ecca33fe3
2024-01-29 16:31:09 -08:00
Yu Zhang 17042a3fb7 Remove misspelled tickers used in error handler (#12302)
Summary:
As titled, the replacement tickers have been introduced in https://github.com/facebook/rocksdb/issues/11509  and in use since release 8.4. This PR completely removes the misspelled ones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12302

Test Plan: CI tests

Reviewed By: jaykorean

Differential Revision: D53196935

Pulled By: jowlyzhang

fbshipit-source-id: 9c9d0d321247690db5edfdc52b4fecb2f1218979
2024-01-29 15:28:37 -08:00
Chdy fc48af33f5 fix some perf statistic in write (#12285)
Summary:
### Summary:  perf context lack statistics in some write steps
```
rocksdb::get_perf_context()->write_wal_time);
rocksdb::get_perf_context()->write_memtable_time);
rocksdb::get_perf_context()->write_pre_and_post_process_time);
```

#### case 1:
when the unordered_write is true, the `write_memtable_time` is 0
```
write_wal_time : 13.7012
write_memtable_time : 0
write_pre_and_post_process_time : 142.037
```

Reason: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` during insert memtable,

```c++
Status DBImpl::UnorderedWriteMemtable(const WriteOptions& write_options,
                                      WriteBatch* my_batch,
                                      WriteCallback* callback, uint64_t log_ref,
                                      SequenceNumber seq,
                                      const size_t sub_batch_cnt) {
	...
  if (w.CheckCallback(this) && w.ShouldWriteToMemtable()) {

    // need calculate write_memtable_time
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        seq_per_batch_, sub_batch_cnt, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);
    if (write_options.disableWAL) {
      has_unpersisted_data_.store(true, std::memory_order_relaxed);
    }
  }
	...
}
```
Fix: add perf function
```
write_wal_time : 14.3991
write_memtable_time : 19.3367
write_pre_and_post_process_time : 130.441
```

#### case 2:
when the enable_pipelined_write is true, the `write_memtable_time` is small
```
write_wal_time : 11.2986
write_memtable_time : 1.0205
write_pre_and_post_process_time : 140.131
```

Fix: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` when `w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER`
```c++
Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
                                  WriteBatch* my_batch, WriteCallback* callback,
                                  uint64_t* log_used, uint64_t log_ref,
                                  bool disable_memtable, uint64_t* seq_used) {
  ...
  if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
    // need calculate write_memtable_time

    assert(w.ShouldWriteToMemtable());
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        false /*seq_per_batch*/, 0 /*batch_cnt*/, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);

    if (write_thread_.CompleteParallelMemTableWriter(&w)) {
      MemTableInsertStatusCheck(w.status);
      versions_->SetLastSequence(w.write_group->last_sequence);
      write_thread_.ExitAsMemTableWriter(&w, *w.write_group);
    }
  }
  if (seq_used != nullptr) {
    *seq_used = w.sequence;
  }

  assert(w.state == WriteThread::STATE_COMPLETED);
  return w.FinalStatus();
}
```
FIx: add perf function
```
write_wal_time : 10.5201
write_memtable_time : 17.1048
write_pre_and_post_process_time : 114.313
```

#### case3:
`DBImpl::WriteImplWALOnly` function has no statistical `write_delay_time`
```c++
Status DBImpl::WriteImplWALOnly(
    WriteThread* write_thread, const WriteOptions& write_options,
    WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used,
    const uint64_t log_ref, uint64_t* seq_used, const size_t sub_batch_cnt,
    PreReleaseCallback* pre_release_callback, const AssignOrder assign_order,
    const PublishLastSeq publish_last_seq, const bool disable_memtable) {
 ...
  if (publish_last_seq == kDoPublishLastSeq) {

  } else {
    // need calculate write_delay_time
    InstrumentedMutexLock lock(&mutex_);
    Status status =
        DelayWrite(/*num_bytes=*/0ull, *write_thread, write_options);
    if (!status.ok()) {
      WriteThread::WriteGroup write_group;
      write_thread->EnterAsBatchGroupLeader(&w, &write_group);
      write_thread->ExitAsBatchGroupLeader(write_group, status);
      return status;
    }
  }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12285

Reviewed By: ajkr

Differential Revision: D53191765

Pulled By: cbi42

fbshipit-source-id: f78d5b280bea6a777f077c89c3e0b8fe98d3c860
2024-01-29 12:31:11 -08:00
Yu Zhang 071a146fa0 Add support for range deletion when user timestamps are not persisted (#12254)
Summary:
For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes:

1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block.

2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`.

3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. d69628e6ce/db/dbformat.h (L923-L935)
This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254

Test Plan: Added unit test and enabled this feature combination's stress test.

Reviewed By: cbi42

Differential Revision: D52965527

Pulled By: jowlyzhang

fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985
2024-01-29 11:37:34 -08:00
Peter Dillinger 4e60663b31 Remove unnecessary, confusing 'extern' (#12300)
Summary:
In C++, `extern` is redundant in a number of cases:
* "Global" function declarations and definitions
* "Global" variable definitions when already declared `extern`

For consistency and simplicity, I've removed these in code that *we own*. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h)
as standard best practice would prescribe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300

Test Plan: no functional changes, CI

Reviewed By: ajkr

Differential Revision: D53148629

Pulled By: pdillinger

fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886
2024-01-29 10:38:08 -08:00
Changyu Bi 2233a2f4c0 Enhance corruption status message for record mismatch in compaction (#12297)
Summary:
... to include the actual numbers of processed and expected records, and the file number for input files. The purpose is to be able to find the offending files even when the relevant LOG file is gone.

Another change is to check the record count even when `compaction_verify_record_count` is false, and log a warning message without setting corruption status if there is a mismatch. This is consistent with how we check the record count for flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12297

Test Plan:
print the status message in `DBCompactionTest.VerifyRecordCount`
```
before
Corruption: Compaction number of input keys does not match number of keys processed.
after
Compaction number of input keys does not match number of keys processed. Expected 20 but processed 10. Compaction summary: Base version 4 Base level 0, inputs: [11(2156B) 9(2156B)]
```

Reviewed By: ajkr

Differential Revision: D53110130

Pulled By: cbi42

fbshipit-source-id: 6325cbfb8f71f25ce37f23f8277ebe9264863c3b
2024-01-26 09:12:07 -08:00
Peter Dillinger f046a8f617 Deflake ColumnFamilyTest.WriteStallSingleColumnFamily (#12294)
Summary:
https://github.com/facebook/rocksdb/issues/12267 apparently introduced a data race in test code where a background read of estimated_compaction_needed_bytes while holding the DB mutex could race with forground write for testing purposes. This change adds the DB mutex to those writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12294

Test Plan: 1000 TSAN runs of test (massively fails before change, passes after)

Reviewed By: ajkr

Differential Revision: D53095483

Pulled By: pdillinger

fbshipit-source-id: 13fcb383ebad313dabe39eb8f9085c34d370b54a
2024-01-25 14:40:18 -08:00
zaidoon c3bff1c02d Allow setting Stderr Logger via C API (#12262)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12262

Reviewed By: pdillinger

Differential Revision: D53027616

Pulled By: ajkr

fbshipit-source-id: 2e88e53e0c02447c613439f5528161ea1340b323
2024-01-25 12:36:40 -08:00
Hui Xiao 96fb7de3bc Rate-limit un-ratelimited flush/compaction code paths (#12290)
Summary:
**Context/Summary:**

We recently found out some code paths in flush and compaction aren't rate-limited when they should.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12290

Test Plan: existing UT**

Reviewed By: anand1976

Differential Revision: D53066103

Pulled By: hx235

fbshipit-source-id: 9dc4cab5f841230d18e5504dc480ac523e9d3950
2024-01-25 12:00:15 -08:00
Peter Dillinger d895eb08b3 Fix UB/crash in new SeqnoToTimeMapping::CopyFromSeqnoRange (#12293)
Summary:
After https://github.com/facebook/rocksdb/issues/12253 this function has crashed in the crash test, in its call to `std::copy`. I haven't reproduced the crash directly, but `std::copy` probably has undefined behavior if the starting iterator is after the ending iterator, which was possible. I've fixed the logic to deal with that case and to add an assertion to check that precondition of `std::copy` (which appears can be unchecked by `std::copy` itself even with UBSAN+ASAN).

Also added some unit tests etc. that were unfinished for https://github.com/facebook/rocksdb/issues/12253, and slightly tweak SeqnoToTimeMapping::EnforceMaxTimeSpan handling of zero time span case.

This is intended for patching 8.11.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12293

Test Plan: tests added. Will trigger ~20 runs of the crash test job that saw the crash. https://fburl.com/ci/5iiizvfa

Reviewed By: jowlyzhang

Differential Revision: D53090422

Pulled By: pdillinger

fbshipit-source-id: 69d60b1847d9c7e4ae62b153011c2040405db461
2024-01-25 11:27:15 -08:00
Changyu Bi 3812a77771 Deflake `DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit` (#12289)
Summary:
The test has been failing with
```
[ RUN      ] DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit
db/db_compaction_test.cc:9661: Failure
Expected equality of these values:
  0u
    Which is: 0
  env_->GetThreadPoolQueueLen(Env::Priority::LOW)
    Which is: 1
```
This can happen when thread pool queue len is checked before `test::SleepingBackgroundTask::DoSleepTask` is scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12289

Reviewed By: ajkr

Differential Revision: D53064300

Pulled By: cbi42

fbshipit-source-id: 9ed1b714243880f82bd1cc1584b402ac9cf57507
2024-01-25 10:37:11 -08:00
Yu Zhang 928aca835f Skip searching through lsm tree for a target level when files overlap (#12284)
Summary:
While ingesting multiple external files with key range overlap, current flow go through the lsm tree to do a search for a target level and later discard that result by defaulting back to L0. This PR improves this by just skip the search altogether.

The other change is to remove default to L0 for the combination of universal compaction + force global sequence number, which was initially added to meet a pre https://github.com/facebook/rocksdb/issues/7421  invariant.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12284

Test Plan:
Added unit test:
./external_sst_file_test --gtest_filter="*IngestFileWithGlobalSeqnoAssignedUniversal*"

Reviewed By: ajkr

Differential Revision: D53072238

Pulled By: jowlyzhang

fbshipit-source-id: 30943e2e284a7f23b495c0ea4c80cb166a34a8ac
2024-01-24 23:30:08 -08:00
Hui Xiao 1b2b16b38e Fix bug of newer ingested data assigned with an older seqno (#12257)
Summary:
**Context:**
We found an edge case where newer ingested data is assigned with an older seqno. This causes older data of that key to be returned for read.

Consider the following lsm shape:
![image](https://github.com/facebook/rocksdb/assets/83968999/973fd160-5065-49cd-8b7b-b6ab4badae23)
Then ingest a file to L5 containing new data of key_overlap. Because of [this](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fblob%2F5a26f392ca640818da0b8590be6119699e852b07%2Fdb%2Fexternal_sst_file_ingestion_job.cc%3Ffbclid%3DIwAR10clXxpUSrt6sYg12sUMeHfShS7XigFrsJHvZoUDroQpbj_Sb3dG_JZFc%23L951-L956&h=AT0m56P7O0ZML7jk1sdjgnZZyGPMXg9HkKvBEb8mE9ZM3fpJjPrArAMsaHWZQPt9Ki-Pn7lv7x-RT9NEd_202Y6D2juIVHOIt3EjCZptDKBLRBMG49F8iBUSM9ypiKe8XCfM-FNW2Hl4KbVq2e3nZRbMvUM), the file is assigned with seqno 2, older than the old data's seqno 4. After just another compaction, we will drop the new_v for key_overlap because of the seqno and cause older data to be returned.
![image](https://github.com/facebook/rocksdb/assets/83968999/a3ef95e4-e7ae-4c30-8d03-955cd4b5ed42)

**Summary:**
This PR removes the incorrect seqno assignment

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12257

Test Plan:
- New unit test failed before the fix but passes after
- python3 tools/db_crashtest.py --compaction_style=1 --ingest_external_file_one_in=10 --preclude_last_level_data_seconds=36000 --compact_files_one_in=10 --enable_blob_files=0 blackbox`
- Rehearsal stress test

Reviewed By: cbi42

Differential Revision: D52926092

Pulled By: hx235

fbshipit-source-id: 9e4dade0f6cc44e548db8fca27ccbc81a621cd6f
2024-01-24 11:21:05 -08:00
Peter Dillinger b31f3245f1 Fix flaky test shutdown race in seqno_time_test (#12282)
Summary:
Seen in build-macos-cmake:

```
Received signal 11 (Segmentation fault: 11)
	https://github.com/facebook/rocksdb/issues/1   rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0::operator()(void*) const (in seqno_time_test) (mock_time_env.cc:29)
	https://github.com/facebook/rocksdb/issues/2   decltype(std::declval<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&>()(std::declval<void*>())) std::__1::__invoke[abi:v15006]<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&, void*>(rocksdb::MockSystemClock::InstallTimedWait	ixCallback()::$_0&, void*&&) (in seqno_time_test) (invoke.h:394)
...
```

This is presumably because the std::function from the lambda only saves a copy of the SeqnoTimeTest* this pointer, which doesn't prevent it from being reclaimed on parallel shutdown. If we instead save a copy of the `std::shared_ptr<MockSystemClock>` in the std::function, this should prevent the crash. (Note that in `SyncPoint::Data::Process()` copies the std::function before releasing the mutex for calling the callback.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12282

Test Plan: watch CI

Reviewed By: cbi42

Differential Revision: D53027136

Pulled By: pdillinger

fbshipit-source-id: 26cd9c0352541d806d42bb061dd349d3b47171a5
2024-01-24 10:14:22 -08:00
Richard Barnes 3079a7e7c2 Remove extra semi colon from internal_repo_rocksdb/repo/db/internal_stats.h (#12278)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12278

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: jaykorean

Differential Revision: D52969116

fbshipit-source-id: 8cb28dafdbede54e8cb59c2b8d461b1eddb3de68
2024-01-24 07:22:10 -08:00
Changyu Bi 3ef9092487 Print additional information when flaky test DBTestWithParam.ThreadStatusSingleCompaction fails (#12268)
Summary:
The test is [flaky](https://github.com/facebook/rocksdb/actions/runs/7616272304/job/20742657041?pr=12257&fbclid=IwAR1vNI1rSRVKnOsXs0WCPklqTkBXxlwS1GMJgWWe7D8dtAvh6e6wxk067FY) but I could not reproduce the test failure. Add some debug print to make the next failure more helpful

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12268

Test Plan:
```
check print works when test fails:
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 6134067200, thread status:
thread id: 6133493760, thread status: Compaction
db/db_test.cc:4680: Failure
Expected equality of these values:
  op_count
    Which is: 1
  expected_count
    Which is: 0
```

Reviewed By: hx235

Differential Revision: D52987503

Pulled By: cbi42

fbshipit-source-id: 33b369796f9b97155578b45167e722ddcde93594
2024-01-23 10:07:06 -08:00
Andrew Kryczka 7fe93162c5 Log pending compaction bytes in a couple places (#12267)
Summary:
This PR adds estimated pending compaction bytes in two places:

- The "Level summary", which is printed to the info LOG after every flush or compaction
- The "rocksdb.cfstats" property, which is printed to the info LOG periodically according to `stats_dump_period_sec`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12267

Test Plan:
Ran `./db_bench -benchmarks=filluniquerandom -stats_dump_period_sec=1 -statistics=true -write_buffer_size=524288` and looked at the LOG.

```
** Compaction Stats [default] **
...
Estimated pending compaction bytes: 12117691
...
2024/01/22-13:15:12.283563 1572872 (Original Log Time 2024/01/22-13:15:12.283540) [/db_impl/db_impl_compaction_flush.cc:371] [default] Level summary: files[10 1 0 0 0 0 0] max score 0.50, estimated pending compaction bytes 12359137
```

Reviewed By: cbi42

Differential Revision: D52973337

Pulled By: ajkr

fbshipit-source-id: c4e546bd9bdac387eebeeba303d04125212037b8
2024-01-23 09:14:59 -08:00
Yu Zhang ef342246dc Consolidate stats recording in error handler (#11992)
Summary:
This is a non functional refactor, mostly for deduplicating the stats recording logic in error handler. Plus some documentation update and simple code dedupe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11992

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D52967713

Pulled By: jowlyzhang

fbshipit-source-id: d584eae1a06410438f5a4c59c2cb67666ea7de1a
2024-01-22 14:57:30 -08:00
zaidoon e572ae9f57 expose mode option to Rate Limiter via C API (#12259)
Summary:
addresses https://github.com/facebook/rocksdb/issues/12220 to allow rate limiting compaction but not flushes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12259

Reviewed By: jaykorean

Differential Revision: D52965342

Pulled By: ajkr

fbshipit-source-id: 38566d9ac75c932c63e10cc53796fab0e46e3b2e
2024-01-22 11:45:53 -08:00
Changyu Bi 4b684e96b7 Allow more intra-L0 compaction when L0 is small (#12214)
Summary:
introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214

Test Plan:
- new unit test
- benchmark:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1

main:
fillrandom   :     234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations;    0.5 MB/s
rocksdb.compact.read.bytes COUNT : 1490756235
rocksdb.compact.write.bytes COUNT : 1469056734
rocksdb.flush.write.bytes COUNT : 71099011

branch:
fillrandom   :     128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations;    0.9 MB/s
rocksdb.compact.read.bytes COUNT : 807474156
rocksdb.compact.write.bytes COUNT : 781977610
rocksdb.flush.write.bytes COUNT : 71098785
```

Reviewed By: ajkr

Differential Revision: D52637771

Pulled By: cbi42

fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3
2024-01-22 10:23:57 -08:00
Peter Dillinger cb08a682d4 Fix/cleanup SeqnoToTimeMapping (#12253)
Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.

* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
  * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
  * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
  * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.

Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253

Test Plan: unit tests updated

Reviewed By: jowlyzhang

Differential Revision: D52913733

Pulled By: pdillinger

fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
2024-01-19 21:50:38 -08:00
Changyu Bi ec5b1be18d Deflake `PerfContextTest.CPUTimer` (#12252)
Summary:
We saw failures like
```
db/perf_context_test.cc:952: Failure
Expected: (next_count) > (count), actual: 26699 vs 26699
```
I can repro by running the test repeatedly and the test fails with different seek keys. So
the cause is likely not with Seek() implementation. I found that
`clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);` can return the same time when
called repeatedly. However, I don't know if Seek() is fast enough that this happened during
continuous test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12252

Test Plan: `gtest_parallel.py --repeat=10000 --workers=1 ./perf_context_test --gtest_filter="PerfContextTest.CPUTimer"`

Reviewed By: ajkr

Differential Revision: D52912751

Pulled By: cbi42

fbshipit-source-id: 8985ae93baa99cdf4b9136ea38addd2e41f4b202
2024-01-19 10:13:52 -08:00
anand76 65e162bf09 Add some asserts in FilePickerMultiGet for debugging (#12241)
Summary:
Add asserts to help debug a crash test failure. The test fails as wollows -
```rocksdb::FilePickerMultiGet::PrepareNextLevel(): Assertion `fp_ctx.search_right_bound == -1 || fp_ctx.search_right_bound == FileIndexer::kLevelMaxIndex' failed```

Also add a unit test to verify an edge case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12241

Reviewed By: cbi42

Differential Revision: D52819029

Pulled By: anand1976

fbshipit-source-id: 33316985c8ace1aed9ecc2400da8b777aec488ff
2024-01-16 17:08:58 -08:00
akankshamahajan cad76a2e1e Fix bug in auto_readahead_size that returned wrong key (#12229)
Summary:
IndexType::kBinarySearchWithFirstKey +
BlockCacheLookupForReadAheadSize enabled => FindNextUserEntryInternal assertion fails or iterator lands at a wrong key because BlockCacheLookupForReadAheadSize moves the index_iter_ and in internal_wrapper.h, result_.key didn't update and pointed to wrong key. Also ikey_ was also pointing to iter_.key() instead of copying the key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12229

Test Plan:
```
 rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
mkdir /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
./db_stress -threads=1 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=0 --backup_one_in=0 --batch_protection_bytes_per_key=0 --blob_cache_size=0 --blob_compaction_readahead_size=0 --blob_compression_type=lz4 --blob_file_size=0 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0 --blob_garbage_collection_force_threshold=0 --block_protection_bytes_per_key=0 --block_size=2048 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 --db_write_buffer_size=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=3 --ingest_external_file_one_in=10 --initial_auto_readahead_size=0 --iterpercent=55 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=8 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=10000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=1 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --snapshot_hold_ops=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=1 --verify_sst_unique_id_in_manifest=0 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=0 > repro.out
Verification failed. Expected state has key 0000000000000077000000000000004178, iterator is at key 0000000000000077000000000000008A78
Column family: default, op_logs: S 0000000000000077000000000000003D7878787878 NNNN
No writes or ops?
Verification failed :(
```

Reviewed By: ajkr

Differential Revision: D52710655

Pulled By: akankshamahajan15

fbshipit-source-id: 9d2e684e190fb0832bdce3337bce1c6548cd054d
2024-01-16 11:30:36 -08:00
Jonah Gao e28251ca72 Fix blob files not reclaimed after deleting all SSTs (#12235)
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12208.

After all the SSTs have been deleted, all the blob files will become unreferenced.
These files should be considered obsolete and thus, should not be saved to the vstorage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12235

Reviewed By: jowlyzhang

Differential Revision: D52806441

Pulled By: ltamasi

fbshipit-source-id: 62f94d4f2544ed2822c764d8ace5bf7f57efe42d
2024-01-16 11:15:23 -08:00
Andrew Kryczka 2dda7a0dd2 Detect compaction pressure at lower debt ratios (#12236)
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236

Reviewed By: cbi42

Differential Revision: D52765685

Pulled By: ajkr

fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
2024-01-15 22:41:18 -08:00
马越 1a1f9f1660 Fix the compactRange with wrong cf handle when ClipColumnFamily (#12219)
Summary:
- **Context**:

In ClipColumnFamily, the DeleteRange API will be used to delete data, and then CompactRange will be called for physical deletion. But now However, the ColumnFamilyHandle is not passed , so by default only the DefaultColumnFamily will be CompactRanged. Therefore, it may cause that the data in some sst files of CompactionRange cannot be physically deleted.

- **In this change**

Pass the ColumnFamilyHandle when call CompactRange

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12219

Reviewed By: ajkr

Differential Revision: D52665162

Pulled By: cbi42

fbshipit-source-id: e8e997aa25ec4ca40e347be89edc7e84a7a0edce
2024-01-10 14:34:12 -08:00
Andrew Kryczka 5a9ecf6614 Automated modernization (#12210)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12210

Reviewed By: hx235

Differential Revision: D52559771

Pulled By: ajkr

fbshipit-source-id: 1ccdd3a0180cc02bc0441f20b0e4a1db50841b03
2024-01-05 11:53:57 -08:00
akankshamahajan 5cb2d09d47 Refactor FilePrefetchBuffer code (#12097)
Summary:
Summary - Refactor FilePrefetchBuffer code
- Implementation:
FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_.

 If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data.
If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer.
If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed.
If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to
the caller.

- Merged Sync and Async code flow into one in FilePrefetchBuffer.

Test Plan -
- Crash test passed
- Unit tests
- Pending - Benchmarks

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097

Reviewed By: ajkr

Differential Revision: D51759552

Pulled By: akankshamahajan15

fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5
2024-01-05 09:29:01 -08:00
Hui Xiao 81b6296c7e Pass flush IO activity enum in FlushJob::MaybeIncreaseFullHistoryTsLowToAboveCutoffUDT...() (#12197)
Summary:
**Context/Summary:** as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12197

Test Plan:
```
./db_stress --acquire_snapshot_one_in=100 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=4.393039399748979 --bottommost_compression_type=disable --bottommost_file_compaction_delay=86400 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --persist_user_defined_timestamps=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=55 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_txn=0 --use_write_buffer_manager=0 --user_timestamp_size=8 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=10000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=128 --writepercent=35
```

Before fix:
```
db_stress_tool/db_stress_env_wrapper.h:92: virtual rocksdb::IOStatus rocksdb::DbStressWritableFileWrapper::Append(const rocksdb::Slice &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `io_activity == Env::IOActivity::kUnknown || io_activity == options.io_activity' failed.
```

After fix:
Succeed

Reviewed By: ajkr

Differential Revision: D52492030

Pulled By: hx235

fbshipit-source-id: 842a0dcbdf135838b57ddb4a3a6f1effc8dd3e82
2024-01-02 17:33:00 -08:00
leipeng d411fc4dd6 column_family.cc: SanitizeOptions(dbo, cfo): WARN msg: add missing spaces (#12193)
Summary:
Fix for multi line strings missing spaces.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12193

Reviewed By: cbi42

Differential Revision: D52457430

Pulled By: ajkr

fbshipit-source-id: 4ca75a14e61c09819e5d821da6137f4536e9e76e
2024-01-02 11:18:11 -08:00
leipeng 906c6683ed InternalKey::Set: remove redundant assign (#12194)
Summary:
InternalKey::Set: remove redundant assign

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12194

Reviewed By: cbi42

Differential Revision: D52457542

Pulled By: ajkr

fbshipit-source-id: 329983a8734ff38ffd93018bbbe112b4a23b5c11
2024-01-02 11:17:39 -08:00
Hui Xiao 06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
anand76 a036525809 Lightweight verification of MANIFEST file after close on shutdown (#12174)
Summary:
Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174

Test Plan: Unit test, and some manual verification

Reviewed By: ajkr

Differential Revision: D52451184

Pulled By: anand1976

fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83
2023-12-28 18:25:29 -08:00
hulk b7ecbe309d Trigger compaction to the next level if the data age exceeds periodic_compaction_seconds (#12175)
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.

Relate issue: https://github.com/facebook/rocksdb/issues/12165

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175

Reviewed By: ajkr

Differential Revision: D52446722

Pulled By: cbi42

fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
2023-12-28 12:50:08 -08:00
Changyu Bi 3d81f175b4 Prioritize marked file in level compaction (#12187)
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187

Test Plan: * New unit tests

Reviewed By: ajkr

Differential Revision: D52437194

Pulled By: cbi42

fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
2023-12-28 10:28:37 -08:00
darionyaphet 01f2edd145 Replace push_back by emplace_back in wal manager (#10805)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10805

Reviewed By: ajkr

Differential Revision: D52424928

Pulled By: hx235

fbshipit-source-id: 548e3304ca721a3907be3696d12735929aca8490
2023-12-27 10:40:33 -08:00
Andrew Kryczka 4fefe1fed9 Downgrade warning for dynamic leveling with non-leveled compaction (#12186)
Summary:
Now that `level_compaction_dynamic_level_bytes`'s default value is true, users who do not touch that setting and use non-leveled compaction will also see this log message. It can be info level rather than warning since, in the case mentioned, there is nothing the user needs to be warned about.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12186

Reviewed By: cbi42

Differential Revision: D52422499

Pulled By: ajkr

fbshipit-source-id: 8dbfcd102aab671b881ba047fb4a0a555b3e0a78
2023-12-26 15:13:42 -08:00
Peter Dillinger a771a47a1b Fix leak or crash on failure in automatic atomic flush (#12176)
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:

```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176

Test Plan: unit test added

Reviewed By: cbi42

Differential Revision: D52417071

Pulled By: pdillinger

fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5
2023-12-26 11:04:25 -08:00
zaidoon ad0362ac92 Expose Options::ttl through C API (#12170)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12170

Reviewed By: jaykorean

Differential Revision: D52378902

Pulled By: cbi42

fbshipit-source-id: 0bac94b8785d5149df86e7317e69c0e64beab887
2023-12-21 15:04:53 -08:00
anand76 cc069f25b3 Add some compressed and tiered secondary cache stats (#12150)
Summary:
Add statistics for more visibility.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12150

Reviewed By: akankshamahajan15

Differential Revision: D52184633

Pulled By: anand1976

fbshipit-source-id: 9969e05d65223811cd12627102b020bb6d229352
2023-12-15 11:34:08 -08:00
Akanksha Mahajan cd577f6059 Fix WRITE_STALL start_time (#12147)
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147

Test Plan: Existing CircleCI

Reviewed By: cbi42

Differential Revision: D52173481

Pulled By: akankshamahajan15

fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
2023-12-14 13:45:06 -08:00
akankshamahajan d926593df5 Fix stress tests failure for auto_readahead_size (#12131)
Summary:
When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter  so that
- BlockBasedTableIterator can point index_iter_ to the right block.
- disable readahead_cache_lookup.
However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case.

So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131

Test Plan:
- Tested on db_stress command that was failing -
`./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35`

 - make crash_test -j32

Reviewed By: anand1976

Differential Revision: D51986326

Pulled By: akankshamahajan15

fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9
2023-12-13 12:15:04 -08:00
Andrew Kryczka d8e47620d7 Speedup based on pending compaction bytes relative to data size (#12130)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.

This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130

Reviewed By: hx235

Differential Revision: D52000746

Pulled By: ajkr

fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7
2023-12-13 10:37:27 -08:00