Commit graph

5558 commits

Author SHA1 Message Date
Yu Zhang d11584e42e Be consistent in key range overlap check (#12315)
Summary:
We should be consistent in how we check key range overlap in memtables and in sst files. While all the sst file key range overlap check compares the user key without timestamp, for example:
377eee77f8/db/version_set.cc (L129-L130)

This key range overlap check for memtable is comparing the whole user key. Currently it happen to achieve the same effect because this function is only called by `ExternalSstFileIngestionJob` and `DBImpl::CompactRange`, which takes a user key without timestamp as the range end, pad a max or min timestamp to it depending on whether the end is exclusive. So use `Compartor::Compare` here is working too, but we should update it to `Comparator::CompareWithoutTimestamp` to be consistent with all the other file key range overlapping check functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12315

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D53273456

Pulled By: jowlyzhang

fbshipit-source-id: c094ae1f0c195d52542124c4fb03fdca14241e85
2024-01-31 11:12:52 -08:00
Peter Dillinger 2b4245559c Don't warn on (recursive) disable file deletion (#12310)
Summary:
To stop spamming our warning logs with normal behavior.

Also fix comment on `DisableFileDeletions()`.

In response to https://github.com/facebook/rocksdb/issues/12001 I've indicated my objection to granting legitimacy to force=true, but I'm not addressing that here and now. In short, the user shouldn't be asked to think about whether they want to use the *wrong* behavior. ;)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12310

Test Plan: existing tests

Reviewed By: jowlyzhang

Differential Revision: D53233117

Pulled By: pdillinger

fbshipit-source-id: 5d2aedb76b02b30f8a5fa5b436fc57fde5d40d6e
2024-01-30 11:58:31 -08:00
Andrew Kryczka aacf60dda2 Speedup based on number of files marked for compaction (#12306)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306

Reviewed By: cbi42

Differential Revision: D53200559

Pulled By: ajkr

fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e
2024-01-29 17:29:04 -08:00
Peter Dillinger 61ed0de600 Add more detail to some statuses (#12307)
Summary:
and also fix comment/label on some MacOS CI jobs. Motivated by a crash test failure missing a definitive indicator of the genesis of the status:

```
file ingestion error: Operation failed. Try again.:
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12307

Test Plan: just cosmetic changes. These statuses should not arise frequently enough to be a performance issue (copying messages).

Reviewed By: jaykorean

Differential Revision: D53199529

Pulled By: pdillinger

fbshipit-source-id: ad83daaa5d80f75c9f81158e90fb6d9ecca33fe3
2024-01-29 16:31:09 -08:00
Yu Zhang 17042a3fb7 Remove misspelled tickers used in error handler (#12302)
Summary:
As titled, the replacement tickers have been introduced in https://github.com/facebook/rocksdb/issues/11509  and in use since release 8.4. This PR completely removes the misspelled ones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12302

Test Plan: CI tests

Reviewed By: jaykorean

Differential Revision: D53196935

Pulled By: jowlyzhang

fbshipit-source-id: 9c9d0d321247690db5edfdc52b4fecb2f1218979
2024-01-29 15:28:37 -08:00
Chdy fc48af33f5 fix some perf statistic in write (#12285)
Summary:
### Summary:  perf context lack statistics in some write steps
```
rocksdb::get_perf_context()->write_wal_time);
rocksdb::get_perf_context()->write_memtable_time);
rocksdb::get_perf_context()->write_pre_and_post_process_time);
```

#### case 1:
when the unordered_write is true, the `write_memtable_time` is 0
```
write_wal_time : 13.7012
write_memtable_time : 0
write_pre_and_post_process_time : 142.037
```

Reason: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` during insert memtable,

```c++
Status DBImpl::UnorderedWriteMemtable(const WriteOptions& write_options,
                                      WriteBatch* my_batch,
                                      WriteCallback* callback, uint64_t log_ref,
                                      SequenceNumber seq,
                                      const size_t sub_batch_cnt) {
	...
  if (w.CheckCallback(this) && w.ShouldWriteToMemtable()) {

    // need calculate write_memtable_time
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        seq_per_batch_, sub_batch_cnt, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);
    if (write_options.disableWAL) {
      has_unpersisted_data_.store(true, std::memory_order_relaxed);
    }
  }
	...
}
```
Fix: add perf function
```
write_wal_time : 14.3991
write_memtable_time : 19.3367
write_pre_and_post_process_time : 130.441
```

#### case 2:
when the enable_pipelined_write is true, the `write_memtable_time` is small
```
write_wal_time : 11.2986
write_memtable_time : 1.0205
write_pre_and_post_process_time : 140.131
```

Fix: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` when `w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER`
```c++
Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
                                  WriteBatch* my_batch, WriteCallback* callback,
                                  uint64_t* log_used, uint64_t log_ref,
                                  bool disable_memtable, uint64_t* seq_used) {
  ...
  if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
    // need calculate write_memtable_time

    assert(w.ShouldWriteToMemtable());
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        false /*seq_per_batch*/, 0 /*batch_cnt*/, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);

    if (write_thread_.CompleteParallelMemTableWriter(&w)) {
      MemTableInsertStatusCheck(w.status);
      versions_->SetLastSequence(w.write_group->last_sequence);
      write_thread_.ExitAsMemTableWriter(&w, *w.write_group);
    }
  }
  if (seq_used != nullptr) {
    *seq_used = w.sequence;
  }

  assert(w.state == WriteThread::STATE_COMPLETED);
  return w.FinalStatus();
}
```
FIx: add perf function
```
write_wal_time : 10.5201
write_memtable_time : 17.1048
write_pre_and_post_process_time : 114.313
```

#### case3:
`DBImpl::WriteImplWALOnly` function has no statistical `write_delay_time`
```c++
Status DBImpl::WriteImplWALOnly(
    WriteThread* write_thread, const WriteOptions& write_options,
    WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used,
    const uint64_t log_ref, uint64_t* seq_used, const size_t sub_batch_cnt,
    PreReleaseCallback* pre_release_callback, const AssignOrder assign_order,
    const PublishLastSeq publish_last_seq, const bool disable_memtable) {
 ...
  if (publish_last_seq == kDoPublishLastSeq) {

  } else {
    // need calculate write_delay_time
    InstrumentedMutexLock lock(&mutex_);
    Status status =
        DelayWrite(/*num_bytes=*/0ull, *write_thread, write_options);
    if (!status.ok()) {
      WriteThread::WriteGroup write_group;
      write_thread->EnterAsBatchGroupLeader(&w, &write_group);
      write_thread->ExitAsBatchGroupLeader(write_group, status);
      return status;
    }
  }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12285

Reviewed By: ajkr

Differential Revision: D53191765

Pulled By: cbi42

fbshipit-source-id: f78d5b280bea6a777f077c89c3e0b8fe98d3c860
2024-01-29 12:31:11 -08:00
Yu Zhang 071a146fa0 Add support for range deletion when user timestamps are not persisted (#12254)
Summary:
For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes:

1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block.

2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`.

3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. d69628e6ce/db/dbformat.h (L923-L935)
This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254

Test Plan: Added unit test and enabled this feature combination's stress test.

Reviewed By: cbi42

Differential Revision: D52965527

Pulled By: jowlyzhang

fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985
2024-01-29 11:37:34 -08:00
Peter Dillinger 4e60663b31 Remove unnecessary, confusing 'extern' (#12300)
Summary:
In C++, `extern` is redundant in a number of cases:
* "Global" function declarations and definitions
* "Global" variable definitions when already declared `extern`

For consistency and simplicity, I've removed these in code that *we own*. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h)
as standard best practice would prescribe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300

Test Plan: no functional changes, CI

Reviewed By: ajkr

Differential Revision: D53148629

Pulled By: pdillinger

fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886
2024-01-29 10:38:08 -08:00
Changyu Bi 2233a2f4c0 Enhance corruption status message for record mismatch in compaction (#12297)
Summary:
... to include the actual numbers of processed and expected records, and the file number for input files. The purpose is to be able to find the offending files even when the relevant LOG file is gone.

Another change is to check the record count even when `compaction_verify_record_count` is false, and log a warning message without setting corruption status if there is a mismatch. This is consistent with how we check the record count for flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12297

Test Plan:
print the status message in `DBCompactionTest.VerifyRecordCount`
```
before
Corruption: Compaction number of input keys does not match number of keys processed.
after
Compaction number of input keys does not match number of keys processed. Expected 20 but processed 10. Compaction summary: Base version 4 Base level 0, inputs: [11(2156B) 9(2156B)]
```

Reviewed By: ajkr

Differential Revision: D53110130

Pulled By: cbi42

fbshipit-source-id: 6325cbfb8f71f25ce37f23f8277ebe9264863c3b
2024-01-26 09:12:07 -08:00
Peter Dillinger f046a8f617 Deflake ColumnFamilyTest.WriteStallSingleColumnFamily (#12294)
Summary:
https://github.com/facebook/rocksdb/issues/12267 apparently introduced a data race in test code where a background read of estimated_compaction_needed_bytes while holding the DB mutex could race with forground write for testing purposes. This change adds the DB mutex to those writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12294

Test Plan: 1000 TSAN runs of test (massively fails before change, passes after)

Reviewed By: ajkr

Differential Revision: D53095483

Pulled By: pdillinger

fbshipit-source-id: 13fcb383ebad313dabe39eb8f9085c34d370b54a
2024-01-25 14:40:18 -08:00
zaidoon c3bff1c02d Allow setting Stderr Logger via C API (#12262)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12262

Reviewed By: pdillinger

Differential Revision: D53027616

Pulled By: ajkr

fbshipit-source-id: 2e88e53e0c02447c613439f5528161ea1340b323
2024-01-25 12:36:40 -08:00
Hui Xiao 96fb7de3bc Rate-limit un-ratelimited flush/compaction code paths (#12290)
Summary:
**Context/Summary:**

We recently found out some code paths in flush and compaction aren't rate-limited when they should.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12290

Test Plan: existing UT**

Reviewed By: anand1976

Differential Revision: D53066103

Pulled By: hx235

fbshipit-source-id: 9dc4cab5f841230d18e5504dc480ac523e9d3950
2024-01-25 12:00:15 -08:00
Peter Dillinger d895eb08b3 Fix UB/crash in new SeqnoToTimeMapping::CopyFromSeqnoRange (#12293)
Summary:
After https://github.com/facebook/rocksdb/issues/12253 this function has crashed in the crash test, in its call to `std::copy`. I haven't reproduced the crash directly, but `std::copy` probably has undefined behavior if the starting iterator is after the ending iterator, which was possible. I've fixed the logic to deal with that case and to add an assertion to check that precondition of `std::copy` (which appears can be unchecked by `std::copy` itself even with UBSAN+ASAN).

Also added some unit tests etc. that were unfinished for https://github.com/facebook/rocksdb/issues/12253, and slightly tweak SeqnoToTimeMapping::EnforceMaxTimeSpan handling of zero time span case.

This is intended for patching 8.11.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12293

Test Plan: tests added. Will trigger ~20 runs of the crash test job that saw the crash. https://fburl.com/ci/5iiizvfa

Reviewed By: jowlyzhang

Differential Revision: D53090422

Pulled By: pdillinger

fbshipit-source-id: 69d60b1847d9c7e4ae62b153011c2040405db461
2024-01-25 11:27:15 -08:00
Changyu Bi 3812a77771 Deflake DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit (#12289)
Summary:
The test has been failing with
```
[ RUN      ] DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit
db/db_compaction_test.cc:9661: Failure
Expected equality of these values:
  0u
    Which is: 0
  env_->GetThreadPoolQueueLen(Env::Priority::LOW)
    Which is: 1
```
This can happen when thread pool queue len is checked before `test::SleepingBackgroundTask::DoSleepTask` is scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12289

Reviewed By: ajkr

Differential Revision: D53064300

Pulled By: cbi42

fbshipit-source-id: 9ed1b714243880f82bd1cc1584b402ac9cf57507
2024-01-25 10:37:11 -08:00
Yu Zhang 928aca835f Skip searching through lsm tree for a target level when files overlap (#12284)
Summary:
While ingesting multiple external files with key range overlap, current flow go through the lsm tree to do a search for a target level and later discard that result by defaulting back to L0. This PR improves this by just skip the search altogether.

The other change is to remove default to L0 for the combination of universal compaction + force global sequence number, which was initially added to meet a pre https://github.com/facebook/rocksdb/issues/7421  invariant.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12284

Test Plan:
Added unit test:
./external_sst_file_test --gtest_filter="*IngestFileWithGlobalSeqnoAssignedUniversal*"

Reviewed By: ajkr

Differential Revision: D53072238

Pulled By: jowlyzhang

fbshipit-source-id: 30943e2e284a7f23b495c0ea4c80cb166a34a8ac
2024-01-24 23:30:08 -08:00
Hui Xiao 1b2b16b38e Fix bug of newer ingested data assigned with an older seqno (#12257)
Summary:
**Context:**
We found an edge case where newer ingested data is assigned with an older seqno. This causes older data of that key to be returned for read.

Consider the following lsm shape:
![image](https://github.com/facebook/rocksdb/assets/83968999/973fd160-5065-49cd-8b7b-b6ab4badae23)
Then ingest a file to L5 containing new data of key_overlap. Because of [this](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fblob%2F5a26f392ca640818da0b8590be6119699e852b07%2Fdb%2Fexternal_sst_file_ingestion_job.cc%3Ffbclid%3DIwAR10clXxpUSrt6sYg12sUMeHfShS7XigFrsJHvZoUDroQpbj_Sb3dG_JZFc%23L951-L956&h=AT0m56P7O0ZML7jk1sdjgnZZyGPMXg9HkKvBEb8mE9ZM3fpJjPrArAMsaHWZQPt9Ki-Pn7lv7x-RT9NEd_202Y6D2juIVHOIt3EjCZptDKBLRBMG49F8iBUSM9ypiKe8XCfM-FNW2Hl4KbVq2e3nZRbMvUM), the file is assigned with seqno 2, older than the old data's seqno 4. After just another compaction, we will drop the new_v for key_overlap because of the seqno and cause older data to be returned.
![image](https://github.com/facebook/rocksdb/assets/83968999/a3ef95e4-e7ae-4c30-8d03-955cd4b5ed42)

**Summary:**
This PR removes the incorrect seqno assignment

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12257

Test Plan:
- New unit test failed before the fix but passes after
- python3 tools/db_crashtest.py --compaction_style=1 --ingest_external_file_one_in=10 --preclude_last_level_data_seconds=36000 --compact_files_one_in=10 --enable_blob_files=0 blackbox`
- Rehearsal stress test

Reviewed By: cbi42

Differential Revision: D52926092

Pulled By: hx235

fbshipit-source-id: 9e4dade0f6cc44e548db8fca27ccbc81a621cd6f
2024-01-24 11:21:05 -08:00
Peter Dillinger b31f3245f1 Fix flaky test shutdown race in seqno_time_test (#12282)
Summary:
Seen in build-macos-cmake:

```
Received signal 11 (Segmentation fault: 11)
	https://github.com/facebook/rocksdb/issues/1   rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0::operator()(void*) const (in seqno_time_test) (mock_time_env.cc:29)
	https://github.com/facebook/rocksdb/issues/2   decltype(std::declval<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&>()(std::declval<void*>())) std::__1::__invoke[abi:v15006]<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&, void*>(rocksdb::MockSystemClock::InstallTimedWait	ixCallback()::$_0&, void*&&) (in seqno_time_test) (invoke.h:394)
...
```

This is presumably because the std::function from the lambda only saves a copy of the SeqnoTimeTest* this pointer, which doesn't prevent it from being reclaimed on parallel shutdown. If we instead save a copy of the `std::shared_ptr<MockSystemClock>` in the std::function, this should prevent the crash. (Note that in `SyncPoint::Data::Process()` copies the std::function before releasing the mutex for calling the callback.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12282

Test Plan: watch CI

Reviewed By: cbi42

Differential Revision: D53027136

Pulled By: pdillinger

fbshipit-source-id: 26cd9c0352541d806d42bb061dd349d3b47171a5
2024-01-24 10:14:22 -08:00
Richard Barnes 3079a7e7c2 Remove extra semi colon from internal_repo_rocksdb/repo/db/internal_stats.h (#12278)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12278

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: jaykorean

Differential Revision: D52969116

fbshipit-source-id: 8cb28dafdbede54e8cb59c2b8d461b1eddb3de68
2024-01-24 07:22:10 -08:00
Changyu Bi 3ef9092487 Print additional information when flaky test DBTestWithParam.ThreadStatusSingleCompaction fails (#12268)
Summary:
The test is [flaky](https://github.com/facebook/rocksdb/actions/runs/7616272304/job/20742657041?pr=12257&fbclid=IwAR1vNI1rSRVKnOsXs0WCPklqTkBXxlwS1GMJgWWe7D8dtAvh6e6wxk067FY) but I could not reproduce the test failure. Add some debug print to make the next failure more helpful

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12268

Test Plan:
```
check print works when test fails:
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 6134067200, thread status:
thread id: 6133493760, thread status: Compaction
db/db_test.cc:4680: Failure
Expected equality of these values:
  op_count
    Which is: 1
  expected_count
    Which is: 0
```

Reviewed By: hx235

Differential Revision: D52987503

Pulled By: cbi42

fbshipit-source-id: 33b369796f9b97155578b45167e722ddcde93594
2024-01-23 10:07:06 -08:00
Andrew Kryczka 7fe93162c5 Log pending compaction bytes in a couple places (#12267)
Summary:
This PR adds estimated pending compaction bytes in two places:

- The "Level summary", which is printed to the info LOG after every flush or compaction
- The "rocksdb.cfstats" property, which is printed to the info LOG periodically according to `stats_dump_period_sec`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12267

Test Plan:
Ran `./db_bench -benchmarks=filluniquerandom -stats_dump_period_sec=1 -statistics=true -write_buffer_size=524288` and looked at the LOG.

```
** Compaction Stats [default] **
...
Estimated pending compaction bytes: 12117691
...
2024/01/22-13:15:12.283563 1572872 (Original Log Time 2024/01/22-13:15:12.283540) [/db_impl/db_impl_compaction_flush.cc:371] [default] Level summary: files[10 1 0 0 0 0 0] max score 0.50, estimated pending compaction bytes 12359137
```

Reviewed By: cbi42

Differential Revision: D52973337

Pulled By: ajkr

fbshipit-source-id: c4e546bd9bdac387eebeeba303d04125212037b8
2024-01-23 09:14:59 -08:00
Yu Zhang ef342246dc Consolidate stats recording in error handler (#11992)
Summary:
This is a non functional refactor, mostly for deduplicating the stats recording logic in error handler. Plus some documentation update and simple code dedupe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11992

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D52967713

Pulled By: jowlyzhang

fbshipit-source-id: d584eae1a06410438f5a4c59c2cb67666ea7de1a
2024-01-22 14:57:30 -08:00
zaidoon e572ae9f57 expose mode option to Rate Limiter via C API (#12259)
Summary:
addresses https://github.com/facebook/rocksdb/issues/12220 to allow rate limiting compaction but not flushes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12259

Reviewed By: jaykorean

Differential Revision: D52965342

Pulled By: ajkr

fbshipit-source-id: 38566d9ac75c932c63e10cc53796fab0e46e3b2e
2024-01-22 11:45:53 -08:00
Changyu Bi 4b684e96b7 Allow more intra-L0 compaction when L0 is small (#12214)
Summary:
introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214

Test Plan:
- new unit test
- benchmark:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1

main:
fillrandom   :     234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations;    0.5 MB/s
rocksdb.compact.read.bytes COUNT : 1490756235
rocksdb.compact.write.bytes COUNT : 1469056734
rocksdb.flush.write.bytes COUNT : 71099011

branch:
fillrandom   :     128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations;    0.9 MB/s
rocksdb.compact.read.bytes COUNT : 807474156
rocksdb.compact.write.bytes COUNT : 781977610
rocksdb.flush.write.bytes COUNT : 71098785
```

Reviewed By: ajkr

Differential Revision: D52637771

Pulled By: cbi42

fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3
2024-01-22 10:23:57 -08:00
Peter Dillinger cb08a682d4 Fix/cleanup SeqnoToTimeMapping (#12253)
Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.

* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
  * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
  * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
  * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.

Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253

Test Plan: unit tests updated

Reviewed By: jowlyzhang

Differential Revision: D52913733

Pulled By: pdillinger

fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
2024-01-19 21:50:38 -08:00
Changyu Bi ec5b1be18d Deflake PerfContextTest.CPUTimer (#12252)
Summary:
We saw failures like
```
db/perf_context_test.cc:952: Failure
Expected: (next_count) > (count), actual: 26699 vs 26699
```
I can repro by running the test repeatedly and the test fails with different seek keys. So
the cause is likely not with Seek() implementation. I found that
`clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);` can return the same time when
called repeatedly. However, I don't know if Seek() is fast enough that this happened during
continuous test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12252

Test Plan: `gtest_parallel.py --repeat=10000 --workers=1 ./perf_context_test --gtest_filter="PerfContextTest.CPUTimer"`

Reviewed By: ajkr

Differential Revision: D52912751

Pulled By: cbi42

fbshipit-source-id: 8985ae93baa99cdf4b9136ea38addd2e41f4b202
2024-01-19 10:13:52 -08:00
anand76 65e162bf09 Add some asserts in FilePickerMultiGet for debugging (#12241)
Summary:
Add asserts to help debug a crash test failure. The test fails as wollows -
```rocksdb::FilePickerMultiGet::PrepareNextLevel(): Assertion `fp_ctx.search_right_bound == -1 || fp_ctx.search_right_bound == FileIndexer::kLevelMaxIndex' failed```

Also add a unit test to verify an edge case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12241

Reviewed By: cbi42

Differential Revision: D52819029

Pulled By: anand1976

fbshipit-source-id: 33316985c8ace1aed9ecc2400da8b777aec488ff
2024-01-16 17:08:58 -08:00
akankshamahajan cad76a2e1e Fix bug in auto_readahead_size that returned wrong key (#12229)
Summary:
IndexType::kBinarySearchWithFirstKey +
BlockCacheLookupForReadAheadSize enabled => FindNextUserEntryInternal assertion fails or iterator lands at a wrong key because BlockCacheLookupForReadAheadSize moves the index_iter_ and in internal_wrapper.h, result_.key didn't update and pointed to wrong key. Also ikey_ was also pointing to iter_.key() instead of copying the key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12229

Test Plan:
```
 rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
mkdir /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
./db_stress -threads=1 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=0 --backup_one_in=0 --batch_protection_bytes_per_key=0 --blob_cache_size=0 --blob_compaction_readahead_size=0 --blob_compression_type=lz4 --blob_file_size=0 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0 --blob_garbage_collection_force_threshold=0 --block_protection_bytes_per_key=0 --block_size=2048 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 --db_write_buffer_size=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=3 --ingest_external_file_one_in=10 --initial_auto_readahead_size=0 --iterpercent=55 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=8 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=10000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=1 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --snapshot_hold_ops=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=1 --verify_sst_unique_id_in_manifest=0 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=0 > repro.out
Verification failed. Expected state has key 0000000000000077000000000000004178, iterator is at key 0000000000000077000000000000008A78
Column family: default, op_logs: S 0000000000000077000000000000003D7878787878 NNNN
No writes or ops?
Verification failed :(
```

Reviewed By: ajkr

Differential Revision: D52710655

Pulled By: akankshamahajan15

fbshipit-source-id: 9d2e684e190fb0832bdce3337bce1c6548cd054d
2024-01-16 11:30:36 -08:00
Jonah Gao e28251ca72 Fix blob files not reclaimed after deleting all SSTs (#12235)
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12208.

After all the SSTs have been deleted, all the blob files will become unreferenced.
These files should be considered obsolete and thus, should not be saved to the vstorage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12235

Reviewed By: jowlyzhang

Differential Revision: D52806441

Pulled By: ltamasi

fbshipit-source-id: 62f94d4f2544ed2822c764d8ace5bf7f57efe42d
2024-01-16 11:15:23 -08:00
Andrew Kryczka 2dda7a0dd2 Detect compaction pressure at lower debt ratios (#12236)
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236

Reviewed By: cbi42

Differential Revision: D52765685

Pulled By: ajkr

fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
2024-01-15 22:41:18 -08:00
马越 1a1f9f1660 Fix the compactRange with wrong cf handle when ClipColumnFamily (#12219)
Summary:
- **Context**:

In ClipColumnFamily, the DeleteRange API will be used to delete data, and then CompactRange will be called for physical deletion. But now However, the ColumnFamilyHandle is not passed , so by default only the DefaultColumnFamily will be CompactRanged. Therefore, it may cause that the data in some sst files of CompactionRange cannot be physically deleted.

- **In this change**

Pass the ColumnFamilyHandle when call CompactRange

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12219

Reviewed By: ajkr

Differential Revision: D52665162

Pulled By: cbi42

fbshipit-source-id: e8e997aa25ec4ca40e347be89edc7e84a7a0edce
2024-01-10 14:34:12 -08:00
Andrew Kryczka 5a9ecf6614 Automated modernization (#12210)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12210

Reviewed By: hx235

Differential Revision: D52559771

Pulled By: ajkr

fbshipit-source-id: 1ccdd3a0180cc02bc0441f20b0e4a1db50841b03
2024-01-05 11:53:57 -08:00
akankshamahajan 5cb2d09d47 Refactor FilePrefetchBuffer code (#12097)
Summary:
Summary - Refactor FilePrefetchBuffer code
- Implementation:
FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_.

 If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data.
If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer.
If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed.
If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to
the caller.

- Merged Sync and Async code flow into one in FilePrefetchBuffer.

Test Plan -
- Crash test passed
- Unit tests
- Pending - Benchmarks

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097

Reviewed By: ajkr

Differential Revision: D51759552

Pulled By: akankshamahajan15

fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5
2024-01-05 09:29:01 -08:00
Hui Xiao 81b6296c7e Pass flush IO activity enum in FlushJob::MaybeIncreaseFullHistoryTsLowToAboveCutoffUDT...() (#12197)
Summary:
**Context/Summary:** as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12197

Test Plan:
```
./db_stress --acquire_snapshot_one_in=100 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=4.393039399748979 --bottommost_compression_type=disable --bottommost_file_compaction_delay=86400 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --persist_user_defined_timestamps=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=55 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_txn=0 --use_write_buffer_manager=0 --user_timestamp_size=8 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=10000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=128 --writepercent=35
```

Before fix:
```
db_stress_tool/db_stress_env_wrapper.h:92: virtual rocksdb::IOStatus rocksdb::DbStressWritableFileWrapper::Append(const rocksdb::Slice &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `io_activity == Env::IOActivity::kUnknown || io_activity == options.io_activity' failed.
```

After fix:
Succeed

Reviewed By: ajkr

Differential Revision: D52492030

Pulled By: hx235

fbshipit-source-id: 842a0dcbdf135838b57ddb4a3a6f1effc8dd3e82
2024-01-02 17:33:00 -08:00
leipeng d411fc4dd6 column_family.cc: SanitizeOptions(dbo, cfo): WARN msg: add missing spaces (#12193)
Summary:
Fix for multi line strings missing spaces.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12193

Reviewed By: cbi42

Differential Revision: D52457430

Pulled By: ajkr

fbshipit-source-id: 4ca75a14e61c09819e5d821da6137f4536e9e76e
2024-01-02 11:18:11 -08:00
leipeng 906c6683ed InternalKey::Set: remove redundant assign (#12194)
Summary:
InternalKey::Set: remove redundant assign

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12194

Reviewed By: cbi42

Differential Revision: D52457542

Pulled By: ajkr

fbshipit-source-id: 329983a8734ff38ffd93018bbbe112b4a23b5c11
2024-01-02 11:17:39 -08:00
Hui Xiao 06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
anand76 a036525809 Lightweight verification of MANIFEST file after close on shutdown (#12174)
Summary:
Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174

Test Plan: Unit test, and some manual verification

Reviewed By: ajkr

Differential Revision: D52451184

Pulled By: anand1976

fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83
2023-12-28 18:25:29 -08:00
hulk b7ecbe309d Trigger compaction to the next level if the data age exceeds periodic_compaction_seconds (#12175)
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.

Relate issue: https://github.com/facebook/rocksdb/issues/12165

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175

Reviewed By: ajkr

Differential Revision: D52446722

Pulled By: cbi42

fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
2023-12-28 12:50:08 -08:00
Changyu Bi 3d81f175b4 Prioritize marked file in level compaction (#12187)
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187

Test Plan: * New unit tests

Reviewed By: ajkr

Differential Revision: D52437194

Pulled By: cbi42

fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
2023-12-28 10:28:37 -08:00
darionyaphet 01f2edd145 Replace push_back by emplace_back in wal manager (#10805)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10805

Reviewed By: ajkr

Differential Revision: D52424928

Pulled By: hx235

fbshipit-source-id: 548e3304ca721a3907be3696d12735929aca8490
2023-12-27 10:40:33 -08:00
Andrew Kryczka 4fefe1fed9 Downgrade warning for dynamic leveling with non-leveled compaction (#12186)
Summary:
Now that `level_compaction_dynamic_level_bytes`'s default value is true, users who do not touch that setting and use non-leveled compaction will also see this log message. It can be info level rather than warning since, in the case mentioned, there is nothing the user needs to be warned about.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12186

Reviewed By: cbi42

Differential Revision: D52422499

Pulled By: ajkr

fbshipit-source-id: 8dbfcd102aab671b881ba047fb4a0a555b3e0a78
2023-12-26 15:13:42 -08:00
Peter Dillinger a771a47a1b Fix leak or crash on failure in automatic atomic flush (#12176)
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:

```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176

Test Plan: unit test added

Reviewed By: cbi42

Differential Revision: D52417071

Pulled By: pdillinger

fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5
2023-12-26 11:04:25 -08:00
zaidoon ad0362ac92 Expose Options::ttl through C API (#12170)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12170

Reviewed By: jaykorean

Differential Revision: D52378902

Pulled By: cbi42

fbshipit-source-id: 0bac94b8785d5149df86e7317e69c0e64beab887
2023-12-21 15:04:53 -08:00
anand76 cc069f25b3 Add some compressed and tiered secondary cache stats (#12150)
Summary:
Add statistics for more visibility.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12150

Reviewed By: akankshamahajan15

Differential Revision: D52184633

Pulled By: anand1976

fbshipit-source-id: 9969e05d65223811cd12627102b020bb6d229352
2023-12-15 11:34:08 -08:00
Akanksha Mahajan cd577f6059 Fix WRITE_STALL start_time (#12147)
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147

Test Plan: Existing CircleCI

Reviewed By: cbi42

Differential Revision: D52173481

Pulled By: akankshamahajan15

fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
2023-12-14 13:45:06 -08:00
akankshamahajan d926593df5 Fix stress tests failure for auto_readahead_size (#12131)
Summary:
When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter  so that
- BlockBasedTableIterator can point index_iter_ to the right block.
- disable readahead_cache_lookup.
However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case.

So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131

Test Plan:
- Tested on db_stress command that was failing -
`./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35`

 - make crash_test -j32

Reviewed By: anand1976

Differential Revision: D51986326

Pulled By: akankshamahajan15

fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9
2023-12-13 12:15:04 -08:00
Andrew Kryczka d8e47620d7 Speedup based on pending compaction bytes relative to data size (#12130)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.

This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130

Reviewed By: hx235

Differential Revision: D52000746

Pulled By: ajkr

fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7
2023-12-13 10:37:27 -08:00
Peter Dillinger c96d9a0fbb Allow TablePropertiesCollectorFactory to return null collector (#12129)
Summary:
As part of building another feature, I wanted this:
* Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases.
* Polished, clarified some related API comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129

Test Plan: unit test added

Reviewed By: ltamasi

Differential Revision: D51966667

Pulled By: pdillinger

fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770
2023-12-11 12:02:56 -08:00
Kevin Mingtarja 44fd914128 Fix double counting of BYTES_WRITTEN ticker (#12111)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12061.

We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable.

The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111

Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes.

Reviewed By: jaykorean

Differential Revision: D51954327

Pulled By: jowlyzhang

fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c
2023-12-08 17:12:11 -08:00
Levi Tamasi a143f93236 Turn the default Timer in PeriodicTaskScheduler into a leaky Meyers singleton (#12128)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12128

The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues.

Reviewed By: akankshamahajan15

Differential Revision: D51963950

fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948
2023-12-08 10:34:07 -08:00
Levi Tamasi 0ebe1614cb Eliminate some code duplication in MergeHelper (#12121)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12121

The patch eliminates some code duplication by unifying the two sets of `MergeHelper::TimedFullMerge` overloads using variadic templates. It also brings the order of parameters into sync when it comes to the various `TimedFullMerge*` methods.

Reviewed By: jaykorean

Differential Revision: D51862483

fbshipit-source-id: e3f832a6ff89ba34591451655cf11025d0a0d018
2023-12-05 14:07:42 -08:00
Yu Zhang ba8fa0f546 internal_repo_rocksdb (4372117296613874540) (#12117)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12117

Reviewed By: ajkr

Differential Revision: D51745846

Pulled By: jowlyzhang

fbshipit-source-id: 51c806a484b3b43d174b06d2cfe9499191d09914
2023-12-04 11:17:32 -08:00
Yu Zhang d68f45e777 Flush buffered logs when FlushRequest is rescheduled (#12105)
Summary:
The optimization to not find and delete obsolete files when FlushRequest is re-scheduled also inadvertently skipped flushing the `LogBuffer`, resulting in missed logs. This PR fixes the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12105

Test Plan:
manually check this test has the correct info log after the fix
`./column_family_test --gtest_filter=ColumnFamilyRetainUDTTest.NotAllKeysExpiredFlushRescheduled`

Reviewed By: ajkr

Differential Revision: D51671079

Pulled By: jowlyzhang

fbshipit-source-id: da0640e07e35c69c08988772ed611ec9e67f2e92
2023-11-29 11:35:59 -08:00
cz2h 324453e579 Fix rowcache get returning incorrect timestamp (#11952)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7930.

When there is a timestamp associated with stored records, get from row cache will return the timestamp provided in query instead of the timestamp associated with the stored record.

## Cause of error:
Currently a row_handle is fetched using row_cache_key(contains a timestamp provided by user query) and the row_handle itself does not persist timestamp associated with the object. Hence the [GetContext::SaveValue()
](6e3429b8a6/table/get_context.cc (L257)) function will fetch the timestamp in row_cache_key and may return the incorrect timestamp value.

## Proposed Solution
If current cf enables ts, append a timestamp associated with stored records after the value in replay_log (equivalently the value of row cache entry).

When read, `replayGetContextLog()` will update parsed_key with the correct timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11952

Reviewed By: ajkr

Differential Revision: D51501176

Pulled By: jowlyzhang

fbshipit-source-id: 808fc943a8ae95de56ae0e82ec59a2573a031f28
2023-11-21 20:39:33 -08:00
Changyu Bi fb5c8c7ea3 Do not compare op_type in WithinPenultimateLevelOutputRange() (#12081)
Summary:
`WithinPenultimateLevelOutputRange()` is updated in https://github.com/facebook/rocksdb/issues/12063 to check internal key range. However, op_type of a key can change during compaction, e.g. MERGE -> PUT, which makes a key larger and becomes out of penultimate output range. This has caused stress test failures with error message "Unsafe to store Seq later than snapshot in the last level if per_key_placement is enabled". So update `WithinPenultimateLevelOutputRange()` to only check user key and sequence number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12081

Test Plan:
* This repro can produce the corruption within a few runs. Ran it a few times after the fix and did not see Corruption failure.
```
python3 ./tools/db_crashtest.py whitebox --test_tiered_storage --random_kill_odd=888887 --use_merge=1 --writepercent=100 --readpercent=0 --prefixpercent=0 --delpercent=0 --delrangepercent=0 --iterpercent=0 --write_buffer_size=419430 --column_families=1 --read_fault_one_in=0 --write_fault_one_in=0
```

Reviewed By: ajkr

Differential Revision: D51481202

Pulled By: cbi42

fbshipit-source-id: cad6b65099733e03071b496e752bbdb09cf4db82
2023-11-20 17:07:28 -08:00
Benoît Mériaux 7780e98268 add write_buffer_manager setter into options and tests in c bindings, (#12007)
Summary:
following https://github.com/facebook/rocksdb/pull/11710
 - add test on wbm c api
- add a setter for WBM in `DBOptions`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12007

Reviewed By: cbi42

Differential Revision: D51430042

Pulled By: ajkr

fbshipit-source-id: 608bc4d3ed35a84200459d0230b35be64b3475f7
2023-11-17 11:34:05 -08:00
Changyu Bi 4e58cc6437 Check internal key range when compacting from last level to penultimate level (#12063)
Summary:
The test failure in https://github.com/facebook/rocksdb/issues/11909 shows that we may compact keys outside of internal key range of penultimate level input files from last level to penultimate level, which can potentially cause overlapping files in the penultimate level. This PR updates the  `Compaction::WithinPenultimateLevelOutputRange()` to check internal key range instead of user key.

Other fixes:
* skip range del sentinels when deciding output level for tiered compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12063

Test Plan:
- existing unit tests
- apply the fix to https://github.com/facebook/rocksdb/issues/11905 and run `./tiered_compaction_test --gtest_filter="*RangeDelsCauseFileEndpointsToOverlap*"`

Reviewed By: ajkr

Differential Revision: D51288985

Pulled By: cbi42

fbshipit-source-id: 70085db5f5c3b15300bcbc39057d57b83fd9902a
2023-11-17 10:50:40 -08:00
Gus Wynn 6d10f8d690 add WriteBufferManager to c api (#11710)
Summary:
I want to use the `WriteBufferManager` in my rust project, which requires exposing it through the c api, just like `Cache` is.

Hopefully the changes are fairly straightfoward!

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11710

Reviewed By: cbi42

Differential Revision: D51166518

Pulled By: ajkr

fbshipit-source-id: cd266ff1e4a7ab145d05385cd125a8390f51f3fc
2023-11-16 10:34:00 -08:00
Andrew Kryczka 9202db1867 Consider archived WALs for deletion more frequently (#12069)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11000.

That issue pointed out that RocksDB was slow to delete archived WALs in case time-based and size-based expiration were enabled, and the time-based threshold (`WAL_ttl_seconds`) was small. This PR prevents the delay by taking into account `WAL_ttl_seconds` when deciding the frequency to process archived WALs for deletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12069

Reviewed By: pdillinger

Differential Revision: D51262589

Pulled By: ajkr

fbshipit-source-id: e65431a06ee96f4c599ba84a27d1aedebecbb003
2023-11-15 15:42:28 -08:00
Changyu Bi e7896f03ad Enable unit test PrecludeLastLevelTest.RangeDelsCauseFileEndpointsToOverlap (#12064)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11909. The test passes after the change in https://github.com/facebook/rocksdb/issues/11917 to start mock clock from a non-zero time.

The reason for test failing is a bit complicated:
- The Put here e4ad4a0ef1/db/compaction/tiered_compaction_test.cc (L2045) happens before mock clock advances beyond 0.
- This causes oldest_key_time_ to be 0 for memtable.
- oldest_ancester_time of the first L0 file becomes 0
- L0 -> L5/6 compaction output files sets `oldest_ancestoer_time` to the current time due to these lines: 509947ce2c/db/compaction/compaction_job.cc (L1898C34-L1904).
- This causes some small sequence number to be mapped to current time: 509947ce2c/db/compaction/compaction_job.cc (L301)
- Keys in L6 is being moved up to L5 due to the unexpected seqno_to_time mapping
- When compacting keys from last level to the penultimate level, we only check keys to be within user key range of penultimate level input files. If we compact the following file 3 with file 1 and output keys to L5, we can get the reported inconsistency bug.
```
L5: file 1 [K5@20, K10@kMaxSeqno], file 2 [K10@30, K14@34)
L6: file 3 [K6@5, K10@20]
```

https://github.com/facebook/rocksdb/issues/12063 will add fixes to check internal key range when compacting keys from last level up to the penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12064

Test Plan: the unit test passes

Reviewed By: ajkr

Differential Revision: D51281149

Pulled By: cbi42

fbshipit-source-id: 00b7f026c453454d9f3af5b2de441383a96f0c62
2023-11-13 15:26:52 -08:00
Jay Huh 8b8f6c63ef ColumnFamilyHandle Nullcheck in GetEntity and MultiGetEntity (#12057)
Summary:
- Add missing null check for ColumnFamilyHandle in `GetEntity()`
- `FailIfCfHasTs()` now returns `Status::InvalidArgument()` if `column_family` is null. `MultiGetEntity()` can rely on this for cfh null check.
- Added `DeleteRange` API using Default Column Family to be consistent with other major APIs (This was also causing Java Test failure after the `FailIfCfHasTs()` change)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12057

Test Plan:
- Updated `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` to include null CF case
- Updated `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` to include null CF case

Reviewed By: jowlyzhang

Differential Revision: D51167445

Pulled By: jaykorean

fbshipit-source-id: 1c1e44fd7b7df4d2dc3bb2d7d251da85bad7d664
2023-11-13 14:30:04 -08:00
leipeng b3ffca0e29 DBImpl::DelayWrite: Remove bad WRITE_STALL histogram (#12067)
Summary:
When delay didn't happen, histogram WRITE_STALL is still recorded, and ticker STALL_MICROS is not recorded.

This is a bug, neither WRITE_STALL or STALL_MICROS should not be recorded when delay did not happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12067

Reviewed By: cbi42

Differential Revision: D51263133

Pulled By: ajkr

fbshipit-source-id: bd82d8328fe088d613991966e83854afdabc6a25
2023-11-13 12:48:44 -08:00
Yu Zhang 509947ce2c Quarantine files in a limbo state after a manifest error (#12030)
Summary:
Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery.  3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted.  This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`.

These are the types of files being actively tracked in quarantine in this PR:
1) new table files and blob files from a background job
2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed.

Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening.

We track these files' file numbers because they share the same file number space.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030

Test Plan: Modified existing unit tests

Reviewed By: ajkr

Differential Revision: D51036774

Pulled By: jowlyzhang

fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716
2023-11-11 08:11:11 -08:00
Yu Zhang c6c683a0ca Remove the default force behavior for EnableFileDeletion API (#12001)
Summary:
Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted.

This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them.  When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK.

This PR also makes unit test that do not need force behavior to do a regular enable file deletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001

Reviewed By: ajkr

Differential Revision: D51214683

Pulled By: jowlyzhang

fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4
2023-11-10 14:35:54 -08:00
Yueh-Hsuan Chiang 5ef92b8ea4 Add rocksdb_options_set_cf_paths (#11151)
Summary:
This PR adds a missing set function for rocksdb_options in the C-API:
rocksdb_options_set_cf_paths().  Without this function, users cannot
specify different paths for different column families as it will fall back
to db_paths.

As a bonus, this PR also includes rocksdb_sst_file_metadata_get_directory()
to the C api -- a missing public function that will also make the test easier to write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11151

Test Plan: Augment existing c_test to verify the specified cf_path.

Reviewed By: hx235

Differential Revision: D51201888

Pulled By: ajkr

fbshipit-source-id: 62a96451f26fab60ada2005ede3eea8e9b431f30
2023-11-10 11:36:11 -08:00
Yueh-Hsuan Chiang 73d223c4e2 Add auto_tuned option to RateLimiter C API (#12058)
Summary:
#### Problem
While the RocksDB C API does have the RateLimiter API, it does not
expose the auto_tuned option.

#### Summary of Change
This PR exposes auto_tuned RateLimiter option in RocksDB C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12058

Test Plan: Augment the C API existing test to cover the new API.

Reviewed By: cbi42

Differential Revision: D51201933

Pulled By: ajkr

fbshipit-source-id: 5bc595a9cf9f88f50fee797b729ba96f09ed8266
2023-11-10 09:53:09 -08:00
Yu Zhang dfaf4dc111 Stubs for piping write time (#12043)
Summary:
As titled. This PR contains the API and stubbed implementation for piping write time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12043

Reviewed By: pdillinger

Differential Revision: D51076575

Pulled By: jowlyzhang

fbshipit-source-id: 3b341263498351b9ccaff27cf35d5aeb5bdf0cf1
2023-11-09 15:58:07 -08:00
brodyhuang e90e9825b4 Drop wal record when sequence is illegal (#11985)
Summary:
- Our database is corrupted, causing some sequences of wal record to be invalid (but the `record_checksum` looks fine).
- When we RecoverLogFiles in WALRecoveryMode::kPointInTimeRecovery, `assert(seq <= kMaxSequenceNumber)` will be failed.
- When it is found that sequence is illegal, can we drop the file  to recover as much data as possible ?  Thx !

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11985

Reviewed By: anand1976

Differential Revision: D50698039

Pulled By: ajkr

fbshipit-source-id: 1e42113b58823088d7c0c3a92af5b3efbb5f5296
2023-11-09 10:43:16 -08:00
Hui Xiao f337533b6f Ensure and clarify how RocksDB calls TablePropertiesCollector's functions (#12053)
Summary:
**Context/Summary:**
It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish".

However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand.

This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`)

Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053

Test Plan:
- Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once.
- Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality  `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change.

Reviewed By: ajkr

Differential Revision: D51095434

Pulled By: hx235

fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973
2023-11-08 14:00:36 -08:00
Zaidoon Abd Al Hadi 58f2a29fb4 Expose Options::periodic_compaction_seconds through C API (#12019)
Summary:
fixes [11090](https://github.com/facebook/rocksdb/issues/11090)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12019

Reviewed By: jaykorean

Differential Revision: D51076427

Pulled By: cbi42

fbshipit-source-id: de353ff66c7f73aba70ab3379e20d8c40f50d873
2023-11-07 12:46:50 -08:00
Jay Huh 2adef5367a AttributeGroups - PutEntity Implementation (#11977)
Summary:
Write Path for AttributeGroup Support. The new `PutEntity()` API uses `WriteBatch` and atomically writes WideColumns entities in multiple Column Families.

Combined the release note from PR https://github.com/facebook/rocksdb/issues/11925

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11977

Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` updated
- `WriteBatchTest::AttributeGroupTest` added
- `WriteBatchTest::AttributeGroupSavePointTest` added

Reviewed By: ltamasi

Differential Revision: D50457122

Pulled By: jaykorean

fbshipit-source-id: 4997b265e415588ce077933082dcd1ac3eeae2cd
2023-11-06 16:52:51 -08:00
Jay Huh 0ecfc4fbb4 AttributeGroups - GetEntity Implementation (#11943)
Summary:
Implementation of `GetEntity()` API that returns wide-column entities as AttributeGroups from multiple column families for a single key. Regarding the definition of Attribute groups, please see the detailed example description in PR https://github.com/facebook/rocksdb/issues/11925

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11943

Test Plan:
- `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` added

will enable the new API in the `db_stress` after merging

Reviewed By: ltamasi

Differential Revision: D50195794

Pulled By: jaykorean

fbshipit-source-id: 218d54841ac7e337de62e13b1233b0a99bd91af3
2023-11-06 15:04:41 -08:00
Jay Huh 2dab137182 Mark more files for periodic compaction during offpeak (#12031)
Summary:
- The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice.
- It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization.
- Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start.
- We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031

Test Plan:
Unit Tests added
- `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled
- `DBTestUniversalCompaction2::PeriodicCompaction` for Universal

Reviewed By: cbi42

Differential Revision: D50900292

Pulled By: jaykorean

fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d
2023-11-06 11:43:59 -08:00
Changyu Bi 520c64fd2e Add missing status check in ExternalSstFileIngestionJob and ImportColumnFamilyJob (#12042)
Summary:
.. and update some unit tests that failed with this change. See comment in ExternalSSTFileBasicTest.IngestFileWithCorruptedDataBlock for more explanation.

The missing status check is not caught by `ASSERT_STATUS_CHECKED=1` due to this line: 8505b26db1/table/block_based/block.h (L394). Will explore if we can remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12042

Test Plan: existing unit tests.

Reviewed By: ajkr

Differential Revision: D50994769

Pulled By: cbi42

fbshipit-source-id: c91615bccd6094a91634c50b98401d456cbb927b
2023-11-06 07:41:36 -08:00
914022466 2648e0a747 Fix a bug when ingest plaintable sst file (#11969)
Summary:
Plaintable doesn't support SeekToLast. And GetIngestedFileInfo is using SeekToLast without checking the validity.

We are using IngestExternalFile or CreateColumnFamilyWithImport with some sst file in PlainTable format . But after running for a while, compaction error often happens. Such as
![image](https://github.com/facebook/rocksdb/assets/13954644/b4fa49fc-73fc-49ce-96c6-f198a30800b8)

I simply add some std::cerr log to find why.
![image](https://github.com/facebook/rocksdb/assets/13954644/2cf1d5ff-48cc-4125-b917-87090f764fcd)
It shows that the smallest key is always equal to largest key.
![image](https://github.com/facebook/rocksdb/assets/13954644/6d43e978-0be0-4306-aae3-f9e4ae366395)
Then I found the root cause is that PlainTable do not support SeekToLast, so the smallest key is always the same with the largest

I try to write an unit test. But it's not easy to reproduce this error.
(This PR is similar to https://github.com/facebook/rocksdb/pull/11266. Sorry for open another PR)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11969

Reviewed By: ajkr

Differential Revision: D50933854

Pulled By: cbi42

fbshipit-source-id: 6c6af53c1388922cbabbe64ed3be1cdc58df5431
2023-11-02 13:45:37 -07:00
Yu Zhang 4b013dcbed Remove VersionEdit's friends pattern (#12024)
Summary:
Almost each of VersionEdit private member has its own getter and setter. Current code access them with a combination of directly accessing private members and via getter and setters. There is no obvious benefits to have this pattern except potential performance gains. I tried this simple benchmark for removing the friends pattern completely, and there is no obvious regression. So I think it would good to remove VersionEdit's friends completely.

```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num_column_families=10 -num=50000000```

With change:
fillseq      :       2.994 micros/op 333980 ops/sec 149.710 seconds 50000000 operations;   36.9 MB/s
fillseq      :       3.033 micros/op 329656 ops/sec 151.673 seconds 50000000 operations;   36.5 MB/s
fillseq      :       2.991 micros/op 334369 ops/sec 149.535 seconds 50000000 operations;   37.0 MB/s
Without change:
fillseq      :       3.015 micros/op 331715 ops/sec 150.732 seconds 50000000 operations;   36.7 MB/s
fillseq      :       3.044 micros/op 328553 ops/sec 152.182 seconds 50000000 operations;   36.3 MB/s
fillseq      :       3.091 micros/op 323520 ops/sec 154.550 seconds 50000000 operations;   35.8 MB/s

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12024

Reviewed By: pdillinger

Differential Revision: D50806066

Pulled By: jowlyzhang

fbshipit-source-id: 35d287ce638a38c30f243f85992e615b4c90eb27
2023-11-01 12:04:11 -07:00
Jay Huh 04225a2cfa Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false (#11991)
Summary:
cbi42 helped investigation and found a potential scenario where `RecoverFromRetryableBGIOError()` may start with `recovery_in_prog_ ` set as false. (and other booleans like `bg_error_` and `soft_error_no_bg_work_`)

**Thread 1**
- `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true`

**Thread 1's `recovery_thread_`**
- (waits for mutex and acquires it)
- `RecoverFromRetryableBGIOError()` -> `ResumeImpl()` -> `ClearBGError()`: sets `recovery_in_prog_ = false`
- `ClearBGError()` -> `NotifyOnErrorRecoveryEnd()`: releases `mutex`

**Thread 2**
- `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true`
- Waits for Thread 1 (`recovery_thread_`) to finish

**Thread 1's `recovery_thread_`**
- re-lock mutex in `NotifyOnErrorRecoveryEnd()`
- Still inside `RecoverFromRetryableBGIOError()`: sets `recovery_in_prog_ = false`
- Done

**Thread 2's `recovery_thread_`**
- recovery thread started with `recovery_in_prog_` set as `false`

# Fix
- Remove double-clearing `bg_error_`,  `recovery_in_prog_` and other fields after `ResumeImpl()` already returned `OK()`.
- Minor typo and linter fixes in `DBErrorHandlingFSTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11991

Test Plan:
- `DBErrorHandlingFSTest::MultipleRecoveryThreads` added to reproduce the scenario.
- Adding `assert(recovery_in_prog_);` at the start of `ErrorHandler::RecoverFromRetryableBGIOError()` fails the test without the fix and succeeds with the fix as expected.

Reviewed By: cbi42

Differential Revision: D50506113

Pulled By: jaykorean

fbshipit-source-id: 6dabe01e9ecd3fc50bbe9019587f2f4858bed9c6
2023-10-31 16:13:36 -07:00
Yu Zhang 60df39e530 Rate limiting stale sst files' deletion during recovery (#12016)
Summary:
As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016

Reviewed By: hx235

Differential Revision: D50670482

Pulled By: jowlyzhang

fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e
2023-10-28 09:50:52 -07:00
Jay Huh e230e4d248 Make OffpeakTimeInfo available in VersionSet (#12018)
Summary:
As mentioned in  https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR.

There were two ways to achieve what we want.
1.  Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`.
2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`.

We chose the latter as it involves smaller changes.

This change includes the following
- Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions`
- `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()`
- During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018

Test Plan:
- `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`.
- `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes`

Reviewed By: pdillinger

Differential Revision: D50723881

Pulled By: jaykorean

fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6
2023-10-27 15:56:48 -07:00
Hui Xiao 0f141352d8 Fix race between flush error recovery and db destruction (#12002)
Summary:
**Context:**
DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: 519f2a41fb/db/db_impl/db_impl.cc (L525) -> 519f2a41fb/db/error_handler.cc (L250) -> 519f2a41fb/db/error_handler.cc (L808-L823)

However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below.

```
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
```

**Summary:**
This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](519f2a41fb/db/error_handler.cc (L688-L694)) such case inside the created recovery thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002

Test Plan: A new UT repro-ed the crash before this fix and and pass after.

Reviewed By: ajkr

Differential Revision: D50586191

Pulled By: hx235

fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093
2023-10-25 11:59:09 -07:00
qiuchengxuan f2c9075d16 Fix dead loop with kSkipAnyCorruptedRecords mode selected in some cases (#11955) (#11979)
Summary:
With fragmented record span across multiple blocks, if any following blocks corrupted with arbitary data, and intepreted log number less than the current log number, program will fall into infinite loop due to
not skipping buffer leading bytes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11979

Test Plan: existing unit tests

Reviewed By: ajkr

Differential Revision: D50604408

Pulled By: jowlyzhang

fbshipit-source-id: e50a0c7e7c3d293fb9d5afec0a3eb4a1835b7a3b
2023-10-25 09:16:24 -07:00
Myth 0ff7665c95 Fix low priority write may cause crash when it is rate limited (#11932)
Summary:
Fixed https://github.com/facebook/rocksdb/issues/11902

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11932

Reviewed By: akankshamahajan15

Differential Revision: D50573356

Pulled By: hx235

fbshipit-source-id: adeb1abdc43b523b0357746055ce4a2eabde56a1
2023-10-24 14:41:46 -07:00
Peter Dillinger 4155087746 Use manifest to persist pre-allocated seqnos (#11995)
Summary:
... and other fixes for crash test after https://github.com/facebook/rocksdb/issues/11922.
* When pre-allocating sequence numbers for establishing a time history, record that last sequence number in the manifest so that it is (most likely) restored on recovery even if no user writes were made or were recovered (e.g. no WAL).
* When pre-allocating sequence numbers for establishing a time history, only do this for actually new DBs.
* Remove the feature that ensures non-zero sequence number on creating the first column family with preserve/preclude option after initial DB::Open. Until fixed in a way compatible with the crash test, this creates a gap where some data written with active preserve/preclude option won't have a known associated time.

Together, these ensure we don't upset the crash test by manipulating sequence numbers after initial DB creation (esp when re-opening with different options). (The crash test expects that the seqno after re-open corresponds to a known point in time from previous crash test operation, matching an expected DB state.)

Follow-up work:
* Re-fill the gap to ensure all data written under preserve/preclude settings have a known time estimate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11995

Test Plan:
Added to unit test SeqnoTimeTablePropTest.PrePopulateInDB

Verified fixes two crash test scenarios:
## 1st reproducer
First apply
```
 diff --git a/db_stress_tool/expected_state.cc b/db_stress_tool/expected_state.cc
index b483e154c..ef63b8d6c 100644
 --- a/db_stress_tool/expected_state.cc
+++ b/db_stress_tool/expected_state.cc
@@ -333,6 +333,7 @@ Status FileExpectedStateManager::SaveAtAndAfter(DB* db) {
     s = NewFileTraceWriter(Env::Default(), soptions, trace_file_path,
                            &trace_writer);
   }
+  if (getenv("CRASH")) assert(false);
   if (s.ok()) {
     TraceOptions trace_opts;
     trace_opts.filter |= kTraceFilterGet;
```

Then
```
mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_expected
mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox
rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/*
CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=36000
./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=0
```

Without the fix you get
```
...
DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox]
(Re-)verified 34 unique IDs
Error restoring historical expected values: Corruption: DB is older than any restorable expected state
```

## 2nd reproducer
First apply
```
 diff --git a/db_stress_tool/db_stress_test_base.cc b/db_stress_tool/db_stress_test_base.cc
index 62ddead7b..f2654980f 100644
 --- a/db_stress_tool/db_stress_test_base.cc
+++ b/db_stress_tool/db_stress_test_base.cc
@@ -1126,6 +1126,7 @@ void StressTest::OperateDb(ThreadState* thread) {
         // OPERATION write
         TestPut(thread, write_opts, read_opts, rand_column_families, rand_keys,
                 value);
+        if (getenv("CRASH")) assert(false);
       } else if (prob_op < del_bound) {
         assert(write_bound <= prob_op);
         // OPERATION delete
```

Then
```
rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/*
CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=0
./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=3600
```

Without the fix you get
```
DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox]
(Re-)verified 34 unique IDs
db_stress: db_stress_tool/expected_state.cc:380: virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~
ExpectedStateTraceRecordHandler(): Assertion `IsDone()' failed.
```

Reviewed By: jowlyzhang

Differential Revision: D50533346

Pulled By: pdillinger

fbshipit-source-id: 1056be45c5b9e537c8c601b28c4b27431a782477
2023-10-23 09:20:59 -07:00
Hui Xiao 0836a2b26d New tickers on deletion compactions grouped by reasons (#11957)
Summary:
Context/Summary: as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957

Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats

Reviewed By: ajkr, cbi42

Differential Revision: D50294310

Pulled By: hx235

fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd
2023-10-18 18:00:07 -07:00
Changyu Bi d5bc30befa Enforce status checking after Valid() returns false for IteratorWrapper (#11975)
Summary:
... when compiled with ASSERT_STATUS_CHECKED = 1.

The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782.

Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975

Test Plan:
* `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check`
* I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task.

Reviewed By: ajkr

Differential Revision: D50383790

Pulled By: cbi42

fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce
2023-10-18 09:38:38 -07:00
Yu Zhang 933ee295f4 Fix a race condition between recovery and backup (#11955)
Summary:
A race condition between recovery and backup can happen with error messages like this:
```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory```

PR https://github.com/facebook/rocksdb/issues/6949  introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event:

[Backup engine]     disable file deletion
[Recovery]              disable file deletion <= this is optional for the race condition, it may or may not get called
[Backup engine]     get list of file to copy/link
[Recovery]              force enable file deletion
....                            some files refered by backup engine get deleted
[Backup engine]    copy/link file <= error no file found

This PR fixes this with:
1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call.

2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler.

3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955

Test Plan: existing unit tests

Reviewed By: anand1976

Differential Revision: D50290592

Pulled By: jowlyzhang

fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b
2023-10-17 13:18:04 -07:00
Peter Dillinger 2fd850c7eb Remove write queue synchronization from WriteOptionsFile (#11951)
Summary:
This has become obsolete with the new `options_mutex_` in https://github.com/facebook/rocksdb/pull/11929

* Remove now-unnecessary parameter from WriteOptionsFile
* Rename (and negate) other parameter for better clarity (the caller shouldn't tell the callee what the callee needs, just what the caller knows, provides, and requests)
* Move a ROCKS_LOG_WARN (I/O) in WriteOptionsFile to outside of holding DB mutex.
* Also *avoid* (but not always eliminate) write queue synchronization in SetDBOptions. Still needed if there was a change to WAL size limit or other configuration.
* Improve some comments

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11951

Test Plan: existing unit tests and TSAN crash test local run

Reviewed By: ajkr

Differential Revision: D50247904

Pulled By: pdillinger

fbshipit-source-id: 7dfe445c705ec013886a2adb7c50abe50d83af69
2023-10-16 08:58:47 -07:00
Changyu Bi f3aef8cad7 Add write operation to tracer only after successful callback (#11954)
Summary:
We saw optimistic transaction stress test failures like the following:
```
Verification failed for column family 0 key 000000000001E9AF000000000000012B00000000000000B5 (12535491): value_from_db: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, value_from_expected: , msg: Iterator verification: Unexpected value found```
```
With ajkr's repro (see test plan), I found that we record duplicated writes to tracer when an optimistic transaction conflict checking fails. This PR fixes it by checking callback status before record a write operation to tracer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11954

Test Plan:
this reproduces the failure consistently
```
#!/bin/bash
db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected
rm -rf $db $exp && mkdir -p $exp && while ./db_stress \
        --atomic_flush=1 \
        --clear_column_family_one_in=0 \
        --db=$db \
        --db_write_buffer_size=2097152 \
        --delpercent=0 \
        --delrangepercent=0 \
        --destroy_db_initially=0 \
        --disable_wal=1 \
        --expected_values_dir=$exp \
        --iterpercent=0 \
        --max_bytes_for_level_base=2097152 \
        --max_key=250000 \
        --memtable_prefix_bloom_size_ratio=0.5 \
        --memtable_whole_key_filtering=1 \
        --occ_lock_bucket_count=100 \
        --occ_validation_policy=0 \
        --ops_per_thread=10 \
        --prefixpercent=0 \
        --readpercent=0 \
        --reopen=0 \
        --target_file_size_base=524288 \
        --test_batches_snapshots=0 \
        --use_optimistic_txn=1 \
        --use_txn=1 \
        --value_size_mult=32 \
        --write_buffer_size=524288 \
        --writepercent=100 ; do : ; done
```

Reviewed By: akankshamahajan15

Differential Revision: D50284976

Pulled By: cbi42

fbshipit-source-id: 793e3cee186c8b4f406b29166efd8d9028695206
2023-10-14 12:00:31 -07:00
Jay Huh c9d8e6a5bf AttributeGroups - MultiGetEntity Implementation (#11925)
Summary:
Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`.
An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families.

Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`.

Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf".

Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo.

With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925

Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added

will enable this new API in the `db_stress` in a separate PR

Reviewed By: ltamasi

Differential Revision: D50017414

Pulled By: jaykorean

fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586
2023-10-13 15:58:03 -07:00
Changyu Bi 6e3429b8a6 Fix data race in accessing recovery_in_prog_ (#11950)
Summary:
We saw the following TSAN stress test failure:
```
WARNING: ThreadSanitizer: data race (pid=17523)
  Write of size 1 at 0x7b8c000008b9 by thread T4243 (mutexes: write M0):
    #0 rocksdb::ErrorHandler::RecoverFromRetryableBGIOError() fbcode/internal_repo_rocksdb/repo/db/error_handler.cc:742 (db_stress+0x95f954) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
    https://github.com/facebook/rocksdb/issues/1 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (rocksdb::ErrorHandler::*)(), rocksdb::ErrorHandler*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 (db_stress+0x95fc2b) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
    https://github.com/facebook/rocksdb/issues/2 execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18 (libstdc++.so.6+0xdf4e4) (BuildId: 452d1cdae868baeeb2fdf1ab140f1c219bf50c6e)

  Previous read of size 1 at 0x7b8c000008b9 by thread T22:
    #0 rocksdb::DBImpl::SyncClosedLogs(rocksdb::JobContext*, rocksdb::VersionEdit*) fbcode/internal_repo_rocksdb/repo/db/error_handler.h:76 (db_stress+0x84f69c) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
```

This is due to a data race in accessing `recovery_in_prog_`. This PR fixes it by accessing `recovery_in_prog_` under db mutex before calling `SyncClosedLogs()`. I think the original PR https://github.com/facebook/rocksdb/pull/10489 intended to clear the error if it's a recovery flush. So ideally we can also just check flush reason. I plan to keep a safer change in this PR and make that change in the future if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11950

Test Plan: check future TSAN stress test results.

Reviewed By: anand1976

Differential Revision: D50242255

Pulled By: cbi42

fbshipit-source-id: 0d487948ef9546b038a34460f3bb037f6e5bfc58
2023-10-12 16:55:25 -07:00
Changyu Bi 648fe25bc0 Always clear files marked for compaction in ComputeCompactionScore() (#11946)
Summary:
We were seeing the following stress test failures:
```LevelCompactionBuilder::PickFileToCompact(const rocksdb::autovector<std::pair<int, rocksdb::FileMetaData*> >&, bool): Assertion `!level_file.second->being_compacted' failed```

This can happen when we are picking a file to be compacted from some files marked for compaction, but that file is already being_compacted. We prevent this by always calling `ComputeCompactionScore()` after we pick a compaction and mark some files as being_compacted. However, if SetOptions() is called to disable marking certain files to be compacted, say `enable_blob_garbage_collection`, we currently just skip the relevant logic in `ComputeCompactionScore()` without clearing the existing files already marked for compaction. This PR fixes this issue by already clearing these files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11946

Test Plan: existing tests.

Reviewed By: akankshamahajan15

Differential Revision: D50232608

Pulled By: cbi42

fbshipit-source-id: 11e4fb5e9d48b0f946ad33b18f7c005f0161f496
2023-10-12 15:26:10 -07:00
Peter Dillinger d010b02e86 Fix race in options taking effect (#11929)
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.

To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.

I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.

Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929

Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace

I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace

Reviewed By: ajkr

Differential Revision: D50024506

Pulled By: pdillinger

fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
2023-10-12 10:05:23 -07:00
Andrew Kryczka 4bd5aa4f55 Fix two ErrorHandler race conditions (#11939)
Summary:
1. Prevent a double join on a `port::Thread`
2. Ensure `recovery_in_prog_` and `bg_error_` are both set under same lock hold. This is useful for writers who see a non-OK `bg_error_` and are deciding whether to stall based on whether the error will be auto-recovered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11939

Reviewed By: cbi42

Differential Revision: D50155484

Pulled By: ajkr

fbshipit-source-id: fbc1f85c50e7eaee27ee0e376aee688d8a06c93b
2023-10-11 09:42:48 -07:00
Andrew Kryczka 77d160ef47 Consolidate ErrorHandler's recovery status variables (#11937)
Summary:
cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937

Reviewed By: cbi42

Differential Revision: D50105793

Pulled By: ajkr

fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f
2023-10-10 06:31:45 -07:00
Andrew Kryczka 8a9cfd5292 Make stopped writes block on recovery (#11879)
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.

This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879

Reviewed By: anand1976

Differential Revision: D49571196

Pulled By: ajkr

fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0
2023-10-10 06:29:01 -07:00
darionyaphet ee0829ba76 fix typo snapshto (#11817)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11817

Reviewed By: jaykorean

Differential Revision: D50103497

Pulled By: ltamasi

fbshipit-source-id: 77c5cf86ff7eb5021fc91b03225882536163af7b
2023-10-09 19:10:06 -07:00
Levi Tamasi 51d7e6a49e Clean up WriteBatchWithIndexInternal a bit (#11930)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11930

The patch cleans up and refactors the logic in/around `WriteBatchWithIndexInternal` a bit as groundwork for further changes. Specifically, the class is turned back into a stateless collection of static helpers (which is the way it was before PR 6851). Note that there were two apparent reasons for introducing this instance state in PR 6851: a) encapsulating `MergeContext` and b) resolving objects like `Logger` and `Statistics` based on a variety of handles. However, neither reason seems justified at this point. Regarding a), the `MultiGetFromBatchAndDB` logic passes in its own `MergeContext` objects via a second set of methods that do not use the member `MergeContext`. As for b), `Logger` and friends are only needed for Merge, which is only supported if a column family handle is provided; in turn, the column family handle enables us to resolve all the necessary objects without the need for any other handles like `DB` or `DBOptions`. In addition to the above, the patch changes the type of `BaseDeltaIterator::merge_result_` to `std::string` from `PinnableSlice` (since no pinning is ever done) and makes some other small code quality improvements.

Reviewed By: jaykorean

Differential Revision: D50038302

fbshipit-source-id: 5f34abe2e808bdaea0f3a8033b5764ebd446b85d
2023-10-09 15:25:35 -07:00
Peter Dillinger 1d5bddbc58 Bootstrap, pre-populate seqno_to_time_mapping (#11922)
Summary:
This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920):
* Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests.
* For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time.

Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.)

Recommended follow-up:
* Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect.
* Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.)

No release note for the minor bug fix because this is still an experimental feature with limited usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922

Test Plan: tests added / updated

Reviewed By: jowlyzhang

Differential Revision: D49956563

Pulled By: pdillinger

fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519
2023-10-06 08:21:21 -07:00
Hui Xiao 8e949116f7 Fix comments about creation_time/oldest_ancester_time/oldest_key_time (#11921)
Summary:
Code reference for the comments change:

40b618f234/table/block_based/block_based_table_builder.cc?fbclid=IwAR0JlfnG8wysclFP5wv0fSngFbi_j32BUCKbFayeGdr10tzDhyyk5QqpclA#L2093

40b618f234/db/flush_job.cc?fbclid=IwAR1ri6eTX3wyD_2fAEBRzFSwZItcbmDS8LaB11k1letDMQmB2L8nF6TfXDs#L945-L949

40b618f234/db/compaction/compaction_job.cc (L1882-L1904)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11921

Reviewed By: cbi42

Differential Revision: D49921304

Pulled By: hx235

fbshipit-source-id: 2ae17e43c0fd52044404d7b63fea254d2d1f3595
2023-10-04 14:42:35 -07:00
Peter Dillinger 141b872bd4 Improve efficiency of create_missing_column_families, light refactor (#11920)
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.

Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920

Test Plan: tests added / updated

Reviewed By: jaykorean

Differential Revision: D49919147

Pulled By: pdillinger

fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
2023-10-04 14:14:22 -07:00