Commit graph

117 commits

Author SHA1 Message Date
Jay Huh 1fca175eec MultiCFSnapshot for NewIterators() API (#12573)
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573

Test Plan:
**Unit Tests Added**

```
./db_iterator_test --gtest_filter="*IteratorsConsistentView*"
```
```
./multi_cf_iterator_test -- --gtest_filter="*ConsistentView*"
```

**Performance Check**

Setup
```
make -j64 release
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none
```

Run
```
TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000
```
Before the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom :       6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found)
```
After the change
```
DB path: [/dev/shm/db_bench/dbbench]
multireadrandom :       6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found)
```

Reviewed By: jowlyzhang

Differential Revision: D56444066

Pulled By: jaykorean

fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99
2024-04-24 15:28:55 -07:00
Andrew Kryczka 6807da0b44 Fix DisableManualCompaction() hang (#12578)
Summary:
Prior to this PR the following sequence could happen:

1. `RunManualCompaction()` A schedules compaction to thread pool and waits
2. `RunManualCompaction()` B waits without scheduling anything due to conflict
3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both
4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done
5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool
6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction
7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution
8. `DisableManualCompaction()` returns as there are no more manual compactions running

Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward:

5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done
6'. `DisableManualCompaction()` returns as there are no more manual compactions running

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578

Reviewed By: cbi42

Differential Revision: D56528144

Pulled By: ajkr

fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b
2024-04-24 12:40:36 -07:00
Andrew Kryczka 3f3045a405 fix DeleteRange+memtable_insert_with_hint_prefix_extractor interaction (#12558)
Summary:
Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558

Reviewed By: hx235

Differential Revision: D56279019

Pulled By: ajkr

fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29
2024-04-22 20:13:58 -07:00
Hui Xiao 7d83b4e3e5 Fix file checksum mismatch due to padded bytes when block_align=true (#12542)
Summary:
**Context/Summary:**
When `BlockBasedTableOptions::block_align=true`, we pad bytes to align blocks d41e568b1c/table/block_based/block_based_table_builder.cc (L1415-L1421).
Those bytes are not included in generating the file checksum upon file creation. But `VerifyFileChecksums()` includes those bytes in generating the file check to compare against the checksum generating upon file creation. Therefore a file checksum mismatch is returned in `VerifyFileChecksums()`.

We decided to include those padded bytes in generating the checksum upon file creation.

Bonus: also fix surrounding code to use actual padded bytes for verification - see https://github.com/facebook/rocksdb/pull/12542#discussion_r1571429163

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12542

Test Plan:
- New UT
- Benchmark
```
TEST_TMPDIR=/dev/shm  ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG    300 runs] : 422857 (± 3942) ops/sec;   46.8 (± 0.4) MB/sec
Post-PR:
fillseq [AVG    300 runs] : 424707 (± 3799) ops/sec;   47.0 (± 0.4) MB/sec

Reviewed By: ajkr

Differential Revision: D56168447

Pulled By: hx235

fbshipit-source-id: 96209ef950d42943d336f11968ae3fcf9872fc2c
2024-04-22 14:07:34 -07:00
Levi Tamasi e82fe7c0b7 Fix the move semantics of PinnableWideColumns (#12557)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557

Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move.

Reviewed By: jaykorean

Differential Revision: D56275054

fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730
2024-04-17 18:56:23 -07:00
Andrew Kryczka 7027265417 Fix max_successive_merges counting CPU overhead regression (#12546)
Summary:
In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546

Reviewed By: jaykorean

Differential Revision: D56193693

Pulled By: ajkr

fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da
2024-04-17 12:11:24 -07:00
akankshamahajan 1856734821 Branch cut 9.1.fb (#12476)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12476

Reviewed By: jowlyzhang

Differential Revision: D55319508

Pulled By: akankshamahajan15

fbshipit-source-id: 2b6db671e027511282775c0fea155335d8e73cc2
2024-03-25 15:07:43 -07:00
Andrew Kryczka bf98dcf9a8 Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462)
Summary:
The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file.

There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462

Test Plan:
A repro command

```
db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp
./db_stress \
        --clear_column_family_one_in=0 \
        --test_batches_snapshots=0 \
        --write_fault_one_in=0 \
        --use_put_entity_one_in=0 \
        --prefixpercent=0 \
        --read_fault_one_in=0 \
        --readpercent=0 \
        --reopen=0 \
        --set_options_one_in=10000 \
        --delpercent=0 \
        --delrangepercent=0 \
        --open_metadata_write_fault_one_in=0 \
        --open_read_fault_one_in=0 \
        --open_write_fault_one_in=0 \
        --destroy_db_initially=0 \
        --ingest_external_file_one_in=0 \
        --iterpercent=0 \
        --nooverwritepercent=0 \
        --db=$db \
        --enable_blob_files=1 \
        --expected_values_dir=$exp \
        --max_background_compactions=20 \
        --max_bytes_for_level_base=2097152 \
        --max_key=100000 \
        --min_blob_size=0 \
        --open_files=-1 \
        --ops_per_thread=100000000 \
        --prefix_size=-1 \
        --target_file_size_base=524288 \
        --use_merge=1 \
        --value_size_mult=32 \
        --write_buffer_size=524288 \
        --writepercent=100
```

It used to fail like:

```
...
frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3
frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, rocksdb::PinnedIteratorsManager*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33
frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492
frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408
frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109
frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656
frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607
frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505
frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084
frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807
frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312
frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157
frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544
frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72
frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317
frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148
...
```

Reviewed By: ltamasi

Differential Revision: D55157795

Pulled By: ajkr

fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764
2024-03-21 12:38:53 -07:00
anand76 63a105a481 Enable recycle_log_file_num option for point in time recovery (#12403)
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.

This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403

Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior

Reviewed By: ajkr

Differential Revision: D54544824

Pulled By: anand1976

fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
2024-03-21 12:29:35 -07:00
Andrew Kryczka 4d5ebad971 Fix kBlockCacheTier read with table cache miss (#12443)
Summary:
Thanks ltamasi for pointing out this bug.

We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443

Reviewed By: cbi42

Differential Revision: D54930128

Pulled By: ajkr

fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d
2024-03-15 14:41:58 -07:00
Changyu Bi 096fb9b67d Fix data race in WalManager (#12439)
Summary:
Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439

Test Plan:
- existing UT
- not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests

Reviewed By: anand1976

Differential Revision: D54920817

Pulled By: cbi42

fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6
2024-03-14 21:24:06 -07:00
Yu Zhang 31dfc81e18 Start 9.1.0 release (#12360)
Summary:
with release notes for 9.0.fb, format_compatible test update, and version.h update.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360

Test Plan: CI

Reviewed By: cbi42

Differential Revision: D53879416

Pulled By: jowlyzhang

fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75
2024-02-16 18:26:48 -08:00
Levi Tamasi de1e3ff6ea Fix a data race in DBImpl::RenameTempFileToOptionsFile (#12347)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347

`DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels).

Reviewed By: jowlyzhang

Differential Revision: D53675153

fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481
2024-02-12 13:26:09 -08:00
anand76 95b41eec6d Fix potential incorrect result for duplicate key in MultiGet (#12295)
Summary:
The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -
1. Duplicate keys in a MultiGet batch
2. Key matches the last key in a non-zero, non-bottommost level file
3. Final value is not in the file (merge operand, not snapshot visible etc)
4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
6. Value or merge operand for the key is present in the very next level

The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the  conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295

Test Plan: Add unit tests that fail an assertion or return wrong result without the fix

Reviewed By: hx235

Differential Revision: D53187634

Pulled By: anand1976

fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea
2024-02-02 11:48:35 -08:00
Chdy fc48af33f5 fix some perf statistic in write (#12285)
Summary:
### Summary:  perf context lack statistics in some write steps
```
rocksdb::get_perf_context()->write_wal_time);
rocksdb::get_perf_context()->write_memtable_time);
rocksdb::get_perf_context()->write_pre_and_post_process_time);
```

#### case 1:
when the unordered_write is true, the `write_memtable_time` is 0
```
write_wal_time : 13.7012
write_memtable_time : 0
write_pre_and_post_process_time : 142.037
```

Reason: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` during insert memtable,

```c++
Status DBImpl::UnorderedWriteMemtable(const WriteOptions& write_options,
                                      WriteBatch* my_batch,
                                      WriteCallback* callback, uint64_t log_ref,
                                      SequenceNumber seq,
                                      const size_t sub_batch_cnt) {
	...
  if (w.CheckCallback(this) && w.ShouldWriteToMemtable()) {

    // need calculate write_memtable_time
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        seq_per_batch_, sub_batch_cnt, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);
    if (write_options.disableWAL) {
      has_unpersisted_data_.store(true, std::memory_order_relaxed);
    }
  }
	...
}
```
Fix: add perf function
```
write_wal_time : 14.3991
write_memtable_time : 19.3367
write_pre_and_post_process_time : 130.441
```

#### case 2:
when the enable_pipelined_write is true, the `write_memtable_time` is small
```
write_wal_time : 11.2986
write_memtable_time : 1.0205
write_pre_and_post_process_time : 140.131
```

Fix: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` when `w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER`
```c++
Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
                                  WriteBatch* my_batch, WriteCallback* callback,
                                  uint64_t* log_used, uint64_t log_ref,
                                  bool disable_memtable, uint64_t* seq_used) {
  ...
  if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
    // need calculate write_memtable_time

    assert(w.ShouldWriteToMemtable());
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        false /*seq_per_batch*/, 0 /*batch_cnt*/, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);

    if (write_thread_.CompleteParallelMemTableWriter(&w)) {
      MemTableInsertStatusCheck(w.status);
      versions_->SetLastSequence(w.write_group->last_sequence);
      write_thread_.ExitAsMemTableWriter(&w, *w.write_group);
    }
  }
  if (seq_used != nullptr) {
    *seq_used = w.sequence;
  }

  assert(w.state == WriteThread::STATE_COMPLETED);
  return w.FinalStatus();
}
```
FIx: add perf function
```
write_wal_time : 10.5201
write_memtable_time : 17.1048
write_pre_and_post_process_time : 114.313
```

#### case3:
`DBImpl::WriteImplWALOnly` function has no statistical `write_delay_time`
```c++
Status DBImpl::WriteImplWALOnly(
    WriteThread* write_thread, const WriteOptions& write_options,
    WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used,
    const uint64_t log_ref, uint64_t* seq_used, const size_t sub_batch_cnt,
    PreReleaseCallback* pre_release_callback, const AssignOrder assign_order,
    const PublishLastSeq publish_last_seq, const bool disable_memtable) {
 ...
  if (publish_last_seq == kDoPublishLastSeq) {

  } else {
    // need calculate write_delay_time
    InstrumentedMutexLock lock(&mutex_);
    Status status =
        DelayWrite(/*num_bytes=*/0ull, *write_thread, write_options);
    if (!status.ok()) {
      WriteThread::WriteGroup write_group;
      write_thread->EnterAsBatchGroupLeader(&w, &write_group);
      write_thread->ExitAsBatchGroupLeader(write_group, status);
      return status;
    }
  }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12285

Reviewed By: ajkr

Differential Revision: D53191765

Pulled By: cbi42

fbshipit-source-id: f78d5b280bea6a777f077c89c3e0b8fe98d3c860
2024-01-29 12:31:11 -08:00
Hui Xiao 1b2b16b38e Fix bug of newer ingested data assigned with an older seqno (#12257)
Summary:
**Context:**
We found an edge case where newer ingested data is assigned with an older seqno. This causes older data of that key to be returned for read.

Consider the following lsm shape:
![image](https://github.com/facebook/rocksdb/assets/83968999/973fd160-5065-49cd-8b7b-b6ab4badae23)
Then ingest a file to L5 containing new data of key_overlap. Because of [this](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fblob%2F5a26f392ca640818da0b8590be6119699e852b07%2Fdb%2Fexternal_sst_file_ingestion_job.cc%3Ffbclid%3DIwAR10clXxpUSrt6sYg12sUMeHfShS7XigFrsJHvZoUDroQpbj_Sb3dG_JZFc%23L951-L956&h=AT0m56P7O0ZML7jk1sdjgnZZyGPMXg9HkKvBEb8mE9ZM3fpJjPrArAMsaHWZQPt9Ki-Pn7lv7x-RT9NEd_202Y6D2juIVHOIt3EjCZptDKBLRBMG49F8iBUSM9ypiKe8XCfM-FNW2Hl4KbVq2e3nZRbMvUM), the file is assigned with seqno 2, older than the old data's seqno 4. After just another compaction, we will drop the new_v for key_overlap because of the seqno and cause older data to be returned.
![image](https://github.com/facebook/rocksdb/assets/83968999/a3ef95e4-e7ae-4c30-8d03-955cd4b5ed42)

**Summary:**
This PR removes the incorrect seqno assignment

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12257

Test Plan:
- New unit test failed before the fix but passes after
- python3 tools/db_crashtest.py --compaction_style=1 --ingest_external_file_one_in=10 --preclude_last_level_data_seconds=36000 --compact_files_one_in=10 --enable_blob_files=0 blackbox`
- Rehearsal stress test

Reviewed By: cbi42

Differential Revision: D52926092

Pulled By: hx235

fbshipit-source-id: 9e4dade0f6cc44e548db8fca27ccbc81a621cd6f
2024-01-24 11:21:05 -08:00
Peter Dillinger 800cfae987 Start 9.0.0 release (#12256)
Summary:
with release notes for 8.11.fb, format_compatible test update, and version.h update.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256

Test Plan: CI

Reviewed By: cbi42

Differential Revision: D52926051

Pulled By: pdillinger

fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4
2024-01-20 08:38:20 -08:00
Peter Dillinger cb08a682d4 Fix/cleanup SeqnoToTimeMapping (#12253)
Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.

* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
  * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
  * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
  * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.

Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253

Test Plan: unit tests updated

Reviewed By: jowlyzhang

Differential Revision: D52913733

Pulled By: pdillinger

fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
2024-01-19 21:50:38 -08:00
Levi Tamasi 7e4406a171 Add a changelog entry for PR 12235 (#12238)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12238

Reviewed By: jowlyzhang

Differential Revision: D52809593

fbshipit-source-id: 692852cdd3074275ef92bde83ff15a800d8ae3d5
2024-01-16 13:02:18 -08:00
akankshamahajan cad76a2e1e Fix bug in auto_readahead_size that returned wrong key (#12229)
Summary:
IndexType::kBinarySearchWithFirstKey +
BlockCacheLookupForReadAheadSize enabled => FindNextUserEntryInternal assertion fails or iterator lands at a wrong key because BlockCacheLookupForReadAheadSize moves the index_iter_ and in internal_wrapper.h, result_.key didn't update and pointed to wrong key. Also ikey_ was also pointing to iter_.key() instead of copying the key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12229

Test Plan:
```
 rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
mkdir /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
./db_stress -threads=1 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=0 --backup_one_in=0 --batch_protection_bytes_per_key=0 --blob_cache_size=0 --blob_compaction_readahead_size=0 --blob_compression_type=lz4 --blob_file_size=0 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0 --blob_garbage_collection_force_threshold=0 --block_protection_bytes_per_key=0 --block_size=2048 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 --db_write_buffer_size=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=3 --ingest_external_file_one_in=10 --initial_auto_readahead_size=0 --iterpercent=55 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=8 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=10000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=1 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --snapshot_hold_ops=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=1 --verify_sst_unique_id_in_manifest=0 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=0 > repro.out
Verification failed. Expected state has key 0000000000000077000000000000004178, iterator is at key 0000000000000077000000000000008A78
Column family: default, op_logs: S 0000000000000077000000000000003D7878787878 NNNN
No writes or ops?
Verification failed :(
```

Reviewed By: ajkr

Differential Revision: D52710655

Pulled By: akankshamahajan15

fbshipit-source-id: 9d2e684e190fb0832bdce3337bce1c6548cd054d
2024-01-16 11:30:36 -08:00
Peter Dillinger 5da900f28a Fix a case of ignored corruption in creating backups (#12200)
Summary:
We often need to read the table properties of an SST file when taking a backup. However, we currently do not check checksums for this step, and even with that enabled, we ignore failures. This change ensures we fail creating a backup if corruption is detected in that step of reading table properties.

To get this working properly (with existing unit tests), we also add some temperature handling logic like already exists in
BackupEngineImpl::ReadFileAndComputeChecksum and elsewhere in BackupEngine. Also, SstFileDumper needed a fix to its error handling logic.

This was originally intended to help diagnose some mysterious failures (apparent corruptions) seen in taking backups in the crash test, though that is now fixed in https://github.com/facebook/rocksdb/pull/12206

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12200

Test Plan: unit test added that corrupts table properties, along with existing tests

Reviewed By: ajkr

Differential Revision: D52520674

Pulled By: pdillinger

fbshipit-source-id: 032cfc0791428f3b8147d34c7d424ab128e28f42
2024-01-05 09:48:19 -08:00
Peter Dillinger ed46981bea Fix and defend against FilePrefetchBuffer combined with mmap reads (#12206)
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.

In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
  * Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.

Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206

Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200

Reviewed By: akankshamahajan15

Differential Revision: D52551701

Pulled By: pdillinger

fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
2024-01-04 18:39:05 -08:00
Hui Xiao 06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
Peter Dillinger a771a47a1b Fix leak or crash on failure in automatic atomic flush (#12176)
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:

```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176

Test Plan: unit test added

Reviewed By: cbi42

Differential Revision: D52417071

Pulled By: pdillinger

fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5
2023-12-26 11:04:25 -08:00
Levi Tamasi 81765866c4 Update HISTORY/version/format compatibility script for the 8.10 release (#12154)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12154

Reviewed By: jaykorean, akankshamahajan15

Differential Revision: D52216271

Pulled By: ltamasi

fbshipit-source-id: 13bab72802eeec8f6e3544be9ebcd7f725a64d2e
2023-12-15 14:44:23 -08:00
Akanksha Mahajan cd577f6059 Fix WRITE_STALL start_time (#12147)
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147

Test Plan: Existing CircleCI

Reviewed By: cbi42

Differential Revision: D52173481

Pulled By: akankshamahajan15

fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
2023-12-14 13:45:06 -08:00
akankshamahajan d926593df5 Fix stress tests failure for auto_readahead_size (#12131)
Summary:
When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter  so that
- BlockBasedTableIterator can point index_iter_ to the right block.
- disable readahead_cache_lookup.
However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case.

So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131

Test Plan:
- Tested on db_stress command that was failing -
`./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35`

 - make crash_test -j32

Reviewed By: anand1976

Differential Revision: D51986326

Pulled By: akankshamahajan15

fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9
2023-12-13 12:15:04 -08:00
anand76 c1b84d0437 Fix false negative in TieredSecondaryCache nvm cache lookup (#12134)
Summary:
There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134

Test Plan: Add a new unit test in tiered_secondary_cache_test

Reviewed By: akankshamahajan15

Differential Revision: D52023309

Pulled By: anand1976

fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b
2023-12-11 16:59:59 -08:00
Kevin Mingtarja 44fd914128 Fix double counting of BYTES_WRITTEN ticker (#12111)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12061.

We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable.

The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111

Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes.

Reviewed By: jaykorean

Differential Revision: D51954327

Pulled By: jowlyzhang

fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c
2023-12-08 17:12:11 -08:00
Levi Tamasi a143f93236 Turn the default Timer in PeriodicTaskScheduler into a leaky Meyers singleton (#12128)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12128

The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues.

Reviewed By: akankshamahajan15

Differential Revision: D51963950

fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948
2023-12-08 10:34:07 -08:00
Jay Huh ddb7df10ef Update HISTORY.md and version.h for 8.9.fb release (#12074)
Summary:
Creating cut for 8.9 release

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12074

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D51435289

Pulled By: jaykorean

fbshipit-source-id: 3918a8250032839e5b71f67f26c8ba01cbc17a41
2023-11-21 18:07:19 -08:00
Yu Zhang 84a54e1e28 Fix some bugs in index builder and reader for the UDT in memtable only feature (#12062)
Summary:
These bugs surfaced while I was trying to add the stress test for the feature:

Bug 1) On the index building path: the optimization to use user key instead of internal key as separator needed a bit tweak for when user defined timestamps can be removed. Because even though the user key look different now and eligible to be used as separator, when their user-defined timestamps are removed, they could be equal and that invariant no longer stands.

Bug 2) On the index reading path: one path that builds the second level index iterator for `PartitionedIndexReader` are not passing the corresponding `user_defined_timestamps_persisted` flag. As a result, the default `true` value be used leading to no minimum timestamps padded when they should be.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12062

Test Plan:
For bug 1): added separate unit test `BlockBasedTableReaderTest::Get` to exercise the `Get` API. It's a different code path from `MultiGet` so worth having its own test. Also in order to cover the bug, the test is modified to generate key values with the same user provided key, different timestamps and different sequence numbers. The test reads back different versions of the same user provided key.  `MultiGet` takes one `ReadOptions` with one read timestamp so we cannot test retrieving different versions of the same key easily.

For bug 2): simply added options `BlockBasedTableOptions.metadata_cache_options.partition_pinning = PinningTier::kAll` to exercise all the index iterator creating paths.

Reviewed By: ltamasi

Differential Revision: D51508280

Pulled By: jowlyzhang

fbshipit-source-id: 8b174d3d70373c0599266ac1f467f2bd4d7ea6e5
2023-11-21 14:05:02 -08:00
Peter Dillinger 92dc5f3e67 AutoHCC: fix a bug with "blind" Insert (#12046)
Summary:
I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions.

It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`).

The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have
* Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain.
* Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty.
* Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain.

This is easily fixed by holding on to a reference until inserted onto the chain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046

Test Plan:
As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed.

Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000)

Reviewed By: jowlyzhang

Differential Revision: D51016979

Pulled By: pdillinger

fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b
2023-11-06 16:06:01 -08:00
Hui Xiao 0f141352d8 Fix race between flush error recovery and db destruction (#12002)
Summary:
**Context:**
DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: 519f2a41fb/db/db_impl/db_impl.cc (L525) -> 519f2a41fb/db/error_handler.cc (L250) -> 519f2a41fb/db/error_handler.cc (L808-L823)

However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below.

```
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
```

**Summary:**
This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](519f2a41fb/db/error_handler.cc (L688-L694)) such case inside the created recovery thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002

Test Plan: A new UT repro-ed the crash before this fix and and pass after.

Reviewed By: ajkr

Differential Revision: D50586191

Pulled By: hx235

fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093
2023-10-25 11:59:09 -07:00
Hui Xiao ab15d33566 Update history, version and format testing for 8.8 (#12004)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12004

Reviewed By: cbi42

Differential Revision: D50586984

Pulled By: hx235

fbshipit-source-id: 1480a8c2757340ebf83510557104aaa0e437b3ae
2023-10-24 12:03:07 -07:00
rogertyang 543191f2ea Add bounds checking to WBWIIteratorImpl and respect bounds of ReadOptions in Transaction (#11680)
Summary:
Fix https://github.com/facebook/rocksdb/issues/11607
Fix https://github.com/facebook/rocksdb/issues/11679
Fix https://github.com/facebook/rocksdb/issues/11606
Fix https://github.com/facebook/rocksdb/issues/2343

Add bounds checking to `WBWIIteratorImpl`, which will be reflected in `BaseDeltaIterator::delta_iterator_::Valid()`, just like `BaseDeltaIterator::base_iterator_::Valid()`. In this way, the two sub itertors become more aligned from `BaseDeltaIterator`'s perspective. Like `DBIter`, the added bounds checking caps in either bound when seeking and disvalidates the `WBWIIteratorImpl` iterator when the lower bound is past or the upper bound is reached.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11680

Test Plan:
- A simple test added to write_batch_with_index_test.cc to exercise the bounds checking in `WBWIIteratorImpl`.
- A sophisticated test added to transaction_test.cc to assert that `Transaction` with different write policies honor bounds in `ReadOptions`. It should be so as long as the  `BaseDeltaIterator` is correctly coordinating the two sub iterators to perform iterating and bounds checking.

Reviewed By: ajkr

Differential Revision: D48125229

Pulled By: cbi42

fbshipit-source-id: c9acea52595aed1471a63d7ca6ef15d2a2af1367
2023-10-20 13:28:28 -07:00
akankshamahajan 9135a61ec6 Fix corruption error in stress test for auto_readahead_size enabled (#11961)
Summary:
Fix corruption error - "Corruption: first key in index doesn't match first key in block". when auto_readahead_size is enabled. Error is because of bug when index_iter_ moves forward, first_internal_key of that index_iter_ is not copied. So the Slice points to a different key resulting in wrong comparison when doing comparison.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11961

Test Plan: Ran stress test which reproduced this error.

Reviewed By: anand1976

Differential Revision: D50310589

Pulled By: akankshamahajan15

fbshipit-source-id: 95d8320b8388f1e3822c32024f84754f3a20a631
2023-10-17 12:21:08 -07:00
anand76 90e160733e Fix runtime error in UpdateTieredCache due to integer underflow (#11949)
Summary:
With the introduction of the `UpdateTieredCache` API, its possible to dynamically change the compressed secondary cache ratio of the total cache capacity. In order to optimize performance, we avoid using a mutex when inserting/releasing placeholder entries, which can result in some inaccuracy in the accounting during the dynamic update. This inaccuracy was causing a runtime error due to an integer underflow in `UpdateCacheReservationRatio`, causing ubsan crash tests to fail. This PR fixes it by explicitly checking for the underflow.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11949

Test Plan:
1. Added a unit test that fails without the fix
2. Run ubsan_crash

Reviewed By: akankshamahajan15

Differential Revision: D50240217

Pulled By: anand1976

fbshipit-source-id: d2f7b79da54eec8b61aec2cc1f2943da5d5847ac
2023-10-12 15:09:40 -07:00
Peter Dillinger d010b02e86 Fix race in options taking effect (#11929)
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.

To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.

I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.

Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929

Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace

I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace

Reviewed By: ajkr

Differential Revision: D50024506

Pulled By: pdillinger

fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
2023-10-12 10:05:23 -07:00
Levi Tamasi b00fa5597e Fix the handling of wide-column base values in the max_successive_merges logic (#11913)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913

The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator.

Reviewed By: jaykorean

Differential Revision: D49795097

fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3
2023-10-02 16:25:25 -07:00
Hui Xiao fce04587b8 Only fallback to RocksDB internal prefetching on unsupported FS prefetching (#11897)
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/11631 introduced an undesired fallback behavior to RocksDB internal prefetching even when FS prefetching return non-OK status other than "Unsupported". We only want to fall back when FS prefetching is not supported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11897

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D49667055

Pulled By: hx235

fbshipit-source-id: fa36e4e5d6dc9507080217035f9d6ff8e4abda28
2023-09-26 18:44:41 -07:00
Changyu Bi 49da91ec09 Update files for version 8.8 (#11878)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11878

Reviewed By: ajkr

Differential Revision: D49568389

Pulled By: cbi42

fbshipit-source-id: b2022735799be9b5e81e03dfb418f8b104632ecf
2023-09-23 11:02:19 -07:00
anand76 48589b961f Fix updating the capacity of a tiered cache (#11873)
Summary:
Updating the tiered cache (cache allocated using ```NewTieredCache()```) by calling ```SetCapacity()``` on it was not working properly. The initial creation would set the primary cache capacity to the combined primary and compressed secondary cache capacity. But ```SetCapacity()``` would just set the primary cache capacity, with no way to change the secondary cache capacity. Additionally, the API was confusing, since the primary and compressed secondary capacities would be specified separately during creation, but ```SetCapacity``` took the combined capacity.

With this fix, the user always specifies the total budget and compressed secondary cache ratio on creation. Subsequently, `SetCapacity` will distribute the new capacity across the two caches by the same ratio. The `NewTieredCache` API has been changed to take the total cache capacity (inclusive of both the primary and the compressed secondary cache) and the ratio of total capacity to allocate to the compressed cache. These are specified in `TieredCacheOptions`. Any capacity specified in `LRUCacheOptions`, `HyperClockCacheOptions` and `CompressedSecondaryCacheOptions` is ignored. A new API, `UpdateTieredCache` is provided to dynamically update the total capacity, ratio of compressed cache, and admission policy.

Tests:
New unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11873

Reviewed By: akankshamahajan15

Differential Revision: D49562250

Pulled By: anand1976

fbshipit-source-id: 57033bc713b68d5da6292207765a6b3dbe539ddf
2023-09-22 18:07:46 -07:00
Changyu Bi 0086809601 Fix a bug with atomic_flush that causes DB to stuck after a flush failure (#11872)
Summary:
With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: 269478ee46/db/db_impl/db_impl_compaction_flush.cc (L725) This prevents any future flush to pick old memtables.

A more detailed repro is in unit test.

This PR fixes this issue by
1. Ensure we rollback memtables if an atomic flush fails due to background error
2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed
3. Do not schedule non-recovery flushes when there is a background error that stops background work

There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11872

Test Plan: new unit test.

Reviewed By: jowlyzhang

Differential Revision: D49556867

Pulled By: cbi42

fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f
2023-09-22 16:43:50 -07:00
Changyu Bi b927ba5936 Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an  error message like the following:

IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```

This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)

This PR fixes the issue by rollback these pending flushes.

There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.

There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865

Test Plan: * Added repro unit tests.

Reviewed By: anand1976

Differential Revision: D49484922

Pulled By: cbi42

fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 15:31:29 -07:00
anand76 548aabfe5f Disable compressed secondary cache if capacity is 0 (#11863)
Summary:
This PR makes disabling the compressed secondary cache by setting capacity to 0 a bit more efficient. Previously, inserts/lookups would go to the backing LRUCache before getting rejected due to 0 capacity. With this change, insert/lookup would return from ```CompressedSecondaryCache``` itself.

Tests:
Existing tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11863

Reviewed By: akankshamahajan15

Differential Revision: D49476248

Pulled By: anand1976

fbshipit-source-id: f0f17a5e3df7d8bfc06709f8f23c1302056ba590
2023-09-20 22:30:17 -07:00
akankshamahajan c1a97fe1f6 Fix Assertion `roundup_len2 >= alignment' failed in crash tests (#11852)
Summary:
When auto_readahead_size is enabled in async_io, during seek, first buffer will prefetch the data - (current block + readahead till upper_bound). There can be cases where
1.  first buffer prefetched all the data till upper bound, or
2.  first buffer already has the data from prev seek call
and second buffer prefetch further leading to alignment issues.

This PR fixes that assertion and second buffer won't go for prefetching if first buffer has already prefetched till upper_bound.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11852

Test Plan:
- Added new unit test that failed without this fix.
- crash tests passed locally

Reviewed By: pdillinger

Differential Revision: D49384138

Pulled By: akankshamahajan15

fbshipit-source-id: 54417e909e4d986f1e5a17dbaea059cd4962fd4d
2023-09-20 16:13:20 -07:00
chuhao zeng 8acf17002a Fix row cache falsely return kNotFound when timestamp enabled (#11816)
Summary:
**Summary:**
When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound.

**Cause of error:**
If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](8fc78a3a9e/db/db_impl/db_impl.cc (L2112)).

Hence the default value set at this [line](694e49cbb1/table/get_context.cc (L611)) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file.

**Proposed Solution**
Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11816

Reviewed By: ajkr

Differential Revision: D49419029

Pulled By: jowlyzhang

fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6
2023-09-20 11:34:38 -07:00
Changyu Bi cc254efea6 Release compaction files in manifest write callback (#11764)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply().  This is done by passing in a [callback](ba59751430/db/version_set.h (L1199)) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11764

Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released

Reviewed By: ajkr

Differential Revision: D48742152

Pulled By: cbi42

fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
2023-09-18 13:11:53 -07:00
Hui Xiao ed913513bd Fix a bug of rocksdb.file.read.verify.file.checksums.micros not being populated (#11836)
Summary:
**Context/Summary:**
`rocksdb.file.read.verify.file.checksums.micros ` was added in https://github.com/facebook/rocksdb/pull/11444 but the related path was not populated with statistics and clock object correctly so the actual statistics collection didn't happen. This PR fixed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11836

Test Plan:
Setup:
```
./db_bench --benchmarks="fillrandom" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb
```
Run:
```
./db_bench --use_existing_db=1  --benchmarks="verifyfilechecksums" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb --statistics=1 --stats_level=4
```
Post-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 9.000000 P95 : 9.000000 P99 : 9.000000 P100 : 9.000000 COUNT : 1 SUM : 9
```

Pre-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

Reviewed By: ajkr

Differential Revision: D49293378

Pulled By: hx235

fbshipit-source-id: 1acd8b828c28e088d0c5d63897f53cd180b82f42
2023-09-15 10:36:14 -07:00
akankshamahajan 1e2fd343bb Update upper_bound_offset when reseek changes iterate_upper_bound dynamically (#11775)
Summary:
Update the logic in FilePrefetchBuffer to update `upper_bound_offset_` during reseek. During Reseek, `iterate_upper_bound` can be changed dynamically. So added an API to update that in FilePrefetchBuffer.
Added unit test to confirm the behavior.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11775

Test Plan:
- Check stress tests in case there is any failure after this diff.
- make crash_test -j32 with auto_readahead_size=1 passed locally

Reviewed By: anand1976

Differential Revision: D48815177

Pulled By: akankshamahajan15

fbshipit-source-id: 5f44fbb3af06c86a1c38f139c5fa4543891837f4
2023-09-15 10:05:56 -07:00
Yu Zhang e1fd348b92 Fix a bug in multiget for cleaning up SuperVersion (#11830)
Summary:
When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done.

It's currently doing this:
`MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()`

This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830

Test Plan:
`make all check`
Added a unit test that would originally fail

Reviewed By: ltamasi

Differential Revision: D49287715

Pulled By: jowlyzhang

fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0
2023-09-15 09:50:39 -07:00
akankshamahajan ed5b6c0d99 Avoid alignment in FilePrefetchBuffer during seek with async_io (#11793)
Summary:
During Seek, the iterator seeks every file on L0. In async_io, it submit the requests to seek on every file on L0 asynchronously using RocksDB FilePrefetchBuffer.
However, FilePrefetchBuffer does alignment and reads extra bytes then needed that can increase the throughput.
In case of non direct io, the alignment can be avoided.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11793

Test Plan:
- Added a unit test that fails without this PR.
- make crash_test -j32 completed successfully

Reviewed By: anand1976

Differential Revision: D48985051

Pulled By: akankshamahajan15

fbshipit-source-id: 2d130a9e7c3df9c4fcd0408406e6277ab75a4389
2023-09-11 11:41:44 -07:00
Changyu Bi 9bd1a6fa29 Fix a bug where iterator can return incorrect data for DeleteRange() users (#11786)
Summary:
This should only affect iterator when
- user uses DeleteRange(),
- An iterator from level L has a non-ok status (such non-ok status may not be caught before the bug fix in https://github.com/facebook/rocksdb/pull/11783), and
- A range tombstone covers a key from level > L and triggers a reseek sets the status_ to OK in SeekImpl()/SeekPrevImpl() e.g. bd6a8340c3/table/merging_iterator.cc (L801)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11786

Differential Revision: D48908830

Pulled By: cbi42

fbshipit-source-id: eb564be375af4e33dc27542eff753260186e6d5d
2023-09-01 11:33:15 -07:00
Changyu Bi bd6a8340c3 Fix a bug where iterator status is not checked (#11782)
Summary:
This happens in (Compaction)MergingIterator layer, and can cause data loss during compaction or read/scan return incorrect result

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11782

Reviewed By: ajkr

Differential Revision: D48880575

Pulled By: cbi42

fbshipit-source-id: 2294ad284a6d653d3674bebe55380f12ee4b645b
2023-09-01 09:34:08 -07:00
Yu Zhang fc58c7c62a Add UDT support in SstFileDumper (#11757)
Summary:
For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757

Test Plan:
make all check
for changes in db_impl.cc and db_impl_compaction_flush.cc

for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually.

Reviewed By: ltamasi

Differential Revision: D48668048

Pulled By: jowlyzhang

fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8
2023-08-30 13:42:04 -07:00
Andrew Kryczka 310a242c57 Fix GenericRateLimiter hanging bug (#11763)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11742

Even after performing duty (1) ("Waiting for the next refill time"), it is possible the remaining threads are all in `Wait()`. Waking up at least one thread is enough to ensure progress continues, even if no new requests arrive.

The repro unit test (https://github.com/facebook/rocksdb/commit/bb54245e6) is not included as it depends on an unlanded PR (https://github.com/facebook/rocksdb/issues/11753)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11763

Reviewed By: jaykorean

Differential Revision: D48710130

Pulled By: ajkr

fbshipit-source-id: 9d166bd577ea3a96ccd81dde85871fec5e85a4eb
2023-08-28 13:36:25 -07:00
anand76 4b53520709 Update HISTORY.md and version.h for 8.6 (#11728)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11728

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D48527100

Pulled By: anand1976

fbshipit-source-id: c48baa44e538fb6bfd3fe7f19046746d3540763f
2023-08-21 13:25:04 -07:00
Changyu Bi eca48bc166 Avoid shifting component too large error in FileTtlBooster (#11673)
Summary:
When `num_levels` > 65, we may be shifting more than 63 bits in FileTtlBooster. This can give errors like: `runtime error: shift exponent 98 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')`. This PR makes a quick fix for this issue by taking a min in the shifting component. This issue should be rare since it requires a user using a large `num_levels`. I'll follow up with a more complex fix if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11673

Test Plan: * Add a unit test that produce the above error before this PR. Need to compile it with ubsan: `COMPILE_WITH_UBSAN=1 OPT="-fsanitize-blacklist=.circleci/ubsan_suppression_list.txt" ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 compaction_picker_test`

Reviewed By: hx235

Differential Revision: D48074386

Pulled By: cbi42

fbshipit-source-id: 25e59df7e93f20e0793cffb941de70ac815d9392
2023-08-04 14:29:50 -07:00
Andrew Kryczka cf95821fb6 Update for 8.5.fb branch cut (#11642)
Summary:
Updated the main branch for the 8.5.fb branch cut. Also made unreleased_history/release.sh backdate to the last commit instead of the current date in case the release manager is a laggard like myself.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11642

Reviewed By: cbi42

Differential Revision: D47783574

Pulled By: ajkr

fbshipit-source-id: 4e2a80f5ccd542dc7dd0d22dfd7e59cb136325a1
2023-08-02 12:34:11 -07:00
akankshamahajan 63a5125a52 Fix use_after_free bug when underlying FS enables kFSBuffer (#11645)
Summary:
Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest
Since it's an experimental feature, added a hack for now to fix the bug.
Planning to make public API change to remove const from the callback as it doesn't make sense to use const.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11645

Test Plan: tested locally

Reviewed By: ltamasi

Differential Revision: D47819907

Pulled By: akankshamahajan15

fbshipit-source-id: 1faf5ef795bf27e2b3a60960374d91274931df8d
2023-07-27 12:02:03 -07:00
akankshamahajan 94c247bff8 Update HISTORY.md for branch cut for 8.4.fb (#11565)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11565

Reviewed By: jowlyzhang, cbi42

Differential Revision: D47027788

Pulled By: akankshamahajan15

fbshipit-source-id: e5e8db2eb21f8aa68fe072f0e1b63b83ba7beb9f
2023-06-26 13:26:15 -07:00
akankshamahajan ff1cc8a63e Fix extra prefetching when num_file_reads_for_auto_readahead is 1 in async_io (#11560)
Summary:
When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and
discarded bytes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D47008102

Pulled By: akankshamahajan15

fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0
2023-06-26 10:39:44 -07:00
Peter Dillinger 70bf5ef093 Avoid destroying default PosixEnv, safely (#11538)
Summary:
Use another static object to join threads instead.

This change is motivated by a case in which some code using NewLRUCache() -> ShardedCacheBase -> SemiStructuredUniqueIdGen -> GenerateRawUniqueId() -> Env::Default() was happening
during static destruction.

I didn't see anything else in PosixEnv or base classes that would cause a problem by not
destroying. (WinEnv is already not destroyed; see env_default.cc)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11538UndefinedBehaviorSanitizer: undefined-behavior env/env_test.cc:3561:23 in
$
```

Test Plan:
test added, which would previously fail with UBSAN:

```
$ ./env_test --gtest_filter=*Destruct*
Note: Google Test filter = *Destruct*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EnvTestMisc
[ RUN      ] EnvTestMisc.StaticDestruction
[       OK ] EnvTestMisc.StaticDestruction (0 ms)
[----------] 1 test from EnvTestMisc (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[  PASSED  ] 1 test.
env/env_test.cc:3561:23: runtime error: member call on address 0x7f7b96671ca8 which does not point to an object of type 'rocksdb::Env'
0x7f7b96671ca8: note: object is of type 'N7rocksdb12ConfigurableE'
 00 00 00 00  90 a7 f7 95 7b 7f 00 00  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  00 00 00 00
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'N7rocksdb12ConfigurableE'

Reviewed By: jowlyzhang

Differential Revision: D46737389

Pulled By: pdillinger

fbshipit-source-id: 0f80a443bf799ffc5641e898cf3a75f7d10a987b
2023-06-14 16:18:08 -07:00
Yu Zhang 77dda0d9d8 Fix use after move in data block hash index (#11505)
Summary:
Fix a use-after-move issue in block.cc and added some unit tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11505

Test Plan:
```
make all check
./block_test
```

Reviewed By: ltamasi

Differential Revision: D46506188

Pulled By: jowlyzhang

fbshipit-source-id: 316ed8ddd221c00b2bce2cf9fd47eea686cd74a5
2023-06-08 11:04:43 -07:00
Changyu Bi 2e8cc98ab2 Fix subcompaction bug to allow running two subcompactions (#11501)
Summary:
as reported in https://github.com/facebook/rocksdb/issues/11476, RocksDB currently does not execute compactions in two subcompactions even when they qualify. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11501

Test Plan:
* Add a new unit test.
* Run crash test with max_subcompactions=2: `python3 tools/db_crashtest.py blackbox --simple  --subcompactions=2 --target_file_size_base=1048576 --compaction_style=0`
  * saw logs showing compactions being executed as 2 subcompactions
```
2023/06/01-17:28:44.028470 3147486 (Original Log Time 2023/06/01-17:28:44.025972) EVENT_LOG_v1 {"time_micros": 1685665724025939, "job": 6, "event": "compaction_finished", "compaction_time_micros": 34539, "compaction_time_cpu_micros": 26404, "output_level": 6, "num_output_files": 2, "total_output_size": 1109796, "num_input_records": 13188, "num_output_records": 13021, "num_subcompactions": 2, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 13]}
```

Reviewed By: ajkr

Differential Revision: D46411497

Pulled By: cbi42

fbshipit-source-id: 3ebfc02e19f78f782e114a9546dc3d481d496258
2023-06-06 13:36:02 -07:00
Peter Dillinger 8848ec92dd Better management of unreleased HISTORY (#11481)
Summary:
See new NOTE in HISTORY.md and unreleased_history/README.txt

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11481

Test Plan: some manual testing on my CentOS 8 system

Reviewed By: jaykorean

Differential Revision: D46233342

Pulled By: pdillinger

fbshipit-source-id: daf59cf3dc907f450b469090dcc481a30a7d7c0d
2023-05-30 16:42:49 -07:00