Commit graph

12888 commits

Author SHA1 Message Date
Yu Zhang fab9dd9635 Temporary revert #12386 to unblock MyRocks build (#12523)
Summary:
MyRocks reports build failure with this change (build failures in this diff: https://www.internalfb.com/diff/D55924596) https://github.com/facebook/rocksdb/issues/12386, we haven't figured out how to fix it yet. So we are temporarily reverting it to unblock them.

This reverts commit 3104e55f29.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12523

Reviewed By: hx235

Differential Revision: D55981751

Pulled By: jowlyzhang

fbshipit-source-id: 1d7edd42b65ca847cec67549644a2b1e5775841e
2024-04-10 13:47:52 -07:00
Hui Xiao 447b7aa7ed Temporarily disable kAdmPolicyThreeQueue in crash test (#12521)
Summary:
**Context/Summary**

This policy leads to segfault in `CompressedCacheSetCapacityThread` with some build/compilation. Before figuring out the why, disable it for now.

**Test**
Rehearse stress test that failed before the fix but passes after

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12521

Reviewed By: jowlyzhang

Differential Revision: D55942399

Pulled By: hx235

fbshipit-source-id: 85f28e50d596dcfd4a316481570b78fdce58ed0b
2024-04-09 16:15:54 -07:00
Hui Xiao 72c1376fcf Fix "assertion failed - iter != ROCKSDB_NAMESPACE::OptionsHelper::temperature_to_string.end()" (#12519)
Summary:
Context/Summary: for unknown reason, calling a db stress common function in db stress flag file for temperature-related flags will cause some weird behavior in some compilation/build.
```
assertion failed - iter != ROCKSDB_NAMESPACE::OptionsHelper::temperature_to_string.end()
```
For now, we decide not to call such function by hard-coding their default stress test values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12519

Test Plan: - Run a rehearsal stress test with this fix and weird behavior is gone.

Reviewed By: jowlyzhang

Differential Revision: D55884693

Pulled By: hx235

fbshipit-source-id: ba5135f5b37a9fa686b3ccae8d3f77e62d6562c9
2024-04-08 13:45:41 -07:00
Hui Xiao ad423abbd1 Add more missing options in crash test (#12508)
Summary:
**Context/Summary:**
This is to improve our crash test coverage.

Bonus change:
- Added the missing Options string mapping for `CacheTier::kVolatileCompressedTier`
- Deprecated crash test options `enable_tiered_storage` mainly for setting `last_level_temperature` which is now covered in crash test by itself
- Intensified `verify_checksum_one_in\verify_file_checksums_one_in` as I found out these together with new coverage surface more issues

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12508

Test Plan: CI to look out for trivial failures

Reviewed By: jowlyzhang

Differential Revision: D55768594

Pulled By: hx235

fbshipit-source-id: 9b829da0309a7db3fcdb17992a524dd64498325c
2024-04-08 09:48:03 -07:00
Radek Hubner a8035ebc0b Fix exception on RocksDB.getColumnFamilyMetaData() (#12474)
Summary:
https://github.com/facebook/rocksdb/issues/12466 reported a bug when `RocksDB.getColumnFamilyMetaData()` is called on an existing database(With files stored on disk). As neilramaswamy mentioned, this was caused by https://github.com/facebook/rocksdb/issues/11770 where the signature of `SstFileMetaData` constructor was changed, but JNI code wasn't updated.

This PR fix JNI code, and also properly populate `fileChecksum` on `SstFileMetaData`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12474

Reviewed By: jowlyzhang

Differential Revision: D55811808

Pulled By: ajkr

fbshipit-source-id: 2ab156f41eaf4a4f30c49e6df421b61e8451230e
2024-04-05 13:55:18 -07:00
Andrew Kryczka 7dcd0bd833 Add GetLiveFilesStorageInfo to legacy BlobDB (#12468)
Summary:
It is an important function and should be correct on legacy BlobDB, even though using legacy BlobDB is not recommended

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12468

Reviewed By: cbi42

Differential Revision: D55231038

Pulled By: ajkr

fbshipit-source-id: 2ac18e4c149590b373eb79cd92c0ca5e7fce94f2
2024-04-05 13:50:27 -07:00
Changyu Bi 7c28dc8beb Enable parallel compression in crash test (#12506)
Summary:
Since some internal user might be interested in using this feature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12506

Test Plan:
The option was disabled in stress test due to causing failures.
I've ran a round of crash tests internally and there was no failure due to parallel compression. Will monitor if more runs cause failures. So we will know at least how it's broken and decide to fix them or reverse the change.

Reviewed By: jowlyzhang

Differential Revision: D55747552

Pulled By: cbi42

fbshipit-source-id: ae5cda78c338b8b58f651c557d9b70790362444d
2024-04-05 10:29:08 -07:00
Hui Xiao abdbeedba6 Miscellaneous improvement to info printing (#12504)
Summary:
**Context/Summary:**
Debugging crash test makes me realize there are a few places can use some improvement of logging more info

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12504

Test Plan:
Manual testing
Debug build
```
2024/04/04-16:12:12.289791 1636007 [/db_filesnapshot.cc:156] Number of log files 2 (0 required by manifest)
...
2024/04/04-16:12:12.289814 1636007 [/db_filesnapshot.cc:171] Log files : /000004.log /000008.log  .Log files required by manifest: .
```
Non-debug build
```
2024/04/04-16:19:23.222168 1685043 [/db_filesnapshot.cc:156] Number of log files 1 (0 required by manifest)
```
CI

Reviewed By: jaykorean

Differential Revision: D55710013

Pulled By: hx235

fbshipit-source-id: 9964d46cfb0a2074620f31571cf9fd29d0a88819
2024-04-05 10:23:31 -07:00
Yu Zhang 50dc3ec4d0 Provide an override ReuseWritableFile implementation for FaultInjectionTestFS (#12510)
Summary:
Without this override, `FaultInjectionTestFs` use the implementation from `FileSystemWrapper` that delegates to the base file system: 2207a66fe5/include/rocksdb/file_system.h (L1451-L1457)

That will create a regular `FSWritableFile` instead of a `TestFSWritableFile`:
2207a66fe5/env/file_system.cc (L98-L108)

We have seen verification failures with a WAL hole because the last log writer is a `FSWritableFile` created from recycling a previous log file, while the second to last log write is a `TestFSWritableFile`. The former can survive a process crash, while the latter cannot. It makes the WAL look like it has a hole.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12510

Reviewed By: hx235

Differential Revision: D55769158

Pulled By: jowlyzhang

fbshipit-source-id: ebeffee8255bfa155434e17afe5082908d41a0d1
2024-04-04 19:26:42 -07:00
Changyu Bi a0aade7e62 Add some debug print for flaky test DBCompactionTest.CompactionLimiter (#12509)
Summary:
The unit test fails occasionally can cannot be reproed locally.
```
[ RUN      ] DBCompactionTest.CompactionLimiter
db/db_compaction_test.cc:6139: Failure
Expected equality of these values:
  cf_count
    Which is: 17
  env_->GetThreadPoolQueueLen(Env::LOW)
    Which is: 15
[  FAILED  ] DBCompactionTest.CompactionLimiter (512 ms)
```

Add some debug print to help triaging if it fails again.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12509

Reviewed By: jowlyzhang

Differential Revision: D55770552

Pulled By: cbi42

fbshipit-source-id: 2a39b2199f80352fcf2c6cd2b9c8b81c727eee8c
2024-04-04 15:21:40 -07:00
Yu Zhang 2207a66fe5 Make autovector call default constructor explicitly before move/copy (#12499)
Summary:
Make `autovector` constructs the stack based element in place before move or copy another `autovector`'s stack based elements. This is already done in  the move/copy version of `autovector::push_back` when adding item to the stack based memory
8e6e8957fb/util/autovector.h (L269-L285)

The ` values_ = reinterpret_cast<pointer>(buf_);` statement is not sufficient to ensure the class's member variables are properly constructed. I'm able to reproduce this consistently in a unit test in this change: https://github.com/facebook/rocksdb/compare/main...jowlyzhang:fix_sv_install with unit test:
`./tiered_compaction_test --gtest_filter="\*FastTrack\*"

With below stack trace P1203997597 showing the `std::string` copy destination is invalid, which indicates the object in the destination `autovector` is not constructed properly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12499

Test Plan: Existing unit tests.

Reviewed By: anand1976

Differential Revision: D55662354

Pulled By: jowlyzhang

fbshipit-source-id: 581ceb11155d3dd711998607ec6950c0e327556a
2024-04-04 12:33:05 -07:00
Alexey Vinogradov c4df598b8e Implement PerfContex#toString for the Java API (#12473)
Summary:
I've implemented `PerfContext#toString` for the Java API.
See: https://groups.google.com/g/rocksdb/c/qbY_gNhbyAg

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12473

Reviewed By: jowlyzhang

Differential Revision: D55660871

Pulled By: cbi42

fbshipit-source-id: f0528fba31ac06e16495e4f49b0bafe0dbc1bc61
2024-04-03 14:33:31 -07:00
Radek Hubner db9eb10b5b Enable all Java test via CMake (#12446)
Summary:
This PR follows the work done in  https://github.com/facebook/rocksdb/issues/11756 and enable all Java test to be run via CMake/Ctest.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12446

Reviewed By: jowlyzhang

Differential Revision: D55661635

Pulled By: cbi42

fbshipit-source-id: 3ea49a121a3ba72089632ff43ee7fe4419b08a96
2024-04-03 11:03:11 -07:00
hasagi 5a86635e12 Fix CreateColumnFamilyWithImport for PessimisticTransactionDB (#12490)
Summary:
When we use the CreateColumnFamilyWithImport interface of PessimisticTransactionDB to create column family, the lack of related information may cause subsequent writes to be unable to find the Column Family ID.

The issue: (https://github.com/facebook/rocksdb/issues/12493)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12490

Reviewed By: jowlyzhang

Differential Revision: D55700343

Pulled By: cbi42

fbshipit-source-id: dc992a3eef433e1193d579cbf58b6ba940fa460d
2024-04-03 10:56:30 -07:00
Yu Zhang 74d419be4d Add support in SstFileReader to get a raw table iterator (#12385)
Summary:
This PR adds support to programmatically iterate a raw table file with an iterator returned by `SstFileReader::NewTableIterator`. For third party tools to use to observe SST files created by RocksDB.

The original feature request was from this merge request: https://github.com/facebook/rocksdb/pull/12370

Since keys returned by raw table iterators are internal keys, this PR also adds a struct `ParsedEntryInfo` and util method `ParseEntry` to support user to parse internal key. `GetInternalKeyForSeek`, and `GetInternalKeyForSeekForPrev` to support users to create internal keys for seek operations with this raw table iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12385

Test Plan: Added unit tests

Reviewed By: cbi42

Differential Revision: D55662855

Pulled By: jowlyzhang

fbshipit-source-id: 0716a173ee95924fbd4e1f9b6cccf06525c40049
2024-04-02 21:23:06 -07:00
Hui Xiao 8e6e8957fb Disable wal_bytes_per_sync at one more place (#12492)
Summary:
Summary/Context: supplement to https://github.com/facebook/rocksdb/pull/12489

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12492

Test Plan: CI

Reviewed By: jaykorean

Differential Revision: D55612747

Pulled By: hx235

fbshipit-source-id: 5c8fbda3e6c8482f2a3363a98a545f1c11e4ea27
2024-04-02 09:44:37 -07:00
Richard Barnes 90d61381bf Fix deprecated use of 0/NULL in internal_repo_rocksdb/repo/util/xxhash.h + 5
Summary:
`nullptr` is typesafe. `0` and `NULL` are not. In the future, only `nullptr` will be allowed.

This diff helps us embrace the future _now_ in service of enabling `-Wzero-as-null-pointer-constant`.

Reviewed By: dmm-fb

Differential Revision: D55559752

fbshipit-source-id: 9f1edc836ded919022c4b53722f6f86208fecf8d
2024-04-01 21:20:51 -07:00
Richard Barnes ee3159e7dd Remove extra semi colon from icsp/lib/logging/IcspLogRpcMessage.cpp
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D55534619

fbshipit-source-id: 26f3c35a51b38a3cbfa12a6f76a2bb783a7b4d8e
2024-03-31 10:26:34 -07:00
Richard Barnes 7e8003296a Remove extra semi colon from internal_repo_rocksdb/repo/util/coding_test.cc
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D55534622

fbshipit-source-id: dfff34924da6f2cdad34ed21f8f08a9bab9189a7
2024-03-30 07:17:52 -07:00
Hui Xiao 21d11de761 Temporarily disable wal_bytes_per_sync in crash test (#12489)
Summary:
**Context/Summary:**

`wal_bytes_per_sync > 0` can sync newer WAL but not an older WAL by its nature. This creates a hole in synced WAL data. By our crash test, we recently discovered that our DB can recover past that hole. This resulted in crash-recovery-verification error. Before we fix that recovery behavior, we will temporarily disable `wal_bytes_per_sync` in crash test

Bonus: updated the API to make the nature of this option more explicitly documented

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12489

Test Plan: More stabilized crash test

Reviewed By: ajkr

Differential Revision: D55531589

Pulled By: hx235

fbshipit-source-id: 6dea6486420dc0f50550d488c15652f93972a0ea
2024-03-29 13:01:15 -07:00
Changyu Bi 796011e5ad Limit compaction input files expansion (#12484)
Summary:
We removed the limit in https://github.com/facebook/rocksdb/issues/10835 and the option in https://github.com/facebook/rocksdb/issues/12323. Usually input level is much smaller than output level, which is likely why we have not seen issues with not applying a limit. It should be safer to add a safe guard as suggested in https://github.com/facebook/rocksdb/pull/12323#issuecomment-2016687321.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12484

Test Plan: * new and existing UT

Reviewed By: ajkr

Differential Revision: D55438870

Pulled By: cbi42

fbshipit-source-id: 0511d0465a70398c36230ed7cced5291ff1a6c19
2024-03-29 11:34:29 -07:00
Hui Xiao d985902ef4 Disallow refitting more than 1 file from non-L0 to L0 (#12481)
Summary:
**Context/Summary:**
We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior.

A related change to OptionChangeMigration():
- When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size.
- Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481

Test Plan:
- New UT
- Modified UT

Reviewed By: cbi42

Differential Revision: D55351178

Pulled By: hx235

fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80
2024-03-29 10:52:36 -07:00
anand76 4e226c97b8 Don't swallow errors in BlockBasedTable::MultiGet (#12486)
Summary:
Errors were being swallowed in `BlockBasedTable::MultiGet` under some circumstances, such as error when parsing the internal key from the block, or IO error when reading the blob value. We need to set the status for the key to the observed error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12486

Test Plan: Run db_stress and verify the expected error failure before, and no failures after the change.

Reviewed By: jaykorean, ajkr

Differential Revision: D55483940

Pulled By: anand1976

fbshipit-source-id: 493e44db507d5db45e8d1ef2e67808d2c4046318
2024-03-28 13:56:28 -07:00
Andrew Kryczka 3d4e78937a Initialize FaultInjectionTestFS::checksum_handoff_func_type_ to kCRC32c (#12485)
Summary:
Previously it was uninitialized. Setting `checksum_handoff_file_types` will cause `kCRC32c` checksums to be passed down in the `DataVerificationInfo`, so it makes sense for `kCRC32c` to be the default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12485

Test Plan:
ran `db_stress` in a way that failed before. Building with ASAN was needed to ensure the uninitialized bytes are nonzero according to `malloc_fill_byte` (default 0xbe)

```
$ COMPILE_WITH_ASAN=1 make -j28 db_stress
...
$ ./db_stress -sync_fault_injection=1 -enable_checksum_handoff=true
```

Reviewed By: jaykorean

Differential Revision: D55450587

Pulled By: ajkr

fbshipit-source-id: 53dc829b86e49b3fa80570032e83af0bb12adaad
2024-03-27 18:37:58 -07:00
akankshamahajan 1856734821 Branch cut 9.1.fb (#12476)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12476

Reviewed By: jowlyzhang

Differential Revision: D55319508

Pulled By: akankshamahajan15

fbshipit-source-id: 2b6db671e027511282775c0fea155335d8e73cc2
2024-03-25 15:07:43 -07:00
Jay Huh c449867236 MultiCfIterator Impl Follow up (#12465)
Summary:
As a follow up for https://github.com/facebook/rocksdb/issues/12422 , this PR includes the following two changes.
- Removal of `direction_` in the MultiCfIterator
- Use of Member Func Template instead of `std::function`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12465

Test Plan:
```
./multi_cf_iterator_test
```

Reviewed By: pdillinger, ltamasi

Differential Revision: D55208448

Pulled By: jaykorean

fbshipit-source-id: 8b3167c1d59839d076afc29097b5ad21a453460a
2024-03-22 14:51:16 -07:00
Peter Dillinger b515a5db3f Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470)
Summary:
ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want.

So this adds ScopedArenaPtr<T> to replace those uses.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470

Test Plan: CI (including ASAN/UBSAN)

Reviewed By: jowlyzhang

Differential Revision: D55254362

Pulled By: pdillinger

fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18
2024-03-22 13:40:42 -07:00
anand76 3b736c4aa3 Fix heap use after free error on retry after checksum mismatch (#12464)
Summary:
Fix the heap use after free bug caused by freeing the file system IO buffer in `BlockFetcher::ReadBlock()` instead of the caller.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12464

Test Plan: Update the `DBIOCorruptionTest` tests

Reviewed By: akankshamahajan15

Differential Revision: D55206920

Pulled By: anand1976

fbshipit-source-id: fd6b608a61cd229b20c1e5f348ff3cc92328de0f
2024-03-21 16:19:09 -07:00
Radek Hubner 088dc7283b Remove unused MSVC compiler warning supressions (#12205)
Summary:
Remove unused compiler warning supressions as was suggested in https://github.com/facebook/rocksdb/issues/10745.

Fixes https://github.com/facebook/rocksdb/issues/10745

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12205

Reviewed By: hx235

Differential Revision: D52547905

Pulled By: ajkr

fbshipit-source-id: 6bf3abfe51a0f3e5f01e1563a8ef7d855726b04d
2024-03-21 13:22:39 -07:00
Andrew Kryczka bf98dcf9a8 Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462)
Summary:
The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file.

There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462

Test Plan:
A repro command

```
db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp
./db_stress \
        --clear_column_family_one_in=0 \
        --test_batches_snapshots=0 \
        --write_fault_one_in=0 \
        --use_put_entity_one_in=0 \
        --prefixpercent=0 \
        --read_fault_one_in=0 \
        --readpercent=0 \
        --reopen=0 \
        --set_options_one_in=10000 \
        --delpercent=0 \
        --delrangepercent=0 \
        --open_metadata_write_fault_one_in=0 \
        --open_read_fault_one_in=0 \
        --open_write_fault_one_in=0 \
        --destroy_db_initially=0 \
        --ingest_external_file_one_in=0 \
        --iterpercent=0 \
        --nooverwritepercent=0 \
        --db=$db \
        --enable_blob_files=1 \
        --expected_values_dir=$exp \
        --max_background_compactions=20 \
        --max_bytes_for_level_base=2097152 \
        --max_key=100000 \
        --min_blob_size=0 \
        --open_files=-1 \
        --ops_per_thread=100000000 \
        --prefix_size=-1 \
        --target_file_size_base=524288 \
        --use_merge=1 \
        --value_size_mult=32 \
        --write_buffer_size=524288 \
        --writepercent=100
```

It used to fail like:

```
...
frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3
frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, rocksdb::PinnedIteratorsManager*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33
frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492
frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408
frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109
frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656
frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607
frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, rocksdb::TrimHistoryScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505
frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084
frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807
frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312
frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157
frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544
frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72
frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317
frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148
...
```

Reviewed By: ltamasi

Differential Revision: D55157795

Pulled By: ajkr

fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764
2024-03-21 12:38:53 -07:00
anand76 63a105a481 Enable recycle_log_file_num option for point in time recovery (#12403)
Summary:
This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL.

This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403

Test Plan:
1. Add new unit tests to verify behavior upon corruption
2. Re-enable disabled tests for verifying recycling behavior

Reviewed By: ajkr

Differential Revision: D54544824

Pulled By: anand1976

fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a
2024-03-21 12:29:35 -07:00
anand76 98d8a85624 New PerfContext counters for block cache bytes read (#12459)
Summary:
Add PerfContext counters for measuring the cumulative size of blocks found in the block cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12459

Reviewed By: ajkr

Differential Revision: D55170694

Pulled By: anand1976

fbshipit-source-id: 8cbba76eece116cefce7f00e2fc9d74757661d25
2024-03-21 10:46:46 -07:00
Yu Zhang 13e1c32a18 Follow ups for TimedPut and write time property (#12455)
Summary:
This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including:

1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry.

2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries.

3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible.

Test plans:
Added unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455

Reviewed By: pdillinger

Differential Revision: D55165422

Pulled By: jowlyzhang

fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47
2024-03-21 10:00:15 -07:00
Richard Barnes 6a1c2abe9d Remove extra semi colon from hbt/src/tagstack/tests/SlicerTest.cpp (#12461)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12461

X-link: https://github.com/facebookincubator/dynolog/pull/233

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: rahku

Differential Revision: D55087324

fbshipit-source-id: e8a03d33cad72a7d378e58f85eb550a03f6c2897
2024-03-20 12:44:50 -07:00
Kshitij Wadhwa 4ce1dc930c don't run ZSTD_TrainDictionary in BlockBasedTableBuilder if there isn't compression needed (#12453)
Summary:
fixes https://github.com/facebook/rocksdb/issues/12409

### Issue

ZSTD_TrainDictionary [[link](a53ed91691/table/block_based/block_based_table_builder.cc (L1894))] runs for SSTFileWriter::Finish even when bottommost_compression option is set to kNoCompression. This reduces throughput for SstFileWriter::Finish

We construct rocksdb options using ZSTD compression for levels including 2 and above. For levels 0 and 1, we set it to kNoCompression. We also set zstd_max_train_bytes to a non-zero positive value (which is applicable for levels with ZSTD compression enabled). These options are used for the database and also passed to SstFileWriter for creating sst files to be later added to that database. Since the BlockBasedTableBuilder::Finish [[link](a53ed91691/table/block_based/block_based_table_builder.cc (L1892))] only checks for zstd_max_train_bytes to be non-zero positive value, it runs ZSTD_TrainDictionary even when it shouldn't since SSTFileWriter is operating at bottommost level

### Fix

If compression_type is set to kNoCompression, then don't run ZSTD_TrainDictionary and dictionary building

### Testing

I see we have tests for sst file writer with compression type set/unset. Let me know if it isn't covered and I can extend

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12453

Reviewed By: cbi42

Differential Revision: D55030484

Pulled By: ajkr

fbshipit-source-id: 834de2174c2b087d61bf045ca1ae29f337b821a7
2024-03-20 11:07:32 -07:00
Jay Huh 3f3f4660bd wal_read_status check in RecoverLogFiles (#12460)
Summary:
Fixing the not-checked status failure as in https://github.com/facebook/rocksdb/actions/runs/8334988399/job/22809612148.

When the status is not ok() for any reason, we do not check the `wal_read_status` because it's not necessary. It's causing the test failure when running with `ASSERT_STATUS_CHECKED=1`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12460

Test Plan: Existing tests

Reviewed By: ajkr

Differential Revision: D55104844

Pulled By: jaykorean

fbshipit-source-id: 919b1fddca835494f9087c51c4da6eabc9e8df2b
2024-03-20 08:09:09 -07:00
Richard Barnes 6ddfa5f061 Remove extra semi colon from internal_repo_rocksdb/repo/util/filelock_test.cc
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D55087322

fbshipit-source-id: ca4db7285444306d6c91545cd2c33483dfe05385
2024-03-19 16:17:57 -07:00
Richard Barnes fc40165614 Remove extra semi colon from internal_repo_rocksdb/repo/tools/ldb_cmd.cc
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: palmje

Differential Revision: D54362227

fbshipit-source-id: ac634ba34f9351ba559c4ed96448f51d6ef33175
2024-03-18 18:51:50 -07:00
anand76 4868c10b44 Retry block reads on checksum mismatch (#12427)
Summary:
On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`.

Tests:
Add new unit tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427

Reviewed By: ajkr

Differential Revision: D55025941

Pulled By: anand1976

fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e
2024-03-18 16:16:05 -07:00
Jay Huh b4e9f5a400 Update Remote Compaction Tests to include more than one CF (#12430)
Summary:
Update `compaction_service_test` to make sure remote compaction works with multiple column family set up. Minor refactor to get rid of duplicate code

Fixing one quick bug in the existing test util: Test util's `FilesPerLevel` didn't honor `cf_id` properly)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12430

Test Plan:
```
./compaction_service_test
```

Reviewed By: ajkr

Differential Revision: D54883035

Pulled By: jaykorean

fbshipit-source-id: 83b4f6f566fed5c4824bfef7de01074354a72b44
2024-03-18 15:40:48 -07:00
Hui Xiao 2443ebf810 Don't write to WAL after previous WAL write error (#12448)
Summary:
**Context/Summary:**
WAL write can continue onto the the WAL file that has encountered error and thus crash at 3f5bd46a07/file/writable_file_writer.cc (L67) in below scenario:
<img width="698" alt="Screenshot 2024-03-15 at 1 52 45 PM" src="https://github.com/facebook/rocksdb/assets/83968999/cd631ef2-c87c-4926-91ab-a0dc10f1eb14">

Note that GetLiveFilesStorageInfo() can happen concurrently with PUT() for the non-WAL-write part where db lock isn't held

This PR added an error check in LogWriter layer to prevent thread 2 from starting to write WAL after thread 1's write error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12448

Test Plan:
Step 1 Apply the patch below to simulate frequent WAL write error for the purpose of repro
```
 diff --git a/db_stress_tool/db_stress_driver.cc b/db_stress_tool/db_stress_driver.cc
index b47fa89e6..31930e976 100644
 --- a/db_stress_tool/db_stress_driver.cc
+++ b/db_stress_tool/db_stress_driver.cc
@@ -98,7 +98,7 @@ bool RunStressTestImpl(SharedState* shared) {
     //  MANIFEST, CURRENT, and WAL files.
     fault_fs_guard->SetRandomWriteError(
         shared->GetSeed(), FLAGS_write_fault_one_in, error_msg,
-        /*inject_for_all_file_types=*/false, {FileType::kTableFile});
+        /*inject_for_all_file_types=*/false, {FileType::kWalFile});
     fault_fs_guard->SetFilesystemDirectWritable(false);
     fault_fs_guard->EnableWriteErrorInjection();
   }

 diff --git a/utilities/fault_injection_fs.cc b/utilities/fault_injection_fs.cc
index 0ffb43ea6..589912cf4 100644
 --- a/utilities/fault_injection_fs.cc
+++ b/utilities/fault_injection_fs.cc
@@ -1042,7 +1042,7 @@ IOStatus FaultInjectionTestFS::InjectWriteError(const std::string& file_name) {
   }

   if (allowed_type) {
-    if (write_error_rand_.OneIn(write_error_one_in_)) {
+    if (write_error_rand_.OneIn(1)) {
       return GetError();
     }
   }
```
Step 2 Run below
```
./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=41.19540459544058 --bottommost_compression_type=disable --bottommost_file_compaction_delay=3600 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=1 --compression_max_dict_buffer_bytes=68719476735 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_index_compression=1 --enable_pipelined_write=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=10000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=15 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=1048576 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=500000 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --report_bg_io_stats=0 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=35
```
Pre-PR:
```
db_stress: ./file/writable_file_writer.h:309: rocksdb::IOStatus rocksdb::WritableFileWriter::AssertFalseAndGetStatusForPrevError(): Assertion `sync_without_flush_called_' failed.
```
Post-PR
```
2024/03/15-13:44:08  Starting database operations
put or merge error: IO error: Retryable injected write error
```

Note: The patch is NOT included in the PR as we first need to figure out how to handle this type of failed write in stress test (planned for the near future). It's sufficient to show the stress test does not crash as pre-PR for the purpose of this PR.

Reviewed By: ajkr

Differential Revision: D54969287

Pulled By: hx235

fbshipit-source-id: 0ba4eabfec44ea7656d4d7117836f388897562f2
2024-03-18 12:27:49 -07:00
Levi Tamasi a93edea7e5 Deduplicate WriteBatchWithIndex::{Get,GetEntity}FromBatch (#12442)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12442

The patch deduplicates and unifies the logic of `WriteBatchWithIndex::{Get,GetEntity}FromBatch` using templates and makes some small code hygiene improvements, including consistently clearing the output value in the various non-success cases.

Reviewed By: jaykorean

Differential Revision: D54922935

fbshipit-source-id: c92e89f905a3c80cef57c2c840f49f806629238f
2024-03-18 12:04:54 -07:00
Jay Huh db1dea22b1 MultiCfIterator Implementations (#12422)
Summary:
This PR continues https://github.com/facebook/rocksdb/issues/12153 by implementing the missing `Iterator` APIs - `Seek()`, `SeekForPrev()`, `SeekToLast()`, and `Prev`. A MaxHeap Implementation has been added to handle the reverse direction.

The current implementation does not include upper/lower bounds yet. These will be added in subsequent PRs. The API is still marked as under construction and will be lifted after being added to the stress test.

Please note that changing the iterator direction in the middle of iteration is expensive, as it requires seeking the element in each iterator again in the opposite direction and rebuilding the heap along the way. The first `Next()` after `SeekForPrev()` requires changing the direction under the current implementation. We may optimize this in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12422

Test Plan: The `multi_cf_iterator_test` has been extended to cover the API implementations.

Reviewed By: pdillinger

Differential Revision: D54820754

Pulled By: jaykorean

fbshipit-source-id: 9eb741508df0f7bad598fb8e6bd5cdffc39e81d1
2024-03-18 09:05:30 -07:00
Changyu Bi 3d5be596a5 Fix a bug in iterator with UDT + ReadOptions::pin_data (#12451)
Summary:
with https://github.com/facebook/rocksdb/issues/12414 enabling `ReadOptions::pin_data`, this bug surfaced as corrupted per key-value checksum during crash test. `saved_key_.GetUserKey()` could be pinned user key, so DBIter should not overwrite it.

In one case, it only surfaces when iterator skips many keys of the same user key. To stress that code path, this PR also added `max_sequential_skip_in_iterations` to crash test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12451

Test Plan:
- Set ReadOptions::pin_data to true, the bug can be reproed quickly with `./db_stress --persist_user_defined_timestamps=1 --user_timestamp_size=8 --writepercent=35 --delpercent=4 --delrangepercent=1 --iterpercent=20 --nooverwritepercent=1 --prefix_size=8 --prefixpercent=10 --readpercent=30 --memtable_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --clear_column_family_one_in=0`.
    - Set max_sequential_skip_in_iterations to 1 for the other occurrence of the bug.

Reviewed By: jowlyzhang

Differential Revision: D55003766

Pulled By: cbi42

fbshipit-source-id: 23e1049129456684dafb028b6132b70e0afc07fb
2024-03-18 09:05:11 -07:00
Yu Zhang f2546b6623 Support returning write unix time in iterator property (#12428)
Summary:
This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is:
1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned.

2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in  https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown.

Needed follow up:
1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it.

2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion.

3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428

Test Plan: Added unit test

Reviewed By: pdillinger

Differential Revision: D54967067

Pulled By: jowlyzhang

fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a
2024-03-15 15:37:37 -07:00
Andrew Kryczka 4d5ebad971 Fix kBlockCacheTier read with table cache miss (#12443)
Summary:
Thanks ltamasi for pointing out this bug.

We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443

Reviewed By: cbi42

Differential Revision: D54930128

Pulled By: ajkr

fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d
2024-03-15 14:41:58 -07:00
Andrew Kryczka 3f5bd46a07 Add ContinueCallback to GetMergeOperands() (#12438)
Summary:
The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because:

- Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge
- The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me

If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438

Reviewed By: hx235

Differential Revision: D54914669

Pulled By: ajkr

fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a
2024-03-15 12:25:49 -07:00
Peter Dillinger c3c0cfc3a8 Create an UnownedPtr type (#12447)
Summary:
... that is more hygienic as an "optional reference" than a raw pointer, and likely more efficient than
std::optional<std::reference_wrapper<T>>.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12447

Test Plan: unit test included (with manual verification that "must not compile" sections currently do not)

Reviewed By: jowlyzhang

Differential Revision: D54957917

Pulled By: pdillinger

fbshipit-source-id: bbd89218df803617b1a170ebddc9e56c9b52bf93
2024-03-15 11:43:28 -07:00
Changyu Bi 096fb9b67d Fix data race in WalManager (#12439)
Summary:
Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439

Test Plan:
- existing UT
- not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests

Reviewed By: anand1976

Differential Revision: D54920817

Pulled By: cbi42

fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6
2024-03-14 21:24:06 -07:00
Yu Zhang 1104eaa35e Add initial support for TimedPut API (#12419)
Summary:
This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API.

The life cycle of such an entry on the write/flush/compaction paths are:

1) It is initially added to memtable as:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}`

2) When it's flushed to L0 sst files, it's converted to:
`<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}`
 when we have easy access to the seqno to time mapping.

3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become:
`<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap.

On the read path:
A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored.

Needed follow ups:
1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable.

2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types.

3) Stress test coverage for the feature

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419

Test Plan: Added unit tests

Reviewed By: pdillinger

Differential Revision: D54920296

Pulled By: jowlyzhang

fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e
2024-03-14 15:44:55 -07:00