Summary:
the value of `inplace_update_support` option need to be fixed across runs of db_stress on the same DB (https://github.com/facebook/rocksdb/issues/12577). My recent fix (https://github.com/facebook/rocksdb/issues/12673) regressed this behavior. Also fix some existing places where this does not hold.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12675
Test Plan: monitor crash tests related to `inplace_update_support`.
Reviewed By: hx235
Differential Revision: D57576375
Pulled By: cbi42
fbshipit-source-id: 75b1bd233f03e5657984f5d5234dbbb1ffc35c27
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12668
The patch adds a new `GetEntityForUpdate` API to optimistic and WriteCommitted pessimistic transactions, which provides transactional wide-column point lookup functionality with concurrency control. For WriteCommitted transactions, user-defined timestamps are also supported similarly to the `GetForUpdate` API.
Reviewed By: jaykorean
Differential Revision: D57458304
fbshipit-source-id: 7eadbac531ca5446353e494abbd0635d63f62d24
Summary:
gcc 14.1 reports some warnings about dangling-reference occured in backup_engine_test.
```c++
/data/rocksdb/utilities/backup/backup_engine_test.cc: In member function 'virtual void rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::TestBody()':
/data/rocksdb/utilities/backup/backup_engine_test.cc:4411:64: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
4411 | std::make_pair(alt_backup_engine, backup_engine_.get())}) {
| ^
/data/rocksdb/utilities/backup/backup_engine_test.cc:4410:23: note: the temporary was destroyed at the end of the full expression 'std::make_pair<rocksdb::BackupEngine*, rocksdb::BackupEngine*&>(((rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test*)this)->rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::rocksdb::{anonymous}::BackupEngineTest.rocksdb::{anonymous}::BackupEngineTest::backup_engine_.std::unique_ptr<rocksdb::BackupEngine>::get(), alt_backup_engine)'
4410 | {std::make_pair(backup_engine_.get(), alt_backup_engine),
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/data/rocksdb/utilities/backup/backup_engine_test.cc:4411:64: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
4411 | std::make_pair(alt_backup_engine, backup_engine_.get())}) {
| ^
/data/rocksdb/utilities/backup/backup_engine_test.cc:4411:23: note: the temporary was destroyed at the end of the full expression 'std::make_pair<rocksdb::BackupEngine*&, rocksdb::BackupEngine*>(alt_backup_engine, ((rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test*)this)->rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::rocksdb::{anonymous}::BackupEngineTest.rocksdb::{anonymous}::BackupEngineTest::backup_engine_.std::unique_ptr<rocksdb::BackupEngine>::get())'
4411 | std::make_pair(alt_backup_engine, backup_engine_.get())}) {
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
It seems to be related to this update in gcc:
https://gcc.gnu.org/gcc-14/changes.html#:~:text=%2DWdangling%2Dreference%20false%20positives%20have%20been%20reduced.%20The%20warning%20does%20not%20warn%20about%20std%3A%3Aspan%2Dlike%20classes%3B%20there%20is%20also%20a%20new%20attribute%20gnu%3A%3Ano_dangling%20to%20suppress%20the%20warning.%20See%20the%20manual%20for%20more%20info.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12637
Reviewed By: cbi42
Differential Revision: D57263996
Pulled By: ajkr
fbshipit-source-id: 1e416c38240d3d1adda787fc484c0392e28bb7f1
Summary:
With unsynced data loss, we replay traces to recover expected state to DB's latest sequence number. With `inplace_update_support`, the largest sequence number of memtable may not reflect the latest update. This is because inplace updates in memtable do not update sequence number. So we disable `inplace_update_support` where traces need to be replayed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12673
Reviewed By: ltamasi
Differential Revision: D57512548
Pulled By: cbi42
fbshipit-source-id: 69278fe2e935874faf744d0ac4fd85263773c3ec
Summary:
This PR implements deletion of obsolete files in a follower RocksDB instance. The follower tails the leader's MANIFEST and creates links to newly added SST files. These links need to be deleted once those files become obsolete in order to reclaim space. There are three cases to be considered -
1. New files added and links created, but the Version could not be installed due to some missing files. Those links need to be preserved so a subsequent catch up attempt can succeed. We insert the next file number in the `VersionSet` to `pending_outputs_` to prevent their deletion.
2. Files deleted from the previous successfully installed `Version`. These are deleted as usual in `PurgeObsoleteFiles`.
3. New files added by a `VersionEdit` and deleted by a subsequent `VersionEdit`, both processed in the same catchup attempt. Links will be created for the new files when verifying a candidate `Version`. Those need to be deleted explicitly as they're never added to `VersionStorageInfo`, and thus not deleted by `PurgeObsoleteFiles`.
Test plan -
New unit tests in `db_follower_test`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12657
Reviewed By: jowlyzhang
Differential Revision: D57462697
Pulled By: anand1976
fbshipit-source-id: 898f15570638dd4930f839ffd31c560f9cb73916
Summary:
This test is flaky and a recent failure prints the following:
```
[ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 1842811, thread status:
thread id: 1842803, thread status:
db/db_test.cc:4697: Failure
Expected equality of these values:
op_count
Which is: 0
expected_count
Which is: 1
[ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (307 ms)
```
Empty thread status implies that operation_type of the threads are all OP_UNKNOWN. From 3ed46e0668/monitoring/thread_status_updater.cc (L197), this can be due to thread_data->operation_type being OP_UNKNOWN or that thread_data->cf_key it not in `cf_info_map_`, potentially due to how cf_key_ is accessed with relaxed memory order. This PR adds some debug print to print the cf_name to check this.
This PR also prints num_running_compaction and lsm state to check if a compaction is indeed running, and removes some not needed options and ensures that exactly 4 L0 files are created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12661
Test Plan:
- Cannot repro the failure locally: `gtest-parallel --repeat=10000 --workers=200 ./db_test --gtest_filter="*ThreadStatusSingleCompaction*"`
- New failure message will look like:
```
[ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
op_count: 1, expected_count 2
thread id: 6104100864, thread status: , cf_name
thread id: 6103527424, thread status: Compaction, cf_name default
running compaction: 1 lsm state: 4
db/db_test.cc:4885: Failure
Value of: match
Actual: false
Expected: true
[ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (115 ms)
```
Reviewed By: hx235
Differential Revision: D57422755
Pulled By: cbi42
fbshipit-source-id: 635663f26052b20e485dfa06a7c0f1f318ac1099
Summary:
Represent internal kTypeValuePreferredSeqno in the public API as kEntryTimedPut (because it is created by TimedPut, until the entry can be safely converted to a regular value entry in compaction)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12669
Test Plan: for follow-up work actually using it. But putting this in place in the public API gives us more flexibility in rolling out that follow-up work (e.g. as a user extension or patch if needed).
Reviewed By: jowlyzhang
Differential Revision: D57459637
Pulled By: pdillinger
fbshipit-source-id: 160ccf7c4e524ee479558846b2a207d51b8b3d9c
Summary:
`ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests.
See also: https://github.com/facebook/rocksdb/issues/12658
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659
Reviewed By: cbi42
Differential Revision: D57391200
Pulled By: ajkr
fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2
Summary:
To avoid use-after-free on custom env on ASSERT_WHATEVER failure.
This is motivated by a rare crash seen in DBErrorHandlingFSTest.WALWriteError (VersionSet::GetObsoleteFiles in a SstFileManagerImpl::ClearError thread) and wanting to rule out this being related to that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12655
Test Plan: manually seeing ASSERT_WHATEVER failures, especially under ASAN
Reviewed By: cbi42
Differential Revision: D57358202
Pulled By: pdillinger
fbshipit-source-id: 4da2a0d73a54380b257e5cc1ab6c666e26b83973
Summary:
If timestamp size record doesn't fit into a block, without padding `Writer::EmitPhysicalRecord` fails on assert (either `assert(block_offset_ + kHeaderSize + n <= kBlockSize);` or `assert(block_offset_ + kRecyclableHeaderSize + n <= kBlockSize)`, depending on whether recycling log files is enabled) in debug build. In release, current block grows beyond 32K, `block_offset_` gets reset on next `AddRecord` and all the subsequent blocks are no longer aligned by block size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12614
Reviewed By: ltamasi
Differential Revision: D57302140
Pulled By: jowlyzhang
fbshipit-source-id: cacb5cefb7586885e52a8137ae23a95e1aefca2d
Summary:
As titled. A proper fix should probably be failing file ingestion if the DB is in a lock wal state as it promises to "Freezes the logical state of the DB".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12642
Reviewed By: pdillinger
Differential Revision: D57235869
Pulled By: jowlyzhang
fbshipit-source-id: c70031463842220f865621eb6f53424df27d29e9
Summary:
**Context/Summary:**
Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below
```
Assertion `output_level == 0 || !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed.
```
This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628
Test Plan: New UT crashed before the fix and return correct status after the fix.
Reviewed By: cbi42
Differential Revision: D57123536
Pulled By: hx235
fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430
Summary:
Seeing way too many errors likely related to PromoteL0 from https://github.com/facebook/rocksdb/issues/12617, containing
```
Cannot delete table file #N from level 0 since it is on level X
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12651
Test Plan: watch crash test results
Reviewed By: hx235
Differential Revision: D57286208
Pulled By: pdillinger
fbshipit-source-id: f7f0560cc0804ca297373c8d20ebc34986cc19d0
Summary:
Follow-up from https://github.com/facebook/rocksdb/issues/12403
The crash test was periodically failing with the
"disableWAL option is not supported if recycle_log_file_num > 0" failure, despite not setting the disableWAL from the user side.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12639
Test Plan: db_stress reproducer now passes. Added WAL recycling to txn DB unit tests, which is generally more difficult for correctness. Many tests now cover this change and pass.
Reviewed By: anand1976
Differential Revision: D57227617
Pulled By: pdillinger
fbshipit-source-id: db9abefeb505bce624b45bc64009694d2a5baed9
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12634
The patch implements support for the `MultiGetEntity` API in optimistic transactions and pessimistic transactions with the WriteCommitted policy. Similarly to the other wide-column transaction APIs, the implementation leverages the `WriteBatchWithIndex` layer.
Reviewed By: jaykorean
Differential Revision: D57177638
fbshipit-source-id: 2d9f9f287fc97e7c126830b48d21457c7c35db3f
Summary:
Adding AttributeGroup APIs in stress test. This contains the following changes only. More PRs to follow.
- Introduce `use_attribute_group` flag
- AttributeGroup `PutEntity()` and `GetEntity()` are now used per `use_attribute_group` flag in BatchOps, NonBatchOps and CfConsistency tests
In the next PRs I plan to add
- AttributeGroup `MultiGetEntity()` in Stress Test
- AttributeGroupIterator in Stress Test (along with CoalescingIterator)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12605
Test Plan:
NonBatchOps
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1
```
BatchOps
```
python3 tools/db_crashtest.py blackbox --test_batches_snapshots=1 --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1
```
CfConsistency Test
```
python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1
```
Reviewed By: ltamasi
Differential Revision: D56916768
Pulled By: jaykorean
fbshipit-source-id: 8555d9e0d05927740a10e4e8301e44beec59a6f5
Summary:
**Context/Summary:**
As titled. Bonus: found that PromoteL0 called with other concurrent PromoteL0 will return non-okay error so clarify the API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12617
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D56954428
Pulled By: hx235
fbshipit-source-id: 0e056153c515003fd241ffec59b0d8a27529db4c
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12630
The patch cleans up, improves, and brings into sync (to the extent possible without API signature changes) the sanity checks around the `GetEntity` / `MultiGetEntity` family of APIs, including the read-your-own-writes (`WriteBatchWithIndex`) and transaction layers. The checks are centralized in two main sets of entry points, namely in `DB(Impl)` and the "main" `GetEntityFromBatchAndDB` / `MultiGetEntityFromBatchAndDB` overloads in `WriteBatchWithIndex`. This eliminates the need to duplicate the checks in the transaction classes.
Reviewed By: jaykorean
Differential Revision: D57125741
fbshipit-source-id: 4dd059ef644a9b173fbba767538943397e4cc6cd
Summary:
The state of `saved_seq_for_penul_check_` is not correctly maintained with the current flow. It's supposed to store the original sequence number for a `kTypeValuePreferredSeqno` entry for use in the `DecideOutputLevel` function. However, it's not always properly cleared.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12626
Test Plan:
Added unit test that would fail before the fix
./tiered_compaction_test --gtest_filter="*InterleavedTimedPutAndPut*"
Reviewed By: pdillinger
Differential Revision: D57123469
Pulled By: jowlyzhang
fbshipit-source-id: 8d73214b3b6dc152daf19b6bd6ee9063581dc277
Summary:
After https://github.com/facebook/rocksdb/issues/12624 seeing db_stress failures due to db_crashtest.py calling it with --prefixpercent=5 --enable_compaction_filter=1
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12627
Test Plan: watch crash test
Reviewed By: ajkr
Differential Revision: D57121592
Pulled By: pdillinger
fbshipit-source-id: 55727355a7662e67efcd22d7e353153e78e24f59
Summary:
In `SaveValue()`, the read lock needs to be obtained before `VerifyEntryChecksum()` because the KV checksum verification reads the entire value metadata+data, which is all mutable when `ColumnFamilyOptions::inplace_update_support == true`.
In `MemTable::Update()`, the write lock needs to be obtained before mutating the value metadata (changing the value size) because it can be read concurrently.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12624
Test Plan:
```
$ make COMPILE_WITH_TSAN=1 -j56 db_stress
...
$ python3 tools/db_crashtest.py blackbox --simple --max_key=10 --inplace_update_support=1 --interval=10 --allow_concurrent_memtable_write=0
```
Reviewed By: cbi42
Differential Revision: D57034571
Pulled By: ajkr
fbshipit-source-id: 3dddf881ad87923143acdf6bfec12ce47bb13a48
Summary:
For manual compaction, FIFO compaction will always skip key range overlapping checking with SST files. If CompactRange() is called with CompactionRangeOptions::change_level=true, a CF with FIFO compaction will now return Status::NotSupported.
For file ingestion, we will always ingest into L0. Previously, it's possible to ingest files into non-L0 levels with FIFO compaction.
These changes also help to fix [this](a178d15baf/db/db_impl/db_impl_compaction_flush.cc (L1269)) assertion failure in crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12618
Test Plan: added unit tests to verify the new behavior.
Reviewed By: hx235
Differential Revision: D56962401
Pulled By: cbi42
fbshipit-source-id: 19812a1509650b4162b379ca5bee02f2e9d9569d
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12623
The PR adds support for the `GetEntity` API to optimistic and WriteCommitted pessimistic transactions. `MultiGetEntity` support and the `ForUpdate` variants of these read APIs will be implemented in subsequent PRs.
Reviewed By: jaykorean
Differential Revision: D57030879
fbshipit-source-id: 1f0aed6418782975fe537b6b3d437fad31fcbd43
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12606
The patch extends optimistic transactions and WriteCommitted pessimistic transactions with support for the `PutEntity` API. Similarly to the other APIs, `PutEntity` is available via both the `Transaction` and `TransactionDB` interfaces, where using the latter executes the write in a single-operation transaction as usual. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added in separate PRs.
Reviewed By: jaykorean
Differential Revision: D56911242
fbshipit-source-id: 57cf8bb6c6b1b40ba4a8a652831c13a617644289
Summary:
Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs".
This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622
Reviewed By: cbi42
Differential Revision: D56987303
Pulled By: ajkr
fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48
Summary:
when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602
Test Plan:
a new repro unit test. Before this PR, it fails with
```
[ RUN ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF
db/import_column_family_test.cc:1048: Failure
db_->WaitForCompact(o)
Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3
```
Reviewed By: hx235
Differential Revision: D56851808
Pulled By: cbi42
fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84
Summary:
This PR fixes a bug in the StderrLogger that truncated the last character in the logline. The problem was that we provided an incorrect max size parameter into the vsnprintf function. The size didn't take into account the null byte that the function automatically adds.
Before fix
```
** File Read Latency Histogram By Level [default] **
2024/05/04-18:50:24.209304 4788 [/db_impl/db_impl.cc:498] Shutdown: canceling all background wor
2024/05/04-18:50:24.209598 4788 [/db_impl/db_impl.cc:692] Shutdown complet
```
After fix
```
** File Read Latency Histogram By Level [default] **
2024/05/04-18:51:19.814584 4d4d [/db_impl/db_impl.cc:498] Shutdown: canceling all background work
2024/05/04-18:51:19.815528 4d4d [/db_impl/db_impl.cc:692] Shutdown complete
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12620
Test Plan:
tested on examples/simple_example.cc with StderrLogger
Fixes: https://github.com/facebook/rocksdb/issues/12576
Reviewed By: jaykorean
Differential Revision: D56972332
Pulled By: ajkr
fbshipit-source-id: 70405e8231ae6e90d24fe0b351bc8e749176bd15
Summary:
Our external benchmark attributed a CPU regression to https://github.com/facebook/rocksdb/issues/11860. Based on the CPU profile the new overhead is from `std::deque`. The deque is always empty for these scans so we do not need to construct it. This PR lazily constructs it only when it is needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12616
Test Plan:
- Command: `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,seekrandom[-X10] -compression_type=none -disable_auto_compactions=true -write_buffer_size=524288 -value_size=1024 -num=10000 -reads=100000`
- Results
- Before this PR: `seekrandom [AVG 10 runs] : 47811 (± 431) ops/sec`
- After this PR: `seekrandom [AVG 10 runs] : 51013 (± 632) ops/sec`
Reviewed By: jaykorean
Differential Revision: D56954136
Pulled By: ajkr
fbshipit-source-id: b4d34c9b6c6c2e83d4fff06deacb9f0df2ad042f
Summary:
We've seen an internal crash test+sanitizer failure seemingly caused by underflow on `current_num_non_deletions_` which would happen if num_entries < num_deletions. (T186407810)
This change adds an additional check (fail earlier?) and coerces read table properties to satisfy the invariant that is supposed to be provided by https://github.com/facebook/rocksdb/pull/4841 but could be violated by older files, due to
https://github.com/facebook/rocksdb/pull/4016.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12600
Test Plan: existing tests
Reviewed By: ajkr
Differential Revision: D56796191
Pulled By: pdillinger
fbshipit-source-id: 6d22cc40eb74974c42b311293ee2775c6af95afc
Summary:
**Context/Summary:**
As titled. There were two flags serving the same purpose so removed one of them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12610
Test Plan: CI
Reviewed By: jaykorean, ajkr
Differential Revision: D56916119
Pulled By: hx235
fbshipit-source-id: 011140a7945782cc613ca86d4b542db0cf7fb444
Summary:
The test has been flaky for a long time. A recent [failure](https://github.com/facebook/rocksdb/actions/runs/8820808355/job/24215219590?pr=12578) shows that there is still flush running when the assertion fails. I think this is because `WaitForFlushMemTable()` may return before the a flush schedules the next compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12596
Test Plan: I could not repro the failure locally: `gtest-parallel --repeat=8000 --workers=100 ./db_compaction_test --gtest_filter="*CompactionLimiter*"`
Reviewed By: ajkr
Differential Revision: D56715874
Pulled By: cbi42
fbshipit-source-id: f5f64eb30fff7e115c19beedad2dc22afa06258d
Summary:
This PR fixes the following compile errors with Clang:
```
.../rocksdb/env/fs_on_demand.cc:184:5: error: no member named 'for_each' in namespace 'std'; did you mean 'std::ranges::for_each'?
184 | std::for_each(rchildren.begin(), rchildren.end(), [&](std::string& name) {
| ^~~~~~~~~~~~~
| std::ranges::for_each
/opt/homebrew/opt/llvm@17/bin/../include/c++/v1/__algorithm/ranges_for_each.h:68:23: note: 'std::ranges::for_each' declared here
68 | inline constexpr auto for_each = __for_each::__fn{};
| ^
.../rocksdb/env/fs_on_demand.cc:188:10: error: no member named 'sort' in namespace 'std'
188 | std::sort(result->begin(), result->end());
| ~~~~~^
.../rocksdb/env/fs_on_demand.cc:189:10: error: no member named 'sort' in namespace 'std'
189 | std::sort(rchildren.begin(), rchildren.end());
| ~~~~~^
.../rocksdb/env/fs_on_demand.cc:193:10: error: no member named 'set_union' in namespace 'std'
193 | std::set_union(result->begin(), result->end(), rchildren.begin(),
| ~~~~~^
.../rocksdb/env/fs_on_demand.cc:221:5: error: no member named 'for_each' in namespace 'std'; did you mean 'std::ranges::for_each'?
221 | std::for_each(
| ^~~~~~~~~~~~~
| std::ranges::for_each
/opt/homebrew/opt/llvm@17/bin/../include/c++/v1/__algorithm/ranges_for_each.h:68:23: note: 'std::ranges::for_each' declared here
68 | inline constexpr auto for_each = __for_each::__fn{};
| ^
.../rocksdb/env/fs_on_demand.cc:226:10: error: no member named 'sort' in namespace 'std'
226 | std::sort(result->begin(), result->end(), file_attr_sorter);
| ~~~~~^
.../rocksdb/env/fs_on_demand.cc:227:10: error: no member named 'sort' in namespace 'std'
227 | std::sort(rchildren.begin(), rchildren.end(), file_attr_sorter);
| ~~~~~^
.../rocksdb/env/fs_on_demand.cc:231:10: error: no member named 'set_union' in namespace 'std'
231 | std::set_union(rchildren.begin(), rchildren.end(), result->begin(),
| ~~~~~^
8 errors generated.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12588
Reviewed By: jaykorean
Differential Revision: D56656222
Pulled By: ajkr
fbshipit-source-id: 7e94b6250fc9edfe597a61b7622f09d6b6cd9cbd
Summary:
This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited.
The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` d610e14f93/include/rocksdb/sst_file_manager.h (L126) however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1.
In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call.
There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590
Test Plan: Added unit test and verified with manual testing
Reviewed By: anand1976
Differential Revision: D56830519
Pulled By: jowlyzhang
fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885
Summary:
This also updates WriteBatch's protection info to include write time since there are several places in memtable that by default protects the whole value slice.
This PR is stacked on https://github.com/facebook/rocksdb/issues/12543
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12559
Reviewed By: pdillinger
Differential Revision: D56308285
Pulled By: jowlyzhang
fbshipit-source-id: 5524339fe0dd6c918dc940ca2f0657b5f2111c56
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598
Test Plan:
- New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption: ...file checksum mismatch`) and passes after
- Benchmark
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none
```
Pre-PR:
fillseq [AVG 300 runs] : 421334 (± 4126) ops/sec; 46.6 (± 0.5) MB/sec
Post-PR: (no regression observed but a slight improvement)
fillseq [AVG 300 runs] : 425768 (± 4309) ops/sec; 47.1 (± 0.5) MB/sec
Reviewed By: ajkr, anand1976
Differential Revision: D56725688
Pulled By: hx235
fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5
Summary:
To make sure `TimedPut` are placed on proper tier before and when it becomes eligible for cold tier
1) flush and compaction need to keep relevant seqno to time mapping for not just the sequence number contained in internal keys, but also preferred sequence number for `TimedPut` entries.
This PR also fix some bugs in for handling `TimedPut` during compaction:
1) dealing with an edge case when a `TimedPut` entry's internal key is the right bound for penultimate level, the internal key after swapping in its preferred sequence number will fall outside of the penultimate range because preferred sequence number is smaller than its original sequence number. The entry however is still safe to be placed on penultimate level, so we keep track of `TimedPut` entry's original sequence number for this check. The idea behind this is that as long as it's safe for the original key to be placed on penultimate level, it's safe for the entry with swapped preferred sequence number to be placed on penultimate level too. Because we only swap in preferred sequence number when that entry is visible to the earliest snapshot and there is no other data points with the same user key in lower levels. On the other hand, as long as it's not safe for the original key to be placed on penultimate level, we will not place the entry after swapping the preferred seqno on penultimate level either.
2) the assertion that preferred seqno is always bigger than original sequence number may fail if this logic is only exercised after sequence number is zeroed out. We adjust the assertion to handle that case too. In this case, we don't swap in the preferred seqno but will adjust the its type to `kTypeValue`.
3) there was a special case handling for when range deletion may end up incorrectly covering an entry if preferred seqno is swapped in. But it missed the case that if the original entry is already covered by range deletion. The original handling will mistakenly output the entry instead of omitting it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12543
Test Plan:
./tiered_compaction_test --gtest_filter="PrecludeLastLevelTest.PreserveTimedPutOnPenultimateLevel"
./compaction_iterator_test --gtest_filter="*TimedPut*"
Reviewed By: pdillinger
Differential Revision: D56195096
Pulled By: jowlyzhang
fbshipit-source-id: 37ebb09d2513abbd9e90cda0217e26874584b8f3