Summary:
Fixes https://github.com/facebook/rocksdb/issues/11000.
That issue pointed out that RocksDB was slow to delete archived WALs in case time-based and size-based expiration were enabled, and the time-based threshold (`WAL_ttl_seconds`) was small. This PR prevents the delay by taking into account `WAL_ttl_seconds` when deciding the frequency to process archived WALs for deletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12069
Reviewed By: pdillinger
Differential Revision: D51262589
Pulled By: ajkr
fbshipit-source-id: e65431a06ee96f4c599ba84a27d1aedebecbb003
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11909. The test passes after the change in https://github.com/facebook/rocksdb/issues/11917 to start mock clock from a non-zero time.
The reason for test failing is a bit complicated:
- The Put here e4ad4a0ef1/db/compaction/tiered_compaction_test.cc (L2045) happens before mock clock advances beyond 0.
- This causes oldest_key_time_ to be 0 for memtable.
- oldest_ancester_time of the first L0 file becomes 0
- L0 -> L5/6 compaction output files sets `oldest_ancestoer_time` to the current time due to these lines: 509947ce2c/db/compaction/compaction_job.cc (L1898C34-L1904).
- This causes some small sequence number to be mapped to current time: 509947ce2c/db/compaction/compaction_job.cc (L301)
- Keys in L6 is being moved up to L5 due to the unexpected seqno_to_time mapping
- When compacting keys from last level to the penultimate level, we only check keys to be within user key range of penultimate level input files. If we compact the following file 3 with file 1 and output keys to L5, we can get the reported inconsistency bug.
```
L5: file 1 [K5@20, K10@kMaxSeqno], file 2 [K10@30, K14@34)
L6: file 3 [K6@5, K10@20]
```
https://github.com/facebook/rocksdb/issues/12063 will add fixes to check internal key range when compacting keys from last level up to the penultimate level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12064
Test Plan: the unit test passes
Reviewed By: ajkr
Differential Revision: D51281149
Pulled By: cbi42
fbshipit-source-id: 00b7f026c453454d9f3af5b2de441383a96f0c62
Summary:
- Add missing null check for ColumnFamilyHandle in `GetEntity()`
- `FailIfCfHasTs()` now returns `Status::InvalidArgument()` if `column_family` is null. `MultiGetEntity()` can rely on this for cfh null check.
- Added `DeleteRange` API using Default Column Family to be consistent with other major APIs (This was also causing Java Test failure after the `FailIfCfHasTs()` change)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12057
Test Plan:
- Updated `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` to include null CF case
- Updated `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` to include null CF case
Reviewed By: jowlyzhang
Differential Revision: D51167445
Pulled By: jaykorean
fbshipit-source-id: 1c1e44fd7b7df4d2dc3bb2d7d251da85bad7d664
Summary:
When delay didn't happen, histogram WRITE_STALL is still recorded, and ticker STALL_MICROS is not recorded.
This is a bug, neither WRITE_STALL or STALL_MICROS should not be recorded when delay did not happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12067
Reviewed By: cbi42
Differential Revision: D51263133
Pulled By: ajkr
fbshipit-source-id: bd82d8328fe088d613991966e83854afdabc6a25
Summary:
Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`.
These are the types of files being actively tracked in quarantine in this PR:
1) new table files and blob files from a background job
2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed.
Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening.
We track these files' file numbers because they share the same file number space.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030
Test Plan: Modified existing unit tests
Reviewed By: ajkr
Differential Revision: D51036774
Pulled By: jowlyzhang
fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716
Summary:
Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted.
This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them. When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK.
This PR also makes unit test that do not need force behavior to do a regular enable file deletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001
Reviewed By: ajkr
Differential Revision: D51214683
Pulled By: jowlyzhang
fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4
Summary:
This PR adds a missing set function for rocksdb_options in the C-API:
rocksdb_options_set_cf_paths(). Without this function, users cannot
specify different paths for different column families as it will fall back
to db_paths.
As a bonus, this PR also includes rocksdb_sst_file_metadata_get_directory()
to the C api -- a missing public function that will also make the test easier to write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11151
Test Plan: Augment existing c_test to verify the specified cf_path.
Reviewed By: hx235
Differential Revision: D51201888
Pulled By: ajkr
fbshipit-source-id: 62a96451f26fab60ada2005ede3eea8e9b431f30
Summary:
#### Problem
While the RocksDB C API does have the RateLimiter API, it does not
expose the auto_tuned option.
#### Summary of Change
This PR exposes auto_tuned RateLimiter option in RocksDB C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12058
Test Plan: Augment the C API existing test to cover the new API.
Reviewed By: cbi42
Differential Revision: D51201933
Pulled By: ajkr
fbshipit-source-id: 5bc595a9cf9f88f50fee797b729ba96f09ed8266
Summary:
As titled. This PR contains the API and stubbed implementation for piping write time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12043
Reviewed By: pdillinger
Differential Revision: D51076575
Pulled By: jowlyzhang
fbshipit-source-id: 3b341263498351b9ccaff27cf35d5aeb5bdf0cf1
Summary:
- Our database is corrupted, causing some sequences of wal record to be invalid (but the `record_checksum` looks fine).
- When we RecoverLogFiles in WALRecoveryMode::kPointInTimeRecovery, `assert(seq <= kMaxSequenceNumber)` will be failed.
- When it is found that sequence is illegal, can we drop the file to recover as much data as possible ? Thx !
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11985
Reviewed By: anand1976
Differential Revision: D50698039
Pulled By: ajkr
fbshipit-source-id: 1e42113b58823088d7c0c3a92af5b3efbb5f5296
Summary:
**Context/Summary:**
It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish".
However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand.
This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`)
Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053
Test Plan:
- Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once.
- Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change.
Reviewed By: ajkr
Differential Revision: D51095434
Pulled By: hx235
fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973
Summary:
Implementation of `GetEntity()` API that returns wide-column entities as AttributeGroups from multiple column families for a single key. Regarding the definition of Attribute groups, please see the detailed example description in PR https://github.com/facebook/rocksdb/issues/11925
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11943
Test Plan:
- `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` added
will enable the new API in the `db_stress` after merging
Reviewed By: ltamasi
Differential Revision: D50195794
Pulled By: jaykorean
fbshipit-source-id: 218d54841ac7e337de62e13b1233b0a99bd91af3
Summary:
- The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice.
- It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization.
- Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start.
- We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031
Test Plan:
Unit Tests added
- `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled
- `DBTestUniversalCompaction2::PeriodicCompaction` for Universal
Reviewed By: cbi42
Differential Revision: D50900292
Pulled By: jaykorean
fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d
Summary:
.. and update some unit tests that failed with this change. See comment in ExternalSSTFileBasicTest.IngestFileWithCorruptedDataBlock for more explanation.
The missing status check is not caught by `ASSERT_STATUS_CHECKED=1` due to this line: 8505b26db1/table/block_based/block.h (L394). Will explore if we can remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12042
Test Plan: existing unit tests.
Reviewed By: ajkr
Differential Revision: D50994769
Pulled By: cbi42
fbshipit-source-id: c91615bccd6094a91634c50b98401d456cbb927b
Summary:
Almost each of VersionEdit private member has its own getter and setter. Current code access them with a combination of directly accessing private members and via getter and setters. There is no obvious benefits to have this pattern except potential performance gains. I tried this simple benchmark for removing the friends pattern completely, and there is no obvious regression. So I think it would good to remove VersionEdit's friends completely.
```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num_column_families=10 -num=50000000```
With change:
fillseq : 2.994 micros/op 333980 ops/sec 149.710 seconds 50000000 operations; 36.9 MB/s
fillseq : 3.033 micros/op 329656 ops/sec 151.673 seconds 50000000 operations; 36.5 MB/s
fillseq : 2.991 micros/op 334369 ops/sec 149.535 seconds 50000000 operations; 37.0 MB/s
Without change:
fillseq : 3.015 micros/op 331715 ops/sec 150.732 seconds 50000000 operations; 36.7 MB/s
fillseq : 3.044 micros/op 328553 ops/sec 152.182 seconds 50000000 operations; 36.3 MB/s
fillseq : 3.091 micros/op 323520 ops/sec 154.550 seconds 50000000 operations; 35.8 MB/s
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12024
Reviewed By: pdillinger
Differential Revision: D50806066
Pulled By: jowlyzhang
fbshipit-source-id: 35d287ce638a38c30f243f85992e615b4c90eb27
Summary:
cbi42 helped investigation and found a potential scenario where `RecoverFromRetryableBGIOError()` may start with `recovery_in_prog_ ` set as false. (and other booleans like `bg_error_` and `soft_error_no_bg_work_`)
**Thread 1**
- `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true`
**Thread 1's `recovery_thread_`**
- (waits for mutex and acquires it)
- `RecoverFromRetryableBGIOError()` -> `ResumeImpl()` -> `ClearBGError()`: sets `recovery_in_prog_ = false`
- `ClearBGError()` -> `NotifyOnErrorRecoveryEnd()`: releases `mutex`
**Thread 2**
- `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true`
- Waits for Thread 1 (`recovery_thread_`) to finish
**Thread 1's `recovery_thread_`**
- re-lock mutex in `NotifyOnErrorRecoveryEnd()`
- Still inside `RecoverFromRetryableBGIOError()`: sets `recovery_in_prog_ = false`
- Done
**Thread 2's `recovery_thread_`**
- recovery thread started with `recovery_in_prog_` set as `false`
# Fix
- Remove double-clearing `bg_error_`, `recovery_in_prog_` and other fields after `ResumeImpl()` already returned `OK()`.
- Minor typo and linter fixes in `DBErrorHandlingFSTest`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11991
Test Plan:
- `DBErrorHandlingFSTest::MultipleRecoveryThreads` added to reproduce the scenario.
- Adding `assert(recovery_in_prog_);` at the start of `ErrorHandler::RecoverFromRetryableBGIOError()` fails the test without the fix and succeeds with the fix as expected.
Reviewed By: cbi42
Differential Revision: D50506113
Pulled By: jaykorean
fbshipit-source-id: 6dabe01e9ecd3fc50bbe9019587f2f4858bed9c6
Summary:
As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016
Reviewed By: hx235
Differential Revision: D50670482
Pulled By: jowlyzhang
fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR.
There were two ways to achieve what we want.
1. Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`.
2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`.
We chose the latter as it involves smaller changes.
This change includes the following
- Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions`
- `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()`
- During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018
Test Plan:
- `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`.
- `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes`
Reviewed By: pdillinger
Differential Revision: D50723881
Pulled By: jaykorean
fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6
Summary:
**Context:**
DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: 519f2a41fb/db/db_impl/db_impl.cc (L525) -> 519f2a41fb/db/error_handler.cc (L250) -> 519f2a41fb/db/error_handler.cc (L808-L823)
However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below.
```
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
```
**Summary:**
This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](519f2a41fb/db/error_handler.cc (L688-L694)) such case inside the created recovery thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002
Test Plan: A new UT repro-ed the crash before this fix and and pass after.
Reviewed By: ajkr
Differential Revision: D50586191
Pulled By: hx235
fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093
Summary:
With fragmented record span across multiple blocks, if any following blocks corrupted with arbitary data, and intepreted log number less than the current log number, program will fall into infinite loop due to
not skipping buffer leading bytes
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11979
Test Plan: existing unit tests
Reviewed By: ajkr
Differential Revision: D50604408
Pulled By: jowlyzhang
fbshipit-source-id: e50a0c7e7c3d293fb9d5afec0a3eb4a1835b7a3b
Summary:
... and other fixes for crash test after https://github.com/facebook/rocksdb/issues/11922.
* When pre-allocating sequence numbers for establishing a time history, record that last sequence number in the manifest so that it is (most likely) restored on recovery even if no user writes were made or were recovered (e.g. no WAL).
* When pre-allocating sequence numbers for establishing a time history, only do this for actually new DBs.
* Remove the feature that ensures non-zero sequence number on creating the first column family with preserve/preclude option after initial DB::Open. Until fixed in a way compatible with the crash test, this creates a gap where some data written with active preserve/preclude option won't have a known associated time.
Together, these ensure we don't upset the crash test by manipulating sequence numbers after initial DB creation (esp when re-opening with different options). (The crash test expects that the seqno after re-open corresponds to a known point in time from previous crash test operation, matching an expected DB state.)
Follow-up work:
* Re-fill the gap to ensure all data written under preserve/preclude settings have a known time estimate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11995
Test Plan:
Added to unit test SeqnoTimeTablePropTest.PrePopulateInDB
Verified fixes two crash test scenarios:
## 1st reproducer
First apply
```
diff --git a/db_stress_tool/expected_state.cc b/db_stress_tool/expected_state.cc
index b483e154c..ef63b8d6c 100644
--- a/db_stress_tool/expected_state.cc
+++ b/db_stress_tool/expected_state.cc
@@ -333,6 +333,7 @@ Status FileExpectedStateManager::SaveAtAndAfter(DB* db) {
s = NewFileTraceWriter(Env::Default(), soptions, trace_file_path,
&trace_writer);
}
+ if (getenv("CRASH")) assert(false);
if (s.ok()) {
TraceOptions trace_opts;
trace_opts.filter |= kTraceFilterGet;
```
Then
```
mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_expected
mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox
rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/*
CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=36000
./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=0
```
Without the fix you get
```
...
DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox]
(Re-)verified 34 unique IDs
Error restoring historical expected values: Corruption: DB is older than any restorable expected state
```
## 2nd reproducer
First apply
```
diff --git a/db_stress_tool/db_stress_test_base.cc b/db_stress_tool/db_stress_test_base.cc
index 62ddead7b..f2654980f 100644
--- a/db_stress_tool/db_stress_test_base.cc
+++ b/db_stress_tool/db_stress_test_base.cc
@@ -1126,6 +1126,7 @@ void StressTest::OperateDb(ThreadState* thread) {
// OPERATION write
TestPut(thread, write_opts, read_opts, rand_column_families, rand_keys,
value);
+ if (getenv("CRASH")) assert(false);
} else if (prob_op < del_bound) {
assert(write_bound <= prob_op);
// OPERATION delete
```
Then
```
rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/*
CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=0
./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=3600
```
Without the fix you get
```
DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox]
(Re-)verified 34 unique IDs
db_stress: db_stress_tool/expected_state.cc:380: virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~
ExpectedStateTraceRecordHandler(): Assertion `IsDone()' failed.
```
Reviewed By: jowlyzhang
Differential Revision: D50533346
Pulled By: pdillinger
fbshipit-source-id: 1056be45c5b9e537c8c601b28c4b27431a782477
Summary:
Context/Summary: as titled
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957
Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats
Reviewed By: ajkr, cbi42
Differential Revision: D50294310
Pulled By: hx235
fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd
Summary:
... when compiled with ASSERT_STATUS_CHECKED = 1.
The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782.
Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975
Test Plan:
* `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check`
* I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task.
Reviewed By: ajkr
Differential Revision: D50383790
Pulled By: cbi42
fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce
Summary:
A race condition between recovery and backup can happen with error messages like this:
```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory```
PR https://github.com/facebook/rocksdb/issues/6949 introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event:
[Backup engine] disable file deletion
[Recovery] disable file deletion <= this is optional for the race condition, it may or may not get called
[Backup engine] get list of file to copy/link
[Recovery] force enable file deletion
.... some files refered by backup engine get deleted
[Backup engine] copy/link file <= error no file found
This PR fixes this with:
1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call.
2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler.
3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955
Test Plan: existing unit tests
Reviewed By: anand1976
Differential Revision: D50290592
Pulled By: jowlyzhang
fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b
Summary:
This has become obsolete with the new `options_mutex_` in https://github.com/facebook/rocksdb/pull/11929
* Remove now-unnecessary parameter from WriteOptionsFile
* Rename (and negate) other parameter for better clarity (the caller shouldn't tell the callee what the callee needs, just what the caller knows, provides, and requests)
* Move a ROCKS_LOG_WARN (I/O) in WriteOptionsFile to outside of holding DB mutex.
* Also *avoid* (but not always eliminate) write queue synchronization in SetDBOptions. Still needed if there was a change to WAL size limit or other configuration.
* Improve some comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11951
Test Plan: existing unit tests and TSAN crash test local run
Reviewed By: ajkr
Differential Revision: D50247904
Pulled By: pdillinger
fbshipit-source-id: 7dfe445c705ec013886a2adb7c50abe50d83af69
Summary:
Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`.
An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families.
Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`.
Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf".
Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo.
With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925
Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added
will enable this new API in the `db_stress` in a separate PR
Reviewed By: ltamasi
Differential Revision: D50017414
Pulled By: jaykorean
fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586
Summary:
We saw the following TSAN stress test failure:
```
WARNING: ThreadSanitizer: data race (pid=17523)
Write of size 1 at 0x7b8c000008b9 by thread T4243 (mutexes: write M0):
#0 rocksdb::ErrorHandler::RecoverFromRetryableBGIOError() fbcode/internal_repo_rocksdb/repo/db/error_handler.cc:742 (db_stress+0x95f954) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
https://github.com/facebook/rocksdb/issues/1 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (rocksdb::ErrorHandler::*)(), rocksdb::ErrorHandler*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 (db_stress+0x95fc2b) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
https://github.com/facebook/rocksdb/issues/2 execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18 (libstdc++.so.6+0xdf4e4) (BuildId: 452d1cdae868baeeb2fdf1ab140f1c219bf50c6e)
Previous read of size 1 at 0x7b8c000008b9 by thread T22:
#0 rocksdb::DBImpl::SyncClosedLogs(rocksdb::JobContext*, rocksdb::VersionEdit*) fbcode/internal_repo_rocksdb/repo/db/error_handler.h:76 (db_stress+0x84f69c) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
```
This is due to a data race in accessing `recovery_in_prog_`. This PR fixes it by accessing `recovery_in_prog_` under db mutex before calling `SyncClosedLogs()`. I think the original PR https://github.com/facebook/rocksdb/pull/10489 intended to clear the error if it's a recovery flush. So ideally we can also just check flush reason. I plan to keep a safer change in this PR and make that change in the future if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11950
Test Plan: check future TSAN stress test results.
Reviewed By: anand1976
Differential Revision: D50242255
Pulled By: cbi42
fbshipit-source-id: 0d487948ef9546b038a34460f3bb037f6e5bfc58
Summary:
We were seeing the following stress test failures:
```LevelCompactionBuilder::PickFileToCompact(const rocksdb::autovector<std::pair<int, rocksdb::FileMetaData*> >&, bool): Assertion `!level_file.second->being_compacted' failed```
This can happen when we are picking a file to be compacted from some files marked for compaction, but that file is already being_compacted. We prevent this by always calling `ComputeCompactionScore()` after we pick a compaction and mark some files as being_compacted. However, if SetOptions() is called to disable marking certain files to be compacted, say `enable_blob_garbage_collection`, we currently just skip the relevant logic in `ComputeCompactionScore()` without clearing the existing files already marked for compaction. This PR fixes this issue by already clearing these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11946
Test Plan: existing tests.
Reviewed By: akankshamahajan15
Differential Revision: D50232608
Pulled By: cbi42
fbshipit-source-id: 11e4fb5e9d48b0f946ad33b18f7c005f0161f496
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.
To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.
I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.
Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929
Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace
I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace
Reviewed By: ajkr
Differential Revision: D50024506
Pulled By: pdillinger
fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
Summary:
1. Prevent a double join on a `port::Thread`
2. Ensure `recovery_in_prog_` and `bg_error_` are both set under same lock hold. This is useful for writers who see a non-OK `bg_error_` and are deciding whether to stall based on whether the error will be auto-recovered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11939
Reviewed By: cbi42
Differential Revision: D50155484
Pulled By: ajkr
fbshipit-source-id: fbc1f85c50e7eaee27ee0e376aee688d8a06c93b
Summary:
cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937
Reviewed By: cbi42
Differential Revision: D50105793
Pulled By: ajkr
fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.
This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879
Reviewed By: anand1976
Differential Revision: D49571196
Pulled By: ajkr
fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11930
The patch cleans up and refactors the logic in/around `WriteBatchWithIndexInternal` a bit as groundwork for further changes. Specifically, the class is turned back into a stateless collection of static helpers (which is the way it was before PR 6851). Note that there were two apparent reasons for introducing this instance state in PR 6851: a) encapsulating `MergeContext` and b) resolving objects like `Logger` and `Statistics` based on a variety of handles. However, neither reason seems justified at this point. Regarding a), the `MultiGetFromBatchAndDB` logic passes in its own `MergeContext` objects via a second set of methods that do not use the member `MergeContext`. As for b), `Logger` and friends are only needed for Merge, which is only supported if a column family handle is provided; in turn, the column family handle enables us to resolve all the necessary objects without the need for any other handles like `DB` or `DBOptions`. In addition to the above, the patch changes the type of `BaseDeltaIterator::merge_result_` to `std::string` from `PinnableSlice` (since no pinning is ever done) and makes some other small code quality improvements.
Reviewed By: jaykorean
Differential Revision: D50038302
fbshipit-source-id: 5f34abe2e808bdaea0f3a8033b5764ebd446b85d
Summary:
This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920):
* Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests.
* For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time.
Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.)
Recommended follow-up:
* Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect.
* Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.)
No release note for the minor bug fix because this is still an experimental feature with limited usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922
Test Plan: tests added / updated
Reviewed By: jowlyzhang
Differential Revision: D49956563
Pulled By: pdillinger
fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.
Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920
Test Plan: tests added / updated
Reviewed By: jaykorean
Differential Revision: D49919147
Pulled By: pdillinger
fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
Summary:
1. **Error** in TestIterateAgainstExpected API - `Assertion index < pre_read_expected_values.size() && index < post_read_expected_values.size() failed.`
**Fix** - `Prev` op is not supported with `auto_readahead_size`. So added support to Reseek in db_iter, if Prev is called. In BlockBasedTableIterator, index_iter_ already moves forward. So there is no way to do Prev from BlockBasedTableIterator.
2. **Error** - `void rocksdb::BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(uint64_t, size_t, size_t&): Assertion index_iter_->value().handle.offset() == offset`
**Fix** - Remove prefetch_buffer to be used when uncompressed dict is read.
3. ** Error in TestPrefixScan API - `db_stress: db/db_iter.cc:369: bool rocksdb::DBIter::FindNextUserEntryInternal(bool, const rocksdb::Slice*): Assertion !skipping_saved_key || CompareKeyForSkip(ikey_.user_key, saved_key_.GetUserKey()) > 0 failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
db_stress: table/merging_iterator.cc:1036: bool rocksdb::MergingIterator::SkipNextDeleted(): Assertion comparator_->Compare(range_tombstone_iters_[i]->start_key(), pik) <= 0 failed`
**Fix** - SeekPrev also calls 1) SeekPrev , 2)Seek and then 3)Prev in some cases in db_iter.cc leading to failure of Prev operation. These backward operations also call Seek. Added direction to disable lookup once direction is backwards in BlockBasedTableIterator.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11884
Test Plan: Ran various flavors of crash tests locally for the whole duration
Reviewed By: anand1976
Differential Revision: D49834201
Pulled By: akankshamahajan15
fbshipit-source-id: 9a007b4d46a48002c43dc4623a400ecf47d997fe
Summary:
Since allowing 24hr peak by setting start_time = end_time is not so intuitive, we are not going to allow it (e.g. `00:00-00:00` doesn't looks like a value that would cover 24hr.). Instead, we are going to compare at minute level (i.e. dropping the seconds to the nearest minute) so that `00:00-23:59` will cover 24hrs. The entire minute from 23:59:00 23:59:59 will be covered with this change.
Minor fixes from previous PR
- release build error
- fixed random seed in test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11911
Test Plan:
`DBOptionsTest::OffPeakTimes`
`make -j64 static_lib` to test release build issue that was fixed
Reviewed By: pdillinger
Differential Revision: D49787795
Pulled By: jaykorean
fbshipit-source-id: e8d045b95f54f61d5dd5f1bb473579f8d55c18b3
Summary:
Recovery triggers flushes for very different scenarios:
(1) `FlushReason::kErrorRecoveryRetryFlush`: a flush failed
(2) `FlushReason::kErrorRecovery`: a WAL may be corrupted
(3) `FlushReason::kCatchUpAfterErrorRecovery`: immutable memtables may have accumulated
The old code called called `FlushAllColumnFamilies()` in all cases, which uses manual flush functions: `AtomicFlushMemTables()` and `FlushMemTable()`. Forcing flushing the latest data on all CFs was useful for (2) because it ensures all CFs move past the corrupted WAL.
However, those code paths were overkill for (1) and (3), where only already-immutable memtables need to be flushed. There were conditionals to exclude some of the extraneous logic but I found there was still too much happening. For example, both of the manual flush functions enter the write thread. Entering the write thread is inconvenient because then we can't allow stalled writes to wait on a retrying flush to finish.
Instead of continuing down the path of adding more conditionals to the manual flush functions, this PR introduces a dedicated function for cases (1) and (3): `RetryFlushesForErrorRecovery()`. Also I cleaned up the manual flush functions to remove existing conditionals for these cases as they're no longer needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11903
Reviewed By: cbi42
Differential Revision: D49693812
Pulled By: ajkr
fbshipit-source-id: 7630ac539b9d6c92052c13a3cdce53256134d990
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913
The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator.
Reviewed By: jaykorean
Differential Revision: D49795097
fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3
Summary:
After https://github.com/facebook/rocksdb/issues/11905, I am preparing a DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping. **Intended follow-up**
However, there are a number of test changes I want to make prior to that to make it clear that I am not regressing the tests and production behavior at the same time.
* Start mock time in the tests well beyond epoch (time 0) so that we aren't normally reaching into pre-history for current time minus the preserve/preclude duration.
* Majorly clean up BasicSeqnoToTimeMapping to avoid confusing hard-coded bounds on GetProximalTimeBeforeSeqno() results.
* There is an unresolved/unexplained issue marked with FIXME that should be investigated when GetProximalTimeBeforeSeqno() is put into production.
* MultiCFs test was strangely generating 5 L0 files, four of which would be compacted into an L1, and then letting TTL compaction compact 1@L0+1@L1. Changing the starting time of the tests seemed to mess up the TTL compaction. But I suspect the TTL compaction was unintentional, so I've cut it down to just 4 L0 files, which compacts predictably.
* Unrelated: allow ROCKSDB_NO_STACK=1 to skip printing a stack trace on assertion failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11917
Test Plan: no changes to production code
Reviewed By: jowlyzhang
Differential Revision: D49841436
Pulled By: pdillinger
fbshipit-source-id: 753348ace9c548e82bcb77fcc8b2ffb7a6beeb0a
Summary:
Missed `GetFileSize()` forwarding , this PR fix this issue, and mark `WritableFile::GetFileSize()` as pure virtual to detect such issue in compile time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11726
Reviewed By: ajkr
Differential Revision: D49791240
Pulled By: jowlyzhang
fbshipit-source-id: ef219508d6b15c9a24df9b706a9fdc33cc6a286e
Summary:
RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage.
In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893
Test Plan:
- New Unit Test Added - `DBOptionsTest::OffPeakTimes`
- Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest`
Reviewed By: pdillinger
Differential Revision: D49714553
Pulled By: jaykorean
fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752
Summary:
This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). **Intended follow-up**
However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping.
Functional fixes / changes:
* This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc
* That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.)
* Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function.
* Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples.
Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work.
An apparent compaction bug is revealed in
PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909
Cosmetic / code safety things (not exhaustive):
* Fix some confusing names.
* `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name.
* Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing.
* Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions.
* Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind.
* (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.)
* A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`).
* Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905
Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others.
Reviewed By: jowlyzhang
Differential Revision: D49755592
Pulled By: pdillinger
fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51