Commit graph

607 commits

Author SHA1 Message Date
Andrew Kryczka 8a9cfd5292 Make stopped writes block on recovery (#11879)
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.

This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879

Reviewed By: anand1976

Differential Revision: D49571196

Pulled By: ajkr

fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0
2023-10-10 06:29:01 -07:00
darionyaphet ee0829ba76 fix typo snapshto (#11817)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11817

Reviewed By: jaykorean

Differential Revision: D50103497

Pulled By: ltamasi

fbshipit-source-id: 77c5cf86ff7eb5021fc91b03225882536163af7b
2023-10-09 19:10:06 -07:00
Peter Dillinger 1d5bddbc58 Bootstrap, pre-populate seqno_to_time_mapping (#11922)
Summary:
This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920):
* Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests.
* For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time.

Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.)

Recommended follow-up:
* Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect.
* Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.)

No release note for the minor bug fix because this is still an experimental feature with limited usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922

Test Plan: tests added / updated

Reviewed By: jowlyzhang

Differential Revision: D49956563

Pulled By: pdillinger

fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519
2023-10-06 08:21:21 -07:00
Peter Dillinger 141b872bd4 Improve efficiency of create_missing_column_families, light refactor (#11920)
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.

Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920

Test Plan: tests added / updated

Reviewed By: jaykorean

Differential Revision: D49919147

Pulled By: pdillinger

fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
2023-10-04 14:14:22 -07:00
Jay Huh 5fbea87859 Disallow start_time == end_time in offpeak time and compare at minute level to allow 24hr offpeak (#11911)
Summary:
Since allowing 24hr peak by setting start_time = end_time is not so intuitive, we are not going to allow it (e.g. `00:00-00:00` doesn't looks like a value that would cover 24hr.). Instead, we are going to compare at minute level (i.e. dropping the seconds to the nearest minute) so that `00:00-23:59` will cover 24hrs. The entire minute from 23:59:00 23:59:59 will be covered with this change.

Minor fixes from previous PR
- release build error
- fixed random seed in test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11911

Test Plan:
`DBOptionsTest::OffPeakTimes`
`make -j64 static_lib` to test release build issue that was fixed

Reviewed By: pdillinger

Differential Revision: D49787795

Pulled By: jaykorean

fbshipit-source-id: e8d045b95f54f61d5dd5f1bb473579f8d55c18b3
2023-10-02 16:52:39 -07:00
Andrew Kryczka 10fd05e394 Give retry flushes their own functions (#11903)
Summary:
Recovery triggers flushes for very different scenarios:

(1) `FlushReason::kErrorRecoveryRetryFlush`: a flush failed
(2) `FlushReason::kErrorRecovery`: a WAL may be corrupted
(3) `FlushReason::kCatchUpAfterErrorRecovery`: immutable memtables may have accumulated

The old code called called `FlushAllColumnFamilies()` in all cases, which uses manual flush functions: `AtomicFlushMemTables()` and `FlushMemTable()`. Forcing flushing the latest data on all CFs was useful for (2) because it ensures all CFs move past the corrupted WAL.

However, those code paths were overkill for (1) and (3), where only already-immutable memtables need to be flushed. There were conditionals to exclude some of the extraneous logic but I found there was still too much happening. For example, both of the manual flush functions enter the write thread. Entering the write thread is inconvenient because then we can't allow stalled writes to wait on a retrying flush to finish.

Instead of continuing down the path of adding more conditionals to the manual flush functions, this PR introduces a dedicated function for cases (1) and (3): `RetryFlushesForErrorRecovery()`. Also I cleaned up the manual flush functions to remove existing conditionals for these cases as they're no longer needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11903

Reviewed By: cbi42

Differential Revision: D49693812

Pulled By: ajkr

fbshipit-source-id: 7630ac539b9d6c92052c13a3cdce53256134d990
2023-10-02 16:26:24 -07:00
Jay Huh 63ed868840 Offpeak in db option (#11893)
Summary:
RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage.

In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893

Test Plan:
- New Unit Test Added - `DBOptionsTest::OffPeakTimes`
- Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest`

Reviewed By: pdillinger

Differential Revision: D49714553

Pulled By: jaykorean

fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752
2023-09-29 13:03:39 -07:00
Peter Dillinger 02443dd93f Refactor, clean up, fixes, and more testing for SeqnoToTimeMapping (#11905)
Summary:
This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). **Intended follow-up**

However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping.

Functional fixes / changes:
* This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc
* That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.)
* Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function.
  * Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples.

Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work.

An apparent compaction bug is revealed in
PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909

Cosmetic / code safety things (not exhaustive):
* Fix some confusing names.
  * `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name.
  * Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing.
  * Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions.
  * Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind.
  * (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.)
  * A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`).
  * Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905

Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others.

Reviewed By: jowlyzhang

Differential Revision: D49755592

Pulled By: pdillinger

fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51
2023-09-29 11:21:59 -07:00
Changyu Bi 1c871a4d86 Only flush after recovery for retryable IOError (#11880)
Summary:
https://github.com/facebook/rocksdb/issues/11872 causes a unit test to start failing with the error message below. The cause is that the additional call to `FlushAllColumnFamilies()` in `DBImpl::ResumeImpl()` can run while DB is closing. More detailed explanation: there are two places where we call `ResumeImpl()`:

1. in `ErrorHandler::RecoverFromBGError`, for manual resume or recovery from errors like OutOfSpace through sst file manager, and
2. in `Errorhandler::RecoverFromRetryableBGIOError`, for error recovery from errors like flush failure due to retryable IOError. This is tracked by `ErrorHandler::recovery_thread_`.

Here is how DB close waits for error recovery: 49da91ec09/db/db_impl/db_impl.cc (L540-L543)

`CancelErrorRecovery()` waits until `recovery_thread_` finishes and `IsRecoveryInProgress()` checks the `recovery_in_prog_` flag. The additional call to `FlushAllColumnFamilies()` in `ResumeImpl()` happens after it clears bg error and the `recovery_in_prog_` flag: 49da91ec09/db/db_impl/db_impl.cc (L436-L463). So if `ResumeImpl()` is called in `RecoverFromBGError()`, we can have a thread running `FlushAllColumnFamilies()` while DB is closing and thought that recovery is done.

The fix is to only do the additional call to `FlushAllColumnFamilies()` when doing error recovery through `Errorhandler::RecoverFromRetryableBGIOError` by setting flags in `DBRecoverContext`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11880

Test Plan:
`gtest-parallel --repeat=100 --workers=4 ./error_handler_fs_test --gtest_filter="*AutoRecoverFlushError*"` reproduces the error pretty reliably.

```[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.AutoRecoverFlushError
error_handler_fs_test: db/column_family.cc:1618: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
Received signal 6 (Aborted)
...
https://github.com/facebook/rocksdb/issues/10 0x00007fac4409efd6 in __GI___assert_fail (assertion=0x7fac452c0afa "last_ref", file=0x7fac452c9fb5 "db/column_family.cc", line=1618, function=0x7fac452cb950 "rocksdb::ColumnFamilySet::~ColumnFamilySet()") at assert.c:101
101     in assert.c
https://github.com/facebook/rocksdb/issues/11 0x00007fac44b5324f in rocksdb::ColumnFamilySet::~ColumnFamilySet (this=0x7b5400000000) at db/column_family.cc:1618
1618        assert(last_ref);
https://github.com/facebook/rocksdb/issues/12 0x00007fac44e0f047 in std::default_delete<rocksdb::ColumnFamilySet>::operator() (this=0x7b5800000940, __ptr=0x7b5400000000) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85
85              delete __ptr;
https://github.com/facebook/rocksdb/issues/13 std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182
182               _M_deleter()(__old_p);
https://github.com/facebook/rocksdb/issues/14 std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456
456             _M_t.reset(std::move(__p));
https://github.com/facebook/rocksdb/issues/15 rocksdb::VersionSet::~VersionSet (this=this@entry=0x7b5800000900) at db/version_set.cc:5081
5081      column_family_set_.reset();
https://github.com/facebook/rocksdb/issues/16 0x00007fac44e0f97a in rocksdb::VersionSet::~VersionSet (this=0x7b5800000900) at db/version_set.cc:5078
5078    VersionSet::~VersionSet() {
https://github.com/facebook/rocksdb/issues/17 0x00007fac44bf0b2f in std::default_delete<rocksdb::VersionSet>::operator() (this=0x7b8c00000068, __ptr=0x7b5800000900) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85
85              delete __ptr;
https://github.com/facebook/rocksdb/issues/18 std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182
182               _M_deleter()(__old_p);
https://github.com/facebook/rocksdb/issues/19 std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456
456             _M_t.reset(std::move(__p));
https://github.com/facebook/rocksdb/issues/20 rocksdb::DBImpl::CloseHelper (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:676
676       versions_.reset();
https://github.com/facebook/rocksdb/issues/21 0x00007fac44bf1346 in rocksdb::DBImpl::CloseImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:720
720     Status DBImpl::CloseImpl() { return CloseHelper(); }
https://github.com/facebook/rocksdb/issues/22 rocksdb::DBImpl::~DBImpl (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:738
738       closing_status_ = CloseImpl();
https://github.com/facebook/rocksdb/issues/23 0x00007fac44bf2bba in rocksdb::DBImpl::~DBImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:722
722     DBImpl::~DBImpl() {
https://github.com/facebook/rocksdb/issues/24 0x00007fac455444d4 in rocksdb::DBTestBase::Close (this=this@entry=0x7b6c00000000) at db/db_test_util.cc:678
678       delete db_;
https://github.com/facebook/rocksdb/issues/25 0x00007fac455455fb in rocksdb::DBTestBase::TryReopen (this=this@entry=0x7b6c00000000, options=...) at db/db_test_util.cc:707
707       Close();
https://github.com/facebook/rocksdb/issues/26 0x00007fac45543459 in rocksdb::DBTestBase::Reopen (this=0x7ffed74b79a0, options=...) at db/db_test_util.cc:670
670       ASSERT_OK(TryReopen(options));
https://github.com/facebook/rocksdb/issues/27 0x00000000004f2522 in rocksdb::DBErrorHandlingFSTest_AutoRecoverFlushError_Test::TestBody (this=this@entry=0x7b6c00000000) at db/error_handler_fs_test.cc:1224
1224      Reopen(options);
```

Reviewed By: jowlyzhang

Differential Revision: D49579701

Pulled By: cbi42

fbshipit-source-id: 3fc8325e6dde7e7faa8bcad95060cb4e26eda638
2023-09-25 09:34:39 -07:00
Changyu Bi 0086809601 Fix a bug with atomic_flush that causes DB to stuck after a flush failure (#11872)
Summary:
With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: 269478ee46/db/db_impl/db_impl_compaction_flush.cc (L725) This prevents any future flush to pick old memtables.

A more detailed repro is in unit test.

This PR fixes this issue by
1. Ensure we rollback memtables if an atomic flush fails due to background error
2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed
3. Do not schedule non-recovery flushes when there is a background error that stops background work

There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11872

Test Plan: new unit test.

Reviewed By: jowlyzhang

Differential Revision: D49556867

Pulled By: cbi42

fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f
2023-09-22 16:43:50 -07:00
Levi Tamasi 12d9386a4f Return a special OK status when the number of merge operands exceeds a threshold (#11870)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11870

Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`.

Reviewed By: jaykorean

Differential Revision: D49522786

fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c
2023-09-22 13:49:19 -07:00
Changyu Bi b927ba5936 Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an  error message like the following:

IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```

This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)

This PR fixes the issue by rollback these pending flushes.

There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.

There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865

Test Plan: * Added repro unit tests.

Reviewed By: anand1976

Differential Revision: D49484922

Pulled By: cbi42

fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 15:31:29 -07:00
Peter (Stig) Edwards bf488c3052 Use *next_sequence -1 here (#11861)
Summary:
To fix off-by-one error:   Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001.

Fixes https://github.com/facebook/rocksdb/issues/11822

I think introduced in a657ee9a9c

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11861

Reviewed By: pdillinger

Differential Revision: D49457273

Pulled By: ajkr

fbshipit-source-id: b527cbae4ecc7874633a11f07027adee62940d74
2023-09-21 13:52:01 -07:00
Hui Xiao 089070cb36 Expose more info about input files in CompactionFilter::Context (#11857)
Summary:
**Context:**
As requested, lowest level as well as a map from input file to its table properties among all input files used in table creation  (if any) are exposed in `CompactionFilter::Context`.

**Summary:**
This PR contains two commits:
(1) [Refactory](0012777f0e) to make resonating/using what is in `Compaction:: table_properties_` easier
- Separate `Compaction:: table_properties_` into `Compaction:: input_table_properties_` and `Compaction:: output_table_properties_`
- Separate the "set input table properties" logic into `Compaction:: SetInputTableProperties()`) from `Compaction:: GetInputTableProperties`
- Call `Compaction:: SetInputTableProperties()` as soon as possible, which is right after `Compaction::SetInputVersion()`. Bundle these two functions into one `Compaction::FinalizeInputInfo()` to minimize missing one or the other

(2) [Expose more info about input files:](6093e7dfba) `CompactionFilter::Context::input_start_level/input_table_properties`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11857

Test Plan:
- Modify existing UT `
TEST_F(DBTestCompactionFilter, CompactionFilterContextManual)` to cover new logics

Reviewed By: ajkr

Differential Revision: D49402540

Pulled By: hx235

fbshipit-source-id: 469fff50fa0e5964ffa5ea8db0743f61438ea392
2023-09-20 13:34:39 -07:00
Changyu Bi cc254efea6 Release compaction files in manifest write callback (#11764)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply().  This is done by passing in a [callback](ba59751430/db/version_set.h (L1199)) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11764

Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released

Reviewed By: ajkr

Differential Revision: D48742152

Pulled By: cbi42

fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
2023-09-18 13:11:53 -07:00
dengyan 0dac75d542 Fix a bug in MultiGet when skip_memtable is true (#11700)
Summary:
When skip_memtable is true in MultiGetImpl, The lookup_current is always false, Causes data to be unable to be queried in super_version->current。

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11700

Reviewed By: anand1976

Differential Revision: D49342877

Pulled By: jowlyzhang

fbshipit-source-id: 270a36d049b4cb7fd151a1fa3080300310111271
2023-09-18 12:06:58 -07:00
Jay Huh cff6490bc4 Add IOActivity.kMultiGetEntity (#11842)
Summary:
- As a follow up from https://github.com/facebook/rocksdb/issues/11799, adding `Env::IOActivity::kMultiGetEntity` support to `DBImpl::MultiGetEntity()`.

## Minor Refactor
- Because both `DBImpl::MultiGet()` and `DBImpl::MultiGetEntity()` call `DBImpl::MultiGetCommon()` which later calls `DBImpl::MultiGetWithCallback()` where we check `Env::IOActivity::kMultiGet`, minor refactor was needed so that we don't check `Env::IOActivity::kMultiGet` for  `DBImpl::MultiGetEntity()`.
- I still see more areas for refactoring to avoid duplicate code of checking IOActivity and setting it when Unknown, but this will be addressed separately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11842

Test Plan:
- Added the `ThreadStatus::OperationType::OP_MULTIGETENTITY` in `db_stress` to verify the pass-down IOActivity in a thread aligns with the actual activity the thread is doing.
```
python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
```

Reviewed By: ltamasi

Differential Revision: D49329575

Pulled By: jaykorean

fbshipit-source-id: 05198f1d3f92e6be3d42a3d184bacb3ab2ce6923
2023-09-15 15:34:04 -07:00
leipeng 68ce5d84f6 Add new Iterator API Refresh(const snapshot*) (#10594)
Summary:
This PR resolves https://github.com/facebook/rocksdb/issues/10487 & https://github.com/facebook/rocksdb/issues/10536, user code needs to call Refresh() periodically.

The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq)  -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`.

This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10594

Test Plan:
* New unit tests.
* Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case.

TODO in a following PR:
* Stress test Iterator::Refresh() to different snapshots or no snapshot.

Reviewed By: ajkr

Differential Revision: D48456896

Pulled By: cbi42

fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779
2023-09-15 10:44:43 -07:00
Hui Xiao ed913513bd Fix a bug of rocksdb.file.read.verify.file.checksums.micros not being populated (#11836)
Summary:
**Context/Summary:**
`rocksdb.file.read.verify.file.checksums.micros ` was added in https://github.com/facebook/rocksdb/pull/11444 but the related path was not populated with statistics and clock object correctly so the actual statistics collection didn't happen. This PR fixed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11836

Test Plan:
Setup:
```
./db_bench --benchmarks="fillrandom" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb
```
Run:
```
./db_bench --use_existing_db=1  --benchmarks="verifyfilechecksums" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb --statistics=1 --stats_level=4
```
Post-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 9.000000 P95 : 9.000000 P99 : 9.000000 P100 : 9.000000 COUNT : 1 SUM : 9
```

Pre-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

Reviewed By: ajkr

Differential Revision: D49293378

Pulled By: hx235

fbshipit-source-id: 1acd8b828c28e088d0c5d63897f53cd180b82f42
2023-09-15 10:36:14 -07:00
Yu Zhang e1fd348b92 Fix a bug in multiget for cleaning up SuperVersion (#11830)
Summary:
When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done.

It's currently doing this:
`MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()`

This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830

Test Plan:
`make all check`
Added a unit test that would originally fail

Reviewed By: ltamasi

Differential Revision: D49287715

Pulled By: jowlyzhang

fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0
2023-09-15 09:50:39 -07:00
Jay Huh f2b623bcc1 GetEntity Support for ReadOnlyDB and SecondaryDB (#11799)
Summary:
`GetEntity` API support for ReadOnly DB and Secondary DB.
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`)
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity`
- `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally.
- `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()`
- Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`.
- Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799

Test Plan:
**Unit Tests**
- Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup.
- Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()`
- `make -j64 check`

**Crash Tests**
- `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
- `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 `
- `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`

Reviewed By: ltamasi

Differential Revision: D49037040

Pulled By: jaykorean

fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b
2023-09-15 08:30:44 -07:00
马越 3c27f56d0b Fix the problem that some keys of ClipColumnFamily may not be deleted (#11811)
Summary:
When executing ClipColumnFamily, if end_key is equal to largest_user_key in a file, this key will not be deleted. So we need to change less than to less than or equal to

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11811

Reviewed By: ajkr

Differential Revision: D49206936

Pulled By: cbi42

fbshipit-source-id: 3e8bcb7b52040a9b4d1176de727616cc298d3445
2023-09-14 13:36:39 -07:00
Jonah Gao 84d335b619 Remove an unused variable: last_stats_dump_time_microsec_ (#11824)
Summary:
`last_stats_dump_time_microsec_`  is not used after initialization.

I guess that it was previously used to implement periodically dumping stats,
but this functionality has now been delegated to the `PeriodicTaskScheduler`.
4b79e8c003/db/db_impl/db_impl.cc (L770-L778)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11824

Reviewed By: cbi42

Differential Revision: D49278311

Pulled By: jowlyzhang

fbshipit-source-id: 5856245580afc026e6b490755a45c5436a2375c9
2023-09-14 11:25:33 -07:00
Yu Zhang 39a4ff2cab Track full_history_ts_low per SuperVersion (#11784)
Summary:
As discussed in https://github.com/facebook/rocksdb/issues/11730 , this PR tracks the effective `full_history_ts_low` per SuperVersion and update existing sanity checks for `ReadOptions.timestamp >= full_history_ts_low` to use this per SuperVersion `full_history_ts_low` instead. This also means the check is moved to happen after acquiring SuperVersion.

There are two motivations for this: 1) Each time `full_history_ts_low` really come into effect to collapse history, a new SuperVersion is always installed, because it would involve either a Flush or Compaction, both of which change the LSM tree shape. We can take advantage of this to ensure that as long as this sanity check is passed, even if `full_history_ts_low` can be concurrently increased and collapse some history above the requested `ReadOptions.timestamp`, a read request won’t have visibility to that part of history through this SuperVersion that it already acquired.  2) the existing sanity check uses `ColumnFamilyData::GetFullHistoryTsLow` without locking the db mutex, which is the mutex all `IncreaseFullHistoryTsLow` operation is using when mutating this field. So there is a race condition. This also solve the race condition on the read path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11784

Test Plan:
`make all check`

// Checks success scenario really provide the read consistency attribute as mentioned above.
`./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckPassReadIsConsistent*`

// Checks failure scenario cleans up SuperVersion properly.
`./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*`
`./db_secondary_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*`
`./db_readonly_with_timestamp_test --gtest_filter=*FullHistoryTsLowSanitchCheckFail*`

Reviewed By: ltamasi

Differential Revision: D48894795

Pulled By: jowlyzhang

fbshipit-source-id: 1f801fe8e1bc8e63ca76c03cbdbd0974e5ff5bf6
2023-09-13 16:34:18 -07:00
Hui Xiao 05daa12332 Change compaction_readahead_size default value to 2MB (#11762)
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, we rely on `compaction_readahead_size` for how much to read ahead for compaction read under non-direct IO case. https://github.com/facebook/rocksdb/pull/11658 therefore also sanitized 0 `compaction_readahead_size` to 2MB under non-direct IO, which is consistent with the existing sanitization with direct IO.

However, this makes disabling compaction readahead impossible as well as add one more scenario to the inconsistent effects between `Options.compaction_readahead_size=0` during DB open and `SetDBOptions("compaction_readahead_size", "0")` .
- `SetDBOptions("compaction_readahead_size", "0")` will disable compaction readahead as its logic never goes through sanitization above while `Options.compaction_readahead_size=0` will go through sanitization.

Therefore we decided to do this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11762

Test Plan: Modified existing UTs to cover this PR

Reviewed By: ajkr

Differential Revision: D48759560

Pulled By: hx235

fbshipit-source-id: b3f85e58bda362a6fa1dc26bd8a87aa0e171af79
2023-08-30 14:57:08 -07:00
Yu Zhang fc58c7c62a Add UDT support in SstFileDumper (#11757)
Summary:
For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757

Test Plan:
make all check
for changes in db_impl.cc and db_impl_compaction_flush.cc

for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually.

Reviewed By: ltamasi

Differential Revision: D48668048

Pulled By: jowlyzhang

fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8
2023-08-30 13:42:04 -07:00
Yu Zhang 4234a6a301 Increase full_history_ts_low when flush happens during recovery (#11774)
Summary:
This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11774

Test Plan:
Added unit test
make all check

Reviewed By: ltamasi

Differential Revision: D48799109

Pulled By: jowlyzhang

fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035
2023-08-30 09:34:31 -07:00
chuhao zeng 1303573589 Reverse sort order in dedup to enable iter checking in callback (#11725)
Summary:
Fix https://github.com/facebook/rocksdb/issues/6470

Ensure TransactionLogIter being initialized correctly with SYNC_POINT API when calling `GetSortedWALFiles`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11725

Reviewed By: akankshamahajan15

Differential Revision: D48529411

Pulled By: ajkr

fbshipit-source-id: 970ca1a6259ed996c6d87f7fcd40f95acf441517
2023-08-22 11:22:35 -07:00
Jay Huh 0fa0c97d3e Timeout in microsecond option in WaitForCompactOptions (#11711)
Summary:
While it's rare, we may run into a scenario where `WaitForCompact()` waits for background jobs indefinitely. For example, not enough space error will add the job back to the queue while WaitForCompact() waits for _all jobs_ including the jobs that are in the queue to be completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11711

Test Plan:
`DBCompactionWaitForCompactTest::WaitForCompactToTimeout` added
`timeout` option added to the variables for all of the existing DBCompactionWaitForCompactTests

Reviewed By: pdillinger, jowlyzhang

Differential Revision: D48416390

Pulled By: jaykorean

fbshipit-source-id: 7b6a12f705ab6c6dfaf8ad736a484ca654a86106
2023-08-18 11:21:45 -07:00
Changyu Bi d1ff401472 Delay bottommost level single file compactions (#11701)
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.

* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701

Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.

Reviewed By: jaykorean, ajkr

Differential Revision: D48331564

Pulled By: cbi42

fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
2023-08-16 17:45:44 -07:00
Jay Huh 793a786fa3 Fix for unchecked status in CancelAllBackgroundWork (#11699)
Summary:
## Summary
PR https://github.com/facebook/rocksdb/issues/11497 introduced this. Status from `CancelPeriodicTaskScheduler()` is unchecked and causing test failure like https://app.circleci.com/pipelines/github/facebook/rocksdb/30743/workflows/24443a9b-6fc3-41e6-86c1-992d766eb1ec/jobs/642419

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11699

Test Plan: Existing tests

Reviewed By: cbi42

Differential Revision: D48287188

Pulled By: jaykorean

fbshipit-source-id: b6bcf6e3c3c47f126c34c24a3dfed2649635cc8c
2023-08-11 19:59:56 -07:00
Jay Huh 52816ff64d Close DB option in WaitForCompact() (#11497)
Summary:
Context:

As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `close_db` option in `WaitForCompactOptions` to close DB after waiting for compactions to finish. Must be set to true to close the DB upon compactions finishing.
1. `bool close_db = false` added to `WaitForCompactOptions`
2. Introduced `CancelPeriodicTaskSchedulers()` and moved unregistering PeriodicTaskSchedulers to it.`CancelAllBackgroundWork()` calls it now.
3. When close_db option is on, unpersisted data (data in memtable when WAL is disabled) will be flushed in `WaitForCompact()` if flush option is not on (and `mutable_db_options_.avoid_flush_during_shutdown` is not true). The unpersisted data flush in `CancelAllBackgroundWork()` will be skipped because `shutting_down_` flag will be set true before calling `Close()`.
4. Atomic boolean `reject_new_background_jobs_` is introduced to prevent new background jobs from being added during the short period of time after waiting is done and before `shutting_down_` is set by `Close()`.
5. `WaitForCompact()` now waits for recovery in progress to complete as well. (flush operations from WAL -> L0 files)
6. Added `close_db_` cases to all existing `WaitForCompactTests`
7. Added a scenario to `DBBasicTest::DBClose`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11497

Test Plan:
- Existing DBCompactionTests
- `WaitForCompactWithOptionToFlushAndCloseDB` added
- Added a scenario to `DBBasicTest::DBClose`

Reviewed By: pdillinger, jowlyzhang

Differential Revision: D46337560

Pulled By: jaykorean

fbshipit-source-id: 0f8c7ee09394847f2af5ea4bdd331b47bcdef0b0
2023-08-11 12:30:48 -07:00
Yu Zhang 7cdbce4564 Add UDT support in API DB::GetApproximateMemTableStats (#11689)
Summary:
This API should consider the case when user-defined timestamp is enabled. Also added some documentation to some related API to clarify the usage in the case when user-defined timestamp is enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11689

Test Plan:
Unit test added
```
make check
./db_with_timestamp_basic_test --gtest_filter=*GetApproximateSizes*
```

Reviewed By: ltamasi

Differential Revision: D48208568

Pulled By: jowlyzhang

fbshipit-source-id: c5baa4a2923441f8ea3a3672c98223a43a3428dc
2023-08-11 11:26:38 -07:00
Changyu Bi 76ed9a3990 Add missing status check when compiling with ASSERT_STATUS_CHECKED=1 (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.

Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686

Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`

Reviewed By: hx235

Differential Revision: D48176132

Pulled By: cbi42

fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
2023-08-09 15:46:44 -07:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
Xinye Tao d2b0652b32 compute compaction score once for a batch of range file deletes (#10744)
Summary:
Only re-calculate compaction score once for a batch of deletions. Fix performance regression brought by https://github.com/facebook/rocksdb/pull/8434.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10744

Test Plan:
In one of our production cluster that recently upgraded to RocksDB 6.29, it takes more than 10 minutes to delete files in 30,000 ranges. The RocksDB instance contains approximately 80,000 files. After this patch, the duration reduces to 100+ ms, which is on par with RocksDB 6.4.

Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/316

Signed-off-by: tabokie <xy.tao@outlook.com>

Reviewed By: cbi42

Differential Revision: D48002581

Pulled By: ajkr

fbshipit-source-id: 7245607ee3ad79c53b648a6396c9159f166b9437
2023-08-07 12:29:31 -07:00
Andrew Kryczka 4500a0d6ec Avoid an std::map copy in persistent stats (#11681)
Summary:
An internal user reported this copy showing up in a CPU profile. We can use move instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11681

Differential Revision: D48103170

Pulled By: ajkr

fbshipit-source-id: 083d6470181a0041bb5275b657aa61bee23a3729
2023-08-06 18:01:08 -07:00
Hui Xiao 09882a52d6 Prepare for deprecation of Options::access_hint_on_compaction_start (#11658)
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D47997856

Pulled By: hx235

fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac
2023-08-03 17:23:02 -07:00
Peter Dillinger 7a1b0207e6 format_version=6 and context-aware block checksums (#9058)
Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.

To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files

Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.

## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.

## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)

## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).

Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).

```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```

Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness

Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058

Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.

Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.

The format compatibility test does not need updating because we have not updated the default
format_version.

Reviewed By: ajkr, mrambacher

Differential Revision: D33100915

Pulled By: pdillinger

fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
2023-07-30 16:40:01 -07:00
Changyu Bi 6a0f637633 Compare the number of input keys and processed keys for compactions (#11571)
Summary:
... to improve data integrity validation during compaction.

A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.

For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.

For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571

Test Plan:
* Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
* All other unit tests for non-corrupted case.
* Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`

Reviewed By: ajkr

Differential Revision: D47131965

Pulled By: cbi42

fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
2023-07-28 09:47:31 -07:00
Yu Zhang c24ef26ca7 Support switching on / off UDT together with in-Memtable-only feature (#11623)
Summary:
Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.

To do this, this PR includes:
1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry.

2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid:
     a) no user comparator change and no effective `persist_user_defined_timestamp` flag change.
     b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag
         is false.
     c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is
         already false.
3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp.

4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623

Test Plan:
```
make all check
./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*"
./db_wal_test --gtest_filter="*EnableDisableUDT*"
```

Reviewed By: ltamasi

Differential Revision: D47636862

Pulled By: jowlyzhang

fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
2023-07-26 20:16:32 -07:00
Yu Zhang 4ea7b796b7 Respect cutoff timestamp during flush (#11599)
Summary:
Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval.

This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`.

For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`.

Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.

Tet plan:
```
./column_family_test --gtest_filter="*RetainUDT*"
./memtable_list_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599

Reviewed By: ajkr

Differential Revision: D47561586

Pulled By: jowlyzhang

fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89
2023-07-26 16:25:06 -07:00
shuzz 2f712235ab optimized code (#11614)
Summary:
improvement code by std::move and c++17

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11614

Reviewed By: ajkr

Differential Revision: D47599519

Pulled By: jowlyzhang

fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff
2023-07-19 12:52:39 -07:00
Changyu Bi 662a1c99f6 Verify number of keys flushed during DB open (#11611)
Summary:
Extend the coverage for option `flush_verify_memtable_count`. The verification code is similar to the ones for regular flush: c3c84b3397/db/flush_job.cc (L956-L965)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11611

Test Plan: existing tests.

Reviewed By: ajkr

Differential Revision: D47478893

Pulled By: cbi42

fbshipit-source-id: ca580c9dbcd6e91facf2e49210661336a79a248e
2023-07-18 10:39:11 -07:00
Changyu Bi ca50ccc71a Add CreateColumnFamilyWithImport to StackableDB and DBImplReadOnly (#11556)
Summary:
https://github.com/facebook/rocksdb/issues/11378 added a new overloaded `CreateColumnFamilyWithImport` API and updated the virtual function in `StackableDB` and `DBImplReadOnly` to the newly overloaded one. This caused internal error when there is a derived class that tries to override the original `CreateColumnFamilyWithImport` function. This PR adds the original `CreateColumnFamilyWithImport` function back to `StackableDB` and `DBImplReadOnly`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11556

Test Plan: check if this fixes an internal build

Reviewed By: akankshamahajan15

Differential Revision: D46980506

Pulled By: cbi42

fbshipit-source-id: 975a6c5748bf9481499a62ee5997ca59e542e3bc
2023-06-23 13:56:26 -07:00
Yu Zhang 7521478b43 Record the persist_user_defined_timestamps flag in manifest (#11515)
Summary:
Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515

Test Plan:
```
make all check
./version_edit_test
```

Reviewed By: ltamasi

Differential Revision: D46920386

Pulled By: jowlyzhang

fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26
2023-06-21 21:49:01 -07:00
mayue.fight fa878a0107 Support to create a CF by importing multiple non-overlapping CFs (#11378)
Summary:
The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317).
Flink uses rocksdb as the state backend,  all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances.
In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one.

This PR is mainly used to quickly merge multiple DBs into one.

We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys.

The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378

Reviewed By: ajkr

Differential Revision: D46413709

Pulled By: cbi42

fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb
2023-06-15 12:25:04 -07:00
Changyu Bi 15e8a843d9 Do not include last level in compaction when allow_ingest_behind=true (#11489)
Summary:
when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489

Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction.

Reviewed By: ajkr

Differential Revision: D46455711

Pulled By: cbi42

fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c
2023-06-14 11:28:56 -07:00
Andrew Kryczka cac3240cbf add property "rocksdb.obsolete-sst-files-size" (#11533)
Summary:
See "unreleased_history/new_features/obsolete_sst_files_size.md" for description

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11533

Test Plan: updated unit test

Reviewed By: jowlyzhang

Differential Revision: D46703152

Pulled By: ajkr

fbshipit-source-id: ea5e31cd6293eccc154130c13e66b5271f57c102
2023-06-13 15:52:45 -07:00
Ignat Loskutov 05fcacdb42 Add missing stopwatch and perf timer to DBImplReadOnly (#11521)
Summary:
The `rocksdb.db.get.micros` histogram is never updated if the DB is open in ReadOnly mode, as well as the `get_cpu_nanos` perf counter. An earlier PR (https://github.com/facebook/rocksdb/issues/4260) for some reason has only added the TODO line, not the accounting itself, so this one is intended to fix it, adding two lines to match [DBImplSecondary](4dafa5b220/db/db_impl/db_impl_secondary.cc (L366-L367)).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11521

Reviewed By: cbi42

Differential Revision: D46577330

Pulled By: jowlyzhang

fbshipit-source-id: be147923e763af32bbc18fd6bdf3aff8ebf08aee
2023-06-08 17:40:56 -07:00