Commit Graph

5680 Commits

Author SHA1 Message Date
anand76 d9c0d44dab Add a perf level for measuring user thread block time (#12368)
Summary:
Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368

Test Plan: Add a unit test for write_thread_wait_nanos

Reviewed By: ajkr

Differential Revision: D54021583

Pulled By: anand1976

fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48
2024-02-22 12:14:53 -08:00
Yu Zhang f1ca47b904 Add support to bulk load external files for UDT in memtable only feature (#12356)
Summary:
This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature.

Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`.  We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`.

To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps.

This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356

Test Plan:
added unit test
existing unit test

Reviewed By: cbi42

Differential Revision: D54025116

Pulled By: jowlyzhang

fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554
2024-02-21 15:41:53 -08:00
Andrew Kryczka 8e29f243c9 No filesystem reads during `Merge()` writes (#12365)
Summary:
This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365

Test Plan:
"rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read.

```
$ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true |& grep 'block.cache.data.add COUNT' ; done
strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0
strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1
```

Reviewed By: hx235

Differential Revision: D53982520

Pulled By: ajkr

fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4
2024-02-21 13:15:27 -08:00
Alex Wied f2732d0586 Export GetSequenceNumber functionality for Snapshots (#12354)
Summary:
This PR adds `Snapshot->GetSequenceNumber()` functionality to the C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12354

Reviewed By: akankshamahajan15

Differential Revision: D53836085

Pulled By: cbi42

fbshipit-source-id: 4a14daeba9210a69bcb74e4c1c0666deff1b4837
2024-02-16 10:28:41 -08:00
anand76 d227276147 Deprecate some variants of Get and MultiGet (#12327)
Summary:
A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327

Reviewed By: pdillinger

Differential Revision: D53828151

Pulled By: anand1976

fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050
2024-02-16 09:21:06 -08:00
Akanksha Mahajan 956f1dfde3 Change ReadAsync callback API to remove const from FSReadRequest (#11649)
Summary:
Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D53585309

Pulled By: akankshamahajan15

fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d
2024-02-16 09:14:55 -08:00
anand76 28c1c15c29 Sync tickers and histograms across C++ and Java (#12355)
Summary:
The RocksDB ticker and histogram statistics were out of sync between the C++ and Java code, with a number of newer stats missing in TickerType.java and HistogramType.java. Also, there were gaps in numbering in portal.h, which could soon become an issue due to the number of tickers and the fact that we're limited to 1 byte in Java. This PR adds the missing stats, and re-numbers all of them. It also moves some stats around to try to group related stats together. Since this will go into a major release, compatibility shouldn't be an issue.

This should be automated at some point, since the current process is somewhat error prone.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12355

Reviewed By: jaykorean

Differential Revision: D53825324

Pulled By: anand1976

fbshipit-source-id: 298c180872f4b9f1ee54b8bb22f4e280458e7e09
2024-02-15 17:22:03 -08:00
Peter Dillinger 12018136d8 KeySegmentsExtractor and prototype higher-dimensional filtering (#12075)
Summary:
This change contains a prototype new API for "higher dimensional" filtering of read queries. Existing filters treat keys as one-dimensional, either as distinct points (whole key) or as contiguous ranges in comparator order (prefix filters). The proposed KeySegmentsExtractor allows treating keys as multi-dimensional for filtering purposes even though they still have a single total order across dimensions. For example, consider these keys in different LSM levels:

L0:
abc_0123
abc_0150
def_0114
ghi_0134

L1:
abc_0045
bcd_0091
def_0077
xyz_0080

If we get a range query for [def_0100, def_0200), a prefix filter (up to the underscore) will tell us that both levels are potentially relevant. However, if each SST file stores a simple range of the values for the second segment of the key, we would see that L1 only has [0045, 0091] which (under certain required assumptions) we are sure does not overlap with the given range query. Thus, we can filter out processing or reading any index or data blocks from L1 for the query.

This kind of case shows up with time-ordered data but is more general than filtering based on user timestamp. See https://github.com/facebook/rocksdb/issues/11332 . Here the "time" segments of the keys are meaningfully ordered with respect to each other even when the previous segment is different, so summarizing data along an alternate dimension of the key like this can work well for filtering.

This prototype implementation simply leverages existing APIs for user table properties and table filtering, which is not very CPU efficient. Eventually, we expect to create a native implementation. However, I have put some significant
thought and engineering into the new APIs overall, which I expect to be close to refined enough for production.

For details, see new public APIs in experimental.h. For a detailed example, see the new unit test in db_bloom_filter_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12075

Test Plan: Unit test included

Reviewed By: jowlyzhang

Differential Revision: D53619406

Pulled By: pdillinger

fbshipit-source-id: 9e6e7b82b4db8d815db76a6ab340e90db2c191f2
2024-02-15 15:39:55 -08:00
Yu Zhang 4bea83aa44 Remove the force mode for EnableFileDeletions API (#12337)
Summary:
There is no strong reason for user to need this mode while on the other hand, its behavior is destructive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337

Reviewed By: hx235

Differential Revision: D53630393

Pulled By: jowlyzhang

fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78
2024-02-13 18:36:25 -08:00
Yu Zhang 10d02456b6 Add support to bulk load external files with user-defined timestamps (#12343)
Summary:
This PR adds initial support to bulk loading external sst files with user-defined timestamps.

To ensure this invariant is met while ingesting external files:
     assume there are two internal keys: <K, ts1, seq1> and <K, ts2, seq2>, the following should hold:
     ts1 < ts2 iff. seq1 < seq2

These extra requirements are added for ingesting external files with user-defined timestamps:
1) A file with overlapping user key (without timestamp) range with the db cannot be ingested. This is because we cannot ensure above invariant is met without checking each overlapped key's timestamp and compare it with the timestamp from the db. This is an expensive step. This bulk loading feature will be used by MyRocks and currently their usage can guarantee ingested file's key range doesn't overlap with db.
4f3a57a13f/storage/rocksdb/ha_rocksdb.cc (L3312)
We can consider loose this requirement by doing this check in the future, this initial support just disallow this.

2) Files with overlapping user key (without timestamp) range are not allowed to be ingested. For similar reasons, it's hard to ensure above invariant is met. For example, if we have two files where user keys are interleaved like this:
file1: [c10, c8, f10, f5]
file2: [b5, c11, f4]
Either file1 gets a bigger global seqno than file2, or the other way around, above invariant cannot be met.
So we disallow this.

2) When a column family enables user-defined timestamps, it doesn't support ingestion behind mode. Ingestion behind currently simply puts the file at the bottommost level, and assign a global seqno 0 to the file. We need to do similar search though the LSM tree for key range overlap checks to make sure aformentioned invariant is met. So this initial support disallow this mode. We can consider adding it in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12343

Test Plan: Add unit tests

Reviewed By: cbi42

Differential Revision: D53686182

Pulled By: jowlyzhang

fbshipit-source-id: f05e3fb27967f7974ed40179d78634c40ecfb136
2024-02-13 11:15:28 -08:00
Levi Tamasi de1e3ff6ea Fix a data race in DBImpl::RenameTempFileToOptionsFile (#12347)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347

`DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels).

Reviewed By: jowlyzhang

Differential Revision: D53675153

fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481
2024-02-12 13:26:09 -08:00
Peter Dillinger 54cb9c77d9 Prefer static_cast in place of most reinterpret_cast (#12308)
Summary:
The following are risks associated with pointer-to-pointer reinterpret_cast:
* Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do.
* Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally.

I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement:
* Going through `void*` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have
`struct Derived: public Base1, public Base2`.  If we have `Derived*` -> `void*` -> `Base2*` -> `Derived*` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2*` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic.
* Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived*` -> `Base2*` -> `Derived*` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance.

With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain.

A couple of related interventions included here:
* Previously Cache::Handle was not actually derived from in the implementations and just used as a `void*` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle.
* Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse).

Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work.

I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308

Test Plan: existing tests, CI

Reviewed By: ltamasi

Differential Revision: D53204947

Pulled By: pdillinger

fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2
2024-02-07 10:44:11 -08:00
Yu Zhang e3e8fbb497 Add a separate range classes for internal usage (#12071)
Summary:
Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`.

These APIs have different expectations of what keys this range class contain.  Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp.

This PR contains:
1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead.
2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071

Test Plan:
existing tests
Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled.
The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT.

Reviewed By: ltamasi

Differential Revision: D53292608

Pulled By: jowlyzhang

fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8
2024-02-06 18:35:36 -08:00
Hui Xiao 1a885fe730 Remove deprecated Options::access_hint_on_compaction_start (#11654)
Summary:
**Context:**
`Options::access_hint_on_compaction_start ` is marked deprecated and now ready to be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11654

Test Plan:
Multiple db_stress runs with pre-PR and post-PR binary randomly to ensure forward/backward compatibility on options 36a5686ec0?fbclid=IwAR2IcdAUdTvw9O9V5GkHEYJRGMVR9p7Ei-LMa-9qiXlj3z80DxjkxlGnP1E
`python3 tools/db_crashtest.py --simple blackbox --interval=30`

Reviewed By: cbi42

Differential Revision: D47892459

Pulled By: hx235

fbshipit-source-id: a62f46a0377fe143be7638e218978d5431c15c56
2024-02-05 13:35:19 -08:00
Peter Dillinger 6e88126dd3 Don't log an error when an auxiliary dir is missing (#12326)
Summary:
info_log gets an error logged when wal_dir or a db_path/cf_path is missing. Under this condition, the directory is created later (in DBImpl::Recover -> Directories::SetDirectories) with no error status returned.

To avoid error spam in logs, change these to a descriptive "header" log entry.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12326

Test Plan: manual with DBBasicTest.DBCloseAllDirectoryFDs which exercises this code

Reviewed By: jowlyzhang

Differential Revision: D53374743

Pulled By: pdillinger

fbshipit-source-id: 32d1ce18809da13a25bdd6183d661f66a3b6a111
2024-02-05 10:26:41 -08:00
Yu Zhang 4eaa771c01 Refactor external sst file ingestion job (#12305)
Summary:
Updates some documentations and invariant assertions after https://github.com/facebook/rocksdb/issues/12257 and https://github.com/facebook/rocksdb/issues/12284. Also refactored some duplicate code and improved some error message and preconditions for errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12305

Test Plan: Existing unit tests

Reviewed By: hx235

Differential Revision: D53371325

Pulled By: jowlyzhang

fbshipit-source-id: fb0edcb3a3602cdf0a292ef437cfdfe897fc6c99
2024-02-02 18:07:57 -08:00
Changyu Bi 5620efc794 Remove deprecated option `ignore_max_compaction_bytes_for_input` (#12323)
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10835 to allow disabling the new compaction behavior if it's not safe. The option is enabled by default and there has not been a need to disable it. So it should be safe to remove now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12323

Reviewed By: ajkr

Differential Revision: D53330336

Pulled By: cbi42

fbshipit-source-id: 36eef4664ac96b3a7ed627c48bd6610b0a7eafc5
2024-02-02 17:09:42 -08:00
Changyu Bi ace1721b28 Remove deprecated option `level_compaction_dynamic_file_size` (#12325)
Summary:
The option is introduced in https://github.com/facebook/rocksdb/issues/10655 to allow reverting to old behavior. The option is enabled by default and there has not been a need to disable it. Remove it for 9.0 release. Also fixed and improved a few unit tests that depended on setting this option to false.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12325

Test Plan: existing tests.

Reviewed By: hx235

Differential Revision: D53369430

Pulled By: cbi42

fbshipit-source-id: 0ec2440ca8d88db7f7211c581542c7581bd4d3de
2024-02-02 15:37:40 -08:00
Peter Dillinger 1d6dbfb8b7 Rename IntTblPropCollector -> InternalTblPropColl (#12320)
Summary:
I've always found this name difficult to read, because it sounds like it's for collecting int(eger)
table properties.

I'm fixing this now to set up for a change that I have stubbed out in the public API (table_properties.h):
a new adapter function `TablePropertiesCollector::AsInternal()` that allows RocksDB-provided
TablePropertiesCollectors (such as CompactOnDeletionCollector) to implement the easier-to-upgrade
internal interface while still (superficially) implementing the public interface. In addition to added flexibility,
this should be a performance improvement as the adapter class UserKeyTablePropertiesCollector can be
avoided for such cases where a RocksDB-provided collector is used (AsInternal() returns non-nullptr).

table_properties.h is the only file with changes that aren't simple find-replace renaming.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12320

Test Plan: existing tests, CI

Reviewed By: ajkr

Differential Revision: D53336945

Pulled By: pdillinger

fbshipit-source-id: 02535bcb30bbfb00e29e8478af62e5dad50a63b8
2024-02-02 14:14:43 -08:00
anand76 95b41eec6d Fix potential incorrect result for duplicate key in MultiGet (#12295)
Summary:
The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met -
1. Duplicate keys in a MultiGet batch
2. Key matches the last key in a non-zero, non-bottommost level file
3. Final value is not in the file (merge operand, not snapshot visible etc)
4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block
5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates
6. Value or merge operand for the key is present in the very next level

The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the  conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295

Test Plan: Add unit tests that fail an assertion or return wrong result without the fix

Reviewed By: hx235

Differential Revision: D53187634

Pulled By: anand1976

fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea
2024-02-02 11:48:35 -08:00
Andrew Kryczka f9d45358ca Removed `check_flush_compaction_key_order` (#12311)
Summary:
`check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311

Reviewed By: cbi42

Differential Revision: D53233379

Pulled By: ajkr

fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637
2024-01-31 16:30:26 -08:00
Peter Dillinger 76c834e441 Remove 'virtual' when implied by 'override' (#12319)
Summary:
... to follow modern C++ style / idioms.

Used this hack:
```
for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319

Test Plan: existing tests, CI

Reviewed By: jaykorean

Differential Revision: D53275303

Pulled By: pdillinger

fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377
2024-01-31 13:14:42 -08:00
Yu Zhang d11584e42e Be consistent in key range overlap check (#12315)
Summary:
We should be consistent in how we check key range overlap in memtables and in sst files. While all the sst file key range overlap check compares the user key without timestamp, for example:
377eee77f8/db/version_set.cc (L129-L130)

This key range overlap check for memtable is comparing the whole user key. Currently it happen to achieve the same effect because this function is only called by `ExternalSstFileIngestionJob` and `DBImpl::CompactRange`, which takes a user key without timestamp as the range end, pad a max or min timestamp to it depending on whether the end is exclusive. So use `Compartor::Compare` here is working too, but we should update it to `Comparator::CompareWithoutTimestamp` to be consistent with all the other file key range overlapping check functions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12315

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D53273456

Pulled By: jowlyzhang

fbshipit-source-id: c094ae1f0c195d52542124c4fb03fdca14241e85
2024-01-31 11:12:52 -08:00
Peter Dillinger 2b4245559c Don't warn on (recursive) disable file deletion (#12310)
Summary:
To stop spamming our warning logs with normal behavior.

Also fix comment on `DisableFileDeletions()`.

In response to https://github.com/facebook/rocksdb/issues/12001 I've indicated my objection to granting legitimacy to force=true, but I'm not addressing that here and now. In short, the user shouldn't be asked to think about whether they want to use the *wrong* behavior. ;)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12310

Test Plan: existing tests

Reviewed By: jowlyzhang

Differential Revision: D53233117

Pulled By: pdillinger

fbshipit-source-id: 5d2aedb76b02b30f8a5fa5b436fc57fde5d40d6e
2024-01-30 11:58:31 -08:00
Andrew Kryczka aacf60dda2 Speedup based on number of files marked for compaction (#12306)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306

Reviewed By: cbi42

Differential Revision: D53200559

Pulled By: ajkr

fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e
2024-01-29 17:29:04 -08:00
Peter Dillinger 61ed0de600 Add more detail to some statuses (#12307)
Summary:
and also fix comment/label on some MacOS CI jobs. Motivated by a crash test failure missing a definitive indicator of the genesis of the status:

```
file ingestion error: Operation failed. Try again.:
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12307

Test Plan: just cosmetic changes. These statuses should not arise frequently enough to be a performance issue (copying messages).

Reviewed By: jaykorean

Differential Revision: D53199529

Pulled By: pdillinger

fbshipit-source-id: ad83daaa5d80f75c9f81158e90fb6d9ecca33fe3
2024-01-29 16:31:09 -08:00
Yu Zhang 17042a3fb7 Remove misspelled tickers used in error handler (#12302)
Summary:
As titled, the replacement tickers have been introduced in https://github.com/facebook/rocksdb/issues/11509  and in use since release 8.4. This PR completely removes the misspelled ones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12302

Test Plan: CI tests

Reviewed By: jaykorean

Differential Revision: D53196935

Pulled By: jowlyzhang

fbshipit-source-id: 9c9d0d321247690db5edfdc52b4fecb2f1218979
2024-01-29 15:28:37 -08:00
Chdy fc48af33f5 fix some perf statistic in write (#12285)
Summary:
### Summary:  perf context lack statistics in some write steps
```
rocksdb::get_perf_context()->write_wal_time);
rocksdb::get_perf_context()->write_memtable_time);
rocksdb::get_perf_context()->write_pre_and_post_process_time);
```

#### case 1:
when the unordered_write is true, the `write_memtable_time` is 0
```
write_wal_time : 13.7012
write_memtable_time : 0
write_pre_and_post_process_time : 142.037
```

Reason: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` during insert memtable,

```c++
Status DBImpl::UnorderedWriteMemtable(const WriteOptions& write_options,
                                      WriteBatch* my_batch,
                                      WriteCallback* callback, uint64_t log_ref,
                                      SequenceNumber seq,
                                      const size_t sub_batch_cnt) {
	...
  if (w.CheckCallback(this) && w.ShouldWriteToMemtable()) {

    // need calculate write_memtable_time
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        seq_per_batch_, sub_batch_cnt, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);
    if (write_options.disableWAL) {
      has_unpersisted_data_.store(true, std::memory_order_relaxed);
    }
  }
	...
}
```
Fix: add perf function
```
write_wal_time : 14.3991
write_memtable_time : 19.3367
write_pre_and_post_process_time : 130.441
```

#### case 2:
when the enable_pipelined_write is true, the `write_memtable_time` is small
```
write_wal_time : 11.2986
write_memtable_time : 1.0205
write_pre_and_post_process_time : 140.131
```

Fix: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` when `w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER`
```c++
Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options,
                                  WriteBatch* my_batch, WriteCallback* callback,
                                  uint64_t* log_used, uint64_t log_ref,
                                  bool disable_memtable, uint64_t* seq_used) {
  ...
  if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) {
    // need calculate write_memtable_time

    assert(w.ShouldWriteToMemtable());
    ColumnFamilyMemTablesImpl column_family_memtables(
        versions_->GetColumnFamilySet());
    w.status = WriteBatchInternal::InsertInto(
        &w, w.sequence, &column_family_memtables, &flush_scheduler_,
        &trim_history_scheduler_, write_options.ignore_missing_column_families,
        0 /*log_number*/, this, true /*concurrent_memtable_writes*/,
        false /*seq_per_batch*/, 0 /*batch_cnt*/, true /*batch_per_txn*/,
        write_options.memtable_insert_hint_per_batch);

    if (write_thread_.CompleteParallelMemTableWriter(&w)) {
      MemTableInsertStatusCheck(w.status);
      versions_->SetLastSequence(w.write_group->last_sequence);
      write_thread_.ExitAsMemTableWriter(&w, *w.write_group);
    }
  }
  if (seq_used != nullptr) {
    *seq_used = w.sequence;
  }

  assert(w.state == WriteThread::STATE_COMPLETED);
  return w.FinalStatus();
}
```
FIx: add perf function
```
write_wal_time : 10.5201
write_memtable_time : 17.1048
write_pre_and_post_process_time : 114.313
```

#### case3:
`DBImpl::WriteImplWALOnly` function has no statistical `write_delay_time`
```c++
Status DBImpl::WriteImplWALOnly(
    WriteThread* write_thread, const WriteOptions& write_options,
    WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used,
    const uint64_t log_ref, uint64_t* seq_used, const size_t sub_batch_cnt,
    PreReleaseCallback* pre_release_callback, const AssignOrder assign_order,
    const PublishLastSeq publish_last_seq, const bool disable_memtable) {
 ...
  if (publish_last_seq == kDoPublishLastSeq) {

  } else {
    // need calculate write_delay_time
    InstrumentedMutexLock lock(&mutex_);
    Status status =
        DelayWrite(/*num_bytes=*/0ull, *write_thread, write_options);
    if (!status.ok()) {
      WriteThread::WriteGroup write_group;
      write_thread->EnterAsBatchGroupLeader(&w, &write_group);
      write_thread->ExitAsBatchGroupLeader(write_group, status);
      return status;
    }
  }
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12285

Reviewed By: ajkr

Differential Revision: D53191765

Pulled By: cbi42

fbshipit-source-id: f78d5b280bea6a777f077c89c3e0b8fe98d3c860
2024-01-29 12:31:11 -08:00
Yu Zhang 071a146fa0 Add support for range deletion when user timestamps are not persisted (#12254)
Summary:
For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes:

1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block.

2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`.

3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. d69628e6ce/db/dbformat.h (L923-L935)
This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254

Test Plan: Added unit test and enabled this feature combination's stress test.

Reviewed By: cbi42

Differential Revision: D52965527

Pulled By: jowlyzhang

fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985
2024-01-29 11:37:34 -08:00
Peter Dillinger 4e60663b31 Remove unnecessary, confusing 'extern' (#12300)
Summary:
In C++, `extern` is redundant in a number of cases:
* "Global" function declarations and definitions
* "Global" variable definitions when already declared `extern`

For consistency and simplicity, I've removed these in code that *we own*. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h)
as standard best practice would prescribe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300

Test Plan: no functional changes, CI

Reviewed By: ajkr

Differential Revision: D53148629

Pulled By: pdillinger

fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886
2024-01-29 10:38:08 -08:00
Changyu Bi 2233a2f4c0 Enhance corruption status message for record mismatch in compaction (#12297)
Summary:
... to include the actual numbers of processed and expected records, and the file number for input files. The purpose is to be able to find the offending files even when the relevant LOG file is gone.

Another change is to check the record count even when `compaction_verify_record_count` is false, and log a warning message without setting corruption status if there is a mismatch. This is consistent with how we check the record count for flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12297

Test Plan:
print the status message in `DBCompactionTest.VerifyRecordCount`
```
before
Corruption: Compaction number of input keys does not match number of keys processed.
after
Compaction number of input keys does not match number of keys processed. Expected 20 but processed 10. Compaction summary: Base version 4 Base level 0, inputs: [11(2156B) 9(2156B)]
```

Reviewed By: ajkr

Differential Revision: D53110130

Pulled By: cbi42

fbshipit-source-id: 6325cbfb8f71f25ce37f23f8277ebe9264863c3b
2024-01-26 09:12:07 -08:00
Peter Dillinger f046a8f617 Deflake ColumnFamilyTest.WriteStallSingleColumnFamily (#12294)
Summary:
https://github.com/facebook/rocksdb/issues/12267 apparently introduced a data race in test code where a background read of estimated_compaction_needed_bytes while holding the DB mutex could race with forground write for testing purposes. This change adds the DB mutex to those writes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12294

Test Plan: 1000 TSAN runs of test (massively fails before change, passes after)

Reviewed By: ajkr

Differential Revision: D53095483

Pulled By: pdillinger

fbshipit-source-id: 13fcb383ebad313dabe39eb8f9085c34d370b54a
2024-01-25 14:40:18 -08:00
zaidoon c3bff1c02d Allow setting Stderr Logger via C API (#12262)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12262

Reviewed By: pdillinger

Differential Revision: D53027616

Pulled By: ajkr

fbshipit-source-id: 2e88e53e0c02447c613439f5528161ea1340b323
2024-01-25 12:36:40 -08:00
Hui Xiao 96fb7de3bc Rate-limit un-ratelimited flush/compaction code paths (#12290)
Summary:
**Context/Summary:**

We recently found out some code paths in flush and compaction aren't rate-limited when they should.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12290

Test Plan: existing UT**

Reviewed By: anand1976

Differential Revision: D53066103

Pulled By: hx235

fbshipit-source-id: 9dc4cab5f841230d18e5504dc480ac523e9d3950
2024-01-25 12:00:15 -08:00
Peter Dillinger d895eb08b3 Fix UB/crash in new SeqnoToTimeMapping::CopyFromSeqnoRange (#12293)
Summary:
After https://github.com/facebook/rocksdb/issues/12253 this function has crashed in the crash test, in its call to `std::copy`. I haven't reproduced the crash directly, but `std::copy` probably has undefined behavior if the starting iterator is after the ending iterator, which was possible. I've fixed the logic to deal with that case and to add an assertion to check that precondition of `std::copy` (which appears can be unchecked by `std::copy` itself even with UBSAN+ASAN).

Also added some unit tests etc. that were unfinished for https://github.com/facebook/rocksdb/issues/12253, and slightly tweak SeqnoToTimeMapping::EnforceMaxTimeSpan handling of zero time span case.

This is intended for patching 8.11.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12293

Test Plan: tests added. Will trigger ~20 runs of the crash test job that saw the crash. https://fburl.com/ci/5iiizvfa

Reviewed By: jowlyzhang

Differential Revision: D53090422

Pulled By: pdillinger

fbshipit-source-id: 69d60b1847d9c7e4ae62b153011c2040405db461
2024-01-25 11:27:15 -08:00
Changyu Bi 3812a77771 Deflake `DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit` (#12289)
Summary:
The test has been failing with
```
[ RUN      ] DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit
db/db_compaction_test.cc:9661: Failure
Expected equality of these values:
  0u
    Which is: 0
  env_->GetThreadPoolQueueLen(Env::Priority::LOW)
    Which is: 1
```
This can happen when thread pool queue len is checked before `test::SleepingBackgroundTask::DoSleepTask` is scheduled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12289

Reviewed By: ajkr

Differential Revision: D53064300

Pulled By: cbi42

fbshipit-source-id: 9ed1b714243880f82bd1cc1584b402ac9cf57507
2024-01-25 10:37:11 -08:00
Yu Zhang 928aca835f Skip searching through lsm tree for a target level when files overlap (#12284)
Summary:
While ingesting multiple external files with key range overlap, current flow go through the lsm tree to do a search for a target level and later discard that result by defaulting back to L0. This PR improves this by just skip the search altogether.

The other change is to remove default to L0 for the combination of universal compaction + force global sequence number, which was initially added to meet a pre https://github.com/facebook/rocksdb/issues/7421  invariant.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12284

Test Plan:
Added unit test:
./external_sst_file_test --gtest_filter="*IngestFileWithGlobalSeqnoAssignedUniversal*"

Reviewed By: ajkr

Differential Revision: D53072238

Pulled By: jowlyzhang

fbshipit-source-id: 30943e2e284a7f23b495c0ea4c80cb166a34a8ac
2024-01-24 23:30:08 -08:00
Hui Xiao 1b2b16b38e Fix bug of newer ingested data assigned with an older seqno (#12257)
Summary:
**Context:**
We found an edge case where newer ingested data is assigned with an older seqno. This causes older data of that key to be returned for read.

Consider the following lsm shape:
![image](https://github.com/facebook/rocksdb/assets/83968999/973fd160-5065-49cd-8b7b-b6ab4badae23)
Then ingest a file to L5 containing new data of key_overlap. Because of [this](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fblob%2F5a26f392ca640818da0b8590be6119699e852b07%2Fdb%2Fexternal_sst_file_ingestion_job.cc%3Ffbclid%3DIwAR10clXxpUSrt6sYg12sUMeHfShS7XigFrsJHvZoUDroQpbj_Sb3dG_JZFc%23L951-L956&h=AT0m56P7O0ZML7jk1sdjgnZZyGPMXg9HkKvBEb8mE9ZM3fpJjPrArAMsaHWZQPt9Ki-Pn7lv7x-RT9NEd_202Y6D2juIVHOIt3EjCZptDKBLRBMG49F8iBUSM9ypiKe8XCfM-FNW2Hl4KbVq2e3nZRbMvUM), the file is assigned with seqno 2, older than the old data's seqno 4. After just another compaction, we will drop the new_v for key_overlap because of the seqno and cause older data to be returned.
![image](https://github.com/facebook/rocksdb/assets/83968999/a3ef95e4-e7ae-4c30-8d03-955cd4b5ed42)

**Summary:**
This PR removes the incorrect seqno assignment

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12257

Test Plan:
- New unit test failed before the fix but passes after
- python3 tools/db_crashtest.py --compaction_style=1 --ingest_external_file_one_in=10 --preclude_last_level_data_seconds=36000 --compact_files_one_in=10 --enable_blob_files=0 blackbox`
- Rehearsal stress test

Reviewed By: cbi42

Differential Revision: D52926092

Pulled By: hx235

fbshipit-source-id: 9e4dade0f6cc44e548db8fca27ccbc81a621cd6f
2024-01-24 11:21:05 -08:00
Peter Dillinger b31f3245f1 Fix flaky test shutdown race in seqno_time_test (#12282)
Summary:
Seen in build-macos-cmake:

```
Received signal 11 (Segmentation fault: 11)
	https://github.com/facebook/rocksdb/issues/1   rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0::operator()(void*) const (in seqno_time_test) (mock_time_env.cc:29)
	https://github.com/facebook/rocksdb/issues/2   decltype(std::declval<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&>()(std::declval<void*>())) std::__1::__invoke[abi:v15006]<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&, void*>(rocksdb::MockSystemClock::InstallTimedWait	ixCallback()::$_0&, void*&&) (in seqno_time_test) (invoke.h:394)
...
```

This is presumably because the std::function from the lambda only saves a copy of the SeqnoTimeTest* this pointer, which doesn't prevent it from being reclaimed on parallel shutdown. If we instead save a copy of the `std::shared_ptr<MockSystemClock>` in the std::function, this should prevent the crash. (Note that in `SyncPoint::Data::Process()` copies the std::function before releasing the mutex for calling the callback.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12282

Test Plan: watch CI

Reviewed By: cbi42

Differential Revision: D53027136

Pulled By: pdillinger

fbshipit-source-id: 26cd9c0352541d806d42bb061dd349d3b47171a5
2024-01-24 10:14:22 -08:00
Richard Barnes 3079a7e7c2 Remove extra semi colon from internal_repo_rocksdb/repo/db/internal_stats.h (#12278)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12278

`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Reviewed By: jaykorean

Differential Revision: D52969116

fbshipit-source-id: 8cb28dafdbede54e8cb59c2b8d461b1eddb3de68
2024-01-24 07:22:10 -08:00
Changyu Bi 3ef9092487 Print additional information when flaky test DBTestWithParam.ThreadStatusSingleCompaction fails (#12268)
Summary:
The test is [flaky](https://github.com/facebook/rocksdb/actions/runs/7616272304/job/20742657041?pr=12257&fbclid=IwAR1vNI1rSRVKnOsXs0WCPklqTkBXxlwS1GMJgWWe7D8dtAvh6e6wxk067FY) but I could not reproduce the test failure. Add some debug print to make the next failure more helpful

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12268

Test Plan:
```
check print works when test fails:
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 6134067200, thread status:
thread id: 6133493760, thread status: Compaction
db/db_test.cc:4680: Failure
Expected equality of these values:
  op_count
    Which is: 1
  expected_count
    Which is: 0
```

Reviewed By: hx235

Differential Revision: D52987503

Pulled By: cbi42

fbshipit-source-id: 33b369796f9b97155578b45167e722ddcde93594
2024-01-23 10:07:06 -08:00
Andrew Kryczka 7fe93162c5 Log pending compaction bytes in a couple places (#12267)
Summary:
This PR adds estimated pending compaction bytes in two places:

- The "Level summary", which is printed to the info LOG after every flush or compaction
- The "rocksdb.cfstats" property, which is printed to the info LOG periodically according to `stats_dump_period_sec`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12267

Test Plan:
Ran `./db_bench -benchmarks=filluniquerandom -stats_dump_period_sec=1 -statistics=true -write_buffer_size=524288` and looked at the LOG.

```
** Compaction Stats [default] **
...
Estimated pending compaction bytes: 12117691
...
2024/01/22-13:15:12.283563 1572872 (Original Log Time 2024/01/22-13:15:12.283540) [/db_impl/db_impl_compaction_flush.cc:371] [default] Level summary: files[10 1 0 0 0 0 0] max score 0.50, estimated pending compaction bytes 12359137
```

Reviewed By: cbi42

Differential Revision: D52973337

Pulled By: ajkr

fbshipit-source-id: c4e546bd9bdac387eebeeba303d04125212037b8
2024-01-23 09:14:59 -08:00
Yu Zhang ef342246dc Consolidate stats recording in error handler (#11992)
Summary:
This is a non functional refactor, mostly for deduplicating the stats recording logic in error handler. Plus some documentation update and simple code dedupe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11992

Test Plan: existing tests

Reviewed By: hx235

Differential Revision: D52967713

Pulled By: jowlyzhang

fbshipit-source-id: d584eae1a06410438f5a4c59c2cb67666ea7de1a
2024-01-22 14:57:30 -08:00
zaidoon e572ae9f57 expose mode option to Rate Limiter via C API (#12259)
Summary:
addresses https://github.com/facebook/rocksdb/issues/12220 to allow rate limiting compaction but not flushes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12259

Reviewed By: jaykorean

Differential Revision: D52965342

Pulled By: ajkr

fbshipit-source-id: 38566d9ac75c932c63e10cc53796fab0e46e3b2e
2024-01-22 11:45:53 -08:00
Changyu Bi 4b684e96b7 Allow more intra-L0 compaction when L0 is small (#12214)
Summary:
introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214

Test Plan:
- new unit test
- benchmark:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1

main:
fillrandom   :     234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations;    0.5 MB/s
rocksdb.compact.read.bytes COUNT : 1490756235
rocksdb.compact.write.bytes COUNT : 1469056734
rocksdb.flush.write.bytes COUNT : 71099011

branch:
fillrandom   :     128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations;    0.9 MB/s
rocksdb.compact.read.bytes COUNT : 807474156
rocksdb.compact.write.bytes COUNT : 781977610
rocksdb.flush.write.bytes COUNT : 71098785
```

Reviewed By: ajkr

Differential Revision: D52637771

Pulled By: cbi42

fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3
2024-01-22 10:23:57 -08:00
Peter Dillinger cb08a682d4 Fix/cleanup SeqnoToTimeMapping (#12253)
Summary:
The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios.

* Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc.
  * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.)
* Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior.
* Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.)
  * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries.
  * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.`
* Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior.
* Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file.

Intended follow-up (me or others):
* Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense.
* There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL.
* The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253

Test Plan: unit tests updated

Reviewed By: jowlyzhang

Differential Revision: D52913733

Pulled By: pdillinger

fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593
2024-01-19 21:50:38 -08:00
Changyu Bi ec5b1be18d Deflake `PerfContextTest.CPUTimer` (#12252)
Summary:
We saw failures like
```
db/perf_context_test.cc:952: Failure
Expected: (next_count) > (count), actual: 26699 vs 26699
```
I can repro by running the test repeatedly and the test fails with different seek keys. So
the cause is likely not with Seek() implementation. I found that
`clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);` can return the same time when
called repeatedly. However, I don't know if Seek() is fast enough that this happened during
continuous test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12252

Test Plan: `gtest_parallel.py --repeat=10000 --workers=1 ./perf_context_test --gtest_filter="PerfContextTest.CPUTimer"`

Reviewed By: ajkr

Differential Revision: D52912751

Pulled By: cbi42

fbshipit-source-id: 8985ae93baa99cdf4b9136ea38addd2e41f4b202
2024-01-19 10:13:52 -08:00
anand76 65e162bf09 Add some asserts in FilePickerMultiGet for debugging (#12241)
Summary:
Add asserts to help debug a crash test failure. The test fails as wollows -
```rocksdb::FilePickerMultiGet::PrepareNextLevel(): Assertion `fp_ctx.search_right_bound == -1 || fp_ctx.search_right_bound == FileIndexer::kLevelMaxIndex' failed```

Also add a unit test to verify an edge case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12241

Reviewed By: cbi42

Differential Revision: D52819029

Pulled By: anand1976

fbshipit-source-id: 33316985c8ace1aed9ecc2400da8b777aec488ff
2024-01-16 17:08:58 -08:00
akankshamahajan cad76a2e1e Fix bug in auto_readahead_size that returned wrong key (#12229)
Summary:
IndexType::kBinarySearchWithFirstKey +
BlockCacheLookupForReadAheadSize enabled => FindNextUserEntryInternal assertion fails or iterator lands at a wrong key because BlockCacheLookupForReadAheadSize moves the index_iter_ and in internal_wrapper.h, result_.key didn't update and pointed to wrong key. Also ikey_ was also pointing to iter_.key() instead of copying the key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12229

Test Plan:
```
 rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
mkdir /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3
./db_stress -threads=1 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=0 --backup_one_in=0 --batch_protection_bytes_per_key=0 --blob_cache_size=0 --blob_compaction_readahead_size=0 --blob_compression_type=lz4 --blob_file_size=0 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0 --blob_garbage_collection_force_threshold=0 --block_protection_bytes_per_key=0 --block_size=2048 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 --db_write_buffer_size=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=3 --ingest_external_file_one_in=10 --initial_auto_readahead_size=0 --iterpercent=55 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=8 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=10000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=1 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --snapshot_hold_ops=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=1 --verify_sst_unique_id_in_manifest=0 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=0 > repro.out
Verification failed. Expected state has key 0000000000000077000000000000004178, iterator is at key 0000000000000077000000000000008A78
Column family: default, op_logs: S 0000000000000077000000000000003D7878787878 NNNN
No writes or ops?
Verification failed :(
```

Reviewed By: ajkr

Differential Revision: D52710655

Pulled By: akankshamahajan15

fbshipit-source-id: 9d2e684e190fb0832bdce3337bce1c6548cd054d
2024-01-16 11:30:36 -08:00
Jonah Gao e28251ca72 Fix blob files not reclaimed after deleting all SSTs (#12235)
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/12208.

After all the SSTs have been deleted, all the blob files will become unreferenced.
These files should be considered obsolete and thus, should not be saved to the vstorage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12235

Reviewed By: jowlyzhang

Differential Revision: D52806441

Pulled By: ltamasi

fbshipit-source-id: 62f94d4f2544ed2822c764d8ace5bf7f57efe42d
2024-01-16 11:15:23 -08:00
Andrew Kryczka 2dda7a0dd2 Detect compaction pressure at lower debt ratios (#12236)
Summary:
This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236

Reviewed By: cbi42

Differential Revision: D52765685

Pulled By: ajkr

fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4
2024-01-15 22:41:18 -08:00
马越 1a1f9f1660 Fix the compactRange with wrong cf handle when ClipColumnFamily (#12219)
Summary:
- **Context**:

In ClipColumnFamily, the DeleteRange API will be used to delete data, and then CompactRange will be called for physical deletion. But now However, the ColumnFamilyHandle is not passed , so by default only the DefaultColumnFamily will be CompactRanged. Therefore, it may cause that the data in some sst files of CompactionRange cannot be physically deleted.

- **In this change**

Pass the ColumnFamilyHandle when call CompactRange

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12219

Reviewed By: ajkr

Differential Revision: D52665162

Pulled By: cbi42

fbshipit-source-id: e8e997aa25ec4ca40e347be89edc7e84a7a0edce
2024-01-10 14:34:12 -08:00
Andrew Kryczka 5a9ecf6614 Automated modernization (#12210)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12210

Reviewed By: hx235

Differential Revision: D52559771

Pulled By: ajkr

fbshipit-source-id: 1ccdd3a0180cc02bc0441f20b0e4a1db50841b03
2024-01-05 11:53:57 -08:00
akankshamahajan 5cb2d09d47 Refactor FilePrefetchBuffer code (#12097)
Summary:
Summary - Refactor FilePrefetchBuffer code
- Implementation:
FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_.

 If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data.
If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer.
If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed.
If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to
the caller.

- Merged Sync and Async code flow into one in FilePrefetchBuffer.

Test Plan -
- Crash test passed
- Unit tests
- Pending - Benchmarks

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097

Reviewed By: ajkr

Differential Revision: D51759552

Pulled By: akankshamahajan15

fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5
2024-01-05 09:29:01 -08:00
Hui Xiao 81b6296c7e Pass flush IO activity enum in FlushJob::MaybeIncreaseFullHistoryTsLowToAboveCutoffUDT...() (#12197)
Summary:
**Context/Summary:** as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12197

Test Plan:
```
./db_stress --acquire_snapshot_one_in=100 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=4.393039399748979 --bottommost_compression_type=disable --bottommost_file_compaction_delay=86400 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --persist_user_defined_timestamps=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=55 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_txn=0 --use_write_buffer_manager=0 --user_timestamp_size=8 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=10000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=128 --writepercent=35
```

Before fix:
```
db_stress_tool/db_stress_env_wrapper.h:92: virtual rocksdb::IOStatus rocksdb::DbStressWritableFileWrapper::Append(const rocksdb::Slice &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `io_activity == Env::IOActivity::kUnknown || io_activity == options.io_activity' failed.
```

After fix:
Succeed

Reviewed By: ajkr

Differential Revision: D52492030

Pulled By: hx235

fbshipit-source-id: 842a0dcbdf135838b57ddb4a3a6f1effc8dd3e82
2024-01-02 17:33:00 -08:00
leipeng d411fc4dd6 column_family.cc: SanitizeOptions(dbo, cfo): WARN msg: add missing spaces (#12193)
Summary:
Fix for multi line strings missing spaces.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12193

Reviewed By: cbi42

Differential Revision: D52457430

Pulled By: ajkr

fbshipit-source-id: 4ca75a14e61c09819e5d821da6137f4536e9e76e
2024-01-02 11:18:11 -08:00
leipeng 906c6683ed InternalKey::Set: remove redundant assign (#12194)
Summary:
InternalKey::Set: remove redundant assign

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12194

Reviewed By: cbi42

Differential Revision: D52457542

Pulled By: ajkr

fbshipit-source-id: 329983a8734ff38ffd93018bbbe112b4a23b5c11
2024-01-02 11:17:39 -08:00
Hui Xiao 06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
anand76 a036525809 Lightweight verification of MANIFEST file after close on shutdown (#12174)
Summary:
Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174

Test Plan: Unit test, and some manual verification

Reviewed By: ajkr

Differential Revision: D52451184

Pulled By: anand1976

fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83
2023-12-28 18:25:29 -08:00
hulk b7ecbe309d Trigger compaction to the next level if the data age exceeds periodic_compaction_seconds (#12175)
Summary:
Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users
who have disabled their ttl or ttl > periodic_compaction_seconds.

Relate issue: https://github.com/facebook/rocksdb/issues/12165

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175

Reviewed By: ajkr

Differential Revision: D52446722

Pulled By: cbi42

fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f
2023-12-28 12:50:08 -08:00
Changyu Bi 3d81f175b4 Prioritize marked file in level compaction (#12187)
Summary:
When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187

Test Plan: * New unit tests

Reviewed By: ajkr

Differential Revision: D52437194

Pulled By: cbi42

fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3
2023-12-28 10:28:37 -08:00
darionyaphet 01f2edd145 Replace push_back by emplace_back in wal manager (#10805)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10805

Reviewed By: ajkr

Differential Revision: D52424928

Pulled By: hx235

fbshipit-source-id: 548e3304ca721a3907be3696d12735929aca8490
2023-12-27 10:40:33 -08:00
Andrew Kryczka 4fefe1fed9 Downgrade warning for dynamic leveling with non-leveled compaction (#12186)
Summary:
Now that `level_compaction_dynamic_level_bytes`'s default value is true, users who do not touch that setting and use non-leveled compaction will also see this log message. It can be info level rather than warning since, in the case mentioned, there is nothing the user needs to be warned about.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12186

Reviewed By: cbi42

Differential Revision: D52422499

Pulled By: ajkr

fbshipit-source-id: 8dbfcd102aab671b881ba047fb4a0a555b3e0a78
2023-12-26 15:13:42 -08:00
Peter Dillinger a771a47a1b Fix leak or crash on failure in automatic atomic flush (#12176)
Summary:
Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test:

```
db_flush_test: db/column_family.cc:1648:
rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176

Test Plan: unit test added

Reviewed By: cbi42

Differential Revision: D52417071

Pulled By: pdillinger

fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5
2023-12-26 11:04:25 -08:00
zaidoon ad0362ac92 Expose Options::ttl through C API (#12170)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12170

Reviewed By: jaykorean

Differential Revision: D52378902

Pulled By: cbi42

fbshipit-source-id: 0bac94b8785d5149df86e7317e69c0e64beab887
2023-12-21 15:04:53 -08:00
anand76 cc069f25b3 Add some compressed and tiered secondary cache stats (#12150)
Summary:
Add statistics for more visibility.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12150

Reviewed By: akankshamahajan15

Differential Revision: D52184633

Pulled By: anand1976

fbshipit-source-id: 9969e05d65223811cd12627102b020bb6d229352
2023-12-15 11:34:08 -08:00
Akanksha Mahajan cd577f6059 Fix WRITE_STALL start_time (#12147)
Summary:
`Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)
However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time);

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147

Test Plan: Existing CircleCI

Reviewed By: cbi42

Differential Revision: D52173481

Pulled By: akankshamahajan15

fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98
2023-12-14 13:45:06 -08:00
akankshamahajan d926593df5 Fix stress tests failure for auto_readahead_size (#12131)
Summary:
When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter  so that
- BlockBasedTableIterator can point index_iter_ to the right block.
- disable readahead_cache_lookup.
However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case.

So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131

Test Plan:
- Tested on db_stress command that was failing -
`./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35`

 - make crash_test -j32

Reviewed By: anand1976

Differential Revision: D51986326

Pulled By: akankshamahajan15

fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9
2023-12-13 12:15:04 -08:00
Andrew Kryczka d8e47620d7 Speedup based on pending compaction bytes relative to data size (#12130)
Summary:
RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years.

This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130

Reviewed By: hx235

Differential Revision: D52000746

Pulled By: ajkr

fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7
2023-12-13 10:37:27 -08:00
Peter Dillinger c96d9a0fbb Allow TablePropertiesCollectorFactory to return null collector (#12129)
Summary:
As part of building another feature, I wanted this:
* Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases.
* Polished, clarified some related API comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129

Test Plan: unit test added

Reviewed By: ltamasi

Differential Revision: D51966667

Pulled By: pdillinger

fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770
2023-12-11 12:02:56 -08:00
Kevin Mingtarja 44fd914128 Fix double counting of BYTES_WRITTEN ticker (#12111)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12061.

We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable.

The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111

Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes.

Reviewed By: jaykorean

Differential Revision: D51954327

Pulled By: jowlyzhang

fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c
2023-12-08 17:12:11 -08:00
Levi Tamasi a143f93236 Turn the default Timer in PeriodicTaskScheduler into a leaky Meyers singleton (#12128)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12128

The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues.

Reviewed By: akankshamahajan15

Differential Revision: D51963950

fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948
2023-12-08 10:34:07 -08:00
Levi Tamasi 0ebe1614cb Eliminate some code duplication in MergeHelper (#12121)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12121

The patch eliminates some code duplication by unifying the two sets of `MergeHelper::TimedFullMerge` overloads using variadic templates. It also brings the order of parameters into sync when it comes to the various `TimedFullMerge*` methods.

Reviewed By: jaykorean

Differential Revision: D51862483

fbshipit-source-id: e3f832a6ff89ba34591451655cf11025d0a0d018
2023-12-05 14:07:42 -08:00
Yu Zhang ba8fa0f546 internal_repo_rocksdb (4372117296613874540) (#12117)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12117

Reviewed By: ajkr

Differential Revision: D51745846

Pulled By: jowlyzhang

fbshipit-source-id: 51c806a484b3b43d174b06d2cfe9499191d09914
2023-12-04 11:17:32 -08:00
Yu Zhang d68f45e777 Flush buffered logs when FlushRequest is rescheduled (#12105)
Summary:
The optimization to not find and delete obsolete files when FlushRequest is re-scheduled also inadvertently skipped flushing the `LogBuffer`, resulting in missed logs. This PR fixes the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12105

Test Plan:
manually check this test has the correct info log after the fix
`./column_family_test --gtest_filter=ColumnFamilyRetainUDTTest.NotAllKeysExpiredFlushRescheduled`

Reviewed By: ajkr

Differential Revision: D51671079

Pulled By: jowlyzhang

fbshipit-source-id: da0640e07e35c69c08988772ed611ec9e67f2e92
2023-11-29 11:35:59 -08:00
cz2h 324453e579 Fix rowcache get returning incorrect timestamp (#11952)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/7930.

When there is a timestamp associated with stored records, get from row cache will return the timestamp provided in query instead of the timestamp associated with the stored record.

## Cause of error:
Currently a row_handle is fetched using row_cache_key(contains a timestamp provided by user query) and the row_handle itself does not persist timestamp associated with the object. Hence the [GetContext::SaveValue()
](6e3429b8a6/table/get_context.cc (L257)) function will fetch the timestamp in row_cache_key and may return the incorrect timestamp value.

## Proposed Solution
If current cf enables ts, append a timestamp associated with stored records after the value in replay_log (equivalently the value of row cache entry).

When read, `replayGetContextLog()` will update parsed_key with the correct timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11952

Reviewed By: ajkr

Differential Revision: D51501176

Pulled By: jowlyzhang

fbshipit-source-id: 808fc943a8ae95de56ae0e82ec59a2573a031f28
2023-11-21 20:39:33 -08:00
Changyu Bi fb5c8c7ea3 Do not compare op_type in `WithinPenultimateLevelOutputRange()` (#12081)
Summary:
`WithinPenultimateLevelOutputRange()` is updated in https://github.com/facebook/rocksdb/issues/12063 to check internal key range. However, op_type of a key can change during compaction, e.g. MERGE -> PUT, which makes a key larger and becomes out of penultimate output range. This has caused stress test failures with error message "Unsafe to store Seq later than snapshot in the last level if per_key_placement is enabled". So update `WithinPenultimateLevelOutputRange()` to only check user key and sequence number.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12081

Test Plan:
* This repro can produce the corruption within a few runs. Ran it a few times after the fix and did not see Corruption failure.
```
python3 ./tools/db_crashtest.py whitebox --test_tiered_storage --random_kill_odd=888887 --use_merge=1 --writepercent=100 --readpercent=0 --prefixpercent=0 --delpercent=0 --delrangepercent=0 --iterpercent=0 --write_buffer_size=419430 --column_families=1 --read_fault_one_in=0 --write_fault_one_in=0
```

Reviewed By: ajkr

Differential Revision: D51481202

Pulled By: cbi42

fbshipit-source-id: cad6b65099733e03071b496e752bbdb09cf4db82
2023-11-20 17:07:28 -08:00
Benoît Mériaux 7780e98268 add write_buffer_manager setter into options and tests in c bindings, (#12007)
Summary:
following https://github.com/facebook/rocksdb/pull/11710
 - add test on wbm c api
- add a setter for WBM in `DBOptions`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12007

Reviewed By: cbi42

Differential Revision: D51430042

Pulled By: ajkr

fbshipit-source-id: 608bc4d3ed35a84200459d0230b35be64b3475f7
2023-11-17 11:34:05 -08:00
Changyu Bi 4e58cc6437 Check internal key range when compacting from last level to penultimate level (#12063)
Summary:
The test failure in https://github.com/facebook/rocksdb/issues/11909 shows that we may compact keys outside of internal key range of penultimate level input files from last level to penultimate level, which can potentially cause overlapping files in the penultimate level. This PR updates the  `Compaction::WithinPenultimateLevelOutputRange()` to check internal key range instead of user key.

Other fixes:
* skip range del sentinels when deciding output level for tiered compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12063

Test Plan:
- existing unit tests
- apply the fix to https://github.com/facebook/rocksdb/issues/11905 and run `./tiered_compaction_test --gtest_filter="*RangeDelsCauseFileEndpointsToOverlap*"`

Reviewed By: ajkr

Differential Revision: D51288985

Pulled By: cbi42

fbshipit-source-id: 70085db5f5c3b15300bcbc39057d57b83fd9902a
2023-11-17 10:50:40 -08:00
Gus Wynn 6d10f8d690 add WriteBufferManager to c api (#11710)
Summary:
I want to use the `WriteBufferManager` in my rust project, which requires exposing it through the c api, just like `Cache` is.

Hopefully the changes are fairly straightfoward!

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11710

Reviewed By: cbi42

Differential Revision: D51166518

Pulled By: ajkr

fbshipit-source-id: cd266ff1e4a7ab145d05385cd125a8390f51f3fc
2023-11-16 10:34:00 -08:00
Andrew Kryczka 9202db1867 Consider archived WALs for deletion more frequently (#12069)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11000.

That issue pointed out that RocksDB was slow to delete archived WALs in case time-based and size-based expiration were enabled, and the time-based threshold (`WAL_ttl_seconds`) was small. This PR prevents the delay by taking into account `WAL_ttl_seconds` when deciding the frequency to process archived WALs for deletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12069

Reviewed By: pdillinger

Differential Revision: D51262589

Pulled By: ajkr

fbshipit-source-id: e65431a06ee96f4c599ba84a27d1aedebecbb003
2023-11-15 15:42:28 -08:00
Changyu Bi e7896f03ad Enable unit test `PrecludeLastLevelTest.RangeDelsCauseFileEndpointsToOverlap` (#12064)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11909. The test passes after the change in https://github.com/facebook/rocksdb/issues/11917 to start mock clock from a non-zero time.

The reason for test failing is a bit complicated:
- The Put here e4ad4a0ef1/db/compaction/tiered_compaction_test.cc (L2045) happens before mock clock advances beyond 0.
- This causes oldest_key_time_ to be 0 for memtable.
- oldest_ancester_time of the first L0 file becomes 0
- L0 -> L5/6 compaction output files sets `oldest_ancestoer_time` to the current time due to these lines: 509947ce2c/db/compaction/compaction_job.cc (L1898C34-L1904).
- This causes some small sequence number to be mapped to current time: 509947ce2c/db/compaction/compaction_job.cc (L301)
- Keys in L6 is being moved up to L5 due to the unexpected seqno_to_time mapping
- When compacting keys from last level to the penultimate level, we only check keys to be within user key range of penultimate level input files. If we compact the following file 3 with file 1 and output keys to L5, we can get the reported inconsistency bug.
```
L5: file 1 [K5@20, K10@kMaxSeqno], file 2 [K10@30, K14@34)
L6: file 3 [K6@5, K10@20]
```

https://github.com/facebook/rocksdb/issues/12063 will add fixes to check internal key range when compacting keys from last level up to the penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12064

Test Plan: the unit test passes

Reviewed By: ajkr

Differential Revision: D51281149

Pulled By: cbi42

fbshipit-source-id: 00b7f026c453454d9f3af5b2de441383a96f0c62
2023-11-13 15:26:52 -08:00
Jay Huh 8b8f6c63ef ColumnFamilyHandle Nullcheck in GetEntity and MultiGetEntity (#12057)
Summary:
- Add missing null check for ColumnFamilyHandle in `GetEntity()`
- `FailIfCfHasTs()` now returns `Status::InvalidArgument()` if `column_family` is null. `MultiGetEntity()` can rely on this for cfh null check.
- Added `DeleteRange` API using Default Column Family to be consistent with other major APIs (This was also causing Java Test failure after the `FailIfCfHasTs()` change)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12057

Test Plan:
- Updated `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` to include null CF case
- Updated `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` to include null CF case

Reviewed By: jowlyzhang

Differential Revision: D51167445

Pulled By: jaykorean

fbshipit-source-id: 1c1e44fd7b7df4d2dc3bb2d7d251da85bad7d664
2023-11-13 14:30:04 -08:00
leipeng b3ffca0e29 DBImpl::DelayWrite: Remove bad WRITE_STALL histogram (#12067)
Summary:
When delay didn't happen, histogram WRITE_STALL is still recorded, and ticker STALL_MICROS is not recorded.

This is a bug, neither WRITE_STALL or STALL_MICROS should not be recorded when delay did not happen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12067

Reviewed By: cbi42

Differential Revision: D51263133

Pulled By: ajkr

fbshipit-source-id: bd82d8328fe088d613991966e83854afdabc6a25
2023-11-13 12:48:44 -08:00
Yu Zhang 509947ce2c Quarantine files in a limbo state after a manifest error (#12030)
Summary:
Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery.  3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted.  This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`.

These are the types of files being actively tracked in quarantine in this PR:
1) new table files and blob files from a background job
2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed.

Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening.

We track these files' file numbers because they share the same file number space.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030

Test Plan: Modified existing unit tests

Reviewed By: ajkr

Differential Revision: D51036774

Pulled By: jowlyzhang

fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716
2023-11-11 08:11:11 -08:00
Yu Zhang c6c683a0ca Remove the default force behavior for `EnableFileDeletion` API (#12001)
Summary:
Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted.

This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them.  When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK.

This PR also makes unit test that do not need force behavior to do a regular enable file deletion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001

Reviewed By: ajkr

Differential Revision: D51214683

Pulled By: jowlyzhang

fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4
2023-11-10 14:35:54 -08:00
Yueh-Hsuan Chiang 5ef92b8ea4 Add rocksdb_options_set_cf_paths (#11151)
Summary:
This PR adds a missing set function for rocksdb_options in the C-API:
rocksdb_options_set_cf_paths().  Without this function, users cannot
specify different paths for different column families as it will fall back
to db_paths.

As a bonus, this PR also includes rocksdb_sst_file_metadata_get_directory()
to the C api -- a missing public function that will also make the test easier to write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11151

Test Plan: Augment existing c_test to verify the specified cf_path.

Reviewed By: hx235

Differential Revision: D51201888

Pulled By: ajkr

fbshipit-source-id: 62a96451f26fab60ada2005ede3eea8e9b431f30
2023-11-10 11:36:11 -08:00
Yueh-Hsuan Chiang 73d223c4e2 Add auto_tuned option to RateLimiter C API (#12058)
Summary:
#### Problem
While the RocksDB C API does have the RateLimiter API, it does not
expose the auto_tuned option.

#### Summary of Change
This PR exposes auto_tuned RateLimiter option in RocksDB C API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12058

Test Plan: Augment the C API existing test to cover the new API.

Reviewed By: cbi42

Differential Revision: D51201933

Pulled By: ajkr

fbshipit-source-id: 5bc595a9cf9f88f50fee797b729ba96f09ed8266
2023-11-10 09:53:09 -08:00
Yu Zhang dfaf4dc111 Stubs for piping write time (#12043)
Summary:
As titled. This PR contains the API and stubbed implementation for piping write time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12043

Reviewed By: pdillinger

Differential Revision: D51076575

Pulled By: jowlyzhang

fbshipit-source-id: 3b341263498351b9ccaff27cf35d5aeb5bdf0cf1
2023-11-09 15:58:07 -08:00
brodyhuang e90e9825b4 Drop wal record when sequence is illegal (#11985)
Summary:
- Our database is corrupted, causing some sequences of wal record to be invalid (but the `record_checksum` looks fine).
- When we RecoverLogFiles in WALRecoveryMode::kPointInTimeRecovery, `assert(seq <= kMaxSequenceNumber)` will be failed.
- When it is found that sequence is illegal, can we drop the file  to recover as much data as possible ?  Thx !

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11985

Reviewed By: anand1976

Differential Revision: D50698039

Pulled By: ajkr

fbshipit-source-id: 1e42113b58823088d7c0c3a92af5b3efbb5f5296
2023-11-09 10:43:16 -08:00
Hui Xiao f337533b6f Ensure and clarify how RocksDB calls TablePropertiesCollector's functions (#12053)
Summary:
**Context/Summary:**
It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish".

However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand.

This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`)

Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053

Test Plan:
- Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once.
- Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality  `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change.

Reviewed By: ajkr

Differential Revision: D51095434

Pulled By: hx235

fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973
2023-11-08 14:00:36 -08:00
Zaidoon Abd Al Hadi 58f2a29fb4 Expose Options::periodic_compaction_seconds through C API (#12019)
Summary:
fixes [11090](https://github.com/facebook/rocksdb/issues/11090)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12019

Reviewed By: jaykorean

Differential Revision: D51076427

Pulled By: cbi42

fbshipit-source-id: de353ff66c7f73aba70ab3379e20d8c40f50d873
2023-11-07 12:46:50 -08:00
Jay Huh 2adef5367a AttributeGroups - PutEntity Implementation (#11977)
Summary:
Write Path for AttributeGroup Support. The new `PutEntity()` API uses `WriteBatch` and atomically writes WideColumns entities in multiple Column Families.

Combined the release note from PR https://github.com/facebook/rocksdb/issues/11925

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11977

Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` updated
- `WriteBatchTest::AttributeGroupTest` added
- `WriteBatchTest::AttributeGroupSavePointTest` added

Reviewed By: ltamasi

Differential Revision: D50457122

Pulled By: jaykorean

fbshipit-source-id: 4997b265e415588ce077933082dcd1ac3eeae2cd
2023-11-06 16:52:51 -08:00
Jay Huh 0ecfc4fbb4 AttributeGroups - GetEntity Implementation (#11943)
Summary:
Implementation of `GetEntity()` API that returns wide-column entities as AttributeGroups from multiple column families for a single key. Regarding the definition of Attribute groups, please see the detailed example description in PR https://github.com/facebook/rocksdb/issues/11925

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11943

Test Plan:
- `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` added

will enable the new API in the `db_stress` after merging

Reviewed By: ltamasi

Differential Revision: D50195794

Pulled By: jaykorean

fbshipit-source-id: 218d54841ac7e337de62e13b1233b0a99bd91af3
2023-11-06 15:04:41 -08:00
Jay Huh 2dab137182 Mark more files for periodic compaction during offpeak (#12031)
Summary:
- The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice.
- It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization.
- Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start.
- We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031

Test Plan:
Unit Tests added
- `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled
- `DBTestUniversalCompaction2::PeriodicCompaction` for Universal

Reviewed By: cbi42

Differential Revision: D50900292

Pulled By: jaykorean

fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d
2023-11-06 11:43:59 -08:00
Changyu Bi 520c64fd2e Add missing status check in ExternalSstFileIngestionJob and ImportColumnFamilyJob (#12042)
Summary:
.. and update some unit tests that failed with this change. See comment in ExternalSSTFileBasicTest.IngestFileWithCorruptedDataBlock for more explanation.

The missing status check is not caught by `ASSERT_STATUS_CHECKED=1` due to this line: 8505b26db1/table/block_based/block.h (L394). Will explore if we can remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12042

Test Plan: existing unit tests.

Reviewed By: ajkr

Differential Revision: D50994769

Pulled By: cbi42

fbshipit-source-id: c91615bccd6094a91634c50b98401d456cbb927b
2023-11-06 07:41:36 -08:00
914022466 2648e0a747 Fix a bug when ingest plaintable sst file (#11969)
Summary:
Plaintable doesn't support SeekToLast. And GetIngestedFileInfo is using SeekToLast without checking the validity.

We are using IngestExternalFile or CreateColumnFamilyWithImport with some sst file in PlainTable format . But after running for a while, compaction error often happens. Such as
![image](https://github.com/facebook/rocksdb/assets/13954644/b4fa49fc-73fc-49ce-96c6-f198a30800b8)

I simply add some std::cerr log to find why.
![image](https://github.com/facebook/rocksdb/assets/13954644/2cf1d5ff-48cc-4125-b917-87090f764fcd)
It shows that the smallest key is always equal to largest key.
![image](https://github.com/facebook/rocksdb/assets/13954644/6d43e978-0be0-4306-aae3-f9e4ae366395)
Then I found the root cause is that PlainTable do not support SeekToLast, so the smallest key is always the same with the largest

I try to write an unit test. But it's not easy to reproduce this error.
(This PR is similar to https://github.com/facebook/rocksdb/pull/11266. Sorry for open another PR)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11969

Reviewed By: ajkr

Differential Revision: D50933854

Pulled By: cbi42

fbshipit-source-id: 6c6af53c1388922cbabbe64ed3be1cdc58df5431
2023-11-02 13:45:37 -07:00
Yu Zhang 4b013dcbed Remove VersionEdit's friends pattern (#12024)
Summary:
Almost each of VersionEdit private member has its own getter and setter. Current code access them with a combination of directly accessing private members and via getter and setters. There is no obvious benefits to have this pattern except potential performance gains. I tried this simple benchmark for removing the friends pattern completely, and there is no obvious regression. So I think it would good to remove VersionEdit's friends completely.

```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num_column_families=10 -num=50000000```

With change:
fillseq      :       2.994 micros/op 333980 ops/sec 149.710 seconds 50000000 operations;   36.9 MB/s
fillseq      :       3.033 micros/op 329656 ops/sec 151.673 seconds 50000000 operations;   36.5 MB/s
fillseq      :       2.991 micros/op 334369 ops/sec 149.535 seconds 50000000 operations;   37.0 MB/s
Without change:
fillseq      :       3.015 micros/op 331715 ops/sec 150.732 seconds 50000000 operations;   36.7 MB/s
fillseq      :       3.044 micros/op 328553 ops/sec 152.182 seconds 50000000 operations;   36.3 MB/s
fillseq      :       3.091 micros/op 323520 ops/sec 154.550 seconds 50000000 operations;   35.8 MB/s

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12024

Reviewed By: pdillinger

Differential Revision: D50806066

Pulled By: jowlyzhang

fbshipit-source-id: 35d287ce638a38c30f243f85992e615b4c90eb27
2023-11-01 12:04:11 -07:00
Jay Huh 04225a2cfa Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false (#11991)
Summary:
cbi42 helped investigation and found a potential scenario where `RecoverFromRetryableBGIOError()` may start with `recovery_in_prog_ ` set as false. (and other booleans like `bg_error_` and `soft_error_no_bg_work_`)

**Thread 1**
- `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true`

**Thread 1's `recovery_thread_`**
- (waits for mutex and acquires it)
- `RecoverFromRetryableBGIOError()` -> `ResumeImpl()` -> `ClearBGError()`: sets `recovery_in_prog_ = false`
- `ClearBGError()` -> `NotifyOnErrorRecoveryEnd()`: releases `mutex`

**Thread 2**
- `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true`
- Waits for Thread 1 (`recovery_thread_`) to finish

**Thread 1's `recovery_thread_`**
- re-lock mutex in `NotifyOnErrorRecoveryEnd()`
- Still inside `RecoverFromRetryableBGIOError()`: sets `recovery_in_prog_ = false`
- Done

**Thread 2's `recovery_thread_`**
- recovery thread started with `recovery_in_prog_` set as `false`

# Fix
- Remove double-clearing `bg_error_`,  `recovery_in_prog_` and other fields after `ResumeImpl()` already returned `OK()`.
- Minor typo and linter fixes in `DBErrorHandlingFSTest`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11991

Test Plan:
- `DBErrorHandlingFSTest::MultipleRecoveryThreads` added to reproduce the scenario.
- Adding `assert(recovery_in_prog_);` at the start of `ErrorHandler::RecoverFromRetryableBGIOError()` fails the test without the fix and succeeds with the fix as expected.

Reviewed By: cbi42

Differential Revision: D50506113

Pulled By: jaykorean

fbshipit-source-id: 6dabe01e9ecd3fc50bbe9019587f2f4858bed9c6
2023-10-31 16:13:36 -07:00
Yu Zhang 60df39e530 Rate limiting stale sst files' deletion during recovery (#12016)
Summary:
As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016

Reviewed By: hx235

Differential Revision: D50670482

Pulled By: jowlyzhang

fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e
2023-10-28 09:50:52 -07:00
Jay Huh e230e4d248 Make OffpeakTimeInfo available in VersionSet (#12018)
Summary:
As mentioned in  https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR.

There were two ways to achieve what we want.
1.  Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`.
2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`.

We chose the latter as it involves smaller changes.

This change includes the following
- Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions`
- `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()`
- During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018

Test Plan:
- `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`.
- `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes`

Reviewed By: pdillinger

Differential Revision: D50723881

Pulled By: jaykorean

fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6
2023-10-27 15:56:48 -07:00
Hui Xiao 0f141352d8 Fix race between flush error recovery and db destruction (#12002)
Summary:
**Context:**
DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: 519f2a41fb/db/db_impl/db_impl.cc (L525) -> 519f2a41fb/db/error_handler.cc (L250) -> 519f2a41fb/db/error_handler.cc (L808-L823)

However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below.

```
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
```

**Summary:**
This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](519f2a41fb/db/error_handler.cc (L688-L694)) such case inside the created recovery thread.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002

Test Plan: A new UT repro-ed the crash before this fix and and pass after.

Reviewed By: ajkr

Differential Revision: D50586191

Pulled By: hx235

fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093
2023-10-25 11:59:09 -07:00
qiuchengxuan f2c9075d16 Fix dead loop with kSkipAnyCorruptedRecords mode selected in some cases (#11955) (#11979)
Summary:
With fragmented record span across multiple blocks, if any following blocks corrupted with arbitary data, and intepreted log number less than the current log number, program will fall into infinite loop due to
not skipping buffer leading bytes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11979

Test Plan: existing unit tests

Reviewed By: ajkr

Differential Revision: D50604408

Pulled By: jowlyzhang

fbshipit-source-id: e50a0c7e7c3d293fb9d5afec0a3eb4a1835b7a3b
2023-10-25 09:16:24 -07:00
Myth 0ff7665c95 Fix low priority write may cause crash when it is rate limited (#11932)
Summary:
Fixed https://github.com/facebook/rocksdb/issues/11902

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11932

Reviewed By: akankshamahajan15

Differential Revision: D50573356

Pulled By: hx235

fbshipit-source-id: adeb1abdc43b523b0357746055ce4a2eabde56a1
2023-10-24 14:41:46 -07:00
Peter Dillinger 4155087746 Use manifest to persist pre-allocated seqnos (#11995)
Summary:
... and other fixes for crash test after https://github.com/facebook/rocksdb/issues/11922.
* When pre-allocating sequence numbers for establishing a time history, record that last sequence number in the manifest so that it is (most likely) restored on recovery even if no user writes were made or were recovered (e.g. no WAL).
* When pre-allocating sequence numbers for establishing a time history, only do this for actually new DBs.
* Remove the feature that ensures non-zero sequence number on creating the first column family with preserve/preclude option after initial DB::Open. Until fixed in a way compatible with the crash test, this creates a gap where some data written with active preserve/preclude option won't have a known associated time.

Together, these ensure we don't upset the crash test by manipulating sequence numbers after initial DB creation (esp when re-opening with different options). (The crash test expects that the seqno after re-open corresponds to a known point in time from previous crash test operation, matching an expected DB state.)

Follow-up work:
* Re-fill the gap to ensure all data written under preserve/preclude settings have a known time estimate.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11995

Test Plan:
Added to unit test SeqnoTimeTablePropTest.PrePopulateInDB

Verified fixes two crash test scenarios:
## 1st reproducer
First apply
```
 diff --git a/db_stress_tool/expected_state.cc b/db_stress_tool/expected_state.cc
index b483e154c..ef63b8d6c 100644
 --- a/db_stress_tool/expected_state.cc
+++ b/db_stress_tool/expected_state.cc
@@ -333,6 +333,7 @@ Status FileExpectedStateManager::SaveAtAndAfter(DB* db) {
     s = NewFileTraceWriter(Env::Default(), soptions, trace_file_path,
                            &trace_writer);
   }
+  if (getenv("CRASH")) assert(false);
   if (s.ok()) {
     TraceOptions trace_opts;
     trace_opts.filter |= kTraceFilterGet;
```

Then
```
mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_expected
mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox
rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/*
CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=36000
./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=0
```

Without the fix you get
```
...
DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox]
(Re-)verified 34 unique IDs
Error restoring historical expected values: Corruption: DB is older than any restorable expected state
```

## 2nd reproducer
First apply
```
 diff --git a/db_stress_tool/db_stress_test_base.cc b/db_stress_tool/db_stress_test_base.cc
index 62ddead7b..f2654980f 100644
 --- a/db_stress_tool/db_stress_test_base.cc
+++ b/db_stress_tool/db_stress_test_base.cc
@@ -1126,6 +1126,7 @@ void StressTest::OperateDb(ThreadState* thread) {
         // OPERATION write
         TestPut(thread, write_opts, read_opts, rand_column_families, rand_keys,
                 value);
+        if (getenv("CRASH")) assert(false);
       } else if (prob_op < del_bound) {
         assert(write_bound <= prob_op);
         // OPERATION delete
```

Then
```
rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_*/*
CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=0
./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=3600
```

Without the fix you get
```
DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox]
(Re-)verified 34 unique IDs
db_stress: db_stress_tool/expected_state.cc:380: virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~
ExpectedStateTraceRecordHandler(): Assertion `IsDone()' failed.
```

Reviewed By: jowlyzhang

Differential Revision: D50533346

Pulled By: pdillinger

fbshipit-source-id: 1056be45c5b9e537c8c601b28c4b27431a782477
2023-10-23 09:20:59 -07:00
Hui Xiao 0836a2b26d New tickers on deletion compactions grouped by reasons (#11957)
Summary:
Context/Summary: as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957

Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats

Reviewed By: ajkr, cbi42

Differential Revision: D50294310

Pulled By: hx235

fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd
2023-10-18 18:00:07 -07:00
Changyu Bi d5bc30befa Enforce status checking after Valid() returns false for IteratorWrapper (#11975)
Summary:
... when compiled with ASSERT_STATUS_CHECKED = 1.

The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782.

Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975

Test Plan:
* `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check`
* I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task.

Reviewed By: ajkr

Differential Revision: D50383790

Pulled By: cbi42

fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce
2023-10-18 09:38:38 -07:00
Yu Zhang 933ee295f4 Fix a race condition between recovery and backup (#11955)
Summary:
A race condition between recovery and backup can happen with error messages like this:
```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory```

PR https://github.com/facebook/rocksdb/issues/6949  introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event:

[Backup engine]     disable file deletion
[Recovery]              disable file deletion <= this is optional for the race condition, it may or may not get called
[Backup engine]     get list of file to copy/link
[Recovery]              force enable file deletion
....                            some files refered by backup engine get deleted
[Backup engine]    copy/link file <= error no file found

This PR fixes this with:
1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call.

2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler.

3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955

Test Plan: existing unit tests

Reviewed By: anand1976

Differential Revision: D50290592

Pulled By: jowlyzhang

fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b
2023-10-17 13:18:04 -07:00
Peter Dillinger 2fd850c7eb Remove write queue synchronization from WriteOptionsFile (#11951)
Summary:
This has become obsolete with the new `options_mutex_` in https://github.com/facebook/rocksdb/pull/11929

* Remove now-unnecessary parameter from WriteOptionsFile
* Rename (and negate) other parameter for better clarity (the caller shouldn't tell the callee what the callee needs, just what the caller knows, provides, and requests)
* Move a ROCKS_LOG_WARN (I/O) in WriteOptionsFile to outside of holding DB mutex.
* Also *avoid* (but not always eliminate) write queue synchronization in SetDBOptions. Still needed if there was a change to WAL size limit or other configuration.
* Improve some comments

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11951

Test Plan: existing unit tests and TSAN crash test local run

Reviewed By: ajkr

Differential Revision: D50247904

Pulled By: pdillinger

fbshipit-source-id: 7dfe445c705ec013886a2adb7c50abe50d83af69
2023-10-16 08:58:47 -07:00
Changyu Bi f3aef8cad7 Add write operation to tracer only after successful callback (#11954)
Summary:
We saw optimistic transaction stress test failures like the following:
```
Verification failed for column family 0 key 000000000001E9AF000000000000012B00000000000000B5 (12535491): value_from_db: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, value_from_expected: , msg: Iterator verification: Unexpected value found```
```
With ajkr's repro (see test plan), I found that we record duplicated writes to tracer when an optimistic transaction conflict checking fails. This PR fixes it by checking callback status before record a write operation to tracer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11954

Test Plan:
this reproduces the failure consistently
```
#!/bin/bash
db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected
rm -rf $db $exp && mkdir -p $exp && while ./db_stress \
        --atomic_flush=1 \
        --clear_column_family_one_in=0 \
        --db=$db \
        --db_write_buffer_size=2097152 \
        --delpercent=0 \
        --delrangepercent=0 \
        --destroy_db_initially=0 \
        --disable_wal=1 \
        --expected_values_dir=$exp \
        --iterpercent=0 \
        --max_bytes_for_level_base=2097152 \
        --max_key=250000 \
        --memtable_prefix_bloom_size_ratio=0.5 \
        --memtable_whole_key_filtering=1 \
        --occ_lock_bucket_count=100 \
        --occ_validation_policy=0 \
        --ops_per_thread=10 \
        --prefixpercent=0 \
        --readpercent=0 \
        --reopen=0 \
        --target_file_size_base=524288 \
        --test_batches_snapshots=0 \
        --use_optimistic_txn=1 \
        --use_txn=1 \
        --value_size_mult=32 \
        --write_buffer_size=524288 \
        --writepercent=100 ; do : ; done
```

Reviewed By: akankshamahajan15

Differential Revision: D50284976

Pulled By: cbi42

fbshipit-source-id: 793e3cee186c8b4f406b29166efd8d9028695206
2023-10-14 12:00:31 -07:00
Jay Huh c9d8e6a5bf AttributeGroups - MultiGetEntity Implementation (#11925)
Summary:
Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`.
An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families.

Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`.

Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf".

Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo.

With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925

Test Plan:
- `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added

will enable this new API in the `db_stress` in a separate PR

Reviewed By: ltamasi

Differential Revision: D50017414

Pulled By: jaykorean

fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586
2023-10-13 15:58:03 -07:00
Changyu Bi 6e3429b8a6 Fix data race in accessing `recovery_in_prog_` (#11950)
Summary:
We saw the following TSAN stress test failure:
```
WARNING: ThreadSanitizer: data race (pid=17523)
  Write of size 1 at 0x7b8c000008b9 by thread T4243 (mutexes: write M0):
    #0 rocksdb::ErrorHandler::RecoverFromRetryableBGIOError() fbcode/internal_repo_rocksdb/repo/db/error_handler.cc:742 (db_stress+0x95f954) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
    https://github.com/facebook/rocksdb/issues/1 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (rocksdb::ErrorHandler::*)(), rocksdb::ErrorHandler*>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 (db_stress+0x95fc2b) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
    https://github.com/facebook/rocksdb/issues/2 execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18 (libstdc++.so.6+0xdf4e4) (BuildId: 452d1cdae868baeeb2fdf1ab140f1c219bf50c6e)

  Previous read of size 1 at 0x7b8c000008b9 by thread T22:
    #0 rocksdb::DBImpl::SyncClosedLogs(rocksdb::JobContext*, rocksdb::VersionEdit*) fbcode/internal_repo_rocksdb/repo/db/error_handler.h:76 (db_stress+0x84f69c) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d)
```

This is due to a data race in accessing `recovery_in_prog_`. This PR fixes it by accessing `recovery_in_prog_` under db mutex before calling `SyncClosedLogs()`. I think the original PR https://github.com/facebook/rocksdb/pull/10489 intended to clear the error if it's a recovery flush. So ideally we can also just check flush reason. I plan to keep a safer change in this PR and make that change in the future if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11950

Test Plan: check future TSAN stress test results.

Reviewed By: anand1976

Differential Revision: D50242255

Pulled By: cbi42

fbshipit-source-id: 0d487948ef9546b038a34460f3bb037f6e5bfc58
2023-10-12 16:55:25 -07:00
Changyu Bi 648fe25bc0 Always clear files marked for compaction in `ComputeCompactionScore()` (#11946)
Summary:
We were seeing the following stress test failures:
```LevelCompactionBuilder::PickFileToCompact(const rocksdb::autovector<std::pair<int, rocksdb::FileMetaData*> >&, bool): Assertion `!level_file.second->being_compacted' failed```

This can happen when we are picking a file to be compacted from some files marked for compaction, but that file is already being_compacted. We prevent this by always calling `ComputeCompactionScore()` after we pick a compaction and mark some files as being_compacted. However, if SetOptions() is called to disable marking certain files to be compacted, say `enable_blob_garbage_collection`, we currently just skip the relevant logic in `ComputeCompactionScore()` without clearing the existing files already marked for compaction. This PR fixes this issue by already clearing these files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11946

Test Plan: existing tests.

Reviewed By: akankshamahajan15

Differential Revision: D50232608

Pulled By: cbi42

fbshipit-source-id: 11e4fb5e9d48b0f946ad33b18f7c005f0161f496
2023-10-12 15:26:10 -07:00
Peter Dillinger d010b02e86 Fix race in options taking effect (#11929)
Summary:
In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect.

To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine.

I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race.

Intended follow-up to this:
* Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929

Test Plan:
Added two mini-stress style regression tests that fail with >1% probability before this change:
DBOptionsTest::SetStatsDumpPeriodSecRace
ColumnFamilyTest::CreateAndDropPeriodicRace

I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway:
DBOptionsTest::SetStatsDumpPeriodSecRace

Reviewed By: ajkr

Differential Revision: D50024506

Pulled By: pdillinger

fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4
2023-10-12 10:05:23 -07:00
Andrew Kryczka 4bd5aa4f55 Fix two `ErrorHandler` race conditions (#11939)
Summary:
1. Prevent a double join on a `port::Thread`
2. Ensure `recovery_in_prog_` and `bg_error_` are both set under same lock hold. This is useful for writers who see a non-OK `bg_error_` and are deciding whether to stall based on whether the error will be auto-recovered.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11939

Reviewed By: cbi42

Differential Revision: D50155484

Pulled By: ajkr

fbshipit-source-id: fbc1f85c50e7eaee27ee0e376aee688d8a06c93b
2023-10-11 09:42:48 -07:00
Andrew Kryczka 77d160ef47 Consolidate `ErrorHandler`'s recovery status variables (#11937)
Summary:
cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937

Reviewed By: cbi42

Differential Revision: D50105793

Pulled By: ajkr

fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f
2023-10-10 06:31:45 -07:00
Andrew Kryczka 8a9cfd5292 Make stopped writes block on recovery (#11879)
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.

This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879

Reviewed By: anand1976

Differential Revision: D49571196

Pulled By: ajkr

fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0
2023-10-10 06:29:01 -07:00
darionyaphet ee0829ba76 fix typo snapshto (#11817)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11817

Reviewed By: jaykorean

Differential Revision: D50103497

Pulled By: ltamasi

fbshipit-source-id: 77c5cf86ff7eb5021fc91b03225882536163af7b
2023-10-09 19:10:06 -07:00
Levi Tamasi 51d7e6a49e Clean up WriteBatchWithIndexInternal a bit (#11930)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11930

The patch cleans up and refactors the logic in/around `WriteBatchWithIndexInternal` a bit as groundwork for further changes. Specifically, the class is turned back into a stateless collection of static helpers (which is the way it was before PR 6851). Note that there were two apparent reasons for introducing this instance state in PR 6851: a) encapsulating `MergeContext` and b) resolving objects like `Logger` and `Statistics` based on a variety of handles. However, neither reason seems justified at this point. Regarding a), the `MultiGetFromBatchAndDB` logic passes in its own `MergeContext` objects via a second set of methods that do not use the member `MergeContext`. As for b), `Logger` and friends are only needed for Merge, which is only supported if a column family handle is provided; in turn, the column family handle enables us to resolve all the necessary objects without the need for any other handles like `DB` or `DBOptions`. In addition to the above, the patch changes the type of `BaseDeltaIterator::merge_result_` to `std::string` from `PinnableSlice` (since no pinning is ever done) and makes some other small code quality improvements.

Reviewed By: jaykorean

Differential Revision: D50038302

fbshipit-source-id: 5f34abe2e808bdaea0f3a8033b5764ebd446b85d
2023-10-09 15:25:35 -07:00
Peter Dillinger 1d5bddbc58 Bootstrap, pre-populate seqno_to_time_mapping (#11922)
Summary:
This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920):
* Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests.
* For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time.

Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.)

Recommended follow-up:
* Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect.
* Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.)

No release note for the minor bug fix because this is still an experimental feature with limited usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922

Test Plan: tests added / updated

Reviewed By: jowlyzhang

Differential Revision: D49956563

Pulled By: pdillinger

fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519
2023-10-06 08:21:21 -07:00
Hui Xiao 8e949116f7 Fix comments about creation_time/oldest_ancester_time/oldest_key_time (#11921)
Summary:
Code reference for the comments change:

40b618f234/table/block_based/block_based_table_builder.cc?fbclid=IwAR0JlfnG8wysclFP5wv0fSngFbi_j32BUCKbFayeGdr10tzDhyyk5QqpclA#L2093

40b618f234/db/flush_job.cc?fbclid=IwAR1ri6eTX3wyD_2fAEBRzFSwZItcbmDS8LaB11k1letDMQmB2L8nF6TfXDs#L945-L949

40b618f234/db/compaction/compaction_job.cc (L1882-L1904)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11921

Reviewed By: cbi42

Differential Revision: D49921304

Pulled By: hx235

fbshipit-source-id: 2ae17e43c0fd52044404d7b63fea254d2d1f3595
2023-10-04 14:42:35 -07:00
Peter Dillinger 141b872bd4 Improve efficiency of create_missing_column_families, light refactor (#11920)
Summary:
In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF.

Also in this change:
* Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern
* Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible)
* Improved comments in several places

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920

Test Plan: tests added / updated

Reviewed By: jaykorean

Differential Revision: D49919147

Pulled By: pdillinger

fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1
2023-10-04 14:14:22 -07:00
akankshamahajan 97f6f475bc Fix various failures in auto_readahead_size (#11884)
Summary:
1. **Error** in TestIterateAgainstExpected API - `Assertion index < pre_read_expected_values.size() && index < post_read_expected_values.size() failed.`
 **Fix** - `Prev` op is not supported with `auto_readahead_size`. So added support to Reseek in db_iter, if Prev is called. In BlockBasedTableIterator, index_iter_ already moves forward. So there is no way to do Prev from BlockBasedTableIterator.

2. **Error** - `void rocksdb::BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(uint64_t, size_t, size_t&): Assertion index_iter_->value().handle.offset() == offset`
   **Fix** - Remove prefetch_buffer to be used when uncompressed dict is read.

3. ** Error in TestPrefixScan API - `db_stress: db/db_iter.cc:369: bool rocksdb::DBIter::FindNextUserEntryInternal(bool, const rocksdb::Slice*): Assertion !skipping_saved_key || CompareKeyForSkip(ikey_.user_key, saved_key_.GetUserKey()) > 0 failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
db_stress: table/merging_iterator.cc:1036: bool rocksdb::MergingIterator::SkipNextDeleted(): Assertion comparator_->Compare(range_tombstone_iters_[i]->start_key(), pik) <= 0 failed`
  **Fix** -  SeekPrev also calls 1) SeekPrev , 2)Seek and then 3)Prev in some cases in db_iter.cc leading to failure of Prev operation. These backward operations also call Seek.  Added direction to disable lookup once direction is backwards in BlockBasedTableIterator.cc

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11884

Test Plan: Ran various flavors of crash tests locally for the whole duration

Reviewed By: anand1976

Differential Revision: D49834201

Pulled By: akankshamahajan15

fbshipit-source-id: 9a007b4d46a48002c43dc4623a400ecf47d997fe
2023-10-02 17:47:24 -07:00
Jay Huh 5fbea87859 Disallow start_time == end_time in offpeak time and compare at minute level to allow 24hr offpeak (#11911)
Summary:
Since allowing 24hr peak by setting start_time = end_time is not so intuitive, we are not going to allow it (e.g. `00:00-00:00` doesn't looks like a value that would cover 24hr.). Instead, we are going to compare at minute level (i.e. dropping the seconds to the nearest minute) so that `00:00-23:59` will cover 24hrs. The entire minute from 23:59:00 23:59:59 will be covered with this change.

Minor fixes from previous PR
- release build error
- fixed random seed in test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11911

Test Plan:
`DBOptionsTest::OffPeakTimes`
`make -j64 static_lib` to test release build issue that was fixed

Reviewed By: pdillinger

Differential Revision: D49787795

Pulled By: jaykorean

fbshipit-source-id: e8d045b95f54f61d5dd5f1bb473579f8d55c18b3
2023-10-02 16:52:39 -07:00
Andrew Kryczka 10fd05e394 Give retry flushes their own functions (#11903)
Summary:
Recovery triggers flushes for very different scenarios:

(1) `FlushReason::kErrorRecoveryRetryFlush`: a flush failed
(2) `FlushReason::kErrorRecovery`: a WAL may be corrupted
(3) `FlushReason::kCatchUpAfterErrorRecovery`: immutable memtables may have accumulated

The old code called called `FlushAllColumnFamilies()` in all cases, which uses manual flush functions: `AtomicFlushMemTables()` and `FlushMemTable()`. Forcing flushing the latest data on all CFs was useful for (2) because it ensures all CFs move past the corrupted WAL.

However, those code paths were overkill for (1) and (3), where only already-immutable memtables need to be flushed. There were conditionals to exclude some of the extraneous logic but I found there was still too much happening. For example, both of the manual flush functions enter the write thread. Entering the write thread is inconvenient because then we can't allow stalled writes to wait on a retrying flush to finish.

Instead of continuing down the path of adding more conditionals to the manual flush functions, this PR introduces a dedicated function for cases (1) and (3): `RetryFlushesForErrorRecovery()`. Also I cleaned up the manual flush functions to remove existing conditionals for these cases as they're no longer needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11903

Reviewed By: cbi42

Differential Revision: D49693812

Pulled By: ajkr

fbshipit-source-id: 7630ac539b9d6c92052c13a3cdce53256134d990
2023-10-02 16:26:24 -07:00
Levi Tamasi b00fa5597e Fix the handling of wide-column base values in the max_successive_merges logic (#11913)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913

The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator.

Reviewed By: jaykorean

Differential Revision: D49795097

fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3
2023-10-02 16:25:25 -07:00
Peter Dillinger 7bebd3036d Update tiered storage tests (ahead of next change) (#11917)
Summary:
After https://github.com/facebook/rocksdb/issues/11905, I am preparing a DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping. **Intended follow-up**

However, there are a number of test changes I want to make prior to that to make it clear that I am not regressing the tests and production behavior at the same time.

* Start mock time in the tests well beyond epoch (time 0) so that we aren't normally reaching into pre-history for current time minus the preserve/preclude duration.
* Majorly clean up BasicSeqnoToTimeMapping to avoid confusing hard-coded bounds on GetProximalTimeBeforeSeqno() results.
  * There is an unresolved/unexplained issue marked with FIXME that should be investigated when GetProximalTimeBeforeSeqno() is put into production.
* MultiCFs test was strangely generating 5 L0 files, four of which would be compacted into an L1, and then letting TTL compaction compact 1@L0+1@L1. Changing the starting time of the tests seemed to mess up the TTL compaction. But I suspect the TTL compaction was unintentional, so I've cut it down to just 4 L0 files, which compacts predictably.
* Unrelated: allow ROCKSDB_NO_STACK=1 to skip printing a stack trace on assertion failures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11917

Test Plan: no changes to production code

Reviewed By: jowlyzhang

Differential Revision: D49841436

Pulled By: pdillinger

fbshipit-source-id: 753348ace9c548e82bcb77fcc8b2ffb7a6beeb0a
2023-10-02 16:19:05 -07:00
leipeng d98a9cfb27 test: WritableFile derived class: add missing GetFileSize() override (#11726)
Summary:
Missed `GetFileSize()` forwarding , this PR fix this issue, and mark `WritableFile::GetFileSize()` as pure virtual to detect such issue in compile time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11726

Reviewed By: ajkr

Differential Revision: D49791240

Pulled By: jowlyzhang

fbshipit-source-id: ef219508d6b15c9a24df9b706a9fdc33cc6a286e
2023-09-29 15:58:08 -07:00
Jay Huh 63ed868840 Offpeak in db option (#11893)
Summary:
RocksDB's primary function is to facilitate read and write operations. Compactions, while essential for minimizing read amplifications and optimizing storage, can sometimes compete with these primary tasks. Especially during periods of high read/write traffic, it's vital to ensure that primary operations receive priority, avoiding any potential disruptions or slowdowns. Conversely, during off-peak times when traffic is minimal, it's an opportune moment to tackle low-priority tasks like TTL based compactions, optimizing resource usage.

In this PR, we are incorporating the concept of off-peak time into RocksDB by introducing `daily_offpeak_time_utc` within the DBOptions. This setting is formatted as "HH:mm-HH:mm" where the first one before "-" is the start time and the second one is the end time, inclusive. It will be later used for resource optimization in subsequent PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11893

Test Plan:
- New Unit Test Added - `DBOptionsTest::OffPeakTimes`
- Existing Unit Test Updated - `OptionsTest`, `OptionsSettableTest`

Reviewed By: pdillinger

Differential Revision: D49714553

Pulled By: jaykorean

fbshipit-source-id: fef51ea7c0fede6431c715bff116ddbb567c8752
2023-09-29 13:03:39 -07:00
Peter Dillinger 02443dd93f Refactor, clean up, fixes, and more testing for SeqnoToTimeMapping (#11905)
Summary:
This change is before a planned DBImpl change to ensure all sufficiently recent sequence numbers since Open are covered by SeqnoToTimeMapping (bug fix with existing test work-arounds). **Intended follow-up**

However, I found enough issues with SeqnoToTimeMapping to warrant this PR first, including very small fixes in DB implementation related to API contract of SeqnoToTimeMapping.

Functional fixes / changes:
* This fixes some mishandling of boundary cases. For example, if the user decides to stop writing to DB, the last written sequence number would perpetually have its write time updated to "now" and would always be ineligible for migration to cold tier. Part of the problem is that the SeqnoToTimeMapping would return a seqno known to have been written before (immediately or otherwise) the requested time, but compaction_job.cc would include that seqno in the preserve/exclude set. That is fixed (in part) by adding one in compaction_job.cc
* That problem was worse because a whole range of seqnos could be updated perpetually with new times in SeqnoToTimeMapping::Append (if no writes to DB). That logic was apparently optimized for GetOldestApproximateTime (now GetProximalTimeBeforeSeqno), which is not used in production, to the detriment of GetOldestSequenceNum (now GetProximalSeqnoBeforeTime), which is used in production. (Perhaps plans changed during development?) This is fixed in Append to optimize for accuracy of GetProximalSeqnoBeforeTime. (Unit tests added and updated.)
* Related: SeqnoToTimeMapping did not have a clear contract about the relationships between seqnos and times, just the idea of a rough correspondence. Now the class description makes it clear that the write time of each recorded seqno comes before or at the associated time, to support getting best results for GetProximalSeqnoBeforeTime. And this makes it easier to make clear the contract of each API function.
  * Update `DBImpl::RecordSeqnoToTimeMapping()` to follow this ordering in gathering samples.

Some part of these changes has required an expanded test work-around for the problem (see intended follow-up above) that the DB does not immediately ensure recent seqnos are covered by its mapping. These work-arounds will be removed with that planned work.

An apparent compaction bug is revealed in
PrecludeLastLevelTest::RangeDelsCauseFileEndpointsToOverlap, so that test is disabled. Filed GitHub issue #11909

Cosmetic / code safety things (not exhaustive):
* Fix some confusing names.
  * `seqno_time_mapping` was used inconsistently in places. Now just `seqno_to_time_mapping` to correspond to class name.
  * Rename confusing `GetOldestSequenceNum` -> `GetProximalSeqnoBeforeTime` and `GetOldestApproximateTime` -> `GetProximalTimeBeforeSeqno`. Part of the motivation is that our times and seqnos here have the same underlying type, so we want to be clear about which is expected where to avoid mixing.
  * Rename `kUnknownSeqnoTime` to `kUnknownTimeBeforeAll` because the value is a bad choice for unknown if we ever add ProximalAfterBlah functions.
  * Arithmetic on SeqnoTimePair doesn't make sense except for delta encoding, so use better names / APIs with that in mind.
  * (OMG) Don't allow direct comparison between SeqnoTimePair and SequenceNumber. (There is no checking that it isn't compared against time by accident.)
  * A field name essentially matching the containing class name is a confusing pattern (`seqno_time_mapping_`).
  * Wrap calls to confusing (but useful) upper_bound and lower_bound functions to have clearer names and more code reuse.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11905

Test Plan: GetOldestSequenceNum (now GetProximalSeqnoBeforeTime) and TruncateOldEntries were lacking unit tests, despite both being used in production (experimental feature). Added those and expanded others.

Reviewed By: jowlyzhang

Differential Revision: D49755592

Pulled By: pdillinger

fbshipit-source-id: f72a3baac74d24b963c77e538bba89a7fc8dce51
2023-09-29 11:21:59 -07:00
Levi Tamasi 6b4315ee8b Extend the test coverage of FullMergeV3 (#11896)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11896

The patch extends the test coverage of the wide column aware merge logic by adding two new tests that perform general transformations during merge by implementing the `FullMergeV3` interface. The first one uses a merge operator that produces a wide-column entity as result in all cases (i.e. even if the base value is a plain key-value, or if there is no base value). The second one uses a merge operator that results in a plain key-value in all cases.

Reviewed By: jaykorean

Differential Revision: D49665946

fbshipit-source-id: 419b9e557c064525b659685eb8c09ae446656439
2023-09-27 14:53:25 -07:00
Yu Zhang 7ea6e724fa Mark recovery_in_prog_ to false whenever recovery thread joins (#11890)
Summary:
Make the `RecoverFromRetryableBGIOError` function always mark `recovery_in_prog_` to false when it returns.
Otherwise, in below code snippet, when db closes and the `error_handler_.CancelErrorRecovery()` call successfully joined the recovery thread, the immediately following while loop will incorrectly think the error recovery is still in progress and loops in `bg_cv_.Wait()`.

1c871a4d86/db/db_impl/db_impl.cc (L542-L545)

This is the issue https://github.com/facebook/rocksdb/issues/11440

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11890

Reviewed By: anand1976

Differential Revision: D49624216

Pulled By: jowlyzhang

fbshipit-source-id: ee10cf6527d95b8dd4705a326eb6208d741fe002
2023-09-25 20:15:40 -07:00
Changyu Bi 1c871a4d86 Only flush after recovery for retryable IOError (#11880)
Summary:
https://github.com/facebook/rocksdb/issues/11872 causes a unit test to start failing with the error message below. The cause is that the additional call to `FlushAllColumnFamilies()` in `DBImpl::ResumeImpl()` can run while DB is closing. More detailed explanation: there are two places where we call `ResumeImpl()`:

1. in `ErrorHandler::RecoverFromBGError`, for manual resume or recovery from errors like OutOfSpace through sst file manager, and
2. in `Errorhandler::RecoverFromRetryableBGIOError`, for error recovery from errors like flush failure due to retryable IOError. This is tracked by `ErrorHandler::recovery_thread_`.

Here is how DB close waits for error recovery: 49da91ec09/db/db_impl/db_impl.cc (L540-L543)

`CancelErrorRecovery()` waits until `recovery_thread_` finishes and `IsRecoveryInProgress()` checks the `recovery_in_prog_` flag. The additional call to `FlushAllColumnFamilies()` in `ResumeImpl()` happens after it clears bg error and the `recovery_in_prog_` flag: 49da91ec09/db/db_impl/db_impl.cc (L436-L463). So if `ResumeImpl()` is called in `RecoverFromBGError()`, we can have a thread running `FlushAllColumnFamilies()` while DB is closing and thought that recovery is done.

The fix is to only do the additional call to `FlushAllColumnFamilies()` when doing error recovery through `Errorhandler::RecoverFromRetryableBGIOError` by setting flags in `DBRecoverContext`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11880

Test Plan:
`gtest-parallel --repeat=100 --workers=4 ./error_handler_fs_test --gtest_filter="*AutoRecoverFlushError*"` reproduces the error pretty reliably.

```[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBErrorHandlingFSTest
[ RUN      ] DBErrorHandlingFSTest.AutoRecoverFlushError
error_handler_fs_test: db/column_family.cc:1618: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed.
Received signal 6 (Aborted)
...
https://github.com/facebook/rocksdb/issues/10 0x00007fac4409efd6 in __GI___assert_fail (assertion=0x7fac452c0afa "last_ref", file=0x7fac452c9fb5 "db/column_family.cc", line=1618, function=0x7fac452cb950 "rocksdb::ColumnFamilySet::~ColumnFamilySet()") at assert.c:101
101     in assert.c
https://github.com/facebook/rocksdb/issues/11 0x00007fac44b5324f in rocksdb::ColumnFamilySet::~ColumnFamilySet (this=0x7b5400000000) at db/column_family.cc:1618
1618        assert(last_ref);
https://github.com/facebook/rocksdb/issues/12 0x00007fac44e0f047 in std::default_delete<rocksdb::ColumnFamilySet>::operator() (this=0x7b5800000940, __ptr=0x7b5400000000) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85
85              delete __ptr;
https://github.com/facebook/rocksdb/issues/13 std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182
182               _M_deleter()(__old_p);
https://github.com/facebook/rocksdb/issues/14 std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456
456             _M_t.reset(std::move(__p));
https://github.com/facebook/rocksdb/issues/15 rocksdb::VersionSet::~VersionSet (this=this@entry=0x7b5800000900) at db/version_set.cc:5081
5081      column_family_set_.reset();
https://github.com/facebook/rocksdb/issues/16 0x00007fac44e0f97a in rocksdb::VersionSet::~VersionSet (this=0x7b5800000900) at db/version_set.cc:5078
5078    VersionSet::~VersionSet() {
https://github.com/facebook/rocksdb/issues/17 0x00007fac44bf0b2f in std::default_delete<rocksdb::VersionSet>::operator() (this=0x7b8c00000068, __ptr=0x7b5800000900) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85
85              delete __ptr;
https://github.com/facebook/rocksdb/issues/18 std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182
182               _M_deleter()(__old_p);
https://github.com/facebook/rocksdb/issues/19 std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456
456             _M_t.reset(std::move(__p));
https://github.com/facebook/rocksdb/issues/20 rocksdb::DBImpl::CloseHelper (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:676
676       versions_.reset();
https://github.com/facebook/rocksdb/issues/21 0x00007fac44bf1346 in rocksdb::DBImpl::CloseImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:720
720     Status DBImpl::CloseImpl() { return CloseHelper(); }
https://github.com/facebook/rocksdb/issues/22 rocksdb::DBImpl::~DBImpl (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:738
738       closing_status_ = CloseImpl();
https://github.com/facebook/rocksdb/issues/23 0x00007fac44bf2bba in rocksdb::DBImpl::~DBImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:722
722     DBImpl::~DBImpl() {
https://github.com/facebook/rocksdb/issues/24 0x00007fac455444d4 in rocksdb::DBTestBase::Close (this=this@entry=0x7b6c00000000) at db/db_test_util.cc:678
678       delete db_;
https://github.com/facebook/rocksdb/issues/25 0x00007fac455455fb in rocksdb::DBTestBase::TryReopen (this=this@entry=0x7b6c00000000, options=...) at db/db_test_util.cc:707
707       Close();
https://github.com/facebook/rocksdb/issues/26 0x00007fac45543459 in rocksdb::DBTestBase::Reopen (this=0x7ffed74b79a0, options=...) at db/db_test_util.cc:670
670       ASSERT_OK(TryReopen(options));
https://github.com/facebook/rocksdb/issues/27 0x00000000004f2522 in rocksdb::DBErrorHandlingFSTest_AutoRecoverFlushError_Test::TestBody (this=this@entry=0x7b6c00000000) at db/error_handler_fs_test.cc:1224
1224      Reopen(options);
```

Reviewed By: jowlyzhang

Differential Revision: D49579701

Pulled By: cbi42

fbshipit-source-id: 3fc8325e6dde7e7faa8bcad95060cb4e26eda638
2023-09-25 09:34:39 -07:00
Changyu Bi 0086809601 Fix a bug with atomic_flush that causes DB to stuck after a flush failure (#11872)
Summary:
With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: 269478ee46/db/db_impl/db_impl_compaction_flush.cc (L725) This prevents any future flush to pick old memtables.

A more detailed repro is in unit test.

This PR fixes this issue by
1. Ensure we rollback memtables if an atomic flush fails due to background error
2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed
3. Do not schedule non-recovery flushes when there is a background error that stops background work

There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11872

Test Plan: new unit test.

Reviewed By: jowlyzhang

Differential Revision: D49556867

Pulled By: cbi42

fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f
2023-09-22 16:43:50 -07:00
Levi Tamasi 12d9386a4f Return a special OK status when the number of merge operands exceeds a threshold (#11870)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11870

Having a large number of merge operands applied at query time can have a significant effect on performance; therefore, applications might want limit the number of deltas for any given key. However, there is currently no way to establish the number of operands for certain types of queries. The ticker `READ_NUM_MERGE_OPERANDS` only provides aggregate (not per-read) information. The `PerfContext` counters `internal_merge_count` and `internal_merge_point_lookup_count` can be used to get this information on a per-query basis for iterators and single point lookups; however, there is no per-key breakdown for `MultiGet` type APIs. The patch addresses this issue by introducing a special kind of OK status which signals that an application-defined threshold on the number of merge operands has been exceeded for a given key. The threshold can be specified on a per-query basis using a new field in `ReadOptions`.

Reviewed By: jaykorean

Differential Revision: D49522786

fbshipit-source-id: 4265b3848d1be5ff313a3e8fb604ddf56411dd2c
2023-09-22 13:49:19 -07:00
anand76 269478ee46 Support compressed and local flash secondary cache stacking (#11812)
Summary:
This PR implements support for a three tier cache - primary block cache, compressed secondary cache, and a nvm (local flash) secondary cache. This allows more effective utilization of the nvm cache, and minimizes the number of reads from local flash by caching compressed blocks in the compressed secondary cache.

The basic design is as follows -
1. A new secondary cache implementation, ```TieredSecondaryCache```, is introduced. It keeps the compressed and nvm secondary caches and manages the movement of blocks between them and the primary block cache. To setup a three tier cache, we allocate a ```CacheWithSecondaryAdapter```, with a ```TieredSecondaryCache``` instance as the secondary cache.
2. The table reader passes both the uncompressed and compressed block to ```FullTypedCacheInterface::InsertFull```, allowing the block cache to optionally store the compressed block.
3. When there's a miss, the block object is constructed and inserted in the primary cache, and the compressed block is inserted into the nvm cache by calling ```InsertSaved```. This avoids the overhead of recompressing the block, as well as avoiding putting more memory pressure on the compressed secondary cache.
4. When there's a hit in the nvm cache, we attempt to insert the block in the compressed secondary cache and the primary cache, subject to the admission policy of those caches (i.e admit on second access). Blocks/items evicted from any tier are simply discarded.

We can easily implement additional admission policies if desired.

Todo (In a subsequent PR):
1. Add to db_bench and run benchmarks
2. Add to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11812

Reviewed By: pdillinger

Differential Revision: D49461842

Pulled By: anand1976

fbshipit-source-id: b40ac1330ef7cd8c12efa0a3ca75128e602e3a0b
2023-09-21 20:30:53 -07:00
Changyu Bi b927ba5936 Rollback other pending memtable flushes when a flush fails (#11865)
Summary:
when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:
```
Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an  error message like the following:

IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory
```

This happens since:
1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: f42e70bf56/db/db_impl/db_impl_compaction_flush.cc (L3161)

This PR fixes the issue by rollback these pending flushes.

There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.

There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11865

Test Plan: * Added repro unit tests.

Reviewed By: anand1976

Differential Revision: D49484922

Pulled By: cbi42

fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d
2023-09-21 15:31:29 -07:00
Yu Zhang 32fc1e6cdc Add unit test for the multiget fix when ReadOptions.read_tier == kPersistedTier and disableWAL == true (#11854)
Summary:
Add unit tests for the fix in https://github.com/facebook/rocksdb/pull/11700

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11854

Reviewed By: anand1976

Differential Revision: D49392462

Pulled By: jowlyzhang

fbshipit-source-id: bd6978e4888074fa5417f3ccda7a78a2c7eee9c6
2023-09-21 14:59:58 -07:00
Peter (Stig) Edwards bf488c3052 Use *next_sequence -1 here (#11861)
Summary:
To fix off-by-one error:   Transaction could not check for conflicts for operation at SequenceNumber 500000 as the MemTable only contains changes newer than SequenceNumber 500001.

Fixes https://github.com/facebook/rocksdb/issues/11822

I think introduced in a657ee9a9c

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11861

Reviewed By: pdillinger

Differential Revision: D49457273

Pulled By: ajkr

fbshipit-source-id: b527cbae4ecc7874633a11f07027adee62940d74
2023-09-21 13:52:01 -07:00
Hui Xiao 089070cb36 Expose more info about input files in `CompactionFilter::Context` (#11857)
Summary:
**Context:**
As requested, lowest level as well as a map from input file to its table properties among all input files used in table creation  (if any) are exposed in `CompactionFilter::Context`.

**Summary:**
This PR contains two commits:
(1) [Refactory](0012777f0e) to make resonating/using what is in `Compaction:: table_properties_` easier
- Separate `Compaction:: table_properties_` into `Compaction:: input_table_properties_` and `Compaction:: output_table_properties_`
- Separate the "set input table properties" logic into `Compaction:: SetInputTableProperties()`) from `Compaction:: GetInputTableProperties`
- Call `Compaction:: SetInputTableProperties()` as soon as possible, which is right after `Compaction::SetInputVersion()`. Bundle these two functions into one `Compaction::FinalizeInputInfo()` to minimize missing one or the other

(2) [Expose more info about input files:](6093e7dfba) `CompactionFilter::Context::input_start_level/input_table_properties`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11857

Test Plan:
- Modify existing UT `
TEST_F(DBTestCompactionFilter, CompactionFilterContextManual)` to cover new logics

Reviewed By: ajkr

Differential Revision: D49402540

Pulled By: hx235

fbshipit-source-id: 469fff50fa0e5964ffa5ea8db0743f61438ea392
2023-09-20 13:34:39 -07:00
chuhao zeng 8acf17002a Fix row cache falsely return kNotFound when timestamp enabled (#11816)
Summary:
**Summary:**
When row cache hits and a timestamp is being set in read_options, even though ROW_CACHE entry is hit, the return status is kNotFound.

**Cause of error:**
If timestamp is provided in readoptions, a callback for sequence number checking is registered [here](8fc78a3a9e/db/db_impl/db_impl.cc (L2112)).

Hence the default value set at this [line](694e49cbb1/table/get_context.cc (L611)) prevents get_context from saving value found in cache. Causing the final status to be kNotFound even though the entry exist in both cache and SST file.

**Proposed Solution**
Row cache key contains a sequence number in it. If the key for row cache lookup matches the key in cache, this cache entry should be good to be exposed to user and hence we reuse the sequence number in cache key rather than passing kMaxSequenceNumber.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11816

Reviewed By: ajkr

Differential Revision: D49419029

Pulled By: jowlyzhang

fbshipit-source-id: 6c77e9e751628d7d8e6c389f299e29a11ea824c6
2023-09-20 11:34:38 -07:00
Levi Tamasi 51b3b7e08c Remove a now-unnecessary WideColumnSerialization::Serialize variant (#11864)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11864

Reviewed By: jaykorean

Differential Revision: D49445163

fbshipit-source-id: a1275112b9f87a3652df5f155bd215db5819326c
2023-09-20 08:04:35 -07:00
Levi Tamasi f42e70bf56 Integrate FullMergeV3 into the query and compaction paths (#11858)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11858

The patch builds on https://github.com/facebook/rocksdb/pull/11807 and integrates the `FullMergeV3` API into the read and compaction code paths by updating and extending the logic in `MergeHelper`.

In particular, when it comes to merge inputs, the existing `TimedFullMergeWithEntity` is folded into `TimedFullMerge`, since wide-column base values are now handled the same way as plain base values (or no base values for that matter), e.g. they are passed directly to the `MergeOperator`. On the other hand, there is some new differentiation on the output side. Namely, there are now two sets of `TimedFullMerge` variants: one set for contexts where the complete merge result and its value type are needed (used by iterators and compactions), and another set where the merge result is needed in a form determined by the client (used by the point lookup APIs, where e.g. for `Get` we have to extract the value of the default column of any wide-column results).

Implementation-wise, the two sets of overloads use different visitors to process the `std::variant` produced by `FullMergeV3`. This has the benefit of eliminating some repeated code e.g. in the point lookup paths, since `TimedFullMerge` now populates the application's result object (`PinnableSlice`/`string` or `PinnableWideColumns`) directly. Moreover, within each set of variants, there is a separate overload for the no base value/plain base value/wide-column base value cases, which eliminates some repeated branching w/r/t to the type of the base value if any.

Reviewed By: jaykorean

Differential Revision: D49352562

fbshipit-source-id: c2fb9853dba3fbbc6918665bde4195c4ea150a0c
2023-09-19 17:27:04 -07:00
Changyu Bi cc254efea6 Release compaction files in manifest write callback (#11764)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/10257 (also see [here](https://github.com/facebook/rocksdb/pull/10355#issuecomment-1684308556)) by releasing compaction files earlier when writing to manifest in LogAndApply().  This is done by passing in a [callback](ba59751430/db/version_set.h (L1199)) to LogAndApply(). The new Version is created in the same critical section where compaction files are released. When compaction picker is picking compaction based on the new version, these compaction files will already be released.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11764

Test Plan:
* Existing unit tests
* A repro unit test to validate that compaction files are released: `./db_compaction_test --gtest_filter=DBCompactionTest.ReleaseCompactionDuringManifestWrite`
* `python3 ./tools/db_crashtest.py --simple whitebox` with some assertions to check compaction files are released

Reviewed By: ajkr

Differential Revision: D48742152

Pulled By: cbi42

fbshipit-source-id: 7560fd0e723a63fe692234015d2b96850f8b5d77
2023-09-18 13:11:53 -07:00
dengyan 0dac75d542 Fix a bug in MultiGet when skip_memtable is true (#11700)
Summary:
When skip_memtable is true in MultiGetImpl, The lookup_current is always false, Causes data to be unable to be queried in super_version->current。

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11700

Reviewed By: anand1976

Differential Revision: D49342877

Pulled By: jowlyzhang

fbshipit-source-id: 270a36d049b4cb7fd151a1fa3080300310111271
2023-09-18 12:06:58 -07:00
Changyu Bi 6997a06c63 Invalidate threadlocal SV before incrementing `super_version_number_` (#11848)
Summary:
CI has been hitting assertion error like
```
https://github.com/facebook/rocksdb/issues/8  0x00007fafd9294fd6 in __GI___assert_fail (assertion=assertion@entry=0x7fafda270300 "!*memtable_range_tombstone_iter_ || sv_number_ != cfd_->GetSuperVersionNumber()", file=file@entry=0x7fafda270350 "db/arena_wrapped_db_iter.cc", line=line@entry=124, function=function@entry=0x7fafda270288 "virtual rocksdb::Status rocksdb::ArenaWrappedDBIter::Refresh(const rocksdb::Snapshot*)") at assert.c:101
```

This is due to
* Iterator::Refresh() passing in `cur_sv_number` instead of `sv->version_number` here: 1c6faf3587/db/arena_wrapped_db_iter.cc (L94-L96)

* `super_version_number_` can be incremented before thread local SV is installed: https://github.com/facebook/rocksdb/blob/main/db/column_family.cc#L1287-L1306

* The optimization in https://github.com/facebook/rocksdb/issues/11452 removed the check for SV number, such that `cur_sv_number > sv.version_number` is possible in the following code.
```
uint64_t cur_sv_number = cfd_->GetSuperVersionNumber();
SuperVersion* sv = cfd_->GetReferencedSuperVersion(db_impl_);
```

Not sure why assertion only started failing after https://github.com/facebook/rocksdb/issues/10594, maybe it's because Refresh() is called more often in stress test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11848

Test Plan:
* This repros hits the assertion pretty consistently before this change:
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_one_in=0 --block_size=16384 --bloom_bits=0.7161318870366848 --cache_index_and_filter_blocks=0 --cache_size=8388608 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=none --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=30 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=500000 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=30 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35 --use_io_uring=0 --db=/tmp/rocksdb_crashtest_blackboxnf3pyv_0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_6opy9nqg
```

Reviewed By: ajkr

Differential Revision: D49344066

Pulled By: cbi42

fbshipit-source-id: d5373ddb48d933acb42a5dd8fae3f3019b0241e5
2023-09-18 09:37:40 -07:00
Niklas Fiekas 1a9b42bbdd Add C API for ReadOptions::auto_readahead_size (#11837)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11837

Reviewed By: cbi42

Differential Revision: D49303447

Pulled By: ajkr

fbshipit-source-id: 7debf722339f4fd551760ef8d6801b7a41498565
2023-09-17 19:51:28 -07:00
Peter Dillinger 1c6faf3587 Make RibbonFilterPolicy::bloom_before_level mutable (SetOptions()) (#11838)
Summary:
An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars.

As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making
RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic.

I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838

Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in).

Reviewed By: ajkr

Differential Revision: D49296284

Pulled By: pdillinger

fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a
2023-09-15 15:46:10 -07:00
Jay Huh cff6490bc4 Add IOActivity.kMultiGetEntity (#11842)
Summary:
- As a follow up from https://github.com/facebook/rocksdb/issues/11799, adding `Env::IOActivity::kMultiGetEntity` support to `DBImpl::MultiGetEntity()`.

## Minor Refactor
- Because both `DBImpl::MultiGet()` and `DBImpl::MultiGetEntity()` call `DBImpl::MultiGetCommon()` which later calls `DBImpl::MultiGetWithCallback()` where we check `Env::IOActivity::kMultiGet`, minor refactor was needed so that we don't check `Env::IOActivity::kMultiGet` for  `DBImpl::MultiGetEntity()`.
- I still see more areas for refactoring to avoid duplicate code of checking IOActivity and setting it when Unknown, but this will be addressed separately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11842

Test Plan:
- Added the `ThreadStatus::OperationType::OP_MULTIGETENTITY` in `db_stress` to verify the pass-down IOActivity in a thread aligns with the actual activity the thread is doing.
```
python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
```

Reviewed By: ltamasi

Differential Revision: D49329575

Pulled By: jaykorean

fbshipit-source-id: 05198f1d3f92e6be3d42a3d184bacb3ab2ce6923
2023-09-15 15:34:04 -07:00
leipeng 68ce5d84f6 Add new Iterator API Refresh(const snapshot*) (#10594)
Summary:
This PR resolves https://github.com/facebook/rocksdb/issues/10487 & https://github.com/facebook/rocksdb/issues/10536, user code needs to call Refresh() periodically.

The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq)  -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`.

This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10594

Test Plan:
* New unit tests.
* Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case.

TODO in a following PR:
* Stress test Iterator::Refresh() to different snapshots or no snapshot.

Reviewed By: ajkr

Differential Revision: D48456896

Pulled By: cbi42

fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779
2023-09-15 10:44:43 -07:00
Hui Xiao ed913513bd Fix a bug of rocksdb.file.read.verify.file.checksums.micros not being populated (#11836)
Summary:
**Context/Summary:**
`rocksdb.file.read.verify.file.checksums.micros ` was added in https://github.com/facebook/rocksdb/pull/11444 but the related path was not populated with statistics and clock object correctly so the actual statistics collection didn't happen. This PR fixed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11836

Test Plan:
Setup:
```
./db_bench --benchmarks="fillrandom" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb
```
Run:
```
./db_bench --use_existing_db=1  --benchmarks="verifyfilechecksums" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb --statistics=1 --stats_level=4
```
Post-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 9.000000 P95 : 9.000000 P99 : 9.000000 P100 : 9.000000 COUNT : 1 SUM : 9
```

Pre-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

Reviewed By: ajkr

Differential Revision: D49293378

Pulled By: hx235

fbshipit-source-id: 1acd8b828c28e088d0c5d63897f53cd180b82f42
2023-09-15 10:36:14 -07:00
Yu Zhang e1fd348b92 Fix a bug in multiget for cleaning up SuperVersion (#11830)
Summary:
When `MultiGet` acquires `SuperVersion` via locking the db mutex and get the current `ColumnFamilyData::super_version_`, its corresponding cleanup logic is not correctly done.

It's currently doing this:
`MultiGetColumnFamilyData::cfd->GetSuperVersion().Unref()`

This operates on the most recent `SuperVersion` without locking db mutex , which is not thread safe by itself. And this unref operation is intended for the originally acquired `SuperVersion` instead of the current one. Because a race condition could happen where a new `SuperVersion` is installed in between this `MultiGet`'s ref and unref. When this race condition does happen, it's not sufficient to just unref the `SuperVersion`, `DBImpl::CleanupSuperVersion` should be called instead to properly clean up the `SuperVersion` had this `MultiGet` call be its last reference holder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11830

Test Plan:
`make all check`
Added a unit test that would originally fail

Reviewed By: ltamasi

Differential Revision: D49287715

Pulled By: jowlyzhang

fbshipit-source-id: 8353636ee11b2e90d85c677a96a92360072644b0
2023-09-15 09:50:39 -07:00
Jay Huh f2b623bcc1 GetEntity Support for ReadOnlyDB and SecondaryDB (#11799)
Summary:
`GetEntity` API support for ReadOnly DB and Secondary DB.
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`)
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity`
- `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally.
- `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()`
- Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`.
- Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799

Test Plan:
**Unit Tests**
- Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup.
- Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()`
- `make -j64 check`

**Crash Tests**
- `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
- `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 `
- `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`

Reviewed By: ltamasi

Differential Revision: D49037040

Pulled By: jaykorean

fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b
2023-09-15 08:30:44 -07:00
马越 3c27f56d0b Fix the problem that some keys of ClipColumnFamily may not be deleted (#11811)
Summary:
When executing ClipColumnFamily, if end_key is equal to largest_user_key in a file, this key will not be deleted. So we need to change less than to less than or equal to

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11811

Reviewed By: ajkr

Differential Revision: D49206936

Pulled By: cbi42

fbshipit-source-id: 3e8bcb7b52040a9b4d1176de727616cc298d3445
2023-09-14 13:36:39 -07:00
Jonah Gao 84d335b619 Remove an unused variable: `last_stats_dump_time_microsec_` (#11824)
Summary:
`last_stats_dump_time_microsec_`  is not used after initialization.

I guess that it was previously used to implement periodically dumping stats,
but this functionality has now been delegated to the `PeriodicTaskScheduler`.
4b79e8c003/db/db_impl/db_impl.cc (L770-L778)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11824

Reviewed By: cbi42

Differential Revision: D49278311

Pulled By: jowlyzhang

fbshipit-source-id: 5856245580afc026e6b490755a45c5436a2375c9
2023-09-14 11:25:33 -07:00
Yu Zhang 39a4ff2cab Track full_history_ts_low per SuperVersion (#11784)
Summary:
As discussed in https://github.com/facebook/rocksdb/issues/11730 , this PR tracks the effective `full_history_ts_low` per SuperVersion and update existing sanity checks for `ReadOptions.timestamp >= full_history_ts_low` to use this per SuperVersion `full_history_ts_low` instead. This also means the check is moved to happen after acquiring SuperVersion.

There are two motivations for this: 1) Each time `full_history_ts_low` really come into effect to collapse history, a new SuperVersion is always installed, because it would involve either a Flush or Compaction, both of which change the LSM tree shape. We can take advantage of this to ensure that as long as this sanity check is passed, even if `full_history_ts_low` can be concurrently increased and collapse some history above the requested `ReadOptions.timestamp`, a read request won’t have visibility to that part of history through this SuperVersion that it already acquired.  2) the existing sanity check uses `ColumnFamilyData::GetFullHistoryTsLow` without locking the db mutex, which is the mutex all `IncreaseFullHistoryTsLow` operation is using when mutating this field. So there is a race condition. This also solve the race condition on the read path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11784

Test Plan:
`make all check`

// Checks success scenario really provide the read consistency attribute as mentioned above.
`./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckPassReadIsConsistent*`

// Checks failure scenario cleans up SuperVersion properly.
`./db_with_timestamp_basic_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*`
`./db_secondary_test --gtest_filter=*FullHistoryTsLowSanityCheckFail*`
`./db_readonly_with_timestamp_test --gtest_filter=*FullHistoryTsLowSanitchCheckFail*`

Reviewed By: ltamasi

Differential Revision: D48894795

Pulled By: jowlyzhang

fbshipit-source-id: 1f801fe8e1bc8e63ca76c03cbdbd0974e5ff5bf6
2023-09-13 16:34:18 -07:00
Changyu Bi 3285ba7a29 Fix unit test tsan failure (#11828)
Summary:
The test DBCompactionWaitForCompactTest.WaitForCompactWithOptionToFlushAndCloseDB failed tsan in https://app.circleci.com/pipelines/github/facebook/rocksdb/32009/workflows/577e4e1f-a909-4e80-8ef4-af98b5ff7446/jobs/660989. I cannot repro locally, but this should be the fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11828

Test Plan: `./db_compaction_test --gtest_filter="*DBCompactionWaitForCompactTest/DBCompactionWaitForCompactTest.*"`

Reviewed By: jaykorean

Differential Revision: D49241904

Pulled By: cbi42

fbshipit-source-id: 68714c836d982dcb3946da104533d5c0594980de
2023-09-13 15:53:05 -07:00
Hui Xiao ef3e289b2d Conditionally exclude some L0 input files in size amp compaction (#11749)
Summary:
**Context/Summary:**
A size amp compaction can select and prevent a large number of L0 files from being selected by other compaction. If such compaction is running long or being queued behind, these L0 files will exist for long. With a few more flushes, we can run into write stop triggered by # L0 files. We've seen this happen on a host with many DBs sharing same thread pool, each of these DBs submits a size amp compaction with (110-180)+ files to the pool upon reopen and with a few more flushes, they hit the 200 L0 write stop condition.

The idea is to exclude some L0 input files in size amp compaction that are harmless to size amp reduction but improve the situation described above.

The exclusion algorithm is in `MightExcludeNewL0sToReduceWriteStop()` with two elements:

1.  #L0 to exclude + (level0_stop_writes_trigger - num_l0_input_pre_exclusion) should be in the range of [min_merge_width, max_merge_width].
    -  This is to ensure we are excluding enough L0 input files but not too many to be qualified to picked for another compaction along with the incoming future L0 files before write stop.
2. Based on (1), further constrain #L0 to exclude based on the post-exclusion compaction score. The goal is to ensure our exclusion will not disqualify the size amp compaction from being a size amp compaction after exclusion.

**Tets plan:** New unit test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11749

Reviewed By: ajkr

Differential Revision: D48850631

Pulled By: hx235

fbshipit-source-id: 2c321036e164087c36319dd5645cbbf6b6152092
2023-09-12 15:53:15 -07:00
Changyu Bi 9d71682d1b Add statistics `COMPACTION_CPU_TOTAL_TIME` for total compaction time (#11741)
Summary:
Existing compaction statistics are `COMPACTION_TIME` and `COMPACTION_CPU_TIME` which are histogram and are logged at the end of a compaction. The new statistics `COMPACTION_CPU_TOTAL_TIME` is for cumulative total compaction time which is updated regularly during a compaction. This allows user to more closely track compaction cpu usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11741

Test Plan: * new unit test `DBTestWithParam.CompactionTotalTimeTest`

Reviewed By: ajkr

Differential Revision: D48608094

Pulled By: cbi42

fbshipit-source-id: b597109f3e4bf2237fb5a216b6fd036e5363b4c0
2023-09-12 15:48:36 -07:00
Levi Tamasi 1e63fc9925 Add a helper method WideColumnsHelper::SortColumns (#11823)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11823

Similarly to https://github.com/facebook/rocksdb/pull/11813, the patch is a small refactoring that eliminates some copy-paste around sorting the columns of entities by column name.

Reviewed By: jaykorean

Differential Revision: D49195504

fbshipit-source-id: d48c9f290e3203f838cc5949856c469ecf730008
2023-09-12 12:36:07 -07:00
Hui Xiao 4b123f3a54 Change file size related variables type to uint64_t in PickCompactionToReduceSizeAmp() (#11814)
Summary:
**Context/Summary:**
size_t is not most likely not needed as SortedRun::size/compensated_file_size is uint64_t. This is a pre-requisite to addressing https://github.com/facebook/rocksdb/pull/11749/files#r1321828933.  Other places already uses uint64_t e.g, https://github.com/facebook/rocksdb/blob/8.6.fb/db/compaction/compaction_picker_universal.cc#L349-L353

**Test**
CI

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11814

Reviewed By: ajkr

Differential Revision: D49169155

Pulled By: hx235

fbshipit-source-id: 2b3ad70e6f18aa360e94ed8907c8534ad2797e62
2023-09-12 10:00:57 -07:00
Levi Tamasi 8fc78a3a9e Add helper methods WideColumnsHelper::{Has,Get}DefaultColumn (#11813)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11813

The patch adds a couple of helper methods `WideColumnsHelper::{Has,Get}DefaultColumn` to eliminate some code duplication.

Reviewed By: jaykorean

Differential Revision: D49166682

fbshipit-source-id: f229ca5b94599f7445a0112b10f8317292505c82
2023-09-11 16:32:32 -07:00
Levi Tamasi 760ea373a8 Introduce a wide column aware MergeOperator API (#11807)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11807

For now, RocksDB has limited support for using merge with wide columns: when a bunch of merge operands have to be applied to a wide-column base value, RocksDB currently passes only the value of the default column to the application's `MergeOperator`, which means there is no way to update any other columns during a merge. As a first step in making this more general, the patch adds a new API `FullMergeV3` to `MergeOperator`.

`FullMergeV3`'s interface enables applications to receive a plain, wide-column, or non-existent base value as merge input, and to produce a new plain value, a new wide-column value, or an existing operand as merge output. Note that there are no limitations on the column names and values if the merge result is a wide-column entity. Also, the interface is general in the sense that it makes it possible e.g. for a merge that takes a plain base value and some deltas to produce a wide-column entity as a result.

For backward compatibility, the default implementation of `FullMergeV3` falls back to `FullMergeV2` and implements the current logic where merge operands are applied to the default column of the base entity and any other columns are unchanged. (Note that with `FullMergeV3` in the `MergeOperator` interface, this behavior will become customizable.)

This patch just introduces the new API and the default backward compatible implementation. I plan to integrate `FullMergeV3` into the query and compaction logic in subsequent diffs.

Reviewed By: jaykorean

Differential Revision: D49117253

fbshipit-source-id: 109e016f25cd130fc504790818d927bae7fec6bd
2023-09-11 12:13:58 -07:00
Changyu Bi 195f35c08b Add a unit test for the fix in #11786 (#11790)
Summary:
Tests a scenario where range tombstone reseek used to cause MergingIterator to discard non-ok status.

Ran on main without https://github.com/facebook/rocksdb/issues/11786:
```
./db_range_del_test --gtest_filter="*RangeDelReseekAfterFileReadError*"
Note: Google Test filter = *RangeDelReseekAfterFileReadError*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBRangeDelTest
[ RUN      ] DBRangeDelTest.RangeDelReseekAfterFileReadError
db/db_range_del_test.cc:3577: Failure
Value of: iter->Valid()
  Actual: true
Expected: false
[  FAILED  ] DBRangeDelTest.RangeDelReseekAfterFileReadError (64 ms)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11790

Reviewed By: ajkr

Differential Revision: D48972869

Pulled By: cbi42

fbshipit-source-id: b1a71867533b0fb60af86f8ce8a9e391ba84dd57
2023-09-06 15:22:39 -07:00
Changyu Bi 458acf8169 Add some unit tests when file read returns error during compaction/scanning (#11788)
Summary:
Some repro unit tests for the bug fixed in https://github.com/facebook/rocksdb/pull/11782.

Ran on main without https://github.com/facebook/rocksdb/pull/11782:
```
./db_compaction_test --gtest_filter='*ErrorWhenReadFileHead'
Note: Google Test filter = *ErrorWhenReadFileHead
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN      ] DBCompactionTest.ErrorWhenReadFileHead
db/db_compaction_test.cc:10105: Failure
Value of: s.IsIOError()
  Actual: false
Expected: true
[  FAILED  ] DBCompactionTest.ErrorWhenReadFileHead (3960 ms)

./db_iterator_test --gtest_filter="*ErrorWhenReadFile*"
Note: Google Test filter = *ErrorWhenReadFile*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBIteratorTest
[ RUN      ] DBIteratorTest.ErrorWhenReadFile
db/db_iterator_test.cc:3399: Failure
Value of: (iter->status()).ok()
  Actual: true
Expected: false
[  FAILED  ] DBIteratorTest.ErrorWhenReadFile (280 ms)
[----------] 1 test from DBIteratorTest (280 ms total)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11788

Reviewed By: ajkr

Differential Revision: D48940284

Pulled By: cbi42

fbshipit-source-id: 06f3c5963f576db3f85d305ffb2745ee13d209bb
2023-09-06 10:23:41 -07:00
Jay Huh 47be3ffffb Minor refactor on LDB command for wide column support and release note (#11777)
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/11754 , refactor to clean up some nearly identical logic. This PR changes the existing debugging string format of Scan command as the following.

```
❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex
```

Before
```
0x6669727374 : :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 : 0x62617A
```
After
```
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11777

Test Plan:
```
❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump --hex
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
Keys in range: 3

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
```

Reviewed By: jowlyzhang

Differential Revision: D48876755

Pulled By: jaykorean

fbshipit-source-id: b1c608a810fe038999ac528b690d398abf5f21d7
2023-08-31 16:17:03 -07:00
Peter Dillinger 83eb7b8c2c Log host name (#11776)
Summary:
... in info_log. Becoming more important with disaggregated storage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11776

Test Plan: manual

Reviewed By: jaykorean

Differential Revision: D48849471

Pulled By: pdillinger

fbshipit-source-id: 9a8fd8b2564a4f133526ecd7c1414cb667e4ba54
2023-08-31 08:39:09 -07:00
Hui Xiao 05daa12332 Change compaction_readahead_size default value to 2MB (#11762)
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, we rely on `compaction_readahead_size` for how much to read ahead for compaction read under non-direct IO case. https://github.com/facebook/rocksdb/pull/11658 therefore also sanitized 0 `compaction_readahead_size` to 2MB under non-direct IO, which is consistent with the existing sanitization with direct IO.

However, this makes disabling compaction readahead impossible as well as add one more scenario to the inconsistent effects between `Options.compaction_readahead_size=0` during DB open and `SetDBOptions("compaction_readahead_size", "0")` .
- `SetDBOptions("compaction_readahead_size", "0")` will disable compaction readahead as its logic never goes through sanitization above while `Options.compaction_readahead_size=0` will go through sanitization.

Therefore we decided to do this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11762

Test Plan: Modified existing UTs to cover this PR

Reviewed By: ajkr

Differential Revision: D48759560

Pulled By: hx235

fbshipit-source-id: b3f85e58bda362a6fa1dc26bd8a87aa0e171af79
2023-08-30 14:57:08 -07:00
Yu Zhang fc58c7c62a Add UDT support in SstFileDumper (#11757)
Summary:
For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757

Test Plan:
make all check
for changes in db_impl.cc and db_impl_compaction_flush.cc

for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually.

Reviewed By: ltamasi

Differential Revision: D48668048

Pulled By: jowlyzhang

fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8
2023-08-30 13:42:04 -07:00
Jay Huh ea9a5b2914 Wide Column support in ldb (#11754)
Summary:
wide_columns can now be pretty-printed in the following commands
- `./ldb dump_wal`
- `./ldb dump`
- `./ldb idump`
- `./ldb dump_live_files`
- `./ldb scan`
- `./sst_dump --command=scan`

There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11754

Test Plan:
**New Tests added**
- `WideColumnsHelperTest::DumpWideColumns`
- `WideColumnsHelperTest::DumpSliceAsWideColumns`

**Changes added to existing tests**
- `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in https://github.com/facebook/rocksdb/issues/11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool.
- `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST

**dump_wal**
```
./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header
```
```
Sequence,Count,ByteSize,Physical Offset,Key(s) : value
1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A
```

**idump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump
```
```
'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:2, type:22 => attr_one:two attr_three:four
'third' seq:3, type:1 => baz
Internal keys in range: 3
```

**SST Dump from dump_live_files**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files
```
```
...
==============================
SST Files
==============================
/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1
------------------------------
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
...
```

**dump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump
```
```
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3
```

**scan**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan
```
```
first : :hello attr_name1:foo attr_name2:bar
second : attr_one:two attr_three:four
third : baz
```

**sst_dump**
```
./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan
```
```
options.env is 0x7ff54b296000
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
from [] to []
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
```

Reviewed By: ltamasi

Differential Revision: D48837999

Pulled By: jaykorean

fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f
2023-08-30 12:45:52 -07:00
Yu Zhang 4234a6a301 Increase full_history_ts_low when flush happens during recovery (#11774)
Summary:
This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11774

Test Plan:
Added unit test
make all check

Reviewed By: ltamasi

Differential Revision: D48799109

Pulled By: jowlyzhang

fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035
2023-08-30 09:34:31 -07:00
jsteemann c1e6ffc40a remove a sub-condition that is always true (#11746)
Summary:
the value of `done` is always false here, so the sub-condition `!done` will always be true and the check can be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11746

Reviewed By: anand1976

Differential Revision: D48656845

Pulled By: ajkr

fbshipit-source-id: 523ba3d07b3af7880c8c8ccb20442fd7c0f49417
2023-08-29 18:40:13 -07:00
Yu Zhang ecbeb305a0 Removing some checks for UDT in memtable only feature (#11732)
Summary:
The user-defined timestamps feature only enforces that for the same key, user-defined timestamps should be non-decreasing. For the user-defined timestamps in memtable only feature, during flush, we check the user-defined timestamps in each memtable to examine if the data is considered expired with regard to `full_history_ts_low`. In this process, it's assuming that a newer memtable should not have smaller user-defined timestamps than an older memtable. This check however is enforcing ordering of user-defined timestamps across keys, as apposed to the vanilla UDT feature, that only enforce ordering of user-defined timestamps for the same key.

This more strict user-defined timestamp ordering requirement could be an issue for secondary instances where commits can be out of order. And after thinking more about it, this requirement is really an overkill to keep the invariants of `full_history_ts_low` which are:

1) users cannot read below `full_history_ts_low`
2) users cannot write at or below `full_history_ts_low`
3) `full_history_ts_low` can only be increasing

As long as RocksDB enforces these 3 checks, we can prohibit inconsistent read that returns a different value. And these three checks are covered in existing APIs.

So this PR removes the extra checks in the UDT in memtable only feature that requires user-defined timestamps to be non decreasing across keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11732

Reviewed By: ltamasi

Differential Revision: D48541466

Pulled By: jowlyzhang

fbshipit-source-id: 95453c6e391cbd511c0feab05f0b11c312d17186
2023-08-29 16:51:48 -07:00
Jan ba59751430 remove an unused typedef (#11286)
Summary:
`VersionBuilderMap` type alias definition seem unused.
If this PR can be compiled fine then the alias is probably not needed anymore.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11286

Reviewed By: jaykorean

Differential Revision: D48656747

Pulled By: ajkr

fbshipit-source-id: ac8554922aead7dc3d24fe7e6544a4622578c514
2023-08-25 18:01:14 -07:00
Hui Xiao f833ca3878 Pick files from the last sorted run in size amp compaction picker (#11740)
Summary:
**Context/Summary:**
Same intention as https://github.com/facebook/rocksdb/pull/2693 - basically we now pick from the last sorted run and expand forward till we can't

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11740

Test Plan:
Existing UT
Stress test

Reviewed By: ajkr

Differential Revision: D48586475

Pulled By: hx235

fbshipit-source-id: 3eb3c3ee1d5f7e0b0d6d649baaeb8c6990fee398
2023-08-23 11:27:48 -07:00
Alexander Bulimov 2b6bcfe590 Add C API for WaitForCompact (#11737)
Summary:
Add a bunch of C API functions to expose new `WaitForCompact` function and related options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11737

Test Plan: unit tests

Reviewed By: jaykorean

Differential Revision: D48568239

Pulled By: abulimov

fbshipit-source-id: 1ff35972d7abacd7e1e17fe2ada1e20cdc88d8de
2023-08-22 14:32:35 -07:00
chuhao zeng 1303573589 Reverse sort order in dedup to enable iter checking in callback (#11725)
Summary:
Fix https://github.com/facebook/rocksdb/issues/6470

Ensure TransactionLogIter being initialized correctly with SYNC_POINT API when calling `GetSortedWALFiles`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11725

Reviewed By: akankshamahajan15

Differential Revision: D48529411

Pulled By: ajkr

fbshipit-source-id: 970ca1a6259ed996c6d87f7fcd40f95acf441517
2023-08-22 11:22:35 -07:00
Yu Zhang 03a74411c0 Add unit test for default temperature (#11722)
Summary:
This piggy back the existing last level file temperature statistics test to test the default temperature becoming effective.

While adding this unit test, I found that the approach to swap out and use default temperature in `VersionBuilder::LoadTableHandlers` will miss the L0 files created from flush, and only work for existing SST files, SST files created by compaction. So this PR moves that logic to `TableCache::GetTableReader`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11722

Test Plan:
```
./db_test2 --gtest_filter="*LastLevelStatistics*"
make all check
```

Reviewed By: pdillinger

Differential Revision: D48489171

Pulled By: jowlyzhang

fbshipit-source-id: ac29f7d484916f3218729594c5bb35c4f2979ac2
2023-08-21 12:14:03 -07:00
Changyu Bi c2aad555c3 Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (d857369028/lib/zstd.h (L428)) to detect corruption during decompression. Main changes are in compression.h:
* User can set CompressionOptions::checksum to true to enable this feature.
* We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`.
* Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag.
* Error handling during decompression to propagate error message from ZSTD.
* Updated microbench to test read performance impact.

About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666

Test Plan:
* Existing unit tests for zstd compression
* Add unit test `DBTest2.ZSTDChecksum` to test the corruption case
* Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API.
* Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work.
* Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2.
* Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable.
* Microbench:
```
TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100

Min out of 100 runs:
Main:
10390 10436 10456 10484 10499 10535 10544 10545 10565 10568

After this PR, checksum=false
10285 10397 10503 10508 10515 10557 10562 10635 10640 10660

After this PR, checksum=true
10827 10876 10925 10949 10971 11052 11061 11063 11100 11109
```
* db_bench:
```
Write perf
TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=..

[FillSeq checksum=0]
fillseq [AVG    10 runs] : 281635 (± 31711) ops/sec;   31.2 (± 3.5) MB/sec
fillseq [MEDIAN 10 runs] : 294027 ops/sec;   32.5 MB/sec

[FillSeq checksum=1]
fillseq [AVG    10 runs] : 286961 (± 34700) ops/sec;   31.7 (± 3.8) MB/sec
fillseq [MEDIAN 10 runs] : 283278 ops/sec;   31.3 MB/sec

Read perf
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1

[Readrandom checksum=1]
readrandom [AVG    20 runs] : 360928 (± 3579) ops/sec;    4.0 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 362468 ops/sec;    4.0 MB/sec

[Readrandom checksum=0]
readrandom [AVG    20 runs] : 380365 (± 2384) ops/sec;    4.2 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 379800 ops/sec;    4.2 MB/sec

Compression
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1

checksum=1
compress [AVG    20 runs] : 54074 (± 634) ops/sec;  211.2 (± 2.5) MB/sec
compress [MEDIAN 20 runs] : 54396 ops/sec;  212.5 MB/sec

checksum=0
compress [AVG    20 runs] : 54598 (± 393) ops/sec;  213.3 (± 1.5) MB/sec
compress [MEDIAN 20 runs] : 54592 ops/sec;  213.3 MB/sec

Decompression:
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1

checksum = 0
uncompress [AVG    20 runs] : 167499 (± 962) ops/sec;  654.3 (± 3.8) MB/sec
uncompress [MEDIAN 20 runs] : 167210 ops/sec;  653.2 MB/sec
checksum = 1
uncompress [AVG    20 runs] : 167980 (± 924) ops/sec;  656.2 (± 3.6) MB/sec
uncompress [MEDIAN 20 runs] : 168465 ops/sec;  658.1 MB/sec
```

Reviewed By: ajkr

Differential Revision: D48019378

Pulled By: cbi42

fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba
2023-08-18 15:01:59 -07:00
Jay Huh 0fa0c97d3e Timeout in microsecond option in WaitForCompactOptions (#11711)
Summary:
While it's rare, we may run into a scenario where `WaitForCompact()` waits for background jobs indefinitely. For example, not enough space error will add the job back to the queue while WaitForCompact() waits for _all jobs_ including the jobs that are in the queue to be completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11711

Test Plan:
`DBCompactionWaitForCompactTest::WaitForCompactToTimeout` added
`timeout` option added to the variables for all of the existing DBCompactionWaitForCompactTests

Reviewed By: pdillinger, jowlyzhang

Differential Revision: D48416390

Pulled By: jaykorean

fbshipit-source-id: 7b6a12f705ab6c6dfaf8ad736a484ca654a86106
2023-08-18 11:21:45 -07:00
Yu Zhang 1e77e35d26 Add a per column family default temperature option for accounting (#11708)
Summary:
Add a column family option `default_temperature` that will be used for file reading accounting purpose, such as io statistics, for files that don't have an explicitly set temperature.

This options is not a mutable one, changing its value would require a DB restart. This is to avoid the confusion that had the option being a mutable one, the users may expect it to take effect on all files immediately, while in reality, it would only become effective for SST files opened in the future.

This `default_temperature` also just affect accounting during one DB session. It won't be recorded in manifest as the file's temperature and can be different across different DB sessions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11708

Test Plan:
```
make all check
```

Reviewed By: pdillinger

Differential Revision: D48375763

Pulled By: jowlyzhang

fbshipit-source-id: eb756696c14a694c6e2a93d2bb6f040563194981
2023-08-17 17:06:57 -07:00
Changyu Bi d1ff401472 Delay bottommost level single file compactions (#11701)
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.

* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701

Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.

Reviewed By: jaykorean, ajkr

Differential Revision: D48331564

Pulled By: cbi42

fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
2023-08-16 17:45:44 -07:00
Yu Zhang 6a3da5635e Add documentation to some formatting util functions (#11674)
Summary:
As titled, mostly adding documentation. While updating one usage of these util functions in the external file ingestion job based on code inspection.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11674

Test Plan:
```
make check
```

Note that no unit test was added or updated to check the change in the external file ingestion flow works. This is because user-defined timestamp doesn't support bulk loading yet. There could be other missing pieces that are needed to make this flow functional and testable. That work is separately tracked and unit tests will be added then.

Reviewed By: cbi42

Differential Revision: D48271338

Pulled By: jowlyzhang

fbshipit-source-id: c05c3440f1c08632dd0de51b563a30b44b4eb8b5
2023-08-14 22:04:18 -07:00
Jay Huh 793a786fa3 Fix for unchecked status in CancelAllBackgroundWork (#11699)
Summary:
## Summary
PR https://github.com/facebook/rocksdb/issues/11497 introduced this. Status from `CancelPeriodicTaskScheduler()` is unchecked and causing test failure like https://app.circleci.com/pipelines/github/facebook/rocksdb/30743/workflows/24443a9b-6fc3-41e6-86c1-992d766eb1ec/jobs/642419

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11699

Test Plan: Existing tests

Reviewed By: cbi42

Differential Revision: D48287188

Pulled By: jaykorean

fbshipit-source-id: b6bcf6e3c3c47f126c34c24a3dfed2649635cc8c
2023-08-11 19:59:56 -07:00
Peter Dillinger ef6f025563 Placeholder for AutoHyperClockCache, more (#11692)
Summary:
* The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation.
* Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache).
* Simplify some existing tests not to depend directly on cache type.
* Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation.
* Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test.
* Add another templated class BaseHyperClockCache to reduce future copy-paste
* Added ReportProblems support to cache_bench
* Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC:
```
2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17
```

In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692

Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in.

Reviewed By: jowlyzhang

Differential Revision: D48247413

Pulled By: pdillinger

fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da
2023-08-11 16:27:38 -07:00
Jay Huh 52816ff64d Close DB option in WaitForCompact() (#11497)
Summary:
Context:

As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `close_db` option in `WaitForCompactOptions` to close DB after waiting for compactions to finish. Must be set to true to close the DB upon compactions finishing.
1. `bool close_db = false` added to `WaitForCompactOptions`
2. Introduced `CancelPeriodicTaskSchedulers()` and moved unregistering PeriodicTaskSchedulers to it.`CancelAllBackgroundWork()` calls it now.
3. When close_db option is on, unpersisted data (data in memtable when WAL is disabled) will be flushed in `WaitForCompact()` if flush option is not on (and `mutable_db_options_.avoid_flush_during_shutdown` is not true). The unpersisted data flush in `CancelAllBackgroundWork()` will be skipped because `shutting_down_` flag will be set true before calling `Close()`.
4. Atomic boolean `reject_new_background_jobs_` is introduced to prevent new background jobs from being added during the short period of time after waiting is done and before `shutting_down_` is set by `Close()`.
5. `WaitForCompact()` now waits for recovery in progress to complete as well. (flush operations from WAL -> L0 files)
6. Added `close_db_` cases to all existing `WaitForCompactTests`
7. Added a scenario to `DBBasicTest::DBClose`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11497

Test Plan:
- Existing DBCompactionTests
- `WaitForCompactWithOptionToFlushAndCloseDB` added
- Added a scenario to `DBBasicTest::DBClose`

Reviewed By: pdillinger, jowlyzhang

Differential Revision: D46337560

Pulled By: jaykorean

fbshipit-source-id: 0f8c7ee09394847f2af5ea4bdd331b47bcdef0b0
2023-08-11 12:30:48 -07:00
Yu Zhang 7cdbce4564 Add UDT support in API DB::GetApproximateMemTableStats (#11689)
Summary:
This API should consider the case when user-defined timestamp is enabled. Also added some documentation to some related API to clarify the usage in the case when user-defined timestamp is enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11689

Test Plan:
Unit test added
```
make check
./db_with_timestamp_basic_test --gtest_filter=*GetApproximateSizes*
```

Reviewed By: ltamasi

Differential Revision: D48208568

Pulled By: jowlyzhang

fbshipit-source-id: c5baa4a2923441f8ea3a3672c98223a43a3428dc
2023-08-11 11:26:38 -07:00
Jay Huh 66643b8106 PutEntity Support in SST File Writer (#11688)
Summary:
RocksDB provides APIs that enable creating SST files offline and then bulk loading them into the LSM tree quickly using metadata operations. Namely, clients can use the `SstFileWriter` class for the offline data preparation and then the IngestExternalFile family of APIs to perform the bulk loading. However, `SstFileWriter` currently does not support creating files with wide-column data in them. This PR adds `PutEntity` API implementation to `SstFileWriter` to support creating files with wide-column data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11688

Test Plan: - `BasicWideColumn` test added in external_sst_file_test

Reviewed By: ltamasi

Differential Revision: D48243779

Pulled By: jaykorean

fbshipit-source-id: 1697e5bd67121a648c03946f867416a94be0cadf
2023-08-10 18:16:10 -07:00
Changyu Bi 76ed9a3990 Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.

Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686

Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`

Reviewed By: hx235

Differential Revision: D48176132

Pulled By: cbi42

fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
2023-08-09 15:46:44 -07:00
Yu Zhang c751583c03 Set default cf ts sz for a reused transaction (#11685)
Summary:
Set up the default column family timestamp size for a reused write committed transaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11685

Test Plan: Added unit test.

Reviewed By: ltamasi

Differential Revision: D48195129

Pulled By: jowlyzhang

fbshipit-source-id: 54faa900c123fc6daa412c01490e36c10a24a678
2023-08-09 13:49:42 -07:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
Yu Zhang 9c2ebcc2c3 Log user_defined_timestamps_persisted flag in event logger (#11683)
Summary:
As titled, and also removed an undefined and unused member function in for ColumnFamilyData

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683

Reviewed By: ajkr

Differential Revision: D48156290

Pulled By: jowlyzhang

fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7
2023-08-08 12:25:21 -07:00
Peter Dillinger e214964f40 Fix a potential memory leak on row_cache insertion failure (#11682)
Summary:
Although the built-in Cache implementations never return failure on Insert without keeping a reference (Handle), a custom implementation could. The code for inserting into row_cache does not keep a reference but does not clean up appropriately on non-OK. This is a fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11682

Test Plan: unit test added that previously fails under ASAN

Reviewed By: ajkr

Differential Revision: D48153831

Pulled By: pdillinger

fbshipit-source-id: 86eb7387915c5b38b6ff5dd8deb4e1e223b7d020
2023-08-08 11:34:41 -07:00
tabokie 6d1effaf01 exclude uninitialized files when estimating compression ratio (#11664)
Summary:
Exclude files with uninitialized table properties when estimating compression ratio.

Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/335

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11664

Reviewed By: cbi42

Differential Revision: D48002518

Pulled By: ajkr

fbshipit-source-id: 931fac8a06b4ed7b7b605cf79903302f1b8babfd
2023-08-07 12:35:42 -07:00
Xinye Tao d2b0652b32 compute compaction score once for a batch of range file deletes (#10744)
Summary:
Only re-calculate compaction score once for a batch of deletions. Fix performance regression brought by https://github.com/facebook/rocksdb/pull/8434.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10744

Test Plan:
In one of our production cluster that recently upgraded to RocksDB 6.29, it takes more than 10 minutes to delete files in 30,000 ranges. The RocksDB instance contains approximately 80,000 files. After this patch, the duration reduces to 100+ ms, which is on par with RocksDB 6.4.

Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/316

Signed-off-by: tabokie <xy.tao@outlook.com>

Reviewed By: cbi42

Differential Revision: D48002581

Pulled By: ajkr

fbshipit-source-id: 7245607ee3ad79c53b648a6396c9159f166b9437
2023-08-07 12:29:31 -07:00
Andrew Kryczka 4500a0d6ec Avoid an std::map copy in persistent stats (#11681)
Summary:
An internal user reported this copy showing up in a CPU profile. We can use move instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11681

Differential Revision: D48103170

Pulled By: ajkr

fbshipit-source-id: 083d6470181a0041bb5275b657aa61bee23a3729
2023-08-06 18:01:08 -07:00
Changyu Bi eca48bc166 Avoid shifting component too large error in FileTtlBooster (#11673)
Summary:
When `num_levels` > 65, we may be shifting more than 63 bits in FileTtlBooster. This can give errors like: `runtime error: shift exponent 98 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')`. This PR makes a quick fix for this issue by taking a min in the shifting component. This issue should be rare since it requires a user using a large `num_levels`. I'll follow up with a more complex fix if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11673

Test Plan: * Add a unit test that produce the above error before this PR. Need to compile it with ubsan: `COMPILE_WITH_UBSAN=1 OPT="-fsanitize-blacklist=.circleci/ubsan_suppression_list.txt" ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 compaction_picker_test`

Reviewed By: hx235

Differential Revision: D48074386

Pulled By: cbi42

fbshipit-source-id: 25e59df7e93f20e0793cffb941de70ac815d9392
2023-08-04 14:29:50 -07:00
Hui Xiao 09882a52d6 Prepare for deprecation of Options::access_hint_on_compaction_start (#11658)
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D47997856

Pulled By: hx235

fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac
2023-08-03 17:23:02 -07:00
Vardhan 87a21d08fe Add an option to trigger flush when the number of range deletions reach a threshold (#11358)
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358

Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.

Reviewed By: ajkr

Differential Revision: D46582680

Pulled By: cbi42

fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
2023-08-02 19:58:56 -07:00
amatveev-cf 946d1009bc Expand Statistics support in the C API (#11263)
Summary:
Adds a few missing features to the C API:
1) Statistics level
2) Getting individual values instead of a serialized string

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11263

Test Plan: unit tests

Reviewed By: ajkr

Differential Revision: D47309963

Pulled By: hx235

fbshipit-source-id: 84df59db4045fc0fb3ea4aec451bc5c2afd2a248
2023-08-02 10:53:40 -07:00