Commit Graph

5751 Commits

Author SHA1 Message Date
Jay Huh f22557886e Fix Compaction Stats (#13071)
Summary:
Compaction stats code is not so straightforward to understand. Here's a bit of context for this PR and why this change was made.

- **CompactionStats (compaction_stats_.stats):** Internal stats about the compaction used for logging and public metrics.
- **CompactionJobStats (compaction_job_stats_)**: The public stats at job level. It's part of Compaction event listener and included in the CompactionResult.
- **CompactionOutputsStats**: output stats only. resides in CompactionOutputs. It gets aggregated toward the CompactionStats (internal stats).

The internal stats, `compaction_stats_.stats`, has the output information recorded from the compaction iterator, but it does not have any input information (input records, input output files) until `UpdateCompactionStats()` gets called. We cannot simply call `UpdateCompactionStats()` to fill in the input information in the remote compaction (which is a subcompaction of the primary host's compaction) because the `compaction->inputs()` have the full list of input files and `UpdateCompactionStats()` takes the entire list of records in all files. `num_input_records` gets double-counted if multiple sub-compactions are submitted to the remote worker.

The job level stats (in the case of remote compaction, it's subcompaction level stat), `compaction_job_stats_`, has the correct input records, but has no output information. We can use `UpdateCompactionJobStats(compaction_stats_.stats)` to set the output information (num_output_records, num_output_files, etc.) from the `compaction_stats_.stats`, but it also sets all other fields including the input information which sets all back to 0.

Therefore, we are overriding `UpdateCompactionJobStats()` in remote worker only to update job level stats, `compaction_job_stats_`, with output information of the internal stats.

Baiscally, we are merging the aggregated output info from the internal stats and aggregated input info from the compaction job stats.

In this PR we are also fixing how we are setting `is_remote_compaction` in CompactionJobStats.
- OnCompactionBegin event, if options.compaction_service is set, `is_remote_compaction=true` for all compactions except for trivial moves
- OnCompactionCompleted event, if any of the sub_compactions were done remotely, compaction level stats's `is_remote_compaction` will be true

Other minor changes
- num_output_records is already available in CompactionJobStats. No need to store separately in CompactionResult.
- total_bytes is not needed.
- Renamed `SubcompactionState::AggregateCompactionStats()` to `SubcompactionState::AggregateCompactionOutputStats()` to make it clear that it's only aggregating output stats.
- Renamed `SetTotalBytes()` to `AddBytesWritten()` to make it more clear that it's adding total written bytes from the compaction output.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13071

Test Plan:
Unit Tests added and updated
```
./compaction_service_test
```

Reviewed By: anand1976

Differential Revision: D64479657

Pulled By: jaykorean

fbshipit-source-id: a7a776a00dc718abae95d856b661bcbafd3b0ed5
2024-10-16 19:20:37 -07:00
Changyu Bi 8ad4c7efc4 Add an API to check if an SST file is generated by SstFileWriter (#13072)
Summary:
Some users want to check if a file in their DB was created by SstFileWriter and ingested into hte DB. Files created by SstFileWriter records additional table properties cbebbad7d9/table/sst_file_writer_collectors.h (L17-L21). These are not exposed so I'm adding an API to SstFileWriter to check for them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13072

Test Plan: added some assertions.

Reviewed By: jowlyzhang

Differential Revision: D64411518

Pulled By: cbi42

fbshipit-source-id: 279bfef48615aa6f78287b643d8445a1471e7f07
2024-10-16 16:57:05 -07:00
Changyu Bi 787730c859 Add an ingestion option to not fill block cache (#13067)
Summary:
add `IngestExternalFileOptions::fill_cache` to allow users to ingest files without loading index/filter/data and other blocks into block cache during file ingestion. This can be useful when users are ingesting files into a CF that is not available to readers yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13067

Test Plan:
* unit test: `ExternalSSTFileTest.NoBlockCache`
* ran one round of crash test with fill_cache disabled: `python3 ./tools/db_crashtest.py --simple blackbox --ops_per_thread=1000000 --interval=30 --ingest_external_file_one_in=200 --level0_stop_writes_trigger=200 --level0_slowdown_writes_trigger=100 --sync_fault_injection=0 --disable_wal=0 --manual_wal_flush_one_in=0`

Reviewed By: jowlyzhang

Differential Revision: D64356424

Pulled By: cbi42

fbshipit-source-id: b380c26f5987238e1ed7d42ceef0390cfaa0b8e2
2024-10-16 14:11:22 -07:00
Levi Tamasi ecc084d301 Support the on-demand loading of blobs during iteration (#13069)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13069

Currently, when using range scans with BlobDB, the iterator logic eagerly loads values from blob files when landing on a new entry. This can be wasteful in use cases where the values associated with some keys in the range are not used by the application. The patch introduces a new read option `allow_unprepared_value`; when specified, this option results in the above eager loading getting bypassed. Values needed by the application can be then loaded on an on-demand basis by calling the new iterator API `PrepareValue`. Note that currently, only regular single-CF iterators are supported; multi-CF iterators and transactions will be extended in later PRs.

Reviewed By: jowlyzhang

Differential Revision: D64360723

fbshipit-source-id: ee55502fa15dcb307a984922b9afc9d9da15d6e1
2024-10-16 12:34:57 -07:00
Jay Huh da5e11310b Preserve Options File (#13074)
Summary:
In https://github.com/facebook/rocksdb/issues/13025 , we made a change to load the latest options file in the remote worker instead of serializing the entire set of options.

That was done under assumption that OPTIONS file do not get purged often. While testing, we learned that this happens more often than we want it to be, so we want to prevent the OPTIONS file from getting purged anytime between when the remote compaction is scheduled and the option is loaded in the remote worker.

Like how we are protecting new SST files from getting purged using `min_pending_output`, we are doing the same by keeping track of `min_options_file_number`. Any OPTIONS file with number greater than `min_options_file_number` will be protected from getting purged. Just like `min_pending_output`, `min_options_file_number` gets bumped when the compaction is done. This is only applicable when `options.compaction_service` is set.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13074

Test Plan:
```
./compaction_service_test --gtest_filter="*PreservedOptionsLocalCompaction*"
./compaction_service_test --gtest_filter="*PreservedOptionsRemoteCompaction*"
```

Reviewed By: anand1976

Differential Revision: D64433795

Pulled By: jaykorean

fbshipit-source-id: 0d902773f0909d9481dec40abf0b4c54ce5e86b2
2024-10-16 09:22:51 -07:00
Levi Tamasi 55de26580a Small improvement to MultiCFIteratorImpl (#13075)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13075

The patch simplifies the iteration logic in `MultiCFIteratorImpl::{Advance,Populate}Iterator` a bit and adds some assertions to uniformly enforce the invariant that any iterators currently on the heap should be valid and have an OK status.

Reviewed By: jaykorean

Differential Revision: D64429566

fbshipit-source-id: 36bc22465285b670f859692a048e10f21df7da7a
2024-10-15 17:32:07 -07:00
Yu Zhang 2cb00c6921 Ingest files in separate batches if they overlap (#13064)
Summary:
This PR assigns levels to files in separate batches if they overlap. This approach can potentially assign external files to lower levels.

In the prepare stage, if the input files' key range overlaps themselves, we divide them up in the user specified order into multiple batches. Where the files in the same batch do not overlap with each other, but key range could overlap between batches. If the input files' key range don't overlap, they always just make one default batch.

During the level assignment stage, we assign levels to files one batch after another.  It's guaranteed that files within one batch are not overlapping, we assign level to each file one after another. If the previous batch's uppermost level is specified, all files in this batch will be assigned to levels that are higher than that level. The uppermost level used by this batch of files is also tracked, so that it can be used by the next batch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13064

Test Plan:
Updated test and added new test
Manually stress tested

Reviewed By: cbi42

Differential Revision: D64428373

Pulled By: jowlyzhang

fbshipit-source-id: 5aeff125c14094c87cc50088505010dfd2da3d6e
2024-10-15 17:22:01 -07:00
Jay Huh dd76862b00 Add file_checksum from FileChecksumGenFactory and Tests for corrupted output (#13060)
Summary:
- When `FileChecksumGenFactory` is set, include the `file_checksum` and `file_checksum_func_name` in the output file metadata
- ~~In Remote Compaction, try opening the output files in the temporary directory to do a quick sanity check before returning the result with status.~~
- After offline discussion, we decided to rely on Primary's existing Compaction flow to sanity check the output files. If the output file is corrupted, we will still be able to catch it and not installing it even after renaming them to cf_paths. The corrupted file in the cf_path won't be added to the MANIFEST and will be purged as part of the next `PurgeObsoleteFiles()` call.
- Unit Test has been added to validate above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13060

Test Plan:
Unit test added
```
./compaction_service_test --gtest_filter="*CorruptedOutput*"
./compaction_service_test --gtest_filter="*TruncatedOutput*"
./compaction_service_test --gtest_filter="*CustomFileChecksum*"
./compaction_job_test --gtest_filter="*ResultSerialization*"
```

Reviewed By: cbi42

Differential Revision: D64189645

Pulled By: jaykorean

fbshipit-source-id: 6cf28720169c960c80df257806bfee3c0d177159
2024-10-14 18:26:17 -07:00
Peter Dillinger 351d2fd2b6 Make simple BlockBasedTableOptions mutable (#10021)
Summary:
In theory, there should be no danger in mutability, as table
builders and readers work from copies of BlockBasedTableOptions.
However, there is currently an unresolved read-write race that
affecting SetOptions on BBTO fields. This should be generally
acceptable for non-pointer options of 64 bits or less, but a fix
is needed to make it mutability general here. See
https://github.com/facebook/rocksdb/issues/10079

This change systematically sets all of those "simple" options (and future
such options) as mutable. (Resurrecting this PR perhaps preferable to
proposed https://github.com/facebook/rocksdb/issues/13063)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10021

Test Plan: Some unit test updates. XXX comment added to stress test code

Reviewed By: cbi42

Differential Revision: D64360967

Pulled By: pdillinger

fbshipit-source-id: ff220fa778331852fe331b42b76ac4adfcd2d760
2024-10-14 17:49:26 -07:00
Yu Zhang 8592517c89 Remove stale entries from L0 files when UDT is not persisted (#13035)
Summary:
When user-defined timestamps are not persisted, currently we replace the actual timestamp with min timestamp after an entry is output from compaction iterator. Compaction iterator won't be able to help with removing stale entries this way. This PR adds a wrapper iterator `TimestampStrippingIterator` for `MemTableIterator` that does the min timestamp replacement at the memtable iteration step. It is used by flush and can help remove stale entries from landing in L0 files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13035

Test Plan: Added unit test

Reviewed By: pdillinger, cbi42

Differential Revision: D63423682

Pulled By: jowlyzhang

fbshipit-source-id: 087dcc9cee97b9ea51b8d2b88dc91c2984d54e55
2024-10-14 12:28:35 -07:00
Yu Zhang a571cbed17 Use same logic to assign level for non-overlapping files in universal compaction (#13059)
Summary:
When the input files are not overlapping, a.k.a `files_overlap_=false`, it's best to assign them to non L0 levels so that they are not one sorted run each.  This can be done regardless of compaction style being leveled or universal without any side effects.

Just my guessing, this special handling may be there because universal compaction used to have an invariant that sequence number on higher levels should not be smaller than sequence number in lower levels. File ingestion used to try to keep up to that promise by doing "sequence number stealing" from the to be assigned level. However, that invariant is no longer true after deletion triggered compaction is added for universal compaction, and we also removed the sequence stealing logic from file ingestion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13059

Test Plan: Updated existing tests

Reviewed By: cbi42

Differential Revision: D64220100

Pulled By: jowlyzhang

fbshipit-source-id: 70a83afba7f4c52d502c393844e6b3273d5cf628
2024-10-11 11:18:45 -07:00
Levi Tamasi 2c9aa69a93 Refactor the BlobDB-related parts of DBIter (#13061)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13061

As groundwork for further changes, the patch refactors the BlobDB-related parts of `DBIter` by 1) introducing a new internal helper class `DBIter::BlobReader` that encapsulates all members needed to retrieve a blob value (namely, `Version` and the `ReadOptions` fields) and 2) factoring out and cleaning up some duplicate logic related to resolving blob references in the non-Merge (see `SetValueAndColumnsFromBlob`) and Merge (see `MergeWithBlobBaseValue`) cases.

Reviewed By: jowlyzhang

Differential Revision: D64078099

fbshipit-source-id: 22d5bd93e6e5be5cc9ecf6c4ee6954f2eb016aff
2024-10-10 15:38:59 -07:00
Jay Huh fe6c8cb1d6 Print unknown writebatch tag (#13062)
Summary:
Add additional info for debugging purpose by doing the same as what WBWI does

632746bb5b/utilities/write_batch_with_index/write_batch_with_index.cc (L274-L276)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13062

Test Plan: CI

Reviewed By: hx235

Differential Revision: D64202297

Pulled By: jaykorean

fbshipit-source-id: 65164fd88420fc72b6db26d1436afe548a653269
2024-10-10 15:34:35 -07:00
Hui Xiao 632746bb5b Improve DBTest.DynamicLevelCompressionPerLevel (#13044)
Summary:
**Context/Summary:**

A part of this test is to verify compression conditionally happens depending on the shape of the LSM when `options.level_compaction_dynamic_level_bytes = true;`. It uses the total file size to determine whether compression has happened or not. This involves some hard-coded math hard to understand. This PR replaces those with statistics that directly shows whether compression has happened or not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13044

Test Plan: Existing test

Reviewed By: jaykorean

Differential Revision: D63666361

Pulled By: hx235

fbshipit-source-id: 8c9b1bea9b06ff1e3ed95c576aec6705159af137
2024-10-09 12:51:19 -07:00
Yu Zhang 8181dfb1c4 Fix a bug for surfacing write unix time (#13057)
Summary:
The write unix time from non L0 files are not surfaced properly because the level's wrapper iterator doesn't have a `write_unix_time` implementation that delegates to the corresponding file. The unit test didn't catch this because it incorrectly destroy the old db and reopen to check write time, instead of just reopen and check. This fix also include a change to support ldb's scan command to get write time for easier debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13057

Test Plan: Updated unit tests

Reviewed By: pdillinger

Differential Revision: D64015107

Pulled By: jowlyzhang

fbshipit-source-id: 244474f78a034f80c9235eea2aa8a0f4e54dff59
2024-10-08 11:31:51 -07:00
Yu Zhang 263fa15b44 Handle a possible overflow (#13046)
Summary:
Stress test detects this variable could potentially overflow, so added some runtime handling to avoid it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13046

Test Plan: Existing tests

Reviewed By: hx235

Differential Revision: D63911396

Pulled By: jowlyzhang

fbshipit-source-id: 7c9abcd74ac9937b211c0ea4bb683677390837c5
2024-10-04 16:48:12 -07:00
Changyu Bi bceb2dfe6a Introduce minimum compaction debt requirement for parallel compaction (#13054)
Summary:
a small CF can trigger parallel compaction that applies to the entire DB. This is because the bottommost file size of a small CF can be too small compared to l0 files when a l0->lbase compaction happens. We prevent this by requiring some minimum on the compaction debt.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13054

Test Plan: updated unit test.

Reviewed By: hx235

Differential Revision: D63861042

Pulled By: cbi42

fbshipit-source-id: 43bbf327988ef0ef912cd2fc700e3d096a8d2c18
2024-10-04 15:01:54 -07:00
Yu Zhang 32dd657bad Add some per key optimization for UDT in memtable only feature (#13031)
Summary:
This PR added some optimizations for the per key handling for SST file for the user-defined timestamps in Memtable only feature. CPU profiling shows this part is a big culprit for regression. This optimization saves some string construction/destruction/appending/copying. vector operations like reserve/emplace_back.

When iterating keys in a block, we need to copy some shared bytes from previous key, put it together with the non shared bytes and find a right location to pad the min timestamp. Previously, we create a tmp local string buffer to first construct the key from its pieces, and then copying this local string's content into `IterKey`'s buffer. To avoid having this local string and to avoid this extra copy. Instead of piecing together the key in a local string first, we just track all the pieces that make this key in a reused Slice array. And then copy the pieces in order into `IterKey`'s buffer. Since the previous key should be kept intact while we are copying some shared bytes from it,  we added a secondary buffer in `IterKey` and alternate between primary buffer and secondary buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13031

Test Plan: Existing tests.

Reviewed By: ltamasi

Differential Revision: D63416531

Pulled By: jowlyzhang

fbshipit-source-id: 9819b0e02301a2dbc90621b2fe4f651bc912113c
2024-10-03 17:57:50 -07:00
Levi Tamasi 917e98ff9e Templatize MultiCfIteratorImpl to avoid std::function's overhead (#13052)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13052

Currently, `MultiCfIteratorImpl` uses `std::function`s for `reset_func_` and `populate_func_`, which uses type erasure and has a performance overhead. The patch turns `MultiCfIteratorImpl` into a template that takes the two function object types as template parameters, and changes `AttributeGroupIteratorImpl` and `CoalescingIterator` so they pass in function objects of named types (as opposed to lambdas).

Reviewed By: jaykorean

Differential Revision: D63802598

fbshipit-source-id: e202f6d80c9054335e5b2571051a67a9e012c2d0
2024-10-02 19:18:15 -07:00
Yu Zhang 9375c3b635 Fix `needs_flush` assertion in file ingestion (#13045)
Summary:
This PR makes file ingestion job's flush wait a bit further until the SuperVersion is also updated. This is necessary since follow up operations will use the current SuperVersion to do range overlapping check and level assignment.

In debug mode, file ingestion job's second `NeedsFlush` call could have been invoked when the memtables are flushed but the SuperVersion hasn't been updated yet, triggering the assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13045

Test Plan:
Existing tests
Manually stress tested

Reviewed By: cbi42

Differential Revision: D63671151

Pulled By: jowlyzhang

fbshipit-source-id: 95a169e58a7e59f6dd4125e7296e9060fe4c63a7
2024-10-02 17:19:18 -07:00
Peter Dillinger dd23e84cad Re-implement GetApproximateMemTableStats for skip lists (#13047)
Summary:
GetApproximateMemTableStats() could return some bad results with the standard skip list memtable. See this new db_bench test showing the dismal distribution of results when the actual number of entries in range is 1000:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.391 micros/op 718915 ops/sec 1.391 seconds 1000000 operations;   11.7 MB/s
approximatememtablestats :       3.711 micros/op 269492 ops/sec 3.711 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 2344.1611  StdDev: 26587.27
Min: 0  Median: 965.8555  Max: 835273
Percentiles: P50: 965.86 P75: 1610.77 P99: 12618.01 P99.9: 74991.58 P99.99: 830970.97
------------------------------------------------------
[       0,       1 ]   131344  13.134%  13.134% ###
(       1,       2 ]      115   0.011%  13.146%
(       2,       3 ]      106   0.011%  13.157%
(       3,       4 ]      190   0.019%  13.176%
(       4,       6 ]      214   0.021%  13.197%
(       6,      10 ]      522   0.052%  13.249%
(      10,      15 ]      748   0.075%  13.324%
(      15,      22 ]     1002   0.100%  13.424%
(      22,      34 ]     1948   0.195%  13.619%
(      34,      51 ]     3067   0.307%  13.926%
(      51,      76 ]     4213   0.421%  14.347%
(      76,     110 ]     5721   0.572%  14.919%
(     110,     170 ]    11375   1.137%  16.056%
(     170,     250 ]    17928   1.793%  17.849%
(     250,     380 ]    36597   3.660%  21.509% #
(     380,     580 ]    77882   7.788%  29.297% ##
(     580,     870 ]   160193  16.019%  45.317% ###
(     870,    1300 ]   210098  21.010%  66.326% ####
(    1300,    1900 ]   167461  16.746%  83.072% ###
(    1900,    2900 ]    78678   7.868%  90.940% ##
(    2900,    4400 ]    47743   4.774%  95.715% #
(    4400,    6600 ]    17650   1.765%  97.480%
(    6600,    9900 ]    11895   1.190%  98.669%
(    9900,   14000 ]     4993   0.499%  99.168%
(   14000,   22000 ]     2384   0.238%  99.407%
(   22000,   33000 ]     1966   0.197%  99.603%
(   50000,   75000 ]     2968   0.297%  99.900%
(  570000,  860000 ]      999   0.100% 100.000%

readrandom   :       1.967 micros/op 508487 ops/sec 1.967 seconds 1000000 operations;    8.2 MB/s (1000000 of 1000000 found)
```

Perhaps the only good thing to say about the old implementation was that it was fast, though apparently not that fast.

I've implemented a much more robust and reasonably fast new version of the function. It's still logarithmic but with some larger constant factors. The standard deviation from true count is around 20% or less, and roughly the CPU cost of two memtable point look-ups. See code comments for detail.

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=1000
...
filluniquerandom :       1.478 micros/op 676434 ops/sec 1.478 seconds 1000000 operations;   11.0 MB/s
approximatememtablestats :       2.694 micros/op 371157 ops/sec 2.694 seconds 1000000 operations;
Reported entry count stats (expected 1000):
Count: 1000000 Average: 1073.5158  StdDev: 197.80
Min: 608  Median: 1079.9506  Max: 2176
Percentiles: P50: 1079.95 P75: 1223.69 P99: 1852.36 P99.9: 1898.70 P99.99: 2176.00
------------------------------------------------------
(     580,     870 ]   134848  13.485%  13.485% ###
(     870,    1300 ]   747868  74.787%  88.272% ###############
(    1300,    1900 ]   116536  11.654%  99.925% ##
(    1900,    2900 ]      748   0.075% 100.000%

readrandom   :       1.997 micros/op 500654 ops/sec 1.997 seconds 1000000 operations;    8.1 MB/s (1000000 of 1000000 found)
```

We can already see that the distribution of results is dramatically better and wonderfully normal-looking, with relative standard deviation around 20%. The function is also FASTER, at least with these parameters. Let's look how this behavior generalizes, first *much* larger range:

```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=30000
filluniquerandom :       1.390 micros/op 719654 ops/sec 1.376 seconds 990000 operations;   11.7 MB/s
approximatememtablestats :       1.129 micros/op 885649 ops/sec 1.129 seconds 1000000 operations;
Reported entry count stats (expected 30000):
Count: 1000000 Average: 31098.8795  StdDev: 3601.47
Min: 21504  Median: 29333.9303  Max: 43008
Percentiles: P50: 29333.93 P75: 33018.00 P99: 43008.00 P99.9: 43008.00 P99.99: 43008.00
------------------------------------------------------
(   14000,   22000 ]      408   0.041%   0.041%
(   22000,   33000 ]   749327  74.933%  74.974% ###############
(   33000,   50000 ]   250265  25.027% 100.000% #####

readrandom   :       1.894 micros/op 528083 ops/sec 1.894 seconds 1000000 operations;    8.5 MB/s (989989 of 1000000 found)
```

This is *even faster* and relatively *more accurate*, with relative standard deviation closer to 10%. Code comments explain why. Now let's look at smaller ranges. Implementation quirks or conveniences:
* When actual number in range is >= 40, the minimum return value is 40.
* When the actual is <= 10, it is guaranteed to return that actual number.
```
$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=75
...
filluniquerandom :       1.417 micros/op 705668 ops/sec 1.417 seconds 999975 operations;   11.4 MB/s
approximatememtablestats :       3.342 micros/op 299197 ops/sec 3.342 seconds 1000000 operations;
Reported entry count stats (expected 75):
Count: 1000000 Average: 75.1210  StdDev: 15.02
Min: 40  Median: 71.9395  Max: 256
Percentiles: P50: 71.94 P75: 89.69 P99: 119.12 P99.9: 166.68 P99.99: 229.78
------------------------------------------------------
(      34,      51 ]    38867   3.887%   3.887% #
(      51,      76 ]   550554  55.055%  58.942% ###########
(      76,     110 ]   398854  39.885%  98.828% ########
(     110,     170 ]    11353   1.135%  99.963%
(     170,     250 ]      364   0.036%  99.999%
(     250,     380 ]        8   0.001% 100.000%

readrandom   :       1.861 micros/op 537224 ops/sec 1.861 seconds 1000000 operations;    8.7 MB/s (999974 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=25
...
filluniquerandom :       1.501 micros/op 666283 ops/sec 1.501 seconds 1000000 operations;   10.8 MB/s
approximatememtablestats :       5.118 micros/op 195401 ops/sec 5.118 seconds 1000000 operations;
Reported entry count stats (expected 25):
Count: 1000000 Average: 26.2392  StdDev: 4.58
Min: 25  Median: 28.4590  Max: 72
Percentiles: P50: 28.46 P75: 31.69 P99: 49.27 P99.9: 67.95 P99.99: 72.00
------------------------------------------------------
(      22,      34 ]   928936  92.894%  92.894% ###################
(      34,      51 ]    67960   6.796%  99.690% #
(      51,      76 ]     3104   0.310% 100.000%

readrandom   :       1.892 micros/op 528595 ops/sec 1.892 seconds 1000000 operations;    8.6 MB/s (1000000 of 1000000 found)

$ ./db_bench --benchmarks=filluniquerandom,approximatememtablestats,readrandom --value_size=1 --num=1000000 --batch_size=10
...
filluniquerandom :       1.642 micros/op 608916 ops/sec 1.642 seconds 1000000 operations;    9.9 MB/s
approximatememtablestats :       3.042 micros/op 328721 ops/sec 3.042 seconds 1000000 operations;
Reported entry count stats (expected 10):
Count: 1000000 Average: 10.0000  StdDev: 0.00
Min: 10  Median: 10.0000  Max: 10
Percentiles: P50: 10.00 P75: 10.00 P99: 10.00 P99.9: 10.00 P99.99: 10.00
------------------------------------------------------
(       6,      10 ]  1000000 100.000% 100.000% ####################

readrandom   :       1.805 micros/op 554126 ops/sec 1.805 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
```

Remarkably consistent.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13047

Test Plan: new db_bench test for both performance and accuracy (see above); added to crash test; unit test updated.

Reviewed By: cbi42

Differential Revision: D63722003

Pulled By: pdillinger

fbshipit-source-id: cfc8613c085e87c17ecec22d82601aac2a5a1b26
2024-10-02 14:25:50 -07:00
Jay Huh 2a5ff78c12 More info in CompactionServiceJobInfo and CompactionJobStats (#13029)
Summary:
Add the following to the `CompactionServiceJobInfo`
- compaction_reason
- is_full_compaction
- is_manual_compaction
- bottommost_level

Added `is_remote_compaction` to the `CompactionJobStats` and set initial values to avoid UB for uninitialized values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13029

Test Plan:
```
./compaction_service_test --gtest_filter="*CompactionInfo*"
```

Reviewed By: anand1976

Differential Revision: D63322878

Pulled By: jaykorean

fbshipit-source-id: f02a66ca45e660b9d354a43837d8ec6beb7621fb
2024-09-25 10:26:15 -07:00
Peter Dillinger a1a102ffce Steps toward deprecating implicit prefix seek, related fixes (#13026)
Summary:
With some new use cases onboarding to prefix extractors/seek/filters, one of the risks is existing iterator code, e.g. for maintenance tasks, being unintentionally subject to prefix seek semantics. This is a longstanding known design flaw with prefix seek, and `prefix_same_as_start` and `auto_prefix_mode` were steps in the direction of making that obsolete. However, we can't just immediately set `total_order_seek` to true by default, because that would impact so much code instantly.

Here we add a new DB option, `prefix_seek_opt_in_only` that basically allows users to transition to the future behavior when they are ready. When set to true, all iterators will be treated as if `total_order_seek=true` and then the only ways to get prefix seek semantics are with `prefix_same_as_start` or `auto_prefix_mode`.

Related fixes / changes:
* Make sure that `prefix_same_as_start` and `auto_prefix_mode` are compatible with (or override) `total_order_seek` (depending on your interpretation).
* Fix a bug in which a new iterator after dynamically changing the prefix extractor might mix different prefix semantics between memtable and SSTs. Both should use the latest extractor semantics, which means iterators ignoring memtable prefix filters with an old extractor. And that means passing the latest prefix extractor to new memtable iterators that might use prefix seek. (Without the fix, the test added for this fails in many ways.)

Suggested follow-up:
* Investigate a FIXME where a MergeIteratorBuilder is created in db_impl.cc. No unit test detects a change in value that should impact correctness.
* Make memtable prefix bloom compatible with `auto_prefix_mode`, which might require involving the memtablereps because we don't know at iterator creation time (only seek time) whether an auto_prefix_mode seek will be a prefix seek.
* Add `prefix_same_as_start` testing to db_stress

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13026

Test Plan:
tests updated, added. Add combination of `total_order_seek=true` and `auto_prefix_mode=true` to stress test. Ran `make blackbox_crash_test` for a long while.

Manually ran tests with `prefix_seek_opt_in_only=true` as default, looking for unexpected issues. I inspected most of the results and migrated many tests to be ready for such a change (but not all).

Reviewed By: ltamasi

Differential Revision: D63147378

Pulled By: pdillinger

fbshipit-source-id: 1f4477b730683d43b4be7e933338583702d3c25e
2024-09-20 15:54:19 -07:00
Jay Huh 5f4a8c3da4 Load latest options from OPTIONS file in Remote host (#13025)
Summary:
We've been serializing and deserializing DBOptions and CFOptions (and other CF into) as part of `CompactionServiceInput`. These are all readily available in the OPTIONS file and the remote worker can read the OPTIONS file to obtain the same information. This helps reducing the size of payload significantly.

In a very rare scenario if the OPTIONS file is purged due to options change by primary host at the same time while the remote host is loading the latest options, it may fail. In this case, we just retry once.

This also solves the problem where we had to open the default CF with the CFOption from another CF if the remote compaction is for a non-default column family. (TODO comment in /db_impl_secondary.cc)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13025

Test Plan:
Unit Tests
```
./compaction_service_test
```
```
./compaction_job_test
```

Also tested with Meta's internal Offload Infra

Reviewed By: anand1976, cbi42

Differential Revision: D63100109

Pulled By: jaykorean

fbshipit-source-id: b7162695e31e2c5a920daa7f432842163a5b156d
2024-09-20 13:26:02 -07:00
Changyu Bi 71e38dbe25 Compact one file at a time for FIFO temperature change compactions (#13018)
Summary:
Per customer request, we should not merge multiple SST files together during temperature change compaction, since this can cause FIFO TTL compactions to be delayed. This PR changes the compaction picking logic to pick one file at a time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13018

Test Plan: * updated some existing unit tests to test this new behavior.

Reviewed By: jowlyzhang

Differential Revision: D62883292

Pulled By: cbi42

fbshipit-source-id: 6a9fc8c296b5d9b17168ef6645f25153241c8b93
2024-09-19 15:50:41 -07:00
Levi Tamasi 54ace7f340 Change the semantics of blob_garbage_collection_force_threshold to provide better control over space amp (#13022)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13022

Currently, `blob_garbage_collection_force_threshold` applies to the oldest batch of blob files, which is typically only a small subset of the blob files currently eligible for garbage collection. This can result in a form of head-of-line blocking: no GC-triggered compactions will be scheduled if the oldest batch does not currently exceed the threshold, even if a lot of higher-numbered blob files do. This can in turn lead to high space amplification that exceeds the soft bound implicit in the force threshold (e.g. 50% would suggest a space amp of <2 and 75% would imply a space amp of <4). The patch changes the semantics of this configuration threshold to apply to the entire set of blob files that are eligible for garbage collection based on `blob_garbage_collection_age_cutoff`. This provides more intuitive semantics for the option and can provide a better write amp/space amp trade-off. (Note that GC-triggered compactions still pick the same SST files as before, so triggered GC still targets the oldest the blob files.)

Reviewed By: jowlyzhang

Differential Revision: D62977860

fbshipit-source-id: a999f31fe9cdda313de513f0e7a6fc707424d4a3
2024-09-19 15:47:13 -07:00
Peter Dillinger 98c33cb8e3 Steps toward making IDENTITY file obsolete (#13019)
Summary:
* Set write_dbid_to_manifest=true by default
* Add new option write_identity_file (default true) that allows us to opt-in to future behavior without identity file
* Refactor related DB open code to minimize code duplication

_Recommend hiding whitespace changes for review_

Intended follow-up: add support to ldb for reading and even replacing the DB identity in the manifest. Could be a variant of `update_manifest` command or based on it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13019

Test Plan: unit tests and stress test updated for new functionality

Reviewed By: anand1976

Differential Revision: D62898229

Pulled By: pdillinger

fbshipit-source-id: c08b25cf790610b034e51a9de0dc78b921abbcf0
2024-09-19 14:05:21 -07:00
Peter Dillinger 10984e8c26 Fix and generalize framework for filtering range queries, etc. (#13005)
Summary:
There was a subtle design/contract bug in the previous version of range filtering in experimental.h  If someone implemented a key segments extractor with "all or nothing" fixed size segments, that could result in unsafe range filtering. For example, with two segments of width 3:
```
x = 0x|12 34 56|78 9A 00|
y = 0x|12 34 56||78 9B
z = 0x|12 34 56|78 9C 00|
```
Segment 1 of y (empty) is out of order with segment 1 of x and z.

I have re-worked the contract to make it clear what does work, and implemented a standard extractor for fixed-size segments, CappedKeySegmentsExtractor. The safe approach for filtering is to consume as much as is available for a segment in the case of a short key.

I have also added support for min-max filtering with reverse byte-wise comparator, which is probably the 2nd most common comparator for RocksDB users (because of MySQL). It might seem that a min-max filter doesn't care about forward or reverse ordering, but it does when trying to determine whether in input range from segment values v1 to v2, where it so happens that v2 is byte-wise less than v1, is an empty forward interval or a non-empty reverse interval. At least in the current setup, we don't have that context.

A new unit test (with some refactoring) tests CappedKeySegmentsExtractor, reverse byte-wise comparator, and the corresponding min-max filter.

I have also (contractually / mathematically) generalized the framework to comparators other than the byte-wise comparator, and made other generalizations to make the extractor limitations more explicitly connected to the particular filters and filtering used--at least in description.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13005

Test Plan: added unit tests as described

Reviewed By: jowlyzhang

Differential Revision: D62769784

Pulled By: pdillinger

fbshipit-source-id: 0d41f0d0273586bdad55e4aa30381ebc861f7044
2024-09-18 15:26:37 -07:00
Nick Brekhus 0611eb5b9d Fix orphaned files in SstFileManager (#13015)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13015

`Close()`ing a database now releases tracked files in `SstFileManager`. Previously this space would be leaked until the database was later reopened.

Reviewed By: jowlyzhang

Differential Revision: D62590773

fbshipit-source-id: 5461bd253d974ac4967ad52fee92e2650f8a9a28
2024-09-18 13:27:44 -07:00
Changyu Bi f97e33454f Fix a bug with auto recovery on WAL write error (#12995)
Summary:
A recent crash test failure shows that auto recovery from WAL write failure can cause CFs to be inconsistent. A unit test repro in P1569398553. The following is an example sequence of events:

```
0. manual_wal_flush is true. There are multiple CFs in a DB.
1. Submit a write batch with updates to multiple CF
2. A FlushWAL or a memtable swtich that will try to write the buffered WAL data. Fail this write so that buffered WAL data is dropped: 4b1d595306/file/writable_file_writer.cc (L624)
The error needs to be retryable to start background auto recovery.
3. One CF successfully flushes its memtable during auto recovery.
4. Crash the process.
5. Reopen the DB, one CF will have the update as a result of successful flush. Other CFs will miss all the updates in the write batch since WAL does not have them.
```

This can happen if a users configures manual_wal_flush, uses more than one CF, and can hit retryable error for WAL writes. This PR is a short-term fix that upgrades WAL related errors to fatal and not trigger auto recovery.

A long-term fix may be not drop buffered WAL data by checking how much data is actually written, or require atomically flushing all column families during error recovery from this kind of errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12995

Test Plan:
added unit test to check error severity and if recovery is triggered. A crash test repro command that fails in a few runs before this PR:
```
python3 ./tools/db_crashtest.py blackbox --interval=60 --metadata_write_fault_one_in=1000 --column_families=10 --exclude_wal_from_write_fault_injection=0 --manual_wal_flush_one_in=1000 --WAL_size_limit_MB=10240 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=0 --auto_readahead_size=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944  --index_block_restart_interval=4 --index_shortening=1 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=16 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=2097152 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=10000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=bar --sqfc_version=1 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=2 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --uncache_aggressiveness=8 --universal_max_read_amp=-1 --unpartitioned_pinning=2 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=0 --use_attribute_group=1 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_sqfc_for_range_queries=0 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=50 --writepercent=35 --ops_per_thread=100000 --preserve_unverified_changes=1
```

Reviewed By: hx235

Differential Revision: D62888510

Pulled By: cbi42

fbshipit-source-id: 308bdbbb8d897cc8eba950155cd0e37cf7eb76fe
2024-09-17 14:10:33 -07:00
Nicholas Ormrod 0e04ef1a96 Deshim coro in fbcode/internal_repo_rocksdb
Summary:
The following rules were deshimmed:
```
//folly/experimental/coro:accumulate -> //folly/coro:accumulate
//folly/experimental/coro:async_generator -> //folly/coro:async_generator
//folly/experimental/coro:async_pipe -> //folly/coro:async_pipe
//folly/experimental/coro:async_scope -> //folly/coro:async_scope
//folly/experimental/coro:async_stack -> //folly/coro:async_stack
//folly/experimental/coro:baton -> //folly/coro:baton
//folly/experimental/coro:blocking_wait -> //folly/coro:blocking_wait
//folly/experimental/coro:collect -> //folly/coro:collect
//folly/experimental/coro:concat -> //folly/coro:concat
//folly/experimental/coro:coroutine -> //folly/coro:coroutine
//folly/experimental/coro:current_executor -> //folly/coro:current_executor
//folly/experimental/coro:detach_on_cancel -> //folly/coro:detach_on_cancel
//folly/experimental/coro:detail_barrier -> //folly/coro:detail_barrier
//folly/experimental/coro:detail_barrier_task -> //folly/coro:detail_barrier_task
//folly/experimental/coro:detail_current_async_frame -> //folly/coro:detail_current_async_frame
//folly/experimental/coro:detail_helpers -> //folly/coro:detail_helpers
//folly/experimental/coro:detail_malloc -> //folly/coro:detail_malloc
//folly/experimental/coro:detail_manual_lifetime -> //folly/coro:detail_manual_lifetime
//folly/experimental/coro:detail_traits -> //folly/coro:detail_traits
//folly/experimental/coro:filter -> //folly/coro:filter
//folly/experimental/coro:future_util -> //folly/coro:future_util
//folly/experimental/coro:generator -> //folly/coro:generator
//folly/experimental/coro:gmock_helpers -> //folly/coro:gmock_helpers
//folly/experimental/coro:gtest_helpers -> //folly/coro:gtest_helpers
//folly/experimental/coro:inline_task -> //folly/coro:inline_task
//folly/experimental/coro:invoke -> //folly/coro:invoke
//folly/experimental/coro:merge -> //folly/coro:merge
//folly/experimental/coro:mutex -> //folly/coro:mutex
//folly/experimental/coro:promise -> //folly/coro:promise
//folly/experimental/coro:result -> //folly/coro:result
//folly/experimental/coro:retry -> //folly/coro:retry
//folly/experimental/coro:rust_adaptors -> //folly/coro:rust_adaptors
//folly/experimental/coro:scope_exit -> //folly/coro:scope_exit
//folly/experimental/coro:shared_lock -> //folly/coro:shared_lock
//folly/experimental/coro:shared_mutex -> //folly/coro:shared_mutex
//folly/experimental/coro:sleep -> //folly/coro:sleep
//folly/experimental/coro:small_unbounded_queue -> //folly/coro:small_unbounded_queue
//folly/experimental/coro:task -> //folly/coro:task
//folly/experimental/coro:timed_wait -> //folly/coro:timed_wait
//folly/experimental/coro:timeout -> //folly/coro:timeout
//folly/experimental/coro:traits -> //folly/coro:traits
//folly/experimental/coro:transform -> //folly/coro:transform
//folly/experimental/coro:unbounded_queue -> //folly/coro:unbounded_queue
//folly/experimental/coro:via_if_async -> //folly/coro:via_if_async
//folly/experimental/coro:with_async_stack -> //folly/coro:with_async_stack
//folly/experimental/coro:with_cancellation -> //folly/coro:with_cancellation
//folly/experimental/coro:bounded_queue -> //folly/coro:bounded_queue
//folly/experimental/coro:shared_promise -> //folly/coro:shared_promise
//folly/experimental/coro:cleanup -> //folly/coro:cleanup
//folly/experimental/coro:auto_cleanup_fwd -> //folly/coro:auto_cleanup_fwd
//folly/experimental/coro:auto_cleanup -> //folly/coro:auto_cleanup
```

The following headers were deshimmed:
```
folly/experimental/coro/Accumulate.h -> folly/coro/Accumulate.h
folly/experimental/coro/Accumulate-inl.h -> folly/coro/Accumulate-inl.h
folly/experimental/coro/AsyncGenerator.h -> folly/coro/AsyncGenerator.h
folly/experimental/coro/AsyncPipe.h -> folly/coro/AsyncPipe.h
folly/experimental/coro/AsyncScope.h -> folly/coro/AsyncScope.h
folly/experimental/coro/AsyncStack.h -> folly/coro/AsyncStack.h
folly/experimental/coro/Baton.h -> folly/coro/Baton.h
folly/experimental/coro/BlockingWait.h -> folly/coro/BlockingWait.h
folly/experimental/coro/Collect.h -> folly/coro/Collect.h
folly/experimental/coro/Collect-inl.h -> folly/coro/Collect-inl.h
folly/experimental/coro/Concat.h -> folly/coro/Concat.h
folly/experimental/coro/Concat-inl.h -> folly/coro/Concat-inl.h
folly/experimental/coro/Coroutine.h -> folly/coro/Coroutine.h
folly/experimental/coro/CurrentExecutor.h -> folly/coro/CurrentExecutor.h
folly/experimental/coro/DetachOnCancel.h -> folly/coro/DetachOnCancel.h
folly/experimental/coro/detail/Barrier.h -> folly/coro/detail/Barrier.h
folly/experimental/coro/detail/BarrierTask.h -> folly/coro/detail/BarrierTask.h
folly/experimental/coro/detail/CurrentAsyncFrame.h -> folly/coro/detail/CurrentAsyncFrame.h
folly/experimental/coro/detail/Helpers.h -> folly/coro/detail/Helpers.h
folly/experimental/coro/detail/Malloc.h -> folly/coro/detail/Malloc.h
folly/experimental/coro/detail/ManualLifetime.h -> folly/coro/detail/ManualLifetime.h
folly/experimental/coro/detail/Traits.h -> folly/coro/detail/Traits.h
folly/experimental/coro/Filter.h -> folly/coro/Filter.h
folly/experimental/coro/Filter-inl.h -> folly/coro/Filter-inl.h
folly/experimental/coro/FutureUtil.h -> folly/coro/FutureUtil.h
folly/experimental/coro/Generator.h -> folly/coro/Generator.h
folly/experimental/coro/GmockHelpers.h -> folly/coro/GmockHelpers.h
folly/experimental/coro/GtestHelpers.h -> folly/coro/GtestHelpers.h
folly/experimental/coro/detail/InlineTask.h -> folly/coro/detail/InlineTask.h
folly/experimental/coro/Invoke.h -> folly/coro/Invoke.h
folly/experimental/coro/Merge.h -> folly/coro/Merge.h
folly/experimental/coro/Merge-inl.h -> folly/coro/Merge-inl.h
folly/experimental/coro/Mutex.h -> folly/coro/Mutex.h
folly/experimental/coro/Promise.h -> folly/coro/Promise.h
folly/experimental/coro/Result.h -> folly/coro/Result.h
folly/experimental/coro/Retry.h -> folly/coro/Retry.h
folly/experimental/coro/RustAdaptors.h -> folly/coro/RustAdaptors.h
folly/experimental/coro/ScopeExit.h -> folly/coro/ScopeExit.h
folly/experimental/coro/SharedLock.h -> folly/coro/SharedLock.h
folly/experimental/coro/SharedMutex.h -> folly/coro/SharedMutex.h
folly/experimental/coro/Sleep.h -> folly/coro/Sleep.h
folly/experimental/coro/Sleep-inl.h -> folly/coro/Sleep-inl.h
folly/experimental/coro/SmallUnboundedQueue.h -> folly/coro/SmallUnboundedQueue.h
folly/experimental/coro/Task.h -> folly/coro/Task.h
folly/experimental/coro/TimedWait.h -> folly/coro/TimedWait.h
folly/experimental/coro/Timeout.h -> folly/coro/Timeout.h
folly/experimental/coro/Timeout-inl.h -> folly/coro/Timeout-inl.h
folly/experimental/coro/Traits.h -> folly/coro/Traits.h
folly/experimental/coro/Transform.h -> folly/coro/Transform.h
folly/experimental/coro/Transform-inl.h -> folly/coro/Transform-inl.h
folly/experimental/coro/UnboundedQueue.h -> folly/coro/UnboundedQueue.h
folly/experimental/coro/ViaIfAsync.h -> folly/coro/ViaIfAsync.h
folly/experimental/coro/WithAsyncStack.h -> folly/coro/WithAsyncStack.h
folly/experimental/coro/WithCancellation.h -> folly/coro/WithCancellation.h
folly/experimental/coro/BoundedQueue.h -> folly/coro/BoundedQueue.h
folly/experimental/coro/SharedPromise.h -> folly/coro/SharedPromise.h
folly/experimental/coro/Cleanup.h -> folly/coro/Cleanup.h
folly/experimental/coro/AutoCleanup-fwd.h -> folly/coro/AutoCleanup-fwd.h
folly/experimental/coro/AutoCleanup.h -> folly/coro/AutoCleanup.h
```

This is a codemod. It was automatically generated and will be landed once it is approved and tests are passing in sandcastle.
You have been added as a reviewer by Sentinel or Butterfly.

Autodiff project: dcoro
Autodiff partition: fbcode.internal_repo_rocksdb
Autodiff bookmark: ad.dcoro.fbcode.internal_repo_rocksdb

Reviewed By: dtolnay

Differential Revision: D62684411

fbshipit-source-id: 8dbd31ab64fcdd99435d322035b9668e3200e0a3
2024-09-14 09:48:21 -07:00
anand76 cabd2d8718 Fix a couple of missing cases of retry on corruption (#13007)
Summary:
For SST checksum mismatch corruptions in the read path, RocksDB retries the read if the underlying file system supports verification and reconstruction of data (`FSSupportedOps::kVerifyAndReconstructRead`). There were a couple of places where the retry was missing - reading the SST footer and the properties block. This PR fixes the retry in those cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13007

Test Plan: Add new unit tests

Reviewed By: jaykorean

Differential Revision: D62519186

Pulled By: anand1976

fbshipit-source-id: 50aa38f18f2a53531a9fc8d4ccdf34fbf034ed59
2024-09-13 13:56:49 -07:00
Changyu Bi e490f2b051 Fix a bug in ReFitLevel() where `FileMetaData::being_compacted` is not cleared (#13009)
Summary:
in ReFitLevel(), we were not setting being_compacted to false after ReFitLevel() is done. This is not a issue if refit level is successful, since new FileMetaData is created for files at the target level. However, if there's an error during RefitLevel(), e.g., Manifest write failure, we should clear the being_compacted field for these files. Otherwise, these files will not be picked for compaction until db reopen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13009

Test Plan:
existing test.
- stress test failure in T200339331 should not happen anymore.

Reviewed By: hx235

Differential Revision: D62597169

Pulled By: cbi42

fbshipit-source-id: 0ba659806da6d6d4b42384fc95268b2d7bad720e
2024-09-12 15:19:14 -07:00
Yu Zhang 43bc71fef6 Add an internal API MemTableList::GetEditForDroppingCurrentVersion (#13001)
Summary:
Prepare this internal API to be used by atomic data replacement. The main purpose of this API is to get a `VersionEdit` to mark the entire current `MemTableListVersion` as dropped.  Flush needs the similar functionality when installing results, so that logic is refactored into a util function `GetDBRecoveryEditForObsoletingMemTables` to be shared by flush and this internal API.

To test this internal API, flush's result installation is redirected to use this API when it is flushing all the immutable MemTables in debug mode. It should achieve the exact same results, just with a duplicated `VersionEdit::log_number` field that doesn't upsets the recovery logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13001

Test Plan: Existing tests

Reviewed By: pdillinger

Differential Revision: D62309591

Pulled By: jowlyzhang

fbshipit-source-id: e25914d9a2e281c25ab7ee31a66eaf6adfae4b88
2024-09-10 13:23:13 -07:00
Yu Zhang 0c6e9c036a Make compaction always use the input version with extra ref protection (#12992)
Summary:
`Compaction` is already creating its own ref for the input Version: 4b1d595306/db/compaction/compaction.cc (L73)

And properly Unref it during destruction:
4b1d595306/db/compaction/compaction.cc (L450)

This PR redirects compaction's access of `cfd->current()` to this input `Version`, to prepare for when a column family's data can be replaced all together, and `cfd->current()` is not safe to access for a compaction job. Because a new `Version` with just some other external files could be installed as `cfd->current()`. The compaction job's expectation of the current `Version` and the corresponding storage info to always have its input files will no longer be guaranteed.

My next follow up is to do a similar thing for flush, also to prepare it for when a column family's data can be replaced. I will make it create its own reference of the current `MemTableListVersion` and use it as input, all flush job's access of memtables will be wired to that input `MemTableListVersion`. Similarly this reference will be unreffed during a flush job's destruction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12992

Test Plan: Existing tests

Reviewed By: pdillinger

Differential Revision: D62212625

Pulled By: jowlyzhang

fbshipit-source-id: 9a781213469cf366857a128d50a702af683a046a
2024-09-06 14:07:33 -07:00
Yu Zhang a24574e80a Add documentation for background job's state transition (#12994)
Summary:
The `SchedulePending*` API is a bit confusing since it doesn't immediately schedule the work and can be confused with the actual scheduling. So I have changed these to be `EnqueuePending*` and added some documentation for the corresponding state transitions of these background work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12994

Test Plan: existing tests

Reviewed By: cbi42

Differential Revision: D62252746

Pulled By: jowlyzhang

fbshipit-source-id: ee68be6ed33070cad9a5004b7b3e16f5bcb041bf
2024-09-06 13:08:34 -07:00
Changyu Bi cd6f802ccb Add a new file ingestion option `link_files` (#12980)
Summary:
Add option `IngestExternalFileOptions::link_files` that hard links input files and preserves original file links after ingestion, unlike `move_files` which will unlink input files after ingestion. This can be useful when being used together with `allow_db_generated_files` to ingest files from another DB. Also reverted the change to `move_files` in https://github.com/facebook/rocksdb/issues/12959 to simplify the contract so that it will always unlink input files without exception.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12980

Test Plan: updated unit test `ExternSSTFileLinkFailFallbackTest.LinkFailFallBackExternalSst` to test that input files will not be unlinked.

Reviewed By: pdillinger

Differential Revision: D61925111

Pulled By: cbi42

fbshipit-source-id: eadaca72e1ae5288bdd195d57158466e5656fa62
2024-09-03 13:06:25 -07:00
Peter Dillinger d96e67c2bf Fix flaky test DBTest2.VariousFileTemperatures (#12974)
Summary:
... apparently due to potentially not purging obsolete files after CompactRange

Example: https://github.com/facebook/rocksdb/actions/runs/10564621261/job/29267393711?pr=12959

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12974

Test Plan: reproduced failure with USE_CLANG=1 COERCE_CONTEXT_SWITCH=1, now fixed

Reviewed By: cbi42

Differential Revision: D61812600

Pulled By: pdillinger

fbshipit-source-id: d4b23e1a179bb8ec39875ed7a8ce1649fa3344bd
2024-08-26 14:08:21 -07:00
Changyu Bi 4eb5878ab2 Support ingesting db generated files using hard link (#12959)
Summary:
so `IngestExternalFileOptions::move_files` and `IngestExternalFileOptions::allow_db_generated_files` are now compatible. The original file links won't be removed if `allow_db_generated_files` is true. This is to prevent deleting files from another DB.

There was a [comment](https://github.com/facebook/rocksdb/pull/12750#discussion_r1684509620) in https://github.com/facebook/rocksdb/issues/12750 about how exactly-once ingestion would work with `move_files`. I've discussed with customer and decided that it can be done by reading the target DB to see if it contains any ingested key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12959

Test Plan: updated unit tests `IngestDBGeneratedFileTest*` to enable `move_files`.

Reviewed By: jowlyzhang

Differential Revision: D61703480

Pulled By: cbi42

fbshipit-source-id: 6b4294369767f989a2f36bbace4ca3c0257aeaf7
2024-08-26 12:25:16 -07:00
Peter Dillinger 96340dbce2 Options for file temperature for more files (#12957)
Summary:
We have a request to use the cold tier as primary source of truth for the DB, and to best support such use cases and to complement the existing options controlling SST file temperatures, we add two new DB options:
* `metadata_write_temperature` for DB "small" files that don't contain much user data
* `wal_write_temperature` for WALs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12957

Test Plan: Unit test included, though it's hard to be sure we've covered all the places

Reviewed By: jowlyzhang

Differential Revision: D61664815

Pulled By: pdillinger

fbshipit-source-id: 8e19c9dd8fd2db059bb15f74938d6bc12002e82b
2024-08-23 19:49:25 -07:00
eniac1024 f5c5f881d2 Fix MultiGet with timestamps (#12943)
Summary:
Issue: MultiGet(PinnableSlice) can't read out all timestamps.
Fixed the impl, and added an UT as well. In the original impl, if MultiGet reads multiple column families, a later column family would clean up timestamps of previous column family.
Fix: https://github.com/facebook/rocksdb/issues/12950#issue-2476996580

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12943

Reviewed By: anand1976

Differential Revision: D61729257

Pulled By: pdillinger

fbshipit-source-id: 55267c26076c8a59acedd27e14714711729a40df
2024-08-23 13:58:04 -07:00
Changyu Bi c62de54c7c Record largest seqno in table properties and verify in file ingestion (#12951)
Summary:
this helps to avoid scanning input files when ingesting db generated files: ecb844babd/db/external_sst_file_ingestion_job.cc (L917-L935)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12951

Test Plan:
* `IngestDBGeneratedFileTest.FailureCase` is updated to verify that this table property is verified during ingestion
* existing unit tests for other ingestion use cases.

Reviewed By: jowlyzhang

Differential Revision: D61608285

Pulled By: cbi42

fbshipit-source-id: b5b7aae9741531349ab247be6ffaa3f3628b76ca
2024-08-21 16:24:18 -07:00
Yu Zhang 945f60b157 Add some documentation for version edit handlers (#12948)
Summary:
As titled. No functional change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12948

Reviewed By: hx235

Differential Revision: D61551254

Pulled By: jowlyzhang

fbshipit-source-id: ccf53d78bd2f18f174d7e61972e5de467c96ce76
2024-08-21 10:09:10 -07:00
Yu Zhang 81d52bdc1a Fix UDT in memtable only assertions (#12946)
Summary:
Empty memtables can be legitimately created and flushed, for example by error recovery flush attempts:

273b3eadf0/db/db_impl/db_impl_compaction_flush.cc (L2309-L2312)

This check is updated to be considerate of this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12946

Reviewed By: hx235

Differential Revision: D61492477

Pulled By: jowlyzhang

fbshipit-source-id: 7d16fcaea457948546072f85b3650fd1cc24f9db
2024-08-20 09:19:52 -07:00
Changyu Bi defd97bc9d Add an option to verify memtable key order during reads (#12889)
Summary:
add a new CF option `paranoid_memory_checks` that allows additional data integrity validations during read/scan. Currently, skiplist-based memtable will validate the order of keys visited. Further data validation can be added in different layers. The option will be opt-in due to performance overhead.

The motivation for this feature is for services where data correctness is critical and want to detect in-memory corruption earlier. For a corrupted memtable key, this feature can help to detect it during during reads instead of during flush with existing protections (OutputValidator that verifies key order or per kv checksum). See internally linked task for more context.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12889

Test Plan:
* new unit test added for paranoid_memory_checks=true.
* existing unit test for paranoid_memory_checks=false.
* enable in stress test.

Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR:
* Memtable-only randomread ops/sec:
```
(for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000  --seed=1723056275 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }';

Main: 608146
PR with paranoid_memory_checks=false: 607727 (- %0.07)
PR with paranoid_memory_checks=true: 521889 (-%14.2)
```

* Memtable-only sequential scan ops/sec:
```
(for I in $(seq 1 50); do ./db_bench--benchmarks=fillseq,readseq[-X10] --write_buffer_size=268435456 --num=1000000  --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';

Main: 9180077
PR with paranoid_memory_checks=false: 9536241 (+%3.8)
PR with paranoid_memory_checks=true: 7653934 (-%16.6)
```

* Memtable-only reverse scan ops/sec:
```
(for I in $(seq 1 20); do ./db_bench --benchmarks=fillseq,readreverse[-X10] --write_buffer_size=268435456 --num=1000000  --seed=1723056275 2>1 | grep "\[AVG 10 runs\]"; done;) | awk '{ t += $6; c++; print; } END { printf "%.0f\n", 1.0 * t / c }';

 Main: 1285719
 PR with integrity_checks=false: 1431626 (+%11.3)
 PR with integrity_checks=true: 811031 (-%36.9)
```

The `readrandom` benchmark shows no regression. The scanning benchmarks show improvement that I can't explain.

Reviewed By: pdillinger

Differential Revision: D60414267

Pulled By: cbi42

fbshipit-source-id: a70b0cbeea131f1a249a5f78f9dc3a62dacfaa91
2024-08-19 13:53:25 -07:00
Jay Huh 273b3eadf0 Add Remote Compaction Installation Callback Function (#12940)
Summary:
Add an optional callback function upon remote compaction temp output installation. This will be internally used for setting the final status in the Offload Infra.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12940

Test Plan:
Unit Test added
```
./compaction_service_test
```

_Also internally tested by manually merging into internal code base_

Reviewed By: anand1976

Differential Revision: D61419157

Pulled By: jaykorean

fbshipit-source-id: 66831685bc403949c26bfc65840dd1900d2a5a67
2024-08-19 11:22:43 -07:00
Yu Zhang 295326b6ee Best efforts recovery recover seqno prefix (#12938)
Summary:
This PR make best efforts recovery more permissive by allowing it to recover incomplete Version that presents a valid point in time view from the user's perspective. Currently, a Version is only valid and saved if all files consisting that Version can be found. With this change, if only a suffix of L0 files (and their associated blob files) are missing,  a valid Version is also available to be saved and recover to. Note that we don't do this if the column family was atomically flushed. Because atomic flush also need a consistent view across the column families, we cannot guarantee that if we are recovering to incomplete version.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12938

Test Plan: Existing tests and added unit tests.

Reviewed By: anand1976

Differential Revision: D61414381

Pulled By: jowlyzhang

fbshipit-source-id: f9b73deb34d35ad696ab42315928b656d586262a
2024-08-16 17:18:54 -07:00
Peter Dillinger 4d3518951a Option to decouple index and filter partitions (#12939)
Summary:
Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads.

However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed.

Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB).

 ## This change

Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder.

This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931.

Suggested follow-up:
* Update confusing use of "last key" to refer to "previous key"
* Expand unit test coverage with parallel compression and dictionary training
* Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939

Test Plan:
unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature.

 ## SST write performance (CPU)

Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR
```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
923691 vs. 924851 (-0.13%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
921398 vs. 922973 (-0.17%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
902259 vs. 908756 (-0.71%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
917932 vs. 916901 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
912755 vs. 907298 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
899754 vs. 892433 (+0.82%)
```
I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations.

 ## Read performance

Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup:

```
(for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 | tee results-$CS; done)
```

And read ops/s results:

```CSV
Cache size MB,After/decoupled/4k,Before/4k,Before/8k
3,15593,15158,12826
6,16295,16693,14134
10,20427,20813,18459
20,27035,26836,27384
30,33250,31810,33846
60,35518,32585,35329
100,36612,31805,35292
300,35780,31492,35481
1000,34145,31551,35411
1100,35219,31380,34302
1200,35060,31037,34322
```

If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration.

Reviewed By: jowlyzhang

Differential Revision: D61376772

Pulled By: pdillinger

fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f
2024-08-16 15:34:31 -07:00
Peter Dillinger f63428bcc7 Optimize, simplify filter block building (fix regression) (#12931)
Summary:
This is in part a refactoring / simplification to set up for "decoupled" partitioned filters and in part to fix an intentional regression for a correctness fix in https://github.com/facebook/rocksdb/issues/12872. Basically, we are taking out some complexity of the filter block builders, and pushing part of it (simultaneous de-duplication of prefixes and whole keys) into the filter bits builders, where it is more efficient by operating on hashes (rather than copied keys).

Previously, the FullFilterBlockBuilder had a somewhat fragile and confusing set of conditions under which it would keep a copy of the most recent prefix and most recent whole key, along with some other state that is essentially redundant. Now we just track (always) the previous prefix in the PartitionedFilterBlockBuilder, to deal with the boundary prefix Seek filtering problem. (Btw, the next PR will optimize this away since BlockBasedTableReader already tracks the previous key.) And to deal with the problem of de-duplicating both whole keys and prefixes going into a single filter, we add a new function to FilterBitsBuilder that has that extra de-duplication capabilty, which is relatively efficient because we only have to cache an extra 64-bit hash, not a copied key or prefix. (The API of this new function is somewhat awkward to avoid a small CPU regression in some cases.)

Also previously, there was awkward logic split between FullFilterBlockBuilder and PartitionedFilterBlockBuilder to deal with some things specific to partitioning. And confusing names like Add vs. AddKey. FullFilterBlockBuilder is much cleaner and simplified now.

The splitting of PartitionedFilterBlockBuilder::MaybeCutAFilterBlock into DecideCutAFilterBlock and CutAFilterBlock is to address what would have been a slight performance regression in some cases. The split allows for more intruction-level parallelism by reducing unnecessary control dependencies.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12931

Test Plan:
existing tests (with some minor updates)

Also manually ported over the pre-broken regression test described in
 https://github.com/facebook/rocksdb/issues/12870 and ran it (passed).

Performance:
Here we validate that an entire series of recent related PRs are a net improvement in aggregate. "Before" is with these PRs reverted: https://github.com/facebook/rocksdb/issues/12872 #12911 https://github.com/facebook/rocksdb/issues/12874 #12867 https://github.com/facebook/rocksdb/issues/12903 #12904. "After" includes this PR (and all
of those, with base revision 16c21af). Simultaneous test script designed to maximally depend on SST construction efficiency:

```
for PF in 0 1; do for PS in 0 8; do for WK in 0 1; do [ "$PS" == "$WK" ] || (for I in `seq 1 20`; do TEST_TMPDIR=/dev/shm/rocksdb2 ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -memtablerep=vector -allow_concurrent_memtable_write=0 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK 2>&1 | grep micros/op; done) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; echo "Was -partition_index_and_filters=$PF -prefix_size=$PS -whole_key_filtering=$WK"; done; done; done) | tee results
```

Showing average ops/sec of "after" vs. "before"

```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
935586 vs. 928176 (+0.79%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
930171 vs. 926801 (+0.36%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
910727 vs. 894397 (+1.8%)
-partition_index_and_filters=1 -prefix_size=0 -whole_key_filtering=1
929795 vs. 922007 (+0.84%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
921924 vs. 917285 (+0.51%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
903393 vs. 887340 (+1.8%)
```

As one would predict, the most improvement is seen in cases where we have optimized away copying the whole key.

Reviewed By: jowlyzhang

Differential Revision: D61138271

Pulled By: pdillinger

fbshipit-source-id: 427cef0b1465017b45d0a507bfa7720fa20af043
2024-08-14 15:13:16 -07:00
Yu Zhang d458331ee9 Move file tracking in VersionEditHandlerPointInTime to VersionBuilder (#12928)
Summary:
`VersionEditHandlerPointInTime` is tracking found files, missing files, intermediate files in order to decide to build a `Version` on negative edge trigger (transition from valid to invalid) without applying  the current `VersionEdit`.  However, applying `VersionEdit` and check completeness of a `Version` are specialization of `VersionBuilder`.  More importantly, when we augment best efforts recovery to recover not just complete point in time Version but also a prefix of seqno for a point in time Version, such checks need to be duplicated in `VersionEditHandlerPointInTime` and `VersionBuilder`.

To avoid this, this refactor move all the file tracking functionality in `VersionEditHandlerPointInTime` into `VersionBuilder`.  To continue to let `VersionEditHandlerPIT` do the edge trigger check and  build a `Version` before applying the current `VersionEdit`, a suite of APIs to supporting creating a save point and its associated functions are added in `VersionBuilder` to achieve this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12928

Test Plan: Existing tests

Reviewed By: anand1976

Differential Revision: D61171320

Pulled By: jowlyzhang

fbshipit-source-id: 604f66f8b1e3a3e13da59d8ba357c74e8a366dbc
2024-08-12 21:09:37 -07:00