Commit Graph

12987 Commits

Author SHA1 Message Date
Peter Dillinger 485ee4f45c Fix and test for leaks of open SST files (#13117)
Summary:
Follow-up to https://github.com/facebook/rocksdb/issues/13106 which revealed that some SST file readers (in addition to blob files) were being essentially leaked in TableCache (until DB::Close() time). Patched sources of leaks:
* Flush that is not committed (builder.cc)
* Various obsolete SST files picked up by directory scan but not caught by SubcompactionState::Cleanup() cleaning up from some failed compactions. Dozens of unit tests fail without the "backstop" TableCache::Evict() call in PurgeObsoleteFiles().

We also needed to adjust the check for leaks as follows:
* Ok if DB::Open never finished (see comment)
* Ok if deletions are disabled (see comment)
* Allow "quarantined" files to be in table_cache because (presumably) they might become live again.
* Get live files from all live Versions.

Suggested follow-up:
* Potentially delete more obsolete files sooner with a FIXME in db_impl_files.cc. This could potentially be high value because it seems to gate deletion of any/all newer obsolete files on all older compactions finishing.
* Try to catch obsolete files in more places using the VersionSet::obsolete_files_ pipeline rather than relying on them being picked up with directory scan, or deleting them outside of normal mechanisms.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13117

Test Plan: updated check used in most all unit tests in ASAN build

Reviewed By: hx235

Differential Revision: D65502988

Pulled By: pdillinger

fbshipit-source-id: aa0795a8a09d9ec578d25183fe43e2a35849209c
2024-11-08 10:54:43 -08:00
Levi Tamasi ba164ac373 Add allow_unprepared_value+PrepareValue() to the stress tests (#13125)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13125

The patch adds the new read option `allow_unprepared_value` and the new `Iterator` / `CoalescingIterator` / `AttributeGroupIterator` API `PrepareValue()` to the stress/crash tests. The change affects the batched, non-batched, and CF consistency stress test flavors and the `TestIterate`, `TestPrefixScan`, and `TestIterateAgainstExpected` operations.

Reviewed By: hx235

Differential Revision: D65636380

fbshipit-source-id: fd0caa0e87d03b6206667f07499b0c11847d1bbe
2024-11-07 21:24:21 -08:00
Yu Zhang 282f5a463b Fix write committed transactions replay when UDT setting toggles (#13121)
Summary:
This PR adds some missing pieces in order to handle UDT setting toggles while replay WALs for WriteCommitted transactions DB. Specifically, all the transaction markers for no op, prepare, commit, rollback are currently not carried over from the original WriteBatch to the new WriteBatch when there is a timestamp setting difference detected. This PR fills that gap.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13121

Test Plan: Added unit tests

Reviewed By: ltamasi

Differential Revision: D65558801

Pulled By: jowlyzhang

fbshipit-source-id: 8176882637b95f6dc0dad10d7fe21056fa5173d1
2024-11-06 17:32:03 -08:00
Levi Tamasi 2ba4dceb4c Add a new API Transaction::GetAttributeGroupIterator (#13119)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13119

The patch adds a new API `Transaction::GetAttributeGroupIterator` that can be used to create a multi-column-family attribute group iterator over the specified column families, including the data from both the transaction and the underlying database. This API is currently supported for optimistic and write-committed pessimistic transactions.

Reviewed By: jowlyzhang

Differential Revision: D65548324

fbshipit-source-id: 0fb8a22129494770fdba3d6024eef72b3e051136
2024-11-06 15:18:03 -08:00
Yu Zhang dc34a0ff1e Add some checks for the file ingestion flow (#13100)
Summary:
This PR does a few misc things for file ingestion flow:

- Add an invalid argument status return for the combination of `allow_global_seqno = false` and external files' key range overlap in `Prepare` stage.
- Add a MemTables status check for when column family is flushed before `Run`.
- Replace the column family dropped check with an assertion after thread enters the write queue and before it exits the write queue, since dropping column family can only happen in the single threaded write queue too and we already checked once after enter write queue.
- Add an `ExternalSstFileIngestionJob::GetColumnFamilyData` API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13100

Test Plan: Added unit tests, and stress tested the ingestion path

Reviewed By: hx235

Differential Revision: D65180472

Pulled By: jowlyzhang

fbshipit-source-id: 180145dd248a7507a13a543481b135e5a31ebe2d
2024-11-05 15:44:56 -08:00
Yu Zhang 8089eae240 Fix assertion that compaction input files are freeed (#13109)
Summary:
This assertion could fail if the compaction input files were successfully trivially moved. On re-locking db mutex after successful `LogAndApply`, those files could have been picked up again by some other compactions. And the assertion will fail.

Example failure: P1669529213

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13109

Reviewed By: cbi42

Differential Revision: D65308574

Pulled By: jowlyzhang

fbshipit-source-id: 32413bdc8e28e67a0386c3fe6327bf0b302b9d1d
2024-11-05 09:39:54 -08:00
Andrew Chang a7ecbfd590 Remove early return when scanning files for temperature change compaction (#13112)
Summary:
This is a small follow-up to https://github.com/facebook/rocksdb/pull/13083.

When we check the `newest_key_time` of files for temperature change compaction, we currently return early if we ever find a file with an unknown `est_newest_key_time`.

However, it is possible for a younger file to have a populated value for `newest_key_time`, since this is a new table property.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13112

Test Plan: The existing unit tests are sufficient.

Reviewed By: cbi42

Differential Revision: D65451797

Pulled By: archang19

fbshipit-source-id: 28e67c2d35a6315f912471f2848de87dd7088d99
2024-11-05 09:12:39 -08:00
Levi Tamasi 3becc9409e Some small improvements around allow_unprepared_value and multi-CF iterators (#13113)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13113

The patch makes some small improvements related to `allow_unprepared_value` and multi-CF iterators as groundwork for further changes:

1) Similarly to `BaseDeltaIterator`'s base iterator, `MultiCfIteratorImpl` gets passed its child iterators by the client. Even though they are currently guaranteed to have been created using the same read options as each other and the multi-CF iterator, it is safer to not assume this and call `PrepareValue` unconditionally before using any child iterator's `value()` or `columns()`.
2) Again similarly to `BaseDeltaIterator`, it makes sense to pass the entire `ReadOptions` structure to `MultiCfIteratorImpl` in case it turns out to require other read options in the future.
3) The constructors of the various multi-CF iterator classes now take an rvalue reference to a vector of column family handle + `unique_ptr` to child iterator pairs and use move semantics to take ownership of this vector (instead of taking two separate vectors of column family handles and raw iterator pointers).
4) Constructor arguments and the members of `MultiCfIteratorImpl` are reordered for consistency.

Reviewed By: jowlyzhang

Differential Revision: D65407521

fbshipit-source-id: 66c2c689ec8b036740bd98641b7b5c0ff7e777f2
2024-11-04 18:06:07 -08:00
Peter Dillinger e7ffca9493 Refactoring toward making preserve/preclude options mutable (#13114)
Summary:
Move them to MutableCFOptions and perform appropriate refactorings to make that work. I didn't want to mix up refactoring with interesting functional changes. Potentially non-trivial bits here:
* During DB Open or RegisterRecordSeqnoTimeWorker we use `GetLatestMutableCFOptions()` because either (a) there might not be a current version, or (b) we are in the process of applying the desired next options.
* Upgrade some test infrastructure to allow some options in MutableCFOptions to be mutable (should be a temporary state)
* Fix a warning that showed up about uninitialized `paranoid_memory_checks`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13114

Test Plan: existing tests, manually check options are still not settable with SetOptions

Reviewed By: jowlyzhang

Differential Revision: D65429031

Pulled By: pdillinger

fbshipit-source-id: 6e0906d08dd8ddf62731cefffe9b8d94149942b9
2024-11-04 16:15:10 -08:00
changyubi 2ce6902cf5 Introduce an interface `ReadOnlyMemTable` for immutable memtables (#13107)
Summary:
This PR sets up follow-up changes for large transaction support. It introduces an interface that allows custom implementations of immutable memtables. Since transactions use a WriteBatchWithIndex to index their operations, I plan to add a ReadOnlyMemTable implementation backed by WriteBatchWithIndex. This will enable direct ingestion of WriteBatchWithIndex into the DB as an immutable memtable, bypassing memtable writes for transactions.

The changes mostly involve moving required methods for immutable memtables into the ReadOnlyMemTable class.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13107

Test Plan:
* Existing unit test and stress test.
* Performance: I do not expect this change to cause noticeable performance regressions with LTO and devirtualization. The memtable-only readrandom benchmark shows no consistent performance difference:
```
USE_LTO=1 OPTIMIZE_LEVEL="-O3"  DEBUG_LEVEL=0 make -j160 db_bench

(for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000  --seed=1723056275 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }';

3 runs:
main: 760728, 752727, 739600
PR:   763036, 750696, 739022
```

Reviewed By: jowlyzhang

Differential Revision: D65365062

Pulled By: cbi42

fbshipit-source-id: 40c673ab856b91c65001ef6d6ac04b65286f2882
2024-11-04 16:09:34 -08:00
Yu Zhang 24045549a6 Add a flag for testing standalone range deletion file (#13101)
Summary:
As titled. This flag controls how frequent standalone range deletion file is tested in the file ingestion flow, for better debuggability.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13101

Test Plan: Manually tested in stress test

Reviewed By: hx235

Differential Revision: D65361004

Pulled By: jowlyzhang

fbshipit-source-id: 21882e7cc5918aff45449acaeb33b696ab1e37f0
2024-11-01 17:07:34 -07:00
Levi Tamasi 1006eddd63 Make BaseDeltaIterator honor allow_unprepared_value (#13111)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13111

As a follow-up to https://github.com/facebook/rocksdb/pull/13105, the patch changes `BaseDeltaIterator` so that it honors the read option `allow_unprepared_value`. When the option is set and the `BaseDeltaIterator` lands on the base iterator, it defers calling `PrepareValue` on the base iterator and setting `value()` and `columns()` until `PrepareValue` is called on the `BaseDeltaIterator` itself.

Reviewed By: jowlyzhang

Differential Revision: D65344764

fbshipit-source-id: d79c77b5de7c690bf2deeff435e9b0a9065f6c5c
2024-11-01 14:10:18 -07:00
Andrew Ryan Chang 7c98a2d130 Update MultiGet to respect the strict_capacity_limit block cache option (#13104)
Summary:
There is a `strict_capacity_limit` option which imposes a hard memory limit on the block cache. When the block cache is enabled, every read request is serviced from the block cache. If the required block is missing, it is first inserted into the cache. If `strict_capacity_limit` is `true` and the limit has been reached, the `Get` and `MultiGet` requests should fail. However, currently this is not happening for `MultiGet`.

I updated `MultiGet` to explicitly check the returned status of `MaybeReadBlockAndLoadToCache`, so the status does not get overwritten later.

Thank you anand1976 for the problem explanation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13104

Test Plan:
Added unit test for both `Get` and `MultiGet` with a `strict_capacity_limit` set.

Before the change, half of my unit test cases failed https://github.com/facebook/rocksdb/actions/runs/11604597524/job/32313608085?pr=13104. After I added the check for the status returned by `MaybeReadBlockAndLoadToCache`, they all pass.

I also ran these tests manually (I had to run `make clean` before):

```
make -j64 block_based_table_reader_test COMPILE_WITH_ASAN=1 ASSERT_STATUS_CHECKED=1

 ./block_based_table_reader_test --gtest_filter="*StrictCapacityLimitReaderTest.Get*"
 ./block_based_table_reader_test --gtest_filter="*StrictCapacityLimitReaderTest.MultiGet*"

```

Reviewed By: anand1976

Differential Revision: D65302470

Pulled By: archang19

fbshipit-source-id: 28dcc381e67e05a89fa9fc9607b4709976d6d90e
2024-11-01 13:22:27 -07:00
Andrew Ryan Chang af2a36d2c7 Record newest_key_time as a table property (#13083)
Summary:
This PR does two things:
1. Adds a new table property `newest_key_time`
2. Uses this property to improve TTL and temperature change compaction.

### Context

The current `creation_time` table property should really be named `oldest_ancestor_time`. For flush output files, this is the oldest key time in the file. For compaction output files, this is the minimum among all oldest key times in the input files.

The problem with using the oldest ancestor time for TTL compaction is that we may end up dropping files earlier than we should. What we really want is the newest (i.e. "youngest") key time. Right now we take a roundabout way to estimate this value -- we take the value of the _oldest_ key time for the _next_ (newer) SST file. This is also why the current code has checks for `index >= 1`.

Our new property `newest_key_time` is set to the file creation time during flushes, and the max over all input files for compactions.

There were some additional smaller changes that I had to make for testing purposes:
- Refactoring the mock table reader to support specifying my own table properties
- Refactoring out a test utility method `GetLevelFileMetadatas`  that would otherwise be copy/pasted in 3 places

Credit to cbi42 for the problem explanation and proposed solution

### Testing

- Added a dedicated unit test to my `newest_key_time` logic in isolation (i.e. are we populating the property on flush and compaction)
- Updated the existing unit tests (for TTL/temperate change compaction), which were comprehensive enough to break when I first made my code changes. I removed the test setup code which set the file metadata `oldest_ancestor_time`, so we know we are actually only using the new table property instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13083

Reviewed By: cbi42

Differential Revision: D65298604

Pulled By: archang19

fbshipit-source-id: 898ef91b692ab33f5129a2a16b64ecadd4c32432
2024-11-01 10:08:35 -07:00
Peter Dillinger a28cc4a38c Fix a leak of open Blob files (#13106)
Summary:
An earlier change (b34cef57b7) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement).

Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache.

Fixes https://github.com/facebook/rocksdb/issues/13066

Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13106

Test Plan:
added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests:

```
db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge
db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase
db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache
db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection
```

Reviewed By: ltamasi

Differential Revision: D65296123

Pulled By: pdillinger

fbshipit-source-id: 2276d76482beb2c75c9010bc1bec070bb23a24c0
2024-10-31 15:29:30 -07:00
Levi Tamasi ef535039f3 Call PrepareValue on the base iterator in BaseDeltaIterator (#13105)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13105

The `WriteBatchWithIndex::NewIteratorWithBase` interface enables creating a `BaseDeltaIterator` with an arbitrary base iterator passed in by the client, which has potentially been created with the `allow_unprepared_value` read option set. Because of this, `BaseDeltaIterator` has to call `PrepareValue` before using the `value()` or `columns()` from the base iterator. This includes both the case when `BaseDeltaIterator` exposes the `value()` and `columns()` of the base iterator as is and the case when the final `value()` / `columns()` is a result of merging key-values across the base and delta iterators. Note that `BaseDeltaIterator` itself does not support `allow_unprepared_value` yet; this will be implemented in an upcoming patch.

Reviewed By: jowlyzhang

Differential Revision: D65249643

fbshipit-source-id: b0a1ccc0dfd31105b2eef167b463ed15a8bb83b7
2024-10-31 14:20:33 -07:00
Jay Huh 1987313a94 TableProperties Serialization Follow Ups (#13095)
Summary:
Follow ups from https://github.com/facebook/rocksdb/issues/13089
- Take `TableProperties` as `const &` instead of `std::shared_ptr<const TableProperties>`
- Move TableProperties OptionsTypeMap definition to another place for other use outside of Remote Compaction
- Add a test verify that the set of field serializations of TableProperties is complete

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13095

Test Plan:
```
./options_settable_test --gtest_filter="*TablePropertiesAllFieldsSettable*"
```
I also intentionally tried adding a new field to `TableProperties`. If it's missed in the OptionsType map, the test detects the missing bytes set and successfully fails.

Reviewed By: pdillinger

Differential Revision: D65077398

Pulled By: jaykorean

fbshipit-source-id: cf10560eb4a467ca523b11fd64945dbc86ac378f
2024-10-31 11:13:53 -07:00
Peter Dillinger e34087c524 Add a temporary hook for custom yielding in long-running op (#13103)
Summary:
This is a simplified version of https://github.com/facebook/rocksdb/issues/13096, which called for a way to hook into long-running loops completely within RocksDB to change their thread priority (or similar). The current prime hook point is `DBIter::FindNextUserEntryInternal` likely because of iterating over tombstones.

This is implemented using the weak symbol hack for ease of back-porting/patching, and while we get to know potential future requirements better for integration into the public API. (Consider potential relationships to `Env::GetThreadStatusUpdater()` and `TransactionDBMutexFactory`.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13103

Test Plan:
Performance validated with db_bench and DEBUG_LEVEL=0: `./db_bench --benchmarks=fillseq,deleterandom,readseq[-X100] --value_size=1 --num=1000000`

No consistent difference seen; variances likely in how DB / executable / memory were laid out.

```
With an empty hook:
readseq [AVG    100 runs] : 1753018 (± 8850) ops/sec;   28.4 (± 0.1) MB/sec
readseq [MEDIAN 100 runs] : 1763746 ops/sec;   28.6 MB/sec
(recompile)
readseq [AVG    100 runs] : 1789019 (± 10260) ops/sec;   29.0 (± 0.2) MB/sec
readseq [MEDIAN 100 runs] : 1801849 ops/sec;   29.2 MB/sec

Base:
readseq [AVG    100 runs] : 1772196 (± 8240) ops/sec;   28.7 (± 0.1) MB/sec
readseq [MEDIAN 100 runs] : 1780453 ops/sec;   28.9 MB/sec
(recompile)
readseq [AVG    100 runs] : 1777637 (± 7613) ops/sec;   28.8 (± 0.1) MB/sec
readseq [MEDIAN 100 runs] : 1786657 ops/sec;   29.0 MB/sec

With a functional hook (count number of calls into it):
readseq [AVG    100 runs] : 1796733 (± 8854) ops/sec;   29.1 (± 0.1) MB/sec
readseq [MEDIAN 100 runs] : 1804690 ops/sec;   29.3 MB/sec
RocksDbThreadYield: 126915800
(recompile)
readseq [AVG    100 runs] : 1775371 (± 10529) ops/sec;   28.8 (± 0.2) MB/sec
readseq [MEDIAN 100 runs] : 1789046 ops/sec;   29.0 MB/sec
RocksDbThreadYield: 126977000

Base:
readseq [AVG    100 runs] : 1773071 (± 10657) ops/sec;   28.7 (± 0.2) MB/sec
readseq [MEDIAN 100 runs] : 1783414 ops/sec;   28.9 MB/sec
(recompile)
readseq [AVG    100 runs] : 1750852 (± 10184) ops/sec;   28.4 (± 0.2) MB/sec
readseq [MEDIAN 100 runs] : 1763587 ops/sec;   28.6 MB/sec
```

Reviewed By: george-reynya

Differential Revision: D65235379

Pulled By: pdillinger

fbshipit-source-id: 7829e4cc25a56d4c1801b8adf9c7f7aa49ab7aca
2024-10-30 20:37:28 -07:00
leipeng 8109046222 secondary instance: remove unnessisary cfds_changed->count() (#13086)
Summary:
`cfds_changed->count(cfd)` is not needed, just blind insert.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13086

Reviewed By: hx235

Differential Revision: D64712400

Pulled By: cbi42

fbshipit-source-id: 4ef62aaa724c8397baa4ff350c16a7a8d04d7067
2024-10-29 11:04:20 -07:00
Peter Dillinger ddafba870d Fix a HISTORY entry for 9.8.0 (#13097)
Summary:
Forgot to update after generalizing mutability of BBTO

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13097

Test Plan: no functional change here

Reviewed By: jaykorean

Differential Revision: D65095618

Pulled By: pdillinger

fbshipit-source-id: 6c37cd0e68756c6b56af1c8e15273fae0ca9224d
2024-10-28 21:27:42 -07:00
Peter Dillinger 9ad772e652 Start version 9.9.0 (#13093)
Summary:
Pull in HISTORY for 9.8.0, update version.h for next version, update check_format_compatible.sh

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13093

Test Plan: CI

Reviewed By: jowlyzhang

Differential Revision: D64987257

Pulled By: pdillinger

fbshipit-source-id: a7cec329e3d245e63767760aa0298c08c3281695
2024-10-25 13:47:29 -07:00
Jay Huh 57a8e69d4e Include TableProperties in the CompactionServiceResult (#13089)
Summary:
In Remote Compactions, the primary host receives the serialized compaction result from the remote worker and deserializes it to build the output. Unlike Local Compactions, where table properties are built by TableBuilder, in Remote Compactions, these properties were not included in the serialized compaction result. This was likely done intentionally since the table properties are already available in the SST files.

Because TableProperties are not populated as part of CompactionOutputs for remote compactions, we were unable to log the table properties in OnCompactionComplete and use them for verification. We are adding the TableProperties as part of the CompactionServiceOutputFile in this PR. By including the TableProperties in the serialized compaction result, the primary host will be able to access them and verify that they match the values read from the actual SST files.

We are also adding the populating `format_version` in table_properties of in TableBuilder.  This has not been a big issue because the `format_version` is written to the SST files directly from `TableOptions.format_version`. When loaded from the SST files, it's populated directly by reading from the MetaBlock. This info has only been missing in the TableBuilder's Rep.props.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13089

Test Plan:
```
./compaction_job_test
```
```
./compaction_service_test
```

Reviewed By: pdillinger

Differential Revision: D64878740

Pulled By: jaykorean

fbshipit-source-id: b6f2fdce851e6477ecb4dd5a87cdc62e176b746b
2024-10-25 13:13:12 -07:00
Peter Dillinger 3fd1f11d35 Fix race to make BlockBasedTableOptions effectively mutable (#13082)
Summary:
Fix a longstanding race condition in SetOptions for `block_based_table_factory` options. The fix is mostly described in new, unified `TableFactoryParseFn()` in `cf_options.cc`. Also in this PR:
* Adds a virtual `Clone()` function to TableFactory
* To avoid behavioral hiccups with `SetOptions`, make the "hidden state" of `BlockBasedTableFactory` shared between an original and a clone. For example, `TailPrefetchStats`
* `Configurable` was allowed to be copied but was not safe to do so, because the copy would have and use pointers into object it was copied from (!!!). This has been fixed using relative instead of absolute pointers, though it's still technically relying on undefined behavior (consistent object layout for non-standard-layout types).

For future follow-up:
* Deny SetOptions on block cache options (dubious and not yet made safe with proper shared_ptr handling)

Fixes https://github.com/facebook/rocksdb/issues/10079

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13082

Test Plan:
added to unit tests and crash test

Ran TSAN blackbox crashtest for hours with options to amplify potential race (see https://github.com/facebook/rocksdb/issues/10079)

Reviewed By: cbi42

Differential Revision: D64947243

Pulled By: pdillinger

fbshipit-source-id: 8390299149f50e2a2b39a5247680f2637edb23c8
2024-10-25 10:24:54 -07:00
Yu Zhang 9c94559de7 Optimize compaction for standalone range deletion files (#13078)
Summary:
This PR adds some optimization for compacting standalone range deletion files. A standalone range deletion file is one with just a single range deletion. Currently, such a file is used in bulk loading to achieve something like atomically delete old version of all data with one big range deletion and adding new version of data. These are the changes included in the PR:

1) When a standalone range deletion file is ingested via bulk loading, it's marked for compaction.
2) When picking input files during compaction picking, we attempt to only pick a standalone range deletion file when oldest snapshot is at or above the file's seqno. To do this, `PickCompaction` API is updated to take existing snapshots as an input. This is only done for the universal compaction + UDT disabled combination, we save querying for existing snapshots and not pass it for all other cases.
3) At `Compaction` construction time, the input files will be filtered to examine if any of them can be skipped for compaction iterator. For example, if all the data of the file is deleted by a standalone range tombstone, and the oldest snapshot is at or above such range tombstone, this file will be filtered out.
4) Every time a snapshot is released, we examine if any column family has standalone range deletion files that becomes eligible to be scheduled for compaction. And schedule one for it.

Potential future improvements:
- Add some dedicated statistics for the filtered files.
- Extend this input filtering to L0 files' compactions cases when a newer L0 file could shadow an older L0 file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13078

Test Plan: Added unit tests and stress tested a few rounds

Reviewed By: cbi42

Differential Revision: D64879415

Pulled By: jowlyzhang

fbshipit-source-id: 02b8683fddbe11f093bcaa0a38406deb39f44d9e
2024-10-25 09:32:14 -07:00
Changyu Bi 8b38d4b400 Fix write tracing to check callback status (#13088)
Summary:
we currently record write operations to tracer before checking callback in PipelinedWriteImpl and WriteImplWALOnly. For optimistic transaction DB, this means that an operation can be recorded to tracer even when it's not written to DB or WAL. I suspect this is the reason some of our optimistic txn crash test is failing. The evidence is that the trace contains some duplicated entry and has more entries compared to the corresponding entry in WAL. This PR moves the tracer logic to be after checking callback status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13088

Test Plan: monitor crash test.

Reviewed By: hx235

Differential Revision: D64711753

Pulled By: cbi42

fbshipit-source-id: 55fd1223538ec6294ce84a957c306d3d9d91df5f
2024-10-21 21:02:03 -07:00
Levi Tamasi c0be6a4b90 Support allow_unprepared_value for multi-CF iterators (#13079)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13079

The patch adds support for the new read option `allow_unprepared_value` to the multi-column-family iterators `CoalescingIterator` and `AttributeGroupIterator`. When this option is set, these iterators populate their value (`value()` + `columns()` or `attribute_groups()`) in an on-demand fashion when `PrepareValue()` is called. Calling `PrepareValue()` on the child iterators is similarly deferred until `PrepareValue()` is called on the main iterator.

Reviewed By: jowlyzhang

Differential Revision: D64570587

fbshipit-source-id: 783c8d408ad10074417dabca7b82c5e1fe5cab36
2024-10-20 20:53:08 -07:00
Jay Huh 0ca691654f Fix Unit Test failing from uninit values in CompactionServiceInput (#13080)
Summary:
# Summary

There was a [test failure](https://github.com/facebook/rocksdb/actions/runs/11381731053/job/31663774089?fbclid=IwZXh0bgNhZW0CMTEAAR0YJVdnkKUhN15RJQrLsvicxqzReS6y4A14VFQbWu-81XJsSsyNepXAr2c_aem_JyQqNdtpeKFSA6CjlD-pDg) from uninit value in the CompactionServiceInput

```
[ RUN      ] CompactionJobTest.InputSerialization
==79945== Use of uninitialised value of size 8
==79945==    at 0x58EA69B: _itoa_word (_itoa.c:179)
==79945==    by 0x5906574: __vfprintf_internal (vfprintf-internal.c:1687)
==79945==    by 0x591AF99: __vsnprintf_internal (vsnprintf.c:114)
==79945==    by 0x1654AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > __gnu_cxx::__to_xstring<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>(int (*)(char*, unsigned long, char const*, __va_list_tag*), unsigned long, char const*, ...) (string_conversions.h:111)
==79945==    by 0x5126C65: to_string (basic_string.h:6568)
==79945==    by 0x5126C65: rocksdb::SerializeSingleOptionHelper(void const*, rocksdb::OptionType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (options_helper.cc:541)
==79945==    by 0x512718B: rocksdb::OptionTypeInfo::Serialize(rocksdb::ConfigOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const (options_helper.cc:1084)
```

This was due to `options_file_number` value not set in the unit test. However, this value is guaranteed to be set in the normal path. It was just missing in the test path. Setting the 0 as the default value for uninitialized fields in the `CompactionServiceInput` and `CompactionServiceResult` for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13080

Test Plan: Existing tests should be sufficient

Reviewed By: cbi42

Differential Revision: D64573567

Pulled By: jaykorean

fbshipit-source-id: 7843a951770c74445620623d069a52ba93ad94d5
2024-10-18 07:31:54 -07:00
Hui Xiao 58fc9d61b7 Trim readahead size based on prefix during prefix scan (#13040)
Summary:
**Context/Summary:**
During prefix scan, prefetched data blocks containing keys not in the same prefix as the `Seek()`'s key will be wasted when `ReadOptions::prefix_same_as_start = true` since they won't be returned to the user. This PR is to exclude those data blocks from being prefetched in a similar manner like trimming according to `ReadOptions::iterate_upper_bound`.

Bonus: refactoring to some existing prefetch test so they are easier to extend and read

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13040

Test Plan:
- New UT, integration to existing UTs
- Benchmark to ensure no regression from CPU due to more trimming logic
```
// Build DB with one sorted run under the same prefix
./db_bench --benchmarks=fillrandom --prefix_size=3 --keys_per_prefix=5000000 --num=5000000 --db=/dev/shm/db_bench --disable_auto_compactions=1
```
```
// Augment the existing db bench to call `Seek()` instead of `SeekToFirst()` in `void ReadSequential(){..}` to trigger the logic in this PR

+++ b/tools/db_bench_tool.cc
@@ -5900,7 +5900,12 @@ class Benchmark {
     Iterator* iter = db->NewIterator(options);
     int64_t i = 0;
     int64_t bytes = 0;
-    for (iter->SeekToFirst(); i < reads_ && iter->Valid(); iter->Next()) {
+
+    iter->SeekToFirst();
+    assert(iter->status().ok() && iter->Valid());
+    auto prefix = prefix_extractor_->Transform(iter->key());
+
+    for (iter->Seek(prefix); i < reads_ && iter->Valid(); iter->Next()) {
       bytes += iter->key().size() + iter->value().size();
       thread->stats.FinishedOps(nullptr, db, 1, kRead);
       ++i;
:
```
```
// Compare prefix scan performance
./db_bench --benchmarks=readseq[-X20] --prefix_size=3  --prefix_same_as_start=1 --auto_readahead_size=1 --cache_size=1 --use_existing_db=1 --db=/dev/shm/db_bench --disable_auto_compactions=1

// Before PR
readseq [AVG    20 runs] : 2449011 (± 50238) ops/sec;  270.9 (± 5.6) MB/sec
readseq [MEDIAN 20 runs] : 2499167 ops/sec;  276.5 MB/sec

// After PR  (regress 0.4 %)
readseq [AVG    20 runs] : 2439098 (± 42931) ops/sec;  269.8 (± 4.7) MB/sec
readseq [MEDIAN 20 runs] : 2460859 ops/sec;  272.2 MB/sec

```

- Stress test: randomly set `prefix_same_as_start` in `TestPrefixScan()`. Run below for a while
```
python3 tools/db_crashtest.py --simple blackbox --prefix_size=5 --prefixpercent=65 --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=-1 --bloom_bits=3 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=0 --compaction_style=2 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0  --db_write_buffer_size=8388608 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=0 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=15 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32  --readpercent=10 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --uncache_aggressiveness=1 --universal_max_read_amp=10 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=10
```

Reviewed By: anand1976

Differential Revision: D64367065

Pulled By: hx235

fbshipit-source-id: 5750c05ccc835c3e9dc81c961b76deaf30bd23c2
2024-10-17 15:52:55 -07:00
Peter Dillinger ac24f152a1 Refactor `table_factory` into MutableCFOptions (#13077)
Summary:
This is setting up for a fix to a data race in SetOptions on BlockBasedTableOptions (BBTO), https://github.com/facebook/rocksdb/issues/10079
The race will be fixed by replacing `table_factory` with a modified copy whenever we want to modify a BBTO field.

An argument could be made that this change creates more entaglement between features (e.g. BlobSource <-> MutableCFOptions), rather than (conceptually) minimizing the dependencies of each feature, but
* Most of these things already depended on ImmutableOptions
* Historically there has been a lot of plumbing (and possible small CPU overhead) involved in adding features that need to reach a lot of places, like `block_protection_bytes_per_key`. Keeping those wrapped up in options simplifies that.
* SuperVersion management generally takes care of lifetime management of MutableCFOptions, so is not that difficult. (Crash test agrees so far.)

There are some FIXME places where it is known to be unsafe to replace `block_cache` unless/until we handle shared_ptr tracking properly. HOWEVER, replacing `block_cache` is generally dubious, at least while existing users of the old block cache (e.g. table readers) can continue indefinitely.

The change to cf_options.cc is essentially just moving code (not changing).

I'm not concerned about the performance of copying another shared_ptr with MutableCFOptions, but I left a note about considering an improvement if more shared_ptr are added to it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13077

Test Plan:
existing tests, crash test.

Unit test DBOptionsTest.GetLatestCFOptions updated with some temporary logic. MemoryTest required some refactoring (simplification) for the change.

Reviewed By: cbi42

Differential Revision: D64546903

Pulled By: pdillinger

fbshipit-source-id: 69ae97ce5cf4c01b58edc4c5d4687eb1e5bf5855
2024-10-17 14:13:20 -07:00
Levi Tamasi a44f4b8653 Couple of small improvements for (Iterator)AttributeGroup (#13076)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13076

The patch makes it possible to construct an `IteratorAttributeGroup` using an `AttributeGroup` instance, and implements `operator==` / `operator!=` for these two classes consistently. It also makes some minor improvements in the related test suites `CoalescingIteratorTest` and `AttributeGroupIteratorTest`.

Reviewed By: jaykorean

Differential Revision: D64510653

fbshipit-source-id: 95d3340168fa3b34e7ef534587b19131f0a27fb7
2024-10-16 20:58:26 -07:00
Jay Huh f22557886e Fix Compaction Stats (#13071)
Summary:
Compaction stats code is not so straightforward to understand. Here's a bit of context for this PR and why this change was made.

- **CompactionStats (compaction_stats_.stats):** Internal stats about the compaction used for logging and public metrics.
- **CompactionJobStats (compaction_job_stats_)**: The public stats at job level. It's part of Compaction event listener and included in the CompactionResult.
- **CompactionOutputsStats**: output stats only. resides in CompactionOutputs. It gets aggregated toward the CompactionStats (internal stats).

The internal stats, `compaction_stats_.stats`, has the output information recorded from the compaction iterator, but it does not have any input information (input records, input output files) until `UpdateCompactionStats()` gets called. We cannot simply call `UpdateCompactionStats()` to fill in the input information in the remote compaction (which is a subcompaction of the primary host's compaction) because the `compaction->inputs()` have the full list of input files and `UpdateCompactionStats()` takes the entire list of records in all files. `num_input_records` gets double-counted if multiple sub-compactions are submitted to the remote worker.

The job level stats (in the case of remote compaction, it's subcompaction level stat), `compaction_job_stats_`, has the correct input records, but has no output information. We can use `UpdateCompactionJobStats(compaction_stats_.stats)` to set the output information (num_output_records, num_output_files, etc.) from the `compaction_stats_.stats`, but it also sets all other fields including the input information which sets all back to 0.

Therefore, we are overriding `UpdateCompactionJobStats()` in remote worker only to update job level stats, `compaction_job_stats_`, with output information of the internal stats.

Baiscally, we are merging the aggregated output info from the internal stats and aggregated input info from the compaction job stats.

In this PR we are also fixing how we are setting `is_remote_compaction` in CompactionJobStats.
- OnCompactionBegin event, if options.compaction_service is set, `is_remote_compaction=true` for all compactions except for trivial moves
- OnCompactionCompleted event, if any of the sub_compactions were done remotely, compaction level stats's `is_remote_compaction` will be true

Other minor changes
- num_output_records is already available in CompactionJobStats. No need to store separately in CompactionResult.
- total_bytes is not needed.
- Renamed `SubcompactionState::AggregateCompactionStats()` to `SubcompactionState::AggregateCompactionOutputStats()` to make it clear that it's only aggregating output stats.
- Renamed `SetTotalBytes()` to `AddBytesWritten()` to make it more clear that it's adding total written bytes from the compaction output.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13071

Test Plan:
Unit Tests added and updated
```
./compaction_service_test
```

Reviewed By: anand1976

Differential Revision: D64479657

Pulled By: jaykorean

fbshipit-source-id: a7a776a00dc718abae95d856b661bcbafd3b0ed5
2024-10-16 19:20:37 -07:00
Changyu Bi 8ad4c7efc4 Add an API to check if an SST file is generated by SstFileWriter (#13072)
Summary:
Some users want to check if a file in their DB was created by SstFileWriter and ingested into hte DB. Files created by SstFileWriter records additional table properties cbebbad7d9/table/sst_file_writer_collectors.h (L17-L21). These are not exposed so I'm adding an API to SstFileWriter to check for them.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13072

Test Plan: added some assertions.

Reviewed By: jowlyzhang

Differential Revision: D64411518

Pulled By: cbi42

fbshipit-source-id: 279bfef48615aa6f78287b643d8445a1471e7f07
2024-10-16 16:57:05 -07:00
Changyu Bi 787730c859 Add an ingestion option to not fill block cache (#13067)
Summary:
add `IngestExternalFileOptions::fill_cache` to allow users to ingest files without loading index/filter/data and other blocks into block cache during file ingestion. This can be useful when users are ingesting files into a CF that is not available to readers yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13067

Test Plan:
* unit test: `ExternalSSTFileTest.NoBlockCache`
* ran one round of crash test with fill_cache disabled: `python3 ./tools/db_crashtest.py --simple blackbox --ops_per_thread=1000000 --interval=30 --ingest_external_file_one_in=200 --level0_stop_writes_trigger=200 --level0_slowdown_writes_trigger=100 --sync_fault_injection=0 --disable_wal=0 --manual_wal_flush_one_in=0`

Reviewed By: jowlyzhang

Differential Revision: D64356424

Pulled By: cbi42

fbshipit-source-id: b380c26f5987238e1ed7d42ceef0390cfaa0b8e2
2024-10-16 14:11:22 -07:00
Levi Tamasi ecc084d301 Support the on-demand loading of blobs during iteration (#13069)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13069

Currently, when using range scans with BlobDB, the iterator logic eagerly loads values from blob files when landing on a new entry. This can be wasteful in use cases where the values associated with some keys in the range are not used by the application. The patch introduces a new read option `allow_unprepared_value`; when specified, this option results in the above eager loading getting bypassed. Values needed by the application can be then loaded on an on-demand basis by calling the new iterator API `PrepareValue`. Note that currently, only regular single-CF iterators are supported; multi-CF iterators and transactions will be extended in later PRs.

Reviewed By: jowlyzhang

Differential Revision: D64360723

fbshipit-source-id: ee55502fa15dcb307a984922b9afc9d9da15d6e1
2024-10-16 12:34:57 -07:00
Jay Huh da5e11310b Preserve Options File (#13074)
Summary:
In https://github.com/facebook/rocksdb/issues/13025 , we made a change to load the latest options file in the remote worker instead of serializing the entire set of options.

That was done under assumption that OPTIONS file do not get purged often. While testing, we learned that this happens more often than we want it to be, so we want to prevent the OPTIONS file from getting purged anytime between when the remote compaction is scheduled and the option is loaded in the remote worker.

Like how we are protecting new SST files from getting purged using `min_pending_output`, we are doing the same by keeping track of `min_options_file_number`. Any OPTIONS file with number greater than `min_options_file_number` will be protected from getting purged. Just like `min_pending_output`, `min_options_file_number` gets bumped when the compaction is done. This is only applicable when `options.compaction_service` is set.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13074

Test Plan:
```
./compaction_service_test --gtest_filter="*PreservedOptionsLocalCompaction*"
./compaction_service_test --gtest_filter="*PreservedOptionsRemoteCompaction*"
```

Reviewed By: anand1976

Differential Revision: D64433795

Pulled By: jaykorean

fbshipit-source-id: 0d902773f0909d9481dec40abf0b4c54ce5e86b2
2024-10-16 09:22:51 -07:00
Levi Tamasi 55de26580a Small improvement to MultiCFIteratorImpl (#13075)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13075

The patch simplifies the iteration logic in `MultiCFIteratorImpl::{Advance,Populate}Iterator` a bit and adds some assertions to uniformly enforce the invariant that any iterators currently on the heap should be valid and have an OK status.

Reviewed By: jaykorean

Differential Revision: D64429566

fbshipit-source-id: 36bc22465285b670f859692a048e10f21df7da7a
2024-10-15 17:32:07 -07:00
Yu Zhang 2cb00c6921 Ingest files in separate batches if they overlap (#13064)
Summary:
This PR assigns levels to files in separate batches if they overlap. This approach can potentially assign external files to lower levels.

In the prepare stage, if the input files' key range overlaps themselves, we divide them up in the user specified order into multiple batches. Where the files in the same batch do not overlap with each other, but key range could overlap between batches. If the input files' key range don't overlap, they always just make one default batch.

During the level assignment stage, we assign levels to files one batch after another.  It's guaranteed that files within one batch are not overlapping, we assign level to each file one after another. If the previous batch's uppermost level is specified, all files in this batch will be assigned to levels that are higher than that level. The uppermost level used by this batch of files is also tracked, so that it can be used by the next batch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13064

Test Plan:
Updated test and added new test
Manually stress tested

Reviewed By: cbi42

Differential Revision: D64428373

Pulled By: jowlyzhang

fbshipit-source-id: 5aeff125c14094c87cc50088505010dfd2da3d6e
2024-10-15 17:22:01 -07:00
anand76 2abbb02d14 Troubleshoot blackbox crash test final verification hang (#13070)
Summary:
Add a timeout for the blackbox crash test final verification step, and print the db_stress stack trace on a timeout. The crash test occasionally hangs in the verification step and this will help debug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13070

Reviewed By: hx235

Differential Revision: D64414461

Pulled By: anand1976

fbshipit-source-id: 4629aac01fbe6c788665beddc66280ba446aadbe
2024-10-15 13:39:24 -07:00
anand76 cbebbad7d9 Sanitize checkpoint_one_in and lock_wal_one_in db_stress params (#13068)
Summary:
Checkpoint creation skips flushing the memtable, even if explicitly requested, when the WAL is locked. This can happen if the user calls `LockWAL()`. In this case, db_stress checkpoint verification fails as the checkpoint will not contain keys present in the primary DB's memtable. Sanitize `checkpoint_one_in` and `lock_wal_one_in` so they're mutually exclusive.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13068

Reviewed By: hx235

Differential Revision: D64353998

Pulled By: anand1976

fbshipit-source-id: 7c93563347f033b6008a47a7d71471e59747e143
2024-10-14 21:11:37 -07:00
Jay Huh dd76862b00 Add file_checksum from FileChecksumGenFactory and Tests for corrupted output (#13060)
Summary:
- When `FileChecksumGenFactory` is set, include the `file_checksum` and `file_checksum_func_name` in the output file metadata
- ~~In Remote Compaction, try opening the output files in the temporary directory to do a quick sanity check before returning the result with status.~~
- After offline discussion, we decided to rely on Primary's existing Compaction flow to sanity check the output files. If the output file is corrupted, we will still be able to catch it and not installing it even after renaming them to cf_paths. The corrupted file in the cf_path won't be added to the MANIFEST and will be purged as part of the next `PurgeObsoleteFiles()` call.
- Unit Test has been added to validate above.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13060

Test Plan:
Unit test added
```
./compaction_service_test --gtest_filter="*CorruptedOutput*"
./compaction_service_test --gtest_filter="*TruncatedOutput*"
./compaction_service_test --gtest_filter="*CustomFileChecksum*"
./compaction_job_test --gtest_filter="*ResultSerialization*"
```

Reviewed By: cbi42

Differential Revision: D64189645

Pulled By: jaykorean

fbshipit-source-id: 6cf28720169c960c80df257806bfee3c0d177159
2024-10-14 18:26:17 -07:00
Peter Dillinger 351d2fd2b6 Make simple BlockBasedTableOptions mutable (#10021)
Summary:
In theory, there should be no danger in mutability, as table
builders and readers work from copies of BlockBasedTableOptions.
However, there is currently an unresolved read-write race that
affecting SetOptions on BBTO fields. This should be generally
acceptable for non-pointer options of 64 bits or less, but a fix
is needed to make it mutability general here. See
https://github.com/facebook/rocksdb/issues/10079

This change systematically sets all of those "simple" options (and future
such options) as mutable. (Resurrecting this PR perhaps preferable to
proposed https://github.com/facebook/rocksdb/issues/13063)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10021

Test Plan: Some unit test updates. XXX comment added to stress test code

Reviewed By: cbi42

Differential Revision: D64360967

Pulled By: pdillinger

fbshipit-source-id: ff220fa778331852fe331b42b76ac4adfcd2d760
2024-10-14 17:49:26 -07:00
Yu Zhang 8592517c89 Remove stale entries from L0 files when UDT is not persisted (#13035)
Summary:
When user-defined timestamps are not persisted, currently we replace the actual timestamp with min timestamp after an entry is output from compaction iterator. Compaction iterator won't be able to help with removing stale entries this way. This PR adds a wrapper iterator `TimestampStrippingIterator` for `MemTableIterator` that does the min timestamp replacement at the memtable iteration step. It is used by flush and can help remove stale entries from landing in L0 files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13035

Test Plan: Added unit test

Reviewed By: pdillinger, cbi42

Differential Revision: D63423682

Pulled By: jowlyzhang

fbshipit-source-id: 087dcc9cee97b9ea51b8d2b88dc91c2984d54e55
2024-10-14 12:28:35 -07:00
generatedunixname89002005287564 f7237e3395 internal_repo_rocksdb
Reviewed By: jermenkoo

Differential Revision: D64318168

fbshipit-source-id: 62bddd81424f1c5d4f50ce3512a9a8fe57a19ec3
2024-10-14 03:01:20 -07:00
Yu Zhang a571cbed17 Use same logic to assign level for non-overlapping files in universal compaction (#13059)
Summary:
When the input files are not overlapping, a.k.a `files_overlap_=false`, it's best to assign them to non L0 levels so that they are not one sorted run each.  This can be done regardless of compaction style being leveled or universal without any side effects.

Just my guessing, this special handling may be there because universal compaction used to have an invariant that sequence number on higher levels should not be smaller than sequence number in lower levels. File ingestion used to try to keep up to that promise by doing "sequence number stealing" from the to be assigned level. However, that invariant is no longer true after deletion triggered compaction is added for universal compaction, and we also removed the sequence stealing logic from file ingestion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13059

Test Plan: Updated existing tests

Reviewed By: cbi42

Differential Revision: D64220100

Pulled By: jowlyzhang

fbshipit-source-id: 70a83afba7f4c52d502c393844e6b3273d5cf628
2024-10-11 11:18:45 -07:00
Levi Tamasi 2c9aa69a93 Refactor the BlobDB-related parts of DBIter (#13061)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13061

As groundwork for further changes, the patch refactors the BlobDB-related parts of `DBIter` by 1) introducing a new internal helper class `DBIter::BlobReader` that encapsulates all members needed to retrieve a blob value (namely, `Version` and the `ReadOptions` fields) and 2) factoring out and cleaning up some duplicate logic related to resolving blob references in the non-Merge (see `SetValueAndColumnsFromBlob`) and Merge (see `MergeWithBlobBaseValue`) cases.

Reviewed By: jowlyzhang

Differential Revision: D64078099

fbshipit-source-id: 22d5bd93e6e5be5cc9ecf6c4ee6954f2eb016aff
2024-10-10 15:38:59 -07:00
Jay Huh fe6c8cb1d6 Print unknown writebatch tag (#13062)
Summary:
Add additional info for debugging purpose by doing the same as what WBWI does

632746bb5b/utilities/write_batch_with_index/write_batch_with_index.cc (L274-L276)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13062

Test Plan: CI

Reviewed By: hx235

Differential Revision: D64202297

Pulled By: jaykorean

fbshipit-source-id: 65164fd88420fc72b6db26d1436afe548a653269
2024-10-10 15:34:35 -07:00
Hui Xiao 632746bb5b Improve DBTest.DynamicLevelCompressionPerLevel (#13044)
Summary:
**Context/Summary:**

A part of this test is to verify compression conditionally happens depending on the shape of the LSM when `options.level_compaction_dynamic_level_bytes = true;`. It uses the total file size to determine whether compression has happened or not. This involves some hard-coded math hard to understand. This PR replaces those with statistics that directly shows whether compression has happened or not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13044

Test Plan: Existing test

Reviewed By: jaykorean

Differential Revision: D63666361

Pulled By: hx235

fbshipit-source-id: 8c9b1bea9b06ff1e3ed95c576aec6705159af137
2024-10-09 12:51:19 -07:00
Yu Zhang 8181dfb1c4 Fix a bug for surfacing write unix time (#13057)
Summary:
The write unix time from non L0 files are not surfaced properly because the level's wrapper iterator doesn't have a `write_unix_time` implementation that delegates to the corresponding file. The unit test didn't catch this because it incorrectly destroy the old db and reopen to check write time, instead of just reopen and check. This fix also include a change to support ldb's scan command to get write time for easier debugging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13057

Test Plan: Updated unit tests

Reviewed By: pdillinger

Differential Revision: D64015107

Pulled By: jowlyzhang

fbshipit-source-id: 244474f78a034f80c9235eea2aa8a0f4e54dff59
2024-10-08 11:31:51 -07:00
Yang Zhang d963661d6f Correct CMake minimum required version (#13056)
Summary:
263fa15b44/CMakeLists.txt (L44)

`HOMEPAGE_URL` is introduced into CMake since 3.12. Compiling RocksDB with CMake ver < 3.12 triggers `CMake Error: Could not find cmake module file: CMakeDetermineHOMEPAGE_URLCompiler.cmake` error.

2 options to fix it:
* Remove `HOMEPAGE_URL`, since it appears to have no practical effect.
* Update RocksDB's minimum required CMake version to 3.12.

This PR chose the second option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13056

Reviewed By: jaykorean

Differential Revision: D63993577

Pulled By: cbi42

fbshipit-source-id: a6278af6916fcdace19a6c9baaf7986037bff720
2024-10-07 14:43:20 -07:00
Yu Zhang 263fa15b44 Handle a possible overflow (#13046)
Summary:
Stress test detects this variable could potentially overflow, so added some runtime handling to avoid it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13046

Test Plan: Existing tests

Reviewed By: hx235

Differential Revision: D63911396

Pulled By: jowlyzhang

fbshipit-source-id: 7c9abcd74ac9937b211c0ea4bb683677390837c5
2024-10-04 16:48:12 -07:00