Summary:
This change aims at increasing general memory safety in scope of selected `/db` files (`db_impl/db_impl.cc`, `dbformat.cc`, `log_reader.cc` and `transaction_log_impl.cc`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13154
Test Plan:
Verify logging structure & formatting parity by manually running the `/db` related tests exercising respective code paths pre and post change.
Note: As per request, we'll address the `internal_stats.cc` in the followup PR.
Reviewed By: pdillinger
Differential Revision: D66392729
Pulled By: mszeszko-meta
fbshipit-source-id: 107fd11221554721d9c1669a24031be3049afd01
Summary:
`OptionTypeInfo::ParseStruct()` was not honoring `config_options.ignore_unknown_options` when unknown properties are found in the serialized string. This caused a compatibility issue in Remote Compaction. When the worker was updated with RocksDB 9.9, the remote worker started including a new table property, `newest_key_time` (added in PR https://github.com/facebook/rocksdb/issues/13083), in the compaction output files. However, parsing that table property in the serialized compaction result from the primary (running with `9.8`) was returning a non-ok status, even though `config_options.ignore_unknown_options` was `true`.
In this fix, we will ignore unused properties if `config_options.ignore_unknown_options` is set to true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13152
Test Plan: Unit Test Added
Reviewed By: archang19
Differential Revision: D66374541
Pulled By: jaykorean
fbshipit-source-id: 78fd8309909279390438c247c4d390bbee4fa914
Summary:
This PR adds support for reusing the file system provided buffer to avoid an extra `memcpy` into RockDB's buffer. This optimization has already been implemented for point lookups, as well as compaction and scan reads _when prefetching is disabled_.
This PR extends this optimization to work with synchronous prefetching (`num_buffers == 1`). Asynchronous prefetching can be addressed in a future PR (and probably should be to keep this PR from growing too large).
Remarks
- To handle the case where the main buffer only has part of the requested data, I used the existing `overlap_buf_` (currently used in the async prefetching case) instead of defining a separate buffer. This was discussed in https://github.com/facebook/rocksdb/pull/13118#discussion_r1842839360.
- We use `MultiRead` with a single request to take advantage of the file system buffer. This is consistent with previous work (e.g. https://github.com/facebook/rocksdb/pull/12266).
- Even without the tests I added, there was some code coverage inside in at least `DBIOCorruptionTest.IterReadCorruptionRetry`, since those tests were failing before I addressed a bug in my code for this PR. [Run with failed test](https://github.com/facebook/rocksdb/actions/runs/11708830448/job/32611508818?pr=13118).
- This prefetching code is not too easy to follow, so I added quite a bit of comments to both the code and test case to try to make it easier to understand the exact internal state of the prefetch buffer at every point in time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13118
Test Plan:
I wrote pretty thorough unit tests that cover synchronous prefetching with file system buffer reuse. The flows for partial hits, complete hits, and complete misses are tested. I also parametrized the test to make sure the async prefetching (without file system buffer reuse) still work as expected.
Once we agree on the changes, I will run a long stress test before merging.
Reviewed By: anand1976
Differential Revision: D65559101
Pulled By: archang19
fbshipit-source-id: 1a56d846e918c20a009b83f1371c1791f69849ae
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/13117 added check for obsolete SST files that are not cleaned up timely. It caused a infrequent stress test failure `assertion="live_and_quar_files.find(file_number) != live_and_quar_files.end()"` that I haven't repro-ed yet.
This PR prints the file number so we can find out what happens to that file through info logs when encountering the same failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13145
Test Plan:
Manually fail the assertion and observe the stderr printing
```
[ RUN ] DBBasicTest.UniqueSession
File 12 is not live nor quarantined
db_basic_test: db/db_impl/db_impl_debug.cc:384: rocksdb::DBImpl::TEST_VerifyNoObsoleteFilesCached(bool) const::<lambda(const rocksdb::Slice&, rocksdb::Cache::ObjectPtr, size_t, const rocksdb::Cache::CacheItemHelper*)>: Assertion `false' failed.
```
Reviewed By: pdillinger
Differential Revision: D66134154
Pulled By: hx235
fbshipit-source-id: 353164c373d3d674cee676b24468dfc79a1d4563
Summary:
Pull in HISTORY for 9.9.0, update version.h for next version, update check_format_compatible.sh, update git hash for folly
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13146
Test Plan: CI
Reviewed By: ltamasi
Differential Revision: D66142259
Pulled By: jowlyzhang
fbshipit-source-id: 90216b2d7cff2e0befb4f56567e3bd074f97c484
Summary:
Follow-up to https://github.com/facebook/rocksdb/issues/13114
This change makes the options mutable in testing only through some internal hooks, so that we can keep the easier mechanics and testing of making the options mutable separate from a more interesting and critical fix needed for the options to be *safely* mutable. See https://github.com/facebook/rocksdb/pull/9964/files#r1024449523 for some background on the interesting remaining problem, which we've added a test for here, with the failing piece commented out (because it puts the DB in a failure state): PrecludeLastLevelTest.RangeTombstoneSnapshotMigrateFromLast.
The mechanics of making the options mutable turned out to be smaller than expected because `RegisterRecordSeqnoTimeWorker()` and `RecordSeqnoToTimeMapping()` are already robust to things like frequently switching between preserve/preclude durations e.g. with new and dropped column families, based on work from
https://github.com/facebook/rocksdb/issues/11920, https://github.com/facebook/rocksdb/issues/11929, and https://github.com/facebook/rocksdb/issues/12253. Mostly, `options_mutex_` prevents races
in applying the options changes, and smart capacity enforcement in `SeqnoToTimeMapping` means it doesn't really matter if the periodic task wakes up too often by being re-scheduled repeatedly.
Functional changes needed other than marking mutable:
* Update periodic task registration (as needed) from SetOptions, with a mapping recorded then also in case it's needed.
* Install SuperVersion(s) with updated mapping when the registration function itself updates the mapping.
Possible follow-up (aside from already mentioned):
* Some FIXME code in RangeTombstoneSnapshotMigrateFromLast is present because Flush does not automatically include a seqno to time mapping entry that puts an upper bound on how new the flushed data is. This has the potential to be a measurable CPU impact so needs to be done carefully.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13124
Test Plan:
updated/refactored tests in tiered_compaction_test to parametrically use dynamic configuration changes (or DB restarts) when changing operating parameters such as these.
CheckInternalKeyRange test got some heavier refactoring in preparation for follow-up, and manually verified that the test still fails when relevant `if (!safe_to_penultimate_level) ...` code is disabled.
Reviewed By: jowlyzhang
Differential Revision: D65634146
Pulled By: pdillinger
fbshipit-source-id: 25c9d00fd5b7fd1b408b5f36d58dc48647970528
Summary:
In PR https://github.com/facebook/rocksdb/issues/13074 , we added a logic to prevent stale OPTIONS file from getting deleted by `PurgeObsoleteFiles()` if the OPTIONS file is being referenced by any of the scheduled the remote compactions.
`PurgeObsoleteFiles()` was not the only place that we were cleaning up the old OPTIONS file. We've been also directly cleaning up the old OPTIONS file as part of `SetOptions()`: `RenameTempFileToOptionsFile()` -> `DeleteObsoleteOptionsFiles()` unless FileDeletion is disabled.
This was not caught by the UnitTest because we always preserve the last two OPTIONS file. A single call of `SetOptions()` was not enough to surface this issue in the previous PR.
To keep things simple, we are just skipping the old OPTIONS file clean up in `RenameTempFileToOptionsFile()` if remote compaction is enabled. We let `PurgeObsoleteFiles()` clean up the old options file later after the compaction is done.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13139
Test Plan:
Updated UnitTest to reproduce the scenario. It's now passing with the fix.
```
./compaction_service_test --gtest_filter="*PreservedOptionsRemoteCompaction*"
```
Reviewed By: cbi42
Differential Revision: D65974726
Pulled By: jaykorean
fbshipit-source-id: 1907e8450d2ccbb42a93084f275e666648ef5b8c
Summary:
I've seen some release notes talking about implementation detail classes, and starting with attempted markdown italics syntax instead of list item syntax. Patched HISTORY.md for existing oddities.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13135
Test Plan:
manual, look at
https://github.com/facebook/rocksdb/blob/main/HISTORY.md
Reviewed By: jowlyzhang
Differential Revision: D65802777
Pulled By: pdillinger
fbshipit-source-id: a1dc2b17709d633352d7e8a275304092dd7be746
Summary:
introduce the class WBWIMemTable that implements ReadOnlyMemTable interface with data stored in a WriteBatchWithIndex object.
This PR implements the main read path: Get, MultiGet and Iterator. It only supports Put, Delete and SingleDelete operations for now. All the keys in the WBWIMemTable will be assigned a global sequence number through WBWIMemTable::SetGlobalSequenceNumber().
Planned follow up PRs:
- Create WBWIMemTable with a transaction's WBWI and ingest it into a DB during Transaction::Commit()
- Support for Merge. This will be more complicated since we can have multiple updates with the same user key for Merge.
- Support for other operations like WideColumn and other ReadOnlyMemTable methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13123
Test Plan: * A mini-stress test for the read path is added as a new unit test
Reviewed By: jowlyzhang
Differential Revision: D65633419
Pulled By: cbi42
fbshipit-source-id: 0684fe47260b41f51ca39c300eb72ca5bc9c5a3b
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13134
Even though `Transaction` does not currently support the attribute group variants of `PutEntity` / `GetEntity` / `MultiGetEntity`, we can still test the corresponding APIs of the underlying `TransactionDB` or `OptimisticTransactionDB` instance. Note: the multi-operation transaction stress test will be handled separately.
Reviewed By: jaykorean
Differential Revision: D65780384
fbshipit-source-id: e4ef3d0c25bcbde9d6d8410af0b7d9381c6b501a
Summary:
The bug only happens for transaction db with 2pc. The main change is in `MemTableList::TryInstallMemtableFlushResults`. Before this fix, `memtables_to_flush` may not include all flushed memtables, and it causes the min_log_number for the flush to be incorrect. The code path for calculating min_log_number is `MemTableList::TryInstallMemtableFlushResults() -> GetDBRecoveryEditForObsoletingMemTables() -> PrecomputeMinLogNumberToKeep2PC() -> FindMinPrepLogReferencedByMemTable()`. Inside `FindMinPrepLogReferencedByMemTable()`, we need to exclude all memtables being flushed.
The PR also includes some documentation changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13127
Test Plan: added a new unit that fails before this change.
Reviewed By: ltamasi
Differential Revision: D65679270
Pulled By: cbi42
fbshipit-source-id: 611f34bd6ef4cba51f8b54cb1be416887b5a9c5e
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13131
The earlier stress test code did not consider that `PrepareValue()` could fail because of read fault injection, leading to false positives. The patch shuffles the `PrepareValue()` calls around a bit in `TestIterate` / `TestIterateAgainstExpected` in order to prevent this by leveraging the existing code paths that intercept injected faults.
Reviewed By: cbi42
Differential Revision: D65731543
fbshipit-source-id: b21c6584ebaa2ff41cd4569098680b91ff7991d1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13130
The patch changes the stress test code so it always logs the error status to aid debugging when a `PrepareValue` call fails.
Reviewed By: hx235
Differential Revision: D65712502
fbshipit-source-id: da81566a358777b691178f0d0a1b680453d03e7d
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13129
The `PrepareValue()` call on an iterator can fail, for example due to our stress tests' read fault injection. Such a failure invalidates the iterator, which makes it illegal to call methods like `key()` on it and leads to assertion violations. The patch fixes this by saving the key before calling `PrepareValue()`, so we can still print it for debugging purposes in case the call fails.
Reviewed By: jowlyzhang
Differential Revision: D65689225
fbshipit-source-id: c2bf298366def0ba3b3c089ee58e28609ecdfab4
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13128
Similarly to https://github.com/facebook/rocksdb/pull/13119, the patch adds a new API `Transaction::GetCoalescingIterator` that can be used to create a multi-column-family coalescing iterator over the specified column families, including the data from both the transaction and the underlying database. This API is currently supported for optimistic and write-committed pessimistic transactions.
Reviewed By: jowlyzhang
Differential Revision: D65682389
fbshipit-source-id: faf5dd1de9bce9d403fc34246ecab4c55572a228
Summary:
This PR fixes a few cases where RocksDB was not retrying checksum failure/corruption of file reads with the `verify_and_reconstruct_read` IO option. After fixing these cases, we can almost always successfully open the DB and execute reads even if we see transient corruptions, provided the `FileSystem` supports the `verify_and_reconstruct_read` option. The specific cases fixed in this PR are -
1. CURRENT file
2. IDENTITY file
3. OPTIONS file
4. SST footer
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13122
Test Plan: Unit test in `db_io_failure_test.cc` that injects corruption at various stages of DB open and reads
Reviewed By: jaykorean
Differential Revision: D65617982
Pulled By: anand1976
fbshipit-source-id: 4324b88cc7eee5501ab5df20ef7a95bb12ed3ea7
Summary:
Follow-up to https://github.com/facebook/rocksdb/issues/13106 which revealed that some SST file readers (in addition to blob files) were being essentially leaked in TableCache (until DB::Close() time). Patched sources of leaks:
* Flush that is not committed (builder.cc)
* Various obsolete SST files picked up by directory scan but not caught by SubcompactionState::Cleanup() cleaning up from some failed compactions. Dozens of unit tests fail without the "backstop" TableCache::Evict() call in PurgeObsoleteFiles().
We also needed to adjust the check for leaks as follows:
* Ok if DB::Open never finished (see comment)
* Ok if deletions are disabled (see comment)
* Allow "quarantined" files to be in table_cache because (presumably) they might become live again.
* Get live files from all live Versions.
Suggested follow-up:
* Potentially delete more obsolete files sooner with a FIXME in db_impl_files.cc. This could potentially be high value because it seems to gate deletion of any/all newer obsolete files on all older compactions finishing.
* Try to catch obsolete files in more places using the VersionSet::obsolete_files_ pipeline rather than relying on them being picked up with directory scan, or deleting them outside of normal mechanisms.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13117
Test Plan: updated check used in most all unit tests in ASAN build
Reviewed By: hx235
Differential Revision: D65502988
Pulled By: pdillinger
fbshipit-source-id: aa0795a8a09d9ec578d25183fe43e2a35849209c
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13125
The patch adds the new read option `allow_unprepared_value` and the new `Iterator` / `CoalescingIterator` / `AttributeGroupIterator` API `PrepareValue()` to the stress/crash tests. The change affects the batched, non-batched, and CF consistency stress test flavors and the `TestIterate`, `TestPrefixScan`, and `TestIterateAgainstExpected` operations.
Reviewed By: hx235
Differential Revision: D65636380
fbshipit-source-id: fd0caa0e87d03b6206667f07499b0c11847d1bbe
Summary:
This PR adds some missing pieces in order to handle UDT setting toggles while replay WALs for WriteCommitted transactions DB. Specifically, all the transaction markers for no op, prepare, commit, rollback are currently not carried over from the original WriteBatch to the new WriteBatch when there is a timestamp setting difference detected. This PR fills that gap.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13121
Test Plan: Added unit tests
Reviewed By: ltamasi
Differential Revision: D65558801
Pulled By: jowlyzhang
fbshipit-source-id: 8176882637b95f6dc0dad10d7fe21056fa5173d1
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13119
The patch adds a new API `Transaction::GetAttributeGroupIterator` that can be used to create a multi-column-family attribute group iterator over the specified column families, including the data from both the transaction and the underlying database. This API is currently supported for optimistic and write-committed pessimistic transactions.
Reviewed By: jowlyzhang
Differential Revision: D65548324
fbshipit-source-id: 0fb8a22129494770fdba3d6024eef72b3e051136
Summary:
This PR does a few misc things for file ingestion flow:
- Add an invalid argument status return for the combination of `allow_global_seqno = false` and external files' key range overlap in `Prepare` stage.
- Add a MemTables status check for when column family is flushed before `Run`.
- Replace the column family dropped check with an assertion after thread enters the write queue and before it exits the write queue, since dropping column family can only happen in the single threaded write queue too and we already checked once after enter write queue.
- Add an `ExternalSstFileIngestionJob::GetColumnFamilyData` API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13100
Test Plan: Added unit tests, and stress tested the ingestion path
Reviewed By: hx235
Differential Revision: D65180472
Pulled By: jowlyzhang
fbshipit-source-id: 180145dd248a7507a13a543481b135e5a31ebe2d
Summary:
This assertion could fail if the compaction input files were successfully trivially moved. On re-locking db mutex after successful `LogAndApply`, those files could have been picked up again by some other compactions. And the assertion will fail.
Example failure: P1669529213
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13109
Reviewed By: cbi42
Differential Revision: D65308574
Pulled By: jowlyzhang
fbshipit-source-id: 32413bdc8e28e67a0386c3fe6327bf0b302b9d1d
Summary:
This is a small follow-up to https://github.com/facebook/rocksdb/pull/13083.
When we check the `newest_key_time` of files for temperature change compaction, we currently return early if we ever find a file with an unknown `est_newest_key_time`.
However, it is possible for a younger file to have a populated value for `newest_key_time`, since this is a new table property.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13112
Test Plan: The existing unit tests are sufficient.
Reviewed By: cbi42
Differential Revision: D65451797
Pulled By: archang19
fbshipit-source-id: 28e67c2d35a6315f912471f2848de87dd7088d99
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13113
The patch makes some small improvements related to `allow_unprepared_value` and multi-CF iterators as groundwork for further changes:
1) Similarly to `BaseDeltaIterator`'s base iterator, `MultiCfIteratorImpl` gets passed its child iterators by the client. Even though they are currently guaranteed to have been created using the same read options as each other and the multi-CF iterator, it is safer to not assume this and call `PrepareValue` unconditionally before using any child iterator's `value()` or `columns()`.
2) Again similarly to `BaseDeltaIterator`, it makes sense to pass the entire `ReadOptions` structure to `MultiCfIteratorImpl` in case it turns out to require other read options in the future.
3) The constructors of the various multi-CF iterator classes now take an rvalue reference to a vector of column family handle + `unique_ptr` to child iterator pairs and use move semantics to take ownership of this vector (instead of taking two separate vectors of column family handles and raw iterator pointers).
4) Constructor arguments and the members of `MultiCfIteratorImpl` are reordered for consistency.
Reviewed By: jowlyzhang
Differential Revision: D65407521
fbshipit-source-id: 66c2c689ec8b036740bd98641b7b5c0ff7e777f2
Summary:
Move them to MutableCFOptions and perform appropriate refactorings to make that work. I didn't want to mix up refactoring with interesting functional changes. Potentially non-trivial bits here:
* During DB Open or RegisterRecordSeqnoTimeWorker we use `GetLatestMutableCFOptions()` because either (a) there might not be a current version, or (b) we are in the process of applying the desired next options.
* Upgrade some test infrastructure to allow some options in MutableCFOptions to be mutable (should be a temporary state)
* Fix a warning that showed up about uninitialized `paranoid_memory_checks`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13114
Test Plan: existing tests, manually check options are still not settable with SetOptions
Reviewed By: jowlyzhang
Differential Revision: D65429031
Pulled By: pdillinger
fbshipit-source-id: 6e0906d08dd8ddf62731cefffe9b8d94149942b9
Summary:
This PR sets up follow-up changes for large transaction support. It introduces an interface that allows custom implementations of immutable memtables. Since transactions use a WriteBatchWithIndex to index their operations, I plan to add a ReadOnlyMemTable implementation backed by WriteBatchWithIndex. This will enable direct ingestion of WriteBatchWithIndex into the DB as an immutable memtable, bypassing memtable writes for transactions.
The changes mostly involve moving required methods for immutable memtables into the ReadOnlyMemTable class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13107
Test Plan:
* Existing unit test and stress test.
* Performance: I do not expect this change to cause noticeable performance regressions with LTO and devirtualization. The memtable-only readrandom benchmark shows no consistent performance difference:
```
USE_LTO=1 OPTIMIZE_LEVEL="-O3" DEBUG_LEVEL=0 make -j160 db_bench
(for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000 --seed=1723056275 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }';
3 runs:
main: 760728, 752727, 739600
PR: 763036, 750696, 739022
```
Reviewed By: jowlyzhang
Differential Revision: D65365062
Pulled By: cbi42
fbshipit-source-id: 40c673ab856b91c65001ef6d6ac04b65286f2882
Summary:
As titled. This flag controls how frequent standalone range deletion file is tested in the file ingestion flow, for better debuggability.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13101
Test Plan: Manually tested in stress test
Reviewed By: hx235
Differential Revision: D65361004
Pulled By: jowlyzhang
fbshipit-source-id: 21882e7cc5918aff45449acaeb33b696ab1e37f0
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13111
As a follow-up to https://github.com/facebook/rocksdb/pull/13105, the patch changes `BaseDeltaIterator` so that it honors the read option `allow_unprepared_value`. When the option is set and the `BaseDeltaIterator` lands on the base iterator, it defers calling `PrepareValue` on the base iterator and setting `value()` and `columns()` until `PrepareValue` is called on the `BaseDeltaIterator` itself.
Reviewed By: jowlyzhang
Differential Revision: D65344764
fbshipit-source-id: d79c77b5de7c690bf2deeff435e9b0a9065f6c5c
Summary:
There is a `strict_capacity_limit` option which imposes a hard memory limit on the block cache. When the block cache is enabled, every read request is serviced from the block cache. If the required block is missing, it is first inserted into the cache. If `strict_capacity_limit` is `true` and the limit has been reached, the `Get` and `MultiGet` requests should fail. However, currently this is not happening for `MultiGet`.
I updated `MultiGet` to explicitly check the returned status of `MaybeReadBlockAndLoadToCache`, so the status does not get overwritten later.
Thank you anand1976 for the problem explanation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13104
Test Plan:
Added unit test for both `Get` and `MultiGet` with a `strict_capacity_limit` set.
Before the change, half of my unit test cases failed https://github.com/facebook/rocksdb/actions/runs/11604597524/job/32313608085?pr=13104. After I added the check for the status returned by `MaybeReadBlockAndLoadToCache`, they all pass.
I also ran these tests manually (I had to run `make clean` before):
```
make -j64 block_based_table_reader_test COMPILE_WITH_ASAN=1 ASSERT_STATUS_CHECKED=1
./block_based_table_reader_test --gtest_filter="*StrictCapacityLimitReaderTest.Get*"
./block_based_table_reader_test --gtest_filter="*StrictCapacityLimitReaderTest.MultiGet*"
```
Reviewed By: anand1976
Differential Revision: D65302470
Pulled By: archang19
fbshipit-source-id: 28dcc381e67e05a89fa9fc9607b4709976d6d90e
Summary:
This PR does two things:
1. Adds a new table property `newest_key_time`
2. Uses this property to improve TTL and temperature change compaction.
### Context
The current `creation_time` table property should really be named `oldest_ancestor_time`. For flush output files, this is the oldest key time in the file. For compaction output files, this is the minimum among all oldest key times in the input files.
The problem with using the oldest ancestor time for TTL compaction is that we may end up dropping files earlier than we should. What we really want is the newest (i.e. "youngest") key time. Right now we take a roundabout way to estimate this value -- we take the value of the _oldest_ key time for the _next_ (newer) SST file. This is also why the current code has checks for `index >= 1`.
Our new property `newest_key_time` is set to the file creation time during flushes, and the max over all input files for compactions.
There were some additional smaller changes that I had to make for testing purposes:
- Refactoring the mock table reader to support specifying my own table properties
- Refactoring out a test utility method `GetLevelFileMetadatas` that would otherwise be copy/pasted in 3 places
Credit to cbi42 for the problem explanation and proposed solution
### Testing
- Added a dedicated unit test to my `newest_key_time` logic in isolation (i.e. are we populating the property on flush and compaction)
- Updated the existing unit tests (for TTL/temperate change compaction), which were comprehensive enough to break when I first made my code changes. I removed the test setup code which set the file metadata `oldest_ancestor_time`, so we know we are actually only using the new table property instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13083
Reviewed By: cbi42
Differential Revision: D65298604
Pulled By: archang19
fbshipit-source-id: 898ef91b692ab33f5129a2a16b64ecadd4c32432
Summary:
An earlier change (b34cef57b7) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement).
Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache.
Fixes https://github.com/facebook/rocksdb/issues/13066
Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13106
Test Plan:
added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests:
```
db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge
db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase
db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache
db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter
db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection
```
Reviewed By: ltamasi
Differential Revision: D65296123
Pulled By: pdillinger
fbshipit-source-id: 2276d76482beb2c75c9010bc1bec070bb23a24c0
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13105
The `WriteBatchWithIndex::NewIteratorWithBase` interface enables creating a `BaseDeltaIterator` with an arbitrary base iterator passed in by the client, which has potentially been created with the `allow_unprepared_value` read option set. Because of this, `BaseDeltaIterator` has to call `PrepareValue` before using the `value()` or `columns()` from the base iterator. This includes both the case when `BaseDeltaIterator` exposes the `value()` and `columns()` of the base iterator as is and the case when the final `value()` / `columns()` is a result of merging key-values across the base and delta iterators. Note that `BaseDeltaIterator` itself does not support `allow_unprepared_value` yet; this will be implemented in an upcoming patch.
Reviewed By: jowlyzhang
Differential Revision: D65249643
fbshipit-source-id: b0a1ccc0dfd31105b2eef167b463ed15a8bb83b7
Summary:
Follow ups from https://github.com/facebook/rocksdb/issues/13089
- Take `TableProperties` as `const &` instead of `std::shared_ptr<const TableProperties>`
- Move TableProperties OptionsTypeMap definition to another place for other use outside of Remote Compaction
- Add a test verify that the set of field serializations of TableProperties is complete
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13095
Test Plan:
```
./options_settable_test --gtest_filter="*TablePropertiesAllFieldsSettable*"
```
I also intentionally tried adding a new field to `TableProperties`. If it's missed in the OptionsType map, the test detects the missing bytes set and successfully fails.
Reviewed By: pdillinger
Differential Revision: D65077398
Pulled By: jaykorean
fbshipit-source-id: cf10560eb4a467ca523b11fd64945dbc86ac378f
Summary:
Forgot to update after generalizing mutability of BBTO
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13097
Test Plan: no functional change here
Reviewed By: jaykorean
Differential Revision: D65095618
Pulled By: pdillinger
fbshipit-source-id: 6c37cd0e68756c6b56af1c8e15273fae0ca9224d
Summary:
Pull in HISTORY for 9.8.0, update version.h for next version, update check_format_compatible.sh
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13093
Test Plan: CI
Reviewed By: jowlyzhang
Differential Revision: D64987257
Pulled By: pdillinger
fbshipit-source-id: a7cec329e3d245e63767760aa0298c08c3281695
Summary:
In Remote Compactions, the primary host receives the serialized compaction result from the remote worker and deserializes it to build the output. Unlike Local Compactions, where table properties are built by TableBuilder, in Remote Compactions, these properties were not included in the serialized compaction result. This was likely done intentionally since the table properties are already available in the SST files.
Because TableProperties are not populated as part of CompactionOutputs for remote compactions, we were unable to log the table properties in OnCompactionComplete and use them for verification. We are adding the TableProperties as part of the CompactionServiceOutputFile in this PR. By including the TableProperties in the serialized compaction result, the primary host will be able to access them and verify that they match the values read from the actual SST files.
We are also adding the populating `format_version` in table_properties of in TableBuilder. This has not been a big issue because the `format_version` is written to the SST files directly from `TableOptions.format_version`. When loaded from the SST files, it's populated directly by reading from the MetaBlock. This info has only been missing in the TableBuilder's Rep.props.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13089
Test Plan:
```
./compaction_job_test
```
```
./compaction_service_test
```
Reviewed By: pdillinger
Differential Revision: D64878740
Pulled By: jaykorean
fbshipit-source-id: b6f2fdce851e6477ecb4dd5a87cdc62e176b746b
Summary:
Fix a longstanding race condition in SetOptions for `block_based_table_factory` options. The fix is mostly described in new, unified `TableFactoryParseFn()` in `cf_options.cc`. Also in this PR:
* Adds a virtual `Clone()` function to TableFactory
* To avoid behavioral hiccups with `SetOptions`, make the "hidden state" of `BlockBasedTableFactory` shared between an original and a clone. For example, `TailPrefetchStats`
* `Configurable` was allowed to be copied but was not safe to do so, because the copy would have and use pointers into object it was copied from (!!!). This has been fixed using relative instead of absolute pointers, though it's still technically relying on undefined behavior (consistent object layout for non-standard-layout types).
For future follow-up:
* Deny SetOptions on block cache options (dubious and not yet made safe with proper shared_ptr handling)
Fixes https://github.com/facebook/rocksdb/issues/10079
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13082
Test Plan:
added to unit tests and crash test
Ran TSAN blackbox crashtest for hours with options to amplify potential race (see https://github.com/facebook/rocksdb/issues/10079)
Reviewed By: cbi42
Differential Revision: D64947243
Pulled By: pdillinger
fbshipit-source-id: 8390299149f50e2a2b39a5247680f2637edb23c8
Summary:
This PR adds some optimization for compacting standalone range deletion files. A standalone range deletion file is one with just a single range deletion. Currently, such a file is used in bulk loading to achieve something like atomically delete old version of all data with one big range deletion and adding new version of data. These are the changes included in the PR:
1) When a standalone range deletion file is ingested via bulk loading, it's marked for compaction.
2) When picking input files during compaction picking, we attempt to only pick a standalone range deletion file when oldest snapshot is at or above the file's seqno. To do this, `PickCompaction` API is updated to take existing snapshots as an input. This is only done for the universal compaction + UDT disabled combination, we save querying for existing snapshots and not pass it for all other cases.
3) At `Compaction` construction time, the input files will be filtered to examine if any of them can be skipped for compaction iterator. For example, if all the data of the file is deleted by a standalone range tombstone, and the oldest snapshot is at or above such range tombstone, this file will be filtered out.
4) Every time a snapshot is released, we examine if any column family has standalone range deletion files that becomes eligible to be scheduled for compaction. And schedule one for it.
Potential future improvements:
- Add some dedicated statistics for the filtered files.
- Extend this input filtering to L0 files' compactions cases when a newer L0 file could shadow an older L0 file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13078
Test Plan: Added unit tests and stress tested a few rounds
Reviewed By: cbi42
Differential Revision: D64879415
Pulled By: jowlyzhang
fbshipit-source-id: 02b8683fddbe11f093bcaa0a38406deb39f44d9e
Summary:
we currently record write operations to tracer before checking callback in PipelinedWriteImpl and WriteImplWALOnly. For optimistic transaction DB, this means that an operation can be recorded to tracer even when it's not written to DB or WAL. I suspect this is the reason some of our optimistic txn crash test is failing. The evidence is that the trace contains some duplicated entry and has more entries compared to the corresponding entry in WAL. This PR moves the tracer logic to be after checking callback status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13088
Test Plan: monitor crash test.
Reviewed By: hx235
Differential Revision: D64711753
Pulled By: cbi42
fbshipit-source-id: 55fd1223538ec6294ce84a957c306d3d9d91df5f
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13079
The patch adds support for the new read option `allow_unprepared_value` to the multi-column-family iterators `CoalescingIterator` and `AttributeGroupIterator`. When this option is set, these iterators populate their value (`value()` + `columns()` or `attribute_groups()`) in an on-demand fashion when `PrepareValue()` is called. Calling `PrepareValue()` on the child iterators is similarly deferred until `PrepareValue()` is called on the main iterator.
Reviewed By: jowlyzhang
Differential Revision: D64570587
fbshipit-source-id: 783c8d408ad10074417dabca7b82c5e1fe5cab36
Summary:
# Summary
There was a [test failure](https://github.com/facebook/rocksdb/actions/runs/11381731053/job/31663774089?fbclid=IwZXh0bgNhZW0CMTEAAR0YJVdnkKUhN15RJQrLsvicxqzReS6y4A14VFQbWu-81XJsSsyNepXAr2c_aem_JyQqNdtpeKFSA6CjlD-pDg) from uninit value in the CompactionServiceInput
```
[ RUN ] CompactionJobTest.InputSerialization
==79945== Use of uninitialised value of size 8
==79945== at 0x58EA69B: _itoa_word (_itoa.c:179)
==79945== by 0x5906574: __vfprintf_internal (vfprintf-internal.c:1687)
==79945== by 0x591AF99: __vsnprintf_internal (vsnprintf.c:114)
==79945== by 0x1654AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > __gnu_cxx::__to_xstring<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>(int (*)(char*, unsigned long, char const*, __va_list_tag*), unsigned long, char const*, ...) (string_conversions.h:111)
==79945== by 0x5126C65: to_string (basic_string.h:6568)
==79945== by 0x5126C65: rocksdb::SerializeSingleOptionHelper(void const*, rocksdb::OptionType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) (options_helper.cc:541)
==79945== by 0x512718B: rocksdb::OptionTypeInfo::Serialize(rocksdb::ConfigOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const (options_helper.cc:1084)
```
This was due to `options_file_number` value not set in the unit test. However, this value is guaranteed to be set in the normal path. It was just missing in the test path. Setting the 0 as the default value for uninitialized fields in the `CompactionServiceInput` and `CompactionServiceResult` for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13080
Test Plan: Existing tests should be sufficient
Reviewed By: cbi42
Differential Revision: D64573567
Pulled By: jaykorean
fbshipit-source-id: 7843a951770c74445620623d069a52ba93ad94d5