rocksdb

Commit Graph

Author	SHA1	Message	Date
Hui Xiao	0f35db55d8	Print file number when TEST_VerifyNoObsoleteFilesCached fails (#13145 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/13117 added check for obsolete SST files that are not cleaned up timely. It caused a infrequent stress test failure `assertion="live_and_quar_files.find(file_number) != live_and_quar_files.end()"` that I haven't repro-ed yet. This PR prints the file number so we can find out what happens to that file through info logs when encountering the same failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13145 Test Plan: Manually fail the assertion and observe the stderr printing ``` [ RUN ] DBBasicTest.UniqueSession File 12 is not live nor quarantined db_basic_test: db/db_impl/db_impl_debug.cc:384: rocksdb::DBImpl::TEST_VerifyNoObsoleteFilesCached(bool) const::<lambda(const rocksdb::Slice&, rocksdb::Cache::ObjectPtr, size_t, const rocksdb::Cache::CacheItemHelper*)>: Assertion `false' failed. ``` Reviewed By: pdillinger Differential Revision: D66134154 Pulled By: hx235 fbshipit-source-id: 353164c373d3d674cee676b24468dfc79a1d4563	2024-11-20 12:22:52 -08:00
Yu Zhang	bee8d5560e	Start version 9.10.0 (#13146 ) Summary: Pull in HISTORY for 9.9.0, update version.h for next version, update check_format_compatible.sh, update git hash for folly Pull Request resolved: https://github.com/facebook/rocksdb/pull/13146 Test Plan: CI Reviewed By: ltamasi Differential Revision: D66142259 Pulled By: jowlyzhang fbshipit-source-id: 90216b2d7cff2e0befb4f56567e3bd074f97c484	2024-11-18 21:49:47 -08:00
Peter Dillinger	8a36543326	Steps toward preserve/preclude options mutable (#13124 ) Summary: Follow-up to https://github.com/facebook/rocksdb/issues/13114 This change makes the options mutable in testing only through some internal hooks, so that we can keep the easier mechanics and testing of making the options mutable separate from a more interesting and critical fix needed for the options to be safely mutable. See https://github.com/facebook/rocksdb/pull/9964/files#r1024449523 for some background on the interesting remaining problem, which we've added a test for here, with the failing piece commented out (because it puts the DB in a failure state): PrecludeLastLevelTest.RangeTombstoneSnapshotMigrateFromLast. The mechanics of making the options mutable turned out to be smaller than expected because `RegisterRecordSeqnoTimeWorker()` and `RecordSeqnoToTimeMapping()` are already robust to things like frequently switching between preserve/preclude durations e.g. with new and dropped column families, based on work from https://github.com/facebook/rocksdb/issues/11920, https://github.com/facebook/rocksdb/issues/11929, and https://github.com/facebook/rocksdb/issues/12253. Mostly, `options_mutex_` prevents races in applying the options changes, and smart capacity enforcement in `SeqnoToTimeMapping` means it doesn't really matter if the periodic task wakes up too often by being re-scheduled repeatedly. Functional changes needed other than marking mutable: * Update periodic task registration (as needed) from SetOptions, with a mapping recorded then also in case it's needed. * Install SuperVersion(s) with updated mapping when the registration function itself updates the mapping. Possible follow-up (aside from already mentioned): * Some FIXME code in RangeTombstoneSnapshotMigrateFromLast is present because Flush does not automatically include a seqno to time mapping entry that puts an upper bound on how new the flushed data is. This has the potential to be a measurable CPU impact so needs to be done carefully. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13124 Test Plan: updated/refactored tests in tiered_compaction_test to parametrically use dynamic configuration changes (or DB restarts) when changing operating parameters such as these. CheckInternalKeyRange test got some heavier refactoring in preparation for follow-up, and manually verified that the test still fails when relevant `if (!safe_to_penultimate_level) ...` code is disabled. Reviewed By: jowlyzhang Differential Revision: D65634146 Pulled By: pdillinger fbshipit-source-id: 25c9d00fd5b7fd1b408b5f36d58dc48647970528	2024-11-18 19:34:01 -08:00
Hui Xiao	f69a5fe8ee	Cap compaction_readahead_size by max_sectors_kb (#12937 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/issues/12038 reported a regression where compaction read ahead does not work when `compaction_readahead_size ` is greater than `max_sectors_kb` defined in linux (i.e, largest I/O size that the OS issues to a block device, see https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt for more). This PR fixes it by capping the `compaction_readahead_size` by `max_sectors_kb` if any. A refactoring of reading queue sys file is also included to reuse existing code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12937 Test Plan: https://github.com/facebook/rocksdb/issues/12038#issuecomment-2327618031 verified the regression was fixed Reviewed By: anand1976 Differential Revision: D61350188 Pulled By: hx235 fbshipit-source-id: e10677f2f5854c22ebf6318b052557db94b98abe	2024-11-18 15:08:21 -08:00
Jay Huh	d3296260c2	Remove EXPERIMENTAL tag for MultiCfIterators (#13142 ) Summary: As title Pull Request resolved: https://github.com/facebook/rocksdb/pull/13142 Test Plan: N/A Reviewed By: jowlyzhang Differential Revision: D66107254 Pulled By: jaykorean fbshipit-source-id: 3927a411f62ba965017ac726ed818cc9f8d24f2d	2024-11-18 11:23:17 -08:00
Jay Huh	3495c94761	Rely on PurgeObsoleteFiles Only for Options file clean up when remote compaction is enabled (#13139 ) Summary: In PR https://github.com/facebook/rocksdb/issues/13074 , we added a logic to prevent stale OPTIONS file from getting deleted by `PurgeObsoleteFiles()` if the OPTIONS file is being referenced by any of the scheduled the remote compactions. `PurgeObsoleteFiles()` was not the only place that we were cleaning up the old OPTIONS file. We've been also directly cleaning up the old OPTIONS file as part of `SetOptions()`: `RenameTempFileToOptionsFile()` -> `DeleteObsoleteOptionsFiles()` unless FileDeletion is disabled. This was not caught by the UnitTest because we always preserve the last two OPTIONS file. A single call of `SetOptions()` was not enough to surface this issue in the previous PR. To keep things simple, we are just skipping the old OPTIONS file clean up in `RenameTempFileToOptionsFile()` if remote compaction is enabled. We let `PurgeObsoleteFiles()` clean up the old options file later after the compaction is done. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13139 Test Plan: Updated UnitTest to reproduce the scenario. It's now passing with the fix. ``` ./compaction_service_test --gtest_filter="PreservedOptionsRemoteCompaction" ``` Reviewed By: cbi42 Differential Revision: D65974726 Pulled By: jaykorean fbshipit-source-id: 1907e8450d2ccbb42a93084f275e666648ef5b8c	2024-11-15 14:21:32 -08:00
Yu Zhang	ef119c9811	Add compaction stats for filtered files (#13136 ) Summary: As titled. This PR adds some compaction job stats, internal stats and some logging for filtered files. Example logging: [default] compacted to: files[0 0 0 0 2 0 0] max score 0.25, estimated pending compaction bytes 0, MB/sec: 0.3 rd, 0.2 wr, level 6, files in(1, 0) filtered(0, 2) out(1 +0 blob) MB in(0.0, 0.0 +0.0 blob) filtered(0.0, 0.0) out(0.0 +0.0 blob), read-write-amplify(2.0) write-amplify(1.0) OK, records in: 1, records dropped: 1 output_compression: Snappy Pull Request resolved: https://github.com/facebook/rocksdb/pull/13136 Test Plan: Added unit tests Reviewed By: cbi42 Differential Revision: D65855380 Pulled By: jowlyzhang fbshipit-source-id: a4d8eef66f8d999ca5c3d9472aeeae98d7bb03ab	2024-11-14 10:10:38 -08:00
Changyu Bi	9a136e18b3	Fix a valgrind unit test failure (#13137 ) Summary: fix the valgrind failure from https://github.com/facebook/rocksdb/actions/runs/11813904728/job/32911902535?fbclid=IwZXh0bgNhZW0CMTEAAR2GJs1U6mNwNv3zwPzU8rpCmBHqfStV3dupj2o_-686RneLKXADaSZH5-U_aem_ADUQy7bzknoseVpjrOc5SQ ``` [ RUN ] WBWIMemTableTest.ReadFromWBWIMemtable ==1150870== Conditional jump or move depends on uninitialised value(s) ==1150870== at 0x50FE67A: rocksdb::WBWIMemTable::Get(rocksdb::LookupKey const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rocksdb::PinnableWideColumns, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, rocksdb::Status, rocksdb::MergeContext, unsigned long, unsigned long, rocksdb::ReadOptions const&, bool, rocksdb::ReadCallback, bool, bool) (wbwi_memtable.cc:60) ==1150870== by 0x50FF92A: rocksdb::WBWIMemTable::MultiGet(rocksdb::ReadOptions const&, rocksdb::MultiGetContext::Range, rocksdb::ReadCallback, bool) (wbwi_memtable.cc:120) ==1150870== by 0x1879EF: rocksdb::WBWIMemTableTest_ReadFromWBWIMemtable_Test::TestBody() (write_batch_with_index_test.cc:3580) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13137 Test Plan: `valgrind ./write_batch_with_index_test --gtest_filter="ReadFromWBWIMemtable*"` Reviewed By: ltamasi Differential Revision: D65892657 Pulled By: cbi42 fbshipit-source-id: 0b44a5a06b8cc64173ad36966339877e2f508d52	2024-11-13 12:41:56 -08:00
Peter Dillinger	4adf691e39	Output some advice with unreleased_history/add.sh (#13135 ) Summary: I've seen some release notes talking about implementation detail classes, and starting with attempted markdown italics syntax instead of list item syntax. Patched HISTORY.md for existing oddities. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13135 Test Plan: manual, look at https://github.com/facebook/rocksdb/blob/main/HISTORY.md Reviewed By: jowlyzhang Differential Revision: D65802777 Pulled By: pdillinger fbshipit-source-id: a1dc2b17709d633352d7e8a275304092dd7be746	2024-11-13 11:23:30 -08:00
Changyu Bi	1c7652fcef	Introduce a WriteBatchWithIndex-based implementation of ReadOnlyMemTable (#13123 ) Summary: introduce the class WBWIMemTable that implements ReadOnlyMemTable interface with data stored in a WriteBatchWithIndex object. This PR implements the main read path: Get, MultiGet and Iterator. It only supports Put, Delete and SingleDelete operations for now. All the keys in the WBWIMemTable will be assigned a global sequence number through WBWIMemTable::SetGlobalSequenceNumber(). Planned follow up PRs: - Create WBWIMemTable with a transaction's WBWI and ingest it into a DB during Transaction::Commit() - Support for Merge. This will be more complicated since we can have multiple updates with the same user key for Merge. - Support for other operations like WideColumn and other ReadOnlyMemTable methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13123 Test Plan: * A mini-stress test for the read path is added as a new unit test Reviewed By: jowlyzhang Differential Revision: D65633419 Pulled By: cbi42 fbshipit-source-id: 0684fe47260b41f51ca39c300eb72ca5bc9c5a3b	2024-11-12 09:27:11 -08:00
Levi Tamasi	7cb6b93eee	Enable attribute group APIs in the transaction stress tests (#13134 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13134 Even though `Transaction` does not currently support the attribute group variants of `PutEntity` / `GetEntity` / `MultiGetEntity`, we can still test the corresponding APIs of the underlying `TransactionDB` or `OptimisticTransactionDB` instance. Note: the multi-operation transaction stress test will be handled separately. Reviewed By: jaykorean Differential Revision: D65780384 fbshipit-source-id: e4ef3d0c25bcbde9d6d8410af0b7d9381c6b501a	2024-11-11 15:25:26 -08:00
Changyu Bi	925435bbd9	Fix a bug that can retain old WAL longer than needed (#13127 ) Summary: The bug only happens for transaction db with 2pc. The main change is in `MemTableList::TryInstallMemtableFlushResults`. Before this fix, `memtables_to_flush` may not include all flushed memtables, and it causes the min_log_number for the flush to be incorrect. The code path for calculating min_log_number is `MemTableList::TryInstallMemtableFlushResults() -> GetDBRecoveryEditForObsoletingMemTables() -> PrecomputeMinLogNumberToKeep2PC() -> FindMinPrepLogReferencedByMemTable()`. Inside `FindMinPrepLogReferencedByMemTable()`, we need to exclude all memtables being flushed. The PR also includes some documentation changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13127 Test Plan: added a new unit that fails before this change. Reviewed By: ltamasi Differential Revision: D65679270 Pulled By: cbi42 fbshipit-source-id: 611f34bd6ef4cba51f8b54cb1be416887b5a9c5e	2024-11-11 14:19:45 -08:00
Levi Tamasi	1f0ccd9a15	Fix the handling of PrepareValue failures due to fault injection (#13131 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13131 The earlier stress test code did not consider that `PrepareValue()` could fail because of read fault injection, leading to false positives. The patch shuffles the `PrepareValue()` calls around a bit in `TestIterate` / `TestIterateAgainstExpected` in order to prevent this by leveraging the existing code paths that intercept injected faults. Reviewed By: cbi42 Differential Revision: D65731543 fbshipit-source-id: b21c6584ebaa2ff41cd4569098680b91ff7991d1	2024-11-10 19:21:35 -08:00
Levi Tamasi	aa889eb5ed	Print iterator status in stress tests when PrepareValue() fails (#13130 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13130 The patch changes the stress test code so it always logs the error status to aid debugging when a `PrepareValue` call fails. Reviewed By: hx235 Differential Revision: D65712502 fbshipit-source-id: da81566a358777b691178f0d0a1b680453d03e7d	2024-11-09 15:04:17 -08:00
Levi Tamasi	a6ee297ac9	Save the key before calling PrepareValue() in the stress tests (#13129 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13129 The `PrepareValue()` call on an iterator can fail, for example due to our stress tests' read fault injection. Such a failure invalidates the iterator, which makes it illegal to call methods like `key()` on it and leads to assertion violations. The patch fixes this by saving the key before calling `PrepareValue()`, so we can still print it for debugging purposes in case the call fails. Reviewed By: jowlyzhang Differential Revision: D65689225 fbshipit-source-id: c2bf298366def0ba3b3c089ee58e28609ecdfab4	2024-11-08 15:37:19 -08:00
Yutian Li	87b4043a67	Remove undefined function GetColumnFamilyDataByName (#13126 ) Summary: function is undefined and unused Pull Request resolved: https://github.com/facebook/rocksdb/pull/13126 Reviewed By: ltamasi Differential Revision: D65675223 Pulled By: cbi42 fbshipit-source-id: d63d2d361dc40226223840ebe74c0f8934ab18e7	2024-11-08 15:35:40 -08:00
Levi Tamasi	9b95fbbf24	Add a new API Transaction::GetCoalescingIterator (#13128 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13128 Similarly to https://github.com/facebook/rocksdb/pull/13119, the patch adds a new API `Transaction::GetCoalescingIterator` that can be used to create a multi-column-family coalescing iterator over the specified column families, including the data from both the transaction and the underlying database. This API is currently supported for optimistic and write-committed pessimistic transactions. Reviewed By: jowlyzhang Differential Revision: D65682389 fbshipit-source-id: faf5dd1de9bce9d403fc34246ecab4c55572a228	2024-11-08 14:14:39 -08:00
anand76	ee258619be	Fix missing cases of corruption retries (#13122 ) Summary: This PR fixes a few cases where RocksDB was not retrying checksum failure/corruption of file reads with the `verify_and_reconstruct_read` IO option. After fixing these cases, we can almost always successfully open the DB and execute reads even if we see transient corruptions, provided the `FileSystem` supports the `verify_and_reconstruct_read` option. The specific cases fixed in this PR are - 1. CURRENT file 2. IDENTITY file 3. OPTIONS file 4. SST footer Pull Request resolved: https://github.com/facebook/rocksdb/pull/13122 Test Plan: Unit test in `db_io_failure_test.cc` that injects corruption at various stages of DB open and reads Reviewed By: jaykorean Differential Revision: D65617982 Pulled By: anand1976 fbshipit-source-id: 4324b88cc7eee5501ab5df20ef7a95bb12ed3ea7	2024-11-08 12:43:21 -08:00
Peter Dillinger	485ee4f45c	Fix and test for leaks of open SST files (#13117 ) Summary: Follow-up to https://github.com/facebook/rocksdb/issues/13106 which revealed that some SST file readers (in addition to blob files) were being essentially leaked in TableCache (until DB::Close() time). Patched sources of leaks: * Flush that is not committed (builder.cc) * Various obsolete SST files picked up by directory scan but not caught by SubcompactionState::Cleanup() cleaning up from some failed compactions. Dozens of unit tests fail without the "backstop" TableCache::Evict() call in PurgeObsoleteFiles(). We also needed to adjust the check for leaks as follows: * Ok if DB::Open never finished (see comment) * Ok if deletions are disabled (see comment) * Allow "quarantined" files to be in table_cache because (presumably) they might become live again. * Get live files from all live Versions. Suggested follow-up: * Potentially delete more obsolete files sooner with a FIXME in db_impl_files.cc. This could potentially be high value because it seems to gate deletion of any/all newer obsolete files on all older compactions finishing. * Try to catch obsolete files in more places using the VersionSet::obsolete_files_ pipeline rather than relying on them being picked up with directory scan, or deleting them outside of normal mechanisms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13117 Test Plan: updated check used in most all unit tests in ASAN build Reviewed By: hx235 Differential Revision: D65502988 Pulled By: pdillinger fbshipit-source-id: aa0795a8a09d9ec578d25183fe43e2a35849209c	2024-11-08 10:54:43 -08:00
Levi Tamasi	ba164ac373	Add allow_unprepared_value+PrepareValue() to the stress tests (#13125 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13125 The patch adds the new read option `allow_unprepared_value` and the new `Iterator` / `CoalescingIterator` / `AttributeGroupIterator` API `PrepareValue()` to the stress/crash tests. The change affects the batched, non-batched, and CF consistency stress test flavors and the `TestIterate`, `TestPrefixScan`, and `TestIterateAgainstExpected` operations. Reviewed By: hx235 Differential Revision: D65636380 fbshipit-source-id: fd0caa0e87d03b6206667f07499b0c11847d1bbe	2024-11-07 21:24:21 -08:00
Yu Zhang	282f5a463b	Fix write committed transactions replay when UDT setting toggles (#13121 ) Summary: This PR adds some missing pieces in order to handle UDT setting toggles while replay WALs for WriteCommitted transactions DB. Specifically, all the transaction markers for no op, prepare, commit, rollback are currently not carried over from the original WriteBatch to the new WriteBatch when there is a timestamp setting difference detected. This PR fills that gap. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13121 Test Plan: Added unit tests Reviewed By: ltamasi Differential Revision: D65558801 Pulled By: jowlyzhang fbshipit-source-id: 8176882637b95f6dc0dad10d7fe21056fa5173d1	2024-11-06 17:32:03 -08:00
Levi Tamasi	2ba4dceb4c	Add a new API Transaction::GetAttributeGroupIterator (#13119 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13119 The patch adds a new API `Transaction::GetAttributeGroupIterator` that can be used to create a multi-column-family attribute group iterator over the specified column families, including the data from both the transaction and the underlying database. This API is currently supported for optimistic and write-committed pessimistic transactions. Reviewed By: jowlyzhang Differential Revision: D65548324 fbshipit-source-id: 0fb8a22129494770fdba3d6024eef72b3e051136	2024-11-06 15:18:03 -08:00
Yu Zhang	dc34a0ff1e	Add some checks for the file ingestion flow (#13100 ) Summary: This PR does a few misc things for file ingestion flow: - Add an invalid argument status return for the combination of `allow_global_seqno = false` and external files' key range overlap in `Prepare` stage. - Add a MemTables status check for when column family is flushed before `Run`. - Replace the column family dropped check with an assertion after thread enters the write queue and before it exits the write queue, since dropping column family can only happen in the single threaded write queue too and we already checked once after enter write queue. - Add an `ExternalSstFileIngestionJob::GetColumnFamilyData` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13100 Test Plan: Added unit tests, and stress tested the ingestion path Reviewed By: hx235 Differential Revision: D65180472 Pulled By: jowlyzhang fbshipit-source-id: 180145dd248a7507a13a543481b135e5a31ebe2d	2024-11-05 15:44:56 -08:00
Yu Zhang	8089eae240	Fix assertion that compaction input files are freeed (#13109 ) Summary: This assertion could fail if the compaction input files were successfully trivially moved. On re-locking db mutex after successful `LogAndApply`, those files could have been picked up again by some other compactions. And the assertion will fail. Example failure: P1669529213 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13109 Reviewed By: cbi42 Differential Revision: D65308574 Pulled By: jowlyzhang fbshipit-source-id: 32413bdc8e28e67a0386c3fe6327bf0b302b9d1d	2024-11-05 09:39:54 -08:00
Andrew Chang	a7ecbfd590	Remove early return when scanning files for temperature change compaction (#13112 ) Summary: This is a small follow-up to https://github.com/facebook/rocksdb/pull/13083. When we check the `newest_key_time` of files for temperature change compaction, we currently return early if we ever find a file with an unknown `est_newest_key_time`. However, it is possible for a younger file to have a populated value for `newest_key_time`, since this is a new table property. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13112 Test Plan: The existing unit tests are sufficient. Reviewed By: cbi42 Differential Revision: D65451797 Pulled By: archang19 fbshipit-source-id: 28e67c2d35a6315f912471f2848de87dd7088d99	2024-11-05 09:12:39 -08:00
Levi Tamasi	3becc9409e	Some small improvements around allow_unprepared_value and multi-CF iterators (#13113 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13113 The patch makes some small improvements related to `allow_unprepared_value` and multi-CF iterators as groundwork for further changes: 1) Similarly to `BaseDeltaIterator`'s base iterator, `MultiCfIteratorImpl` gets passed its child iterators by the client. Even though they are currently guaranteed to have been created using the same read options as each other and the multi-CF iterator, it is safer to not assume this and call `PrepareValue` unconditionally before using any child iterator's `value()` or `columns()`. 2) Again similarly to `BaseDeltaIterator`, it makes sense to pass the entire `ReadOptions` structure to `MultiCfIteratorImpl` in case it turns out to require other read options in the future. 3) The constructors of the various multi-CF iterator classes now take an rvalue reference to a vector of column family handle + `unique_ptr` to child iterator pairs and use move semantics to take ownership of this vector (instead of taking two separate vectors of column family handles and raw iterator pointers). 4) Constructor arguments and the members of `MultiCfIteratorImpl` are reordered for consistency. Reviewed By: jowlyzhang Differential Revision: D65407521 fbshipit-source-id: 66c2c689ec8b036740bd98641b7b5c0ff7e777f2	2024-11-04 18:06:07 -08:00
Peter Dillinger	e7ffca9493	Refactoring toward making preserve/preclude options mutable (#13114 ) Summary: Move them to MutableCFOptions and perform appropriate refactorings to make that work. I didn't want to mix up refactoring with interesting functional changes. Potentially non-trivial bits here: * During DB Open or RegisterRecordSeqnoTimeWorker we use `GetLatestMutableCFOptions()` because either (a) there might not be a current version, or (b) we are in the process of applying the desired next options. * Upgrade some test infrastructure to allow some options in MutableCFOptions to be mutable (should be a temporary state) * Fix a warning that showed up about uninitialized `paranoid_memory_checks` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13114 Test Plan: existing tests, manually check options are still not settable with SetOptions Reviewed By: jowlyzhang Differential Revision: D65429031 Pulled By: pdillinger fbshipit-source-id: 6e0906d08dd8ddf62731cefffe9b8d94149942b9	2024-11-04 16:15:10 -08:00
changyubi	2ce6902cf5	Introduce an interface `ReadOnlyMemTable` for immutable memtables (#13107 ) Summary: This PR sets up follow-up changes for large transaction support. It introduces an interface that allows custom implementations of immutable memtables. Since transactions use a WriteBatchWithIndex to index their operations, I plan to add a ReadOnlyMemTable implementation backed by WriteBatchWithIndex. This will enable direct ingestion of WriteBatchWithIndex into the DB as an immutable memtable, bypassing memtable writes for transactions. The changes mostly involve moving required methods for immutable memtables into the ReadOnlyMemTable class. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13107 Test Plan: * Existing unit test and stress test. * Performance: I do not expect this change to cause noticeable performance regressions with LTO and devirtualization. The memtable-only readrandom benchmark shows no consistent performance difference: ``` USE_LTO=1 OPTIMIZE_LEVEL="-O3" DEBUG_LEVEL=0 make -j160 db_bench (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --num=250000 --reads=500000 --seed=1723056275 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; 3 runs: main: 760728, 752727, 739600 PR: 763036, 750696, 739022 ``` Reviewed By: jowlyzhang Differential Revision: D65365062 Pulled By: cbi42 fbshipit-source-id: 40c673ab856b91c65001ef6d6ac04b65286f2882	2024-11-04 16:09:34 -08:00
Yu Zhang	24045549a6	Add a flag for testing standalone range deletion file (#13101 ) Summary: As titled. This flag controls how frequent standalone range deletion file is tested in the file ingestion flow, for better debuggability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13101 Test Plan: Manually tested in stress test Reviewed By: hx235 Differential Revision: D65361004 Pulled By: jowlyzhang fbshipit-source-id: 21882e7cc5918aff45449acaeb33b696ab1e37f0	2024-11-01 17:07:34 -07:00
Levi Tamasi	1006eddd63	Make BaseDeltaIterator honor allow_unprepared_value (#13111 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13111 As a follow-up to https://github.com/facebook/rocksdb/pull/13105, the patch changes `BaseDeltaIterator` so that it honors the read option `allow_unprepared_value`. When the option is set and the `BaseDeltaIterator` lands on the base iterator, it defers calling `PrepareValue` on the base iterator and setting `value()` and `columns()` until `PrepareValue` is called on the `BaseDeltaIterator` itself. Reviewed By: jowlyzhang Differential Revision: D65344764 fbshipit-source-id: d79c77b5de7c690bf2deeff435e9b0a9065f6c5c	2024-11-01 14:10:18 -07:00
Andrew Ryan Chang	7c98a2d130	Update MultiGet to respect the strict_capacity_limit block cache option (#13104 ) Summary: There is a `strict_capacity_limit` option which imposes a hard memory limit on the block cache. When the block cache is enabled, every read request is serviced from the block cache. If the required block is missing, it is first inserted into the cache. If `strict_capacity_limit` is `true` and the limit has been reached, the `Get` and `MultiGet` requests should fail. However, currently this is not happening for `MultiGet`. I updated `MultiGet` to explicitly check the returned status of `MaybeReadBlockAndLoadToCache`, so the status does not get overwritten later. Thank you anand1976 for the problem explanation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13104 Test Plan: Added unit test for both `Get` and `MultiGet` with a `strict_capacity_limit` set. Before the change, half of my unit test cases failed https://github.com/facebook/rocksdb/actions/runs/11604597524/job/32313608085?pr=13104. After I added the check for the status returned by `MaybeReadBlockAndLoadToCache`, they all pass. I also ran these tests manually (I had to run `make clean` before): ``` make -j64 block_based_table_reader_test COMPILE_WITH_ASAN=1 ASSERT_STATUS_CHECKED=1 ./block_based_table_reader_test --gtest_filter="StrictCapacityLimitReaderTest.Get" ./block_based_table_reader_test --gtest_filter="StrictCapacityLimitReaderTest.MultiGet" ``` Reviewed By: anand1976 Differential Revision: D65302470 Pulled By: archang19 fbshipit-source-id: 28dcc381e67e05a89fa9fc9607b4709976d6d90e	2024-11-01 13:22:27 -07:00
Andrew Ryan Chang	af2a36d2c7	Record newest_key_time as a table property (#13083 ) Summary: This PR does two things: 1. Adds a new table property `newest_key_time` 2. Uses this property to improve TTL and temperature change compaction. ### Context The current `creation_time` table property should really be named `oldest_ancestor_time`. For flush output files, this is the oldest key time in the file. For compaction output files, this is the minimum among all oldest key times in the input files. The problem with using the oldest ancestor time for TTL compaction is that we may end up dropping files earlier than we should. What we really want is the newest (i.e. "youngest") key time. Right now we take a roundabout way to estimate this value -- we take the value of the _oldest_ key time for the _next_ (newer) SST file. This is also why the current code has checks for `index >= 1`. Our new property `newest_key_time` is set to the file creation time during flushes, and the max over all input files for compactions. There were some additional smaller changes that I had to make for testing purposes: - Refactoring the mock table reader to support specifying my own table properties - Refactoring out a test utility method `GetLevelFileMetadatas` that would otherwise be copy/pasted in 3 places Credit to cbi42 for the problem explanation and proposed solution ### Testing - Added a dedicated unit test to my `newest_key_time` logic in isolation (i.e. are we populating the property on flush and compaction) - Updated the existing unit tests (for TTL/temperate change compaction), which were comprehensive enough to break when I first made my code changes. I removed the test setup code which set the file metadata `oldest_ancestor_time`, so we know we are actually only using the new table property instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13083 Reviewed By: cbi42 Differential Revision: D65298604 Pulled By: archang19 fbshipit-source-id: 898ef91b692ab33f5129a2a16b64ecadd4c32432	2024-11-01 10:08:35 -07:00
Peter Dillinger	a28cc4a38c	Fix a leak of open Blob files (#13106 ) Summary: An earlier change (`b34cef57b7`) removed apparently unused functionality where an obsolete blob file number is passed for removal from TableCache, which manages SST files. This was actually relying on broken/fragile abstractions wherein TableCache and BlobFileCache share the same Cache and using the TableCache interface to manipulate blob file caching. No unit test was actually checking for removal of obsolete blob files from the cache (which is somewhat tricky to check and a second order correctness requirement). Here we fix the leak and add a DEBUG+ASAN-only check in DB::Close() that no obsolete files are lingering in the table/blob file cache. Fixes https://github.com/facebook/rocksdb/issues/13066 Important follow-up (FIXME): The added check discovered some apparent cases of leaked (into table_cache) SST file readers that would stick around until DB::Close(). Need to enable that check, diagnose, and fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13106 Test Plan: added a check that is called during DB::Close in ASAN builds (to minimize paying the cost in all unit tests). Without the fix, the check failed in at least these tests: ``` db_blob_basic_test DBBlobBasicTest.DynamicallyWarmCacheDuringFlush db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadMerge db_blob_compaction_test DBBlobCompactionTest.MergeBlobWithBase db_blob_compaction_test DBBlobCompactionTest.CompactionDoNotFillCache db_blob_compaction_test DBBlobCompactionTest.SkipUntilFilter db_blob_compaction_test DBBlobCompactionTest.CompactionFilter db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadFilter db_blob_compaction_test DBBlobCompactionTest.CompactionReadaheadGarbageCollection ``` Reviewed By: ltamasi Differential Revision: D65296123 Pulled By: pdillinger fbshipit-source-id: 2276d76482beb2c75c9010bc1bec070bb23a24c0	2024-10-31 15:29:30 -07:00
Levi Tamasi	ef535039f3	Call PrepareValue on the base iterator in BaseDeltaIterator (#13105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13105 The `WriteBatchWithIndex::NewIteratorWithBase` interface enables creating a `BaseDeltaIterator` with an arbitrary base iterator passed in by the client, which has potentially been created with the `allow_unprepared_value` read option set. Because of this, `BaseDeltaIterator` has to call `PrepareValue` before using the `value()` or `columns()` from the base iterator. This includes both the case when `BaseDeltaIterator` exposes the `value()` and `columns()` of the base iterator as is and the case when the final `value()` / `columns()` is a result of merging key-values across the base and delta iterators. Note that `BaseDeltaIterator` itself does not support `allow_unprepared_value` yet; this will be implemented in an upcoming patch. Reviewed By: jowlyzhang Differential Revision: D65249643 fbshipit-source-id: b0a1ccc0dfd31105b2eef167b463ed15a8bb83b7	2024-10-31 14:20:33 -07:00
Jay Huh	1987313a94	TableProperties Serialization Follow Ups (#13095 ) Summary: Follow ups from https://github.com/facebook/rocksdb/issues/13089 - Take `TableProperties` as `const &` instead of `std::shared_ptr<const TableProperties>` - Move TableProperties OptionsTypeMap definition to another place for other use outside of Remote Compaction - Add a test verify that the set of field serializations of TableProperties is complete Pull Request resolved: https://github.com/facebook/rocksdb/pull/13095 Test Plan: ``` ./options_settable_test --gtest_filter="TablePropertiesAllFieldsSettable" ``` I also intentionally tried adding a new field to `TableProperties`. If it's missed in the OptionsType map, the test detects the missing bytes set and successfully fails. Reviewed By: pdillinger Differential Revision: D65077398 Pulled By: jaykorean fbshipit-source-id: cf10560eb4a467ca523b11fd64945dbc86ac378f	2024-10-31 11:13:53 -07:00
Peter Dillinger	e34087c524	Add a temporary hook for custom yielding in long-running op (#13103 ) Summary: This is a simplified version of https://github.com/facebook/rocksdb/issues/13096, which called for a way to hook into long-running loops completely within RocksDB to change their thread priority (or similar). The current prime hook point is `DBIter::FindNextUserEntryInternal` likely because of iterating over tombstones. This is implemented using the weak symbol hack for ease of back-porting/patching, and while we get to know potential future requirements better for integration into the public API. (Consider potential relationships to `Env::GetThreadStatusUpdater()` and `TransactionDBMutexFactory`.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13103 Test Plan: Performance validated with db_bench and DEBUG_LEVEL=0: `./db_bench --benchmarks=fillseq,deleterandom,readseq[-X100] --value_size=1 --num=1000000` No consistent difference seen; variances likely in how DB / executable / memory were laid out. ``` With an empty hook: readseq [AVG 100 runs] : 1753018 (± 8850) ops/sec; 28.4 (± 0.1) MB/sec readseq [MEDIAN 100 runs] : 1763746 ops/sec; 28.6 MB/sec (recompile) readseq [AVG 100 runs] : 1789019 (± 10260) ops/sec; 29.0 (± 0.2) MB/sec readseq [MEDIAN 100 runs] : 1801849 ops/sec; 29.2 MB/sec Base: readseq [AVG 100 runs] : 1772196 (± 8240) ops/sec; 28.7 (± 0.1) MB/sec readseq [MEDIAN 100 runs] : 1780453 ops/sec; 28.9 MB/sec (recompile) readseq [AVG 100 runs] : 1777637 (± 7613) ops/sec; 28.8 (± 0.1) MB/sec readseq [MEDIAN 100 runs] : 1786657 ops/sec; 29.0 MB/sec With a functional hook (count number of calls into it): readseq [AVG 100 runs] : 1796733 (± 8854) ops/sec; 29.1 (± 0.1) MB/sec readseq [MEDIAN 100 runs] : 1804690 ops/sec; 29.3 MB/sec RocksDbThreadYield: 126915800 (recompile) readseq [AVG 100 runs] : 1775371 (± 10529) ops/sec; 28.8 (± 0.2) MB/sec readseq [MEDIAN 100 runs] : 1789046 ops/sec; 29.0 MB/sec RocksDbThreadYield: 126977000 Base: readseq [AVG 100 runs] : 1773071 (± 10657) ops/sec; 28.7 (± 0.2) MB/sec readseq [MEDIAN 100 runs] : 1783414 ops/sec; 28.9 MB/sec (recompile) readseq [AVG 100 runs] : 1750852 (± 10184) ops/sec; 28.4 (± 0.2) MB/sec readseq [MEDIAN 100 runs] : 1763587 ops/sec; 28.6 MB/sec ``` Reviewed By: george-reynya Differential Revision: D65235379 Pulled By: pdillinger fbshipit-source-id: 7829e4cc25a56d4c1801b8adf9c7f7aa49ab7aca	2024-10-30 20:37:28 -07:00
leipeng	8109046222	secondary instance: remove unnessisary cfds_changed->count() (#13086 ) Summary: `cfds_changed->count(cfd)` is not needed, just blind insert. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13086 Reviewed By: hx235 Differential Revision: D64712400 Pulled By: cbi42 fbshipit-source-id: 4ef62aaa724c8397baa4ff350c16a7a8d04d7067	2024-10-29 11:04:20 -07:00
Peter Dillinger	ddafba870d	Fix a HISTORY entry for 9.8.0 (#13097 ) Summary: Forgot to update after generalizing mutability of BBTO Pull Request resolved: https://github.com/facebook/rocksdb/pull/13097 Test Plan: no functional change here Reviewed By: jaykorean Differential Revision: D65095618 Pulled By: pdillinger fbshipit-source-id: 6c37cd0e68756c6b56af1c8e15273fae0ca9224d	2024-10-28 21:27:42 -07:00
Peter Dillinger	9ad772e652	Start version 9.9.0 (#13093 ) Summary: Pull in HISTORY for 9.8.0, update version.h for next version, update check_format_compatible.sh Pull Request resolved: https://github.com/facebook/rocksdb/pull/13093 Test Plan: CI Reviewed By: jowlyzhang Differential Revision: D64987257 Pulled By: pdillinger fbshipit-source-id: a7cec329e3d245e63767760aa0298c08c3281695	2024-10-25 13:47:29 -07:00
Jay Huh	57a8e69d4e	Include TableProperties in the CompactionServiceResult (#13089 ) Summary: In Remote Compactions, the primary host receives the serialized compaction result from the remote worker and deserializes it to build the output. Unlike Local Compactions, where table properties are built by TableBuilder, in Remote Compactions, these properties were not included in the serialized compaction result. This was likely done intentionally since the table properties are already available in the SST files. Because TableProperties are not populated as part of CompactionOutputs for remote compactions, we were unable to log the table properties in OnCompactionComplete and use them for verification. We are adding the TableProperties as part of the CompactionServiceOutputFile in this PR. By including the TableProperties in the serialized compaction result, the primary host will be able to access them and verify that they match the values read from the actual SST files. We are also adding the populating `format_version` in table_properties of in TableBuilder. This has not been a big issue because the `format_version` is written to the SST files directly from `TableOptions.format_version`. When loaded from the SST files, it's populated directly by reading from the MetaBlock. This info has only been missing in the TableBuilder's Rep.props. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13089 Test Plan: ``` ./compaction_job_test ``` ``` ./compaction_service_test ``` Reviewed By: pdillinger Differential Revision: D64878740 Pulled By: jaykorean fbshipit-source-id: b6f2fdce851e6477ecb4dd5a87cdc62e176b746b	2024-10-25 13:13:12 -07:00
Peter Dillinger	3fd1f11d35	Fix race to make BlockBasedTableOptions effectively mutable (#13082 ) Summary: Fix a longstanding race condition in SetOptions for `block_based_table_factory` options. The fix is mostly described in new, unified `TableFactoryParseFn()` in `cf_options.cc`. Also in this PR: * Adds a virtual `Clone()` function to TableFactory * To avoid behavioral hiccups with `SetOptions`, make the "hidden state" of `BlockBasedTableFactory` shared between an original and a clone. For example, `TailPrefetchStats` * `Configurable` was allowed to be copied but was not safe to do so, because the copy would have and use pointers into object it was copied from (!!!). This has been fixed using relative instead of absolute pointers, though it's still technically relying on undefined behavior (consistent object layout for non-standard-layout types). For future follow-up: * Deny SetOptions on block cache options (dubious and not yet made safe with proper shared_ptr handling) Fixes https://github.com/facebook/rocksdb/issues/10079 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13082 Test Plan: added to unit tests and crash test Ran TSAN blackbox crashtest for hours with options to amplify potential race (see https://github.com/facebook/rocksdb/issues/10079) Reviewed By: cbi42 Differential Revision: D64947243 Pulled By: pdillinger fbshipit-source-id: 8390299149f50e2a2b39a5247680f2637edb23c8	2024-10-25 10:24:54 -07:00
Yu Zhang	9c94559de7	Optimize compaction for standalone range deletion files (#13078 ) Summary: This PR adds some optimization for compacting standalone range deletion files. A standalone range deletion file is one with just a single range deletion. Currently, such a file is used in bulk loading to achieve something like atomically delete old version of all data with one big range deletion and adding new version of data. These are the changes included in the PR: 1) When a standalone range deletion file is ingested via bulk loading, it's marked for compaction. 2) When picking input files during compaction picking, we attempt to only pick a standalone range deletion file when oldest snapshot is at or above the file's seqno. To do this, `PickCompaction` API is updated to take existing snapshots as an input. This is only done for the universal compaction + UDT disabled combination, we save querying for existing snapshots and not pass it for all other cases. 3) At `Compaction` construction time, the input files will be filtered to examine if any of them can be skipped for compaction iterator. For example, if all the data of the file is deleted by a standalone range tombstone, and the oldest snapshot is at or above such range tombstone, this file will be filtered out. 4) Every time a snapshot is released, we examine if any column family has standalone range deletion files that becomes eligible to be scheduled for compaction. And schedule one for it. Potential future improvements: - Add some dedicated statistics for the filtered files. - Extend this input filtering to L0 files' compactions cases when a newer L0 file could shadow an older L0 file Pull Request resolved: https://github.com/facebook/rocksdb/pull/13078 Test Plan: Added unit tests and stress tested a few rounds Reviewed By: cbi42 Differential Revision: D64879415 Pulled By: jowlyzhang fbshipit-source-id: 02b8683fddbe11f093bcaa0a38406deb39f44d9e	2024-10-25 09:32:14 -07:00
Changyu Bi	8b38d4b400	Fix write tracing to check callback status (#13088 ) Summary: we currently record write operations to tracer before checking callback in PipelinedWriteImpl and WriteImplWALOnly. For optimistic transaction DB, this means that an operation can be recorded to tracer even when it's not written to DB or WAL. I suspect this is the reason some of our optimistic txn crash test is failing. The evidence is that the trace contains some duplicated entry and has more entries compared to the corresponding entry in WAL. This PR moves the tracer logic to be after checking callback status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13088 Test Plan: monitor crash test. Reviewed By: hx235 Differential Revision: D64711753 Pulled By: cbi42 fbshipit-source-id: 55fd1223538ec6294ce84a957c306d3d9d91df5f	2024-10-21 21:02:03 -07:00
Levi Tamasi	c0be6a4b90	Support allow_unprepared_value for multi-CF iterators (#13079 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13079 The patch adds support for the new read option `allow_unprepared_value` to the multi-column-family iterators `CoalescingIterator` and `AttributeGroupIterator`. When this option is set, these iterators populate their value (`value()` + `columns()` or `attribute_groups()`) in an on-demand fashion when `PrepareValue()` is called. Calling `PrepareValue()` on the child iterators is similarly deferred until `PrepareValue()` is called on the main iterator. Reviewed By: jowlyzhang Differential Revision: D64570587 fbshipit-source-id: 783c8d408ad10074417dabca7b82c5e1fe5cab36	2024-10-20 20:53:08 -07:00
Jay Huh	0ca691654f	Fix Unit Test failing from uninit values in CompactionServiceInput (#13080 ) Summary: # Summary There was a [test failure](https://github.com/facebook/rocksdb/actions/runs/11381731053/job/31663774089?fbclid=IwZXh0bgNhZW0CMTEAAR0YJVdnkKUhN15RJQrLsvicxqzReS6y4A14VFQbWu-81XJsSsyNepXAr2c_aem_JyQqNdtpeKFSA6CjlD-pDg) from uninit value in the CompactionServiceInput ``` [ RUN ] CompactionJobTest.InputSerialization ==79945== Use of uninitialised value of size 8 ==79945== at 0x58EA69B: _itoa_word (_itoa.c:179) ==79945== by 0x5906574: __vfprintf_internal (vfprintf-internal.c:1687) ==79945== by 0x591AF99: __vsnprintf_internal (vsnprintf.c:114) ==79945== by 0x1654AE: std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > __gnu_cxx::__to_xstring<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>(int ()(char, unsigned long, char const, __va_list_tag), unsigned long, char const, ...) (string_conversions.h:111) ==79945== by 0x5126C65: to_string (basic_string.h:6568) ==79945== by 0x5126C65: rocksdb::SerializeSingleOptionHelper(void const, rocksdb::OptionType, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (options_helper.cc:541) ==79945== by 0x512718B: rocksdb::OptionTypeInfo::Serialize(rocksdb::ConfigOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*) const (options_helper.cc:1084) ``` This was due to `options_file_number` value not set in the unit test. However, this value is guaranteed to be set in the normal path. It was just missing in the test path. Setting the 0 as the default value for uninitialized fields in the `CompactionServiceInput` and `CompactionServiceResult` for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13080 Test Plan: Existing tests should be sufficient Reviewed By: cbi42 Differential Revision: D64573567 Pulled By: jaykorean fbshipit-source-id: 7843a951770c74445620623d069a52ba93ad94d5	2024-10-18 07:31:54 -07:00
Hui Xiao	58fc9d61b7	Trim readahead size based on prefix during prefix scan (#13040 ) Summary: Context/Summary: During prefix scan, prefetched data blocks containing keys not in the same prefix as the `Seek()`'s key will be wasted when `ReadOptions::prefix_same_as_start = true` since they won't be returned to the user. This PR is to exclude those data blocks from being prefetched in a similar manner like trimming according to `ReadOptions::iterate_upper_bound`. Bonus: refactoring to some existing prefetch test so they are easier to extend and read Pull Request resolved: https://github.com/facebook/rocksdb/pull/13040 Test Plan: - New UT, integration to existing UTs - Benchmark to ensure no regression from CPU due to more trimming logic ``` // Build DB with one sorted run under the same prefix ./db_bench --benchmarks=fillrandom --prefix_size=3 --keys_per_prefix=5000000 --num=5000000 --db=/dev/shm/db_bench --disable_auto_compactions=1 ``` ``` // Augment the existing db bench to call `Seek()` instead of `SeekToFirst()` in `void ReadSequential(){..}` to trigger the logic in this PR +++ b/tools/db_bench_tool.cc @@ -5900,7 +5900,12 @@ class Benchmark { Iterator* iter = db->NewIterator(options); int64_t i = 0; int64_t bytes = 0; - for (iter->SeekToFirst(); i < reads_ && iter->Valid(); iter->Next()) { + + iter->SeekToFirst(); + assert(iter->status().ok() && iter->Valid()); + auto prefix = prefix_extractor_->Transform(iter->key()); + + for (iter->Seek(prefix); i < reads_ && iter->Valid(); iter->Next()) { bytes += iter->key().size() + iter->value().size(); thread->stats.FinishedOps(nullptr, db, 1, kRead); ++i; : ``` ``` // Compare prefix scan performance ./db_bench --benchmarks=readseq[-X20] --prefix_size=3 --prefix_same_as_start=1 --auto_readahead_size=1 --cache_size=1 --use_existing_db=1 --db=/dev/shm/db_bench --disable_auto_compactions=1 // Before PR readseq [AVG 20 runs] : 2449011 (± 50238) ops/sec; 270.9 (± 5.6) MB/sec readseq [MEDIAN 20 runs] : 2499167 ops/sec; 276.5 MB/sec // After PR (regress 0.4 %) readseq [AVG 20 runs] : 2439098 (± 42931) ops/sec; 269.8 (± 4.7) MB/sec readseq [MEDIAN 20 runs] : 2460859 ops/sec; 272.2 MB/sec ``` - Stress test: randomly set `prefix_same_as_start` in `TestPrefixScan()`. Run below for a while ``` python3 tools/db_crashtest.py --simple blackbox --prefix_size=5 --prefixpercent=65 --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adm_policy=1 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --async_io=0 --avoid_flush_during_recovery=1 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=100 --block_align=1 --block_protection_bytes_per_key=4 --block_size=16384 --bloom_before_level=-1 --bloom_bits=3 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=10000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=0 --compaction_style=2 --compaction_ttl=0 --compress_format_version=2 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db_write_buffer_size=8388608 --decouple_partitioned_filters=1 --default_temperature=kCold --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=1 --enable_sst_partitioner_factory=0 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=0 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=15 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=1000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32 --readpercent=10 --recycle_log_file_num=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=1 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --uncache_aggressiveness=1 --universal_max_read_amp=10 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 --use_get_entity=0 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=10 ``` Reviewed By: anand1976 Differential Revision: D64367065 Pulled By: hx235 fbshipit-source-id: 5750c05ccc835c3e9dc81c961b76deaf30bd23c2	2024-10-17 15:52:55 -07:00
Peter Dillinger	ac24f152a1	Refactor `table_factory` into MutableCFOptions (#13077 ) Summary: This is setting up for a fix to a data race in SetOptions on BlockBasedTableOptions (BBTO), https://github.com/facebook/rocksdb/issues/10079 The race will be fixed by replacing `table_factory` with a modified copy whenever we want to modify a BBTO field. An argument could be made that this change creates more entaglement between features (e.g. BlobSource <-> MutableCFOptions), rather than (conceptually) minimizing the dependencies of each feature, but * Most of these things already depended on ImmutableOptions * Historically there has been a lot of plumbing (and possible small CPU overhead) involved in adding features that need to reach a lot of places, like `block_protection_bytes_per_key`. Keeping those wrapped up in options simplifies that. * SuperVersion management generally takes care of lifetime management of MutableCFOptions, so is not that difficult. (Crash test agrees so far.) There are some FIXME places where it is known to be unsafe to replace `block_cache` unless/until we handle shared_ptr tracking properly. HOWEVER, replacing `block_cache` is generally dubious, at least while existing users of the old block cache (e.g. table readers) can continue indefinitely. The change to cf_options.cc is essentially just moving code (not changing). I'm not concerned about the performance of copying another shared_ptr with MutableCFOptions, but I left a note about considering an improvement if more shared_ptr are added to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13077 Test Plan: existing tests, crash test. Unit test DBOptionsTest.GetLatestCFOptions updated with some temporary logic. MemoryTest required some refactoring (simplification) for the change. Reviewed By: cbi42 Differential Revision: D64546903 Pulled By: pdillinger fbshipit-source-id: 69ae97ce5cf4c01b58edc4c5d4687eb1e5bf5855	2024-10-17 14:13:20 -07:00
Levi Tamasi	a44f4b8653	Couple of small improvements for (Iterator)AttributeGroup (#13076 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13076 The patch makes it possible to construct an `IteratorAttributeGroup` using an `AttributeGroup` instance, and implements `operator==` / `operator!=` for these two classes consistently. It also makes some minor improvements in the related test suites `CoalescingIteratorTest` and `AttributeGroupIteratorTest`. Reviewed By: jaykorean Differential Revision: D64510653 fbshipit-source-id: 95d3340168fa3b34e7ef534587b19131f0a27fb7	2024-10-16 20:58:26 -07:00
Jay Huh	f22557886e	Fix Compaction Stats (#13071 ) Summary: Compaction stats code is not so straightforward to understand. Here's a bit of context for this PR and why this change was made. - CompactionStats (compaction_stats_.stats): Internal stats about the compaction used for logging and public metrics. - CompactionJobStats (compaction_job_stats_): The public stats at job level. It's part of Compaction event listener and included in the CompactionResult. - CompactionOutputsStats: output stats only. resides in CompactionOutputs. It gets aggregated toward the CompactionStats (internal stats). The internal stats, `compaction_stats_.stats`, has the output information recorded from the compaction iterator, but it does not have any input information (input records, input output files) until `UpdateCompactionStats()` gets called. We cannot simply call `UpdateCompactionStats()` to fill in the input information in the remote compaction (which is a subcompaction of the primary host's compaction) because the `compaction->inputs()` have the full list of input files and `UpdateCompactionStats()` takes the entire list of records in all files. `num_input_records` gets double-counted if multiple sub-compactions are submitted to the remote worker. The job level stats (in the case of remote compaction, it's subcompaction level stat), `compaction_job_stats_`, has the correct input records, but has no output information. We can use `UpdateCompactionJobStats(compaction_stats_.stats)` to set the output information (num_output_records, num_output_files, etc.) from the `compaction_stats_.stats`, but it also sets all other fields including the input information which sets all back to 0. Therefore, we are overriding `UpdateCompactionJobStats()` in remote worker only to update job level stats, `compaction_job_stats_`, with output information of the internal stats. Baiscally, we are merging the aggregated output info from the internal stats and aggregated input info from the compaction job stats. In this PR we are also fixing how we are setting `is_remote_compaction` in CompactionJobStats. - OnCompactionBegin event, if options.compaction_service is set, `is_remote_compaction=true` for all compactions except for trivial moves - OnCompactionCompleted event, if any of the sub_compactions were done remotely, compaction level stats's `is_remote_compaction` will be true Other minor changes - num_output_records is already available in CompactionJobStats. No need to store separately in CompactionResult. - total_bytes is not needed. - Renamed `SubcompactionState::AggregateCompactionStats()` to `SubcompactionState::AggregateCompactionOutputStats()` to make it clear that it's only aggregating output stats. - Renamed `SetTotalBytes()` to `AddBytesWritten()` to make it more clear that it's adding total written bytes from the compaction output. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13071 Test Plan: Unit Tests added and updated ``` ./compaction_service_test ``` Reviewed By: anand1976 Differential Revision: D64479657 Pulled By: jaykorean fbshipit-source-id: a7a776a00dc718abae95d856b661bcbafd3b0ed5	2024-10-16 19:20:37 -07:00
Changyu Bi	8ad4c7efc4	Add an API to check if an SST file is generated by SstFileWriter (#13072 ) Summary: Some users want to check if a file in their DB was created by SstFileWriter and ingested into hte DB. Files created by SstFileWriter records additional table properties `cbebbad7d9/table/sst_file_writer_collectors.h (L17-L21)`. These are not exposed so I'm adding an API to SstFileWriter to check for them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13072 Test Plan: added some assertions. Reviewed By: jowlyzhang Differential Revision: D64411518 Pulled By: cbi42 fbshipit-source-id: 279bfef48615aa6f78287b643d8445a1471e7f07	2024-10-16 16:57:05 -07:00

1 2 3 4 5 ...

13005 Commits All Branches Search

13005 Commits

All Branches