rocksdb

Commit Graph

Author	SHA1	Message	Date
Peter Dillinger	a178d15baf	More checks around num_entries vs. num_deletions (#12600 ) Summary: We've seen an internal crash test+sanitizer failure seemingly caused by underflow on `current_num_non_deletions_` which would happen if num_entries < num_deletions. (T186407810) This change adds an additional check (fail earlier?) and coerces read table properties to satisfy the invariant that is supposed to be provided by https://github.com/facebook/rocksdb/pull/4841 but could be violated by older files, due to https://github.com/facebook/rocksdb/pull/4016. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12600 Test Plan: existing tests Reviewed By: ajkr Differential Revision: D56796191 Pulled By: pdillinger fbshipit-source-id: 6d22cc40eb74974c42b311293ee2775c6af95afc	2024-05-03 16:40:07 -07:00
Zaidoon Abd Al Hadi	ed01babd07	Expose compaction pri through C API (#12604 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12604 Reviewed By: cbi42 Differential Revision: D56914066 Pulled By: ajkr fbshipit-source-id: 64b51ab2b7b5ec0b5fde5a5f61d076bac1c3a8ad	2024-05-02 18:39:24 -07:00
Changyu Bi	e2ef349f56	Deflake unit test `DBCompactionTest.CompactionLimiter` (#12596 ) Summary: The test has been flaky for a long time. A recent [failure](https://github.com/facebook/rocksdb/actions/runs/8820808355/job/24215219590?pr=12578) shows that there is still flush running when the assertion fails. I think this is because `WaitForFlushMemTable()` may return before the a flush schedules the next compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12596 Test Plan: I could not repro the failure locally: `gtest-parallel --repeat=8000 --workers=100 ./db_compaction_test --gtest_filter="CompactionLimiter"` Reviewed By: ajkr Differential Revision: D56715874 Pulled By: cbi42 fbshipit-source-id: f5f64eb30fff7e115c19beedad2dc22afa06258d	2024-05-02 17:10:06 -07:00
Yu Zhang	241253053a	Fix delete obsolete files on recovery not rate limited (#12590 ) Summary: This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited. The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` `d610e14f93/include/rocksdb/sst_file_manager.h (L126)` however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1. In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call. There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590 Test Plan: Added unit test and verified with manual testing Reviewed By: anand1976 Differential Revision: D56830519 Pulled By: jowlyzhang fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885	2024-05-01 12:26:54 -07:00
Yu Zhang	8b3d9e6bfe	Add TimedPut to stress test (#12559 ) Summary: This also updates WriteBatch's protection info to include write time since there are several places in memtable that by default protects the whole value slice. This PR is stacked on https://github.com/facebook/rocksdb/issues/12543 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12559 Reviewed By: pdillinger Differential Revision: D56308285 Pulled By: jowlyzhang fbshipit-source-id: 5524339fe0dd6c918dc940ca2f0657b5f2111c56	2024-04-30 15:40:35 -07:00
Yu Zhang	2c02a9b76f	Preserve TimedPut on penultimate level until it actually expires (#12543 ) Summary: To make sure `TimedPut` are placed on proper tier before and when it becomes eligible for cold tier 1) flush and compaction need to keep relevant seqno to time mapping for not just the sequence number contained in internal keys, but also preferred sequence number for `TimedPut` entries. This PR also fix some bugs in for handling `TimedPut` during compaction: 1) dealing with an edge case when a `TimedPut` entry's internal key is the right bound for penultimate level, the internal key after swapping in its preferred sequence number will fall outside of the penultimate range because preferred sequence number is smaller than its original sequence number. The entry however is still safe to be placed on penultimate level, so we keep track of `TimedPut` entry's original sequence number for this check. The idea behind this is that as long as it's safe for the original key to be placed on penultimate level, it's safe for the entry with swapped preferred sequence number to be placed on penultimate level too. Because we only swap in preferred sequence number when that entry is visible to the earliest snapshot and there is no other data points with the same user key in lower levels. On the other hand, as long as it's not safe for the original key to be placed on penultimate level, we will not place the entry after swapping the preferred seqno on penultimate level either. 2) the assertion that preferred seqno is always bigger than original sequence number may fail if this logic is only exercised after sequence number is zeroed out. We adjust the assertion to handle that case too. In this case, we don't swap in the preferred seqno but will adjust the its type to `kTypeValue`. 3) there was a special case handling for when range deletion may end up incorrectly covering an entry if preferred seqno is swapped in. But it missed the case that if the original entry is already covered by range deletion. The original handling will mistakenly output the entry instead of omitting it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12543 Test Plan: ./tiered_compaction_test --gtest_filter="PrecludeLastLevelTest.PreserveTimedPutOnPenultimateLevel" ./compaction_iterator_test --gtest_filter="TimedPut" Reviewed By: pdillinger Differential Revision: D56195096 Pulled By: jowlyzhang fbshipit-source-id: 37ebb09d2513abbd9e90cda0217e26874584b8f3	2024-04-30 11:16:02 -07:00
Peter Dillinger	45c105104b	Set optimize_filters_for_memory by default (#12377 ) Summary: This feature has been around for a couple of years and users haven't reported any problems with it. Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377 Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in. Reviewed By: jowlyzhang Differential Revision: D54129517 Pulled By: pdillinger fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c	2024-04-30 08:33:31 -07:00
Changyu Bi	5c1334f763	DeleteRange() return NotSupported if row_cache is configured (#12512 ) Summary: ...since this feature combination is not supported yet (https://github.com/facebook/rocksdb/issues/4122). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12512 Test Plan: new unit test. Reviewed By: jaykorean, jowlyzhang Differential Revision: D55820323 Pulled By: cbi42 fbshipit-source-id: eeb5e97d15c9bdc388793a2fb8e52cfa47e34bcf	2024-04-29 16:33:13 -07:00
Andrew Kryczka	b2931a5c53	Fixed `MultiGet()` error handling to not skip blob dereference (#12597 ) Summary: See comment at top of the test case and release note. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597 Reviewed By: jaykorean Differential Revision: D56718786 Pulled By: ajkr fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f	2024-04-29 14:18:42 -07:00
anand76	e36b0a2da4	Fix corruption bug when recycle_log_file_num changed from 0 (#12591 ) Summary: When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 \|\| s >= first_seqno_` in `rocksdb::MemTable::Add`. We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591 Test Plan: 1. Add a unit test to verify the bug and fix Reviewed By: jowlyzhang Differential Revision: D56655812 Pulled By: anand1976 fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87	2024-04-29 12:25:00 -07:00
Andrew Kryczka	2ec25a3e54	Prevent data block compression with `BlockBasedTableOptions::block_align` (#12592 ) Summary: Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592 Reviewed By: hx235 Differential Revision: D56650862 Pulled By: ajkr fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141	2024-04-26 20:05:30 -07:00
Andrew Kryczka	177ccd3904	Print more debug info in test when `SyncWAL()` fails (#12580 ) Summary: Example failure (cannot reproduce): ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBWriteTestInstance/DBWriteTest [ RUN ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0 db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file [ FAILED ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0 (49 ms) [----------] 1 test from DBWriteTestInstance/DBWriteTest (49 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (49 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0 ``` I have no idea why `SyncWAL()` would not be supported from what is presumably a `SpecialEnv` so added more debug info in case it fails again in CI. The last failure was https://github.com/facebook/rocksdb/actions/runs/8731304938/job/23956487511?fbclid=IwAR2jyXgVQtCezri3axV5MwMdI7D6VIudMk1xkiN_FL9-x2dkBv4IqIjjgB4 and it only happened once ever AFAIK. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12580 Reviewed By: hx235 Differential Revision: D56541996 Pulled By: ajkr fbshipit-source-id: 1eab17567db783c11054fa85dd8b8880eacd3a50	2024-04-25 14:34:11 -07:00
Jay Huh	f16ba42116	Fix IteratorsConsistentView tests (#12582 ) Summary: Fixing the failure in IteratorsConsistentViewExplicitSnapshot as shown in https://github.com/facebook/rocksdb/actions/runs/8825927545/job/24230854140?pr=12581 The failure was due to the timing of the `flush()` for the later Column Family in the loop. If the flush for the later CFs installs the new super version before getting the SV for the iterator, assertion succeeds, but if the order flips, SV will be obsolete and assertion can fail. This PR simplifies the test in a way that we do only one `flush()` so that `SYNC_POINT` can guarantee the order of operations. For ImplicitSnapshot test, it now just triggers flush for the second CF after obtaining SV for the first CF. For the ExplicitSnapshot test, it now triggers atomic flush() for all CFs after obtaining SV for the first CF. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12582 Test Plan: ``` ./db_iterator_test --gtest_filter="IteratorsConsistentView" ./multi_cf_iterator_test -- --gtest_filter="ConsistentView ``` Reviewed By: ajkr, jowlyzhang Differential Revision: D56557234 Pulled By: jaykorean fbshipit-source-id: 7aa2f6d0e12a915b6e16cd240389bcfb5b4a5b62	2024-04-25 14:06:46 -07:00
Jay Huh	1fca175eec	MultiCFSnapshot for NewIterators() API (#12573 ) Summary: As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573 Test Plan: Unit Tests Added ``` ./db_iterator_test --gtest_filter="IteratorsConsistentView" ``` ``` ./multi_cf_iterator_test -- --gtest_filter="ConsistentView" ``` Performance Check Setup ``` make -j64 release TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: jowlyzhang Differential Revision: D56444066 Pulled By: jaykorean fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99	2024-04-24 15:28:55 -07:00
Andrew Kryczka	6807da0b44	Fix `DisableManualCompaction()` hang (#12578 ) Summary: Prior to this PR the following sequence could happen: 1. `RunManualCompaction()` A schedules compaction to thread pool and waits 2. `RunManualCompaction()` B waits without scheduling anything due to conflict 3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both 4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done 5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool 6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction 7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution 8. `DisableManualCompaction()` returns as there are no more manual compactions running Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward: 5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done 6'. `DisableManualCompaction()` returns as there are no more manual compactions running Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578 Reviewed By: cbi42 Differential Revision: D56528144 Pulled By: ajkr fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b	2024-04-24 12:40:36 -07:00
Andrew Kryczka	3f3045a405	fix DeleteRange+memtable_insert_with_hint_prefix_extractor interaction (#12558 ) Summary: Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558 Reviewed By: hx235 Differential Revision: D56279019 Pulled By: ajkr fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29	2024-04-22 20:13:58 -07:00
Levi Tamasi	bcfe4a0dcf	Make sure DBImplFollower::stop_requested_ is initialized (#12572 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12572 Reviewed By: jowlyzhang, anand1976 Differential Revision: D56426800 fbshipit-source-id: a31f86d8869148092325924db4e7fbfad28777a4	2024-04-22 12:02:28 -07:00
anand76	d8fb849b7e	Basic RocksDB follower implementation (#12540 ) Summary: A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways - 1. It has its own directory with links to the leader's database 2. Periodically refreshes itself 3. (Future) Snapshot support 4. (Future) Garbage collection of obsolete links 5. (Long term) Memtable replication There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive. The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files. TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540 Reviewed By: jowlyzhang Differential Revision: D56315718 Pulled By: anand1976 fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b	2024-04-19 19:13:31 -07:00
Jay Huh	909ff2c208	MultiCFSnapshot Refactor - separate multiget key range info from CFD & superversion info (#12561 ) Summary: While implementing MultiCFIterators (CoalescingIterator and AttributeGroupIterator), we found that the existing `NewIterators()` API does not ensure a uniform view of the DB across all column families. The `NewIterators()` function is utilized to generate child iterators for the MultiCfIterators, and it's expected that all child iterators maintain a consistent view of the DB. For example, within the loop where the super version for each CF is being obtained, if a CF undergoes compaction after the super versions for previous CFs have already been retrieved, we lose the consistency in the view of the CFs for the iterators due to the API not under a db mutex. This preliminary refactoring of `MultiCFSnapshot` aims to address this issue in the `NewIterators()` API in the later PR. Currently, `MultiCFSnapshot` is used to achieve a consistent view across CFs in `MultiGet`. The `MultiGetColumnFamilyData` contains MultiGet-specific information that can be decoupled from the cfd and sv, allowing `MultiCFSnapshot` to be used in other places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12561 Test Plan: Existing Unit Tests for `MultiCFSnapshot()` ``` ./db_basic_test -- --gtest_filter="MultiGet" ``` Performance Test Setup ``` make -j64 release TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 4.760 micros/op 210072 ops/sec 4.760 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 4.593 micros/op 217727 ops/sec 4.593 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: anand1976 Differential Revision: D56309422 Pulled By: jaykorean fbshipit-source-id: 7a9164d12c810b6c2d2db062827fcc4a36cbc77b	2024-04-18 20:11:01 -07:00
anand76	97991960e9	Retry DB::Open upon a corruption detected while reading the MANIFEST (#12518 ) Summary: This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518 Reviewed By: ajkr Differential Revision: D55932155 Pulled By: anand1976 fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9	2024-04-18 17:36:33 -07:00
Levi Tamasi	0df601ab07	Reset user-facing wide-column stuctures upon deserialization failures (#12562 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12562 The patch makes a small usability improvement by consistently resetting any user-facing wide-column structures (`DBIter::columns()`, `BaseDeltaIterator::columns()`, and any `PinnableWideColumns` objects) upon encountering any deserialization failures. Reviewed By: jaykorean Differential Revision: D56312764 fbshipit-source-id: 44efed0d1720cc06bf6facf928f73ce39a1bd2ca	2024-04-18 13:08:34 -07:00
Levi Tamasi	e82fe7c0b7	Fix the move semantics of PinnableWideColumns (#12557 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557 Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move. Reviewed By: jaykorean Differential Revision: D56275054 fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730	2024-04-17 18:56:23 -07:00
Jay Huh	4f584652ab	Add an option to wait for purge in WaitForCompact (#12520 ) Summary: Adding an option to wait for purge to complete in `WaitForCompact` API. Internally, RocksDB has a way to wait for purge to complete (e.g. TEST_WaitForPurge() in db_impl_debug.cc), but there's no public API available for gracefully wait for purge to complete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12520 Test Plan: Unit Test Added - `WaitForCompactWithWaitForPurgeOptionTest` ``` ./deletefile_test -- --gtest_filter="WaitForCompactWithWaitForPurgeOptionTest" ``` Existing Tests ``` ./db_compaction_test -- --gtest_filter="WaitForCompactWithOption" ``` Reviewed By: ajkr Differential Revision: D55888283 Pulled By: jaykorean fbshipit-source-id: cfc6d6e8657deaefab8961890b36e390095c9f65	2024-04-17 17:33:27 -07:00
Andrew Kryczka	7027265417	Fix `max_successive_merges` counting CPU overhead regression (#12546 ) Summary: In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546 Reviewed By: jaykorean Differential Revision: D56193693 Pulled By: ajkr fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da	2024-04-17 12:11:24 -07:00
Jay Huh	02ea0d6367	Reserve vector in advance to avoid resizing in GetLiveFilesMetaData (#12554 ) Summary: As title Pull Request resolved: https://github.com/facebook/rocksdb/pull/12554 Test Plan: Existing CI Reviewed By: ajkr Differential Revision: D56252201 Pulled By: jaykorean fbshipit-source-id: 06211555a54ce5e6bf656b81109022494e6787ea	2024-04-17 11:01:06 -07:00
Jay Huh	b7319d8a10	MultiCfIterator - Tests for lower/upper bounds (#12548 ) Summary: Thanks to how we are using `DBIter` as child iterators in MultiCfIterators (both `CoalescingIterator` and `AttributeGroupIterator`), we got the lower/upper bound feature for free. This PR simply adds unit test coverage to ensure that the lower/upper bounds are working as expected in the MultiCfIterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12548 Test Plan: UnitTest Added ``` ./multi_cf_iterator_test ``` Reviewed By: ltamasi Differential Revision: D56197966 Pulled By: jaykorean fbshipit-source-id: fa51cc70705dbc5efd836ac006a7c6a49d05707a	2024-04-16 14:20:13 -07:00
Jay Huh	d34712e0ac	MultiCfIterator - AttributeGroupIter Impl & CoalescingIter Optimization (#12534 ) Summary: Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480 #12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`. Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL` Here are some implementation details: - `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration. - `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap. - `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap. - Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once. - This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534 Test Plan: Uncommented the assertions in `verifyAttributeGroupIterator()` ``` ./multi_cf_iterator_test ``` Reviewed By: ltamasi Differential Revision: D56089019 Pulled By: jaykorean fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054	2024-04-16 08:45:38 -07:00
Yu Zhang	b166ca8b74	Second attempt #12386 (#12529 ) Summary: Check https://github.com/facebook/rocksdb/issues/12386 back in now that we have figured out MyRocks build's failure and unblocked it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12529 Reviewed By: ajkr Differential Revision: D56047495 Pulled By: jowlyzhang fbshipit-source-id: f90664b9e72c085e068f174720f126b80ad4e8ea	2024-04-12 10:14:44 -07:00
Andrew Kryczka	8897bf2d04	Drop unsynced data in `TestFSWritableFile::Close()` (#12528 ) Summary: Our `FileSystem` for simulating unsynced data loss should not sync during `Close()` because it masks bugs where we forgot to sync as long as we closed the file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12528 Test Plan: Peeled back https://github.com/facebook/rocksdb/issues/10560 fix and verified it is caught much faster now (few seconds vs. ???) with command like ``` $ TEST_TMPDIR=./ python3 tools/db_crashtest.py blackbox --disable_wal=0 --max_key=1000 --write_buffer_size=131072 --max_bytes_for_level_base=524288 --target_file_size_base=131072 --interval=3 --sync_fault_injection=1 --enable_blob_files=0 --manual_wal_flush_one_in=10 --sync_wal_one_in=0 --get_live_files_one_in=0 --get_sorted_wal_files_one_in=0 --backup_one_in=0 --checkpoint_one_in=0 --write_fault_one_in=0 --read_fault_one_in=0 --open_write_fault_one_in=0 --compact_range_one_in=0 --compact_files_one_in=0 --open_read_fault_one_in=0 --get_property_one_in=0 --writepercent=100 -readpercent=0 -prefixpercent=0 -delpercent=0 -delrangepercent=0 -iterpercent=0 ``` Reviewed By: anand1976 Differential Revision: D56033250 Pulled By: ajkr fbshipit-source-id: 6bbf480d79a06c46f08f6214010937f6654af5ca	2024-04-12 09:57:56 -07:00
Vershinin Maxim 00873208	70d3fc3b6f	Fix error for CF smallest and largest keys computation in ImportColumnFamilyJob::Prepare (#12526 ) Summary: This PR fixes error for CF smallest and largest keys computation in ImportColumnFamilyJob::Prepare. Before this fix smallest and largest keys for CF were computed incorrectly, and ImportColumnFamilyJob::Prepare function might not have detect overlaps between CFs. I added test to detect this error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12526 Reviewed By: hx235 Differential Revision: D56046044 Pulled By: ajkr fbshipit-source-id: d562fbfc9cc2d9624372d24d34a649198a960691	2024-04-11 21:54:51 -07:00
Jay Huh	58a98bded9	MultiCFIterator Refactor - CoalescingIterator & AttributeGroupIterator (#12480 ) Summary: There are a couple of reasons to modify the current implementation of the MultiCfIterator, which implements the generic `Iterator` interface. - The default behavior of `value()`/`columns()` returning data from different Column Families for different keys can be prone to errors, even though there might be valid use cases where users do not care about the origin of the value/columns. - The `attribute_groups()` API, which is not yet implemented, will not be useful for a single-CF iterator. In this PR, we are implementing the following changes: - `IteratorBase` introduced, which includes all basic iterator functions except `value()` and `columns()`. - `Iterator`, which now inherits from `IteratorBase`, includes `value()` and `columns()`. - New public interface `AttributeGroupIterator` inherits from `IteratorBase` and additionally includes `attribute_groups()` (to be implemented). - Renamed former `MultiCfIterator` to `CoalescingIterator` which inherits from `Iterator` - Existing MultiCfIteratorTest has been split into two - `CoalescingIteratorTest` and `AttributeGroupIteratorTest`. - Moved AttributeGroup related code from `wide_columns.h` to a new file, `attribute_groups.h`. Some Implementation Details - `MultiCfIteratorImpl` takes two functions - `populate_func` and `reset_func` and use them to populate `value_` and `columns_` in CoalescingIterator and `attribute_groups_` in AttributeGroupIterator. In CoalescingIterator, populate_func is `Coalesce()`, in AttributeGroupIterator populate_func is `AddToAttributeGroups()`. `reset_func` clears populated value_, columns_ and attribute_groups_ accordingly. - `Coalesce()` merge sorts columns from multiple CFs when a key exists in more than on CFs. column that appears in later CF overwrites the prior ones. For example, if CF1 has `"key_1" ==> {"col_1": "foo", "col_2", "baz"}` and CF2 has `"key_1" ==> {"col_2": "quux", "col_3", "bla"}`, and when the iterator is at `key_1`, `columns()` will return `{"col_1": "foo", "col_2", "quux", "col_3", "bla"}` In this example, `value()` will be empty, because none of them have values for `kDefaultColumnName` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12480 Test Plan: ## Unit Test ``` ./multi_cf_iterator_test ``` ## Performance Test To make sure this change does not impact existing `Iterator` performance Build ``` $> make -j64 release ``` Setup ``` $> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.519 micros/op 1927904 ops/sec 0.519 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 5.302 micros/op 188589 ops/sec 5.303 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.497 micros/op 2011012 ops/sec 0.497 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 5.252 micros/op 190405 ops/sec 5.252 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: ltamasi Differential Revision: D55353909 Pulled By: jaykorean fbshipit-source-id: 8d7786ffee09e022261ce34aa60e8633685e1946	2024-04-11 11:34:04 -07:00
Yu Zhang	fab9dd9635	Temporary revert #12386 to unblock MyRocks build (#12523 ) Summary: MyRocks reports build failure with this change (build failures in this diff: https://www.internalfb.com/diff/D55924596) https://github.com/facebook/rocksdb/issues/12386, we haven't figured out how to fix it yet. So we are temporarily reverting it to unblock them. This reverts commit `3104e55f29`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12523 Reviewed By: hx235 Differential Revision: D55981751 Pulled By: jowlyzhang fbshipit-source-id: 1d7edd42b65ca847cec67549644a2b1e5775841e	2024-04-10 13:47:52 -07:00
Hui Xiao	abdbeedba6	Miscellaneous improvement to info printing (#12504 ) Summary: Context/Summary: Debugging crash test makes me realize there are a few places can use some improvement of logging more info Pull Request resolved: https://github.com/facebook/rocksdb/pull/12504 Test Plan: Manual testing Debug build ``` 2024/04/04-16:12:12.289791 1636007 [/db_filesnapshot.cc:156] Number of log files 2 (0 required by manifest) ... 2024/04/04-16:12:12.289814 1636007 [/db_filesnapshot.cc:171] Log files : /000004.log /000008.log .Log files required by manifest: . ``` Non-debug build ``` 2024/04/04-16:19:23.222168 1685043 [/db_filesnapshot.cc:156] Number of log files 1 (0 required by manifest) ``` CI Reviewed By: jaykorean Differential Revision: D55710013 Pulled By: hx235 fbshipit-source-id: 9964d46cfb0a2074620f31571cf9fd29d0a88819	2024-04-05 10:23:31 -07:00
Changyu Bi	a0aade7e62	Add some debug print for flaky test `DBCompactionTest.CompactionLimiter` (#12509 ) Summary: The unit test fails occasionally can cannot be reproed locally. ``` [ RUN ] DBCompactionTest.CompactionLimiter db/db_compaction_test.cc:6139: Failure Expected equality of these values: cf_count Which is: 17 env_->GetThreadPoolQueueLen(Env::LOW) Which is: 15 [ FAILED ] DBCompactionTest.CompactionLimiter (512 ms) ``` Add some debug print to help triaging if it fails again. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12509 Reviewed By: jowlyzhang Differential Revision: D55770552 Pulled By: cbi42 fbshipit-source-id: 2a39b2199f80352fcf2c6cd2b9c8b81c727eee8c	2024-04-04 15:21:40 -07:00
Changyu Bi	796011e5ad	Limit compaction input files expansion (#12484 ) Summary: We removed the limit in https://github.com/facebook/rocksdb/issues/10835 and the option in https://github.com/facebook/rocksdb/issues/12323. Usually input level is much smaller than output level, which is likely why we have not seen issues with not applying a limit. It should be safer to add a safe guard as suggested in https://github.com/facebook/rocksdb/pull/12323#issuecomment-2016687321. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12484 Test Plan: * new and existing UT Reviewed By: ajkr Differential Revision: D55438870 Pulled By: cbi42 fbshipit-source-id: 0511d0465a70398c36230ed7cced5291ff1a6c19	2024-03-29 11:34:29 -07:00
Hui Xiao	d985902ef4	Disallow refitting more than 1 file from non-L0 to L0 (#12481 ) Summary: Context/Summary: We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior. A related change to OptionChangeMigration(): - When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size. - Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481 Test Plan: - New UT - Modified UT Reviewed By: cbi42 Differential Revision: D55351178 Pulled By: hx235 fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80	2024-03-29 10:52:36 -07:00
Jay Huh	c449867236	MultiCfIterator Impl Follow up (#12465 ) Summary: As a follow up for https://github.com/facebook/rocksdb/issues/12422 , this PR includes the following two changes. - Removal of `direction_` in the MultiCfIterator - Use of Member Func Template instead of `std::function` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12465 Test Plan: ``` ./multi_cf_iterator_test ``` Reviewed By: pdillinger, ltamasi Differential Revision: D55208448 Pulled By: jaykorean fbshipit-source-id: 8b3167c1d59839d076afc29097b5ad21a453460a	2024-03-22 14:51:16 -07:00
Peter Dillinger	b515a5db3f	Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470 ) Summary: ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want. So this adds ScopedArenaPtr<T> to replace those uses. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470 Test Plan: CI (including ASAN/UBSAN) Reviewed By: jowlyzhang Differential Revision: D55254362 Pulled By: pdillinger fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18	2024-03-22 13:40:42 -07:00
anand76	3b736c4aa3	Fix heap use after free error on retry after checksum mismatch (#12464 ) Summary: Fix the heap use after free bug caused by freeing the file system IO buffer in `BlockFetcher::ReadBlock()` instead of the caller. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12464 Test Plan: Update the `DBIOCorruptionTest` tests Reviewed By: akankshamahajan15 Differential Revision: D55206920 Pulled By: anand1976 fbshipit-source-id: fd6b608a61cd229b20c1e5f348ff3cc92328de0f	2024-03-21 16:19:09 -07:00
Andrew Kryczka	bf98dcf9a8	Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462 ) Summary: The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file. There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462 Test Plan: A repro command ``` db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp ./db_stress \ --clear_column_family_one_in=0 \ --test_batches_snapshots=0 \ --write_fault_one_in=0 \ --use_put_entity_one_in=0 \ --prefixpercent=0 \ --read_fault_one_in=0 \ --readpercent=0 \ --reopen=0 \ --set_options_one_in=10000 \ --delpercent=0 \ --delrangepercent=0 \ --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 \ --open_write_fault_one_in=0 \ --destroy_db_initially=0 \ --ingest_external_file_one_in=0 \ --iterpercent=0 \ --nooverwritepercent=0 \ --db=$db \ --enable_blob_files=1 \ --expected_values_dir=$exp \ --max_background_compactions=20 \ --max_bytes_for_level_base=2097152 \ --max_key=100000 \ --min_blob_size=0 \ --open_files=-1 \ --ops_per_thread=100000000 \ --prefix_size=-1 \ --target_file_size_base=524288 \ --use_merge=1 \ --value_size_mult=32 \ --write_buffer_size=524288 \ --writepercent=100 ``` It used to fail like: ``` ... frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3 frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice, rocksdb::PinnableWideColumns, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, rocksdb::Status, rocksdb::MergeContext, unsigned long, rocksdb::PinnedIteratorsManager, bool, bool, unsigned long, rocksdb::ReadCallback, bool, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33 frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492 frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408 frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109 frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656 frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607 frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables, rocksdb::FlushScheduler, rocksdb::TrimHistoryScheduler, bool, unsigned long, rocksdb::DB, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505 frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084 frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807 frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312 frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157 frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544 frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72 frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317 frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148 ... ``` Reviewed By: ltamasi Differential Revision: D55157795 Pulled By: ajkr fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764	2024-03-21 12:38:53 -07:00
anand76	63a105a481	Enable recycle_log_file_num option for point in time recovery (#12403 ) Summary: This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL. This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403 Test Plan: 1. Add new unit tests to verify behavior upon corruption 2. Re-enable disabled tests for verifying recycling behavior Reviewed By: ajkr Differential Revision: D54544824 Pulled By: anand1976 fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a	2024-03-21 12:29:35 -07:00
Yu Zhang	13e1c32a18	Follow ups for TimedPut and write time property (#12455 ) Summary: This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including: 1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry. 2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries. 3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible. Test plans: Added unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455 Reviewed By: pdillinger Differential Revision: D55165422 Pulled By: jowlyzhang fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47	2024-03-21 10:00:15 -07:00
Richard Barnes	6a1c2abe9d	Remove extra semi colon from hbt/src/tagstack/tests/SlicerTest.cpp (#12461 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12461 X-link: https://github.com/facebookincubator/dynolog/pull/233 `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: rahku Differential Revision: D55087324 fbshipit-source-id: e8a03d33cad72a7d378e58f85eb550a03f6c2897	2024-03-20 12:44:50 -07:00
Kshitij Wadhwa	4ce1dc930c	don't run ZSTD_TrainDictionary in BlockBasedTableBuilder if there isn't compression needed (#12453 ) Summary: fixes https://github.com/facebook/rocksdb/issues/12409 ### Issue ZSTD_TrainDictionary [[link](`a53ed91691/table/block_based/block_based_table_builder.cc (L1894)`)] runs for SSTFileWriter::Finish even when bottommost_compression option is set to kNoCompression. This reduces throughput for SstFileWriter::Finish We construct rocksdb options using ZSTD compression for levels including 2 and above. For levels 0 and 1, we set it to kNoCompression. We also set zstd_max_train_bytes to a non-zero positive value (which is applicable for levels with ZSTD compression enabled). These options are used for the database and also passed to SstFileWriter for creating sst files to be later added to that database. Since the BlockBasedTableBuilder::Finish [[link](`a53ed91691/table/block_based/block_based_table_builder.cc (L1892)`)] only checks for zstd_max_train_bytes to be non-zero positive value, it runs ZSTD_TrainDictionary even when it shouldn't since SSTFileWriter is operating at bottommost level ### Fix If compression_type is set to kNoCompression, then don't run ZSTD_TrainDictionary and dictionary building ### Testing I see we have tests for sst file writer with compression type set/unset. Let me know if it isn't covered and I can extend Pull Request resolved: https://github.com/facebook/rocksdb/pull/12453 Reviewed By: cbi42 Differential Revision: D55030484 Pulled By: ajkr fbshipit-source-id: 834de2174c2b087d61bf045ca1ae29f337b821a7	2024-03-20 11:07:32 -07:00
Jay Huh	3f3f4660bd	wal_read_status check in RecoverLogFiles (#12460 ) Summary: Fixing the not-checked status failure as in https://github.com/facebook/rocksdb/actions/runs/8334988399/job/22809612148. When the status is not ok() for any reason, we do not check the `wal_read_status` because it's not necessary. It's causing the test failure when running with `ASSERT_STATUS_CHECKED=1` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12460 Test Plan: Existing tests Reviewed By: ajkr Differential Revision: D55104844 Pulled By: jaykorean fbshipit-source-id: 919b1fddca835494f9087c51c4da6eabc9e8df2b	2024-03-20 08:09:09 -07:00
anand76	4868c10b44	Retry block reads on checksum mismatch (#12427 ) Summary: On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`. Tests: Add new unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427 Reviewed By: ajkr Differential Revision: D55025941 Pulled By: anand1976 fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e	2024-03-18 16:16:05 -07:00
Jay Huh	b4e9f5a400	Update Remote Compaction Tests to include more than one CF (#12430 ) Summary: Update `compaction_service_test` to make sure remote compaction works with multiple column family set up. Minor refactor to get rid of duplicate code Fixing one quick bug in the existing test util: Test util's `FilesPerLevel` didn't honor `cf_id` properly) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12430 Test Plan: ``` ./compaction_service_test ``` Reviewed By: ajkr Differential Revision: D54883035 Pulled By: jaykorean fbshipit-source-id: 83b4f6f566fed5c4824bfef7de01074354a72b44	2024-03-18 15:40:48 -07:00
Hui Xiao	2443ebf810	Don't write to WAL after previous WAL write error (#12448 ) Summary: Context/Summary: WAL write can continue onto the the WAL file that has encountered error and thus crash at `3f5bd46a07/file/writable_file_writer.cc (L67)` in below scenario: <img width="698" alt="Screenshot 2024-03-15 at 1 52 45 PM" src="https://github.com/facebook/rocksdb/assets/83968999/cd631ef2-c87c-4926-91ab-a0dc10f1eb14"> Note that GetLiveFilesStorageInfo() can happen concurrently with PUT() for the non-WAL-write part where db lock isn't held This PR added an error check in LogWriter layer to prevent thread 2 from starting to write WAL after thread 1's write error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12448 Test Plan: Step 1 Apply the patch below to simulate frequent WAL write error for the purpose of repro ``` diff --git a/db_stress_tool/db_stress_driver.cc b/db_stress_tool/db_stress_driver.cc index b47fa89e6..31930e976 100644 --- a/db_stress_tool/db_stress_driver.cc +++ b/db_stress_tool/db_stress_driver.cc @@ -98,7 +98,7 @@ bool RunStressTestImpl(SharedState* shared) { // MANIFEST, CURRENT, and WAL files. fault_fs_guard->SetRandomWriteError( shared->GetSeed(), FLAGS_write_fault_one_in, error_msg, - /inject_for_all_file_types=/false, {FileType::kTableFile}); + /inject_for_all_file_types=/false, {FileType::kWalFile}); fault_fs_guard->SetFilesystemDirectWritable(false); fault_fs_guard->EnableWriteErrorInjection(); } diff --git a/utilities/fault_injection_fs.cc b/utilities/fault_injection_fs.cc index 0ffb43ea6..589912cf4 100644 --- a/utilities/fault_injection_fs.cc +++ b/utilities/fault_injection_fs.cc @@ -1042,7 +1042,7 @@ IOStatus FaultInjectionTestFS::InjectWriteError(const std::string& file_name) { } if (allowed_type) { - if (write_error_rand_.OneIn(write_error_one_in_)) { + if (write_error_rand_.OneIn(1)) { return GetError(); } } ``` Step 2 Run below ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=41.19540459544058 --bottommost_compression_type=disable --bottommost_file_compaction_delay=3600 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=1 --compression_max_dict_buffer_bytes=68719476735 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_index_compression=1 --enable_pipelined_write=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=10000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=15 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=1048576 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=500000 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --report_bg_io_stats=0 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=35 ``` Pre-PR: ``` db_stress: ./file/writable_file_writer.h:309: rocksdb::IOStatus rocksdb::WritableFileWriter::AssertFalseAndGetStatusForPrevError(): Assertion `sync_without_flush_called_' failed. ``` Post-PR ``` 2024/03/15-13:44:08 Starting database operations put or merge error: IO error: Retryable injected write error ``` Note: The patch is NOT included in the PR as we first need to figure out how to handle this type of failed write in stress test (planned for the near future). It's sufficient to show the stress test does not crash as pre-PR for the purpose of this PR. Reviewed By: ajkr Differential Revision: D54969287 Pulled By: hx235 fbshipit-source-id: 0ba4eabfec44ea7656d4d7117836f388897562f2	2024-03-18 12:27:49 -07:00
Jay Huh	db1dea22b1	MultiCfIterator Implementations (#12422 ) Summary: This PR continues https://github.com/facebook/rocksdb/issues/12153 by implementing the missing `Iterator` APIs - `Seek()`, `SeekForPrev()`, `SeekToLast()`, and `Prev`. A MaxHeap Implementation has been added to handle the reverse direction. The current implementation does not include upper/lower bounds yet. These will be added in subsequent PRs. The API is still marked as under construction and will be lifted after being added to the stress test. Please note that changing the iterator direction in the middle of iteration is expensive, as it requires seeking the element in each iterator again in the opposite direction and rebuilding the heap along the way. The first `Next()` after `SeekForPrev()` requires changing the direction under the current implementation. We may optimize this in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12422 Test Plan: The `multi_cf_iterator_test` has been extended to cover the API implementations. Reviewed By: pdillinger Differential Revision: D54820754 Pulled By: jaykorean fbshipit-source-id: 9eb741508df0f7bad598fb8e6bd5cdffc39e81d1	2024-03-18 09:05:30 -07:00
Changyu Bi	3d5be596a5	Fix a bug in iterator with UDT + `ReadOptions::pin_data` (#12451 ) Summary: with https://github.com/facebook/rocksdb/issues/12414 enabling `ReadOptions::pin_data`, this bug surfaced as corrupted per key-value checksum during crash test. `saved_key_.GetUserKey()` could be pinned user key, so DBIter should not overwrite it. In one case, it only surfaces when iterator skips many keys of the same user key. To stress that code path, this PR also added `max_sequential_skip_in_iterations` to crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12451 Test Plan: - Set ReadOptions::pin_data to true, the bug can be reproed quickly with `./db_stress --persist_user_defined_timestamps=1 --user_timestamp_size=8 --writepercent=35 --delpercent=4 --delrangepercent=1 --iterpercent=20 --nooverwritepercent=1 --prefix_size=8 --prefixpercent=10 --readpercent=30 --memtable_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --clear_column_family_one_in=0`. - Set max_sequential_skip_in_iterations to 1 for the other occurrence of the bug. Reviewed By: jowlyzhang Differential Revision: D55003766 Pulled By: cbi42 fbshipit-source-id: 23e1049129456684dafb028b6132b70e0afc07fb	2024-03-18 09:05:11 -07:00

1 2 3 4 5 ...

5601 Commits