rocksdb

Commit Graph

Author	SHA1	Message	Date
Changyu Bi	5bf2c00a35	Clarify manual compaction and file ingestion behavior with FIFO compaction (#12618 ) Summary: For manual compaction, FIFO compaction will always skip key range overlapping checking with SST files. If CompactRange() is called with CompactionRangeOptions::change_level=true, a CF with FIFO compaction will now return Status::NotSupported. For file ingestion, we will always ingest into L0. Previously, it's possible to ingest files into non-L0 levels with FIFO compaction. These changes also help to fix [this](`a178d15baf/db/db_impl/db_impl_compaction_flush.cc (L1269)`) assertion failure in crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12618 Test Plan: added unit tests to verify the new behavior. Reviewed By: hx235 Differential Revision: D56962401 Pulled By: cbi42 fbshipit-source-id: 19812a1509650b4162b379ca5bee02f2e9d9569d	2024-05-07 12:00:15 -07:00
Zaidoon Abd Al Hadi	36ab251c07	Expose block based metadata cache options via C API (#12611 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12611 Reviewed By: jaykorean Differential Revision: D56961823 Pulled By: ajkr fbshipit-source-id: aa062cdb49a0bb2c1148a81d4c882a4733c7790e	2024-05-06 16:49:11 -07:00
Andrew Kryczka	0fef690bd5	Sync non-latest WALs during flush for 2PC, single-CF DBs (#12622 ) Summary: Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs". This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622 Reviewed By: cbi42 Differential Revision: D56987303 Pulled By: ajkr fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48	2024-05-06 11:56:16 -07:00
Changyu Bi	6fdc4c5282	Fix a corruption bug in `CreateColumnFamilyWithImport()` (#12602 ) Summary: when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602 Test Plan: a new repro unit test. Before this PR, it fails with ``` [ RUN ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF db/import_column_family_test.cc:1048: Failure db_->WaitForCompact(o) Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3 ``` Reviewed By: hx235 Differential Revision: D56851808 Pulled By: cbi42 fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84	2024-05-06 11:01:38 -07:00
Peter Dillinger	a178d15baf	More checks around num_entries vs. num_deletions (#12600 ) Summary: We've seen an internal crash test+sanitizer failure seemingly caused by underflow on `current_num_non_deletions_` which would happen if num_entries < num_deletions. (T186407810) This change adds an additional check (fail earlier?) and coerces read table properties to satisfy the invariant that is supposed to be provided by https://github.com/facebook/rocksdb/pull/4841 but could be violated by older files, due to https://github.com/facebook/rocksdb/pull/4016. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12600 Test Plan: existing tests Reviewed By: ajkr Differential Revision: D56796191 Pulled By: pdillinger fbshipit-source-id: 6d22cc40eb74974c42b311293ee2775c6af95afc	2024-05-03 16:40:07 -07:00
Zaidoon Abd Al Hadi	ed01babd07	Expose compaction pri through C API (#12604 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12604 Reviewed By: cbi42 Differential Revision: D56914066 Pulled By: ajkr fbshipit-source-id: 64b51ab2b7b5ec0b5fde5a5f61d076bac1c3a8ad	2024-05-02 18:39:24 -07:00
Changyu Bi	e2ef349f56	Deflake unit test `DBCompactionTest.CompactionLimiter` (#12596 ) Summary: The test has been flaky for a long time. A recent [failure](https://github.com/facebook/rocksdb/actions/runs/8820808355/job/24215219590?pr=12578) shows that there is still flush running when the assertion fails. I think this is because `WaitForFlushMemTable()` may return before the a flush schedules the next compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12596 Test Plan: I could not repro the failure locally: `gtest-parallel --repeat=8000 --workers=100 ./db_compaction_test --gtest_filter="CompactionLimiter"` Reviewed By: ajkr Differential Revision: D56715874 Pulled By: cbi42 fbshipit-source-id: f5f64eb30fff7e115c19beedad2dc22afa06258d	2024-05-02 17:10:06 -07:00
Yu Zhang	241253053a	Fix delete obsolete files on recovery not rate limited (#12590 ) Summary: This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited. The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` `d610e14f93/include/rocksdb/sst_file_manager.h (L126)` however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1. In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call. There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590 Test Plan: Added unit test and verified with manual testing Reviewed By: anand1976 Differential Revision: D56830519 Pulled By: jowlyzhang fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885	2024-05-01 12:26:54 -07:00
Yu Zhang	8b3d9e6bfe	Add TimedPut to stress test (#12559 ) Summary: This also updates WriteBatch's protection info to include write time since there are several places in memtable that by default protects the whole value slice. This PR is stacked on https://github.com/facebook/rocksdb/issues/12543 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12559 Reviewed By: pdillinger Differential Revision: D56308285 Pulled By: jowlyzhang fbshipit-source-id: 5524339fe0dd6c918dc940ca2f0657b5f2111c56	2024-04-30 15:40:35 -07:00
Yu Zhang	2c02a9b76f	Preserve TimedPut on penultimate level until it actually expires (#12543 ) Summary: To make sure `TimedPut` are placed on proper tier before and when it becomes eligible for cold tier 1) flush and compaction need to keep relevant seqno to time mapping for not just the sequence number contained in internal keys, but also preferred sequence number for `TimedPut` entries. This PR also fix some bugs in for handling `TimedPut` during compaction: 1) dealing with an edge case when a `TimedPut` entry's internal key is the right bound for penultimate level, the internal key after swapping in its preferred sequence number will fall outside of the penultimate range because preferred sequence number is smaller than its original sequence number. The entry however is still safe to be placed on penultimate level, so we keep track of `TimedPut` entry's original sequence number for this check. The idea behind this is that as long as it's safe for the original key to be placed on penultimate level, it's safe for the entry with swapped preferred sequence number to be placed on penultimate level too. Because we only swap in preferred sequence number when that entry is visible to the earliest snapshot and there is no other data points with the same user key in lower levels. On the other hand, as long as it's not safe for the original key to be placed on penultimate level, we will not place the entry after swapping the preferred seqno on penultimate level either. 2) the assertion that preferred seqno is always bigger than original sequence number may fail if this logic is only exercised after sequence number is zeroed out. We adjust the assertion to handle that case too. In this case, we don't swap in the preferred seqno but will adjust the its type to `kTypeValue`. 3) there was a special case handling for when range deletion may end up incorrectly covering an entry if preferred seqno is swapped in. But it missed the case that if the original entry is already covered by range deletion. The original handling will mistakenly output the entry instead of omitting it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12543 Test Plan: ./tiered_compaction_test --gtest_filter="PrecludeLastLevelTest.PreserveTimedPutOnPenultimateLevel" ./compaction_iterator_test --gtest_filter="TimedPut" Reviewed By: pdillinger Differential Revision: D56195096 Pulled By: jowlyzhang fbshipit-source-id: 37ebb09d2513abbd9e90cda0217e26874584b8f3	2024-04-30 11:16:02 -07:00
Peter Dillinger	45c105104b	Set optimize_filters_for_memory by default (#12377 ) Summary: This feature has been around for a couple of years and users haven't reported any problems with it. Not quite related: fixed a technical ODR violation in public header for info_log_level in case DEBUG build status changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12377 Test Plan: unit tests updated, already in crash test. Some unit tests are expecting specific behaviors of optimize_filters_for_memory=false and we now need to bake that in. Reviewed By: jowlyzhang Differential Revision: D54129517 Pulled By: pdillinger fbshipit-source-id: a64b614840eadd18b892624187b3e122bab6719c	2024-04-30 08:33:31 -07:00
Changyu Bi	5c1334f763	DeleteRange() return NotSupported if row_cache is configured (#12512 ) Summary: ...since this feature combination is not supported yet (https://github.com/facebook/rocksdb/issues/4122). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12512 Test Plan: new unit test. Reviewed By: jaykorean, jowlyzhang Differential Revision: D55820323 Pulled By: cbi42 fbshipit-source-id: eeb5e97d15c9bdc388793a2fb8e52cfa47e34bcf	2024-04-29 16:33:13 -07:00
Andrew Kryczka	b2931a5c53	Fixed `MultiGet()` error handling to not skip blob dereference (#12597 ) Summary: See comment at top of the test case and release note. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597 Reviewed By: jaykorean Differential Revision: D56718786 Pulled By: ajkr fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f	2024-04-29 14:18:42 -07:00
anand76	e36b0a2da4	Fix corruption bug when recycle_log_file_num changed from 0 (#12591 ) Summary: When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 \|\| s >= first_seqno_` in `rocksdb::MemTable::Add`. We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591 Test Plan: 1. Add a unit test to verify the bug and fix Reviewed By: jowlyzhang Differential Revision: D56655812 Pulled By: anand1976 fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87	2024-04-29 12:25:00 -07:00
Andrew Kryczka	2ec25a3e54	Prevent data block compression with `BlockBasedTableOptions::block_align` (#12592 ) Summary: Made `BlockBasedTableOptions::block_align` incompatible (i.e., APIs will return `Status::InvalidArgument`) with more ways of enabling compression: `CompactionOptions::compression`, `ColumnFamilyOptions::compression_per_level`, and `ColumnFamilyOptions::bottommost_compression`. Previously it was only incompatible with `ColumnFamilyOptions::compression`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12592 Reviewed By: hx235 Differential Revision: D56650862 Pulled By: ajkr fbshipit-source-id: f5201602c2ce436e6d8d30893caa6a161a61f141	2024-04-26 20:05:30 -07:00
Andrew Kryczka	177ccd3904	Print more debug info in test when `SyncWAL()` fails (#12580 ) Summary: Example failure (cannot reproduce): ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBWriteTestInstance/DBWriteTest [ RUN ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0 db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file db/db_write_test.cc:809: Failure dbfull()->SyncWAL() Not implemented: SyncWAL() is not supported for this implementation of WAL file [ FAILED ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0 (49 ms) [----------] 1 test from DBWriteTestInstance/DBWriteTest (49 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (49 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] DBWriteTestInstance/DBWriteTest.ConcurrentlyDisabledWAL/0, where GetParam() = 0 ``` I have no idea why `SyncWAL()` would not be supported from what is presumably a `SpecialEnv` so added more debug info in case it fails again in CI. The last failure was https://github.com/facebook/rocksdb/actions/runs/8731304938/job/23956487511?fbclid=IwAR2jyXgVQtCezri3axV5MwMdI7D6VIudMk1xkiN_FL9-x2dkBv4IqIjjgB4 and it only happened once ever AFAIK. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12580 Reviewed By: hx235 Differential Revision: D56541996 Pulled By: ajkr fbshipit-source-id: 1eab17567db783c11054fa85dd8b8880eacd3a50	2024-04-25 14:34:11 -07:00
Jay Huh	f16ba42116	Fix IteratorsConsistentView tests (#12582 ) Summary: Fixing the failure in IteratorsConsistentViewExplicitSnapshot as shown in https://github.com/facebook/rocksdb/actions/runs/8825927545/job/24230854140?pr=12581 The failure was due to the timing of the `flush()` for the later Column Family in the loop. If the flush for the later CFs installs the new super version before getting the SV for the iterator, assertion succeeds, but if the order flips, SV will be obsolete and assertion can fail. This PR simplifies the test in a way that we do only one `flush()` so that `SYNC_POINT` can guarantee the order of operations. For ImplicitSnapshot test, it now just triggers flush for the second CF after obtaining SV for the first CF. For the ExplicitSnapshot test, it now triggers atomic flush() for all CFs after obtaining SV for the first CF. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12582 Test Plan: ``` ./db_iterator_test --gtest_filter="IteratorsConsistentView" ./multi_cf_iterator_test -- --gtest_filter="ConsistentView ``` Reviewed By: ajkr, jowlyzhang Differential Revision: D56557234 Pulled By: jaykorean fbshipit-source-id: 7aa2f6d0e12a915b6e16cd240389bcfb5b4a5b62	2024-04-25 14:06:46 -07:00
Jay Huh	1fca175eec	MultiCFSnapshot for NewIterators() API (#12573 ) Summary: As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573 Test Plan: Unit Tests Added ``` ./db_iterator_test --gtest_filter="IteratorsConsistentView" ``` ``` ./multi_cf_iterator_test -- --gtest_filter="ConsistentView" ``` Performance Check Setup ``` make -j64 release TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: jowlyzhang Differential Revision: D56444066 Pulled By: jaykorean fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99	2024-04-24 15:28:55 -07:00
Andrew Kryczka	6807da0b44	Fix `DisableManualCompaction()` hang (#12578 ) Summary: Prior to this PR the following sequence could happen: 1. `RunManualCompaction()` A schedules compaction to thread pool and waits 2. `RunManualCompaction()` B waits without scheduling anything due to conflict 3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both 4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done 5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool 6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction 7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution 8. `DisableManualCompaction()` returns as there are no more manual compactions running Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward: 5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done 6'. `DisableManualCompaction()` returns as there are no more manual compactions running Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578 Reviewed By: cbi42 Differential Revision: D56528144 Pulled By: ajkr fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b	2024-04-24 12:40:36 -07:00
Andrew Kryczka	3f3045a405	fix DeleteRange+memtable_insert_with_hint_prefix_extractor interaction (#12558 ) Summary: Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558 Reviewed By: hx235 Differential Revision: D56279019 Pulled By: ajkr fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29	2024-04-22 20:13:58 -07:00
Levi Tamasi	bcfe4a0dcf	Make sure DBImplFollower::stop_requested_ is initialized (#12572 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12572 Reviewed By: jowlyzhang, anand1976 Differential Revision: D56426800 fbshipit-source-id: a31f86d8869148092325924db4e7fbfad28777a4	2024-04-22 12:02:28 -07:00
anand76	d8fb849b7e	Basic RocksDB follower implementation (#12540 ) Summary: A basic implementation of RocksDB follower mode, which opens a remote database (referred to as leader) on a distributed file system by tailing its MANIFEST. It leverages the secondary instance mode, but is different in some key ways - 1. It has its own directory with links to the leader's database 2. Periodically refreshes itself 3. (Future) Snapshot support 4. (Future) Garbage collection of obsolete links 5. (Long term) Memtable replication There are two main classes implementing this functionality - `DBImplFollower` and `OnDemandFileSystem`. The former is derived from `DBImplSecondary`. Similar to `DBImplSecondary`, it implements recovery and catch up through MANIFEST tailing using the `ReactiveVersionSet`, but does not consider logs. In a future PR, we will implement memtable replication, which will eliminate the need to catch up using logs. In addition, the recovery and catch-up tries to avoid directory listing as repeated metadata operations are expensive. The second main piece is the `OnDemandFileSystem`, which plugs in as an `Env` for the follower instance and creates the illusion of the follower directory as a clone of the leader directory. It creates links to SSTs on first reference. When the follower tails the MANIFEST and attempts to create a new `Version`, it calls `VerifyFileMetadata` to verify the size of the file, and optionally the unique ID of the file. During this process, links are created which prevent the underlying files from getting deallocated even if the leader deletes the files. TODOs: Deletion of obsolete links, snapshots, robust checking against misconfigurations, better observability etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12540 Reviewed By: jowlyzhang Differential Revision: D56315718 Pulled By: anand1976 fbshipit-source-id: d19e1aca43a6af4000cb8622a718031b69ebd97b	2024-04-19 19:13:31 -07:00
Jay Huh	909ff2c208	MultiCFSnapshot Refactor - separate multiget key range info from CFD & superversion info (#12561 ) Summary: While implementing MultiCFIterators (CoalescingIterator and AttributeGroupIterator), we found that the existing `NewIterators()` API does not ensure a uniform view of the DB across all column families. The `NewIterators()` function is utilized to generate child iterators for the MultiCfIterators, and it's expected that all child iterators maintain a consistent view of the DB. For example, within the loop where the super version for each CF is being obtained, if a CF undergoes compaction after the super versions for previous CFs have already been retrieved, we lose the consistency in the view of the CFs for the iterators due to the API not under a db mutex. This preliminary refactoring of `MultiCFSnapshot` aims to address this issue in the `NewIterators()` API in the later PR. Currently, `MultiCFSnapshot` is used to achieve a consistent view across CFs in `MultiGet`. The `MultiGetColumnFamilyData` contains MultiGet-specific information that can be decoupled from the cfd and sv, allowing `MultiCFSnapshot` to be used in other places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12561 Test Plan: Existing Unit Tests for `MultiCFSnapshot()` ``` ./db_basic_test -- --gtest_filter="MultiGet" ``` Performance Test Setup ``` make -j64 release TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 4.760 micros/op 210072 ops/sec 4.760 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 4.593 micros/op 217727 ops/sec 4.593 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: anand1976 Differential Revision: D56309422 Pulled By: jaykorean fbshipit-source-id: 7a9164d12c810b6c2d2db062827fcc4a36cbc77b	2024-04-18 20:11:01 -07:00
anand76	97991960e9	Retry DB::Open upon a corruption detected while reading the MANIFEST (#12518 ) Summary: This PR is a counterpart of https://github.com/facebook/rocksdb/issues/12427 . On file systems that support storage level data checksum and reconstruction, retry opening the DB if a corruption is detected when reading the MANIFEST. This could be done in `log::Reader`, but its a little complicated since the sequential file would have to be reopened in order to re-read the same data, and we may miss some subtle corruptions that don't result in checksum mismatch. The approach chosen here instead is to make the decision to retry in `DBImpl::Recover`, based on either an explicit corruption in the MANIFEST file, or missing SST files due to bad data in the MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12518 Reviewed By: ajkr Differential Revision: D55932155 Pulled By: anand1976 fbshipit-source-id: 51755a29b3eb14b9d8e98534adb2e7d54b12ced9	2024-04-18 17:36:33 -07:00
Levi Tamasi	0df601ab07	Reset user-facing wide-column stuctures upon deserialization failures (#12562 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12562 The patch makes a small usability improvement by consistently resetting any user-facing wide-column structures (`DBIter::columns()`, `BaseDeltaIterator::columns()`, and any `PinnableWideColumns` objects) upon encountering any deserialization failures. Reviewed By: jaykorean Differential Revision: D56312764 fbshipit-source-id: 44efed0d1720cc06bf6facf928f73ce39a1bd2ca	2024-04-18 13:08:34 -07:00
Levi Tamasi	e82fe7c0b7	Fix the move semantics of PinnableWideColumns (#12557 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557 Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move. Reviewed By: jaykorean Differential Revision: D56275054 fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730	2024-04-17 18:56:23 -07:00
Jay Huh	4f584652ab	Add an option to wait for purge in WaitForCompact (#12520 ) Summary: Adding an option to wait for purge to complete in `WaitForCompact` API. Internally, RocksDB has a way to wait for purge to complete (e.g. TEST_WaitForPurge() in db_impl_debug.cc), but there's no public API available for gracefully wait for purge to complete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12520 Test Plan: Unit Test Added - `WaitForCompactWithWaitForPurgeOptionTest` ``` ./deletefile_test -- --gtest_filter="WaitForCompactWithWaitForPurgeOptionTest" ``` Existing Tests ``` ./db_compaction_test -- --gtest_filter="WaitForCompactWithOption" ``` Reviewed By: ajkr Differential Revision: D55888283 Pulled By: jaykorean fbshipit-source-id: cfc6d6e8657deaefab8961890b36e390095c9f65	2024-04-17 17:33:27 -07:00
Andrew Kryczka	7027265417	Fix `max_successive_merges` counting CPU overhead regression (#12546 ) Summary: In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546 Reviewed By: jaykorean Differential Revision: D56193693 Pulled By: ajkr fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da	2024-04-17 12:11:24 -07:00
Jay Huh	02ea0d6367	Reserve vector in advance to avoid resizing in GetLiveFilesMetaData (#12554 ) Summary: As title Pull Request resolved: https://github.com/facebook/rocksdb/pull/12554 Test Plan: Existing CI Reviewed By: ajkr Differential Revision: D56252201 Pulled By: jaykorean fbshipit-source-id: 06211555a54ce5e6bf656b81109022494e6787ea	2024-04-17 11:01:06 -07:00
Jay Huh	b7319d8a10	MultiCfIterator - Tests for lower/upper bounds (#12548 ) Summary: Thanks to how we are using `DBIter` as child iterators in MultiCfIterators (both `CoalescingIterator` and `AttributeGroupIterator`), we got the lower/upper bound feature for free. This PR simply adds unit test coverage to ensure that the lower/upper bounds are working as expected in the MultiCfIterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12548 Test Plan: UnitTest Added ``` ./multi_cf_iterator_test ``` Reviewed By: ltamasi Differential Revision: D56197966 Pulled By: jaykorean fbshipit-source-id: fa51cc70705dbc5efd836ac006a7c6a49d05707a	2024-04-16 14:20:13 -07:00
Jay Huh	d34712e0ac	MultiCfIterator - AttributeGroupIter Impl & CoalescingIter Optimization (#12534 ) Summary: Continuing from the previous MultiCfIterator Implementations - (https://github.com/facebook/rocksdb/issues/12422, https://github.com/facebook/rocksdb/issues/12480 #12465), this PR completes the `AttributeGroupIterator` by implementing `AttributeGroupIteratorImpl::AddToAttributeGroups()`. While implementing the `AttributeGroupIterator`, we had to make some changes in `MultiCfIteratorImpl` and found an opportunity to improve `Coalesce()` in `CoalescingIterator`. Lifting `UNDER CONSTRUCTION - DO NOT USE` comment by replacing it with `EXPERIMENTAL` Here are some implementation details: - `IteratorAttributeGroups` is introduced to avoid having to copy all `WideColumn` objects during iteration. - `PopulateIterator()` no longer advances non-top iterators that have the same key as the top iterator in the heap. - `AdvanceIterator()` needs to advance the non-top iterators when they have the same key as the top iterator in the heap. - Instead of populating one by one, `PopulateIterator()` now collects all items with the same key and calls `populate_func(items)` at once. - This allowed optimization in `Coalesce()` such that we no longer do K-1 rounds of 2-way merge, but do one K-way merge instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12534 Test Plan: Uncommented the assertions in `verifyAttributeGroupIterator()` ``` ./multi_cf_iterator_test ``` Reviewed By: ltamasi Differential Revision: D56089019 Pulled By: jaykorean fbshipit-source-id: 6b0b4247e221f69b40b147d41492008cc9b15054	2024-04-16 08:45:38 -07:00
Yu Zhang	b166ca8b74	Second attempt #12386 (#12529 ) Summary: Check https://github.com/facebook/rocksdb/issues/12386 back in now that we have figured out MyRocks build's failure and unblocked it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12529 Reviewed By: ajkr Differential Revision: D56047495 Pulled By: jowlyzhang fbshipit-source-id: f90664b9e72c085e068f174720f126b80ad4e8ea	2024-04-12 10:14:44 -07:00
Andrew Kryczka	8897bf2d04	Drop unsynced data in `TestFSWritableFile::Close()` (#12528 ) Summary: Our `FileSystem` for simulating unsynced data loss should not sync during `Close()` because it masks bugs where we forgot to sync as long as we closed the file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12528 Test Plan: Peeled back https://github.com/facebook/rocksdb/issues/10560 fix and verified it is caught much faster now (few seconds vs. ???) with command like ``` $ TEST_TMPDIR=./ python3 tools/db_crashtest.py blackbox --disable_wal=0 --max_key=1000 --write_buffer_size=131072 --max_bytes_for_level_base=524288 --target_file_size_base=131072 --interval=3 --sync_fault_injection=1 --enable_blob_files=0 --manual_wal_flush_one_in=10 --sync_wal_one_in=0 --get_live_files_one_in=0 --get_sorted_wal_files_one_in=0 --backup_one_in=0 --checkpoint_one_in=0 --write_fault_one_in=0 --read_fault_one_in=0 --open_write_fault_one_in=0 --compact_range_one_in=0 --compact_files_one_in=0 --open_read_fault_one_in=0 --get_property_one_in=0 --writepercent=100 -readpercent=0 -prefixpercent=0 -delpercent=0 -delrangepercent=0 -iterpercent=0 ``` Reviewed By: anand1976 Differential Revision: D56033250 Pulled By: ajkr fbshipit-source-id: 6bbf480d79a06c46f08f6214010937f6654af5ca	2024-04-12 09:57:56 -07:00
Vershinin Maxim 00873208	70d3fc3b6f	Fix error for CF smallest and largest keys computation in ImportColumnFamilyJob::Prepare (#12526 ) Summary: This PR fixes error for CF smallest and largest keys computation in ImportColumnFamilyJob::Prepare. Before this fix smallest and largest keys for CF were computed incorrectly, and ImportColumnFamilyJob::Prepare function might not have detect overlaps between CFs. I added test to detect this error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12526 Reviewed By: hx235 Differential Revision: D56046044 Pulled By: ajkr fbshipit-source-id: d562fbfc9cc2d9624372d24d34a649198a960691	2024-04-11 21:54:51 -07:00
Jay Huh	58a98bded9	MultiCFIterator Refactor - CoalescingIterator & AttributeGroupIterator (#12480 ) Summary: There are a couple of reasons to modify the current implementation of the MultiCfIterator, which implements the generic `Iterator` interface. - The default behavior of `value()`/`columns()` returning data from different Column Families for different keys can be prone to errors, even though there might be valid use cases where users do not care about the origin of the value/columns. - The `attribute_groups()` API, which is not yet implemented, will not be useful for a single-CF iterator. In this PR, we are implementing the following changes: - `IteratorBase` introduced, which includes all basic iterator functions except `value()` and `columns()`. - `Iterator`, which now inherits from `IteratorBase`, includes `value()` and `columns()`. - New public interface `AttributeGroupIterator` inherits from `IteratorBase` and additionally includes `attribute_groups()` (to be implemented). - Renamed former `MultiCfIterator` to `CoalescingIterator` which inherits from `Iterator` - Existing MultiCfIteratorTest has been split into two - `CoalescingIteratorTest` and `AttributeGroupIteratorTest`. - Moved AttributeGroup related code from `wide_columns.h` to a new file, `attribute_groups.h`. Some Implementation Details - `MultiCfIteratorImpl` takes two functions - `populate_func` and `reset_func` and use them to populate `value_` and `columns_` in CoalescingIterator and `attribute_groups_` in AttributeGroupIterator. In CoalescingIterator, populate_func is `Coalesce()`, in AttributeGroupIterator populate_func is `AddToAttributeGroups()`. `reset_func` clears populated value_, columns_ and attribute_groups_ accordingly. - `Coalesce()` merge sorts columns from multiple CFs when a key exists in more than on CFs. column that appears in later CF overwrites the prior ones. For example, if CF1 has `"key_1" ==> {"col_1": "foo", "col_2", "baz"}` and CF2 has `"key_1" ==> {"col_2": "quux", "col_3", "bla"}`, and when the iterator is at `key_1`, `columns()` will return `{"col_1": "foo", "col_2", "quux", "col_3", "bla"}` In this example, `value()` will be empty, because none of them have values for `kDefaultColumnName` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12480 Test Plan: ## Unit Test ``` ./multi_cf_iterator_test ``` ## Performance Test To make sure this change does not impact existing `Iterator` performance Build ``` $> make -j64 release ``` Setup ``` $> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.519 micros/op 1927904 ops/sec 0.519 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 5.302 micros/op 188589 ops/sec 5.303 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.497 micros/op 2011012 ops/sec 0.497 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 5.252 micros/op 190405 ops/sec 5.252 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: ltamasi Differential Revision: D55353909 Pulled By: jaykorean fbshipit-source-id: 8d7786ffee09e022261ce34aa60e8633685e1946	2024-04-11 11:34:04 -07:00
Yu Zhang	fab9dd9635	Temporary revert #12386 to unblock MyRocks build (#12523 ) Summary: MyRocks reports build failure with this change (build failures in this diff: https://www.internalfb.com/diff/D55924596) https://github.com/facebook/rocksdb/issues/12386, we haven't figured out how to fix it yet. So we are temporarily reverting it to unblock them. This reverts commit `3104e55f29`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12523 Reviewed By: hx235 Differential Revision: D55981751 Pulled By: jowlyzhang fbshipit-source-id: 1d7edd42b65ca847cec67549644a2b1e5775841e	2024-04-10 13:47:52 -07:00
Hui Xiao	abdbeedba6	Miscellaneous improvement to info printing (#12504 ) Summary: Context/Summary: Debugging crash test makes me realize there are a few places can use some improvement of logging more info Pull Request resolved: https://github.com/facebook/rocksdb/pull/12504 Test Plan: Manual testing Debug build ``` 2024/04/04-16:12:12.289791 1636007 [/db_filesnapshot.cc:156] Number of log files 2 (0 required by manifest) ... 2024/04/04-16:12:12.289814 1636007 [/db_filesnapshot.cc:171] Log files : /000004.log /000008.log .Log files required by manifest: . ``` Non-debug build ``` 2024/04/04-16:19:23.222168 1685043 [/db_filesnapshot.cc:156] Number of log files 1 (0 required by manifest) ``` CI Reviewed By: jaykorean Differential Revision: D55710013 Pulled By: hx235 fbshipit-source-id: 9964d46cfb0a2074620f31571cf9fd29d0a88819	2024-04-05 10:23:31 -07:00
Changyu Bi	a0aade7e62	Add some debug print for flaky test `DBCompactionTest.CompactionLimiter` (#12509 ) Summary: The unit test fails occasionally can cannot be reproed locally. ``` [ RUN ] DBCompactionTest.CompactionLimiter db/db_compaction_test.cc:6139: Failure Expected equality of these values: cf_count Which is: 17 env_->GetThreadPoolQueueLen(Env::LOW) Which is: 15 [ FAILED ] DBCompactionTest.CompactionLimiter (512 ms) ``` Add some debug print to help triaging if it fails again. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12509 Reviewed By: jowlyzhang Differential Revision: D55770552 Pulled By: cbi42 fbshipit-source-id: 2a39b2199f80352fcf2c6cd2b9c8b81c727eee8c	2024-04-04 15:21:40 -07:00
Changyu Bi	796011e5ad	Limit compaction input files expansion (#12484 ) Summary: We removed the limit in https://github.com/facebook/rocksdb/issues/10835 and the option in https://github.com/facebook/rocksdb/issues/12323. Usually input level is much smaller than output level, which is likely why we have not seen issues with not applying a limit. It should be safer to add a safe guard as suggested in https://github.com/facebook/rocksdb/pull/12323#issuecomment-2016687321. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12484 Test Plan: * new and existing UT Reviewed By: ajkr Differential Revision: D55438870 Pulled By: cbi42 fbshipit-source-id: 0511d0465a70398c36230ed7cced5291ff1a6c19	2024-03-29 11:34:29 -07:00
Hui Xiao	d985902ef4	Disallow refitting more than 1 file from non-L0 to L0 (#12481 ) Summary: Context/Summary: We recently discovered that `CompactRange(change_level=true, target_level=0)` can possibly refit more than 1 files to L0. This refitting can cause read performance regression as we need to go through every file in L0, corruption in some edge case and false positive corruption caught by force consistency check. We decided to explicitly disallow such behavior. A related change to OptionChangeMigration(): - When migrating to FIFO with `compaction_options_fifo.max_table_files_size > 0`, RocksDB will [CompactRange() all the to-be-migrate data into a couple L0 files](https://github.com/facebook/rocksdb/blob/main/utilities/option_change_migration/option_change_migration.cc#L164-L169) to avoid dropping all the data upon migration finishes when the migrated data is larger than max_table_files_size. This is achieved by first compacting all the data into a couple non-L0 files and refitting those files from non-L0 to L0 if needed. In that way, only some data instead of all data will be dropped immediately after migration to FIFO with a max_table_files_size. - Since this type of refitting behavior is disallowed from now on, we won't do this trick anymore and explicitly state such risk in API comment. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12481 Test Plan: - New UT - Modified UT Reviewed By: cbi42 Differential Revision: D55351178 Pulled By: hx235 fbshipit-source-id: 9d8854f2f81d7e8aff859c3a4e53b7d688048e80	2024-03-29 10:52:36 -07:00
Jay Huh	c449867236	MultiCfIterator Impl Follow up (#12465 ) Summary: As a follow up for https://github.com/facebook/rocksdb/issues/12422 , this PR includes the following two changes. - Removal of `direction_` in the MultiCfIterator - Use of Member Func Template instead of `std::function` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12465 Test Plan: ``` ./multi_cf_iterator_test ``` Reviewed By: pdillinger, ltamasi Differential Revision: D55208448 Pulled By: jaykorean fbshipit-source-id: 8b3167c1d59839d076afc29097b5ad21a453460a	2024-03-22 14:51:16 -07:00
Peter Dillinger	b515a5db3f	Replace ScopedArenaIterator with ScopedArenaPtr<InternalIterator> (#12470 ) Summary: ScopedArenaIterator is not an iterator. It is a pointer wrapper. And we don't need a custom implemented pointer wrapper when std::unique_ptr can be instantiated with what we want. So this adds ScopedArenaPtr<T> to replace those uses. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12470 Test Plan: CI (including ASAN/UBSAN) Reviewed By: jowlyzhang Differential Revision: D55254362 Pulled By: pdillinger fbshipit-source-id: cc96a0b9840df99aa807f417725e120802c0ae18	2024-03-22 13:40:42 -07:00
anand76	3b736c4aa3	Fix heap use after free error on retry after checksum mismatch (#12464 ) Summary: Fix the heap use after free bug caused by freeing the file system IO buffer in `BlockFetcher::ReadBlock()` instead of the caller. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12464 Test Plan: Update the `DBIOCorruptionTest` tests Reviewed By: akankshamahajan15 Differential Revision: D55206920 Pulled By: anand1976 fbshipit-source-id: fd6b608a61cd229b20c1e5f348ff3cc92328de0f	2024-03-21 16:19:09 -07:00
Andrew Kryczka	bf98dcf9a8	Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462 ) Summary: The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file. There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462 Test Plan: A repro command ``` db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp ./db_stress \ --clear_column_family_one_in=0 \ --test_batches_snapshots=0 \ --write_fault_one_in=0 \ --use_put_entity_one_in=0 \ --prefixpercent=0 \ --read_fault_one_in=0 \ --readpercent=0 \ --reopen=0 \ --set_options_one_in=10000 \ --delpercent=0 \ --delrangepercent=0 \ --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 \ --open_write_fault_one_in=0 \ --destroy_db_initially=0 \ --ingest_external_file_one_in=0 \ --iterpercent=0 \ --nooverwritepercent=0 \ --db=$db \ --enable_blob_files=1 \ --expected_values_dir=$exp \ --max_background_compactions=20 \ --max_bytes_for_level_base=2097152 \ --max_key=100000 \ --min_blob_size=0 \ --open_files=-1 \ --ops_per_thread=100000000 \ --prefix_size=-1 \ --target_file_size_base=524288 \ --use_merge=1 \ --value_size_mult=32 \ --write_buffer_size=524288 \ --writepercent=100 ``` It used to fail like: ``` ... frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3 frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice, rocksdb::PinnableWideColumns, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, rocksdb::Status, rocksdb::MergeContext, unsigned long, rocksdb::PinnedIteratorsManager, bool, bool, unsigned long, rocksdb::ReadCallback, bool, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33 frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492 frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408 frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109 frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656 frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607 frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables, rocksdb::FlushScheduler, rocksdb::TrimHistoryScheduler, bool, unsigned long, rocksdb::DB, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505 frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084 frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807 frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312 frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157 frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544 frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72 frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317 frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148 ... ``` Reviewed By: ltamasi Differential Revision: D55157795 Pulled By: ajkr fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764	2024-03-21 12:38:53 -07:00
anand76	63a105a481	Enable recycle_log_file_num option for point in time recovery (#12403 ) Summary: This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL. This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403 Test Plan: 1. Add new unit tests to verify behavior upon corruption 2. Re-enable disabled tests for verifying recycling behavior Reviewed By: ajkr Differential Revision: D54544824 Pulled By: anand1976 fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a	2024-03-21 12:29:35 -07:00
Yu Zhang	13e1c32a18	Follow ups for TimedPut and write time property (#12455 ) Summary: This PR contains a few follow ups from https://github.com/facebook/rocksdb/issues/12419 and https://github.com/facebook/rocksdb/issues/12428 including: 1) Handle a special case for `WriteBatch::TimedPut`. When the user specified write time is `std::numeric_limits<uint64_t>::max()`, it's not treated as an error, but it instead creates and writes a regular `Put` entry. 2) Update the `InternalIterator::write_unix_time` APIs to handle `kTypeValuePreferredSeqno` entries. 3) FlushJob is updated to use the seqno to time mapping copy in `SuperVersion`. FlushJob currently copy the DB's seqno to time mapping while holding db mutex and only copies the part of interest, a.k.a, the part that only goes back to the earliest sequence number of the to-be-flushed memtables. While updating FlushJob to use the mapping copy in `SuperVersion`, it's given access to the full mapping to help cover the need to convert `kTypeValuePreferredSeqno`'s write time to preferred seqno as much as possible. Test plans: Added unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12455 Reviewed By: pdillinger Differential Revision: D55165422 Pulled By: jowlyzhang fbshipit-source-id: dc022653077f678c24661de5743146a74cce4b47	2024-03-21 10:00:15 -07:00
Richard Barnes	6a1c2abe9d	Remove extra semi colon from hbt/src/tagstack/tests/SlicerTest.cpp (#12461 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12461 X-link: https://github.com/facebookincubator/dynolog/pull/233 `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: rahku Differential Revision: D55087324 fbshipit-source-id: e8a03d33cad72a7d378e58f85eb550a03f6c2897	2024-03-20 12:44:50 -07:00
Kshitij Wadhwa	4ce1dc930c	don't run ZSTD_TrainDictionary in BlockBasedTableBuilder if there isn't compression needed (#12453 ) Summary: fixes https://github.com/facebook/rocksdb/issues/12409 ### Issue ZSTD_TrainDictionary [[link](`a53ed91691/table/block_based/block_based_table_builder.cc (L1894)`)] runs for SSTFileWriter::Finish even when bottommost_compression option is set to kNoCompression. This reduces throughput for SstFileWriter::Finish We construct rocksdb options using ZSTD compression for levels including 2 and above. For levels 0 and 1, we set it to kNoCompression. We also set zstd_max_train_bytes to a non-zero positive value (which is applicable for levels with ZSTD compression enabled). These options are used for the database and also passed to SstFileWriter for creating sst files to be later added to that database. Since the BlockBasedTableBuilder::Finish [[link](`a53ed91691/table/block_based/block_based_table_builder.cc (L1892)`)] only checks for zstd_max_train_bytes to be non-zero positive value, it runs ZSTD_TrainDictionary even when it shouldn't since SSTFileWriter is operating at bottommost level ### Fix If compression_type is set to kNoCompression, then don't run ZSTD_TrainDictionary and dictionary building ### Testing I see we have tests for sst file writer with compression type set/unset. Let me know if it isn't covered and I can extend Pull Request resolved: https://github.com/facebook/rocksdb/pull/12453 Reviewed By: cbi42 Differential Revision: D55030484 Pulled By: ajkr fbshipit-source-id: 834de2174c2b087d61bf045ca1ae29f337b821a7	2024-03-20 11:07:32 -07:00
Jay Huh	3f3f4660bd	wal_read_status check in RecoverLogFiles (#12460 ) Summary: Fixing the not-checked status failure as in https://github.com/facebook/rocksdb/actions/runs/8334988399/job/22809612148. When the status is not ok() for any reason, we do not check the `wal_read_status` because it's not necessary. It's causing the test failure when running with `ASSERT_STATUS_CHECKED=1` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12460 Test Plan: Existing tests Reviewed By: ajkr Differential Revision: D55104844 Pulled By: jaykorean fbshipit-source-id: 919b1fddca835494f9087c51c4da6eabc9e8df2b	2024-03-20 08:09:09 -07:00
anand76	4868c10b44	Retry block reads on checksum mismatch (#12427 ) Summary: On file systems that support storage level data checksum and reconstruction, retry SST block reads for point lookups, scans, and flush and compaction if there's a checksum mismatch on the initial read. A file system can indicate its support by setting the `FSSupportedOps::kVerifyAndReconstructRead` bit in `SupportedOps`. Tests: Add new unit tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/12427 Reviewed By: ajkr Differential Revision: D55025941 Pulled By: anand1976 fbshipit-source-id: dbd990cb75e03f756c8a66d42956f645c0b6d55e	2024-03-18 16:16:05 -07:00
Jay Huh	b4e9f5a400	Update Remote Compaction Tests to include more than one CF (#12430 ) Summary: Update `compaction_service_test` to make sure remote compaction works with multiple column family set up. Minor refactor to get rid of duplicate code Fixing one quick bug in the existing test util: Test util's `FilesPerLevel` didn't honor `cf_id` properly) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12430 Test Plan: ``` ./compaction_service_test ``` Reviewed By: ajkr Differential Revision: D54883035 Pulled By: jaykorean fbshipit-source-id: 83b4f6f566fed5c4824bfef7de01074354a72b44	2024-03-18 15:40:48 -07:00
Hui Xiao	2443ebf810	Don't write to WAL after previous WAL write error (#12448 ) Summary: Context/Summary: WAL write can continue onto the the WAL file that has encountered error and thus crash at `3f5bd46a07/file/writable_file_writer.cc (L67)` in below scenario: <img width="698" alt="Screenshot 2024-03-15 at 1 52 45 PM" src="https://github.com/facebook/rocksdb/assets/83968999/cd631ef2-c87c-4926-91ab-a0dc10f1eb14"> Note that GetLiveFilesStorageInfo() can happen concurrently with PUT() for the non-WAL-write part where db lock isn't held This PR added an error check in LogWriter layer to prevent thread 2 from starting to write WAL after thread 1's write error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12448 Test Plan: Step 1 Apply the patch below to simulate frequent WAL write error for the purpose of repro ``` diff --git a/db_stress_tool/db_stress_driver.cc b/db_stress_tool/db_stress_driver.cc index b47fa89e6..31930e976 100644 --- a/db_stress_tool/db_stress_driver.cc +++ b/db_stress_tool/db_stress_driver.cc @@ -98,7 +98,7 @@ bool RunStressTestImpl(SharedState* shared) { // MANIFEST, CURRENT, and WAL files. fault_fs_guard->SetRandomWriteError( shared->GetSeed(), FLAGS_write_fault_one_in, error_msg, - /inject_for_all_file_types=/false, {FileType::kTableFile}); + /inject_for_all_file_types=/false, {FileType::kWalFile}); fault_fs_guard->SetFilesystemDirectWritable(false); fault_fs_guard->EnableWriteErrorInjection(); } diff --git a/utilities/fault_injection_fs.cc b/utilities/fault_injection_fs.cc index 0ffb43ea6..589912cf4 100644 --- a/utilities/fault_injection_fs.cc +++ b/utilities/fault_injection_fs.cc @@ -1042,7 +1042,7 @@ IOStatus FaultInjectionTestFS::InjectWriteError(const std::string& file_name) { } if (allowed_type) { - if (write_error_rand_.OneIn(write_error_one_in_)) { + if (write_error_rand_.OneIn(1)) { return GetError(); } } ``` Step 2 Run below ``` ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=100 --adaptive_readahead=1 --advise_random_on_open=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --allow_fallocate=1 --async_io=1 --auto_readahead_size=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=8 --bgerror_resume_retry_interval=1000000 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=41.19540459544058 --bottommost_compression_type=disable --bottommost_file_compaction_delay=3600 --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_size=8388608 --compression_checksum=1 --compression_max_dict_buffer_bytes=68719476735 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --dump_malloc_stats=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_index_compression=1 --enable_pipelined_write=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --fill_cache=1 --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=10000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=15 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --log2_keys_per_lock=10 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_total_wal_size=0 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=1048576 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=0 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=500000 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=1000 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --report_bg_io_stats=0 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=4 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=1 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=1 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=1000 --writepercent=35 ``` Pre-PR: ``` db_stress: ./file/writable_file_writer.h:309: rocksdb::IOStatus rocksdb::WritableFileWriter::AssertFalseAndGetStatusForPrevError(): Assertion `sync_without_flush_called_' failed. ``` Post-PR ``` 2024/03/15-13:44:08 Starting database operations put or merge error: IO error: Retryable injected write error ``` Note: The patch is NOT included in the PR as we first need to figure out how to handle this type of failed write in stress test (planned for the near future). It's sufficient to show the stress test does not crash as pre-PR for the purpose of this PR. Reviewed By: ajkr Differential Revision: D54969287 Pulled By: hx235 fbshipit-source-id: 0ba4eabfec44ea7656d4d7117836f388897562f2	2024-03-18 12:27:49 -07:00
Jay Huh	db1dea22b1	MultiCfIterator Implementations (#12422 ) Summary: This PR continues https://github.com/facebook/rocksdb/issues/12153 by implementing the missing `Iterator` APIs - `Seek()`, `SeekForPrev()`, `SeekToLast()`, and `Prev`. A MaxHeap Implementation has been added to handle the reverse direction. The current implementation does not include upper/lower bounds yet. These will be added in subsequent PRs. The API is still marked as under construction and will be lifted after being added to the stress test. Please note that changing the iterator direction in the middle of iteration is expensive, as it requires seeking the element in each iterator again in the opposite direction and rebuilding the heap along the way. The first `Next()` after `SeekForPrev()` requires changing the direction under the current implementation. We may optimize this in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12422 Test Plan: The `multi_cf_iterator_test` has been extended to cover the API implementations. Reviewed By: pdillinger Differential Revision: D54820754 Pulled By: jaykorean fbshipit-source-id: 9eb741508df0f7bad598fb8e6bd5cdffc39e81d1	2024-03-18 09:05:30 -07:00
Changyu Bi	3d5be596a5	Fix a bug in iterator with UDT + `ReadOptions::pin_data` (#12451 ) Summary: with https://github.com/facebook/rocksdb/issues/12414 enabling `ReadOptions::pin_data`, this bug surfaced as corrupted per key-value checksum during crash test. `saved_key_.GetUserKey()` could be pinned user key, so DBIter should not overwrite it. In one case, it only surfaces when iterator skips many keys of the same user key. To stress that code path, this PR also added `max_sequential_skip_in_iterations` to crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12451 Test Plan: - Set ReadOptions::pin_data to true, the bug can be reproed quickly with `./db_stress --persist_user_defined_timestamps=1 --user_timestamp_size=8 --writepercent=35 --delpercent=4 --delrangepercent=1 --iterpercent=20 --nooverwritepercent=1 --prefix_size=8 --prefixpercent=10 --readpercent=30 --memtable_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --clear_column_family_one_in=0`. - Set max_sequential_skip_in_iterations to 1 for the other occurrence of the bug. Reviewed By: jowlyzhang Differential Revision: D55003766 Pulled By: cbi42 fbshipit-source-id: 23e1049129456684dafb028b6132b70e0afc07fb	2024-03-18 09:05:11 -07:00
Yu Zhang	f2546b6623	Support returning write unix time in iterator property (#12428 ) Summary: This PR adds support to return data's approximate unix write time in the iterator property API. The general implementation is: 1) If the entry comes from a SST file, the sequence number to time mapping recorded in that file's table properties will be used to deduce the entry's write time from its sequence number. If no such recording is available, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown except if the entry's sequence number is zero, in which case, 0 is returned. This also means that even if `preclude_last_level_data_seconds` and `preserve_internal_time_seconds` can be toggled off between DB reopens, as long as the SST file's table property has the mapping available, the entry's write time can be deduced and returned. 2) If the entry comes from memtable, we will use the DB's sequence number to write time mapping to do similar things. A copy of the DB's seqno to write time mapping is kept in SuperVersion to allow iterators to have lock free access. This also means a new `SuperVersion` is installed each time DB's seqno to time mapping updates, which is originally proposed by Peter in https://github.com/facebook/rocksdb/issues/11928 . Similarly, if the feature is not enabled, `std::numeric_limits<uint64_t>::max()` is returned to indicate the write time is unknown. Needed follow up: 1) The write time for `kTypeValuePreferredSeqno` should be special cased, where it's already specified by the user, so we can directly return it. 2) Flush job can be updated to use DB's seqno to time mapping copy in the SuperVersion. 3) Handle the case when `TimedPut` is called with a write time that is `std::numeric_limits<uint64_t>::max()`. We can make it a regular `Put`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12428 Test Plan: Added unit test Reviewed By: pdillinger Differential Revision: D54967067 Pulled By: jowlyzhang fbshipit-source-id: c795b1b7ec142e09e53f2ed3461cf719833cb37a	2024-03-15 15:37:37 -07:00
Andrew Kryczka	4d5ebad971	Fix kBlockCacheTier read with table cache miss (#12443 ) Summary: Thanks ltamasi for pointing out this bug. We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443 Reviewed By: cbi42 Differential Revision: D54930128 Pulled By: ajkr fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d	2024-03-15 14:41:58 -07:00
Andrew Kryczka	3f5bd46a07	Add `ContinueCallback` to `GetMergeOperands()` (#12438 ) Summary: The use case is similar to `MergeOperator::ShouldMerge()` for `Get()`: preventing reads into LSM components for merge operands that are of no interest to the user. `MergeOperator::ShouldMerge()` cannot be reused here because: - Its name does not make sense in the context of `GetMergeOperands()` since `GetMergeOperands()` never invokes merge - The callback is part of the `MergeOperator`, but an option specific to the read operation makes more sense to me If there are any ideas for an API design that covers both `MergeOperator::ShouldMerge()`'s use cases and `GetMergeOperandsOptions::continue_cb`'s use cases, that would be ideal, but for now this is what I came up with. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12438 Reviewed By: hx235 Differential Revision: D54914669 Pulled By: ajkr fbshipit-source-id: 5f3ff78d3890adc0b1b74bedf3921221930ce63a	2024-03-15 12:25:49 -07:00
Changyu Bi	096fb9b67d	Fix data race in WalManager (#12439 ) Summary: Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439 Test Plan: - existing UT - not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests Reviewed By: anand1976 Differential Revision: D54920817 Pulled By: cbi42 fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6	2024-03-14 21:24:06 -07:00
Yu Zhang	1104eaa35e	Add initial support for TimedPut API (#12419 ) Summary: This PR adds support for `TimedPut` API. We introduced a new type `kTypeValuePreferredSeqno` for entries added to the DB via the `TimedPut` API. The life cycle of such an entry on the write/flush/compaction paths are: 1) It is initially added to memtable as: `<user_key, seq, kTypeValuePreferredSeqno>: {value, write_unix_time}` 2) When it's flushed to L0 sst files, it's converted to: `<user_key, seq, kTypeValuePreferredSeqno>: {value, preferred_seqno}` when we have easy access to the seqno to time mapping. 3) During compaction, if certain conditions are met, we swap in the `preferred_seqno` and the entry will become: `<user_key, preferred_seqno, kTypeValue>: value`. This step helps fast track these entries to the cold tier if they are eligible after the sequence number swap. On the read path: A `kTypeValuePreferredSeqno` entry acts the same as a `kTypeValue` entry, the unix_write_time/preferred seqno part packed in value is completely ignored. Needed follow ups: 1) The seqno to time mapping accessible in flush needs to be extended to cover the `write_unix_time` for possible `kTypeValuePreferredSeqno` entries. This also means we need to track these `write_unix_time` in memtable. 2) Compaction filter support for the new `kTypeValuePreferredSeqno` type for feature parity with other `kTypeValue` and equivalent types. 3) Stress test coverage for the feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/12419 Test Plan: Added unit tests Reviewed By: pdillinger Differential Revision: D54920296 Pulled By: jowlyzhang fbshipit-source-id: c8b43f7a7c465e569141770e93c748371ff1da9e	2024-03-14 15:44:55 -07:00
Peter Dillinger	dd24bda137	Fix windows build and CI (#12426 ) Summary: Issue https://github.com/facebook/rocksdb/issues/12421 describes a regression in the migration from CircleCI to GitHub Actions in which failing build steps no longer fail Windows CI jobs. In GHA with pwsh (new preferred powershell command), only the last non-builtin command (or something like that) affects the overall success/failure result, and failures in external commands do not exit the script, even with `$ErrorActionPreference = 'Stop'` and `$PSNativeCommandErrorActionPreference = $true`. Switching to `powershell` causes some obscure failure (not seen in CircleCI) about the `-Lo` option to `curl`. Here we work around this using the only reasonable-but-ugly way known: explicitly check the result after every non-trivial build step. This leaves us highly susceptible to future regressions with unchecked build steps in the future, but a clean solution is not known. This change also fixes the build errors that were allowed to creep in because of the CI regression. Also decreased the unnecessarily long running time of DBWriteTest.WriteThreadWaitNanosCounter. For background, this problem explicitly contradicts GitHub's documentation, and GitHub has known about the problem for more than a year, with no evidence of caring or intending to fix. https://github.com/actions/runner-images/issues/6668 Somehow CircleCI doesn't have this problem. And even though cmd.exe and powershell have been perpetuating DOS-isms for decades, they still seem to be a somewhat active "hot mess" when it comes to sensible, consistent, and documented behavior. Fixes https://github.com/facebook/rocksdb/issues/12421 A history of some things I tried in development is here: https://github.com/facebook/rocksdb/compare/main...pdillinger:rocksdb:debug_windows_ci_orig Pull Request resolved: https://github.com/facebook/rocksdb/pull/12426 Test Plan: CI, including https://github.com/facebook/rocksdb/issues/12434 where I have temporarily enabled other Windows builds on PR with this change Reviewed By: cbi42 Differential Revision: D54903698 Pulled By: pdillinger fbshipit-source-id: 116bcbebbbf98f347c7cf7dfdeebeaaed7f76827	2024-03-14 12:04:41 -07:00
Peter Dillinger	c0ae5be934	Disable flaky part of TransactionLogIteratorCheckWhenArchive (#12423 ) Summary: https://github.com/facebook/rocksdb/issues/12397 attempted to make the test more honest about its failures, and they're really showing up in CI now (but not locally). Disable pending investigation Pull Request resolved: https://github.com/facebook/rocksdb/pull/12423 Test Plan: watch CI Reviewed By: ltamasi Differential Revision: D54817705 Pulled By: pdillinger fbshipit-source-id: 4721834c49b225ac52d1a28ecb06b9d05de977b3	2024-03-12 12:54:53 -07:00
Peter Dillinger	7622029101	Fix flaky TransactionLogIteratorCheckWhenArchive (#12397 ) Summary: Seen in https://github.com/facebook/rocksdb/actions/runs/8086592802/job/22096691572?pr=12388 ``` [ RUN ] DBTestXactLogIterator.TransactionLogIteratorCheckWhenArchive db/db_log_iter_test.cc:173:23: runtime error: member call on address 0x0000023956f0 which does not point to an object of type 'rocksdb::DBTestXactLogIterator' 0x0000023956f0: note: object is of type 'rocksdb::DBTestBase' 00 00 00 00 98 ae f7 da 75 7f 00 00 a0 5d 39 02 00 00 00 00 80 ff 39 02 00 00 00 00 95 00 00 00 ^~~~~~~~~~~~~~~~~~~~~~~ vptr for 'rocksdb::DBTestBase' UndefinedBehaviorSanitizer: undefined-behavior db/db_log_iter_test.cc:173:23 in ``` This is almost certainly caused by the sync point callback happening on asynchronous file deletion in the DB while the end of the test is reached and the destruction of the `DBTestXactLogIterator` has reached `DBTestBase::~DBTestBase()`. Either closing the DB or disabling sync points before the end of the test should suffice to fix, and we'll do both. And assert that the sync point callback is actually hit each time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12397 Test Plan: unable to reproduce, but ran 1000 iterations of the test with UBSAN Reviewed By: ltamasi Differential Revision: D54326687 Pulled By: pdillinger fbshipit-source-id: cc09a4dcd2f237d5b45d910364d6aa56bbd46d50	2024-03-12 08:43:47 -07:00
Andrew Kryczka	27a2473668	Best-effort recovery support for atomic flush (#12406 ) Summary: This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once. The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits. Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery. The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios. The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary. - `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates. - `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406 Reviewed By: jaykorean, cbi42 Differential Revision: D54494904 Pulled By: ajkr fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52	2024-03-06 14:40:40 -08:00
Peter Dillinger	a53ed91691	Fix/improve temperature handling for file ingestion (#12402 ) Summary: Partly following up on leftovers from https://github.com/facebook/rocksdb/issues/12388 In terms of public API: * Make it clear that IngestExternalFileArg::file_temperature is just a hint for opening the existing file, though it was previously used for both copy-from temp hint and copy-to temp, which was bizarre. * Specify how IngestExternalFile assigns temperature to file ingested into DB. (See details in comments.) This approach is not perfect in terms of matching how the DB assigns temperatures, but was the simplest way to get close. The key complication for matching DB temperature assignments is that ingestion files are copied (to a destination temp) before their target level is determined (in general). * Add a temperature option to SstFileWriter::Open so that files intended for ingestion can be initially written to a chosen temperature. * Note that "fail_if_not_bottommost_level" is obsolete/confusing use of "bottommost" In terms of the implementation, there was a similar bit of oddness with the internal CopyFile API, which only took one temperature, ambiguously applicable to the source, destination, or both. This is also fixed. Eventual suggested follow-up: * Before copying files for ingestion, determine a tentative level assignment to use for destination temperature, and keep that even if final level assignment happens to be different at commit time (rare). * More temperature handling for CreateColumnFamilyWithImport and Checkpoints. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12402 Test Plan: Deeply revamped ExternalSSTFileBasicTest.IngestWithTemperature to test the new changes. Previously this test was insufficient because it was only looking at temperatures according to the DB manifest. Incorporating FileTemperatureTestFS allows us to also test the temperatures in the storage layer. Used macros instead of functions for better tracing to critical source location on test failures. Some enhancements to FileTemperatureTestFS in the process of developing the revamped test. Reviewed By: jowlyzhang Differential Revision: D54442794 Pulled By: pdillinger fbshipit-source-id: 41d9d0afdc073e6a983304c10bbc07c70cc7e995	2024-03-05 16:56:08 -08:00
Jay Huh	3412195367	Introduce MultiCfIterator (#12153 ) Summary: This PR introduces a new implementation of `Iterator` via a new public API called `NewMultiCfIterator()`. The new API takes a vector of column family handles to build a cross-column-family iterator, which internally maintains multiple `DBIter`s as child iterators from a consistent database state. When a key exists in multiple column families, the iterator selects the value (and wide columns) from the first column family containing the key, following the order provided in the `column_families` parameter. Similar to the merging iterator, a min heap is used to iterate across the child iterators. Backward iteration and direction change functionalities will be implemented in future PRs. The comparator used to compare keys across different column families will be derived from the iterator of the first column family specified in `column_families`. This comparator will be checked against the comparators from all other column families that the iterator will traverse. If there's a mismatch with any of the comparators, the initialization of the iterator will fail. Please note that this PR is not enough for users to start using `MultiCfIterator`. The `MultiCfIterator` and related APIs are still marked as "DO NOT USE - UNDER CONSTRUCTION". This PR is just the first of many PRs that will follow soon. This PR includes the following: - Introduction and partial implementation of the `MultiCfIterator`, which implements the generic `Iterator` interface. The implementation includes the construction of the iterator, `SeekToFirst()`, `Next()`, `Valid()`, `key()`, `value()`, and `columns()`. - Unit tests to verify iteration across multiple column families in two distinct scenarios: (1) keys are unique across all column families, and (2) the same keys exist in multiple column families. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12153 Reviewed By: pdillinger Differential Revision: D52308697 Pulled By: jaykorean fbshipit-source-id: b03e69f13b40af5a8f0598d0f43a0bec01ef8294	2024-03-05 10:22:43 -08:00
yuzhangyu@fb.com	1cfdece85d	Run internal cpp modernizer on RocksDB repo (#12398 ) Summary: When internal cpp modernizer attempts to format rocksdb code, it will replace macro `ROCKSDB_NAMESPACE` with its default definition `rocksdb` when collapsing nested namespace. We filed a feedback for the tool T180254030 and the team filed a bug for this: https://github.com/llvm/llvm-project/issues/83452. At the same time, they suggested us to run the modernizer tool ourselves so future auto codemod attempts will be smaller. This diff contains: Running `xplat/scripts/codemod_service/cpp_modernizer.sh` in fbcode/internal_repo_rocksdb/repo (excluding some directories in utilities/transactions/lock/range/range_tree/lib that has a non meta copyright comment) without swapping out the namespace macro `ROCKSDB_NAMESPACE` Followed by RocksDB's own `make format` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12398 Test Plan: Auto tests Reviewed By: hx235 Differential Revision: D54382532 Pulled By: jowlyzhang fbshipit-source-id: e7d5b40f9b113b60e5a503558c181f080b9d02fa	2024-03-04 10:08:32 -08:00
Richard Barnes	d7b8756976	Remove extra semi colon from internal_repo_rocksdb/repo/db/table_cache_sync_and_async.h Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D54362208 fbshipit-source-id: a47acd4c794c899fccb65285b116b50d9566ea12	2024-03-04 06:34:44 -08:00
Richard Barnes	ced333ee45	Remove extra semi colon from instagram/ranking/mezql/shots/parser/fast/Token.cpp Summary: `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: palmje Differential Revision: D54362213 fbshipit-source-id: 0bbc9e5fce917fc4f72423f0a4c8cb2c2b1759dd	2024-03-04 06:32:50 -08:00
Jay Huh	c00c16855d	Access DBImpl* and CFD* by CFHImpl* in Iterators (#12395 ) Summary: In the current implementation of iterators, `DBImpl` and `ColumnFamilyData` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR https://github.com/facebook/rocksdb/issues/12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: https://github.com/facebook/rocksdb/issues/11925 #11943, and https://github.com/facebook/rocksdb/issues/11977. To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl` and `ColumnFamilyData` can be easily accessed via `ColumnFamilyHandleImpl`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12395 Test Plan: # Summary In the current implementation of iterators, `DBImpl` and `ColumnFamilyData` are held in `DBIter` and `ArenaWrappedDBIter` for two purposes: tracing and Refresh() API. With the introduction of a new iterator called MultiCfIterator in PR #12153 , which is a cross-column-family iterator that maintains multiple DBIters as child iterators from a consistent database state, we need to make some changes to the existing implementation. The new iterator will still be exposed through the generic Iterator interface with an additional capability to return AttributeGroups (via `attribute_groups()`) which is a list of wide columns grouped by column family. For more information about AttributeGroup, please refer to previous PRs: #11925 #11943, and #11977. To be able to return AttributeGroup in the default single CF iterator created, access to `ColumnFamilyHandle` within `DBIter` is necessary. However, this is not currently available in `DBIter`. Since `DBImpl` and `ColumnFamilyData` can be easily accessed via `ColumnFamilyHandleImpl`, we have decided to replace the pointers to `ColumnFamilyData` and `DBImpl` in `DBIter` with a pointer to `ColumnFamilyHandleImpl`. # Test Plan There should be no behavior changes. Existing tests and CI for the correctness tests. Test for Perf Regression Build ``` $> make -j64 release ``` Setup ``` $> TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=1000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="newiterator,seekrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.552 micros/op 1810157 ops/sec 0.552 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 4.502 micros/op 222143 ops/sec 4.502 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] newiterator : 0.520 micros/op 1924401 ops/sec 0.520 seconds 1000000 operations; DB path: [/dev/shm/db_bench/dbbench] seekrandom : 4.532 micros/op 220657 ops/sec 4.532 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: pdillinger Differential Revision: D54332713 Pulled By: jaykorean fbshipit-source-id: b28d897ad519e58b1ca82eb068a6319544a4fae5	2024-03-01 10:28:20 -08:00
Jay Huh	5bcc184975	Update APIs to support generic unique identifier format (#12384 ) Summary: The current design proposes using a combination of `job_id`, `db_id`, and `db_session_id` to create a unique identifier for remote compaction jobs. However, this approach may not be suitable for users who prefer a different format for the unique identifier. At Meta, we are utilizing generic compute offload to offload compaction tasks to remote workers. The compute offload client generates a UUID for each task, which requires an update to the current RocksDB API for onboarding purposes. Users still have the option to create the unique identifier by combining `job_id`, `db_id`, and `db_session_id` if they prefer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12384 Test Plan: ``` $> ./compaction_service_test 13:29:35 [==========] Running 14 tests from 1 test case. [----------] Global test environment set-up. [----------] 14 tests from CompactionServiceTest [ RUN ] CompactionServiceTest.BasicCompactions [ OK ] CompactionServiceTest.BasicCompactions (2642 ms) [ RUN ] CompactionServiceTest.ManualCompaction [ OK ] CompactionServiceTest.ManualCompaction (454 ms) [ RUN ] CompactionServiceTest.CancelCompactionOnRemoteSide [ OK ] CompactionServiceTest.CancelCompactionOnRemoteSide (1643 ms) [ RUN ] CompactionServiceTest.FailedToStart [ OK ] CompactionServiceTest.FailedToStart (1332 ms) [ RUN ] CompactionServiceTest.InvalidResult [ OK ] CompactionServiceTest.InvalidResult (1516 ms) [ RUN ] CompactionServiceTest.SubCompaction [ OK ] CompactionServiceTest.SubCompaction (551 ms) [ RUN ] CompactionServiceTest.CompactionFilter [ OK ] CompactionServiceTest.CompactionFilter (563 ms) [ RUN ] CompactionServiceTest.Snapshot [ OK ] CompactionServiceTest.Snapshot (124 ms) [ RUN ] CompactionServiceTest.ConcurrentCompaction [ OK ] CompactionServiceTest.ConcurrentCompaction (660 ms) [ RUN ] CompactionServiceTest.CompactionInfo [ OK ] CompactionServiceTest.CompactionInfo (984 ms) [ RUN ] CompactionServiceTest.FallbackLocalAuto [ OK ] CompactionServiceTest.FallbackLocalAuto (343 ms) [ RUN ] CompactionServiceTest.FallbackLocalManual [ OK ] CompactionServiceTest.FallbackLocalManual (380 ms) [ RUN ] CompactionServiceTest.RemoteEventListener [ OK ] CompactionServiceTest.RemoteEventListener (491 ms) [ RUN ] CompactionServiceTest.TablePropertiesCollector [ OK ] CompactionServiceTest.TablePropertiesCollector (169 ms) [----------] 14 tests from CompactionServiceTest (11854 ms total) [----------] Global test environment tear-down [==========] 14 tests from 1 test case ran. (11855 ms total) [ PASSED ] 14 tests. ``` Reviewed By: hx235 Differential Revision: D54220339 Pulled By: jaykorean fbshipit-source-id: 5a9054f31933d1996adca02082eb37b6d5353224	2024-03-01 09:55:30 -08:00
Changyu Bi	4aed229fa7	Add `write_memtable_time` to perf level `kEnableWait` (#12394 ) Summary: .. so write time can be measured under the new perf level for single-threaded writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12394 Test Plan: * add a new UT `PerfContextTest.WriteMemtableTimePerfLevel` Reviewed By: anand1976 Differential Revision: D54326263 Pulled By: cbi42 fbshipit-source-id: d0e334d9581851ba6cf53c776c0bd876365d1e00	2024-02-29 15:08:26 -08:00
Peter Dillinger	13ef21c22e	default_write_temperature option (#12388 ) Summary: Currently SST files that aren't applicable to last_level_temperature nor file_temperature_age_thresholds are written with temperature kUnknown, which is a little weird and doesn't support CF-based tiering. The default_temperature option only affects how kUnknown is interpreted for stats. This change adds a new per-CF option default_write_temperature that determines the temperature of new SST files when those other options do not apply. Also made a change to ignore last_level_temperature with FIFO compaction, because I found that could lead to an infinite loop in compaction. Needed follow-up: Fix temperature handling with external file ingestion Pull Request resolved: https://github.com/facebook/rocksdb/pull/12388 Test Plan: unit tests extended appropriately. (Ignore whitespace changes when reviewing.) Reviewed By: jowlyzhang Differential Revision: D54266574 Pulled By: pdillinger fbshipit-source-id: c9ec9a74dbf22be6e986f77f9689d05fea8ef0bb	2024-02-28 14:36:13 -08:00
奏之章	1fa5dff7d1	WriteThread::EnterAsBatchGroupLeader reorder writers (#12138 ) Summary: Reorder writers list to allow a leader can take as more commits as possible to maximize the throughput of the system and reduce IOPS. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12138 Reviewed By: hx235 Differential Revision: D53955592 Pulled By: ajkr fbshipit-source-id: 4d899d038faef691b63801d9d85f5cc079b7bbb5	2024-02-27 15:23:54 -08:00
zaidoon	3104e55f29	update DB::DumpSupportInfo to log whether jemalloc is supported or not (#12386 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12386 Reviewed By: cbi42 Differential Revision: D54231896 Pulled By: ajkr fbshipit-source-id: 6b3357b2e97d3599955e303810088bb5d5896199	2024-02-27 15:07:00 -08:00
Peter Dillinger	d780e7a561	Remove `bottommost_temperature` (#12389 ) Summary: deprecated option already replaced by `last_level_temperature`. (Keeping recognition of the option in old options files.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12389 Test Plan: tests updated Reviewed By: jowlyzhang, cbi42 Differential Revision: D54267946 Pulled By: pdillinger fbshipit-source-id: 65c49b15e7394829c1f3b44edd4179d2daff6017	2024-02-27 14:48:00 -08:00
anand76	d9c0d44dab	Add a perf level for measuring user thread block time (#12368 ) Summary: Enabling time PerfCounter stats in RocksDB is currently very expensive, as it enables all sorts of relatively uninteresting stats, such as iteration, point lookup breakdown etc. This PR adds a new perf level between `kEnableCount` and `kEnableTimeExceptForMutex` to enable stats for time spent by user (i.e a RocksDB user) threads blocked by other RocksDB threads or events, such as a write group leader, write delay or stalls etc. It does not include time spent waiting to acquire mutexes, or waiting for IO. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12368 Test Plan: Add a unit test for write_thread_wait_nanos Reviewed By: ajkr Differential Revision: D54021583 Pulled By: anand1976 fbshipit-source-id: 3f6fcf71010132ffffca0391a5565f3b59fddd48	2024-02-22 12:14:53 -08:00
Yu Zhang	f1ca47b904	Add support to bulk load external files for UDT in memtable only feature (#12356 ) Summary: This PR expands on the capabilities added in https://github.com/facebook/rocksdb/issues/12343. It adds sanity checks for external file's comparator name and user-defined timestamps related flag. With this, it now supports ingesting files to a column family that enables user-defined timestamps in Memtable only feature. Two fields in the table properties are used for aformentioned check: 1) the comparator name, it records what comparator is used to create this external sst file, 2) the flag `user_defined_timestamps_persisted`. We compare these two fields with the column family's settings. The details are in util function `ValidateUserDefinedTimestampsOptions`. To optimize for the majority of the cases where sanity check should pass and the table properties read should not affect how `TableReader` is constructed, instead of read the table properties block separately and use it for sanity check before creating a `TableReader`. We continue using the current flow to first create a `TableReader`, use it for reading table properties and do sanity checks, and reset the`TableReader` for the case where the column family enables UDTs in memtable only feature, and the external file does not contain user-defined timestamps. This PR also groups other table properties related sanity check in function `GetIngestedFileInfo` into the newly added `SanityCheckTableProperties` function. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12356 Test Plan: added unit test existing unit test Reviewed By: cbi42 Differential Revision: D54025116 Pulled By: jowlyzhang fbshipit-source-id: a918276c15f9908bd9df8513ce667638882e1554	2024-02-21 15:41:53 -08:00
Andrew Kryczka	8e29f243c9	No filesystem reads during `Merge()` writes (#12365 ) Summary: This occasional filesystem read in the write path has caused user pain. It doesn't seem very useful considering it only limits one component's merge chain length, and only helps merge uncached (i.e., infrequently read) values. This PR proposes allowing `max_successive_merges` to be exceeded when the value cannot be read from in-memory components. I included a rollback flag (`strict_max_successive_merges`) just in case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12365 Test Plan: "rocksdb.block.cache.data.add" is number of data blocks read from filesystem. Since the benchmark is write-only, compaction is disabled, and flush doesn't read data blocks, any nonzero value means the user write issued the read. ``` $ for s in false true; do echo -n "strict_max_successive_merges=$s: " && ./db_bench -value_size=64 -write_buffer_size=131072 -writes=128 -num=1 -benchmarks=mergerandom,flush,mergerandom -merge_operator=stringappend -disable_auto_compactions=true -compression_type=none -strict_max_successive_merges=$s -max_successive_merges=100 -statistics=true \|& grep 'block.cache.data.add COUNT' ; done strict_max_successive_merges=false: rocksdb.block.cache.data.add COUNT : 0 strict_max_successive_merges=true: rocksdb.block.cache.data.add COUNT : 1 ``` Reviewed By: hx235 Differential Revision: D53982520 Pulled By: ajkr fbshipit-source-id: e40f761a60bd601f232417ac0058e4a33ee9c0f4	2024-02-21 13:15:27 -08:00
Alex Wied	f2732d0586	Export GetSequenceNumber functionality for Snapshots (#12354 ) Summary: This PR adds `Snapshot->GetSequenceNumber()` functionality to the C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12354 Reviewed By: akankshamahajan15 Differential Revision: D53836085 Pulled By: cbi42 fbshipit-source-id: 4a14daeba9210a69bcb74e4c1c0666deff1b4837	2024-02-16 10:28:41 -08:00
anand76	d227276147	Deprecate some variants of Get and MultiGet (#12327 ) Summary: A lot of variants of Get and MultiGet have been added to `include/rocksdb/db.h` over the years. Try to consolidate them by marking variants that don't return timestamps as deprecated. The underlying DB implementation will check and return Status::NotSupported() if it doesn't support returning timestamps and the caller asks for it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12327 Reviewed By: pdillinger Differential Revision: D53828151 Pulled By: anand1976 fbshipit-source-id: e0b5ca42d32daa2739d5f439a729815a2d4ff050	2024-02-16 09:21:06 -08:00
Akanksha Mahajan	956f1dfde3	Change ReadAsync callback API to remove const from FSReadRequest (#11649 ) Summary: Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649 Test Plan: CircleCI jobs Reviewed By: anand1976 Differential Revision: D53585309 Pulled By: akankshamahajan15 fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d	2024-02-16 09:14:55 -08:00
anand76	28c1c15c29	Sync tickers and histograms across C++ and Java (#12355 ) Summary: The RocksDB ticker and histogram statistics were out of sync between the C++ and Java code, with a number of newer stats missing in TickerType.java and HistogramType.java. Also, there were gaps in numbering in portal.h, which could soon become an issue due to the number of tickers and the fact that we're limited to 1 byte in Java. This PR adds the missing stats, and re-numbers all of them. It also moves some stats around to try to group related stats together. Since this will go into a major release, compatibility shouldn't be an issue. This should be automated at some point, since the current process is somewhat error prone. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12355 Reviewed By: jaykorean Differential Revision: D53825324 Pulled By: anand1976 fbshipit-source-id: 298c180872f4b9f1ee54b8bb22f4e280458e7e09	2024-02-15 17:22:03 -08:00
Peter Dillinger	12018136d8	KeySegmentsExtractor and prototype higher-dimensional filtering (#12075 ) Summary: This change contains a prototype new API for "higher dimensional" filtering of read queries. Existing filters treat keys as one-dimensional, either as distinct points (whole key) or as contiguous ranges in comparator order (prefix filters). The proposed KeySegmentsExtractor allows treating keys as multi-dimensional for filtering purposes even though they still have a single total order across dimensions. For example, consider these keys in different LSM levels: L0: abc_0123 abc_0150 def_0114 ghi_0134 L1: abc_0045 bcd_0091 def_0077 xyz_0080 If we get a range query for [def_0100, def_0200), a prefix filter (up to the underscore) will tell us that both levels are potentially relevant. However, if each SST file stores a simple range of the values for the second segment of the key, we would see that L1 only has [0045, 0091] which (under certain required assumptions) we are sure does not overlap with the given range query. Thus, we can filter out processing or reading any index or data blocks from L1 for the query. This kind of case shows up with time-ordered data but is more general than filtering based on user timestamp. See https://github.com/facebook/rocksdb/issues/11332 . Here the "time" segments of the keys are meaningfully ordered with respect to each other even when the previous segment is different, so summarizing data along an alternate dimension of the key like this can work well for filtering. This prototype implementation simply leverages existing APIs for user table properties and table filtering, which is not very CPU efficient. Eventually, we expect to create a native implementation. However, I have put some significant thought and engineering into the new APIs overall, which I expect to be close to refined enough for production. For details, see new public APIs in experimental.h. For a detailed example, see the new unit test in db_bloom_filter_test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12075 Test Plan: Unit test included Reviewed By: jowlyzhang Differential Revision: D53619406 Pulled By: pdillinger fbshipit-source-id: 9e6e7b82b4db8d815db76a6ab340e90db2c191f2	2024-02-15 15:39:55 -08:00
Yu Zhang	4bea83aa44	Remove the force mode for EnableFileDeletions API (#12337 ) Summary: There is no strong reason for user to need this mode while on the other hand, its behavior is destructive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12337 Reviewed By: hx235 Differential Revision: D53630393 Pulled By: jowlyzhang fbshipit-source-id: ce94b537258102cd98f89aa4090025663664dd78	2024-02-13 18:36:25 -08:00
Yu Zhang	10d02456b6	Add support to bulk load external files with user-defined timestamps (#12343 ) Summary: This PR adds initial support to bulk loading external sst files with user-defined timestamps. To ensure this invariant is met while ingesting external files: assume there are two internal keys: <K, ts1, seq1> and <K, ts2, seq2>, the following should hold: ts1 < ts2 iff. seq1 < seq2 These extra requirements are added for ingesting external files with user-defined timestamps: 1) A file with overlapping user key (without timestamp) range with the db cannot be ingested. This is because we cannot ensure above invariant is met without checking each overlapped key's timestamp and compare it with the timestamp from the db. This is an expensive step. This bulk loading feature will be used by MyRocks and currently their usage can guarantee ingested file's key range doesn't overlap with db. `4f3a57a13f/storage/rocksdb/ha_rocksdb.cc (L3312)` We can consider loose this requirement by doing this check in the future, this initial support just disallow this. 2) Files with overlapping user key (without timestamp) range are not allowed to be ingested. For similar reasons, it's hard to ensure above invariant is met. For example, if we have two files where user keys are interleaved like this: file1: [c10, c8, f10, f5] file2: [b5, c11, f4] Either file1 gets a bigger global seqno than file2, or the other way around, above invariant cannot be met. So we disallow this. 2) When a column family enables user-defined timestamps, it doesn't support ingestion behind mode. Ingestion behind currently simply puts the file at the bottommost level, and assign a global seqno 0 to the file. We need to do similar search though the LSM tree for key range overlap checks to make sure aformentioned invariant is met. So this initial support disallow this mode. We can consider adding it in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12343 Test Plan: Add unit tests Reviewed By: cbi42 Differential Revision: D53686182 Pulled By: jowlyzhang fbshipit-source-id: f05e3fb27967f7974ed40179d78634c40ecfb136	2024-02-13 11:15:28 -08:00
Levi Tamasi	de1e3ff6ea	Fix a data race in DBImpl::RenameTempFileToOptionsFile (#12347 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347 `DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels). Reviewed By: jowlyzhang Differential Revision: D53675153 fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481	2024-02-12 13:26:09 -08:00
Peter Dillinger	54cb9c77d9	Prefer static_cast in place of most reinterpret_cast (#12308 ) Summary: The following are risks associated with pointer-to-pointer reinterpret_cast: * Can produce the "wrong result" (crash or memory corruption). IIRC, in theory this can happen for any up-cast or down-cast for a non-standard-layout type, though in practice would only happen for multiple inheritance cases (where the base class pointer might be "inside" the derived object). We don't use multiple inheritance a lot, but we do. * Can mask useful compiler errors upon code change, including converting between unrelated pointer types that you are expecting to be related, and converting between pointer and scalar types unintentionally. I can only think of some obscure cases where static_cast could be troublesome when it compiles as a replacement: * Going through `void` could plausibly cause unnecessary or broken pointer arithmetic. Suppose we have `struct Derived: public Base1, public Base2`. If we have `Derived` -> `void` -> `Base2` -> `Derived` through reinterpret casts, this could plausibly work (though technical UB) assuming the `Base2` is not dereferenced. Changing to static cast could introduce breaking pointer arithmetic. * Unnecessary (but safe) pointer arithmetic could arise in a case like `Derived` -> `Base2` -> `Derived` where before the Base2 pointer might not have been dereferenced. This could potentially affect performance. With some light scripting, I tried replacing pointer-to-pointer reinterpret_casts with static_cast and kept the cases that still compile. Most occurrences of reinterpret_cast have successfully been changed (except for java/ and third-party/). 294 changed, 257 remain. A couple of related interventions included here: Previously Cache::Handle was not actually derived from in the implementations and just used as a `void` stand-in with reinterpret_cast. Now there is a relationship to allow static_cast. In theory, this could introduce pointer arithmetic (as described above) but is unlikely without multiple inheritance AND non-empty Cache::Handle. Remove some unnecessary casts to void* as this is allowed to be implicit (for better or worse). Most of the remaining reinterpret_casts are for converting to/from raw bytes of objects. We could consider better idioms for these patterns in follow-up work. I wish there were a way to implement a template variant of static_cast that would only compile if no pointer arithmetic is generated, but best I can tell, this is not possible. AFAIK the best you could do is a dynamic check that the void* conversion after the static cast is unchanged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12308 Test Plan: existing tests, CI Reviewed By: ltamasi Differential Revision: D53204947 Pulled By: pdillinger fbshipit-source-id: 9de23e618263b0d5b9820f4e15966876888a16e2	2024-02-07 10:44:11 -08:00
Yu Zhang	e3e8fbb497	Add a separate range classes for internal usage (#12071 ) Summary: Introduce some different range classes `UserKeyRange` and `UserKeyRangePtr` to be used by internal implementation. The `Range` class is used in both public APIs like `DB::GetApproximateSizes`, `DB::GetApproximateMemTableStats`, `DB::GetPropertiesOfTablesInRange` etc and internal implementations like `ColumnFamilyData::RangesOverlapWithMemtables`, `VersionSet::GetPropertiesOfTablesInRange`. These APIs have different expectations of what keys this range class contain. Public API users are supposed to populate the range with the user keys without timestamp, in the same way that point lookup and range scan APIs' key input only expect the user key without timestamp. The internal APIs implementation expect a user key whose format is compatible with the user comparator, a.k.a a user key with the timestamp. This PR contains: 1) introducing counterpart range class `UserKeyRange` `UserKeyRangePtr` for internal implementation while leave the existing `Range` and `RangePtr` class only for public APIs. Internal implementations are updated to use this new class instead. 2) add user-defined timestamp support for `DB::GetPropertiesOfTablesInRange` API and `DeleteFilesInRanges` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12071 Test Plan: existing tests Added test for `DB::GetPropertiesOfTablesInRange` and `DeleteFilesInRanges` APIs for when user-defined timestamp is enabled. The change in external_file_ingestion_job doesn't have a user-defined timestamp enabled test case coverage, will add one in a follow up PR that adds file ingestion support for UDT. Reviewed By: ltamasi Differential Revision: D53292608 Pulled By: jowlyzhang fbshipit-source-id: 9a9279e23c640a6d8f8232636501a95aef7638b8	2024-02-06 18:35:36 -08:00
Hui Xiao	1a885fe730	Remove deprecated Options::access_hint_on_compaction_start (#11654 ) Summary: Context: `Options::access_hint_on_compaction_start ` is marked deprecated and now ready to be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11654 Test Plan: Multiple db_stress runs with pre-PR and post-PR binary randomly to ensure forward/backward compatibility on options `36a5686ec0`?fbclid=IwAR2IcdAUdTvw9O9V5GkHEYJRGMVR9p7Ei-LMa-9qiXlj3z80DxjkxlGnP1E `python3 tools/db_crashtest.py --simple blackbox --interval=30` Reviewed By: cbi42 Differential Revision: D47892459 Pulled By: hx235 fbshipit-source-id: a62f46a0377fe143be7638e218978d5431c15c56	2024-02-05 13:35:19 -08:00
Peter Dillinger	6e88126dd3	Don't log an error when an auxiliary dir is missing (#12326 ) Summary: info_log gets an error logged when wal_dir or a db_path/cf_path is missing. Under this condition, the directory is created later (in DBImpl::Recover -> Directories::SetDirectories) with no error status returned. To avoid error spam in logs, change these to a descriptive "header" log entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12326 Test Plan: manual with DBBasicTest.DBCloseAllDirectoryFDs which exercises this code Reviewed By: jowlyzhang Differential Revision: D53374743 Pulled By: pdillinger fbshipit-source-id: 32d1ce18809da13a25bdd6183d661f66a3b6a111	2024-02-05 10:26:41 -08:00
Yu Zhang	4eaa771c01	Refactor external sst file ingestion job (#12305 ) Summary: Updates some documentations and invariant assertions after https://github.com/facebook/rocksdb/issues/12257 and https://github.com/facebook/rocksdb/issues/12284. Also refactored some duplicate code and improved some error message and preconditions for errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12305 Test Plan: Existing unit tests Reviewed By: hx235 Differential Revision: D53371325 Pulled By: jowlyzhang fbshipit-source-id: fb0edcb3a3602cdf0a292ef437cfdfe897fc6c99	2024-02-02 18:07:57 -08:00
Changyu Bi	5620efc794	Remove deprecated option `ignore_max_compaction_bytes_for_input` (#12323 ) Summary: The option is introduced in https://github.com/facebook/rocksdb/issues/10835 to allow disabling the new compaction behavior if it's not safe. The option is enabled by default and there has not been a need to disable it. So it should be safe to remove now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12323 Reviewed By: ajkr Differential Revision: D53330336 Pulled By: cbi42 fbshipit-source-id: 36eef4664ac96b3a7ed627c48bd6610b0a7eafc5	2024-02-02 17:09:42 -08:00
Changyu Bi	ace1721b28	Remove deprecated option `level_compaction_dynamic_file_size` (#12325 ) Summary: The option is introduced in https://github.com/facebook/rocksdb/issues/10655 to allow reverting to old behavior. The option is enabled by default and there has not been a need to disable it. Remove it for 9.0 release. Also fixed and improved a few unit tests that depended on setting this option to false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12325 Test Plan: existing tests. Reviewed By: hx235 Differential Revision: D53369430 Pulled By: cbi42 fbshipit-source-id: 0ec2440ca8d88db7f7211c581542c7581bd4d3de	2024-02-02 15:37:40 -08:00
Peter Dillinger	1d6dbfb8b7	Rename IntTblPropCollector -> InternalTblPropColl (#12320 ) Summary: I've always found this name difficult to read, because it sounds like it's for collecting int(eger) table properties. I'm fixing this now to set up for a change that I have stubbed out in the public API (table_properties.h): a new adapter function `TablePropertiesCollector::AsInternal()` that allows RocksDB-provided TablePropertiesCollectors (such as CompactOnDeletionCollector) to implement the easier-to-upgrade internal interface while still (superficially) implementing the public interface. In addition to added flexibility, this should be a performance improvement as the adapter class UserKeyTablePropertiesCollector can be avoided for such cases where a RocksDB-provided collector is used (AsInternal() returns non-nullptr). table_properties.h is the only file with changes that aren't simple find-replace renaming. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12320 Test Plan: existing tests, CI Reviewed By: ajkr Differential Revision: D53336945 Pulled By: pdillinger fbshipit-source-id: 02535bcb30bbfb00e29e8478af62e5dad50a63b8	2024-02-02 14:14:43 -08:00
anand76	95b41eec6d	Fix potential incorrect result for duplicate key in MultiGet (#12295 ) Summary: The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met - 1. Duplicate keys in a MultiGet batch 2. Key matches the last key in a non-zero, non-bottommost level file 3. Final value is not in the file (merge operand, not snapshot visible etc) 4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block 5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates 6. Value or merge operand for the key is present in the very next level The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295 Test Plan: Add unit tests that fail an assertion or return wrong result without the fix Reviewed By: hx235 Differential Revision: D53187634 Pulled By: anand1976 fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea	2024-02-02 11:48:35 -08:00
Andrew Kryczka	f9d45358ca	Removed `check_flush_compaction_key_order` (#12311 ) Summary: `check_flush_compaction_key_order` option was introduced for the key order checking online validation. It gave users the ability to disable the validation without downgrade in case the validation caused inefficiencies or false positives. Over time this validation has shown to be cheap and correct, so the option to disable it can now be removed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12311 Reviewed By: cbi42 Differential Revision: D53233379 Pulled By: ajkr fbshipit-source-id: 1384361104021d6e3e580dce2ec123f9f99ce637	2024-01-31 16:30:26 -08:00
Peter Dillinger	76c834e441	Remove 'virtual' when implied by 'override' (#12319 ) Summary: ... to follow modern C++ style / idioms. Used this hack: ``` for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319 Test Plan: existing tests, CI Reviewed By: jaykorean Differential Revision: D53275303 Pulled By: pdillinger fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377	2024-01-31 13:14:42 -08:00
Yu Zhang	d11584e42e	Be consistent in key range overlap check (#12315 ) Summary: We should be consistent in how we check key range overlap in memtables and in sst files. While all the sst file key range overlap check compares the user key without timestamp, for example: `377eee77f8/db/version_set.cc (L129-L130)` This key range overlap check for memtable is comparing the whole user key. Currently it happen to achieve the same effect because this function is only called by `ExternalSstFileIngestionJob` and `DBImpl::CompactRange`, which takes a user key without timestamp as the range end, pad a max or min timestamp to it depending on whether the end is exclusive. So use `Compartor::Compare` here is working too, but we should update it to `Comparator::CompareWithoutTimestamp` to be consistent with all the other file key range overlapping check functions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12315 Test Plan: existing tests Reviewed By: ltamasi Differential Revision: D53273456 Pulled By: jowlyzhang fbshipit-source-id: c094ae1f0c195d52542124c4fb03fdca14241e85	2024-01-31 11:12:52 -08:00
Peter Dillinger	2b4245559c	Don't warn on (recursive) disable file deletion (#12310 ) Summary: To stop spamming our warning logs with normal behavior. Also fix comment on `DisableFileDeletions()`. In response to https://github.com/facebook/rocksdb/issues/12001 I've indicated my objection to granting legitimacy to force=true, but I'm not addressing that here and now. In short, the user shouldn't be asked to think about whether they want to use the wrong behavior. ;) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12310 Test Plan: existing tests Reviewed By: jowlyzhang Differential Revision: D53233117 Pulled By: pdillinger fbshipit-source-id: 5d2aedb76b02b30f8a5fa5b436fc57fde5d40d6e	2024-01-30 11:58:31 -08:00
Andrew Kryczka	aacf60dda2	Speedup based on number of files marked for compaction (#12306 ) Summary: RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. This PR adds pressure detection based on the number of files marked for compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12306 Reviewed By: cbi42 Differential Revision: D53200559 Pulled By: ajkr fbshipit-source-id: 63402ee336881a4539204d255960f04338ab7a0e	2024-01-29 17:29:04 -08:00
Peter Dillinger	61ed0de600	Add more detail to some statuses (#12307 ) Summary: and also fix comment/label on some MacOS CI jobs. Motivated by a crash test failure missing a definitive indicator of the genesis of the status: ``` file ingestion error: Operation failed. Try again.: ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12307 Test Plan: just cosmetic changes. These statuses should not arise frequently enough to be a performance issue (copying messages). Reviewed By: jaykorean Differential Revision: D53199529 Pulled By: pdillinger fbshipit-source-id: ad83daaa5d80f75c9f81158e90fb6d9ecca33fe3	2024-01-29 16:31:09 -08:00
Yu Zhang	17042a3fb7	Remove misspelled tickers used in error handler (#12302 ) Summary: As titled, the replacement tickers have been introduced in https://github.com/facebook/rocksdb/issues/11509 and in use since release 8.4. This PR completely removes the misspelled ones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12302 Test Plan: CI tests Reviewed By: jaykorean Differential Revision: D53196935 Pulled By: jowlyzhang fbshipit-source-id: 9c9d0d321247690db5edfdc52b4fecb2f1218979	2024-01-29 15:28:37 -08:00
Chdy	fc48af33f5	fix some perf statistic in write (#12285 ) Summary: ### Summary: perf context lack statistics in some write steps ``` rocksdb::get_perf_context()->write_wal_time); rocksdb::get_perf_context()->write_memtable_time); rocksdb::get_perf_context()->write_pre_and_post_process_time); ``` #### case 1: when the unordered_write is true, the `write_memtable_time` is 0 ``` write_wal_time : 13.7012 write_memtable_time : 0 write_pre_and_post_process_time : 142.037 ``` Reason: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` during insert memtable, ```c++ Status DBImpl::UnorderedWriteMemtable(const WriteOptions& write_options, WriteBatch* my_batch, WriteCallback* callback, uint64_t log_ref, SequenceNumber seq, const size_t sub_batch_cnt) { ... if (w.CheckCallback(this) && w.ShouldWriteToMemtable()) { // need calculate write_memtable_time ColumnFamilyMemTablesImpl column_family_memtables( versions_->GetColumnFamilySet()); w.status = WriteBatchInternal::InsertInto( &w, w.sequence, &column_family_memtables, &flush_scheduler_, &trim_history_scheduler_, write_options.ignore_missing_column_families, 0 /log_number/, this, true /concurrent_memtable_writes/, seq_per_batch_, sub_batch_cnt, true /batch_per_txn/, write_options.memtable_insert_hint_per_batch); if (write_options.disableWAL) { has_unpersisted_data_.store(true, std::memory_order_relaxed); } } ... } ``` Fix: add perf function ``` write_wal_time : 14.3991 write_memtable_time : 19.3367 write_pre_and_post_process_time : 130.441 ``` #### case 2: when the enable_pipelined_write is true, the `write_memtable_time` is small ``` write_wal_time : 11.2986 write_memtable_time : 1.0205 write_pre_and_post_process_time : 140.131 ``` Fix: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` when `w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER` ```c++ Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options, WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used, uint64_t log_ref, bool disable_memtable, uint64_t* seq_used) { ... if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) { // need calculate write_memtable_time assert(w.ShouldWriteToMemtable()); ColumnFamilyMemTablesImpl column_family_memtables( versions_->GetColumnFamilySet()); w.status = WriteBatchInternal::InsertInto( &w, w.sequence, &column_family_memtables, &flush_scheduler_, &trim_history_scheduler_, write_options.ignore_missing_column_families, 0 /log_number/, this, true /concurrent_memtable_writes/, false /seq_per_batch/, 0 /batch_cnt/, true /batch_per_txn/, write_options.memtable_insert_hint_per_batch); if (write_thread_.CompleteParallelMemTableWriter(&w)) { MemTableInsertStatusCheck(w.status); versions_->SetLastSequence(w.write_group->last_sequence); write_thread_.ExitAsMemTableWriter(&w, w.write_group); } } if (seq_used != nullptr) { seq_used = w.sequence; } assert(w.state == WriteThread::STATE_COMPLETED); return w.FinalStatus(); } ``` FIx: add perf function ``` write_wal_time : 10.5201 write_memtable_time : 17.1048 write_pre_and_post_process_time : 114.313 ``` #### case3: `DBImpl::WriteImplWALOnly` function has no statistical `write_delay_time` ```c++ Status DBImpl::WriteImplWALOnly( WriteThread* write_thread, const WriteOptions& write_options, WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used, const uint64_t log_ref, uint64_t* seq_used, const size_t sub_batch_cnt, PreReleaseCallback* pre_release_callback, const AssignOrder assign_order, const PublishLastSeq publish_last_seq, const bool disable_memtable) { ... if (publish_last_seq == kDoPublishLastSeq) { } else { // need calculate write_delay_time InstrumentedMutexLock lock(&mutex_); Status status = DelayWrite(/num_bytes=/0ull, *write_thread, write_options); if (!status.ok()) { WriteThread::WriteGroup write_group; write_thread->EnterAsBatchGroupLeader(&w, &write_group); write_thread->ExitAsBatchGroupLeader(write_group, status); return status; } } } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12285 Reviewed By: ajkr Differential Revision: D53191765 Pulled By: cbi42 fbshipit-source-id: f78d5b280bea6a777f077c89c3e0b8fe98d3c860	2024-01-29 12:31:11 -08:00
Yu Zhang	071a146fa0	Add support for range deletion when user timestamps are not persisted (#12254 ) Summary: For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes: 1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block. 2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`. 3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. `d69628e6ce/db/dbformat.h (L923-L935)` This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254 Test Plan: Added unit test and enabled this feature combination's stress test. Reviewed By: cbi42 Differential Revision: D52965527 Pulled By: jowlyzhang fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985	2024-01-29 11:37:34 -08:00
Peter Dillinger	4e60663b31	Remove unnecessary, confusing 'extern' (#12300 ) Summary: In C++, `extern` is redundant in a number of cases: * "Global" function declarations and definitions * "Global" variable definitions when already declared `extern` For consistency and simplicity, I've removed these in code that we own. In a couple of cases, I removed obsolete declarations, and for MagicNumber constants, I have consolidated the declarations into a header file (format.h) as standard best practice would prescribe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12300 Test Plan: no functional changes, CI Reviewed By: ajkr Differential Revision: D53148629 Pulled By: pdillinger fbshipit-source-id: fb8d927959892e03af09b0c0d542b0a3b38fd886	2024-01-29 10:38:08 -08:00
Changyu Bi	2233a2f4c0	Enhance corruption status message for record mismatch in compaction (#12297 ) Summary: ... to include the actual numbers of processed and expected records, and the file number for input files. The purpose is to be able to find the offending files even when the relevant LOG file is gone. Another change is to check the record count even when `compaction_verify_record_count` is false, and log a warning message without setting corruption status if there is a mismatch. This is consistent with how we check the record count for flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12297 Test Plan: print the status message in `DBCompactionTest.VerifyRecordCount` ``` before Corruption: Compaction number of input keys does not match number of keys processed. after Compaction number of input keys does not match number of keys processed. Expected 20 but processed 10. Compaction summary: Base version 4 Base level 0, inputs: [11(2156B) 9(2156B)] ``` Reviewed By: ajkr Differential Revision: D53110130 Pulled By: cbi42 fbshipit-source-id: 6325cbfb8f71f25ce37f23f8277ebe9264863c3b	2024-01-26 09:12:07 -08:00
Peter Dillinger	f046a8f617	Deflake ColumnFamilyTest.WriteStallSingleColumnFamily (#12294 ) Summary: https://github.com/facebook/rocksdb/issues/12267 apparently introduced a data race in test code where a background read of estimated_compaction_needed_bytes while holding the DB mutex could race with forground write for testing purposes. This change adds the DB mutex to those writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12294 Test Plan: 1000 TSAN runs of test (massively fails before change, passes after) Reviewed By: ajkr Differential Revision: D53095483 Pulled By: pdillinger fbshipit-source-id: 13fcb383ebad313dabe39eb8f9085c34d370b54a	2024-01-25 14:40:18 -08:00
zaidoon	c3bff1c02d	Allow setting Stderr Logger via C API (#12262 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12262 Reviewed By: pdillinger Differential Revision: D53027616 Pulled By: ajkr fbshipit-source-id: 2e88e53e0c02447c613439f5528161ea1340b323	2024-01-25 12:36:40 -08:00
Hui Xiao	96fb7de3bc	Rate-limit un-ratelimited flush/compaction code paths (#12290 ) Summary: Context/Summary: We recently found out some code paths in flush and compaction aren't rate-limited when they should. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12290 Test Plan: existing UT** Reviewed By: anand1976 Differential Revision: D53066103 Pulled By: hx235 fbshipit-source-id: 9dc4cab5f841230d18e5504dc480ac523e9d3950	2024-01-25 12:00:15 -08:00
Peter Dillinger	d895eb08b3	Fix UB/crash in new SeqnoToTimeMapping::CopyFromSeqnoRange (#12293 ) Summary: After https://github.com/facebook/rocksdb/issues/12253 this function has crashed in the crash test, in its call to `std::copy`. I haven't reproduced the crash directly, but `std::copy` probably has undefined behavior if the starting iterator is after the ending iterator, which was possible. I've fixed the logic to deal with that case and to add an assertion to check that precondition of `std::copy` (which appears can be unchecked by `std::copy` itself even with UBSAN+ASAN). Also added some unit tests etc. that were unfinished for https://github.com/facebook/rocksdb/issues/12253, and slightly tweak SeqnoToTimeMapping::EnforceMaxTimeSpan handling of zero time span case. This is intended for patching 8.11. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12293 Test Plan: tests added. Will trigger ~20 runs of the crash test job that saw the crash. https://fburl.com/ci/5iiizvfa Reviewed By: jowlyzhang Differential Revision: D53090422 Pulled By: pdillinger fbshipit-source-id: 69d60b1847d9c7e4ae62b153011c2040405db461	2024-01-25 11:27:15 -08:00
Changyu Bi	3812a77771	Deflake `DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit` (#12289 ) Summary: The test has been failing with ``` [ RUN ] DBCompactionTest.BottomPriCompactionCountsTowardConcurrencyLimit db/db_compaction_test.cc:9661: Failure Expected equality of these values: 0u Which is: 0 env_->GetThreadPoolQueueLen(Env::Priority::LOW) Which is: 1 ``` This can happen when thread pool queue len is checked before `test::SleepingBackgroundTask::DoSleepTask` is scheduled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12289 Reviewed By: ajkr Differential Revision: D53064300 Pulled By: cbi42 fbshipit-source-id: 9ed1b714243880f82bd1cc1584b402ac9cf57507	2024-01-25 10:37:11 -08:00
Yu Zhang	928aca835f	Skip searching through lsm tree for a target level when files overlap (#12284 ) Summary: While ingesting multiple external files with key range overlap, current flow go through the lsm tree to do a search for a target level and later discard that result by defaulting back to L0. This PR improves this by just skip the search altogether. The other change is to remove default to L0 for the combination of universal compaction + force global sequence number, which was initially added to meet a pre https://github.com/facebook/rocksdb/issues/7421 invariant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12284 Test Plan: Added unit test: ./external_sst_file_test --gtest_filter="IngestFileWithGlobalSeqnoAssignedUniversal" Reviewed By: ajkr Differential Revision: D53072238 Pulled By: jowlyzhang fbshipit-source-id: 30943e2e284a7f23b495c0ea4c80cb166a34a8ac	2024-01-24 23:30:08 -08:00
Hui Xiao	1b2b16b38e	Fix bug of newer ingested data assigned with an older seqno (#12257 ) Summary: Context: We found an edge case where newer ingested data is assigned with an older seqno. This causes older data of that key to be returned for read. Consider the following lsm shape: ![image](https://github.com/facebook/rocksdb/assets/83968999/973fd160-5065-49cd-8b7b-b6ab4badae23) Then ingest a file to L5 containing new data of key_overlap. Because of [this](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fblob%2F5a26f392ca640818da0b8590be6119699e852b07%2Fdb%2Fexternal_sst_file_ingestion_job.cc%3Ffbclid%3DIwAR10clXxpUSrt6sYg12sUMeHfShS7XigFrsJHvZoUDroQpbj_Sb3dG_JZFc%23L951-L956&h=AT0m56P7O0ZML7jk1sdjgnZZyGPMXg9HkKvBEb8mE9ZM3fpJjPrArAMsaHWZQPt9Ki-Pn7lv7x-RT9NEd_202Y6D2juIVHOIt3EjCZptDKBLRBMG49F8iBUSM9ypiKe8XCfM-FNW2Hl4KbVq2e3nZRbMvUM), the file is assigned with seqno 2, older than the old data's seqno 4. After just another compaction, we will drop the new_v for key_overlap because of the seqno and cause older data to be returned. ![image](https://github.com/facebook/rocksdb/assets/83968999/a3ef95e4-e7ae-4c30-8d03-955cd4b5ed42) Summary: This PR removes the incorrect seqno assignment Pull Request resolved: https://github.com/facebook/rocksdb/pull/12257 Test Plan: - New unit test failed before the fix but passes after - python3 tools/db_crashtest.py --compaction_style=1 --ingest_external_file_one_in=10 --preclude_last_level_data_seconds=36000 --compact_files_one_in=10 --enable_blob_files=0 blackbox` - Rehearsal stress test Reviewed By: cbi42 Differential Revision: D52926092 Pulled By: hx235 fbshipit-source-id: 9e4dade0f6cc44e548db8fca27ccbc81a621cd6f	2024-01-24 11:21:05 -08:00
Peter Dillinger	b31f3245f1	Fix flaky test shutdown race in seqno_time_test (#12282 ) Summary: Seen in build-macos-cmake: ``` Received signal 11 (Segmentation fault: 11) https://github.com/facebook/rocksdb/issues/1 rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0::operator()(void) const (in seqno_time_test) (mock_time_env.cc:29) https://github.com/facebook/rocksdb/issues/2 decltype(std::declval<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&>()(std::declval<void>())) std::__1::__invoke[abi:v15006]<rocksdb::MockSystemClock::InstallTimedWaitFixCallback()::$_0&, void>(rocksdb::MockSystemClock::InstallTimedWait ixCallback()::$_0&, void&&) (in seqno_time_test) (invoke.h:394) ... ``` This is presumably because the std::function from the lambda only saves a copy of the SeqnoTimeTest* this pointer, which doesn't prevent it from being reclaimed on parallel shutdown. If we instead save a copy of the `std::shared_ptr<MockSystemClock>` in the std::function, this should prevent the crash. (Note that in `SyncPoint::Data::Process()` copies the std::function before releasing the mutex for calling the callback.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12282 Test Plan: watch CI Reviewed By: cbi42 Differential Revision: D53027136 Pulled By: pdillinger fbshipit-source-id: 26cd9c0352541d806d42bb061dd349d3b47171a5	2024-01-24 10:14:22 -08:00
Richard Barnes	3079a7e7c2	Remove extra semi colon from internal_repo_rocksdb/repo/db/internal_stats.h (#12278 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12278 `-Wextra-semi` or `-Wextra-semi-stmt` If the code compiles, this is safe to land. Reviewed By: jaykorean Differential Revision: D52969116 fbshipit-source-id: 8cb28dafdbede54e8cb59c2b8d461b1eddb3de68	2024-01-24 07:22:10 -08:00
Changyu Bi	3ef9092487	Print additional information when flaky test DBTestWithParam.ThreadStatusSingleCompaction fails (#12268 ) Summary: The test is [flaky](https://github.com/facebook/rocksdb/actions/runs/7616272304/job/20742657041?pr=12257&fbclid=IwAR1vNI1rSRVKnOsXs0WCPklqTkBXxlwS1GMJgWWe7D8dtAvh6e6wxk067FY) but I could not reproduce the test failure. Add some debug print to make the next failure more helpful Pull Request resolved: https://github.com/facebook/rocksdb/pull/12268 Test Plan: ``` check print works when test fails: [ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0 thread id: 6134067200, thread status: thread id: 6133493760, thread status: Compaction db/db_test.cc:4680: Failure Expected equality of these values: op_count Which is: 1 expected_count Which is: 0 ``` Reviewed By: hx235 Differential Revision: D52987503 Pulled By: cbi42 fbshipit-source-id: 33b369796f9b97155578b45167e722ddcde93594	2024-01-23 10:07:06 -08:00
Andrew Kryczka	7fe93162c5	Log pending compaction bytes in a couple places (#12267 ) Summary: This PR adds estimated pending compaction bytes in two places: - The "Level summary", which is printed to the info LOG after every flush or compaction - The "rocksdb.cfstats" property, which is printed to the info LOG periodically according to `stats_dump_period_sec` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12267 Test Plan: Ran `./db_bench -benchmarks=filluniquerandom -stats_dump_period_sec=1 -statistics=true -write_buffer_size=524288` and looked at the LOG. ``` Compaction Stats [default] ... Estimated pending compaction bytes: 12117691 ... 2024/01/22-13:15:12.283563 1572872 (Original Log Time 2024/01/22-13:15:12.283540) [/db_impl/db_impl_compaction_flush.cc:371] [default] Level summary: files[10 1 0 0 0 0 0] max score 0.50, estimated pending compaction bytes 12359137 ``` Reviewed By: cbi42 Differential Revision: D52973337 Pulled By: ajkr fbshipit-source-id: c4e546bd9bdac387eebeeba303d04125212037b8	2024-01-23 09:14:59 -08:00
Yu Zhang	ef342246dc	Consolidate stats recording in error handler (#11992 ) Summary: This is a non functional refactor, mostly for deduplicating the stats recording logic in error handler. Plus some documentation update and simple code dedupe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11992 Test Plan: existing tests Reviewed By: hx235 Differential Revision: D52967713 Pulled By: jowlyzhang fbshipit-source-id: d584eae1a06410438f5a4c59c2cb67666ea7de1a	2024-01-22 14:57:30 -08:00
zaidoon	e572ae9f57	expose mode option to Rate Limiter via C API (#12259 ) Summary: addresses https://github.com/facebook/rocksdb/issues/12220 to allow rate limiting compaction but not flushes Pull Request resolved: https://github.com/facebook/rocksdb/pull/12259 Reviewed By: jaykorean Differential Revision: D52965342 Pulled By: ajkr fbshipit-source-id: 38566d9ac75c932c63e10cc53796fab0e46e3b2e	2024-01-22 11:45:53 -08:00
Changyu Bi	4b684e96b7	Allow more intra-L0 compaction when L0 is small (#12214 ) Summary: introduce a new option `intra_l0_compaction_size` to allow more intra-L0 compaction when total L0 size is under a threshold. This option applies only to leveled compaction. It is enabled by default and set to `max_bytes_for_level_base / max_bytes_for_level_multiplier` only for atomic_flush users. When atomic_flush=true, it is more likely that some CF's total L0 size is small when it's eligible for compaction. This option aims to reduce write amplification in this case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12214 Test Plan: - new unit test - benchmark: ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom --write_buffer_size=51200 --max_bytes_for_level_base=5242880 --level0_file_num_compaction_trigger=4 --statistics=1 main: fillrandom : 234.499 micros/op 4264 ops/sec 234.499 seconds 1000000 operations; 0.5 MB/s rocksdb.compact.read.bytes COUNT : 1490756235 rocksdb.compact.write.bytes COUNT : 1469056734 rocksdb.flush.write.bytes COUNT : 71099011 branch: fillrandom : 128.494 micros/op 7782 ops/sec 128.494 seconds 1000000 operations; 0.9 MB/s rocksdb.compact.read.bytes COUNT : 807474156 rocksdb.compact.write.bytes COUNT : 781977610 rocksdb.flush.write.bytes COUNT : 71098785 ``` Reviewed By: ajkr Differential Revision: D52637771 Pulled By: cbi42 fbshipit-source-id: 4f2c7925d0c3a718635c948ea0d4981ed9fabec3	2024-01-22 10:23:57 -08:00
Peter Dillinger	cb08a682d4	Fix/cleanup SeqnoToTimeMapping (#12253 ) Summary: The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios. * Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc. * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.) * Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior. * Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.) * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries. * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.` * Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior. * Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file. Intended follow-up (me or others): * Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense. * There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL. * The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253 Test Plan: unit tests updated Reviewed By: jowlyzhang Differential Revision: D52913733 Pulled By: pdillinger fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593	2024-01-19 21:50:38 -08:00
Changyu Bi	ec5b1be18d	Deflake `PerfContextTest.CPUTimer` (#12252 ) Summary: We saw failures like ``` db/perf_context_test.cc:952: Failure Expected: (next_count) > (count), actual: 26699 vs 26699 ``` I can repro by running the test repeatedly and the test fails with different seek keys. So the cause is likely not with Seek() implementation. I found that `clock_gettime(CLOCK_THREAD_CPUTIME_ID, &ts);` can return the same time when called repeatedly. However, I don't know if Seek() is fast enough that this happened during continuous test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12252 Test Plan: `gtest_parallel.py --repeat=10000 --workers=1 ./perf_context_test --gtest_filter="PerfContextTest.CPUTimer"` Reviewed By: ajkr Differential Revision: D52912751 Pulled By: cbi42 fbshipit-source-id: 8985ae93baa99cdf4b9136ea38addd2e41f4b202	2024-01-19 10:13:52 -08:00
anand76	65e162bf09	Add some asserts in FilePickerMultiGet for debugging (#12241 ) Summary: Add asserts to help debug a crash test failure. The test fails as wollows - ```rocksdb::FilePickerMultiGet::PrepareNextLevel(): Assertion `fp_ctx.search_right_bound == -1 \|\| fp_ctx.search_right_bound == FileIndexer::kLevelMaxIndex' failed``` Also add a unit test to verify an edge case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12241 Reviewed By: cbi42 Differential Revision: D52819029 Pulled By: anand1976 fbshipit-source-id: 33316985c8ace1aed9ecc2400da8b777aec488ff	2024-01-16 17:08:58 -08:00
akankshamahajan	cad76a2e1e	Fix bug in auto_readahead_size that returned wrong key (#12229 ) Summary: IndexType::kBinarySearchWithFirstKey + BlockCacheLookupForReadAheadSize enabled => FindNextUserEntryInternal assertion fails or iterator lands at a wrong key because BlockCacheLookupForReadAheadSize moves the index_iter_ and in internal_wrapper.h, result_.key didn't update and pointed to wrong key. Also ikey_ was also pointing to iter_.key() instead of copying the key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12229 Test Plan: ``` rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 mkdir /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 ./db_stress -threads=1 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=0 --backup_one_in=0 --batch_protection_bytes_per_key=0 --blob_cache_size=0 --blob_compaction_readahead_size=0 --blob_compression_type=lz4 --blob_file_size=0 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0 --blob_garbage_collection_force_threshold=0 --block_protection_bytes_per_key=0 --block_size=2048 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 --db_write_buffer_size=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=3 --ingest_external_file_one_in=10 --initial_auto_readahead_size=0 --iterpercent=55 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=8 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=10000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=1 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --snapshot_hold_ops=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=1 --verify_sst_unique_id_in_manifest=0 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=0 > repro.out Verification failed. Expected state has key 0000000000000077000000000000004178, iterator is at key 0000000000000077000000000000008A78 Column family: default, op_logs: S 0000000000000077000000000000003D7878787878 NNNN No writes or ops? Verification failed :( ``` Reviewed By: ajkr Differential Revision: D52710655 Pulled By: akankshamahajan15 fbshipit-source-id: 9d2e684e190fb0832bdce3337bce1c6548cd054d	2024-01-16 11:30:36 -08:00
Jonah Gao	e28251ca72	Fix blob files not reclaimed after deleting all SSTs (#12235 ) Summary: Fix issue https://github.com/facebook/rocksdb/issues/12208. After all the SSTs have been deleted, all the blob files will become unreferenced. These files should be considered obsolete and thus, should not be saved to the vstorage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12235 Reviewed By: jowlyzhang Differential Revision: D52806441 Pulled By: ltamasi fbshipit-source-id: 62f94d4f2544ed2822c764d8ace5bf7f57efe42d	2024-01-16 11:15:23 -08:00
Andrew Kryczka	2dda7a0dd2	Detect compaction pressure at lower debt ratios (#12236 ) Summary: This PR significantly reduces the compaction pressure threshold introduced in https://github.com/facebook/rocksdb/issues/12130 by a factor of 64x. The original number was too high to trigger in scenarios where compaction parallelism was needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12236 Reviewed By: cbi42 Differential Revision: D52765685 Pulled By: ajkr fbshipit-source-id: 8298e966933b485de24f63165a00e672cb9db6c4	2024-01-15 22:41:18 -08:00
马越	1a1f9f1660	Fix the compactRange with wrong cf handle when ClipColumnFamily (#12219 ) Summary: - Context: In ClipColumnFamily, the DeleteRange API will be used to delete data, and then CompactRange will be called for physical deletion. But now However, the ColumnFamilyHandle is not passed , so by default only the DefaultColumnFamily will be CompactRanged. Therefore, it may cause that the data in some sst files of CompactionRange cannot be physically deleted. - In this change Pass the ColumnFamilyHandle when call CompactRange Pull Request resolved: https://github.com/facebook/rocksdb/pull/12219 Reviewed By: ajkr Differential Revision: D52665162 Pulled By: cbi42 fbshipit-source-id: e8e997aa25ec4ca40e347be89edc7e84a7a0edce	2024-01-10 14:34:12 -08:00
Andrew Kryczka	5a9ecf6614	Automated modernization (#12210 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12210 Reviewed By: hx235 Differential Revision: D52559771 Pulled By: ajkr fbshipit-source-id: 1ccdd3a0180cc02bc0441f20b0e4a1db50841b03	2024-01-05 11:53:57 -08:00
akankshamahajan	5cb2d09d47	Refactor FilePrefetchBuffer code (#12097 ) Summary: Summary - Refactor FilePrefetchBuffer code - Implementation: FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_. If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data. If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer. If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed. If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to the caller. - Merged Sync and Async code flow into one in FilePrefetchBuffer. Test Plan - - Crash test passed - Unit tests - Pending - Benchmarks Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097 Reviewed By: ajkr Differential Revision: D51759552 Pulled By: akankshamahajan15 fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5	2024-01-05 09:29:01 -08:00
Hui Xiao	81b6296c7e	Pass flush IO activity enum in FlushJob::MaybeIncreaseFullHistoryTsLowToAboveCutoffUDT...() (#12197 ) Summary: Context/Summary: as titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/12197 Test Plan: ``` ./db_stress --acquire_snapshot_one_in=100 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=4.393039399748979 --bottommost_compression_type=disable --bottommost_file_compaction_delay=86400 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=1000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=10000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=8388608 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --persist_user_defined_timestamps=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=55 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_txn=0 --use_write_buffer_manager=0 --user_timestamp_size=8 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=10000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=128 --writepercent=35 ``` Before fix: ``` db_stress_tool/db_stress_env_wrapper.h:92: virtual rocksdb::IOStatus rocksdb::DbStressWritableFileWrapper::Append(const rocksdb::Slice &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `io_activity == Env::IOActivity::kUnknown \|\| io_activity == options.io_activity' failed. ``` After fix: Succeed Reviewed By: ajkr Differential Revision: D52492030 Pulled By: hx235 fbshipit-source-id: 842a0dcbdf135838b57ddb4a3a6f1effc8dd3e82	2024-01-02 17:33:00 -08:00
leipeng	d411fc4dd6	column_family.cc: SanitizeOptions(dbo, cfo): WARN msg: add missing spaces (#12193 ) Summary: Fix for multi line strings missing spaces. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12193 Reviewed By: cbi42 Differential Revision: D52457430 Pulled By: ajkr fbshipit-source-id: 4ca75a14e61c09819e5d821da6137f4536e9e76e	2024-01-02 11:18:11 -08:00
leipeng	906c6683ed	InternalKey::Set: remove redundant assign (#12194 ) Summary: InternalKey::Set: remove redundant assign Pull Request resolved: https://github.com/facebook/rocksdb/pull/12194 Reviewed By: cbi42 Differential Revision: D52457542 Pulled By: ajkr fbshipit-source-id: 329983a8734ff38ffd93018bbbe112b4a23b5c11	2024-01-02 11:17:39 -08:00
Hui Xiao	06e593376c	Group SST write in flush, compaction and db open with new stats (#11910 ) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre\|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff	2023-12-29 15:29:23 -08:00
anand76	a036525809	Lightweight verification of MANIFEST file after close on shutdown (#12174 ) Summary: Do a size verification on the MANIFEST file during DB shutdown, after closing the file. If the verification fails, write a new MANIFEST file. In the future, we can do a more thorough verification if we want to. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12174 Test Plan: Unit test, and some manual verification Reviewed By: ajkr Differential Revision: D52451184 Pulled By: anand1976 fbshipit-source-id: fc3bc170e22f6c9a9c482ee5ff592abab889df83	2023-12-28 18:25:29 -08:00
hulk	b7ecbe309d	Trigger compaction to the next level if the data age exceeds periodic_compaction_seconds (#12175 ) Summary: Currently, the data are always compacted to the same level if exceed periodic_compaction_seconds which may confuse users, so we change it to allow trigger compaction to the next level here. It's a behavior change to users, and may affect users who have disabled their ttl or ttl > periodic_compaction_seconds. Relate issue: https://github.com/facebook/rocksdb/issues/12165 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12175 Reviewed By: ajkr Differential Revision: D52446722 Pulled By: cbi42 fbshipit-source-id: ccd3d2c6434ed77055735a03408d4a62d119342f	2023-12-28 12:50:08 -08:00
Changyu Bi	3d81f175b4	Prioritize marked file in level compaction (#12187 ) Summary: When ranking file by compaction priority in a level, prioritize files marked for compaction over files that are not marked. This only applies to default CompactPri kMinOverlappingRatio for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12187 Test Plan: * New unit tests Reviewed By: ajkr Differential Revision: D52437194 Pulled By: cbi42 fbshipit-source-id: 65ea9ce5bb421e598d539a55c8219b70844b82b3	2023-12-28 10:28:37 -08:00
darionyaphet	01f2edd145	Replace push_back by emplace_back in wal manager (#10805 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10805 Reviewed By: ajkr Differential Revision: D52424928 Pulled By: hx235 fbshipit-source-id: 548e3304ca721a3907be3696d12735929aca8490	2023-12-27 10:40:33 -08:00
Andrew Kryczka	4fefe1fed9	Downgrade warning for dynamic leveling with non-leveled compaction (#12186 ) Summary: Now that `level_compaction_dynamic_level_bytes`'s default value is true, users who do not touch that setting and use non-leveled compaction will also see this log message. It can be info level rather than warning since, in the case mentioned, there is nothing the user needs to be warned about. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12186 Reviewed By: cbi42 Differential Revision: D52422499 Pulled By: ajkr fbshipit-source-id: 8dbfcd102aab671b881ba047fb4a0a555b3e0a78	2023-12-26 15:13:42 -08:00
Peter Dillinger	a771a47a1b	Fix leak or crash on failure in automatic atomic flush (#12176 ) Summary: Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test: ``` db_flush_test: db/column_family.cc:1648: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176 Test Plan: unit test added Reviewed By: cbi42 Differential Revision: D52417071 Pulled By: pdillinger fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5	2023-12-26 11:04:25 -08:00
zaidoon	ad0362ac92	Expose Options::ttl through C API (#12170 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12170 Reviewed By: jaykorean Differential Revision: D52378902 Pulled By: cbi42 fbshipit-source-id: 0bac94b8785d5149df86e7317e69c0e64beab887	2023-12-21 15:04:53 -08:00
anand76	cc069f25b3	Add some compressed and tiered secondary cache stats (#12150 ) Summary: Add statistics for more visibility. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12150 Reviewed By: akankshamahajan15 Differential Revision: D52184633 Pulled By: anand1976 fbshipit-source-id: 9969e05d65223811cd12627102b020bb6d229352	2023-12-15 11:34:08 -08:00
Akanksha Mahajan	cd577f6059	Fix WRITE_STALL start_time (#12147 ) Summary: `Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - `cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)` However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time); Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147 Test Plan: Existing CircleCI Reviewed By: cbi42 Differential Revision: D52173481 Pulled By: akankshamahajan15 fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98	2023-12-14 13:45:06 -08:00
akankshamahajan	d926593df5	Fix stress tests failure for auto_readahead_size (#12131 ) Summary: When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter so that - BlockBasedTableIterator can point index_iter_ to the right block. - disable readahead_cache_lookup. However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case. So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131 Test Plan: - Tested on db_stress command that was failing - `./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35` - make crash_test -j32 Reviewed By: anand1976 Differential Revision: D51986326 Pulled By: akankshamahajan15 fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9	2023-12-13 12:15:04 -08:00
Andrew Kryczka	d8e47620d7	Speedup based on pending compaction bytes relative to data size (#12130 ) Summary: RocksDB self throttles per-DB compaction parallelism until it detects compaction pressure. The pressure detection based on pending compaction bytes was only comparing against the slowdown trigger (`soft_pending_compaction_bytes_limit`). Online services tend to set that extremely high to avoid stalling at all costs. Perhaps they should have set it to zero, but we never documented that zero disables stalling so I have been telling everyone to increase it for years. This PR adds pressure detection based on pending compaction bytes relative to the size of bottommost data. The size of bottommost data should be fairly stable and proportional to the logical data size Pull Request resolved: https://github.com/facebook/rocksdb/pull/12130 Reviewed By: hx235 Differential Revision: D52000746 Pulled By: ajkr fbshipit-source-id: 7e1fd170901a74c2d4a69266285e3edf6e7631c7	2023-12-13 10:37:27 -08:00
Peter Dillinger	c96d9a0fbb	Allow TablePropertiesCollectorFactory to return null collector (#12129 ) Summary: As part of building another feature, I wanted this: * Custom implementations of `TablePropertiesCollectorFactory` may now return a `nullptr` collector to decline processing a file, reducing callback overheads in such cases. * Polished, clarified some related API comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12129 Test Plan: unit test added Reviewed By: ltamasi Differential Revision: D51966667 Pulled By: pdillinger fbshipit-source-id: 2991c08fe6ce3a8c9f14c68f1495f5a17bca2770	2023-12-11 12:02:56 -08:00
Kevin Mingtarja	44fd914128	Fix double counting of BYTES_WRITTEN ticker (#12111 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/12061. We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable. The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111 Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes. Reviewed By: jaykorean Differential Revision: D51954327 Pulled By: jowlyzhang fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c	2023-12-08 17:12:11 -08:00
Levi Tamasi	a143f93236	Turn the default Timer in PeriodicTaskScheduler into a leaky Meyers singleton (#12128 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12128 The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues. Reviewed By: akankshamahajan15 Differential Revision: D51963950 fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948	2023-12-08 10:34:07 -08:00
Levi Tamasi	0ebe1614cb	Eliminate some code duplication in MergeHelper (#12121 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12121 The patch eliminates some code duplication by unifying the two sets of `MergeHelper::TimedFullMerge` overloads using variadic templates. It also brings the order of parameters into sync when it comes to the various `TimedFullMerge*` methods. Reviewed By: jaykorean Differential Revision: D51862483 fbshipit-source-id: e3f832a6ff89ba34591451655cf11025d0a0d018	2023-12-05 14:07:42 -08:00
Yu Zhang	ba8fa0f546	internal_repo_rocksdb (4372117296613874540) (#12117 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12117 Reviewed By: ajkr Differential Revision: D51745846 Pulled By: jowlyzhang fbshipit-source-id: 51c806a484b3b43d174b06d2cfe9499191d09914	2023-12-04 11:17:32 -08:00
Yu Zhang	d68f45e777	Flush buffered logs when FlushRequest is rescheduled (#12105 ) Summary: The optimization to not find and delete obsolete files when FlushRequest is re-scheduled also inadvertently skipped flushing the `LogBuffer`, resulting in missed logs. This PR fixes the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12105 Test Plan: manually check this test has the correct info log after the fix `./column_family_test --gtest_filter=ColumnFamilyRetainUDTTest.NotAllKeysExpiredFlushRescheduled` Reviewed By: ajkr Differential Revision: D51671079 Pulled By: jowlyzhang fbshipit-source-id: da0640e07e35c69c08988772ed611ec9e67f2e92	2023-11-29 11:35:59 -08:00
cz2h	324453e579	Fix rowcache get returning incorrect timestamp (#11952 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/7930. When there is a timestamp associated with stored records, get from row cache will return the timestamp provided in query instead of the timestamp associated with the stored record. ## Cause of error: Currently a row_handle is fetched using row_cache_key(contains a timestamp provided by user query) and the row_handle itself does not persist timestamp associated with the object. Hence the [GetContext::SaveValue() ](`6e3429b8a6/table/get_context.cc (L257)`) function will fetch the timestamp in row_cache_key and may return the incorrect timestamp value. ## Proposed Solution If current cf enables ts, append a timestamp associated with stored records after the value in replay_log (equivalently the value of row cache entry). When read, `replayGetContextLog()` will update parsed_key with the correct timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11952 Reviewed By: ajkr Differential Revision: D51501176 Pulled By: jowlyzhang fbshipit-source-id: 808fc943a8ae95de56ae0e82ec59a2573a031f28	2023-11-21 20:39:33 -08:00
Changyu Bi	fb5c8c7ea3	Do not compare op_type in `WithinPenultimateLevelOutputRange()` (#12081 ) Summary: `WithinPenultimateLevelOutputRange()` is updated in https://github.com/facebook/rocksdb/issues/12063 to check internal key range. However, op_type of a key can change during compaction, e.g. MERGE -> PUT, which makes a key larger and becomes out of penultimate output range. This has caused stress test failures with error message "Unsafe to store Seq later than snapshot in the last level if per_key_placement is enabled". So update `WithinPenultimateLevelOutputRange()` to only check user key and sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12081 Test Plan: * This repro can produce the corruption within a few runs. Ran it a few times after the fix and did not see Corruption failure. ``` python3 ./tools/db_crashtest.py whitebox --test_tiered_storage --random_kill_odd=888887 --use_merge=1 --writepercent=100 --readpercent=0 --prefixpercent=0 --delpercent=0 --delrangepercent=0 --iterpercent=0 --write_buffer_size=419430 --column_families=1 --read_fault_one_in=0 --write_fault_one_in=0 ``` Reviewed By: ajkr Differential Revision: D51481202 Pulled By: cbi42 fbshipit-source-id: cad6b65099733e03071b496e752bbdb09cf4db82	2023-11-20 17:07:28 -08:00
Benoît Mériaux	7780e98268	add write_buffer_manager setter into options and tests in c bindings, (#12007 ) Summary: following https://github.com/facebook/rocksdb/pull/11710 - add test on wbm c api - add a setter for WBM in `DBOptions` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12007 Reviewed By: cbi42 Differential Revision: D51430042 Pulled By: ajkr fbshipit-source-id: 608bc4d3ed35a84200459d0230b35be64b3475f7	2023-11-17 11:34:05 -08:00
Changyu Bi	4e58cc6437	Check internal key range when compacting from last level to penultimate level (#12063 ) Summary: The test failure in https://github.com/facebook/rocksdb/issues/11909 shows that we may compact keys outside of internal key range of penultimate level input files from last level to penultimate level, which can potentially cause overlapping files in the penultimate level. This PR updates the `Compaction::WithinPenultimateLevelOutputRange()` to check internal key range instead of user key. Other fixes: * skip range del sentinels when deciding output level for tiered compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/12063 Test Plan: - existing unit tests - apply the fix to https://github.com/facebook/rocksdb/issues/11905 and run `./tiered_compaction_test --gtest_filter="RangeDelsCauseFileEndpointsToOverlap"` Reviewed By: ajkr Differential Revision: D51288985 Pulled By: cbi42 fbshipit-source-id: 70085db5f5c3b15300bcbc39057d57b83fd9902a	2023-11-17 10:50:40 -08:00
Gus Wynn	6d10f8d690	add WriteBufferManager to c api (#11710 ) Summary: I want to use the `WriteBufferManager` in my rust project, which requires exposing it through the c api, just like `Cache` is. Hopefully the changes are fairly straightfoward! Pull Request resolved: https://github.com/facebook/rocksdb/pull/11710 Reviewed By: cbi42 Differential Revision: D51166518 Pulled By: ajkr fbshipit-source-id: cd266ff1e4a7ab145d05385cd125a8390f51f3fc	2023-11-16 10:34:00 -08:00
Andrew Kryczka	9202db1867	Consider archived WALs for deletion more frequently (#12069 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/11000. That issue pointed out that RocksDB was slow to delete archived WALs in case time-based and size-based expiration were enabled, and the time-based threshold (`WAL_ttl_seconds`) was small. This PR prevents the delay by taking into account `WAL_ttl_seconds` when deciding the frequency to process archived WALs for deletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12069 Reviewed By: pdillinger Differential Revision: D51262589 Pulled By: ajkr fbshipit-source-id: e65431a06ee96f4c599ba84a27d1aedebecbb003	2023-11-15 15:42:28 -08:00
Changyu Bi	e7896f03ad	Enable unit test `PrecludeLastLevelTest.RangeDelsCauseFileEndpointsToOverlap` (#12064 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/11909. The test passes after the change in https://github.com/facebook/rocksdb/issues/11917 to start mock clock from a non-zero time. The reason for test failing is a bit complicated: - The Put here `e4ad4a0ef1/db/compaction/tiered_compaction_test.cc (L2045)` happens before mock clock advances beyond 0. - This causes oldest_key_time_ to be 0 for memtable. - oldest_ancester_time of the first L0 file becomes 0 - L0 -> L5/6 compaction output files sets `oldest_ancestoer_time` to the current time due to these lines: `509947ce2c/db/compaction/compaction_job.cc (L1898C34-L1904)`. - This causes some small sequence number to be mapped to current time: `509947ce2c/db/compaction/compaction_job.cc (L301)` - Keys in L6 is being moved up to L5 due to the unexpected seqno_to_time mapping - When compacting keys from last level to the penultimate level, we only check keys to be within user key range of penultimate level input files. If we compact the following file 3 with file 1 and output keys to L5, we can get the reported inconsistency bug. ``` L5: file 1 [K5@20, K10@kMaxSeqno], file 2 [K10@30, K14@34) L6: file 3 [K6@5, K10@20] ``` https://github.com/facebook/rocksdb/issues/12063 will add fixes to check internal key range when compacting keys from last level up to the penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12064 Test Plan: the unit test passes Reviewed By: ajkr Differential Revision: D51281149 Pulled By: cbi42 fbshipit-source-id: 00b7f026c453454d9f3af5b2de441383a96f0c62	2023-11-13 15:26:52 -08:00
Jay Huh	8b8f6c63ef	ColumnFamilyHandle Nullcheck in GetEntity and MultiGetEntity (#12057 ) Summary: - Add missing null check for ColumnFamilyHandle in `GetEntity()` - `FailIfCfHasTs()` now returns `Status::InvalidArgument()` if `column_family` is null. `MultiGetEntity()` can rely on this for cfh null check. - Added `DeleteRange` API using Default Column Family to be consistent with other major APIs (This was also causing Java Test failure after the `FailIfCfHasTs()` change) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12057 Test Plan: - Updated `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` to include null CF case - Updated `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` to include null CF case Reviewed By: jowlyzhang Differential Revision: D51167445 Pulled By: jaykorean fbshipit-source-id: 1c1e44fd7b7df4d2dc3bb2d7d251da85bad7d664	2023-11-13 14:30:04 -08:00
leipeng	b3ffca0e29	DBImpl::DelayWrite: Remove bad WRITE_STALL histogram (#12067 ) Summary: When delay didn't happen, histogram WRITE_STALL is still recorded, and ticker STALL_MICROS is not recorded. This is a bug, neither WRITE_STALL or STALL_MICROS should not be recorded when delay did not happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12067 Reviewed By: cbi42 Differential Revision: D51263133 Pulled By: ajkr fbshipit-source-id: bd82d8328fe088d613991966e83854afdabc6a25	2023-11-13 12:48:44 -08:00
Yu Zhang	509947ce2c	Quarantine files in a limbo state after a manifest error (#12030 ) Summary: Part of the procedures to handle manifest IO error is to disable file deletion in case some files in limbo state get deleted prematurely. This is not ideal because: 1) not all the VersionEdits whose commit encounter such an error contain updates for files, disabling file deletion sometimes are not necessary. 2) `EnableFileDeletion` has a force mode that could make other threads accidentally disrupt this procedure in recovery. 3) Disabling file deletion as a whole is also not as efficient as more precisely tracking impacted files from being prematurely deleted. This PR replaces this mechanism with tracking such files and quarantine them from being deleted in `ErrorHandler`. These are the types of files being actively tracked in quarantine in this PR: 1) new table files and blob files from a background job 2) old manifest file whose immediately following new manifest file's CURRENT file creation gets into unclear state. Current handling is not sufficient to make sure the old manifest file is kept in case it's needed. Note that WAL logs are not part of the quarantine because `min_log_number_to_keep` is a safe mechanism and it's only updated after successful manifest commits so it can prevent this premature deletion issue from happening. We track these files' file numbers because they share the same file number space. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12030 Test Plan: Modified existing unit tests Reviewed By: ajkr Differential Revision: D51036774 Pulled By: jowlyzhang fbshipit-source-id: 84ef26271fbbc888ef70da5c40fe843bd7038716	2023-11-11 08:11:11 -08:00
Yu Zhang	c6c683a0ca	Remove the default force behavior for `EnableFileDeletion` API (#12001 ) Summary: Disabling file deletion can be critical for operations like making a backup, recovery from manifest IO error (for now). Ideally as long as there is one caller requesting file deletion disabled, it should be kept disabled until all callers agree to re-enable it. So this PR removes the default forcing behavior for the `EnableFileDeletion` API, and users need to explicitly pass the argument if they insisted on doing so knowing the consequence of what can be potentially disrupted. This PR removes the API's default argument value so it will cause breakage for all users that are relying on the default value, regardless of whether the forcing behavior is critical for them. When fixing this breakage, it's good to check if the forcing behavior is indeed needed and potential disruption is OK. This PR also makes unit test that do not need force behavior to do a regular enable file deletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12001 Reviewed By: ajkr Differential Revision: D51214683 Pulled By: jowlyzhang fbshipit-source-id: ca7b1ebf15c09eed00f954da2f75c00d2c6a97e4	2023-11-10 14:35:54 -08:00
Yueh-Hsuan Chiang	5ef92b8ea4	Add rocksdb_options_set_cf_paths (#11151 ) Summary: This PR adds a missing set function for rocksdb_options in the C-API: rocksdb_options_set_cf_paths(). Without this function, users cannot specify different paths for different column families as it will fall back to db_paths. As a bonus, this PR also includes rocksdb_sst_file_metadata_get_directory() to the C api -- a missing public function that will also make the test easier to write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11151 Test Plan: Augment existing c_test to verify the specified cf_path. Reviewed By: hx235 Differential Revision: D51201888 Pulled By: ajkr fbshipit-source-id: 62a96451f26fab60ada2005ede3eea8e9b431f30	2023-11-10 11:36:11 -08:00
Yueh-Hsuan Chiang	73d223c4e2	Add auto_tuned option to RateLimiter C API (#12058 ) Summary: #### Problem While the RocksDB C API does have the RateLimiter API, it does not expose the auto_tuned option. #### Summary of Change This PR exposes auto_tuned RateLimiter option in RocksDB C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12058 Test Plan: Augment the C API existing test to cover the new API. Reviewed By: cbi42 Differential Revision: D51201933 Pulled By: ajkr fbshipit-source-id: 5bc595a9cf9f88f50fee797b729ba96f09ed8266	2023-11-10 09:53:09 -08:00
Yu Zhang	dfaf4dc111	Stubs for piping write time (#12043 ) Summary: As titled. This PR contains the API and stubbed implementation for piping write time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12043 Reviewed By: pdillinger Differential Revision: D51076575 Pulled By: jowlyzhang fbshipit-source-id: 3b341263498351b9ccaff27cf35d5aeb5bdf0cf1	2023-11-09 15:58:07 -08:00
brodyhuang	e90e9825b4	Drop wal record when sequence is illegal (#11985 ) Summary: - Our database is corrupted, causing some sequences of wal record to be invalid (but the `record_checksum` looks fine). - When we RecoverLogFiles in WALRecoveryMode::kPointInTimeRecovery, `assert(seq <= kMaxSequenceNumber)` will be failed. - When it is found that sequence is illegal, can we drop the file to recover as much data as possible ? Thx ! Pull Request resolved: https://github.com/facebook/rocksdb/pull/11985 Reviewed By: anand1976 Differential Revision: D50698039 Pulled By: ajkr fbshipit-source-id: 1e42113b58823088d7c0c3a92af5b3efbb5f5296	2023-11-09 10:43:16 -08:00
Hui Xiao	f337533b6f	Ensure and clarify how RocksDB calls TablePropertiesCollector's functions (#12053 ) Summary: Context/Summary: It's intuitive for users to assume `TablePropertiesCollector::Finish()` is called only once by RocksDB internal by the word "finish". However, this is currently not true as RocksDB also calls this function in `BlockBased/PlainTableBuilder::GetTableProperties()` to populate user collected properties on demand. This PR avoids that by moving that populating to where we first call `Finish()` (i.e, `NotifyCollectTableCollectorsOnFinish`) Bonus: clarified in the API that `GetReadableProperties()` will be called after `Finish()` and added UT to ensure that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12053 Test Plan: - Modified test `DBPropertiesTest.GetUserDefinedTableProperties` to ensure `Finish()` only called once. - Existing test particularly `db_properties_test, table_properties_collector_test` verify the functionality `NotifyCollectTableCollectorsOnFinish` and `GetReadableProperties()` are not broken by this change. Reviewed By: ajkr Differential Revision: D51095434 Pulled By: hx235 fbshipit-source-id: 1c6275258f9b99dedad313ee8427119126817973	2023-11-08 14:00:36 -08:00
Zaidoon Abd Al Hadi	58f2a29fb4	Expose Options::periodic_compaction_seconds through C API (#12019 ) Summary: fixes [11090](https://github.com/facebook/rocksdb/issues/11090) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12019 Reviewed By: jaykorean Differential Revision: D51076427 Pulled By: cbi42 fbshipit-source-id: de353ff66c7f73aba70ab3379e20d8c40f50d873	2023-11-07 12:46:50 -08:00
Jay Huh	2adef5367a	AttributeGroups - PutEntity Implementation (#11977 ) Summary: Write Path for AttributeGroup Support. The new `PutEntity()` API uses `WriteBatch` and atomically writes WideColumns entities in multiple Column Families. Combined the release note from PR https://github.com/facebook/rocksdb/issues/11925 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11977 Test Plan: - `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` updated - `WriteBatchTest::AttributeGroupTest` added - `WriteBatchTest::AttributeGroupSavePointTest` added Reviewed By: ltamasi Differential Revision: D50457122 Pulled By: jaykorean fbshipit-source-id: 4997b265e415588ce077933082dcd1ac3eeae2cd	2023-11-06 16:52:51 -08:00
Jay Huh	0ecfc4fbb4	AttributeGroups - GetEntity Implementation (#11943 ) Summary: Implementation of `GetEntity()` API that returns wide-column entities as AttributeGroups from multiple column families for a single key. Regarding the definition of Attribute groups, please see the detailed example description in PR https://github.com/facebook/rocksdb/issues/11925 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11943 Test Plan: - `DBWideBasicTest::GetEntityAsPinnableAttributeGroups` added will enable the new API in the `db_stress` after merging Reviewed By: ltamasi Differential Revision: D50195794 Pulled By: jaykorean fbshipit-source-id: 218d54841ac7e337de62e13b1233b0a99bd91af3	2023-11-06 15:04:41 -08:00
Jay Huh	2dab137182	Mark more files for periodic compaction during offpeak (#12031 ) Summary: - The struct previously named `OffpeakTimeInfo` has been renamed to `OffpeakTimeOption` to indicate that it's a user-configurable option. Additionally, a new struct, `OffpeakTimeInfo`, has been introduced, which includes two fields: `is_now_offpeak` and `seconds_till_next_offpeak_start`. This change prevents the need to parse the `daily_offpeak_time_utc` string twice. - It's worth noting that we may consider adding more fields to the `OffpeakTimeInfo` struct, such as `elapsed_seconds` and `total_seconds`, as needed for further optimization. - Within `VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction()`, we've adjusted the `allowed_time_limit` to include files that are expected to expire by the next offpeak start. - We might explore further optimizations, such as evenly distributing files to mark during offpeak hours, if the initial approach results in marking too many files simultaneously during the first scoring in offpeak hours. The primary objective of this PR is to prevent periodic compactions during non-offpeak hours when offpeak hours are configured. We'll start with this straightforward solution and assess whether it suffices for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12031 Test Plan: Unit Tests added - `DBCompactionTest::LevelPeriodicCompactionOffpeak` for Leveled - `DBTestUniversalCompaction2::PeriodicCompaction` for Universal Reviewed By: cbi42 Differential Revision: D50900292 Pulled By: jaykorean fbshipit-source-id: 267e7d3332d45a5d9881796786c8650fa0a3b43d	2023-11-06 11:43:59 -08:00
Changyu Bi	520c64fd2e	Add missing status check in ExternalSstFileIngestionJob and ImportColumnFamilyJob (#12042 ) Summary: .. and update some unit tests that failed with this change. See comment in ExternalSSTFileBasicTest.IngestFileWithCorruptedDataBlock for more explanation. The missing status check is not caught by `ASSERT_STATUS_CHECKED=1` due to this line: `8505b26db1/table/block_based/block.h (L394)`. Will explore if we can remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12042 Test Plan: existing unit tests. Reviewed By: ajkr Differential Revision: D50994769 Pulled By: cbi42 fbshipit-source-id: c91615bccd6094a91634c50b98401d456cbb927b	2023-11-06 07:41:36 -08:00
914022466	2648e0a747	Fix a bug when ingest plaintable sst file (#11969 ) Summary: Plaintable doesn't support SeekToLast. And GetIngestedFileInfo is using SeekToLast without checking the validity. We are using IngestExternalFile or CreateColumnFamilyWithImport with some sst file in PlainTable format . But after running for a while, compaction error often happens. Such as ![image](https://github.com/facebook/rocksdb/assets/13954644/b4fa49fc-73fc-49ce-96c6-f198a30800b8) I simply add some std::cerr log to find why. ![image](https://github.com/facebook/rocksdb/assets/13954644/2cf1d5ff-48cc-4125-b917-87090f764fcd) It shows that the smallest key is always equal to largest key. ![image](https://github.com/facebook/rocksdb/assets/13954644/6d43e978-0be0-4306-aae3-f9e4ae366395) Then I found the root cause is that PlainTable do not support SeekToLast, so the smallest key is always the same with the largest I try to write an unit test. But it's not easy to reproduce this error. (This PR is similar to https://github.com/facebook/rocksdb/pull/11266. Sorry for open another PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11969 Reviewed By: ajkr Differential Revision: D50933854 Pulled By: cbi42 fbshipit-source-id: 6c6af53c1388922cbabbe64ed3be1cdc58df5431	2023-11-02 13:45:37 -07:00
Yu Zhang	4b013dcbed	Remove VersionEdit's friends pattern (#12024 ) Summary: Almost each of VersionEdit private member has its own getter and setter. Current code access them with a combination of directly accessing private members and via getter and setters. There is no obvious benefits to have this pattern except potential performance gains. I tried this simple benchmark for removing the friends pattern completely, and there is no obvious regression. So I think it would good to remove VersionEdit's friends completely. ```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num_column_families=10 -num=50000000``` With change: fillseq : 2.994 micros/op 333980 ops/sec 149.710 seconds 50000000 operations; 36.9 MB/s fillseq : 3.033 micros/op 329656 ops/sec 151.673 seconds 50000000 operations; 36.5 MB/s fillseq : 2.991 micros/op 334369 ops/sec 149.535 seconds 50000000 operations; 37.0 MB/s Without change: fillseq : 3.015 micros/op 331715 ops/sec 150.732 seconds 50000000 operations; 36.7 MB/s fillseq : 3.044 micros/op 328553 ops/sec 152.182 seconds 50000000 operations; 36.3 MB/s fillseq : 3.091 micros/op 323520 ops/sec 154.550 seconds 50000000 operations; 35.8 MB/s Pull Request resolved: https://github.com/facebook/rocksdb/pull/12024 Reviewed By: pdillinger Differential Revision: D50806066 Pulled By: jowlyzhang fbshipit-source-id: 35d287ce638a38c30f243f85992e615b4c90eb27	2023-11-01 12:04:11 -07:00
Jay Huh	04225a2cfa	Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false (#11991 ) Summary: cbi42 helped investigation and found a potential scenario where `RecoverFromRetryableBGIOError()` may start with `recovery_in_prog_ ` set as false. (and other booleans like `bg_error_` and `soft_error_no_bg_work_`) Thread 1 - `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true` Thread 1's `recovery_thread_` - (waits for mutex and acquires it) - `RecoverFromRetryableBGIOError()` -> `ResumeImpl()` -> `ClearBGError()`: sets `recovery_in_prog_ = false` - `ClearBGError()` -> `NotifyOnErrorRecoveryEnd()`: releases `mutex` Thread 2 - `StartRecoverFromRetryableBGIOError()`): (mutex held) sets `recovery_in_prog_ = true` - Waits for Thread 1 (`recovery_thread_`) to finish Thread 1's `recovery_thread_` - re-lock mutex in `NotifyOnErrorRecoveryEnd()` - Still inside `RecoverFromRetryableBGIOError()`: sets `recovery_in_prog_ = false` - Done Thread 2's `recovery_thread_` - recovery thread started with `recovery_in_prog_` set as `false` # Fix - Remove double-clearing `bg_error_`, `recovery_in_prog_` and other fields after `ResumeImpl()` already returned `OK()`. - Minor typo and linter fixes in `DBErrorHandlingFSTest` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11991 Test Plan: - `DBErrorHandlingFSTest::MultipleRecoveryThreads` added to reproduce the scenario. - Adding `assert(recovery_in_prog_);` at the start of `ErrorHandler::RecoverFromRetryableBGIOError()` fails the test without the fix and succeeds with the fix as expected. Reviewed By: cbi42 Differential Revision: D50506113 Pulled By: jaykorean fbshipit-source-id: 6dabe01e9ecd3fc50bbe9019587f2f4858bed9c6	2023-10-31 16:13:36 -07:00
Yu Zhang	60df39e530	Rate limiting stale sst files' deletion during recovery (#12016 ) Summary: As titled. If SstFileManager is available, deleting stale sst files will be delegated to it so it can be rate limited. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12016 Reviewed By: hx235 Differential Revision: D50670482 Pulled By: jowlyzhang fbshipit-source-id: bde5b76ea1d98e67f6b4f08bfba3db48e46aab4e	2023-10-28 09:50:52 -07:00
Jay Huh	e230e4d248	Make OffpeakTimeInfo available in VersionSet (#12018 ) Summary: As mentioned in https://github.com/facebook/rocksdb/issues/11893, we are going to use the offpeak time information to pre-process TTL-based compactions. To do so, we need to access `daily_offpeak_time_utc` in `VersionStorageInfo::ComputeCompactionScore()` where we pick the files to compact. This PR is to make the offpeak time information available at the time of compaction-scoring. We are not changing any compaction scoring logic just yet. Will follow up in a separate PR. There were two ways to achieve what we want. 1. Make `MutableDBOptions` available in `ColumnFamilyData` and `ComputeCompactionScore()` take `MutableDBOptions` along with `ImmutableOptions` and `MutableCFOptions`. 2. Make `daily_offpeak_time_utc` and `IsNowOffpeak()` available in `VersionStorageInfo`. We chose the latter as it involves smaller changes. This change includes the following - Introduction of `OffpeakTimeInfo` and `IsNowOffpeak()` has been moved from `MutableDBOptions` - `OffpeakTimeInfo` added to `VersionSet` and it can be set during construction and by `ChangeOffpeakTimeInfo()` - During `SetDBOptions()`, if offpeak time info needs to change, it calls `MaybeScheduleFlushOrCompaction()` to re-compute compaction scores and process compactions as needed Pull Request resolved: https://github.com/facebook/rocksdb/pull/12018 Test Plan: - `DBOptionsTest::OffpeakTimes` changed to include checks for `MaybeScheduleFlushOrCompaction()` calls and `VersionSet`'s OffpeakTimeInfo value change during `SetDBOptions()`. - `VersionSetTest::OffpeakTimeInfoTest` added to test `ChangeOffpeakTimeInfo()`. `IsNowOffpeak()` tests moved from `DBOptionsTest::OffpeakTimes` Reviewed By: pdillinger Differential Revision: D50723881 Pulled By: jaykorean fbshipit-source-id: 3cff0291936f3729c0e9c7750834b9378fb435f6	2023-10-27 15:56:48 -07:00
Hui Xiao	0f141352d8	Fix race between flush error recovery and db destruction (#12002 ) Summary: Context: DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: `519f2a41fb/db/db_impl/db_impl.cc (L525)` -> `519f2a41fb/db/error_handler.cc (L250)` -> `519f2a41fb/db/error_handler.cc (L808-L823)` However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below. ``` std::terminate() std::default_delete<std::thread>::operator()(std::thread) const std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr() rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31) rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725) rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725) rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678) ``` Summary:* This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](`519f2a41fb/db/error_handler.cc (L688-L694)`) such case inside the created recovery thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002 Test Plan: A new UT repro-ed the crash before this fix and and pass after. Reviewed By: ajkr Differential Revision: D50586191 Pulled By: hx235 fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093	2023-10-25 11:59:09 -07:00
qiuchengxuan	f2c9075d16	Fix dead loop with kSkipAnyCorruptedRecords mode selected in some cases (#11955 ) (#11979 ) Summary: With fragmented record span across multiple blocks, if any following blocks corrupted with arbitary data, and intepreted log number less than the current log number, program will fall into infinite loop due to not skipping buffer leading bytes Pull Request resolved: https://github.com/facebook/rocksdb/pull/11979 Test Plan: existing unit tests Reviewed By: ajkr Differential Revision: D50604408 Pulled By: jowlyzhang fbshipit-source-id: e50a0c7e7c3d293fb9d5afec0a3eb4a1835b7a3b	2023-10-25 09:16:24 -07:00
Myth	0ff7665c95	Fix low priority write may cause crash when it is rate limited (#11932 ) Summary: Fixed https://github.com/facebook/rocksdb/issues/11902 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11932 Reviewed By: akankshamahajan15 Differential Revision: D50573356 Pulled By: hx235 fbshipit-source-id: adeb1abdc43b523b0357746055ce4a2eabde56a1	2023-10-24 14:41:46 -07:00
Peter Dillinger	4155087746	Use manifest to persist pre-allocated seqnos (#11995 ) Summary: ... and other fixes for crash test after https://github.com/facebook/rocksdb/issues/11922. * When pre-allocating sequence numbers for establishing a time history, record that last sequence number in the manifest so that it is (most likely) restored on recovery even if no user writes were made or were recovered (e.g. no WAL). * When pre-allocating sequence numbers for establishing a time history, only do this for actually new DBs. * Remove the feature that ensures non-zero sequence number on creating the first column family with preserve/preclude option after initial DB::Open. Until fixed in a way compatible with the crash test, this creates a gap where some data written with active preserve/preclude option won't have a known associated time. Together, these ensure we don't upset the crash test by manipulating sequence numbers after initial DB creation (esp when re-opening with different options). (The crash test expects that the seqno after re-open corresponds to a known point in time from previous crash test operation, matching an expected DB state.) Follow-up work: * Re-fill the gap to ensure all data written under preserve/preclude settings have a known time estimate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11995 Test Plan: Added to unit test SeqnoTimeTablePropTest.PrePopulateInDB Verified fixes two crash test scenarios: ## 1st reproducer First apply ``` diff --git a/db_stress_tool/expected_state.cc b/db_stress_tool/expected_state.cc index b483e154c..ef63b8d6c 100644 --- a/db_stress_tool/expected_state.cc +++ b/db_stress_tool/expected_state.cc @@ -333,6 +333,7 @@ Status FileExpectedStateManager::SaveAtAndAfter(DB* db) { s = NewFileTraceWriter(Env::Default(), soptions, trace_file_path, &trace_writer); } + if (getenv("CRASH")) assert(false); if (s.ok()) { TraceOptions trace_opts; trace_opts.filter \|= kTraceFilterGet; ``` Then ``` mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_expected mkdir -p /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_/ CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=36000 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --preserve_internal_time_seconds=0 ``` Without the fix you get ``` ... DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox] (Re-)verified 34 unique IDs Error restoring historical expected values: Corruption: DB is older than any restorable expected state ``` ## 2nd reproducer First apply ``` diff --git a/db_stress_tool/db_stress_test_base.cc b/db_stress_tool/db_stress_test_base.cc index 62ddead7b..f2654980f 100644 --- a/db_stress_tool/db_stress_test_base.cc +++ b/db_stress_tool/db_stress_test_base.cc @@ -1126,6 +1126,7 @@ void StressTest::OperateDb(ThreadState* thread) { // OPERATION write TestPut(thread, write_opts, read_opts, rand_column_families, rand_keys, value); + if (getenv("CRASH")) assert(false); } else if (prob_op < del_bound) { assert(write_bound <= prob_op); // OPERATION delete ``` Then ``` rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_/ CRASH=1 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=1 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=0 ./db_stress --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --destroy_db_initially=0 --manual_wal_flush_one_in=1000000 --clear_column_family_one_in=0 --disable_wal=1 --reopen=0 --preserve_internal_time_seconds=3600 ``` Without the fix you get ``` DB path: [/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox] (Re-)verified 34 unique IDs db_stress: db_stress_tool/expected_state.cc:380: virtual rocksdb::{anonymous}::ExpectedStateTraceRecordHandler::~ ExpectedStateTraceRecordHandler(): Assertion `IsDone()' failed. ``` Reviewed By: jowlyzhang Differential Revision: D50533346 Pulled By: pdillinger fbshipit-source-id: 1056be45c5b9e537c8c601b28c4b27431a782477	2023-10-23 09:20:59 -07:00
Hui Xiao	0836a2b26d	New tickers on deletion compactions grouped by reasons (#11957 ) Summary: Context/Summary: as titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/11957 Test Plan: piggyback on existing tests; fixed a failed test due to adding new stats Reviewed By: ajkr, cbi42 Differential Revision: D50294310 Pulled By: hx235 fbshipit-source-id: d99b97ebac41efc1bdeaf9ca7a1debd2927d54cd	2023-10-18 18:00:07 -07:00
Changyu Bi	d5bc30befa	Enforce status checking after Valid() returns false for IteratorWrapper (#11975 ) Summary: ... when compiled with ASSERT_STATUS_CHECKED = 1. The main change is in iterator_wrapper.h. The remaining changes are just fixing existing unit tests. Adding this check to IteratorWrapper gives a good coverage as the class is used in many places, including child iterators under merging iterator, merging iterator under DB iter, file_iter under level iterator, etc. This change can catch the bug fixed in https://github.com/facebook/rocksdb/issues/11782. Future follow up: enable `ASSERT_STATUS_CHECKED=1` for stress test and for DEBUG_LEVEL=0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11975 Test Plan: * `ASSERT_STATUS_CHECKED=1 DEBUG_LEVEL=2 make -j32 J=32 check` * I tried to run stress test with `ASSERT_STATUS_CHECKED=1`, but there are a lot of existing stress code that ignore status checking, and fail without the change in this PR. So defer that to a follow up task. Reviewed By: ajkr Differential Revision: D50383790 Pulled By: cbi42 fbshipit-source-id: 1a28ce0f5fdf1890f93400b26b3b1b3a287624ce	2023-10-18 09:38:38 -07:00
Yu Zhang	933ee295f4	Fix a race condition between recovery and backup (#11955 ) Summary: A race condition between recovery and backup can happen with error messages like this: ```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory``` PR https://github.com/facebook/rocksdb/issues/6949 introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event: [Backup engine] disable file deletion [Recovery] disable file deletion <= this is optional for the race condition, it may or may not get called [Backup engine] get list of file to copy/link [Recovery] force enable file deletion .... some files refered by backup engine get deleted [Backup engine] copy/link file <= error no file found This PR fixes this with: 1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call. 2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR https://github.com/facebook/rocksdb/issues/6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler. 3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11955 Test Plan: existing unit tests Reviewed By: anand1976 Differential Revision: D50290592 Pulled By: jowlyzhang fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b	2023-10-17 13:18:04 -07:00
Peter Dillinger	2fd850c7eb	Remove write queue synchronization from WriteOptionsFile (#11951 ) Summary: This has become obsolete with the new `options_mutex_` in https://github.com/facebook/rocksdb/pull/11929 * Remove now-unnecessary parameter from WriteOptionsFile * Rename (and negate) other parameter for better clarity (the caller shouldn't tell the callee what the callee needs, just what the caller knows, provides, and requests) * Move a ROCKS_LOG_WARN (I/O) in WriteOptionsFile to outside of holding DB mutex. * Also avoid (but not always eliminate) write queue synchronization in SetDBOptions. Still needed if there was a change to WAL size limit or other configuration. * Improve some comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/11951 Test Plan: existing unit tests and TSAN crash test local run Reviewed By: ajkr Differential Revision: D50247904 Pulled By: pdillinger fbshipit-source-id: 7dfe445c705ec013886a2adb7c50abe50d83af69	2023-10-16 08:58:47 -07:00
Changyu Bi	f3aef8cad7	Add write operation to tracer only after successful callback (#11954 ) Summary: We saw optimistic transaction stress test failures like the following: ``` Verification failed for column family 0 key 000000000001E9AF000000000000012B00000000000000B5 (12535491): value_from_db: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, value_from_expected: , msg: Iterator verification: Unexpected value found``` ``` With ajkr's repro (see test plan), I found that we record duplicated writes to tracer when an optimistic transaction conflict checking fails. This PR fixes it by checking callback status before record a write operation to tracer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11954 Test Plan: this reproduces the failure consistently ``` #!/bin/bash db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected rm -rf $db $exp && mkdir -p $exp && while ./db_stress \ --atomic_flush=1 \ --clear_column_family_one_in=0 \ --db=$db \ --db_write_buffer_size=2097152 \ --delpercent=0 \ --delrangepercent=0 \ --destroy_db_initially=0 \ --disable_wal=1 \ --expected_values_dir=$exp \ --iterpercent=0 \ --max_bytes_for_level_base=2097152 \ --max_key=250000 \ --memtable_prefix_bloom_size_ratio=0.5 \ --memtable_whole_key_filtering=1 \ --occ_lock_bucket_count=100 \ --occ_validation_policy=0 \ --ops_per_thread=10 \ --prefixpercent=0 \ --readpercent=0 \ --reopen=0 \ --target_file_size_base=524288 \ --test_batches_snapshots=0 \ --use_optimistic_txn=1 \ --use_txn=1 \ --value_size_mult=32 \ --write_buffer_size=524288 \ --writepercent=100 ; do : ; done ``` Reviewed By: akankshamahajan15 Differential Revision: D50284976 Pulled By: cbi42 fbshipit-source-id: 793e3cee186c8b4f406b29166efd8d9028695206	2023-10-14 12:00:31 -07:00
Jay Huh	c9d8e6a5bf	AttributeGroups - MultiGetEntity Implementation (#11925 ) Summary: Introducing the notion of AttributeGroup by adding the `MultiGetEntity()` API retrieving `PinnableAttributeGroups`. An "attribute group" refers to a logical grouping of wide-column entities within RocksDB. These attribute groups are implemented using column families. Users can store WideColumns in different CFs for various reasons (e.g. similar access patterns, same types, etc.). This new API `MultiGetEntity()` takes keys and `PinnableAttributeGroups` per key. `PinnableAttributeGroups` is just a list of `PinnableAttributeGroup`s in which we have `ColumnFamilyHandle*`, `Status`, and `PinnableWideColumns`. Let's say a user stored "hot" wide columns in column family "hot_data_cf" and "cold" wide columns in column family "cold_data_cf" and all other columns in "common_cf". Prior to this PR, if the user wants to query for two keys, "key_1" and "key_2" and but only interested in "common_cf" and "hot_data_cf" for "key_1", and "common_cf" and "cold_data_cf" for "key_2", the user would have to construct input like `keys = ["key_1", "key_1", "key_2", "key_2"]`, `column_families = ["common_cf", "hot_data_cf", "common_cf", "cold_data_cf"]` and get the flat list of `PinnableWideColumns` to find the corresponding <key,CF> combo. With the new `MultiGetEntity()` introduced in this PR, users can now query only `["common_cf", "hot_data_cf"]` for `"key_1"`, and only `["common_cf", "cold_data_cf"]` for `"key_2"`. The user will get `PinnableAttributeGroups` for each key, and `PinnableAttributeGroups` gives a list of `PinnableAttributeGroup`s where the user can find column family and corresponding `PinnableWideColumns` and the `Status`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11925 Test Plan: - `DBWideBasicTest::MultiCFMultiGetEntityAsPinnableAttributeGroups` added will enable this new API in the `db_stress` in a separate PR Reviewed By: ltamasi Differential Revision: D50017414 Pulled By: jaykorean fbshipit-source-id: 643611d1273c574bc81b94c6f5aeea24b40c4586	2023-10-13 15:58:03 -07:00
Changyu Bi	6e3429b8a6	Fix data race in accessing `recovery_in_prog_` (#11950 ) Summary: We saw the following TSAN stress test failure: ``` WARNING: ThreadSanitizer: data race (pid=17523) Write of size 1 at 0x7b8c000008b9 by thread T4243 (mutexes: write M0): #0 rocksdb::ErrorHandler::RecoverFromRetryableBGIOError() fbcode/internal_repo_rocksdb/repo/db/error_handler.cc:742 (db_stress+0x95f954) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) https://github.com/facebook/rocksdb/issues/1 std:🧵:_State_impl<std:🧵:_Invoker<std::tuple<void (rocksdb::ErrorHandler::)(), rocksdb::ErrorHandler>>>::_M_run() fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:74 (db_stress+0x95fc2b) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) https://github.com/facebook/rocksdb/issues/2 execute_native_thread_routine /home/engshare/third-party2/libgcc/11.x/src/gcc-11.x/x86_64-facebook-linux/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:82:18 (libstdc++.so.6+0xdf4e4) (BuildId: 452d1cdae868baeeb2fdf1ab140f1c219bf50c6e) Previous read of size 1 at 0x7b8c000008b9 by thread T22: #0 rocksdb::DBImpl::SyncClosedLogs(rocksdb::JobContext, rocksdb::VersionEdit) fbcode/internal_repo_rocksdb/repo/db/error_handler.h:76 (db_stress+0x84f69c) (BuildId: 35795dfb86ddc9c4f20ddf08a491f24d) ``` This is due to a data race in accessing `recovery_in_prog_`. This PR fixes it by accessing `recovery_in_prog_` under db mutex before calling `SyncClosedLogs()`. I think the original PR https://github.com/facebook/rocksdb/pull/10489 intended to clear the error if it's a recovery flush. So ideally we can also just check flush reason. I plan to keep a safer change in this PR and make that change in the future if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11950 Test Plan: check future TSAN stress test results. Reviewed By: anand1976 Differential Revision: D50242255 Pulled By: cbi42 fbshipit-source-id: 0d487948ef9546b038a34460f3bb037f6e5bfc58	2023-10-12 16:55:25 -07:00
Changyu Bi	648fe25bc0	Always clear files marked for compaction in `ComputeCompactionScore()` (#11946 ) Summary: We were seeing the following stress test failures: ```LevelCompactionBuilder::PickFileToCompact(const rocksdb::autovector<std::pair<int, rocksdb::FileMetaData*> >&, bool): Assertion `!level_file.second->being_compacted' failed``` This can happen when we are picking a file to be compacted from some files marked for compaction, but that file is already being_compacted. We prevent this by always calling `ComputeCompactionScore()` after we pick a compaction and mark some files as being_compacted. However, if SetOptions() is called to disable marking certain files to be compacted, say `enable_blob_garbage_collection`, we currently just skip the relevant logic in `ComputeCompactionScore()` without clearing the existing files already marked for compaction. This PR fixes this issue by already clearing these files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11946 Test Plan: existing tests. Reviewed By: akankshamahajan15 Differential Revision: D50232608 Pulled By: cbi42 fbshipit-source-id: 11e4fb5e9d48b0f946ad33b18f7c005f0161f496	2023-10-12 15:26:10 -07:00
Peter Dillinger	d010b02e86	Fix race in options taking effect (#11929 ) Summary: In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect. To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine. I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race. Intended follow-up to this: * Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929 Test Plan: Added two mini-stress style regression tests that fail with >1% probability before this change: DBOptionsTest::SetStatsDumpPeriodSecRace ColumnFamilyTest::CreateAndDropPeriodicRace I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway: DBOptionsTest::SetStatsDumpPeriodSecRace Reviewed By: ajkr Differential Revision: D50024506 Pulled By: pdillinger fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4	2023-10-12 10:05:23 -07:00
Andrew Kryczka	4bd5aa4f55	Fix two `ErrorHandler` race conditions (#11939 ) Summary: 1. Prevent a double join on a `port::Thread` 2. Ensure `recovery_in_prog_` and `bg_error_` are both set under same lock hold. This is useful for writers who see a non-OK `bg_error_` and are deciding whether to stall based on whether the error will be auto-recovered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11939 Reviewed By: cbi42 Differential Revision: D50155484 Pulled By: ajkr fbshipit-source-id: fbc1f85c50e7eaee27ee0e376aee688d8a06c93b	2023-10-11 09:42:48 -07:00
Andrew Kryczka	77d160ef47	Consolidate `ErrorHandler`'s recovery status variables (#11937 ) Summary: cbi42 pointed out a race condition in which `recovery_io_error_` and `recovery_error_` could be updated inconsistently due to releasing the DB mutex in `EventHelpers::NotifyOnBackgroundError()`. There doesn't seem to be a point to having two status objects, so this PR consolidates them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11937 Reviewed By: cbi42 Differential Revision: D50105793 Pulled By: ajkr fbshipit-source-id: 3de95baccfa44351a49a5c2aa0986c9bc81baa8f	2023-10-10 06:31:45 -07:00
Andrew Kryczka	8a9cfd5292	Make stopped writes block on recovery (#11879 ) Summary: Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush. This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879 Reviewed By: anand1976 Differential Revision: D49571196 Pulled By: ajkr fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0	2023-10-10 06:29:01 -07:00
darionyaphet	ee0829ba76	fix typo snapshto (#11817 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11817 Reviewed By: jaykorean Differential Revision: D50103497 Pulled By: ltamasi fbshipit-source-id: 77c5cf86ff7eb5021fc91b03225882536163af7b	2023-10-09 19:10:06 -07:00
Levi Tamasi	51d7e6a49e	Clean up WriteBatchWithIndexInternal a bit (#11930 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11930 The patch cleans up and refactors the logic in/around `WriteBatchWithIndexInternal` a bit as groundwork for further changes. Specifically, the class is turned back into a stateless collection of static helpers (which is the way it was before PR 6851). Note that there were two apparent reasons for introducing this instance state in PR 6851: a) encapsulating `MergeContext` and b) resolving objects like `Logger` and `Statistics` based on a variety of handles. However, neither reason seems justified at this point. Regarding a), the `MultiGetFromBatchAndDB` logic passes in its own `MergeContext` objects via a second set of methods that do not use the member `MergeContext`. As for b), `Logger` and friends are only needed for Merge, which is only supported if a column family handle is provided; in turn, the column family handle enables us to resolve all the necessary objects without the need for any other handles like `DB` or `DBOptions`. In addition to the above, the patch changes the type of `BaseDeltaIterator::merge_result_` to `std::string` from `PinnableSlice` (since no pinning is ever done) and makes some other small code quality improvements. Reviewed By: jaykorean Differential Revision: D50038302 fbshipit-source-id: 5f34abe2e808bdaea0f3a8033b5764ebd446b85d	2023-10-09 15:25:35 -07:00
Peter Dillinger	1d5bddbc58	Bootstrap, pre-populate seqno_to_time_mapping (#11922 ) Summary: This change has two primary goals (follow-up to https://github.com/facebook/rocksdb/issues/11917, https://github.com/facebook/rocksdb/issues/11920): * Ensure the DB seqno_to_time_mapping has entries that allow us to put a good time lower bound on any writes that happen after setting up preserve/preclude options (either in a new DB, new CF, SetOptions, etc.) and haven't yet aged out of that time window. This allows us to remove a bunch of work-arounds in tests. * For new DBs using preserve/preclude options, automatically reserve some sequence numbers and pre-map them to cover the time span back to the preserve/preclude cut-off time. In the future, this will allow us to import data from another DB by key, value, and write time by assigning an appropriate seqno in this DB for that write time. Note that the pre-population (historical mappings) does not happen if the original options at DB Open time do not have preserve/preclude, so it is recommended to create initial column families at that time with create_missing_column_families, to take advantage of this (future) feature. (Adding these historical mappings after DB Open would risk non-monotonic seqno_to_time_mapping, which is dubious if not dangerous.) Recommended follow-up: * Solve existing race conditions (not memory safety) where parallel operations like CreateColumnFamily or SetDBOptions could leave the wrong setting in effect. * Make SeqnoToTimeMapping more gracefully handle a possible case in which too many mappings are added for the time range of concern. It seems like there could be cases where data is massively excluded from the cold tier because of entries falling off the front of the mapping list (causing GetProximalSeqnoBeforeTime() to return 0). (More investigation needed.) No release note for the minor bug fix because this is still an experimental feature with limited usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11922 Test Plan: tests added / updated Reviewed By: jowlyzhang Differential Revision: D49956563 Pulled By: pdillinger fbshipit-source-id: 92beb918c3a298fae9ca8e509717b1067caa1519	2023-10-06 08:21:21 -07:00
Hui Xiao	8e949116f7	Fix comments about creation_time/oldest_ancester_time/oldest_key_time (#11921 ) Summary: Code reference for the comments change: `40b618f234/table/block_based/block_based_table_builder.cc`?fbclid=IwAR0JlfnG8wysclFP5wv0fSngFbi_j32BUCKbFayeGdr10tzDhyyk5QqpclA#L2093 `40b618f234/db/flush_job.cc`?fbclid=IwAR1ri6eTX3wyD_2fAEBRzFSwZItcbmDS8LaB11k1letDMQmB2L8nF6TfXDs#L945-L949 `40b618f234/db/compaction/compaction_job.cc (L1882-L1904)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11921 Reviewed By: cbi42 Differential Revision: D49921304 Pulled By: hx235 fbshipit-source-id: 2ae17e43c0fd52044404d7b63fea254d2d1f3595	2023-10-04 14:42:35 -07:00
Peter Dillinger	141b872bd4	Improve efficiency of create_missing_column_families, light refactor (#11920 ) Summary: In preparing some seqno_to_time_mapping improvements, I found that some of the wrap-up work for creating column families was unnecessarily repeated in the case of DB::Open with create_missing_column_families. This change fixes that (`CreateColumnFamily()` -> `CreateColumnFamilyImpl()` in `DBImpl::Open()`), motivated by avoiding repeated calls to `RegisterRecordSeqnoTimeWorker()` but with the side benefit of avoiding repeated calls to `WriteOptionsFile()` for each CF. Also in this change: * Add a `Status::UpdateIfOk()` function for combining statuses in a common pattern * Rename `max_time_duration` -> `min_preserve_seconds` (include units as much as possible) * Improved comments in several places Pull Request resolved: https://github.com/facebook/rocksdb/pull/11920 Test Plan: tests added / updated Reviewed By: jaykorean Differential Revision: D49919147 Pulled By: pdillinger fbshipit-source-id: 3d0318c1d070c842c5331da0a5b415caedc104f1	2023-10-04 14:14:22 -07:00
akankshamahajan	97f6f475bc	Fix various failures in auto_readahead_size (#11884 ) Summary: 1. Error in TestIterateAgainstExpected API - `Assertion index < pre_read_expected_values.size() && index < post_read_expected_values.size() failed.` Fix - `Prev` op is not supported with `auto_readahead_size`. So added support to Reseek in db_iter, if Prev is called. In BlockBasedTableIterator, index_iter_ already moves forward. So there is no way to do Prev from BlockBasedTableIterator. 2. Error - `void rocksdb::BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(uint64_t, size_t, size_t&): Assertion index_iter_->value().handle.offset() == offset` Fix - Remove prefetch_buffer to be used when uncompressed dict is read. 3. ** Error in TestPrefixScan API - `db_stress: db/db_iter.cc:369: bool rocksdb::DBIter::FindNextUserEntryInternal(bool, const rocksdb::Slice): Assertion !skipping_saved_key \|\| CompareKeyForSkip(ikey_.user_key, saved_key_.GetUserKey()) > 0 failed. Received signal 6 (Aborted) Invoking GDB for stack trace... db_stress: table/merging_iterator.cc:1036: bool rocksdb::MergingIterator::SkipNextDeleted(): Assertion comparator_->Compare(range_tombstone_iters_[i]->start_key(), pik) <= 0 failed` Fix* - SeekPrev also calls 1) SeekPrev , 2)Seek and then 3)Prev in some cases in db_iter.cc leading to failure of Prev operation. These backward operations also call Seek. Added direction to disable lookup once direction is backwards in BlockBasedTableIterator.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/11884 Test Plan: Ran various flavors of crash tests locally for the whole duration Reviewed By: anand1976 Differential Revision: D49834201 Pulled By: akankshamahajan15 fbshipit-source-id: 9a007b4d46a48002c43dc4623a400ecf47d997fe	2023-10-02 17:47:24 -07:00
Jay Huh	5fbea87859	Disallow start_time == end_time in offpeak time and compare at minute level to allow 24hr offpeak (#11911 ) Summary: Since allowing 24hr peak by setting start_time = end_time is not so intuitive, we are not going to allow it (e.g. `00:00-00:00` doesn't looks like a value that would cover 24hr.). Instead, we are going to compare at minute level (i.e. dropping the seconds to the nearest minute) so that `00:00-23:59` will cover 24hrs. The entire minute from 23:59:00 23:59:59 will be covered with this change. Minor fixes from previous PR - release build error - fixed random seed in test Pull Request resolved: https://github.com/facebook/rocksdb/pull/11911 Test Plan: `DBOptionsTest::OffPeakTimes` `make -j64 static_lib` to test release build issue that was fixed Reviewed By: pdillinger Differential Revision: D49787795 Pulled By: jaykorean fbshipit-source-id: e8d045b95f54f61d5dd5f1bb473579f8d55c18b3	2023-10-02 16:52:39 -07:00
Andrew Kryczka	10fd05e394	Give retry flushes their own functions (#11903 ) Summary: Recovery triggers flushes for very different scenarios: (1) `FlushReason::kErrorRecoveryRetryFlush`: a flush failed (2) `FlushReason::kErrorRecovery`: a WAL may be corrupted (3) `FlushReason::kCatchUpAfterErrorRecovery`: immutable memtables may have accumulated The old code called called `FlushAllColumnFamilies()` in all cases, which uses manual flush functions: `AtomicFlushMemTables()` and `FlushMemTable()`. Forcing flushing the latest data on all CFs was useful for (2) because it ensures all CFs move past the corrupted WAL. However, those code paths were overkill for (1) and (3), where only already-immutable memtables need to be flushed. There were conditionals to exclude some of the extraneous logic but I found there was still too much happening. For example, both of the manual flush functions enter the write thread. Entering the write thread is inconvenient because then we can't allow stalled writes to wait on a retrying flush to finish. Instead of continuing down the path of adding more conditionals to the manual flush functions, this PR introduces a dedicated function for cases (1) and (3): `RetryFlushesForErrorRecovery()`. Also I cleaned up the manual flush functions to remove existing conditionals for these cases as they're no longer needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11903 Reviewed By: cbi42 Differential Revision: D49693812 Pulled By: ajkr fbshipit-source-id: 7630ac539b9d6c92052c13a3cdce53256134d990	2023-10-02 16:26:24 -07:00

... 2 3 4 5 6 ...

5755 Commits