rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-12-03 14:52:53 +00:00

Author	SHA1	Message	Date
Hui Xiao	f910a0c025	Fix unreleased bug fix .md name (#12672 ) Summary: Context/Summary: as above Pull Request resolved: https://github.com/facebook/rocksdb/pull/12672 Test Plan: no code change Reviewed By: ajkr Differential Revision: D57505136 Pulled By: hx235 fbshipit-source-id: 0e216dc5974e9be10027b444eb6b4034f679dfd8	2024-05-20 09:41:11 -07:00
Hui Xiao	20213d01a3	Fix crash in CompactFiles() of conflict range under `preclude_last_level_data_seconds > 0` (#12628 ) Summary: Context/Summary: Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below ``` Assertion `output_level == 0 \|\| !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed. ``` This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628 Test Plan: New UT crashed before the fix and return correct status after the fix. Reviewed By: cbi42 Differential Revision: D57123536 Pulled By: hx235 fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430	2024-05-13 13:12:06 -07:00
Andrew Kryczka	933ac0e05c	Fix locking for `ColumnFamilyOptions::inplace_update_support` (#12624 ) Summary: In `SaveValue()`, the read lock needs to be obtained before `VerifyEntryChecksum()` because the KV checksum verification reads the entire value metadata+data, which is all mutable when `ColumnFamilyOptions::inplace_update_support == true`. In `MemTable::Update()`, the write lock needs to be obtained before mutating the value metadata (changing the value size) because it can be read concurrently. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12624 Test Plan: ``` $ make COMPILE_WITH_TSAN=1 -j56 db_stress ... $ python3 tools/db_crashtest.py blackbox --simple --max_key=10 --inplace_update_support=1 --interval=10 --allow_concurrent_memtable_write=0 ``` Reviewed By: cbi42 Differential Revision: D57034571 Pulled By: ajkr fbshipit-source-id: 3dddf881ad87923143acdf6bfec12ce47bb13a48	2024-05-08 08:30:12 -07:00
Andrew Kryczka	0fef690bd5	Sync non-latest WALs during flush for 2PC, single-CF DBs (#12622 ) Summary: Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs". This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622 Reviewed By: cbi42 Differential Revision: D56987303 Pulled By: ajkr fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48	2024-05-06 11:56:16 -07:00
Changyu Bi	6fdc4c5282	Fix a corruption bug in `CreateColumnFamilyWithImport()` (#12602 ) Summary: when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602 Test Plan: a new repro unit test. Before this PR, it fails with ``` [ RUN ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF db/import_column_family_test.cc:1048: Failure db_->WaitForCompact(o) Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3 ``` Reviewed By: hx235 Differential Revision: D56851808 Pulled By: cbi42 fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84	2024-05-06 11:01:38 -07:00
anand76	6349da612b	Update HISTORY.md and version to 9.3.0 (#12601 ) Summary: Update HISTORY.md for 9.2 and version to 9.3. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12601 Reviewed By: jaykorean, jowlyzhang Differential Revision: D56845901 Pulled By: anand1976 fbshipit-source-id: 0d1137a6568e4712be2f8b705f4f7b438217dbed	2024-05-01 16:33:04 -07:00
Yu Zhang	241253053a	Fix delete obsolete files on recovery not rate limited (#12590 ) Summary: This PR fix the issue that deletion of obsolete files during DB::Open are not rate limited. The root cause is slow deletion is disabled if trash/db size ratio exceeds the configured `max_trash_db_ratio` `d610e14f93/include/rocksdb/sst_file_manager.h (L126)` however, the current handling in DB::Open starts with tracking nothing but the obsolete files. This will make the ratio always look like it's 1. In order for the deletion rate limiting logic to work properly, we should only start deleting files after `SstFileManager` has finished tracking the whole DB, so the main fix is to move these two places that attempts to delete file after the tracking are done: 1) the `DeleteScheduler::CleanupDirectory` call in `SanitizeOptions`, 2) the `DB::DeleteObsoleteFiles` call. There are some other aesthetic changes like refactoring collecting all the DB paths into a function, rename `DBImp::DeleteUnreferencedSstFiles` to `DBImpl:: MaybeUpdateNextFileNumber` as it doesn't actually delete the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12590 Test Plan: Added unit test and verified with manual testing Reviewed By: anand1976 Differential Revision: D56830519 Pulled By: jowlyzhang fbshipit-source-id: 8a38a21b1ea11c5371924f2b88663648f7a17885	2024-05-01 12:26:54 -07:00
Hui Xiao	abd6751aba	Fix wrong padded bytes being used to generate file checksum (#12598 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/12542 introduced a bug where wrong padded bytes used to generate file checksum if flush happens during padding. This PR fixed it along with an existing same bug for `perform_data_verification_=true`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12598 Test Plan: - New UT that failed before this fix (`db->VerifyFileChecksums: ...Corruption: ...file checksum mismatch`) and passes after - Benchmark ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none ``` Pre-PR: fillseq [AVG 300 runs] : 421334 (± 4126) ops/sec; 46.6 (± 0.5) MB/sec Post-PR: (no regression observed but a slight improvement) fillseq [AVG 300 runs] : 425768 (± 4309) ops/sec; 47.1 (± 0.5) MB/sec Reviewed By: ajkr, anand1976 Differential Revision: D56725688 Pulled By: hx235 fbshipit-source-id: c1a700a95def8c65c0a21e44f8c1966164925ad5	2024-04-30 15:38:53 -07:00
Andrew Kryczka	b2931a5c53	Fixed `MultiGet()` error handling to not skip blob dereference (#12597 ) Summary: See comment at top of the test case and release note. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12597 Reviewed By: jaykorean Differential Revision: D56718786 Pulled By: ajkr fbshipit-source-id: 8dce185bb0d24a358372fc2b553d181793fc335f	2024-04-29 14:18:42 -07:00
anand76	e36b0a2da4	Fix corruption bug when recycle_log_file_num changed from 0 (#12591 ) Summary: When `recycle_log_file_num` is changed from 0 to non-zero and the DB is reopened, any log files from the previous session that are still alive get reused. However, the WAL records in those files are not in the recyclable format. If one of those files is reused and is empty, a subsequent re-open, in `RecoverLogFiles`, can replay those records and insert stale data into the memtable. Another manifestation of this is an assertion failure `first_seqno_ == 0 \|\| s >= first_seqno_` in `rocksdb::MemTable::Add`. We could fix this by either 1) Writing a special record when reusing a log file, or 2) Implement more rigorous checking in `RecoverLogFiles` to ensure we don't replay stale records, or 3) Not reuse files created by a previous DB session. We choose option 3 as its the simplest, and flipping `recycle_log_file_num` is expected to be a rare event. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12591 Test Plan: 1. Add a unit test to verify the bug and fix Reviewed By: jowlyzhang Differential Revision: D56655812 Pulled By: anand1976 fbshipit-source-id: aa3a26b4a5e892d39a54b5a0658233cbebebac87	2024-04-29 12:25:00 -07:00
Jay Huh	1fca175eec	MultiCFSnapshot for NewIterators() API (#12573 ) Summary: As mentioned in https://github.com/facebook/rocksdb/issues/12561 and https://github.com/facebook/rocksdb/issues/12566 , `NewIterators()` API has not been providing consistent view of the db across multiple column families. This PR addresses it by utilizing `MultiCFSnapshot()` function which has been used for `MultiGet()` APIs. To be able to obtain the thread-local super version with ref, `sv_exclusive_access` parameter has been added to `MultiCFSnapshot()` so that we could call `GetReferencedSuperVersion()` or `GetAndRefSuperVersion()` depending on the param and support `Refresh()` API for MultiCfIterators Pull Request resolved: https://github.com/facebook/rocksdb/pull/12573 Test Plan: Unit Tests Added ``` ./db_iterator_test --gtest_filter="IteratorsConsistentView" ``` ``` ./multi_cf_iterator_test -- --gtest_filter="ConsistentView" ``` Performance Check Setup ``` make -j64 release TEST_TMPDIR=/dev/shm/db_bench ./db_bench -benchmarks="filluniquerandom" -key_size=32 -value_size=512 -num=10000000 -compression_type=none ``` Run ``` TEST_TMPDIR=/dev/shm/db_bench ./db_bench -use_existing_db=1 -benchmarks="multireadrandom" -cache_size=10485760000 ``` Before the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 6.374 micros/op 156892 ops/sec 6.374 seconds 1000000 operations; (0 of 1000000 found) ``` After the change ``` DB path: [/dev/shm/db_bench/dbbench] multireadrandom : 6.265 micros/op 159627 ops/sec 6.265 seconds 1000000 operations; (0 of 1000000 found) ``` Reviewed By: jowlyzhang Differential Revision: D56444066 Pulled By: jaykorean fbshipit-source-id: 327ce73c072da30c221e18d4f3389f49115b8f99	2024-04-24 15:28:55 -07:00
Andrew Kryczka	6807da0b44	Fix `DisableManualCompaction()` hang (#12578 ) Summary: Prior to this PR the following sequence could happen: 1. `RunManualCompaction()` A schedules compaction to thread pool and waits 2. `RunManualCompaction()` B waits without scheduling anything due to conflict 3. `DisableManualCompaction()` bumps `manual_compaction_paused_` and wakes up both 4. `RunManualCompaction()` A (`scheduled && !unscheduled`) unschedules its compaction and marks itself done 5. `RunManualCompaction()` B (`!scheduled && !unscheduled`) schedules compaction to thread pool 6. `RunManualCompaction()` B (`scheduled && !unscheduled`) waits on its compaction 7. `RunManualCompaction()` B at some point wakes up and finishes, either by unscheduling or by compaction execution 8. `DisableManualCompaction()` returns as there are no more manual compactions running Between 6. and 7. the wait can be long while the compaction sits in the thread pool queue. That wait is unnecessary. This PR changes the behavior from step 5. onward: 5'. `RunManualCompaction()` B (`!scheduled && !unscheduled`) marks itself done 6'. `DisableManualCompaction()` returns as there are no more manual compactions running Pull Request resolved: https://github.com/facebook/rocksdb/pull/12578 Reviewed By: cbi42 Differential Revision: D56528144 Pulled By: ajkr fbshipit-source-id: 4da2467376d7d4ff435547aa74dd8f118db0c03b	2024-04-24 12:40:36 -07:00
Andrew Kryczka	3f3045a405	fix DeleteRange+memtable_insert_with_hint_prefix_extractor interaction (#12558 ) Summary: Previously `insert_hints_` was used for both point key table (`table_`) and range deletion table (`range_del_table_`). Hints include pointers to table data, so mixing hints for different tables together without tracking which hint corresponds to which table was problematic. We can just make the hints dedicated to the point key table only. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12558 Reviewed By: hx235 Differential Revision: D56279019 Pulled By: ajkr fbshipit-source-id: 00fe5ce72f9f11a1c1cba5f1977b908b2d518f29	2024-04-22 20:13:58 -07:00
Hui Xiao	7d83b4e3e5	Fix file checksum mismatch due to padded bytes when block_align=true (#12542 ) Summary: Context/Summary: When `BlockBasedTableOptions::block_align=true`, we pad bytes to align blocks `d41e568b1c/table/block_based/block_based_table_builder.cc (L1415-L1421)`. Those bytes are not included in generating the file checksum upon file creation. But `VerifyFileChecksums()` includes those bytes in generating the file check to compare against the checksum generating upon file creation. Therefore a file checksum mismatch is returned in `VerifyFileChecksums()`. We decided to include those padded bytes in generating the checksum upon file creation. Bonus: also fix surrounding code to use actual padded bytes for verification - see https://github.com/facebook/rocksdb/pull/12542#discussion_r1571429163 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12542 Test Plan: - New UT - Benchmark ``` TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq[-X300] --num=100000 --block_align=1 --compression_type=none ``` Pre-PR: fillseq [AVG 300 runs] : 422857 (± 3942) ops/sec; 46.8 (± 0.4) MB/sec Post-PR: fillseq [AVG 300 runs] : 424707 (± 3799) ops/sec; 47.0 (± 0.4) MB/sec Reviewed By: ajkr Differential Revision: D56168447 Pulled By: hx235 fbshipit-source-id: 96209ef950d42943d336f11968ae3fcf9872fc2c	2024-04-22 14:07:34 -07:00
Levi Tamasi	e82fe7c0b7	Fix the move semantics of PinnableWideColumns (#12557 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12557 Unlike for other sequence containers, the C++ standard allows moving an `std::string` to invalidate pointers/iterators/references. In practice, this happens with short strings which are stored "inline" in the `std::string` object (small string optimization). Since `PinnableSlice` uses `std::string` as its internal buffer, and `PinnableWideColumns` in turn is implemented in terms of `PinnableSlice`, this means that the default compiler-generated move operations can invalidate the column index stored in `PinnableWideColumns::columns_`. The PR fixes this by providing custom move constructor/move assignment implementations for `PinnableWideColumns` that recreate the `columns_` index upon move. Reviewed By: jaykorean Differential Revision: D56275054 fbshipit-source-id: e8648c003dbcf1c39ec122ad229780c28138e730	2024-04-17 18:56:23 -07:00
Andrew Kryczka	7027265417	Fix `max_successive_merges` counting CPU overhead regression (#12546 ) Summary: In https://github.com/facebook/rocksdb/issues/12365 we made `max_successive_merges` non-strict by default. Before https://github.com/facebook/rocksdb/issues/12365, `CountSuccessiveMergeEntries()`'s scan was implicitly limited to `max_successive_merges` entries for a given key, because after that the merge operator would be invoked and the merge chain would be collapsed. After https://github.com/facebook/rocksdb/issues/12365, the merge chain will not be collapsed no matter how long it is when the chain's operands are not all in memory. Since `CountSuccessiveMergeEntries()` scanned the whole merge chain, https://github.com/facebook/rocksdb/issues/12365 had a side effect that it would scan more memtable entries. This PR introduces a limit so it won't scan more entries than it could before. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12546 Reviewed By: jaykorean Differential Revision: D56193693 Pulled By: ajkr fbshipit-source-id: b070ba0703ef733e0ff230f89cd5cca5233b84da	2024-04-17 12:11:24 -07:00
akankshamahajan	1856734821	Branch cut 9.1.fb (#12476 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12476 Reviewed By: jowlyzhang Differential Revision: D55319508 Pulled By: akankshamahajan15 fbshipit-source-id: 2b6db671e027511282775c0fea155335d8e73cc2	2024-03-25 15:07:43 -07:00
Andrew Kryczka	bf98dcf9a8	Fix kBlockCacheTier read when merge-chain base value is in a blob file (#12462 ) Summary: The original goal is to propagate failures from `GetContext::SaveValue()` -> `GetContext::GetBlobValue()` -> `BlobFetcher::FetchBlob()` up to the user. This call sequence happens when a merge chain ends with a base value in a blob file. There's also fixes for bugs encountered along the way where non-ok statuses were ignored/overwritten, and a bit of plumbing work for functions that had no capability to return a status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12462 Test Plan: A repro command ``` db=/dev/shm/dbstress_db ; exp=/dev/shm/dbstress_exp ; rm -rf $db $exp ; mkdir -p $db $exp ./db_stress \ --clear_column_family_one_in=0 \ --test_batches_snapshots=0 \ --write_fault_one_in=0 \ --use_put_entity_one_in=0 \ --prefixpercent=0 \ --read_fault_one_in=0 \ --readpercent=0 \ --reopen=0 \ --set_options_one_in=10000 \ --delpercent=0 \ --delrangepercent=0 \ --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 \ --open_write_fault_one_in=0 \ --destroy_db_initially=0 \ --ingest_external_file_one_in=0 \ --iterpercent=0 \ --nooverwritepercent=0 \ --db=$db \ --enable_blob_files=1 \ --expected_values_dir=$exp \ --max_background_compactions=20 \ --max_bytes_for_level_base=2097152 \ --max_key=100000 \ --min_blob_size=0 \ --open_files=-1 \ --ops_per_thread=100000000 \ --prefix_size=-1 \ --target_file_size_base=524288 \ --use_merge=1 \ --value_size_mult=32 \ --write_buffer_size=524288 \ --writepercent=100 ``` It used to fail like: ``` ... frame https://github.com/facebook/rocksdb/issues/9: 0x00007fc63903bc93 libc.so.6`__GI___assert_fail(assertion="HasDefaultColumn(columns)", file="fbcode/internal_repo_rocksdb/repo/db/wide/wide_columns_helper.h", line=33, function="static const rocksdb::Slice &rocksdb::WideColumnsHelper::GetDefaultColumn(const rocksdb::WideColumns &)") at assert.c:101:3 frame https://github.com/facebook/rocksdb/issues/10: 0x00000000006f7e92 db_stress`rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice, rocksdb::PinnableWideColumns, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, rocksdb::Status, rocksdb::MergeContext, unsigned long, rocksdb::PinnedIteratorsManager, bool, bool, unsigned long, rocksdb::ReadCallback, bool, bool) [inlined] rocksdb::WideColumnsHelper::GetDefaultColumn(columns=size=0) at wide_columns_helper.h:33 frame https://github.com/facebook/rocksdb/issues/11: 0x00000000006f7e76 db_stress`rocksdb::Version::Get(this=0x00007fc5ec763000, read_options=<unavailable>, k=<unavailable>, value=0x0000000000000000, columns=0x00007fc6035fd1d8, timestamp=<unavailable>, status=0x00007fc6035fd250, merge_context=0x00007fc6035fce40, max_covering_tombstone_seq=0x00007fc6035fce90, pinned_iters_mgr=0x00007fc6035fcdf0, value_found=0x0000000000000000, key_exists=0x0000000000000000, seq=0x0000000000000000, callback=0x0000000000000000, is_blob=0x0000000000000000, do_merge=<unavailable>) at version_set.cc:2492 frame https://github.com/facebook/rocksdb/issues/12: 0x000000000051e245 db_stress`rocksdb::DBImpl::GetImpl(this=0x00007fc637a86000, read_options=0x00007fc6035fcf60, key=<unavailable>, get_impl_options=0x00007fc6035fd000) at db_impl.cc:2408 frame https://github.com/facebook/rocksdb/issues/13: 0x000000000050cec2 db_stress`rocksdb::DBImpl::GetEntity(this=0x00007fc637a86000, _read_options=<unavailable>, column_family=<unavailable>, key=0x00007fc6035fd3c8, columns=0x00007fc6035fd1d8) at db_impl.cc:2109 frame https://github.com/facebook/rocksdb/issues/14: 0x000000000074f688 db_stress`rocksdb::(anonymous namespace)::MemTableInserter::MergeCF(this=0x00007fc6035fd450, column_family_id=2, key=0x00007fc6035fd3c8, value=0x00007fc6035fd3a0) at write_batch.cc:2656 frame https://github.com/facebook/rocksdb/issues/15: 0x00000000007476fc db_stress`rocksdb::WriteBatchInternal::Iterate(wb=0x00007fc6035fe698, handler=0x00007fc6035fd450, begin=12, end=<unavailable>) at write_batch.cc:607 frame https://github.com/facebook/rocksdb/issues/16: 0x000000000074d7dd db_stress`rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables, rocksdb::FlushScheduler, rocksdb::TrimHistoryScheduler, bool, unsigned long, rocksdb::DB, bool, bool, bool) [inlined] rocksdb::WriteBatch::Iterate(this=<unavailable>, handler=0x00007fc6035fd450) const at write_batch.cc:505 frame https://github.com/facebook/rocksdb/issues/17: 0x000000000074d77b db_stress`rocksdb::WriteBatchInternal::InsertInto(write_group=<unavailable>, sequence=<unavailable>, memtables=<unavailable>, flush_scheduler=<unavailable>, trim_history_scheduler=<unavailable>, ignore_missing_column_families=<unavailable>, recovery_log_number=0, db=0x00007fc637a86000, concurrent_memtable_writes=<unavailable>, seq_per_batch=false, batch_per_txn=<unavailable>) at write_batch.cc:3084 frame https://github.com/facebook/rocksdb/issues/18: 0x0000000000631d77 db_stress`rocksdb::DBImpl::PipelinedWriteImpl(this=0x00007fc637a86000, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000) at db_impl_write.cc:807 frame https://github.com/facebook/rocksdb/issues/19: 0x000000000062ceeb db_stress`rocksdb::DBImpl::WriteImpl(this=<unavailable>, write_options=<unavailable>, my_batch=0x00007fc6035fe698, callback=0x0000000000000000, log_used=<unavailable>, log_ref=0, disable_memtable=<unavailable>, seq_used=0x0000000000000000, batch_cnt=0, pre_release_callback=0x0000000000000000, post_memtable_callback=0x0000000000000000) at db_impl_write.cc:312 frame https://github.com/facebook/rocksdb/issues/20: 0x000000000062c8ec db_stress`rocksdb::DBImpl::Write(this=0x00007fc637a86000, write_options=0x00007fc6035feca8, my_batch=0x00007fc6035fe698) at db_impl_write.cc:157 frame https://github.com/facebook/rocksdb/issues/21: 0x000000000062b847 db_stress`rocksdb::DB::Merge(this=0x00007fc637a86000, opt=0x00007fc6035feca8, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, value=0x00007fc6035fe830) at db_impl_write.cc:2544 frame https://github.com/facebook/rocksdb/issues/22: 0x000000000062b6ef db_stress`rocksdb::DBImpl::Merge(this=0x00007fc637a86000, o=<unavailable>, column_family=0x00007fc6370bf140, key=0x00007fc6035fe8d8, val=0x00007fc6035fe830) at db_impl_write.cc:72 frame https://github.com/facebook/rocksdb/issues/23: 0x00000000004d6397 db_stress`rocksdb::NonBatchedOpsStressTest::TestPut(this=0x00007fc637041000, thread=0x00007fc6370dbc00, write_opts=0x00007fc6035feca8, read_opts=0x00007fc6035fe9c8, rand_column_families=<unavailable>, rand_keys=size=1, value={P\xe9_\x03\xc6\x7f\0\0}) at no_batched_ops_stress.cc:1317 frame https://github.com/facebook/rocksdb/issues/24: 0x000000000049361d db_stress`rocksdb::StressTest::OperateDb(this=0x00007fc637041000, thread=0x00007fc6370dbc00) at db_stress_test_base.cc:1148 ... ``` Reviewed By: ltamasi Differential Revision: D55157795 Pulled By: ajkr fbshipit-source-id: 5f7c1380ead5794c29d41680028e34b839744764	2024-03-21 12:38:53 -07:00
anand76	63a105a481	Enable recycle_log_file_num option for point in time recovery (#12403 ) Summary: This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL. This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403 Test Plan: 1. Add new unit tests to verify behavior upon corruption 2. Re-enable disabled tests for verifying recycling behavior Reviewed By: ajkr Differential Revision: D54544824 Pulled By: anand1976 fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a	2024-03-21 12:29:35 -07:00
Andrew Kryczka	4d5ebad971	Fix kBlockCacheTier read with table cache miss (#12443 ) Summary: Thanks ltamasi for pointing out this bug. We were incorrectly overwriting `Status::Incomplete` with `Status::OK` after a table cache miss failed to open the file due to the read being memory-only (`kBlockCacheTier`). The fix is to simply stop overwriting the status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12443 Reviewed By: cbi42 Differential Revision: D54930128 Pulled By: ajkr fbshipit-source-id: 52f912a2e93b46e71d79fc5968f8ca35b299213d	2024-03-15 14:41:58 -07:00
Changyu Bi	096fb9b67d	Fix data race in WalManager (#12439 ) Summary: Crash tests were failing due to data race in accessing `purge_wal_files_last_run_`. This PR changes it to atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12439 Test Plan: - existing UT - not able to repro with `python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --WAL_ttl_seconds=1` and TSAN yet, will monitor internal crash tests Reviewed By: anand1976 Differential Revision: D54920817 Pulled By: cbi42 fbshipit-source-id: 80ee026b1785ad5dba11295ed35c88889df5f5a6	2024-03-14 21:24:06 -07:00
Yu Zhang	31dfc81e18	Start 9.1.0 release (#12360 ) Summary: with release notes for 9.0.fb, format_compatible test update, and version.h update. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12360 Test Plan: CI Reviewed By: cbi42 Differential Revision: D53879416 Pulled By: jowlyzhang fbshipit-source-id: 29598893d9ce2d0bb181345ddb78f9b1529aee75	2024-02-16 18:26:48 -08:00
Levi Tamasi	de1e3ff6ea	Fix a data race in DBImpl::RenameTempFileToOptionsFile (#12347 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12347 `DBImpl::disable_delete_obsolete_files_` should only be accessed while holding the DB mutex to prevent data races. There's a piece of logic in `DBImpl::RenameTempFileToOptionsFile` where this synchronization was previously missing. The patch fixes this issue similarly to how it's handled in `DisableFileDeletions` and `EnableFileDeletions`, that is, by saving the counter value while holding the mutex and then performing the actual file deletion outside the critical section. Note: this PR only fixes the race itself; as a followup, we can also look into cleaning up and optimizing the file deletion logic (which is currently inefficient on multiple different levels). Reviewed By: jowlyzhang Differential Revision: D53675153 fbshipit-source-id: 5358e894ee6829d3edfadac50a93d97f8819e481	2024-02-12 13:26:09 -08:00
anand76	95b41eec6d	Fix potential incorrect result for duplicate key in MultiGet (#12295 ) Summary: The RocksDB correctness testing has recently discovered a possible, but very unlikely, correctness issue with MultiGet. The issue happens when all of the below conditions are met - 1. Duplicate keys in a MultiGet batch 2. Key matches the last key in a non-zero, non-bottommost level file 3. Final value is not in the file (merge operand, not snapshot visible etc) 4. Multiple entries exist for the key in the file spanning more than 1 data block. This can happen due to snapshots, which would force multiple versions of the key in the file, and they may spill over to another data block 5. Lookup attempt in the SST for the first of the duplicates fails with IO error on a data block (NOT the first data block, but the second or subsequent uncached block), but no errors for the other duplicates 6. Value or merge operand for the key is present in the very next level The problem is, in FilePickerMultiGet, when looking up keys in a level we use FileIndexer and the overlapping file in the current level to determine the search bounds for that key in the file list in the next level. If the next level is empty, the search bounds are reset and we do a full binary search in the next non-empty level's LevelFilesBrief. However, under the conditions https://github.com/facebook/rocksdb/issues/1 and https://github.com/facebook/rocksdb/issues/2 listed above, only the first of the duplicates has its next-level search bounds updated, and the remaining duplicates are skipped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12295 Test Plan: Add unit tests that fail an assertion or return wrong result without the fix Reviewed By: hx235 Differential Revision: D53187634 Pulled By: anand1976 fbshipit-source-id: a5eadf4fede9bbdec784cd993b15e3341436d1ea	2024-02-02 11:48:35 -08:00
Chdy	fc48af33f5	fix some perf statistic in write (#12285 ) Summary: ### Summary: perf context lack statistics in some write steps ``` rocksdb::get_perf_context()->write_wal_time); rocksdb::get_perf_context()->write_memtable_time); rocksdb::get_perf_context()->write_pre_and_post_process_time); ``` #### case 1: when the unordered_write is true, the `write_memtable_time` is 0 ``` write_wal_time : 13.7012 write_memtable_time : 0 write_pre_and_post_process_time : 142.037 ``` Reason: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` during insert memtable, ```c++ Status DBImpl::UnorderedWriteMemtable(const WriteOptions& write_options, WriteBatch* my_batch, WriteCallback* callback, uint64_t log_ref, SequenceNumber seq, const size_t sub_batch_cnt) { ... if (w.CheckCallback(this) && w.ShouldWriteToMemtable()) { // need calculate write_memtable_time ColumnFamilyMemTablesImpl column_family_memtables( versions_->GetColumnFamilySet()); w.status = WriteBatchInternal::InsertInto( &w, w.sequence, &column_family_memtables, &flush_scheduler_, &trim_history_scheduler_, write_options.ignore_missing_column_families, 0 /log_number/, this, true /concurrent_memtable_writes/, seq_per_batch_, sub_batch_cnt, true /batch_per_txn/, write_options.memtable_insert_hint_per_batch); if (write_options.disableWAL) { has_unpersisted_data_.store(true, std::memory_order_relaxed); } } ... } ``` Fix: add perf function ``` write_wal_time : 14.3991 write_memtable_time : 19.3367 write_pre_and_post_process_time : 130.441 ``` #### case 2: when the enable_pipelined_write is true, the `write_memtable_time` is small ``` write_wal_time : 11.2986 write_memtable_time : 1.0205 write_pre_and_post_process_time : 140.131 ``` Fix: `DBImpl::UnorderedWriteMemtable` function has no statistical `write_memtable_time` when `w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER` ```c++ Status DBImpl::PipelinedWriteImpl(const WriteOptions& write_options, WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used, uint64_t log_ref, bool disable_memtable, uint64_t* seq_used) { ... if (w.state == WriteThread::STATE_PARALLEL_MEMTABLE_WRITER) { // need calculate write_memtable_time assert(w.ShouldWriteToMemtable()); ColumnFamilyMemTablesImpl column_family_memtables( versions_->GetColumnFamilySet()); w.status = WriteBatchInternal::InsertInto( &w, w.sequence, &column_family_memtables, &flush_scheduler_, &trim_history_scheduler_, write_options.ignore_missing_column_families, 0 /log_number/, this, true /concurrent_memtable_writes/, false /seq_per_batch/, 0 /batch_cnt/, true /batch_per_txn/, write_options.memtable_insert_hint_per_batch); if (write_thread_.CompleteParallelMemTableWriter(&w)) { MemTableInsertStatusCheck(w.status); versions_->SetLastSequence(w.write_group->last_sequence); write_thread_.ExitAsMemTableWriter(&w, w.write_group); } } if (seq_used != nullptr) { seq_used = w.sequence; } assert(w.state == WriteThread::STATE_COMPLETED); return w.FinalStatus(); } ``` FIx: add perf function ``` write_wal_time : 10.5201 write_memtable_time : 17.1048 write_pre_and_post_process_time : 114.313 ``` #### case3: `DBImpl::WriteImplWALOnly` function has no statistical `write_delay_time` ```c++ Status DBImpl::WriteImplWALOnly( WriteThread* write_thread, const WriteOptions& write_options, WriteBatch* my_batch, WriteCallback* callback, uint64_t* log_used, const uint64_t log_ref, uint64_t* seq_used, const size_t sub_batch_cnt, PreReleaseCallback* pre_release_callback, const AssignOrder assign_order, const PublishLastSeq publish_last_seq, const bool disable_memtable) { ... if (publish_last_seq == kDoPublishLastSeq) { } else { // need calculate write_delay_time InstrumentedMutexLock lock(&mutex_); Status status = DelayWrite(/num_bytes=/0ull, *write_thread, write_options); if (!status.ok()) { WriteThread::WriteGroup write_group; write_thread->EnterAsBatchGroupLeader(&w, &write_group); write_thread->ExitAsBatchGroupLeader(write_group, status); return status; } } } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12285 Reviewed By: ajkr Differential Revision: D53191765 Pulled By: cbi42 fbshipit-source-id: f78d5b280bea6a777f077c89c3e0b8fe98d3c860	2024-01-29 12:31:11 -08:00
Hui Xiao	1b2b16b38e	Fix bug of newer ingested data assigned with an older seqno (#12257 ) Summary: Context: We found an edge case where newer ingested data is assigned with an older seqno. This causes older data of that key to be returned for read. Consider the following lsm shape: ![image](https://github.com/facebook/rocksdb/assets/83968999/973fd160-5065-49cd-8b7b-b6ab4badae23) Then ingest a file to L5 containing new data of key_overlap. Because of [this](https://l.facebook.com/l.php?u=https%3A%2F%2Fgithub.com%2Ffacebook%2Frocksdb%2Fblob%2F5a26f392ca640818da0b8590be6119699e852b07%2Fdb%2Fexternal_sst_file_ingestion_job.cc%3Ffbclid%3DIwAR10clXxpUSrt6sYg12sUMeHfShS7XigFrsJHvZoUDroQpbj_Sb3dG_JZFc%23L951-L956&h=AT0m56P7O0ZML7jk1sdjgnZZyGPMXg9HkKvBEb8mE9ZM3fpJjPrArAMsaHWZQPt9Ki-Pn7lv7x-RT9NEd_202Y6D2juIVHOIt3EjCZptDKBLRBMG49F8iBUSM9ypiKe8XCfM-FNW2Hl4KbVq2e3nZRbMvUM), the file is assigned with seqno 2, older than the old data's seqno 4. After just another compaction, we will drop the new_v for key_overlap because of the seqno and cause older data to be returned. ![image](https://github.com/facebook/rocksdb/assets/83968999/a3ef95e4-e7ae-4c30-8d03-955cd4b5ed42) Summary: This PR removes the incorrect seqno assignment Pull Request resolved: https://github.com/facebook/rocksdb/pull/12257 Test Plan: - New unit test failed before the fix but passes after - python3 tools/db_crashtest.py --compaction_style=1 --ingest_external_file_one_in=10 --preclude_last_level_data_seconds=36000 --compact_files_one_in=10 --enable_blob_files=0 blackbox` - Rehearsal stress test Reviewed By: cbi42 Differential Revision: D52926092 Pulled By: hx235 fbshipit-source-id: 9e4dade0f6cc44e548db8fca27ccbc81a621cd6f	2024-01-24 11:21:05 -08:00
Peter Dillinger	800cfae987	Start 9.0.0 release (#12256 ) Summary: with release notes for 8.11.fb, format_compatible test update, and version.h update. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12256 Test Plan: CI Reviewed By: cbi42 Differential Revision: D52926051 Pulled By: pdillinger fbshipit-source-id: adcf7119b065758599e904c16cbdf1d28811e0b4	2024-01-20 08:38:20 -08:00
Peter Dillinger	cb08a682d4	Fix/cleanup SeqnoToTimeMapping (#12253 ) Summary: The SeqnoToTimeMapping class (RocksDB internal) used by the preserve_internal_time_seconds / preclude_last_level_data_seconds options was essentially in a prototype state with some significant flaws that would risk biting us some day. This is a big, complicated change because both the implementation and the behavioral requirements of the class needed to be upgraded together. In short, this makes SeqnoToTimeMapping more internally responsible for maintaining good invariants, so that callers don't easily encounter dangerous scenarios. * Some API functions were confusingly named and structured, so I fully refactored the APIs to use clear naming (e.g. `DecodeFrom` and `CopyFromSeqnoRange`), object states, function preconditions, etc. * Previously the object could informally be sorted / compacted or not, and there was limited checking or enforcement on these states. Now there's a well-defined "enforced" state that is consistently checked in debug mode for applicable operations. (I attempted to create a separate "builder" class for unenforced states, but IIRC found that more cumbersome for existing uses than it was worth.) * Previously operations would coalesce data in a way that was better for `GetProximalTimeBeforeSeqno` than for `GetProximalSeqnoBeforeTime` which is odd because the latter is the only one used by DB code currently (what is the seqno cut-off for data definitely older than this given time?). This is now reversed to consistently favor `GetProximalSeqnoBeforeTime`, with that logic concentrated in one place: `SeqnoToTimeMapping::SeqnoTimePair::Merge()`. Unfortunately, a lot of unit test logic was specifically testing the old, suboptimal behavior. * Previously, the natural behavior of SeqnoToTimeMapping was to THROW AWAY data needed to get reasonable answers to the important `GetProximalSeqnoBeforeTime` queries. This is because SeqnoToTimeMapping only had a FIFO policy for staying within the entry capacity (except in aggregate+sort+serialize mode). If the DB wasn't extremely careful to avoid gathering too many time mappings, it could lose track of where the seqno cutoff was for cold data (`GetProximalSeqnoBeforeTime()` returning 0) and preventing all further data migration to the cold tier--until time passes etc. for mappings to catch up with FIFO purging of them. (The problem is not so acute because SST files contain relevant snapshots of the mappings, but the problem would apply to long-lived memtables.) * Now the SeqnoToTimeMapping class has fully-integrated smarts for keeping a sufficiently complete history, within capacity limits, to give good answers to `GetProximalSeqnoBeforeTime` queries. * Fixes old `// FIXME: be smarter about how we erase to avoid data falling off the front prematurely.` * Fix an apparent bug in how entries are selected for storing into SST files. Previously, it only selected entries within the seqno range of the file, but that would easily leave a gap at the beginning of the timeline for data in the file for the purposes of answering GetProximalXXX queries with reasonable accuracy. This could probably lead to the same problem discussed above in naively throwing away entries in FIFO order in the old SeqnoToTimeMapping. The updated testing of GetProximalSeqnoBeforeTime in BasicSeqnoToTimeMapping relies on the fixed behavior. * Fix a potential compaction CPU efficiency/scaling issue in which each compaction output file would iterate over and sort all seqno-to-time mappings from all compaction input files. Now we distill the input file entries to a constant size before processing each compaction output file. Intended follow-up (me or others): * Expand some direct testing of SeqnoToTimeMapping APIs. Here I've focused on updating existing tests to make sense. * There are likely more gaps in availability of needed SeqnoToTimeMapping data when the DB shuts down and is restarted, at least with WAL. * The data tracked in the DB could be kept more accurate and limited if it used the oldest seqno of unflushed data. This might require some more API refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12253 Test Plan: unit tests updated Reviewed By: jowlyzhang Differential Revision: D52913733 Pulled By: pdillinger fbshipit-source-id: 020737fcbbe6212f6701191a6ab86565054c9593	2024-01-19 21:50:38 -08:00
Levi Tamasi	7e4406a171	Add a changelog entry for PR 12235 (#12238 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12238 Reviewed By: jowlyzhang Differential Revision: D52809593 fbshipit-source-id: 692852cdd3074275ef92bde83ff15a800d8ae3d5	2024-01-16 13:02:18 -08:00
akankshamahajan	cad76a2e1e	Fix bug in auto_readahead_size that returned wrong key (#12229 ) Summary: IndexType::kBinarySearchWithFirstKey + BlockCacheLookupForReadAheadSize enabled => FindNextUserEntryInternal assertion fails or iterator lands at a wrong key because BlockCacheLookupForReadAheadSize moves the index_iter_ and in internal_wrapper.h, result_.key didn't update and pointed to wrong key. Also ikey_ was also pointing to iter_.key() instead of copying the key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12229 Test Plan: ``` rm -rf /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 mkdir /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 /dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 ./db_stress -threads=1 --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_setting_blob_options_dynamically=0 --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=0 --backup_one_in=0 --batch_protection_bytes_per_key=0 --blob_cache_size=0 --blob_compaction_readahead_size=0 --blob_compression_type=lz4 --blob_file_size=0 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0 --blob_garbage_collection_force_threshold=0 --block_protection_bytes_per_key=0 --block_size=2048 --bloom_before_level=2147483646 --bloom_bits=15 --bottommost_compression_type=snappy --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox_alt3 --db_write_buffer_size=0 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected_alt3 --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=13 --index_type=3 --ingest_external_file_one_in=10 --initial_auto_readahead_size=0 --iterpercent=55 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=10 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_blob_size=8 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=10000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=1 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=0 --snapshot_hold_ops=0 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=600 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=0 --use_blob_cache=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=0 --verify_checksum_one_in=0 --verify_db_one_in=0 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=1 --verify_sst_unique_id_in_manifest=0 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=0 > repro.out Verification failed. Expected state has key 0000000000000077000000000000004178, iterator is at key 0000000000000077000000000000008A78 Column family: default, op_logs: S 0000000000000077000000000000003D7878787878 NNNN No writes or ops? Verification failed :( ``` Reviewed By: ajkr Differential Revision: D52710655 Pulled By: akankshamahajan15 fbshipit-source-id: 9d2e684e190fb0832bdce3337bce1c6548cd054d	2024-01-16 11:30:36 -08:00
Peter Dillinger	5da900f28a	Fix a case of ignored corruption in creating backups (#12200 ) Summary: We often need to read the table properties of an SST file when taking a backup. However, we currently do not check checksums for this step, and even with that enabled, we ignore failures. This change ensures we fail creating a backup if corruption is detected in that step of reading table properties. To get this working properly (with existing unit tests), we also add some temperature handling logic like already exists in BackupEngineImpl::ReadFileAndComputeChecksum and elsewhere in BackupEngine. Also, SstFileDumper needed a fix to its error handling logic. This was originally intended to help diagnose some mysterious failures (apparent corruptions) seen in taking backups in the crash test, though that is now fixed in https://github.com/facebook/rocksdb/pull/12206 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12200 Test Plan: unit test added that corrupts table properties, along with existing tests Reviewed By: ajkr Differential Revision: D52520674 Pulled By: pdillinger fbshipit-source-id: 032cfc0791428f3b8147d34c7d424ab128e28f42	2024-01-05 09:48:19 -08:00
Peter Dillinger	ed46981bea	Fix and defend against FilePrefetchBuffer combined with mmap reads (#12206 ) Summary: FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200. In this change, * Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption. * Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior. * Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer * Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues. Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206 Test Plan: * Manually verified that the new check aids in debugging. * Unit test added, that fails if either fix is missed. * Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200 Reviewed By: akankshamahajan15 Differential Revision: D52551701 Pulled By: pdillinger fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc	2024-01-04 18:39:05 -08:00
Hui Xiao	06e593376c	Group SST write in flush, compaction and db open with new stats (#11910 ) Summary: ## Context/Summary Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity. For that, this PR does the following: - Tag different write IOs by passing down and converting WriteOptions to IOOptions - Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS Some related code refactory to make implementation cleaner: - Blob stats - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH\|COMPACTION\|DB_OPEN}_MICROS include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info. - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write. - Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority - Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification - Build table - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder. This parameter is used for dynamically changing file io priority for flush, see https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority ## Test ### db bench Flush ``` ./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100 rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377 rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 ``` compaction, db oopen ``` Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279 rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0 rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213 rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66 ``` blob stats - just to make sure they aren't broken by this PR ``` Integrated Blob DB Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact --db=../db_bench --use_existing_db=1 pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600 rocksdb.blobdb.blob.file.synced COUNT : 1 rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same) ``` ``` Stacked Blob DB Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench pre-PR: rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876 rocksdb.blobdb.blob.file.synced COUNT : 8 rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 post-PR: rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924 - COUNT is higher and values are smaller as it includes header and footer write - COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164 rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same) rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same) ``` ### Rehearsal CI stress test Trigger 3 full runs of all our CI stress tests ### Performance Flush ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark; enable_statistics = true Pre-pr: avg 507515519.3 ns 497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908, Post-pr: avg 511971266.5 ns, regressed 0.88% 502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408, ``` Compaction ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre\|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 495346098.30 ns 492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846 Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97% 502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007 ``` Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats) ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1 --benchmark_repetitions=1000 -- default: 1 thread is used to run benchmark Pre-pr: avg 3848.10 ns 3814,3838,3839,3848,3854,3854,3854,3860,3860,3860 Post-pr: avg 3874.20 ns, regressed 0.68% 3863,3867,3871,3874,3875,3877,3877,3877,3880,3881 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910 Reviewed By: ajkr Differential Revision: D49788060 Pulled By: hx235 fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff	2023-12-29 15:29:23 -08:00
Peter Dillinger	a771a47a1b	Fix leak or crash on failure in automatic atomic flush (#12176 ) Summary: Through code inspection in debugging an apparent leak of ColumnFamilyData in the crash test, I found a case where too few UnrefAndTryDelete() could be called on a cfd. This fixes that case, which would fail like this in the new unit test: ``` db_flush_test: db/column_family.cc:1648: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12176 Test Plan: unit test added Reviewed By: cbi42 Differential Revision: D52417071 Pulled By: pdillinger fbshipit-source-id: 4ee33c918409cf9c1968f138e273d3347a6cc8e5	2023-12-26 11:04:25 -08:00
Levi Tamasi	81765866c4	Update HISTORY/version/format compatibility script for the 8.10 release (#12154 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12154 Reviewed By: jaykorean, akankshamahajan15 Differential Revision: D52216271 Pulled By: ltamasi fbshipit-source-id: 13bab72802eeec8f6e3544be9ebcd7f725a64d2e	2023-12-15 14:44:23 -08:00
Akanksha Mahajan	cd577f6059	Fix WRITE_STALL start_time (#12147 ) Summary: `Delayed` is set true in two cases. One is when `delay` is specified. Other one is in the `while` loop - `cd21e4e69d/db/db_impl/db_impl_write.cc (L1876)` However start_time is not initialized in second case, resulting in time_delayed = immutable_db_options_.clock->NowMicros() - 0(start_time); Pull Request resolved: https://github.com/facebook/rocksdb/pull/12147 Test Plan: Existing CircleCI Reviewed By: cbi42 Differential Revision: D52173481 Pulled By: akankshamahajan15 fbshipit-source-id: fb9183b24c191d287a1d715346467bee66190f98	2023-12-14 13:45:06 -08:00
akankshamahajan	d926593df5	Fix stress tests failure for auto_readahead_size (#12131 ) Summary: When auto_readahead_size is enabled, Prev operation calls SeekForPrev in db_iter so that - BlockBasedTableIterator can point index_iter_ to the right block. - disable readahead_cache_lookup. However, there can be cases where SeekForPrev might not go through Version_set and call BlockBasedTableIterator SeekForPrev. In that case, when BlockBasedTableIterator::Prev is called, it returns NotSupported error. This more like a corner case. So to handle that case, removed SeekForPrev calling from db_iter and reseeking index_iter_ in Prev operation. block_iter_'s key already point to right block. So reseeking to index_iter_ solves the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12131 Test Plan: - Tested on db_stress command that was failing - `./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --best_efforts_recovery=1 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=0 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_ttl=10 --compressed_secondary_cache_size=16777216 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/home/akankshamahajan/rocksdb_auto_tune/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=10000 --skip_verifydb=1 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=0 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35` - make crash_test -j32 Reviewed By: anand1976 Differential Revision: D51986326 Pulled By: akankshamahajan15 fbshipit-source-id: 90e11e63d1f1894770b457a44d8b213ae5512df9	2023-12-13 12:15:04 -08:00
anand76	c1b84d0437	Fix false negative in TieredSecondaryCache nvm cache lookup (#12134 ) Summary: There is a bug in the `TieredSecondaryCache` that can result in a false negative. This can happen when a MultiGet does a cache lookup that gets a hit in the `TieredSecondaryCache` local nvm cache tier, and the result is available before MultiGet calls `WaitAll` (i.e the nvm cache `SecondaryCacheResultHandle` `IsReady` returns true). Pull Request resolved: https://github.com/facebook/rocksdb/pull/12134 Test Plan: Add a new unit test in tiered_secondary_cache_test Reviewed By: akankshamahajan15 Differential Revision: D52023309 Pulled By: anand1976 fbshipit-source-id: e5ae681226a0f12753fecb2f6acc7e5f254ae72b	2023-12-11 16:59:59 -08:00
Kevin Mingtarja	44fd914128	Fix double counting of BYTES_WRITTEN ticker (#12111 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/12061. We were double counting the `BYTES_WRITTEN` ticker when doing writes with transactions. During transactions, after writing, a client can call `Prepare()`, which writes the values to WAL but not to the Memtable. After that, they can call `Commit()`, which writes a commit marker to the WAL and the values to Memtable. The cause of this bug is previously during writes, we didn't take into account `writer->ShouldWriteToMemtable()` before adding to `total_byte_size`, so it is still added to during the `Prepare()` phase even though we're not writing to the Memtable, which was why we saw the value to be double of what's written to WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12111 Test Plan: Added a test in `db/db_statistics_test.cc` that tests writes with and without transactions, by comparing the values of `BYTES_WRITTEN` and `WAL_FILE_BYTES` after doing writes. Reviewed By: jaykorean Differential Revision: D51954327 Pulled By: jowlyzhang fbshipit-source-id: 57a0986a14e5b94eb5188715d819212529110d2c	2023-12-08 17:12:11 -08:00
Levi Tamasi	a143f93236	Turn the default Timer in PeriodicTaskScheduler into a leaky Meyers singleton (#12128 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12128 The patch turns the `Timer` Meyers singleton in `PeriodicTaskScheduler::Default()` into one of the leaky variety in order to prevent static destruction order issues. Reviewed By: akankshamahajan15 Differential Revision: D51963950 fbshipit-source-id: 0fc34113ad03c51fdc83bdb8c2cfb6c9f6913948	2023-12-08 10:34:07 -08:00
Jay Huh	ddb7df10ef	Update HISTORY.md and version.h for 8.9.fb release (#12074 ) Summary: Creating cut for 8.9 release Pull Request resolved: https://github.com/facebook/rocksdb/pull/12074 Test Plan: CI Reviewed By: ajkr Differential Revision: D51435289 Pulled By: jaykorean fbshipit-source-id: 3918a8250032839e5b71f67f26c8ba01cbc17a41	2023-11-21 18:07:19 -08:00
Yu Zhang	84a54e1e28	Fix some bugs in index builder and reader for the UDT in memtable only feature (#12062 ) Summary: These bugs surfaced while I was trying to add the stress test for the feature: Bug 1) On the index building path: the optimization to use user key instead of internal key as separator needed a bit tweak for when user defined timestamps can be removed. Because even though the user key look different now and eligible to be used as separator, when their user-defined timestamps are removed, they could be equal and that invariant no longer stands. Bug 2) On the index reading path: one path that builds the second level index iterator for `PartitionedIndexReader` are not passing the corresponding `user_defined_timestamps_persisted` flag. As a result, the default `true` value be used leading to no minimum timestamps padded when they should be. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12062 Test Plan: For bug 1): added separate unit test `BlockBasedTableReaderTest::Get` to exercise the `Get` API. It's a different code path from `MultiGet` so worth having its own test. Also in order to cover the bug, the test is modified to generate key values with the same user provided key, different timestamps and different sequence numbers. The test reads back different versions of the same user provided key. `MultiGet` takes one `ReadOptions` with one read timestamp so we cannot test retrieving different versions of the same key easily. For bug 2): simply added options `BlockBasedTableOptions.metadata_cache_options.partition_pinning = PinningTier::kAll` to exercise all the index iterator creating paths. Reviewed By: ltamasi Differential Revision: D51508280 Pulled By: jowlyzhang fbshipit-source-id: 8b174d3d70373c0599266ac1f467f2bd4d7ea6e5	2023-11-21 14:05:02 -08:00
Peter Dillinger	92dc5f3e67	AutoHCC: fix a bug with "blind" Insert (#12046 ) Summary: I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions. It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`). The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have * Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain. * Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty. * Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain. This is easily fixed by holding on to a reference until inserted onto the chain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046 Test Plan: As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed. Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000) Reviewed By: jowlyzhang Differential Revision: D51016979 Pulled By: pdillinger fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b	2023-11-06 16:06:01 -08:00
Hui Xiao	0f141352d8	Fix race between flush error recovery and db destruction (#12002 ) Summary: Context: DB destruction will wait for ongoing error recovery through `EndAutoRecovery()` and join the recovery thread: `519f2a41fb/db/db_impl/db_impl.cc (L525)` -> `519f2a41fb/db/error_handler.cc (L250)` -> `519f2a41fb/db/error_handler.cc (L808-L823)` However, due to a race between flush error recovery and db destruction, recovery can actually start after such wait during the db shutdown. The consequence is that the recovery thread created as part of this recovery will not be properly joined upon its destruction as part the db destruction. It then crashes the program as below. ``` std::terminate() std::default_delete<std::thread>::operator()(std::thread) const std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr() rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31) rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725) rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725) rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678) ``` Summary:* This PR fixed it by considering whether EndAutoRecovery() has been called before creating such thread. This fix is similar to how we currently [handle](`519f2a41fb/db/error_handler.cc (L688-L694)`) such case inside the created recovery thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12002 Test Plan: A new UT repro-ed the crash before this fix and and pass after. Reviewed By: ajkr Differential Revision: D50586191 Pulled By: hx235 fbshipit-source-id: b372f6d7a94eadee4b9283b826cc5fb81779a093	2023-10-25 11:59:09 -07:00
Hui Xiao	ab15d33566	Update history, version and format testing for 8.8 (#12004 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12004 Reviewed By: cbi42 Differential Revision: D50586984 Pulled By: hx235 fbshipit-source-id: 1480a8c2757340ebf83510557104aaa0e437b3ae	2023-10-24 12:03:07 -07:00
rogertyang	543191f2ea	Add bounds checking to WBWIIteratorImpl and respect bounds of ReadOptions in Transaction (#11680 ) Summary: Fix https://github.com/facebook/rocksdb/issues/11607 Fix https://github.com/facebook/rocksdb/issues/11679 Fix https://github.com/facebook/rocksdb/issues/11606 Fix https://github.com/facebook/rocksdb/issues/2343 Add bounds checking to `WBWIIteratorImpl`, which will be reflected in `BaseDeltaIterator::delta_iterator_::Valid()`, just like `BaseDeltaIterator::base_iterator_::Valid()`. In this way, the two sub itertors become more aligned from `BaseDeltaIterator`'s perspective. Like `DBIter`, the added bounds checking caps in either bound when seeking and disvalidates the `WBWIIteratorImpl` iterator when the lower bound is past or the upper bound is reached. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11680 Test Plan: - A simple test added to write_batch_with_index_test.cc to exercise the bounds checking in `WBWIIteratorImpl`. - A sophisticated test added to transaction_test.cc to assert that `Transaction` with different write policies honor bounds in `ReadOptions`. It should be so as long as the `BaseDeltaIterator` is correctly coordinating the two sub iterators to perform iterating and bounds checking. Reviewed By: ajkr Differential Revision: D48125229 Pulled By: cbi42 fbshipit-source-id: c9acea52595aed1471a63d7ca6ef15d2a2af1367	2023-10-20 13:28:28 -07:00
akankshamahajan	9135a61ec6	Fix corruption error in stress test for auto_readahead_size enabled (#11961 ) Summary: Fix corruption error - "Corruption: first key in index doesn't match first key in block". when auto_readahead_size is enabled. Error is because of bug when index_iter_ moves forward, first_internal_key of that index_iter_ is not copied. So the Slice points to a different key resulting in wrong comparison when doing comparison. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11961 Test Plan: Ran stress test which reproduced this error. Reviewed By: anand1976 Differential Revision: D50310589 Pulled By: akankshamahajan15 fbshipit-source-id: 95d8320b8388f1e3822c32024f84754f3a20a631	2023-10-17 12:21:08 -07:00
anand76	90e160733e	Fix runtime error in UpdateTieredCache due to integer underflow (#11949 ) Summary: With the introduction of the `UpdateTieredCache` API, its possible to dynamically change the compressed secondary cache ratio of the total cache capacity. In order to optimize performance, we avoid using a mutex when inserting/releasing placeholder entries, which can result in some inaccuracy in the accounting during the dynamic update. This inaccuracy was causing a runtime error due to an integer underflow in `UpdateCacheReservationRatio`, causing ubsan crash tests to fail. This PR fixes it by explicitly checking for the underflow. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11949 Test Plan: 1. Added a unit test that fails without the fix 2. Run ubsan_crash Reviewed By: akankshamahajan15 Differential Revision: D50240217 Pulled By: anand1976 fbshipit-source-id: d2f7b79da54eec8b61aec2cc1f2943da5d5847ac	2023-10-12 15:09:40 -07:00
Peter Dillinger	d010b02e86	Fix race in options taking effect (#11929 ) Summary: In follow-up to https://github.com/facebook/rocksdb/issues/11922, fix a race in functions like CreateColumnFamily and SetDBOptions where the DB reports one option setting but a different one is left in effect. To fix, we can add an extra mutex around these rare operations. We don't want to hold the DB mutex during I/O or other slow things because of the many purposes it serves, but a mutex more limited to these cases should be fine. I believe this would fix a write-write race in https://github.com/facebook/rocksdb/issues/10079 but not the read-write race. Intended follow-up to this: * Should be able to remove write thread synchronization from DBImpl::WriteOptionsFile Pull Request resolved: https://github.com/facebook/rocksdb/pull/11929 Test Plan: Added two mini-stress style regression tests that fail with >1% probability before this change: DBOptionsTest::SetStatsDumpPeriodSecRace ColumnFamilyTest::CreateAndDropPeriodicRace I haven't reproduced such an inconsistency between in-memory options and on disk latest options, but this change at least improves safety and adds a test anyway: DBOptionsTest::SetStatsDumpPeriodSecRace Reviewed By: ajkr Differential Revision: D50024506 Pulled By: pdillinger fbshipit-source-id: 1e99a9ed4d96fdcf3ac5061ec6b3cee78aecdda4	2023-10-12 10:05:23 -07:00
Levi Tamasi	b00fa5597e	Fix the handling of wide-column base values in the max_successive_merges logic (#11913 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11913 The `max_successive_merges` logic currently does not handle wide-column base values correctly, since it uses the `Get` API, which only returns the value of the default column. The patch fixes this by switching to `GetEntity` and passing all columns (if applicable) to the merge operator. Reviewed By: jaykorean Differential Revision: D49795097 fbshipit-source-id: 75eb7cc9476226255062cdb3d43ab6bd1cc2faa3	2023-10-02 16:25:25 -07:00

1 2

77 commits