rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-12-04 11:03:22 +00:00

Author	SHA1	Message	Date
Peter Dillinger	590e248865	Special patch release 9.3.102	2024-11-01 08:20:18 -07:00
Peter Dillinger	11b3f09eac	Add a temporary hook for custom yielding in long-running op (#13103 ) Summary: This is a simplified version of https://github.com/facebook/rocksdb/issues/13096, which called for a way to hook into long-running loops completely within RocksDB to change their thread priority (or similar). The current prime hook point is `DBIter::FindNextUserEntryInternal` likely because of iterating over tombstones. This is implemented using the weak symbol hack for ease of back-porting/patching, and while we get to know potential future requirements better for integration into the public API. (Consider potential relationships to `Env::GetThreadStatusUpdater()` and `TransactionDBMutexFactory`.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13103 Test Plan: Performance validated with db_bench and DEBUG_LEVEL=0: `./db_bench --benchmarks=fillseq,deleterandom,readseq[-X100] --value_size=1 --num=1000000` No consistent difference seen; variances likely in how DB / executable / memory were laid out. ``` With an empty hook: readseq [AVG 100 runs] : 1753018 (± 8850) ops/sec; 28.4 (± 0.1) MB/sec readseq [MEDIAN 100 runs] : 1763746 ops/sec; 28.6 MB/sec (recompile) readseq [AVG 100 runs] : 1789019 (± 10260) ops/sec; 29.0 (± 0.2) MB/sec readseq [MEDIAN 100 runs] : 1801849 ops/sec; 29.2 MB/sec Base: readseq [AVG 100 runs] : 1772196 (± 8240) ops/sec; 28.7 (± 0.1) MB/sec readseq [MEDIAN 100 runs] : 1780453 ops/sec; 28.9 MB/sec (recompile) readseq [AVG 100 runs] : 1777637 (± 7613) ops/sec; 28.8 (± 0.1) MB/sec readseq [MEDIAN 100 runs] : 1786657 ops/sec; 29.0 MB/sec With a functional hook (count number of calls into it): readseq [AVG 100 runs] : 1796733 (± 8854) ops/sec; 29.1 (± 0.1) MB/sec readseq [MEDIAN 100 runs] : 1804690 ops/sec; 29.3 MB/sec RocksDbThreadYield: 126915800 (recompile) readseq [AVG 100 runs] : 1775371 (± 10529) ops/sec; 28.8 (± 0.2) MB/sec readseq [MEDIAN 100 runs] : 1789046 ops/sec; 29.0 MB/sec RocksDbThreadYield: 126977000 Base: readseq [AVG 100 runs] : 1773071 (± 10657) ops/sec; 28.7 (± 0.2) MB/sec readseq [MEDIAN 100 runs] : 1783414 ops/sec; 28.9 MB/sec (recompile) readseq [AVG 100 runs] : 1750852 (± 10184) ops/sec; 28.4 (± 0.2) MB/sec readseq [MEDIAN 100 runs] : 1763587 ops/sec; 28.6 MB/sec ``` Reviewed By: george-reynya Differential Revision: D65235379 Pulled By: pdillinger fbshipit-source-id: 7829e4cc25a56d4c1801b8adf9c7f7aa49ab7aca	2024-11-01 08:11:34 -07:00
anand76	01393464ea	Update HISTORY for 9.3.2	2024-08-05 10:40:16 -07:00
anand76	97a152b841	Make transaction name conflict check more robust (#12895 ) Summary: The `PessimisticTransaction::SetName()` code checks for an existing txn of the given name before registering the new txn. However, this is not atomic, which could result in a race condition if two txns try to register with the same name. Both might succeed and lead to unpredictable behavior. This PR makes the test and set atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12895 Reviewed By: pdillinger Differential Revision: D60460482 Pulled By: anand1976 fbshipit-source-id: e8afeb2356e1b8f4e8df785cb73532739f82579d	2024-08-05 10:37:59 -07:00
Yu Zhang	537ccd2fbe	update HISTORY.md and version.h for 9.3.2	2024-08-05 09:25:52 -07:00
Yu Zhang	86e421fba9	Fix file deletions in DestroyDB not rate limited (#12891 ) Summary: Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`. It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it. This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891 Test Plan: Added unit tests, existing tests. Reviewed By: anand1976 Differential Revision: D60300220 Pulled By: jowlyzhang fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7	2024-08-05 09:24:20 -07:00
Andrew Kryczka	c5201abc4a	update HISTORY.md and version.h for 9.3.1	2024-06-04 11:43:02 -07:00
Fan Zhang(DevX)	4eda1b79fa	add export_file to rockdb TARGETS generator and re-gen Summary: we are converting the implicit loads to explicit loads, then remove the hidden loads in fbcode macroes. details see https://fb.workplace.com/groups/devx.build.bffs/permalink/7481848805183560/ Reviewed By: JakobDegen Differential Revision: D57800976 fbshipit-source-id: a893aa2aa9237704ba9eb998cba210222c95dd2f	2024-06-04 11:42:26 -07:00
Andrew Kryczka	030f7c4a7b	include bug fixes in 9.3.0 from after the branch cut until now	2024-06-02 16:52:07 -07:00
anand76	6c60c8f3f3	Fix stale memory access with FSBuffer and tiered sec cache (#12712 ) Summary: A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks into the compressed cache and the corresponding compressed block into the NVM cache. The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712 Test Plan: 1. Add a unit test 2. Release validation stress test Reviewed By: jaykorean Differential Revision: D57974026 Pulled By: anand1976 fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd	2024-06-02 16:50:41 -07:00
Andrew Kryczka	2660754dbc	Fix recycled WAL detection when wal_compression is enabled (#12643 ) Summary: I think the point of the `if (end_of_buffer_offset_ - buffer_.size() == 0)` was to only set `recycled_` when the first record was read. However, the condition was false when reading the first record when the WAL began with a `kSetCompressionType` record because we had already dropped the `kSetCompressionType` record from `buffer_`. To fix this, I used `first_record_read_` instead. Also, it was pretty confusing to treat the WAL as non-recycled when a recyclable record first appeared in a non-first record. I changed it to return an error if that happens. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12643 Reviewed By: hx235 Differential Revision: D57238099 Pulled By: ajkr fbshipit-source-id: e20a2a0c9cf0c9510a7b6af463650a05d559239e	2024-06-02 16:50:29 -07:00
Peter Dillinger	94a867c845	Disallow memtable flush and sst ingest while WAL is locked (#12652 ) Summary: We recently noticed that some memtable flushed and file ingestions could proceed during LockWAL, in violation of its stated contract. (Note: we aren't 100% sure its actually needed by MySQL, but we want it to be in a clean state nonetheless.) Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I found a place to wait to wait for LockWAL to be cleared before allowing these operations to proceed: WaitForPendingWrites() Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652 Test Plan: Added to unit tests. Extended how db_stress validates LockWAL and re-enabled combination of ingestion and LockWAL in crash test, in follow-up to https://github.com/facebook/rocksdb/issues/12642 Ran blackbox_crash_test for a long while with relevant features amplified. Suggested follow-up: fix FaultInjectionTestFS to report file sizes consistent with what the user has requested to be flushed. Reviewed By: jowlyzhang Differential Revision: D57622142 Pulled By: pdillinger fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce	2024-06-02 16:50:15 -07:00
Hui Xiao	1b5474c8e1	Fix unreleased bug fix .md name (#12672 ) Summary: Context/Summary: as above Pull Request resolved: https://github.com/facebook/rocksdb/pull/12672 Test Plan: no code change Reviewed By: ajkr Differential Revision: D57505136 Pulled By: hx235 fbshipit-source-id: 0e216dc5974e9be10027b444eb6b4034f679dfd8	2024-06-02 16:49:53 -07:00
Andrew Kryczka	b135fa7a8b	9.3.0 release notes	2024-06-02 16:21:22 -07:00
anand76	0ed93552f4	Implement obsolete file deletion (GC) in follower (#12657 ) Summary: This PR implements deletion of obsolete files in a follower RocksDB instance. The follower tails the leader's MANIFEST and creates links to newly added SST files. These links need to be deleted once those files become obsolete in order to reclaim space. There are three cases to be considered - 1. New files added and links created, but the Version could not be installed due to some missing files. Those links need to be preserved so a subsequent catch up attempt can succeed. We insert the next file number in the `VersionSet` to `pending_outputs_` to prevent their deletion. 2. Files deleted from the previous successfully installed `Version`. These are deleted as usual in `PurgeObsoleteFiles`. 3. New files added by a `VersionEdit` and deleted by a subsequent `VersionEdit`, both processed in the same catchup attempt. Links will be created for the new files when verifying a candidate `Version`. Those need to be deleted explicitly as they're never added to `VersionStorageInfo`, and thus not deleted by `PurgeObsoleteFiles`. Test plan - New unit tests in `db_follower_test`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12657 Reviewed By: jowlyzhang Differential Revision: D57462697 Pulled By: anand1976 fbshipit-source-id: 898f15570638dd4930f839ffd31c560f9cb73916	2024-05-17 19:13:33 -07:00
Changyu Bi	ffd7930312	Add more debug print to `DBTestWithParam.ThreadStatusSingleCompaction` (#12661 ) Summary: This test is flaky and a recent failure prints the following: ``` [ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0 thread id: 1842811, thread status: thread id: 1842803, thread status: db/db_test.cc:4697: Failure Expected equality of these values: op_count Which is: 0 expected_count Which is: 1 [ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (307 ms) ``` Empty thread status implies that operation_type of the threads are all OP_UNKNOWN. From `3ed46e0668/monitoring/thread_status_updater.cc (L197)`, this can be due to thread_data->operation_type being OP_UNKNOWN or that thread_data->cf_key it not in `cf_info_map_`, potentially due to how cf_key_ is accessed with relaxed memory order. This PR adds some debug print to print the cf_name to check this. This PR also prints num_running_compaction and lsm state to check if a compaction is indeed running, and removes some not needed options and ensures that exactly 4 L0 files are created. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12661 Test Plan: - Cannot repro the failure locally: `gtest-parallel --repeat=10000 --workers=200 ./db_test --gtest_filter="ThreadStatusSingleCompaction"` - New failure message will look like: ``` [ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0 op_count: 1, expected_count 2 thread id: 6104100864, thread status: , cf_name thread id: 6103527424, thread status: Compaction, cf_name default running compaction: 1 lsm state: 4 db/db_test.cc:4885: Failure Value of: match Actual: false Expected: true [ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (115 ms) ``` Reviewed By: hx235 Differential Revision: D57422755 Pulled By: cbi42 fbshipit-source-id: 635663f26052b20e485dfa06a7c0f1f318ac1099	2024-05-16 17:23:56 -07:00
Peter Dillinger	131c8ccfcd	Add EntryType for TimedPut (#12669 ) Summary: Represent internal kTypeValuePreferredSeqno in the public API as kEntryTimedPut (because it is created by TimedPut, until the entry can be safely converted to a regular value entry in compaction) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12669 Test Plan: for follow-up work actually using it. But putting this in place in the public API gives us more flexibility in rolling out that follow-up work (e.g. as a user extension or patch if needed). Reviewed By: jowlyzhang Differential Revision: D57459637 Pulled By: pdillinger fbshipit-source-id: 160ccf7c4e524ee479558846b2a207d51b8b3d9c	2024-05-16 15:18:12 -07:00
Changyu Bi	2eb404de13	Print non-ok status for multi_ops_txns_stress test (#12660 ) Summary: Currently `assert(s.ok())` does not offer much information to debug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12660 Test Plan: revert https://github.com/facebook/rocksdb/issues/12639 and run `python3 ./tools/db_crashtest.py blackbox --write_policy write_prepared --recycle_log_file_num=1 --threads=1 --test_multiops_txn` before this PR: ``` Exit Before Killing stdout: stderr: db_stress: db_stress_tool/multi_ops_txns_stress.cc:1529: void rocksdb::MultiOpsTxnsStressTest::PreloadDb(rocksdb::SharedState, int, uint32_t, uint32_t, uint32_t, uint32_t): Assertion `s.ok()' failed. ``` after this PR: ``` Exit Before Killing stdout: stderr: Verification failed: PreloadDB failed: Invalid argument: WriteOptions::disableWAL option is not supported if DBOptions::recycle_log_file_num > 0 db_stress: db_stress_tool/db_stress_test_base.cc:517: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState, std::string, rocksdb::Status, bool) const: Assertion `false' failed. Received signal 6 (Aborted) ``` Reviewed By: ajkr Differential Revision: D57410819 Pulled By: cbi42 fbshipit-source-id: a03c2202c3fd666eb2f58bae24e0c9e3e6ed4265	2024-05-15 21:14:41 -07:00
Andrew Kryczka	4eaf628120	Add `Iterator` property "rocksdb.iterator.is-value-pinned" (#12659 ) Summary: `ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests. See also: https://github.com/facebook/rocksdb/issues/12658 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659 Reviewed By: cbi42 Differential Revision: D57391200 Pulled By: ajkr fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2	2024-05-15 19:11:52 -07:00
Peter Dillinger	3ed46e0668	Handle early exit in DBErrorHandlingFSTests (#12655 ) Summary: To avoid use-after-free on custom env on ASSERT_WHATEVER failure. This is motivated by a rare crash seen in DBErrorHandlingFSTest.WALWriteError (VersionSet::GetObsoleteFiles in a SstFileManagerImpl::ClearError thread) and wanting to rule out this being related to that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12655 Test Plan: manually seeing ASSERT_WHATEVER failures, especially under ASAN Reviewed By: cbi42 Differential Revision: D57358202 Pulled By: pdillinger fbshipit-source-id: 4da2a0d73a54380b257e5cc1ab6c666e26b83973	2024-05-14 16:44:32 -07:00
Jay Huh	b4c6956a59	Add MultiGetEntity AttributeGroup API to stress test (#12640 ) Summary: Continuing from https://github.com/facebook/rocksdb/pull/12605, adding AttributeGroup `MultiGetEntity` API to stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12640 Test Plan: AttributeGroup Tests NonBatchOps ``` python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 --use_multi_get=1 ``` BatchOps ``` python3 tools/db_crashtest.py blackbox --test_batches_snapshots=1 --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 --use_multi_get=1 ``` CfConsistency Test ``` python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 --use_multi_get=1 ``` Non-AttributeGroup Tests NonBatchOps ``` python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 ``` BatchOps ``` python3 tools/db_crashtest.py blackbox --test_batches_snapshots=1 --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 ``` CfConsistency Test ``` python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 ``` Reviewed By: ltamasi Differential Revision: D57233931 Pulled By: jaykorean fbshipit-source-id: 8cea771ac2e5749050bf5319360c6c5aa476d7d5	2024-05-14 16:33:44 -07:00
Andrii Lysenko	b9fc13db69	Add padding before timestamp size record if it doesn't fit into a WAL block. (#12614 ) Summary: If timestamp size record doesn't fit into a block, without padding `Writer::EmitPhysicalRecord` fails on assert (either `assert(block_offset_ + kHeaderSize + n <= kBlockSize);` or `assert(block_offset_ + kRecyclableHeaderSize + n <= kBlockSize)`, depending on whether recycling log files is enabled) in debug build. In release, current block grows beyond 32K, `block_offset_` gets reset on next `AddRecord` and all the subsequent blocks are no longer aligned by block size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12614 Reviewed By: ltamasi Differential Revision: D57302140 Pulled By: jowlyzhang fbshipit-source-id: cacb5cefb7586885e52a8137ae23a95e1aefca2d	2024-05-14 15:54:02 -07:00
Yu Zhang	1567d50a27	Temporarily disable file ingestion if LockWAL is tested (#12642 ) Summary: As titled. A proper fix should probably be failing file ingestion if the DB is in a lock wal state as it promises to "Freezes the logical state of the DB". Pull Request resolved: https://github.com/facebook/rocksdb/pull/12642 Reviewed By: pdillinger Differential Revision: D57235869 Pulled By: jowlyzhang fbshipit-source-id: c70031463842220f865621eb6f53424df27d29e9	2024-05-14 09:27:48 -07:00
Yu Zhang	c110091d36	Support read timestamp in ldb (#12641 ) Summary: As titled. Also updated sst_dump to print out user-defined timestamp separately and in human readable format. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12641 Test Plan: manually tested: Example success run: ./ldb --db=$TEST_DB multi_get_entity 0x00000000000000950000000000000120 --key_hex --column_family=7 --read_timestamp=1115613683797942 --value_hex 0x00000000000000950000000000000120 ==> :0x0E0000000A0B080906070405020300011E1F1C1D1A1B181916171415121310112E2F2C2D2A2B282926272425222320213E3F3C3D3A3B383936373435323330314E4F4C4D4A4B484946474445424340415E5F5C5D5A5B58595657545552535051 Example failed run: Failed: GetEntity failed: Invalid argument: column family enables user-defined timestamp while --read_timestamp is not a valid uint64 value. sst_dump print out: '000000000000015D000000000000012B000000000000013B\|timestamp:1113554493256671' seq:2330405, type:1 => 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E Reviewed By: ltamasi Differential Revision: D57297006 Pulled By: jowlyzhang fbshipit-source-id: 8486d91468e4f6c0d42dca3c9629f1c45a92bf5a	2024-05-13 15:43:12 -07:00
Hui Xiao	20213d01a3	Fix crash in CompactFiles() of conflict range under `preclude_last_level_data_seconds > 0` (#12628 ) Summary: Context/Summary: Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below ``` Assertion `output_level == 0 \|\| !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed. ``` This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628 Test Plan: New UT crashed before the fix and return correct status after the fix. Reviewed By: cbi42 Differential Revision: D57123536 Pulled By: hx235 fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430	2024-05-13 13:12:06 -07:00
Peter Dillinger	7747abdc15	Disable PromoteL0 in crash test (#12651 ) Summary: Seeing way too many errors likely related to PromoteL0 from https://github.com/facebook/rocksdb/issues/12617, containing ``` Cannot delete table file #N from level 0 since it is on level X ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12651 Test Plan: watch crash test results Reviewed By: hx235 Differential Revision: D57286208 Pulled By: pdillinger fbshipit-source-id: f7f0560cc0804ca297373c8d20ebc34986cc19d0	2024-05-13 12:42:01 -07:00
Peter Dillinger	b75438f986	Allow disableWAL+recycle with WritePreparedTxnDB internals (#12639 ) Summary: Follow-up from https://github.com/facebook/rocksdb/issues/12403 The crash test was periodically failing with the "disableWAL option is not supported if recycle_log_file_num > 0" failure, despite not setting the disableWAL from the user side. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12639 Test Plan: db_stress reproducer now passes. Added WAL recycling to txn DB unit tests, which is generally more difficult for correctness. Many tests now cover this change and pass. Reviewed By: anand1976 Differential Revision: D57227617 Pulled By: pdillinger fbshipit-source-id: db9abefeb505bce624b45bc64009694d2a5baed9	2024-05-10 17:56:40 -07:00
Yu Zhang	7d9642d876	Add logging for read timestamp during VerifyDB (#12638 ) Summary: As titled. To help debug some verification failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12638 Test Plan: manually tested Reviewed By: ajkr Differential Revision: D57219549 Pulled By: jowlyzhang fbshipit-source-id: 59c05ac85fb1c24449e7394ea04172c855d86420	2024-05-10 12:34:53 -07:00
Levi Tamasi	b92d874c8b	Support MultiGetEntity in optimistic and WriteCommitted pessimistic transactions (#12634 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12634 The patch implements support for the `MultiGetEntity` API in optimistic transactions and pessimistic transactions with the WriteCommitted policy. Similarly to the other wide-column transaction APIs, the implementation leverages the `WriteBatchWithIndex` layer. Reviewed By: jaykorean Differential Revision: D57177638 fbshipit-source-id: 2d9f9f287fc97e7c126830b48d21457c7c35db3f	2024-05-09 16:49:38 -07:00
Jay Huh	1f2715d1d2	AttributeGroup APIs in stress test - PutEntity and GetEntity (#12605 ) Summary: Adding AttributeGroup APIs in stress test. This contains the following changes only. More PRs to follow. - Introduce `use_attribute_group` flag - AttributeGroup `PutEntity()` and `GetEntity()` are now used per `use_attribute_group` flag in BatchOps, NonBatchOps and CfConsistency tests In the next PRs I plan to add - AttributeGroup `MultiGetEntity()` in Stress Test - AttributeGroupIterator in Stress Test (along with CoalescingIterator) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12605 Test Plan: NonBatchOps ``` python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 ``` BatchOps ``` python3 tools/db_crashtest.py blackbox --test_batches_snapshots=1 --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 ``` CfConsistency Test ``` python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 ``` Reviewed By: ltamasi Differential Revision: D56916768 Pulled By: jaykorean fbshipit-source-id: 8555d9e0d05927740a10e4e8301e44beec59a6f5	2024-05-09 16:40:22 -07:00
Hui Xiao	9bddac0dcf	Add more public APIs to crash test (#12617 ) Summary: Context/Summary: As titled. Bonus: found that PromoteL0 called with other concurrent PromoteL0 will return non-okay error so clarify the API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12617 Test Plan: CI Reviewed By: ajkr Differential Revision: D56954428 Pulled By: hx235 fbshipit-source-id: 0e056153c515003fd241ffec59b0d8a27529db4c	2024-05-09 15:37:38 -07:00
Levi Tamasi	97e70906fa	Improve the sanity checks in (Multi)GetEntity and friends (#12630 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12630 The patch cleans up, improves, and brings into sync (to the extent possible without API signature changes) the sanity checks around the `GetEntity` / `MultiGetEntity` family of APIs, including the read-your-own-writes (`WriteBatchWithIndex`) and transaction layers. The checks are centralized in two main sets of entry points, namely in `DB(Impl)` and the "main" `GetEntityFromBatchAndDB` / `MultiGetEntityFromBatchAndDB` overloads in `WriteBatchWithIndex`. This eliminates the need to duplicate the checks in the transaction classes. Reviewed By: jaykorean Differential Revision: D57125741 fbshipit-source-id: 4dd059ef644a9b173fbba767538943397e4cc6cd	2024-05-09 12:25:19 -07:00
Wei Liu	1a3357648f	Error log update to db_impl_compaction_flush.cc (#12608 ) Summary: Make the error looks better. Some inconsistency here and [here](`e46ab9d4f0/db/db_impl/db_impl_compaction_flush.cc (L2701-L2702)`) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12608 Reviewed By: ajkr Differential Revision: D57134933 Pulled By: cbi42 fbshipit-source-id: 2f19f077f388d196652a4e3afd2526f18bf75b2d	2024-05-09 11:36:24 -07:00
Yu Zhang	9dc171e3bb	Fix issue that cause false alarm corruption report (#12626 ) Summary: The state of `saved_seq_for_penul_check_` is not correctly maintained with the current flow. It's supposed to store the original sequence number for a `kTypeValuePreferredSeqno` entry for use in the `DecideOutputLevel` function. However, it's not always properly cleared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12626 Test Plan: Added unit test that would fail before the fix ./tiered_compaction_test --gtest_filter="InterleavedTimedPutAndPut" Reviewed By: pdillinger Differential Revision: D57123469 Pulled By: jowlyzhang fbshipit-source-id: 8d73214b3b6dc152daf19b6bd6ee9063581dc277	2024-05-08 15:51:38 -07:00
Peter Dillinger	eeda54fe63	Fix db_crashtest.py for prefixpercent and enable_compaction_filter (#12627 ) Summary: After https://github.com/facebook/rocksdb/issues/12624 seeing db_stress failures due to db_crashtest.py calling it with --prefixpercent=5 --enable_compaction_filter=1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12627 Test Plan: watch crash test Reviewed By: ajkr Differential Revision: D57121592 Pulled By: pdillinger fbshipit-source-id: 55727355a7662e67efcd22d7e353153e78e24f59	2024-05-08 13:16:44 -07:00
Andrew Kryczka	933ac0e05c	Fix locking for `ColumnFamilyOptions::inplace_update_support` (#12624 ) Summary: In `SaveValue()`, the read lock needs to be obtained before `VerifyEntryChecksum()` because the KV checksum verification reads the entire value metadata+data, which is all mutable when `ColumnFamilyOptions::inplace_update_support == true`. In `MemTable::Update()`, the write lock needs to be obtained before mutating the value metadata (changing the value size) because it can be read concurrently. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12624 Test Plan: ``` $ make COMPILE_WITH_TSAN=1 -j56 db_stress ... $ python3 tools/db_crashtest.py blackbox --simple --max_key=10 --inplace_update_support=1 --interval=10 --allow_concurrent_memtable_write=0 ``` Reviewed By: cbi42 Differential Revision: D57034571 Pulled By: ajkr fbshipit-source-id: 3dddf881ad87923143acdf6bfec12ce47bb13a48	2024-05-08 08:30:12 -07:00
Hans Holmberg	b8400c9faf	Make linux file write life time hinting work (#12595 ) Summary: The life time hint fcntl takes a 64-bit unsigned int, so make sure to pass a uint64_t when doing the syscall. See: https://man7.org/linux/man-pages/man2/fcntl.2.html https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35 This is one of those "How did this ever work?", as Env::WriteLifeTimeHint hint is definitely not the same as an 64-bit unsigned int. What's surprising is that SetWriteLifeTimeHint does pass a valid hint from time to time. Thanks, Hans Pull Request resolved: https://github.com/facebook/rocksdb/pull/12595 Reviewed By: cbi42 Differential Revision: D56901280 Pulled By: ajkr fbshipit-source-id: f276348863cbc29a537bed9450b16b0cc513ea78	2024-05-07 17:54:50 -07:00
Changyu Bi	5bf2c00a35	Clarify manual compaction and file ingestion behavior with FIFO compaction (#12618 ) Summary: For manual compaction, FIFO compaction will always skip key range overlapping checking with SST files. If CompactRange() is called with CompactionRangeOptions::change_level=true, a CF with FIFO compaction will now return Status::NotSupported. For file ingestion, we will always ingest into L0. Previously, it's possible to ingest files into non-L0 levels with FIFO compaction. These changes also help to fix [this](`a178d15baf/db/db_impl/db_impl_compaction_flush.cc (L1269)`) assertion failure in crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12618 Test Plan: added unit tests to verify the new behavior. Reviewed By: hx235 Differential Revision: D56962401 Pulled By: cbi42 fbshipit-source-id: 19812a1509650b4162b379ca5bee02f2e9d9569d	2024-05-07 12:00:15 -07:00
Levi Tamasi	83d051a8d9	Add release note for GetEntity transaction support (#12625 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12625 Reviewed By: jaykorean Differential Revision: D57059775 fbshipit-source-id: 80b3ddb51d538c6c21b69cd589f4ee8dd13596c9	2024-05-07 11:38:04 -07:00
Levi Tamasi	eaa3226ef7	Add support for GetEntity in optimistic and WriteCommitted pessimistic transactions (#12623 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12623 The PR adds support for the `GetEntity` API to optimistic and WriteCommitted pessimistic transactions. `MultiGetEntity` support and the `ForUpdate` variants of these read APIs will be implemented in subsequent PRs. Reviewed By: jaykorean Differential Revision: D57030879 fbshipit-source-id: 1f0aed6418782975fe537b6b3d437fad31fcbd43	2024-05-07 10:20:26 -07:00
Zaidoon Abd Al Hadi	36ab251c07	Expose block based metadata cache options via C API (#12611 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12611 Reviewed By: jaykorean Differential Revision: D56961823 Pulled By: ajkr fbshipit-source-id: aa062cdb49a0bb2c1148a81d4c882a4733c7790e	2024-05-06 16:49:11 -07:00
Levi Tamasi	45c290660a	Add PutEntity support for optimistic and WritePrepared pessimistic transactions (#12606 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12606 The patch extends optimistic transactions and WriteCommitted pessimistic transactions with support for the `PutEntity` API. Similarly to the other APIs, `PutEntity` is available via both the `Transaction` and `TransactionDB` interfaces, where using the latter executes the write in a single-operation transaction as usual. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added in separate PRs. Reviewed By: jaykorean Differential Revision: D56911242 fbshipit-source-id: 57cf8bb6c6b1b40ba4a8a652831c13a617644289	2024-05-06 14:41:00 -07:00
Andrew Kryczka	0fef690bd5	Sync non-latest WALs during flush for 2PC, single-CF DBs (#12622 ) Summary: Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs". This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622 Reviewed By: cbi42 Differential Revision: D56987303 Pulled By: ajkr fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48	2024-05-06 11:56:16 -07:00
Changyu Bi	6fdc4c5282	Fix a corruption bug in `CreateColumnFamilyWithImport()` (#12602 ) Summary: when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602 Test Plan: a new repro unit test. Before this PR, it fails with ``` [ RUN ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF db/import_column_family_test.cc:1048: Failure db_->WaitForCompact(o) Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3 ``` Reviewed By: hx235 Differential Revision: D56851808 Pulled By: cbi42 fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84	2024-05-06 11:01:38 -07:00
Patrik Valo	3fdc7243f3	Fix truncating last character in the StderrLogger (#12620 ) Summary: This PR fixes a bug in the StderrLogger that truncated the last character in the logline. The problem was that we provided an incorrect max size parameter into the vsnprintf function. The size didn't take into account the null byte that the function automatically adds. Before fix ``` File Read Latency Histogram By Level [default] 2024/05/04-18:50:24.209304 4788 [/db_impl/db_impl.cc:498] Shutdown: canceling all background wor 2024/05/04-18:50:24.209598 4788 [/db_impl/db_impl.cc:692] Shutdown complet ``` After fix ``` File Read Latency Histogram By Level [default] 2024/05/04-18:51:19.814584 4d4d [/db_impl/db_impl.cc:498] Shutdown: canceling all background work 2024/05/04-18:51:19.815528 4d4d [/db_impl/db_impl.cc:692] Shutdown complete ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12620 Test Plan: tested on examples/simple_example.cc with StderrLogger Fixes: https://github.com/facebook/rocksdb/issues/12576 Reviewed By: jaykorean Differential Revision: D56972332 Pulled By: ajkr fbshipit-source-id: 70405e8231ae6e90d24fe0b351bc8e749176bd15	2024-05-06 08:53:06 -07:00
Andrew Kryczka	7bf6d4c9d5	Lazily construct `BlockBasedTableIterator::block_handles_` (#12616 ) Summary: Our external benchmark attributed a CPU regression to https://github.com/facebook/rocksdb/issues/11860. Based on the CPU profile the new overhead is from `std::deque`. The deque is always empty for these scans so we do not need to construct it. This PR lazily constructs it only when it is needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12616 Test Plan: - Command: `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,seekrandom[-X10] -compression_type=none -disable_auto_compactions=true -write_buffer_size=524288 -value_size=1024 -num=10000 -reads=100000` - Results - Before this PR: `seekrandom [AVG 10 runs] : 47811 (± 431) ops/sec` - After this PR: `seekrandom [AVG 10 runs] : 51013 (± 632) ops/sec` Reviewed By: jaykorean Differential Revision: D56954136 Pulled By: ajkr fbshipit-source-id: b4d34c9b6c6c2e83d4fff06deacb9f0df2ad042f	2024-05-03 17:18:13 -07:00
Peter Dillinger	a178d15baf	More checks around num_entries vs. num_deletions (#12600 ) Summary: We've seen an internal crash test+sanitizer failure seemingly caused by underflow on `current_num_non_deletions_` which would happen if num_entries < num_deletions. (T186407810) This change adds an additional check (fail earlier?) and coerces read table properties to satisfy the invariant that is supposed to be provided by https://github.com/facebook/rocksdb/pull/4841 but could be violated by older files, due to https://github.com/facebook/rocksdb/pull/4016. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12600 Test Plan: existing tests Reviewed By: ajkr Differential Revision: D56796191 Pulled By: pdillinger fbshipit-source-id: 6d22cc40eb74974c42b311293ee2775c6af95afc	2024-05-03 16:40:07 -07:00
Hui Xiao	b312dbe920	Remove duplicate inplace_update_support crash/stress test flag (#12610 ) Summary: Context/Summary: As titled. There were two flags serving the same purpose so removed one of them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12610 Test Plan: CI Reviewed By: jaykorean, ajkr Differential Revision: D56916119 Pulled By: hx235 fbshipit-source-id: 011140a7945782cc613ca86d4b542db0cf7fb444	2024-05-03 11:33:33 -07:00
Changyu Bi	174b8294c8	Do not change compaction style for tiered_storage crash tests (#12609 ) Summary: tiered_storage crash tests are intended to run with universal compaction only: `e2ef349f56/tools/db_crashtest.py (L562)` Update the test script to not change compaction style for tiered_storage crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12609 Test Plan: `python3 tools/db_crashtest.py whitebox --test_tiered_storage --ops_per_thread=10 --max_key=100000 --duration=300 --reopen=1` Reviewed By: ajkr Differential Revision: D56917965 Pulled By: cbi42 fbshipit-source-id: 5ab693d7153832bae884788ff1f1762b8ec8262d	2024-05-03 10:34:56 -07:00
anand76	58627eff2e	Fix build error due to virtual Iterator destructor (#12612 ) Summary: Fix build error due to virtual Iterator destructor not marked as override. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12612 Reviewed By: jaykorean Differential Revision: D56939155 Pulled By: anand1976 fbshipit-source-id: 2921d6facc296c69215de45151b08e279a1a98a2	2024-05-03 10:34:05 -07:00

1 2 3 4 5 ...

12698 commits