Commit graph

12704 commits

Author SHA1 Message Date
Yu Zhang c628788abd Add a TransactionOptions to enable tracking timestamp size info inside WriteBatch (#12864)
Summary:
In normal use cases, meta info like column family's timestamp size is tracked at the transaction layer, so it's not necessary and even detrimental to track such info inside the internal WriteBatch because it may let anti-patterns like bypassing Transaction write APIs and directly write to its internal WriteBatch like this:
9d0a754dc9/storage/rocksdb/ha_rocksdb.cc (L4949-L4950)
Setting this option to true will keep aforementioned use case continue to work before it's refactored out. This option is only for this purpose and it will be gradually deprecated after aforementioned MyRocks use case are refactored.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12864

Test Plan: Added unit tests

Reviewed By: cbi42

Differential Revision: D60194094

Pulled By: jowlyzhang

fbshipit-source-id: 64a98822167e99aa7e4fa2a60085d44a5deaa45c
2024-08-07 12:10:22 -07:00
Yu Zhang 56a54b99f2 Add an option to toggle timestamp based validation for the whole DB (#12857)
Summary:
As titled. This PR adds a `TransactionDBOptions` field `enable_udt_validation` to allow user to toggle the timestamp based validation behavior across the whole DB. When it is true, which is the default value and the existing behavior. A recap of what this behavior is: `GetForUpdate` does timestamp based conflict checking to make sure no other transaction has committed a version of the key tagged with a timestamp equal to or newer than the calling transaction's `read_timestamp_` the user set via `SetReadTimestampForValidation`. When this field is set to false, we disable timestamp based validation for the whole DB. MyRocks find it hard to find a read timestamp for this validation API, so we added this flexibility.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12857

Test Plan: Added unit test

Reviewed By: ltamasi

Differential Revision: D60194134

Pulled By: jowlyzhang

fbshipit-source-id: b8507f8ddc37fc7a2948cf492ce5c599ae646fef
2024-08-07 11:46:34 -07:00
Yu Zhang 116dd52be4 Fix bug for recovering a prepared but not committed txn (#12856)
Summary:
This PR fix a bug for recovering a prepared Transaction that can contain user-defined timestamps.

The `Transaction::Put` type of APIs expect the key provided to be user key without timestamps. When the original transaction added a key for a column family that enables user-defined timestamps, say of size 8. Internally `WriteBatch::Put` will leave a placeholder 8 bytes for the final commit timestamp. For example:
cec28aa90f/db/write_batch.cc (L937)

When rebuilding this transaction from a `WriteBatch` from WAL log, we should consider this and remove the tailing 8 bytes of a key before adding it via the public Transaction write APIs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12856

Test Plan: Added unit test that would fail without this fix

Reviewed By: cbi42

Differential Revision: D59656399

Pulled By: jowlyzhang

fbshipit-source-id: c716aefa4d548770b691efe96ac8e6d7dab458b9
2024-08-07 11:44:10 -07:00
Yu Zhang c8ff39be33 Fix manual flush hanging on waiting for no stall for UDT in memtable … (#12771)
Summary:
This PR fix a possible manual flush hanging scenario because of its expectation that others will clear out excessive memtables was not met. The root cause is the FlushRequest rescheduling logic is using a stricter criteria for what a write stall is about to happen means than `WaitUntilFlushWouldNotStallWrites` does. Currently, the former thinks a write stall is about to happen when the last memtable is half full, and it will instead reschedule queued FlushRequest and not actually proceed with the flush. While the latter thinks if we already start to use the last memtable, we should wait until some other background flush jobs clear out some memtables before proceed this manual flush.

If we make them use the same criteria, we can guarantee that at any time when`WaitUntilFlushWouldNotStallWrites` is waiting, it's not because the rescheduling logic is holding it back.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12771

Test Plan: Added unit test

Reviewed By: ajkr

Differential Revision: D58603746

Pulled By: jowlyzhang

fbshipit-source-id: 9fa1c87c0175d47a40f584dfb1b497baa576755b
2024-08-07 11:40:13 -07:00
Yu Zhang b01951ed91 Change the behavior of manual flush to not retain UDT (#12737)
Summary:
When user-defined timestamps in Memtable only feature is enabled, all scheduled flushes go through a check to see if it's eligible to be rescheduled to retain user-defined timestamps. However when the user makes a manual flush request, their intention is for all the in memory data to be persisted into SST files as soon as possible. These two sides have some conflict of interest, the user can implement some workaround like https://github.com/facebook/rocksdb/issues/12631 to explicitly mark which one takes precedence. The implementation for this can be nuanced since the user needs to be aware of all the scenarios that can trigger a manual flush and handle the concurrency well etc.

In this PR, we updated the default behavior to give manual flush precedence when it's requested. The user-defined timestamps rescheduling mechanism is turned off when a manual flush is requested. Likewise, all error recovery triggered flushes skips the rescheduling mechanism too.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12737

Test Plan: Add unit tests

Reviewed By: ajkr

Differential Revision: D58538246

Pulled By: jowlyzhang

fbshipit-source-id: 0b9b3d1af3e8d882f2d6a2406adda19324ba0694
2024-08-07 11:39:14 -07:00
Yu Zhang aa44f959a6 Add a OnManualFlushScheduled callback in event listener (#12631)
Summary:
As titled. Also added the newest user-defined timestamp into the `MemTableInfo`. This can be a useful info in the callback.

Added some unit tests as examples for how users can use two separate approaches to allow manual flush / manual compactions to go through when the user-defined timestamps in memtable only feature is enabled. One approach relies on selectively increase cutoff timestamp in `OnMemtableSeal` callback when it's initiated by a manual flush. Another approach is to increase cutoff timestamp in `OnManualFlushScheduled` callback. The caveats of the approaches are also documented in the unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12631

Reviewed By: ajkr

Differential Revision: D58260528

Pulled By: jowlyzhang

fbshipit-source-id: bf446d7140affdf124744095e0a179fa6e427532
2024-08-07 11:38:10 -07:00
Yu Zhang 4a42da5759 Use extended file boundary for key range overlap check during file ingestion (#12735)
Summary:
When https://github.com/facebook/rocksdb/issues/12343 added support to bulk load external files while column family enables user-defined timestamps, it's a requirement that the external file doesn't overlap with the DB in key ranges. More specifically, the external file should not contain a user key (without timestamp) that already have some entries in the DB.

All the `*Overlap*` functions like `RangeOverlapWithMemtable`, `RangeOverlapWithCompaction` are using `CompareWithoutTimestamp` to check for overlap  already. One thing that is missing here is we need to extend the external file's user key boundary for this check to avoid missing the checks for the boundary user keys. For example, with the current way of checking things where `external_file_info.smallest.user_key()` is used as the left boundary, and `external_file_info.largest.user_key()` is used as the right boundary, a file with this entry: (b, 40) can fit into a DB with these two entries: (b, 30), (c, 20).

To avoid this, we extend the user key boundaries used for overlap check, by updating the left boundary with the maximum timestamp and the right boundary with the minimum timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12735

Test Plan: Added unit test

Reviewed By: ltamasi

Differential Revision: D58152117

Pulled By: jowlyzhang

fbshipit-source-id: 9cba61e7357f6d76ad44c258381c35073ebbf347
2024-08-07 11:37:12 -07:00
Yu Zhang 0435a9fc31 Add timestamp support in dump_wal/dump/idump (#12690)
Summary:
As titled.  For dumping wal files, since a mapping from column family id to the user comparator object is needed to print the timestamp in human readable format, option `[--db=<db_path>]` is added to `dump_wal` command to allow the user to choose to optionally open the DB as read only instance and dump the wal file with better timestamp formatting.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12690

Test Plan:
Manually tested

dump_wal:
[dump a wal file specified with --walfile]
```
>> ./ldb --walfile=$TEST_DB/000004.log dump_wal  --print_value
>>1,1,28,13,PUT(0) : 0x666F6F0100000000000000 : 0x7631
(Column family id: [0] contained in WAL are not opened in DB. Applied default hex formatting for user key. Specify --db=<db_path> to open DB for better user key formatting if it contains timestamp.)
```

[dump with --db specified for better timestamp formatting]
```
>> ./ldb --walfile=$TEST_DB/000004.log dump_wal  --db=$TEST_DB --print_value
>> 1,1,28,13,PUT(0) : 0x666F6F|timestamp:1 : 0x7631
```

dump:
[dump a file specified with --path]
```
>>./ldb --path=/tmp/rocksdbtest-501/column_family_test_75359_17910784957761284041/000004.log dump
Sequence,Count,ByteSize,Physical Offset,Key(s) : value
1,1,28,13,PUT(0) : 0x666F6F0100000000000000 : 0x7631
(Column family id: [0] contained in WAL are not opened in DB. Applied default hex formatting for user key. Specify --db=<db_path> to open DB for better user key formatting if it contains timestamp.)
```

[dump db specified with --db]
```
>> ./ldb --db=/tmp/rocksdbtest-501/column_family_test_75359_17910784957761284041 dump
>> foo|timestamp:1 ==> v1
Keys in range: 1
```

idump
```
./ldb --db=$TEST_DB idump
'foo|timestamp:1' seq:1, type:1 => v1
Internal keys in range: 1
```

Reviewed By: ltamasi

Differential Revision: D57755382

Pulled By: jowlyzhang

fbshipit-source-id: a0a2ef80c92801cbf7bfccc64769c1191824362e
2024-08-07 11:33:00 -07:00
anand76 01393464ea Update HISTORY for 9.3.2 2024-08-05 10:40:16 -07:00
anand76 97a152b841 Make transaction name conflict check more robust (#12895)
Summary:
The `PessimisticTransaction::SetName()` code checks for an existing txn of the given name before registering the new txn. However, this is not atomic, which could result in a race condition if two txns try to register with the same name. Both might succeed and lead to unpredictable behavior. This PR makes the test and set atomic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12895

Reviewed By: pdillinger

Differential Revision: D60460482

Pulled By: anand1976

fbshipit-source-id: e8afeb2356e1b8f4e8df785cb73532739f82579d
2024-08-05 10:37:59 -07:00
Yu Zhang 537ccd2fbe update HISTORY.md and version.h for 9.3.2 2024-08-05 09:25:52 -07:00
Yu Zhang 86e421fba9 Fix file deletions in DestroyDB not rate limited (#12891)
Summary:
Make `DestroyDB` slowly delete files if it's configured and enabled via `SstFileManager`.

It's currently not available mainly because of DeleteScheduler's logic related to tracked total_size_ and total_trash_size_. These accounting and logic should not be applied to `DestroyDB`. This PR adds a `DeleteUnaccountedDBFile` util for this purpose which deletes files without accounting it.  This util also supports assigning a file to a specified trash bucket so that user can later wait for a specific trash bucket to be empty. For `DestroyDB`, files with more than 1 hard links will be deleted immediately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12891

Test Plan: Added unit tests, existing tests.

Reviewed By: anand1976

Differential Revision: D60300220

Pulled By: jowlyzhang

fbshipit-source-id: 8b18109a177a3a9532f6dc2e40e08310c08ca3c7
2024-08-05 09:24:20 -07:00
Andrew Kryczka c5201abc4a update HISTORY.md and version.h for 9.3.1 2024-06-04 11:43:02 -07:00
Fan Zhang(DevX) 4eda1b79fa add export_file to rockdb TARGETS generator and re-gen
Summary:
we are converting the implicit loads to explicit loads, then remove the hidden loads in fbcode macroes.
details see https://fb.workplace.com/groups/devx.build.bffs/permalink/7481848805183560/

Reviewed By: JakobDegen

Differential Revision: D57800976

fbshipit-source-id: a893aa2aa9237704ba9eb998cba210222c95dd2f
2024-06-04 11:42:26 -07:00
Andrew Kryczka 030f7c4a7b include bug fixes in 9.3.0 from after the branch cut until now 2024-06-02 16:52:07 -07:00
anand76 6c60c8f3f3 Fix stale memory access with FSBuffer and tiered sec cache (#12712)
Summary:
A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks  into the compressed cache and the corresponding compressed block into the NVM cache.  The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712

Test Plan:
1. Add a unit test
2. Release validation stress test

Reviewed By: jaykorean

Differential Revision: D57974026

Pulled By: anand1976

fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd
2024-06-02 16:50:41 -07:00
Andrew Kryczka 2660754dbc Fix recycled WAL detection when wal_compression is enabled (#12643)
Summary:
I think the point of the `if (end_of_buffer_offset_ - buffer_.size() == 0)` was to only set `recycled_` when the first record was read. However, the condition was false when reading the first record when the WAL began with a  `kSetCompressionType` record because we had already dropped the `kSetCompressionType` record from `buffer_`. To fix this, I used `first_record_read_` instead.

Also, it was pretty confusing to treat the WAL as non-recycled when a recyclable record first appeared in a non-first record. I changed it to return an error if that happens.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12643

Reviewed By: hx235

Differential Revision: D57238099

Pulled By: ajkr

fbshipit-source-id: e20a2a0c9cf0c9510a7b6af463650a05d559239e
2024-06-02 16:50:29 -07:00
Peter Dillinger 94a867c845 Disallow memtable flush and sst ingest while WAL is locked (#12652)
Summary:
We recently noticed that some memtable flushed and file
ingestions could proceed during LockWAL, in violation of its stated
contract. (Note: we aren't 100% sure its actually needed by MySQL, but
we want it to be in a clean state nonetheless.)

Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I
found a place to wait to wait for LockWAL to be cleared before allowing
these operations to proceed: WaitForPendingWrites()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652

Test Plan:
Added to unit tests. Extended how db_stress validates LockWAL
and re-enabled combination of ingestion and LockWAL in crash test, in
follow-up to https://github.com/facebook/rocksdb/issues/12642

Ran blackbox_crash_test for a long while with relevant features
amplified.

Suggested follow-up: fix FaultInjectionTestFS to report file sizes
consistent with what the user has requested to be flushed.

Reviewed By: jowlyzhang

Differential Revision: D57622142

Pulled By: pdillinger

fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce
2024-06-02 16:50:15 -07:00
Hui Xiao 1b5474c8e1 Fix unreleased bug fix .md name (#12672)
Summary:
Context/Summary: as above

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12672

Test Plan: no code change

Reviewed By: ajkr

Differential Revision: D57505136

Pulled By: hx235

fbshipit-source-id: 0e216dc5974e9be10027b444eb6b4034f679dfd8
2024-06-02 16:49:53 -07:00
Andrew Kryczka b135fa7a8b 9.3.0 release notes 2024-06-02 16:21:22 -07:00
anand76 0ed93552f4 Implement obsolete file deletion (GC) in follower (#12657)
Summary:
This PR implements deletion of obsolete files in a follower RocksDB instance. The follower tails the leader's MANIFEST and creates links to newly added SST files. These links need to be deleted once those files become obsolete in order to reclaim space. There are three cases to be considered -
1. New files added and links created, but the Version could not be installed due to some missing files. Those links need to be preserved so a subsequent catch up attempt can succeed. We insert the next file number in the `VersionSet` to `pending_outputs_` to prevent their deletion.
2. Files deleted from the previous successfully installed `Version`. These are deleted as usual in `PurgeObsoleteFiles`.
3. New files added by a `VersionEdit` and deleted by a subsequent `VersionEdit`, both processed in the same catchup attempt. Links will be created for the new files when verifying a candidate `Version`. Those need to be deleted explicitly as they're never added to `VersionStorageInfo`, and thus not deleted by `PurgeObsoleteFiles`.

Test plan -
New unit tests in `db_follower_test`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12657

Reviewed By: jowlyzhang

Differential Revision: D57462697

Pulled By: anand1976

fbshipit-source-id: 898f15570638dd4930f839ffd31c560f9cb73916
2024-05-17 19:13:33 -07:00
Changyu Bi ffd7930312 Add more debug print to DBTestWithParam.ThreadStatusSingleCompaction (#12661)
Summary:
This test is flaky and a recent failure prints the following:
```
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 1842811, thread status:
thread id: 1842803, thread status:
db/db_test.cc:4697: Failure
Expected equality of these values:
  op_count
    Which is: 0
  expected_count
    Which is: 1
[  FAILED  ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (307 ms)
```
Empty thread status implies that operation_type of the threads are all OP_UNKNOWN. From 3ed46e0668/monitoring/thread_status_updater.cc (L197), this can be due to thread_data->operation_type being OP_UNKNOWN or that thread_data->cf_key it not in `cf_info_map_`, potentially due to how cf_key_ is accessed with relaxed memory order. This PR adds some debug print to print the cf_name to check this.

This PR also prints num_running_compaction and lsm state to check if a compaction is indeed running, and removes some not needed options and ensures that exactly 4 L0 files are created.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12661

Test Plan:
- Cannot repro the failure locally: `gtest-parallel --repeat=10000 --workers=200 ./db_test --gtest_filter="*ThreadStatusSingleCompaction*"`
- New failure message will look like:
```
[ RUN      ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
op_count: 1, expected_count 2
thread id: 6104100864, thread status: , cf_name
thread id: 6103527424, thread status: Compaction, cf_name default
running compaction: 1 lsm state: 4
db/db_test.cc:4885: Failure
Value of: match
  Actual: false
Expected: true
[  FAILED  ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (115 ms)
```

Reviewed By: hx235

Differential Revision: D57422755

Pulled By: cbi42

fbshipit-source-id: 635663f26052b20e485dfa06a7c0f1f318ac1099
2024-05-16 17:23:56 -07:00
Peter Dillinger 131c8ccfcd Add EntryType for TimedPut (#12669)
Summary:
Represent internal kTypeValuePreferredSeqno in the public API as kEntryTimedPut (because it is created by TimedPut, until the entry can be safely converted to a regular value entry in compaction)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12669

Test Plan: for follow-up work actually using it. But putting this in place in the public API gives us more flexibility in rolling out that follow-up work (e.g. as a user extension or patch if needed).

Reviewed By: jowlyzhang

Differential Revision: D57459637

Pulled By: pdillinger

fbshipit-source-id: 160ccf7c4e524ee479558846b2a207d51b8b3d9c
2024-05-16 15:18:12 -07:00
Changyu Bi 2eb404de13 Print non-ok status for multi_ops_txns_stress test (#12660)
Summary:
Currently `assert(s.ok())` does not offer much information to debug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12660

Test Plan:
revert https://github.com/facebook/rocksdb/issues/12639 and run `python3 ./tools/db_crashtest.py blackbox --write_policy write_prepared --recycle_log_file_num=1 --threads=1 --test_multiops_txn`

before this PR:
```
Exit Before Killing
stdout:

stderr:
 db_stress: db_stress_tool/multi_ops_txns_stress.cc:1529: void rocksdb::MultiOpsTxnsStressTest::PreloadDb(rocksdb::SharedState*, int, uint32_t, uint32_t, uint32_t, uint32_t): Assertion `s.ok()' failed.
```

after this PR:
```
Exit Before Killing
stdout:

stderr:
 Verification failed: PreloadDB failed: Invalid argument: WriteOptions::disableWAL option is not supported if DBOptions::recycle_log_file_num > 0
db_stress: db_stress_tool/db_stress_test_base.cc:517: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, rocksdb::Status, bool) const: Assertion `false' failed.
Received signal 6 (Aborted)
```

Reviewed By: ajkr

Differential Revision: D57410819

Pulled By: cbi42

fbshipit-source-id: a03c2202c3fd666eb2f58bae24e0c9e3e6ed4265
2024-05-15 21:14:41 -07:00
Andrew Kryczka 4eaf628120 Add Iterator property "rocksdb.iterator.is-value-pinned" (#12659)
Summary:
`ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests.

See also: https://github.com/facebook/rocksdb/issues/12658

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659

Reviewed By: cbi42

Differential Revision: D57391200

Pulled By: ajkr

fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2
2024-05-15 19:11:52 -07:00
Peter Dillinger 3ed46e0668 Handle early exit in DBErrorHandlingFSTests (#12655)
Summary:
To avoid use-after-free on custom env on ASSERT_WHATEVER failure.

This is motivated by a rare crash seen in DBErrorHandlingFSTest.WALWriteError (VersionSet::GetObsoleteFiles in a SstFileManagerImpl::ClearError thread) and wanting to rule out this being related to that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12655

Test Plan: manually seeing ASSERT_WHATEVER failures, especially under ASAN

Reviewed By: cbi42

Differential Revision: D57358202

Pulled By: pdillinger

fbshipit-source-id: 4da2a0d73a54380b257e5cc1ab6c666e26b83973
2024-05-14 16:44:32 -07:00
Jay Huh b4c6956a59 Add MultiGetEntity AttributeGroup API to stress test (#12640)
Summary:
Continuing from https://github.com/facebook/rocksdb/pull/12605, adding AttributeGroup `MultiGetEntity` API to stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12640

Test Plan:
**AttributeGroup Tests**

NonBatchOps
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 --use_multi_get=1
```

BatchOps
```
python3 tools/db_crashtest.py blackbox  --test_batches_snapshots=1 --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 --use_multi_get=1
```

CfConsistency Test
```
python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1 --use_multi_get=1
```

**Non-AttributeGroup Tests**

NonBatchOps
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1
```

BatchOps
```
python3 tools/db_crashtest.py blackbox  --test_batches_snapshots=1 --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1
```

CfConsistency Test
```
python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1
```

Reviewed By: ltamasi

Differential Revision: D57233931

Pulled By: jaykorean

fbshipit-source-id: 8cea771ac2e5749050bf5319360c6c5aa476d7d5
2024-05-14 16:33:44 -07:00
Andrii Lysenko b9fc13db69 Add padding before timestamp size record if it doesn't fit into a WAL block. (#12614)
Summary:
If timestamp size record doesn't fit into a block, without padding `Writer::EmitPhysicalRecord` fails on assert (either `assert(block_offset_ + kHeaderSize + n <= kBlockSize);` or `assert(block_offset_ + kRecyclableHeaderSize + n <= kBlockSize)`, depending on whether recycling log files is enabled)  in debug build. In release, current block grows beyond 32K, `block_offset_` gets reset on next `AddRecord` and all the subsequent blocks are no longer aligned by block size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12614

Reviewed By: ltamasi

Differential Revision: D57302140

Pulled By: jowlyzhang

fbshipit-source-id: cacb5cefb7586885e52a8137ae23a95e1aefca2d
2024-05-14 15:54:02 -07:00
Yu Zhang 1567d50a27 Temporarily disable file ingestion if LockWAL is tested (#12642)
Summary:
As titled. A proper fix should probably be failing file ingestion if the DB is in a lock wal state as it promises to "Freezes the logical state of the DB".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12642

Reviewed By: pdillinger

Differential Revision: D57235869

Pulled By: jowlyzhang

fbshipit-source-id: c70031463842220f865621eb6f53424df27d29e9
2024-05-14 09:27:48 -07:00
Yu Zhang c110091d36 Support read timestamp in ldb (#12641)
Summary:
As titled. Also updated sst_dump to print out user-defined timestamp separately and in human readable format.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12641

Test Plan:
manually tested:
Example success run:
./ldb --db=$TEST_DB multi_get_entity 0x00000000000000950000000000000120 --key_hex --column_family=7  --read_timestamp=1115613683797942 --value_hex
0x00000000000000950000000000000120 ==> :0x0E0000000A0B080906070405020300011E1F1C1D1A1B181916171415121310112E2F2C2D2A2B282926272425222320213E3F3C3D3A3B383936373435323330314E4F4C4D4A4B484946474445424340415E5F5C5D5A5B58595657545552535051
Example failed run:
Failed: GetEntity failed: Invalid argument: column family enables user-defined timestamp while --read_timestamp is not a valid uint64 value.

sst_dump print out:
'000000000000015D000000000000012B000000000000013B|timestamp:1113554493256671' seq:2330405, type:1 => 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E

Reviewed By: ltamasi

Differential Revision: D57297006

Pulled By: jowlyzhang

fbshipit-source-id: 8486d91468e4f6c0d42dca3c9629f1c45a92bf5a
2024-05-13 15:43:12 -07:00
Hui Xiao 20213d01a3 Fix crash in CompactFiles() of conflict range under preclude_last_level_data_seconds > 0 (#12628)
Summary:
**Context/Summary:**

Previously `CompactFiles()` used `RangeOverlapWithCompaction()` to check for conflict when sanitizing input files while later used `FilesRangeOverlapWithCompaction()` to assert for no conflict. The latter function checks for more conflict scenarios than the former does, particularly the ones arising from `preclude_last_level_data_seconds > 0` (i.e, compaction can output to second-to-the-last level). So we ran into assertion violation in `CompactFiles()` like below
```
 Assertion `output_level == 0 || !FilesRangeOverlapWithCompaction( input_files, output_level, Compaction::EvaluatePenultimateLevel(vstorage, ioptions_, start_level, output_level))' failed.
```

This PR make `CompactFiles()` used `FilesRangeOverlapWithCompaction()` and return Aborted status upon range conflict instead of crashing (during debug build) or proceed incorrectly (during non-debug build). To do so cleanly, I included a refactoring to make `FilesRangeOverlapWithCompaction()` part of `SanitizeAndConvertCompactionInputFiles()`, replacing `RangeOverlapWithCompaction()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12628

Test Plan: New UT crashed before the fix and return correct status after the fix.

Reviewed By: cbi42

Differential Revision: D57123536

Pulled By: hx235

fbshipit-source-id: f963a2c9e7ba1a9927a67fcc87f0dce126d3a430
2024-05-13 13:12:06 -07:00
Peter Dillinger 7747abdc15 Disable PromoteL0 in crash test (#12651)
Summary:
Seeing way too many errors likely related to PromoteL0 from https://github.com/facebook/rocksdb/issues/12617, containing
```
Cannot delete table file #N from level 0 since it is on level X
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12651

Test Plan: watch crash test results

Reviewed By: hx235

Differential Revision: D57286208

Pulled By: pdillinger

fbshipit-source-id: f7f0560cc0804ca297373c8d20ebc34986cc19d0
2024-05-13 12:42:01 -07:00
Peter Dillinger b75438f986 Allow disableWAL+recycle with WritePreparedTxnDB internals (#12639)
Summary:
Follow-up from https://github.com/facebook/rocksdb/issues/12403

The crash test was periodically failing with the
"disableWAL option is not supported if recycle_log_file_num > 0" failure, despite not setting the disableWAL from the user side.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12639

Test Plan: db_stress reproducer now passes. Added WAL recycling to txn DB unit tests, which is generally more difficult for correctness. Many tests now cover this change and pass.

Reviewed By: anand1976

Differential Revision: D57227617

Pulled By: pdillinger

fbshipit-source-id: db9abefeb505bce624b45bc64009694d2a5baed9
2024-05-10 17:56:40 -07:00
Yu Zhang 7d9642d876 Add logging for read timestamp during VerifyDB (#12638)
Summary:
As titled. To help debug some verification failures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12638

Test Plan: manually tested

Reviewed By: ajkr

Differential Revision: D57219549

Pulled By: jowlyzhang

fbshipit-source-id: 59c05ac85fb1c24449e7394ea04172c855d86420
2024-05-10 12:34:53 -07:00
Levi Tamasi b92d874c8b Support MultiGetEntity in optimistic and WriteCommitted pessimistic transactions (#12634)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12634

The patch implements support for the `MultiGetEntity` API in optimistic transactions and pessimistic transactions with the WriteCommitted policy. Similarly to the other wide-column transaction APIs, the implementation leverages the `WriteBatchWithIndex` layer.

Reviewed By: jaykorean

Differential Revision: D57177638

fbshipit-source-id: 2d9f9f287fc97e7c126830b48d21457c7c35db3f
2024-05-09 16:49:38 -07:00
Jay Huh 1f2715d1d2 AttributeGroup APIs in stress test - PutEntity and GetEntity (#12605)
Summary:
Adding AttributeGroup APIs in stress test. This contains the following changes only. More PRs to follow.

- Introduce `use_attribute_group` flag
- AttributeGroup `PutEntity()` and `GetEntity()` are now used per `use_attribute_group` flag in BatchOps, NonBatchOps and CfConsistency tests

In the next PRs I plan to add
- AttributeGroup `MultiGetEntity()` in Stress Test
- AttributeGroupIterator in Stress Test (along with CoalescingIterator)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12605

Test Plan:
NonBatchOps
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1
```

BatchOps
```
python3 tools/db_crashtest.py blackbox --test_batches_snapshots=1 --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1
```

CfConsistency Test
```
python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=1 --use_put_entity_one_in=1
```

Reviewed By: ltamasi

Differential Revision: D56916768

Pulled By: jaykorean

fbshipit-source-id: 8555d9e0d05927740a10e4e8301e44beec59a6f5
2024-05-09 16:40:22 -07:00
Hui Xiao 9bddac0dcf Add more public APIs to crash test (#12617)
Summary:
**Context/Summary:**
As titled. Bonus: found that PromoteL0 called with other concurrent PromoteL0 will return non-okay error so clarify the API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12617

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D56954428

Pulled By: hx235

fbshipit-source-id: 0e056153c515003fd241ffec59b0d8a27529db4c
2024-05-09 15:37:38 -07:00
Levi Tamasi 97e70906fa Improve the sanity checks in (Multi)GetEntity and friends (#12630)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12630

The patch cleans up, improves, and brings into sync (to the extent possible without API signature changes) the sanity checks around the `GetEntity` / `MultiGetEntity` family of APIs, including the read-your-own-writes (`WriteBatchWithIndex`) and transaction layers. The checks are centralized in two main sets of entry points, namely in `DB(Impl)` and the "main" `GetEntityFromBatchAndDB` / `MultiGetEntityFromBatchAndDB` overloads in `WriteBatchWithIndex`. This eliminates the need to duplicate the checks in the transaction classes.

Reviewed By: jaykorean

Differential Revision: D57125741

fbshipit-source-id: 4dd059ef644a9b173fbba767538943397e4cc6cd
2024-05-09 12:25:19 -07:00
Wei Liu 1a3357648f Error log update to db_impl_compaction_flush.cc (#12608)
Summary:
Make the error looks better.

Some inconsistency here and [here](e46ab9d4f0/db/db_impl/db_impl_compaction_flush.cc (L2701-L2702))

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12608

Reviewed By: ajkr

Differential Revision: D57134933

Pulled By: cbi42

fbshipit-source-id: 2f19f077f388d196652a4e3afd2526f18bf75b2d
2024-05-09 11:36:24 -07:00
Yu Zhang 9dc171e3bb Fix issue that cause false alarm corruption report (#12626)
Summary:
The state of `saved_seq_for_penul_check_` is not correctly maintained with the current flow. It's supposed to store the original sequence number for a `kTypeValuePreferredSeqno` entry for use in the `DecideOutputLevel` function. However, it's not always properly cleared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12626

Test Plan:
Added unit test that would fail before the fix
./tiered_compaction_test --gtest_filter="*InterleavedTimedPutAndPut*"

Reviewed By: pdillinger

Differential Revision: D57123469

Pulled By: jowlyzhang

fbshipit-source-id: 8d73214b3b6dc152daf19b6bd6ee9063581dc277
2024-05-08 15:51:38 -07:00
Peter Dillinger eeda54fe63 Fix db_crashtest.py for prefixpercent and enable_compaction_filter (#12627)
Summary:
After https://github.com/facebook/rocksdb/issues/12624 seeing db_stress failures due to db_crashtest.py calling it with --prefixpercent=5 --enable_compaction_filter=1

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12627

Test Plan: watch crash test

Reviewed By: ajkr

Differential Revision: D57121592

Pulled By: pdillinger

fbshipit-source-id: 55727355a7662e67efcd22d7e353153e78e24f59
2024-05-08 13:16:44 -07:00
Andrew Kryczka 933ac0e05c Fix locking for ColumnFamilyOptions::inplace_update_support (#12624)
Summary:
In `SaveValue()`, the read lock needs to be obtained before `VerifyEntryChecksum()` because the KV checksum verification reads the entire value metadata+data, which is all mutable when `ColumnFamilyOptions::inplace_update_support == true`.

In `MemTable::Update()`, the write lock needs to be obtained before mutating the value metadata (changing the value size) because it can be read concurrently.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12624

Test Plan:
```
$ make COMPILE_WITH_TSAN=1 -j56 db_stress
...
$ python3 tools/db_crashtest.py blackbox --simple --max_key=10 --inplace_update_support=1 --interval=10 --allow_concurrent_memtable_write=0
```

Reviewed By: cbi42

Differential Revision: D57034571

Pulled By: ajkr

fbshipit-source-id: 3dddf881ad87923143acdf6bfec12ce47bb13a48
2024-05-08 08:30:12 -07:00
Hans Holmberg b8400c9faf Make linux file write life time hinting work (#12595)
Summary:
The life time hint fcntl takes a 64-bit unsigned int, so make sure to pass a uint64_t when doing the syscall.

See:

https://man7.org/linux/man-pages/man2/fcntl.2.html
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c75b1d9421f80f4143e389d2d50ddfc8a28c8c35

This is one of those "How did this ever work?", as Env::WriteLifeTimeHint hint is definitely not the same as an 64-bit unsigned int.
What's surprising is that SetWriteLifeTimeHint does pass a valid hint from time to time.

Thanks,
Hans

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12595

Reviewed By: cbi42

Differential Revision: D56901280

Pulled By: ajkr

fbshipit-source-id: f276348863cbc29a537bed9450b16b0cc513ea78
2024-05-07 17:54:50 -07:00
Changyu Bi 5bf2c00a35 Clarify manual compaction and file ingestion behavior with FIFO compaction (#12618)
Summary:
For manual compaction, FIFO compaction will always skip key range overlapping checking with SST files. If CompactRange() is called with CompactionRangeOptions::change_level=true, a CF with FIFO compaction will now return Status::NotSupported.

For file ingestion, we will always ingest into L0. Previously, it's possible to ingest files into non-L0 levels with FIFO compaction.

These changes also help to fix [this](a178d15baf/db/db_impl/db_impl_compaction_flush.cc (L1269)) assertion failure in crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12618

Test Plan: added unit tests to verify the new behavior.

Reviewed By: hx235

Differential Revision: D56962401

Pulled By: cbi42

fbshipit-source-id: 19812a1509650b4162b379ca5bee02f2e9d9569d
2024-05-07 12:00:15 -07:00
Levi Tamasi 83d051a8d9 Add release note for GetEntity transaction support (#12625)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12625

Reviewed By: jaykorean

Differential Revision: D57059775

fbshipit-source-id: 80b3ddb51d538c6c21b69cd589f4ee8dd13596c9
2024-05-07 11:38:04 -07:00
Levi Tamasi eaa3226ef7 Add support for GetEntity in optimistic and WriteCommitted pessimistic transactions (#12623)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12623

The PR adds support for the `GetEntity` API to optimistic and WriteCommitted pessimistic transactions. `MultiGetEntity` support and the `ForUpdate` variants of these read APIs will be implemented in subsequent PRs.

Reviewed By: jaykorean

Differential Revision: D57030879

fbshipit-source-id: 1f0aed6418782975fe537b6b3d437fad31fcbd43
2024-05-07 10:20:26 -07:00
Zaidoon Abd Al Hadi 36ab251c07 Expose block based metadata cache options via C API (#12611)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12611

Reviewed By: jaykorean

Differential Revision: D56961823

Pulled By: ajkr

fbshipit-source-id: aa062cdb49a0bb2c1148a81d4c882a4733c7790e
2024-05-06 16:49:11 -07:00
Levi Tamasi 45c290660a Add PutEntity support for optimistic and WritePrepared pessimistic transactions (#12606)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12606

The patch extends optimistic transactions and WriteCommitted pessimistic transactions with support for the `PutEntity` API. Similarly to the other APIs, `PutEntity` is available via both the `Transaction` and `TransactionDB` interfaces, where using the latter executes the write in a single-operation transaction as usual. Support for read APIs and other write policies (WritePrepared, WriteUnprepared) will be added in separate PRs.

Reviewed By: jaykorean

Differential Revision: D56911242

fbshipit-source-id: 57cf8bb6c6b1b40ba4a8a652831c13a617644289
2024-05-06 14:41:00 -07:00
Andrew Kryczka 0fef690bd5 Sync non-latest WALs during flush for 2PC, single-CF DBs (#12622)
Summary:
Previously we skipped syncing the non-latest WALs during memtable flush when the DB had only one column family. Normally that is fine because those non-latest WALs would not be read by recovery. However, in case of `DBOptions::allow_2pc == true`, there could be unmatched prepare records in those WALs making them needed by recovery. As a result, the missing sync could have resulted in the recovered WAL state falling behind the recovered SST state. When we detect that case, we return a `Status::Corruption` saying "SST file is ahead of WALs".

This PR proposes syncing the WAL in case of `DBOptions::allow_2pc`. This introduces the sync in some scenarios where it isn't needed (e.g., non-recent WALs contain no prepares) but I suspect the simplicity is worth it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12622

Reviewed By: cbi42

Differential Revision: D56987303

Pulled By: ajkr

fbshipit-source-id: 7fe9395458018a18d77e907a3b5429065c0e2e48
2024-05-06 11:56:16 -07:00
Changyu Bi 6fdc4c5282 Fix a corruption bug in CreateColumnFamilyWithImport() (#12602)
Summary:
when importing files from multiple CFs into a new CF, we were reusing the epoch numbers assigned by the original CFs. This means L0 files in the new CF can have the same epoch number (assigned originally by different CFs). While CreateColumnFamilyWithImport() requires each original CF to have disjoint key range, after an intra-l0 compaction, we still can end up with L0 files with the same epoch number but overlapping key range. This PR attempt to fix this by reassigning epoch numbers when importing multiple CFs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12602

Test Plan:
a new repro unit test. Before this PR, it fails with
```
[ RUN      ] ImportColumnFamilyTest.AssignEpochNumberToMultipleCF
db/import_column_family_test.cc:1048: Failure
db_->WaitForCompact(o)
Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 files of same epoch number but overlapping range https://github.com/facebook/rocksdb/issues/44 , smallest key: '6B6579303030303030' seq:511, type:1 , largest key: '6B6579303031303239' seq:510, type:1 , epoch number: 3 vs. file https://github.com/facebook/rocksdb/issues/36 , smallest key: '6B6579303030313030' seq:401, type:1 , largest key: '6B6579303030313939' seq:500, type:1 , epoch number: 3
```

Reviewed By: hx235

Differential Revision: D56851808

Pulled By: cbi42

fbshipit-source-id: 01b8c790c9f1f2a168047ead670e73633f705b84
2024-05-06 11:01:38 -07:00