Summary:
This PR adds a `DB::WriteWithCallback` API that does the same things as `DB::Write` while takes an argument `UserWriteCallback` to execute custom callback functions during the write.
We currently support two types of callback functions: `OnWriteEnqueued` and `OnWalWriteFinish`. The former is invoked after the write is enqueued, and the later is invoked after WAL write finishes when applicable.
These callback functions are intended for users to use to improve synchronization between concurrent writes, their execution is on the write's critical path so it will impact the write's latency if not used properly. The documentation for the callback interface mentioned this and suggest user to keep these callback functions' implementation minimum.
Although transaction interfaces' writes doesn't yet allow user to specify such a user write callback argument, the `DBImpl::Write*` type of APIs do not differentiate between regular DB writes or writes coming from the transaction layer when it comes to supporting this `UserWriteCallback`. These callbacks works for all the write modes including: default write mode, Options.two_write_queues, Options.unordered_write, Options.enable_pipelined_write
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12603
Test Plan: Added unit test in ./write_callback_test
Reviewed By: anand1976
Differential Revision: D58044638
Pulled By: jowlyzhang
fbshipit-source-id: 87a84a0221df8f589ec8fc4d74597e72ce97e4cd
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12717
The PR adds `Transaction::MultiGetEntity` to the stress tests. Similarly to what we do for `Transaction::MultiGet`, in this mode we open a transaction and randomly add writes for some of the queried keys to it while keeping track of the values written on a per-key basis. The results of `Transaction::MultiGetEntity` can then be validated against these expected values (in order to test the read-your-own-writes functionality) as well as the results returned by `Transaction::GetEntity` for the same keys.
Reviewed By: jaykorean
Differential Revision: D57990210
fbshipit-source-id: 9bf3bb292051c2c57757f86b517919197b03c524
Summary:
Introduce `use_multi_cf_iterator`, and when it's set, use `CoalescingIterator` in `TestIterate()`. Because all the column families contain the same data in today's Stress Test, we can compare `CoalescingIterator` against any `DBIter` from any of the column families. Currently, coalescing logic verification is done by unit tests, but we can extend the stress test to support different data in different column families in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12706
Test Plan:
```
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --use_attribute_group=0 --use_put_entity_one_in=1 --use_multi_get=1 --use_multi_cf_iterator=1
```
**More PRs to come**
- Use `AttributeGroupIterator` when both `use_multi_cf_iterator` and `use_attribute_group` are true
- Support `Refresh()` in `CoalescingIterator`
- Extend Stress Test to support different data in different CFs (Long-term)
Reviewed By: ltamasi
Differential Revision: D58020247
Pulled By: jaykorean
fbshipit-source-id: 8e2483b85cf2bb0f5a9bb44851601bbf063484ec
Summary:
As titled. This PR also makes the interactive query tool more permissive by allowing the user to continue to try out a different command after the previous command received some allowed errors, such as `Status::NotFound`, `Status::InvalidArgument`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12716
Test Plan:
Manually tested:
```
yuzhangyu@yuzhangyu-mbp rocksdb % ./ldb --db=$TEST_DB --key_hex --value_hex query
get 0x0000000000000000 --read_timestamp=1115559245398440
0x0000000000000000|timestamp:1115559245398440 ==> 0x07000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B
put 0x0000000000000000 0x0000
put 0x0000000000000000 => 0x0000 failed: Invalid argument: cannot call this method on column family default that enables timestamp
put 0x0000000000000000 aha 0x0000
put gets invalid argument: Invalid argument: user provided timestamp is not a valid uint64 value.
put 0x0000000000000000 1115559245398441 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B
put 0x0000000000000000 write_ts: 1115559245398441 => 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B succeeded
delete 0x0000000000000000
delete 0x0000000000000000 failed: Invalid argument: cannot call this method on column family default that enables timestamp
delete 0x0000000000000000 1115559245398442
delete 0x0000000000000000 write_ts: 1115559245398442 succeeded
get 0x0000000000000000 --read_timestamp=1115559245398442
get 0x0000000000000000 read_timestamp: 1115559245398442 status: NotFound:
get 0x0000000000000000 --read_timestamp=1115559245398441
0x0000000000000000|timestamp:1115559245398441 ==> 0x08000000000102030C0D0E0F08090A0B14151617101112131C1D1E1F18191A1B24252627202122232C2D2E2F28292A2B34353637303132333C3D3E3F38393A3B
count --from=0x0000000000000000 --to=0x0000000000000001
scan from 0x0000000000000000 to 0x0000000000000001failed: Invalid argument: cannot call this method on column family default that enables timestamp
count --from=0x0000000000000000 --to=0x0000000000000001 --read_timestamp=1115559245398442
0
count --from=0x0000000000000000 --to=0x0000000000000001 --read_timestamp=1115559245398441
1
```
Reviewed By: ltamasi
Differential Revision: D57992183
Pulled By: jowlyzhang
fbshipit-source-id: 720525de22412d16aa952870e088f2c371459ece
Summary:
These functions were very similar and did not make sense for maintaining separately. This is not a pure refactor but I think bringing the behaviors closer together should reduce long term risk of unintentionally divergent behavior. This change is motivated by some forthcoming WAL handling fixes for Checkpoint and Backups.
* Sync() is always used on closed WALs, like the old SyncClosedWals. SyncWithoutFlush() is only used on the active (maybe) WAL. Perhaps SyncWithoutFlush() should be used whenever available, but I don't know which is preferred, as the previous state of the code was inconsistent.
* Syncing the WAL dir is selective based on need, like old SyncWAL, rather than done always like old SyncClosedLogs. This could be a performance improvement that was never applied to SyncClosedLogs but now is. We might still sync the dir more times than necessary in the case of parallel SyncWAL variants, but on a good FileSystem that's probably not too different performance-wise from us implementing something to have threads wait on each other.
Cosmetic changes:
* Rename internal function SyncClosedLogs to SyncClosedWals
* Merging the sync points into the common implementation between the two entry points isn't pretty, but should be fine.
Recommended follow-up:
* Clean up more confusing naming like log_dir_synced_
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12707
Test Plan: existing tests
Reviewed By: anand1976
Differential Revision: D57870856
Pulled By: pdillinger
fbshipit-source-id: 5455fba016d25dd5664fa41b253f18db2ca8919a
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12715
The patch refactors/deduplicates the non-attribute-group and attribute-group code paths in `NonBatchedOpsStressTest::TestMultiGetEntity` by introducing two new generic lambdas `verify_expected_errors` and `check_results` (the latter of which subsumes the existing `handle_results`) that can handle both types of APIs. This change also serves as groundwork for the upcoming transactional `MultiGetEntity` stress tests.
Reviewed By: jaykorean
Differential Revision: D57977700
fbshipit-source-id: 83a18a9e57f46ea92ba07b2f0dca3e9bc353f257
Summary:
A `BlockBasedTable` with `TieredSecondaryCache` containing a NVM cache inserts blocks into the compressed cache and the corresponding compressed block into the NVM cache. The `BlockFetcher` is used to get the uncompressed and compressed blocks by calling `ReadBlockContents()` and `GetUncompressedBlock()` respectively. If the file system supports FSBuffer (i.e returning a FS allocated buffer rather than caller provided), that buffer gets freed between the two calls. This PR fixes it by making the FSBuffer unique pointer a member rather than local variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12712
Test Plan:
1. Add a unit test
2. Release validation stress test
Reviewed By: jaykorean
Differential Revision: D57974026
Pulled By: anand1976
fbshipit-source-id: cfa895914e74b4f628413b40e6e39d8d8e5286bd
Summary:
We tested on icelake server (vcpu=160). The default configuration is allow_concurrent_memtable_write=1, thread number =activate core number. With our optimizations, the improvement can reach up to 184% in fillseq case. op/s is as the performance indicator in db_bench, and the following are performance improvements in some cases in db_bench.
| case name | optimized/original |
|-------------------:|--------------------:|
| fillrandom | 182% |
| fillseq | 184% |
| fillsync | 136% |
| overwrite | 179% |
| randomreplacekeys | 180% |
| randomtransaction | 161% |
| updaterandom | 163% |
| xorupdaterandom | 165% |
With analysis, we find that although the process of writing memtable is processed in parallel, the process of waking up the writers is not processed in parallel, which means that only one writers is responsible for the sequential waking up other writers. The following is our method to optimize this process.
Assume that there are currently n threads in total, we parallelize SetState in LaunchParallelMemTableWriters. To wake up each writer to write its own memtable, the leader writer first wakes up the (n^0.5-1) caller writers, and then those callers and the leader will wake up n/x separately to write to the memtable. This reduces the number for the leader's to SetState n-1 writers to 2*(n^0.5) writers in turn.
A reproduction script:
./db_bench --benchmarks="fillrandom" --threads ${number of all activate vcpu} --seed 1708494134896523 --duration 60
![image](https://github.com/facebook/rocksdb/assets/22110918/c5eca02f-93b3-4434-bba2-5155fc892a97)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12545
Reviewed By: ajkr
Differential Revision: D57422827
Pulled By: cbi42
fbshipit-source-id: 94127937c0c61e4241720bd902c82c607b7b2431
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12711
The patch adds the missing other half of https://github.com/facebook/rocksdb/pull/12709: when there is no locking in a read test, we have to be more permissive when it comes to values returned by queries. In particular, any expected state value in a small window around the read call should be allowed, and discrepancies in the presence/absence of a key should only be treated as a failure if the key is guaranteed to have not existed/existed during the above window.
Reviewed By: hx235
Differential Revision: D57938678
fbshipit-source-id: cd5c8bc2e014ec12ea4daf441965f3ec2115663e
Summary:
https://github.com/facebook/rocksdb/issues/12512 added the sanity check for this incompatible combination. However, it does the check during memtable insertion which can turn the DB into read-only mode. This PR moves the check earlier so that this write failure will not turn the DB into read-only mode and affect other DB operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12710
Test Plan: * updated unit test `DBRangeDelTest.RowCache` to write to DB after a failed DeleteRange(). The test fails before this PR.
Reviewed By: ajkr
Differential Revision: D57925188
Pulled By: cbi42
fbshipit-source-id: 8bf001bd3fcf05635411ba28bc4a037321942879
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12709
This is most likely copypasta from `TestGet` from before https://github.com/facebook/rocksdb/pull/11058 . There is no need to lock the mutex for the key for reads; in fact, doing so is detrimental to test coverage since it locks out concurrent writers.
Reviewed By: jowlyzhang
Differential Revision: D57915207
fbshipit-source-id: eb0dbf6b84e5408b87d96dd47597511996e206a7
Summary:
Add the `--leader_path` option to specify the directory path of the leader for a follower RocksDB instance. This PR also adds a `count` command to the repl shell. While not specific to followers, it is useful for testing purposes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12682
Reviewed By: jowlyzhang
Differential Revision: D57642296
Pulled By: anand1976
fbshipit-source-id: 53767d496ecadc363ff92cd958b8e15a7bf3b151
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12699
The patch adds `PutEntity` to the potential write operations used in the read-your-own-writes tests for `Transaction::MultiGet`. Note that since the stress test generates wide-column structures which have the value returned by `GenerateValue` in the default column, this does not affect the results returned by the `MultiGet` API (unless we have a bug).
The wide-column entity is generated according to the usual rules based on the value base and the `use_put_entity_one_in` flag. The entire entity structure will be validated by the upcoming stress test for `Transaction::MultiGetEntity`, where we also plan to leverage this logic.
Reviewed By: jowlyzhang
Differential Revision: D57799075
fbshipit-source-id: 5f86c2b2b3ceee8e1b8bf7453c02f1f1b1b00751
Summary:
**Context/Summary:**
the flag --json of manifest_dump in ldb tool has no effect
The bug may be introduced by pr https://github.com/facebook/rocksdb/pull/8378
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12703
Reviewed By: cbi42
Differential Revision: D57848094
Pulled By: ajkr
fbshipit-source-id: 3d1ce65528bf4ce9c53593a7208406ab90e8994b
Summary:
This PR adds UpdateTimestamp API of WriteBatch and WBWI, create WB, WBWI with all options and Iterator Refresh in C API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10529
Reviewed By: cbi42
Differential Revision: D57826913
Pulled By: ajkr
fbshipit-source-id: d2ec840129f61a1d3a5a12e859728be98ebbad2f
Summary:
This change replaces the use of `std::unique_ptr` with `std::optional` for conditionally constructing a `ReadLock` object. The read lock object was recently introduced in PR https://github.com/facebook/rocksdb/issues/12624. This change makes the code more concise and clarifies that the lock is not meant to be transferred (as `std::unique_ptr` is movable). It also avoids a heap allocation.
There are no functional changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12704
Reviewed By: cbi42
Differential Revision: D57848192
Pulled By: ajkr
fbshipit-source-id: da48c77aac33b51ba5dcc238f98fc48ccf234a21
Summary:
These names are confusing with `Logger` etc. so moving to `WalFile` etc.
Other small, related name refactorings.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12695
Test Plan: Left most unit tests using old names as an API compatibility test. Non-test code compiles with deprecated names removed. No functional changes.
Reviewed By: ajkr
Differential Revision: D57747458
Pulled By: pdillinger
fbshipit-source-id: 7b77596b9c20d865d43b9dc66c30c8bd2b3b424f
Summary:
It should be no less than `level0_file_num_compaction_trigger`(which defaults to 4) when set to a positive value. Otherwise DB open will fail.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12701
Test Plan: crash test not failing DB open due to this option value.
Reviewed By: ajkr
Differential Revision: D57825062
Pulled By: cbi42
fbshipit-source-id: 22d8e12aeceb5cef815157845995a8448552e2d2
Summary:
we are converting the implicit loads to explicit loads, then remove the hidden loads in fbcode macroes.
details see https://fb.workplace.com/groups/devx.build.bffs/permalink/7481848805183560/
Reviewed By: JakobDegen
Differential Revision: D57800976
fbshipit-source-id: a893aa2aa9237704ba9eb998cba210222c95dd2f
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12697
As groundwork for stress testing `Transaction::MultiGetEntity`, the patch factors out the logic for adding transactional writes for some of the keys in a `MultiGet` batch into a separate helper method called `MaybeAddKeyToTxnForRYW`.
Reviewed By: jowlyzhang
Differential Revision: D57791830
fbshipit-source-id: ef347ba6e6e82dfe5cedb4cf67dd6d1503901d89
Summary:
As titled. For dumping wal files, since a mapping from column family id to the user comparator object is needed to print the timestamp in human readable format, option `[--db=<db_path>]` is added to `dump_wal` command to allow the user to choose to optionally open the DB as read only instance and dump the wal file with better timestamp formatting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12690
Test Plan:
Manually tested
dump_wal:
[dump a wal file specified with --walfile]
```
>> ./ldb --walfile=$TEST_DB/000004.log dump_wal --print_value
>>1,1,28,13,PUT(0) : 0x666F6F0100000000000000 : 0x7631
(Column family id: [0] contained in WAL are not opened in DB. Applied default hex formatting for user key. Specify --db=<db_path> to open DB for better user key formatting if it contains timestamp.)
```
[dump with --db specified for better timestamp formatting]
```
>> ./ldb --walfile=$TEST_DB/000004.log dump_wal --db=$TEST_DB --print_value
>> 1,1,28,13,PUT(0) : 0x666F6F|timestamp:1 : 0x7631
```
dump:
[dump a file specified with --path]
```
>>./ldb --path=/tmp/rocksdbtest-501/column_family_test_75359_17910784957761284041/000004.log dump
Sequence,Count,ByteSize,Physical Offset,Key(s) : value
1,1,28,13,PUT(0) : 0x666F6F0100000000000000 : 0x7631
(Column family id: [0] contained in WAL are not opened in DB. Applied default hex formatting for user key. Specify --db=<db_path> to open DB for better user key formatting if it contains timestamp.)
```
[dump db specified with --db]
```
>> ./ldb --db=/tmp/rocksdbtest-501/column_family_test_75359_17910784957761284041 dump
>> foo|timestamp:1 ==> v1
Keys in range: 1
```
idump
```
./ldb --db=$TEST_DB idump
'foo|timestamp:1' seq:1, type:1 => v1
Internal keys in range: 1
```
Reviewed By: ltamasi
Differential Revision: D57755382
Pulled By: jowlyzhang
fbshipit-source-id: a0a2ef80c92801cbf7bfccc64769c1191824362e
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12696
Two fixes:
1) `Random::Uniform(n)` returns an integer from the interval [0, n - 1], so `Uniform(2)` returns 0 or 1, which means is that we have apparently never covered transactions with deletions in the test. (To prevent similar issues, the patch cleans this write logic up a bit using an `enum class` for the type of write.)
2) The keys passed in to `TestMultiGet` can have duplicates. What this boils down to is that we have to keep track of the latest expected values for read-your-own-writes on a per-key basis.
Reviewed By: jowlyzhang
Differential Revision: D57750212
fbshipit-source-id: e8ab603252c32331f8db0dfb2affcca1e188c790
Summary:
I think the point of the `if (end_of_buffer_offset_ - buffer_.size() == 0)` was to only set `recycled_` when the first record was read. However, the condition was false when reading the first record when the WAL began with a `kSetCompressionType` record because we had already dropped the `kSetCompressionType` record from `buffer_`. To fix this, I used `first_record_read_` instead.
Also, it was pretty confusing to treat the WAL as non-recycled when a recyclable record first appeared in a non-first record. I changed it to return an error if that happens.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12643
Reviewed By: hx235
Differential Revision: D57238099
Pulled By: ajkr
fbshipit-source-id: e20a2a0c9cf0c9510a7b6af463650a05d559239e
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12688
As a first step of covering the wide-column transaction APIs, the patch adds `PutEntity` to the optimistic and pessimistic transaction stress tests (for the latter, only when the WriteCommitted policy is utilized). Other APIs and the multi-operation transaction test will be covered by subsequent PRs.
Reviewed By: jaykorean
Differential Revision: D57675781
fbshipit-source-id: bfe062ec5f6ab48641cd99a70f239ce4aa39299c
Summary:
**Context/Summary:** https://github.com/facebook/rocksdb/pull/12556 `avoid_sync_during_shutdown=false` missed an edge case where `manual_wal_flush == true` so WAL sync will still miss unflushed WAL. This PR fixes it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12684
Test Plan: modified UT to include this case `manual_wal_flush==true`
Reviewed By: cbi42
Differential Revision: D57655861
Pulled By: hx235
fbshipit-source-id: c9f49fe260e8b38b3ea387558432dcd9a3dbec19
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12689
These should be in `snake_case` (not `camelCase`) per our style guide.
Reviewed By: jowlyzhang
Differential Revision: D57676418
fbshipit-source-id: 82ad6a87d1540f0b29c2f864ca0128287fe95a9e
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`
If the code compiles, this is safe to land.
Reviewed By: palmje
Differential Revision: D57632757
fbshipit-source-id: 1dbad2a2e185381e225df8b9027033e06aeaf01b
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12683
With optimistic transactions, the stress test parameter `txn_write_policy` is not applicable and is thus not set. When the parameter is subsequently checked, Python's dictionary `get` method returns `None`, which is not equal to zero. The net result of this is that currently, `sync_fault_injection` and `manual_wal_flush_one_in` are always disabled in optimistic transaction mode (most likely unintentionally).
Reviewed By: cbi42
Differential Revision: D57655339
fbshipit-source-id: 8b93a788f9b02307b6ea7b2129dc012271130334
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12681
When rebuilding transactions during recovery, `MemtableInserter::PutCFImpl` currently calls `WriteBatchInternal::Put` regardless of value type, which is incorrect for `PutEntity` entries, as well as `TimedPut`s and the blob indexes used by the old BlobDB implementation. The patch fixes the handling of `PutEntity` and returns `NotSupported` for `TimedPut`s and blob indices.
Reviewed By: jaykorean, jowlyzhang
Differential Revision: D57636355
fbshipit-source-id: 833de4e4aa0b42ff6638b72c4181f981d12d0f15
Summary:
We recently noticed that some memtable flushed and file
ingestions could proceed during LockWAL, in violation of its stated
contract. (Note: we aren't 100% sure its actually needed by MySQL, but
we want it to be in a clean state nonetheless.)
Despite earlier skepticism that this could be done safely (https://github.com/facebook/rocksdb/issues/12666), I
found a place to wait to wait for LockWAL to be cleared before allowing
these operations to proceed: WaitForPendingWrites()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12652
Test Plan:
Added to unit tests. Extended how db_stress validates LockWAL
and re-enabled combination of ingestion and LockWAL in crash test, in
follow-up to https://github.com/facebook/rocksdb/issues/12642
Ran blackbox_crash_test for a long while with relevant features
amplified.
Suggested follow-up: fix FaultInjectionTestFS to report file sizes
consistent with what the user has requested to be flushed.
Reviewed By: jowlyzhang
Differential Revision: D57622142
Pulled By: pdillinger
fbshipit-source-id: aef265fce69465618974b4ec47f4636257c676ce
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12677
The patch contains two fixes related to printing `PutEntity` records with `ldb dump_wal`:
1) It adds the key to the printout (it was missing earlier).
2) It restores the formatting flags of the output stream after dumping the wide-column structure so that any `hex` flag that might have been set does not affect subsequent printing of e.g. sequence numbers.
Reviewed By: jaykorean, jowlyzhang
Differential Revision: D57591295
fbshipit-source-id: af4e3e219f0082ad39bbdfd26f8c5a57ebb898be
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12676
The patch extends the RocksDB buckifier script so it also creates a `buck` target for the `ldb` tool and updates the `TARGETS` file with the results of the new version of the script.
Reviewed By: cbi42
Differential Revision: D57588789
fbshipit-source-id: 2ed58b405b3f216e802cf6bcbdbf9809e7386c8b
Summary:
the value of `inplace_update_support` option need to be fixed across runs of db_stress on the same DB (https://github.com/facebook/rocksdb/issues/12577). My recent fix (https://github.com/facebook/rocksdb/issues/12673) regressed this behavior. Also fix some existing places where this does not hold.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12675
Test Plan: monitor crash tests related to `inplace_update_support`.
Reviewed By: hx235
Differential Revision: D57576375
Pulled By: cbi42
fbshipit-source-id: 75b1bd233f03e5657984f5d5234dbbb1ffc35c27
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12668
The patch adds a new `GetEntityForUpdate` API to optimistic and WriteCommitted pessimistic transactions, which provides transactional wide-column point lookup functionality with concurrency control. For WriteCommitted transactions, user-defined timestamps are also supported similarly to the `GetForUpdate` API.
Reviewed By: jaykorean
Differential Revision: D57458304
fbshipit-source-id: 7eadbac531ca5446353e494abbd0635d63f62d24
Summary:
gcc 14.1 reports some warnings about dangling-reference occured in backup_engine_test.
```c++
/data/rocksdb/utilities/backup/backup_engine_test.cc: In member function 'virtual void rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::TestBody()':
/data/rocksdb/utilities/backup/backup_engine_test.cc:4411:64: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
4411 | std::make_pair(alt_backup_engine, backup_engine_.get())}) {
| ^
/data/rocksdb/utilities/backup/backup_engine_test.cc:4410:23: note: the temporary was destroyed at the end of the full expression 'std::make_pair<rocksdb::BackupEngine*, rocksdb::BackupEngine*&>(((rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test*)this)->rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::rocksdb::{anonymous}::BackupEngineTest.rocksdb::{anonymous}::BackupEngineTest::backup_engine_.std::unique_ptr<rocksdb::BackupEngine>::get(), alt_backup_engine)'
4410 | {std::make_pair(backup_engine_.get(), alt_backup_engine),
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/data/rocksdb/utilities/backup/backup_engine_test.cc:4411:64: error: possibly dangling reference to a temporary [-Werror=dangling-reference]
4411 | std::make_pair(alt_backup_engine, backup_engine_.get())}) {
| ^
/data/rocksdb/utilities/backup/backup_engine_test.cc:4411:23: note: the temporary was destroyed at the end of the full expression 'std::make_pair<rocksdb::BackupEngine*&, rocksdb::BackupEngine*>(alt_backup_engine, ((rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test*)this)->rocksdb::{anonymous}::BackupEngineTest_ExcludeFiles_Test::rocksdb::{anonymous}::BackupEngineTest.rocksdb::{anonymous}::BackupEngineTest::backup_engine_.std::unique_ptr<rocksdb::BackupEngine>::get())'
4411 | std::make_pair(alt_backup_engine, backup_engine_.get())}) {
| ~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```
It seems to be related to this update in gcc:
https://gcc.gnu.org/gcc-14/changes.html#:~:text=%2DWdangling%2Dreference%20false%20positives%20have%20been%20reduced.%20The%20warning%20does%20not%20warn%20about%20std%3A%3Aspan%2Dlike%20classes%3B%20there%20is%20also%20a%20new%20attribute%20gnu%3A%3Ano_dangling%20to%20suppress%20the%20warning.%20See%20the%20manual%20for%20more%20info.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12637
Reviewed By: cbi42
Differential Revision: D57263996
Pulled By: ajkr
fbshipit-source-id: 1e416c38240d3d1adda787fc484c0392e28bb7f1
Summary:
With unsynced data loss, we replay traces to recover expected state to DB's latest sequence number. With `inplace_update_support`, the largest sequence number of memtable may not reflect the latest update. This is because inplace updates in memtable do not update sequence number. So we disable `inplace_update_support` where traces need to be replayed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12673
Reviewed By: ltamasi
Differential Revision: D57512548
Pulled By: cbi42
fbshipit-source-id: 69278fe2e935874faf744d0ac4fd85263773c3ec
Summary:
This PR implements deletion of obsolete files in a follower RocksDB instance. The follower tails the leader's MANIFEST and creates links to newly added SST files. These links need to be deleted once those files become obsolete in order to reclaim space. There are three cases to be considered -
1. New files added and links created, but the Version could not be installed due to some missing files. Those links need to be preserved so a subsequent catch up attempt can succeed. We insert the next file number in the `VersionSet` to `pending_outputs_` to prevent their deletion.
2. Files deleted from the previous successfully installed `Version`. These are deleted as usual in `PurgeObsoleteFiles`.
3. New files added by a `VersionEdit` and deleted by a subsequent `VersionEdit`, both processed in the same catchup attempt. Links will be created for the new files when verifying a candidate `Version`. Those need to be deleted explicitly as they're never added to `VersionStorageInfo`, and thus not deleted by `PurgeObsoleteFiles`.
Test plan -
New unit tests in `db_follower_test`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12657
Reviewed By: jowlyzhang
Differential Revision: D57462697
Pulled By: anand1976
fbshipit-source-id: 898f15570638dd4930f839ffd31c560f9cb73916
Summary:
This test is flaky and a recent failure prints the following:
```
[ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
thread id: 1842811, thread status:
thread id: 1842803, thread status:
db/db_test.cc:4697: Failure
Expected equality of these values:
op_count
Which is: 0
expected_count
Which is: 1
[ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (307 ms)
```
Empty thread status implies that operation_type of the threads are all OP_UNKNOWN. From 3ed46e0668/monitoring/thread_status_updater.cc (L197), this can be due to thread_data->operation_type being OP_UNKNOWN or that thread_data->cf_key it not in `cf_info_map_`, potentially due to how cf_key_ is accessed with relaxed memory order. This PR adds some debug print to print the cf_name to check this.
This PR also prints num_running_compaction and lsm state to check if a compaction is indeed running, and removes some not needed options and ensures that exactly 4 L0 files are created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12661
Test Plan:
- Cannot repro the failure locally: `gtest-parallel --repeat=10000 --workers=200 ./db_test --gtest_filter="*ThreadStatusSingleCompaction*"`
- New failure message will look like:
```
[ RUN ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0
op_count: 1, expected_count 2
thread id: 6104100864, thread status: , cf_name
thread id: 6103527424, thread status: Compaction, cf_name default
running compaction: 1 lsm state: 4
db/db_test.cc:4885: Failure
Value of: match
Actual: false
Expected: true
[ FAILED ] DBTestWithParam/DBTestWithParam.ThreadStatusSingleCompaction/0, where GetParam() = (1, false) (115 ms)
```
Reviewed By: hx235
Differential Revision: D57422755
Pulled By: cbi42
fbshipit-source-id: 635663f26052b20e485dfa06a7c0f1f318ac1099
Summary:
Represent internal kTypeValuePreferredSeqno in the public API as kEntryTimedPut (because it is created by TimedPut, until the entry can be safely converted to a regular value entry in compaction)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12669
Test Plan: for follow-up work actually using it. But putting this in place in the public API gives us more flexibility in rolling out that follow-up work (e.g. as a user extension or patch if needed).
Reviewed By: jowlyzhang
Differential Revision: D57459637
Pulled By: pdillinger
fbshipit-source-id: 160ccf7c4e524ee479558846b2a207d51b8b3d9c
Summary:
`ReadOptions::pin_data` already has the effect of pinning the `Slice` returned by `Iterator::value()` when the value is stored inline (e.g., `kTypeValue`). This PR adds a bit of visibility into that via a new `Iterator` property, "rocksdb.iterator.is-value-pinned", as well as some documentation and tests.
See also: https://github.com/facebook/rocksdb/issues/12658
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12659
Reviewed By: cbi42
Differential Revision: D57391200
Pulled By: ajkr
fbshipit-source-id: 0caa8db27ca1aba86ee2addc3dfd6f0e003d32e2
Summary:
To avoid use-after-free on custom env on ASSERT_WHATEVER failure.
This is motivated by a rare crash seen in DBErrorHandlingFSTest.WALWriteError (VersionSet::GetObsoleteFiles in a SstFileManagerImpl::ClearError thread) and wanting to rule out this being related to that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/12655
Test Plan: manually seeing ASSERT_WHATEVER failures, especially under ASAN
Reviewed By: cbi42
Differential Revision: D57358202
Pulled By: pdillinger
fbshipit-source-id: 4da2a0d73a54380b257e5cc1ab6c666e26b83973