Summary:
Enable the set of crash test for when user defined timestamp is enabled in combination with BlobDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11163
Test Plan: `make check` and `db_stress`/`db_crashtest.py` with various combinations.
Reviewed By: ltamasi
Differential Revision: D42906457
Pulled By: jowlyzhang
fbshipit-source-id: 6bec6449a4213b536c787420ff30a7d17b676deb
Summary:
The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
* Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
* Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
* Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
* Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
* Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
* Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
Test Plan: unit tests added / updated, added to stress/crash test
Reviewed By: ajkr
Differential Revision: D42848627
Pulled By: pdillinger
fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
Summary:
**Context/Summary:**
This reverts commit fc74abb436 and related HISTORY record.
The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004
Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D41572604
Pulled By: hx235
fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954
Test Plan: tests updated
Reviewed By: akankshamahajan15
Differential Revision: D41310123
Pulled By: pdillinger
fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
Summary:
This option allows a custom cleanup command line for a non-Posix file system to be used by db_crashtest.py to cleanup between runs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10883
Test Plan: Run the whitebox crash test
Reviewed By: pdillinger
Differential Revision: D40726424
Pulled By: anand1976
fbshipit-source-id: b827f6b583ff78f9ca75ced2d96f7e58f5200432
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress
./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```
**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
- [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
- For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
- This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
- For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
- This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
- Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only
Reviewed By: ajkr
Differential Revision: D40090485
Pulled By: hx235
fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
Summary:
This is a small follow-up to https://github.com/facebook/rocksdb/pull/10821. The goal of that PR was to hold `test_batches_snapshots` fixed across all `db_stress` invocations; however, that patch didn't address the case when `test_batches_snapshots` is unset due to a conflicting `enable_compaction_filter` or `prefix_size` setting. This PR updates the logic so the other parameter is sanitized instead in the case of such conflicts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10830
Reviewed By: riversand963
Differential Revision: D40444548
Pulled By: ltamasi
fbshipit-source-id: 0331265704904b729262adec37139292fcbb7805
Summary:
We have recently made some stress test improvements that rely on decoding the "value base" from the values stored in the database. This logic does not currently support the case when some KVs are written by a non-batched ops run and some by a batched ops run. The patch temporarily disables mixing these two.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10821
Reviewed By: riversand963
Differential Revision: D40367326
Pulled By: ltamasi
fbshipit-source-id: 66f2e0cbc097ab6b1f9e4b39b833bd466f1aaab5
Summary:
We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)
This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768
Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)
Reviewed By: anand1976
Differential Revision: D40034747
Pulled By: anand1976
fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
Summary:
`SstFileWriter` currently does not support the `PutEntity` API, so in `TestIngestExternalFile` all key-values are written using regular `Put`s. This violates the assumption that whether or not a key corresponds to a plain old key-value or a wide-column entity can be determined by solely looking at the "value base" used when generating the value. The patch fixes this issue by disabling ingestion when `PutEntity` is enabled in the stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10769
Test Plan: Ran a simple blackbox stress test.
Reviewed By: akankshamahajan15
Differential Revision: D40042132
Pulled By: ltamasi
fbshipit-source-id: 93e75ff55545b7b69fa4ddef1d96093c961158a0
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case.
| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|
Reviewed By: riversand963
Differential Revision: D39441192
Pulled By: cbi42
fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
Summary:
The patch adds the `PutEntity` API to the non-batched, batched, and
CF consistency stress tests. Namely, when the new `db_stress` command
line parameter `use_put_entity_one_in` is greater than zero, one in
N writes on average is performed using `PutEntity` rather than `Put`.
The wide-column entity written has the generated value in its default
column; in addition, it contains up to three additional columns where
the original generated value is divided up between the column name and the
column value (with the column name containing the first k characters of
the generated value, and the column value containing the rest). Whether
`PutEntity` is used (and if so, how many columns the entity has) is completely
determined by the "value base" used to generate the value (that is, there is
no randomness involved). Assuming the same `use_put_entity_one_in` setting
is used across `db_stress` invocations, this enables us to reconstruct and
validate the entity during subsequent `db_stress` runs.
Note that `PutEntity` is currently incompatible with `Merge`, transactions, and
user-defined timestamps; these combinations are currently disabled/disallowed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10760
Test Plan: Ran some batched, non-batched, and CF consistency stress tests using the script.
Reviewed By: riversand963
Differential Revision: D39939032
Pulled By: ltamasi
fbshipit-source-id: eafdf124e95993fb7d73158e3b006d11819f7fa9
Summary:
An add-on to https://github.com/facebook/rocksdb/pull/6818 to complete adding single-level universal compaction to stress/crash testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10732
Test Plan:
- Locally run for 10 min `python3 ./tools/db_crashtest.py whitebox --simple --compaction_style=1 --num_levels=1 -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887`
- Check LOG to confirm single-level universal compaction is called
- Manual testing and log checking to ensure destroy_db_initially=1 is correctly set across runs with different compaction styles (i.e, in the second half of whitebox testing).
- [ongoing]CI jobs stress test
Reviewed By: ajkr
Differential Revision: D39797612
Pulled By: ajkr
fbshipit-source-id: 16f5c40c3464c57360c06c8305f92118e426149c
Summary:
**Context:**
Prior to this PR, correctness testing with un-sync data loss [disabled](https://github.com/facebook/rocksdb/pull/10605) transaction (`use_txn=1`) thus all of the `txn_write_policy` . This PR improved that by adding support for one policy - WriteCommit (`txn_write_policy=0`).
**Summary:**
They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan.
For (a)
- During prepare (i.e, between `MarkBeginPrepare()` and `MarkEndPrepare(xid)`), `ExpectedStateTraceRecordHandler` will buffer all writes by adding all writes to an internal `WriteBatch`.
- On `MarkEndPrepare()`, that `WriteBatch` will be associated with the transaction's `xid`.
- During the commit (i.e, on `MarkCommit(xid)`), `ExpectedStateTraceRecordHandler` will retrieve and iterate the internal `WriteBatch` and finally apply those writes to `ExpectedState`
- During the rollback (i.e, on `MarkRollback(xid)`), `ExpectedStateTraceRecordHandler` will erase the internal `WriteBatch` from the map.
For (b) - one major issue described below:
- TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical [PR](6d06be22c0) predated correctness testing.
- And we will verify those processed keys in a recovered db against their expected state.
- However since now we turn on `sync_fault_injection=1` where the expected state is constructed from the trace instead of using the LATEST.state from previous run. The expected state now used to verify those processed keys won't contain UNKNOWN_SENTINEL as they should - see test 1 for a failed case.
- Therefore, we decided to manually update its expected state to be UNKNOWN_SENTINEL as part of the processing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10624
Test Plan:
1. Test exposed the major issue described above. This test will fail without setting UNKNOWN_SENTINEL in expected state during the processing and pass after
```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
dbt=$db.tmp
expt=$exp.tmp
rm -rf $db $exp
mkdir -p $exp
echo "RUN 1"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 0.2
sleep 20
kill $pid
sleep 0.2
echo "RUN 2"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 0.2
sleep 20
kill $pid
sleep 0.2
echo "RUN 3"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
```
2. Manual testing to ensure ExpectedState is constructed correctly during recovery by verifying it against previously crashed TransactionDB's WAL.
- Run the following command to crash a TransactionDB with WriteCommit policy. Then `./ldb dump_wal` on its WAL file
```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
rm -rf $db $exp
mkdir -p $exp
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 30
kill $pid
sleep 1
```
- Run the following command to verify recovery of the crashed db under debugger. Compare the step-wise result with WAL records (e.g, WriteBatch content, xid, prepare/commit/rollback marker)
```
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
```
3. Automatic testing by triggering all RocksDB stress/crash test jobs for 3 rounds with no failure.
Reviewed By: ajkr, riversand963
Differential Revision: D39199373
Pulled By: hx235
fbshipit-source-id: 7a1dec0e3e2ee6ea86ddf5dd19ceb5543a3d6f0c
Summary:
As title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10717
Test Plan:
Unit Tests
CI
Reviewed By: riversand963
Differential Revision: D39700707
Pulled By: gitbw95
fbshipit-source-id: 54de27e695535a50159f5f6467da36aaf21bebae
Summary:
`enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690
Test Plan:
Unit Tests
Stress Tests
Reviewed By: anand1976
Differential Revision: D39567604
Pulled By: gitbw95
fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7
Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)
Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.
For performance tests see https://github.com/facebook/rocksdb/pull/10626
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684
Test Plan: no interesting functional changes; tests updated
Reviewed By: anand1976
Differential Revision: D39547800
Pulled By: pdillinger
fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
Summary:
Hopefully, we can re-enable the combination of user-defined timestamp and subcompactions
after https://github.com/facebook/rocksdb/issues/10658.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10689
Test Plan:
Make sure the following succeeds on devserver.
make crash_test_with_ts
Reviewed By: ltamasi
Differential Revision: D39556558
Pulled By: riversand963
fbshipit-source-id: 4695f420b1bc9ebf3b24640b693746f4db82c149
Summary:
Same as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10632
Test Plan: make crash_test -j32
Reviewed By: anand1976
Differential Revision: D39241479
Pulled By: akankshamahajan15
fbshipit-source-id: 5db5b0c007da786bacc1b30d8926d36d6d029b87
Summary:
Expanded `all_params` to include all parameters crash test may set. Previously, `atomic_flush` was not included in `all_params` and thus was not visible to `finalize_and_sanitize()`. The consequence was manual crash test runs could provide unsafe combinations of parameters to `db_stress`. For example, running `db_crashtest.py` with `-atomic_flush=0` could cause `db_stress` to run with `-atomic_flush=0 -disable_wal=1`, which is known to produce inconsistencies across column families.
While expanding `all_params`, I found we cannot have an entry in it for both `db_stress` and `db_crashtest.py`. So I renamed `enable_tiered_storage` to `test_tiered_storage` for `db_crashtest.py`, which appears more conventional anyways.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10654
Reviewed By: hx235
Differential Revision: D39369349
Pulled By: ajkr
fbshipit-source-id: 31d9010c760c868b20d5e9bd78ba75c8ff3ce348
Summary:
**Context/Summary:**
`ExpectedState` is not aware of transaction-related concept so `use_txn=1 ` is not compatible with `sync_fault_injection=1`. Therefore this PR disabled this combination until we expand our correctness testing to transaction related features.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10605
Test Plan:
- Run the following commands to verify `--use_txn` is correctly sanitized
- `python3 ./tools/db_crashtest.py blackbox --use_txn=1 --sync_fault_injection=1 `
- `python3 ./tools/db_crashtest.py blackbox --use_txn=0 --sync_fault_injection=1 `
Reviewed By: ajkr
Differential Revision: D39121287
Pulled By: hx235
fbshipit-source-id: 7d5d6dd32479ea1c07df4f38322650f3a60def9c
Summary:
The features that cannot work with disable_wal=1 due to unsynced data dropping (ingest_external_file_one_in and enable_compaction_filter) similarly cannot work with sync_fault_injection=1. This PR prevents those features from being used together with sync_fault_injection=1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10559
Reviewed By: hx235
Differential Revision: D38953019
Pulled By: ajkr
fbshipit-source-id: 7e2c7644ec84d7323f632cf976bcee00502d0ed7
Summary:
As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913,
`db_stress` does not have much verification for iterator correctness.
It has a `TestIterate()` function, but that is mainly for comparing results
between two iterators, one with `total_order_seek` and the other optionally
sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33
added a new `TestIterateAgainstExpected()` function that compares iterator against
expected state. It locks a range of keys, creates an iterator, does
a random sequence of `Next/Prev` and compares against expected state.
This PR is based on that commit, the main changes include some logs
(for easier debugging if a test fails), a forward and backward scan to
cover the entire locked key range, and a flag for optionally turning on
this version of Iterator testing.
Added constraint that the checks against expected state in
`TestIterateAgainstExpected()` and in `TestGet()` are only turned on
when `--skip_verifydb` flag is not set.
Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538
Test Plan:
Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`,
and a large `--iterpercent` and `--num_iterations`. Checked `op_logs`
manually to ensure expected coverage. Tweaked part of the code in
https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it.
- internally run various flavor of crash test
Reviewed By: ajkr
Differential Revision: D38847269
Pulled By: cbi42
fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599
Summary:
Currently user_defined_timestamp is failing in stress test with
subcompactions. So disabling it for now and will re enable it once its
fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10503
Test Plan: make crash_test_with_ts -j32
Reviewed By: riversand963
Differential Revision: D38510485
Pulled By: akankshamahajan15
fbshipit-source-id: 82fd0ec8cf86a96ff6653edd5bad7623cb9e0a15
Summary:
The secondary cache is randomly disabled or enabled with CompressedSecondaryCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10442
Test Plan: - To test that the CompressedSecondaryCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test `
Reviewed By: anand1976
Differential Revision: D38290796
Pulled By: gitbw95
fbshipit-source-id: bb7027b39e0ed9c0c62835abe09e759898130ec8
Summary:
In this PR we bring ClockCache closer to production quality. We implement the following changes:
1. Fixed a few bugs in ClockCache.
2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.
As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418
Test Plan:
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
Reviewed By: pdillinger
Differential Revision: D38170673
Pulled By: guidotag
fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298
Reviewed By: ltamasi
Differential Revision: D37908743
Pulled By: gangliao
fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
Summary:
as title.
Test plan
- make check
- CI on PR
- TEST_TMPDIR=/dev/shm make crash_test_with_multiops_wp_txn (tested with successful run)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10350
Reviewed By: ajkr
Differential Revision: D37792872
Pulled By: riversand963
fbshipit-source-id: ff064093b7f715d0acf387af2e3ae87b1278b52b
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.
Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37522461
Pulled By: guidotag
fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
Summary:
Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter.
This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of
the presence of timestamps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259
Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts
Reviewed By: ajkr
Differential Revision: D37459692
Pulled By: riversand963
fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef
Summary:
Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252
Reviewed By: riversand963
Differential Revision: D37432948
Pulled By: ajkr
fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a
Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202
Reviewed By: ltamasi
Differential Revision: D37325739
Pulled By: gangliao
fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.
Fixes https://github.com/facebook/rocksdb/issues/10186
I intend to pick this into the 7.4.0 release candidate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220
Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.
Reviewed By: anand1976
Differential Revision: D37315647
Pulled By: pdillinger
fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
Summary:
In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles.
Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37296770
Pulled By: guidotag
fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
Test Plan:
- Manual
- Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
- Enabled in MyShadow for 1+ week
- Automated
- Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
- Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
Reviewed By: cbi42
Differential Revision: D36614569
Pulled By: ajkr
fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).
This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).
Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB
Reviewed By: riversand963
Differential Revision: D37212647
Pulled By: pdillinger
fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
Summary:
There is `Options::allow_data_in_errors` that controls whether RocksDB
is allowed to log data, e.g. key, value, etc in LOG files. It is false
by default. However, in db_bench and db_stress, it is often ok to log
data because there is no concern about privacy.
This PR allows db_stress and db_bench to set this option on the command
line, while it remains false by default. Furthermore, make
crash/recovery test driven by db_crashtest.py to opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171
Test Plan: Stress test and db_bench
Reviewed By: hx235
Differential Revision: D37163787
Pulled By: riversand963
fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.
This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```
If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.
Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();
// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
std::shared_ptr<TransactionNotifier> notifier,
TxnTimestamp ts,
std::shared_ptr<const Snapshot>* ret);
// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```
The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```
We also provide two additional APIs for stats collection and reporting purposes.
```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
TxnTimestamp ts_lb,
TxnTimestamp ts_ub,
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```
To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```
Before shutdown, RocksDB will release all timestamped snapshots.
Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.
Timestamped snapshots can also be used with user-defined timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```
Reviewed By: siying
Differential Revision: D35783919
Pulled By: riversand963
fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
Summary:
This PR updates secondary instance testing in stress test by default.
A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary.
Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan
to read some random keys with only very basic verification to make sure no assertion failure is triggered.
Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled.
Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic
catch-up.
In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not
work well when the underlying file is renamed by primary, e.g. SstFileManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush
```
Reviewed By: ajkr
Differential Revision: D36939458
Pulled By: riversand963
fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.
In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077
Reviewed By: ltamasi
Differential Revision: D36884156
Pulled By: gangliao
fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
Summary:
Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache.
Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081
Test Plan:
- To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`.
- To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`.
Reviewed By: anand1976
Differential Revision: D36839908
Pulled By: guidotag
fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725
Summary:
Thanks to https://github.com/facebook/rocksdb/issues/9919 and https://github.com/facebook/rocksdb/issues/10051 the known bugs in file ingestion (besides mmap read + file checksum) are fixed. Now we can try again to enable file ingestion in crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9357
Test Plan: stress file ingestion heavily for an hour: `$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=1000000 --ingest_external_file_one_in=100 --duration=3600 --interval=20 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152`
Reviewed By: riversand963
Differential Revision: D33410746
Pulled By: ajkr
fbshipit-source-id: d276431390995a67f68390d61c06a40945fdd280
Summary:
Start tracking SST unique id in MANIFEST, which is used to verify with
SST properties to make sure the SST file is not overwritten or
misplaced. A DB option `try_verify_sst_unique_id` is introduced to
enable/disable the verification, if enabled, it opens all SST files
during DB-open to read the unique_id from table properties (default is
false), so it's recommended to use it with `max_open_files = -1` to
pre-open the files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990
Test Plan: unittests, format-compatible test, mini-crash
Reviewed By: anand1976
Differential Revision: D36381863
Pulled By: jay-zhuang
fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
Summary:
**Context:**
Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.
Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.
**Summary:**
- Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
- Added missing db bench/stress test for some memory charging features
- Renamed related test suite to indicate they are under the same theme of memory charging
- Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
- Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926
Test Plan:
- New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
- New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
- CI
- db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
#-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**
- db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal
Reviewed By: ajkr
Differential Revision: D36054712
Pulled By: hx235
fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
Summary:
This PR
- since we are testing with disable_wal = true and best_efforts_recovery, we should set column family count to 1, due to the requirement of `ExpectedState` tracking and replaying logic.
- during backup and checkpoint restore, disable best-efforts-recovery. This does not matter now because db_crashtest.py always disables wal when testing best-efforts-recovery. In the future, if we enable wal, then not setting `restore_opitions.best_efforts_recovery` will cause backup db not to recover the WALs, and differ from db (that enables WAL).
- during verification of backup and checkpoint restore, print the key where inconsistency exists between expected state and db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9986
Test Plan: TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
Reviewed By: siying
Differential Revision: D36353105
Pulled By: riversand963
fbshipit-source-id: a484da161273e6216a1f7e245bac15a349693917
Summary:
Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode.
Other notable changes include:
- Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint`
- Made `--read_fault_one_in > 0` return an error in release mode
- Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957
Test Plan:
```
$ DEBUG_LEVEL=0 make -j24 db_stress
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox
```
Reviewed By: anand1976
Differential Revision: D36193830
Pulled By: ajkr
fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321
Summary:
`db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`.
Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947
Test Plan:
- regular blackbox, blackbox --simple
- various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16`
Reviewed By: riversand963
Differential Revision: D36152775
Pulled By: ajkr
fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8
Summary:
When compaction filter determines that a key should be removed, it updates the internal key's type
to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
together with `SingleDelete`, it will cause compaction iterator to return Corruption.
To fix the issue, compaction filter should return more information in addition to the intention of removing
a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
`kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.
In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.
Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
not quite work with `MultiOpsTxnsStressTest`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929
Test Plan:
make check
make crash_test
make crash_test_with_txn
Reviewed By: anand1976
Differential Revision: D36069678
Pulled By: riversand963
fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
Summary:
Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions.
1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw.
2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true.
3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn
```
Reviewed By: pdillinger
Differential Revision: D35806678
Pulled By: riversand963
fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616
Summary:
I did another pass through running CI jobs. It is uncommon now to see
`db_stress` stuck in the setup phase but still happen.
One reason was repeatedly reading/verifying checksum on filter blocks when
`-cache_index_and_filter_blocks=1` and `-cache_size=1048576`. To address
that I increased the cache size.
Another reason was having a WAL with many range tombstones and every
`db_stress` run using `-avoid_flush_during_recovery=1` (in that
scenario, the setup phase spent too much CPU in
`rocksdb::MemTable::NewRangeTombstoneIteratorInternal()`). To address
that I fixed the `-avoid_flush_during_recovery` setting so it is
reevaluated for every `db_stress` run.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9483
Reviewed By: riversand963
Differential Revision: D33922929
Pulled By: ajkr
fbshipit-source-id: 0a298ec7c4df6f6b44620233996047a2dc7ee5f3
Summary:
Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393
**Context:**
(Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following:
a) set of keys to add to filter
b) set of hashes to add to filter (64-bit hash applied to each key)
c) set of Bloom indices to set in filter, with duplicates
d) set of Bloom indices to set in filter, deduplicated
e) final filter and its checksum
This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level.
- b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`)
- c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO.
Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default.
**Summary:**
- Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()`
- Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption
- See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design
- Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries`
- Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()`
- When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342
Test Plan:
- Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption`
- Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl
- For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 | grep 'Build avg'` to make sure nothing break.
- Added to `filter_bench`: increased filter construction time by **30%**, mostly by `MaybePostVerify()`
- FastLocalBloom
- Before change: `./filter_bench -impl=2 -quick -runs 10 | grep 'Build avg'`: **28.86643s**
- After change:
- `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): **27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (expect acceptable increase): **34.41159s (+20%)**
- `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): **27.13431s (-6%)**
- Standard128Ribbon
- Before change: `./filter_bench -impl=3 -quick -runs 10 | grep 'Build avg'`: **122.5384s**
- After change:
- `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 | grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): **124.3588s (+2%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(expect acceptable increase): **159.4946s (+30%)**
- `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 | grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : **125.258s (+2%)**
- Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true`
- Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status.
Reviewed By: pdillinger
Differential Revision: D33746928
Pulled By: hx235
fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57
Summary:
Despite attempts to optimize `db_stress` setup phase (i.e.,
pre-`OperateDb()`) latency in https://github.com/facebook/rocksdb/issues/9470 and https://github.com/facebook/rocksdb/issues/9475, it still always took tens
of seconds. Since we still aren't able to setup a 100M key `db_stress`
quickly, we should reduce the number of keys. This PR reduces it 4x
while increasing `value_size_mult` 4x (from its default value of 8) so
that memtables and SST files fill at a similar rate compared to before this PR.
Also disabled bzip2 compression since we'll probably never use it and
I noticed many CI runs spending majority of CPU on bzip2 decompression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9476
Reviewed By: siying
Differential Revision: D33898520
Pulled By: ajkr
fbshipit-source-id: 855021784ad9664f2be5bce21f0339a1cf93230d
Summary:
Even after https://github.com/facebook/rocksdb/issues/9461 could see
```
Error: please specify prefix_size for test_batches_snapshots test!
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9463
Test Plan:
run `make blackbox_crashtest` for a long time. (Unfortunately,
it's taking a long time to reproduce these failures)
Reviewed By: akankshamahajan15
Differential Revision: D33838152
Pulled By: pdillinger
fbshipit-source-id: b9a73c5bbb68df53f14c22b9b52f61d1f7ef38af
Summary:
Changes in https://github.com/facebook/rocksdb/issues/9453 could trigger
```
stderr:
Error: prefixpercent is non-zero while prefix_size is not positive!
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9461
Test Plan: run `make blackbox_crashtest` for a long time
Reviewed By: ajkr
Differential Revision: D33830751
Pulled By: pdillinger
fbshipit-source-id: be88377dcaa47e4bb7adb0347762639eff8f1476
Summary:
MemTable::MultiGet was not considering range tombstones before
querying Bloom filter. This means range tombstones would be skipped for
keys (or prefixes) with no other entries in the memtable. This could cause
old values for a key (in SST files) to still show up until the range tombstone
covering it has been flushed.
This is fixed by essentially disabling the memtable Bloom filter when there
are any range tombstones. (This could be better optimized in the future, but
good enough for now.)
Did some other cleanup/optimization in the same code to (more than) offset
the cost of checking on range tombstones in more cases. There is now
notable improvement when memtable_whole_key_filtering and prefix_extractor
are used together (unusual), and this makes MultiGet closer to the Get
implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9453
Test Plan:
new unit test added. Added memtable Bloom to crash test.
Performance testing
--------------------
Build WAL-only DB (recovers to memtable):
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=1000000 -write_buffer_size=250000000
```
Query test command, to maximize sensitivity to the changed code:
```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=multireadrandom -num=10000000 -write_buffer_size=250000000 -memtable_bloom_size_ratio=0.015 -multiread_batched -batch_size=24 -threads=8 -memtable_whole_key_filtering=$MWKF -prefix_size=$PXS
```
(Note -num here is 10x larger for mostly memtable misses)
Before & after run simultaneously, average over 10 iterations per data point, ops/sec.
MWKF=0 PXS=0 (Bloom disabled)
Before: 5724844
After: 6722066
MWKF=0 PXS=7 (prefixes hardly unique; Bloom not useful)
Before: 9981319
After: 10237990
MWKF=0 PXS=8 (prefixes unique; Bloom useful)
Before: 12081715
After: 12117603
MWKF=1 PXS=0 (whole key Bloom useful)
Before: 11944354
After: 12096085
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes not useful in old version)
Before: 9444299
After: 11826029
MWKF=1 PXS=7 (whole key Bloom useful in new version; prefixes useful in old version)
Before: 11784465
After: 11778591
Only in this last case is the 'before' *slightly* faster, perhaps because hashing prefixes is slightly faster than hashing whole keys. Otherwise, 'after' is faster.
Reviewed By: ajkr
Differential Revision: D33805025
Pulled By: pdillinger
fbshipit-source-id: 597523cae4f4eafdf6ae6bb2bc6cb46f83b017bf
Summary:
Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (https://github.com/facebook/rocksdb/issues/8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled.
To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py.
Depends on https://github.com/facebook/rocksdb/issues/9305.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9338
Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan.
Reviewed By: riversand963
Differential Revision: D33345333
Pulled By: ajkr
fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
Summary:
db_crashtest.py uses multiple CFs only when run without flag `--simple`.
The previous config set `-test_batches_snapshots=1` in that case for
blackbox mode. But `-test_batches_snapshots=1` cannot verify recovery
correctness, so it should not always be set for multi-CF blackbox tests.
We can instead randomly toggle it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9303
Reviewed By: riversand963
Differential Revision: D33155229
Pulled By: ajkr
fbshipit-source-id: 4a6fdc4eddccc8ece664063baf6393ce1c5de6b7
Summary:
When table_options.prepopulate_block_cache is set to
BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and
table_options.partition_filters is also set true, then there is
segmentation failure when top level filter is fetched because its
entered with wrong type in cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263
Test Plan:
Updated unit tests;
Ran db_stress: make crash_test -j32
Reviewed By: pdillinger
Differential Revision: D32936566
Pulled By: akankshamahajan15
fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
Test Plan: Ran `make check` and the stress/crash test.
Reviewed By: riversand963
Differential Revision: D32565512
Pulled By: ltamasi
fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
Summary:
XXH3 - latest hash function that is extremely fast on large
data, easily faster than crc32c on most any x86_64 hardware. In
integrating this hash function, I have handled the compression type byte
in a non-standard way to avoid using the streaming API (extra data
movement and active code size because of hash function complexity). This
approach got a thumbs-up from Yann Collet.
Existing functionality change:
* reject bad ChecksumType in options with InvalidArgument
This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is
likely to be handled through different configuration than ChecksumType.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069
Test Plan:
tests updated, and substantially expanded. Unit tests now check
that we don't accidentally change the values generated by the checksum
algorithms ("schema test") and that we properly handle
invalid/unrecognized checksum types in options or in file footer.
DBTestBase::ChangeOptions (etc.) updated from two to one configuration
changing from default CRC32c ChecksumType. The point of this test code
is to detect possible interactions among features, and the likelihood of
some bad interaction being detected by including configurations other
than XXH3 and CRC32c--and then not detected by stress/crash test--is
extremely low.
Stress/crash test also updated (manual run long enough to see it accepts
new checksum type). db_bench also updated for microbenchmarking
checksums.
### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3
crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op)
xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op)
xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op)
xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op)
crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op)
xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op)
xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op)
xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op)
crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op)
xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op)
xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op)
xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op)
As you can see, especially once warmed up, xxh3 is fastest.
### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor)
Test
for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 | grep 'micros/op' | tee -a results-$CHK & done; wait; done
Results (ops/sec)
for FILE in results*; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 * s / c; }' < $FILE; done
results-0 252118 # kNoChecksum
results-1 251588 # kCRC32c
results-2 251863 # kxxHash
results-3 252016 # kxxHash64
results-4 252038 # kXXH3
Reviewed By: mrambacher
Differential Revision: D31905249
Pulled By: pdillinger
fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.
In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).
These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)
This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
Reviewed By: riversand963
Differential Revision: D31489850
Pulled By: ltamasi
fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.
The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995
Test Plan:
- Verified it fixes the following failure:
```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```
- `make check -j48`
Reviewed By: ltamasi
Differential Revision: D31495388
Pulled By: ajkr
fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385
Summary:
Enable SingleDelete with user defined timestamp in db_bench,
db_stress and crash test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8971
Test Plan:
1. For db_stress, ran the command for full duration: i) python3 -u tools/db_crashtest.py
--enable_ts whitebox --nooverwritepercent=100
ii) make crash_test_with_ts
2. For db_bench, ran: ./db_bench -benchmarks=randomreplacekeys
-user_timestamp_size=8 -use_single_deletes=true
Reviewed By: riversand963
Differential Revision: D31246558
Pulled By: akankshamahajan15
fbshipit-source-id: 29cd8740c9921341e52f09242fca3c44d75a12b7
Summary:
This is a precursor refactoring to enable an upcoming feature: persistence failure correctness testing.
- Changed `--expected_values_path` to `--expected_values_dir` and migrated "db_crashtest.py" to use the new flag. For persistence failure correctness testing there are multiple possible correct states since unsynced data is allowed to be dropped. Making it possible to restore all these possible correct states will eventually involve files containing snapshots of expected values and DB trace files.
- The expected values directory is managed by an `ExpectedStateManager` instance. Managing expected state files is separated out of `SharedState` to prevent `SharedState` from becoming too complex when the new files and features (snapshotting, tracing, and restoring) are introduced.
- Migrated expected values file access/management out of `SharedState` into a separate class called `ExpectedState`. This is not exposed directly to the test but rather the `ExpectedState` for the latest values file is accessed via a pass-through API on `ExpectedStateManager`. This forces the test to always access the single latest `ExpectedState`.
- Changed the initialization of the latest expected values file to use a tempfile followed by rename, and also add cleanup logic for possible stranded tempfiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8913
Test Plan:
run in several ways; try to make sure it's not obviously broken.
- crashtest blackbox without TEST_TMPDIR
```
$ python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest blackbox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest whitebox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py whitebox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none --random_kill_odd=88887
```
- db_stress without expected_values_dir
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true
```
- db_stress with expected_values_dir and manual corruption
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true --expected_values_dir=./
// modify one byte in "./LATEST.state"
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=false --expected_values_dir=./
...
Verification failed for column family 0 key 0000000000000000 (0): Value not found: NotFound:
...
```
Reviewed By: riversand963
Differential Revision: D30921951
Pulled By: ajkr
fbshipit-source-id: babfe218062e55d018c9b046536c0289fb78f41c
Summary:
For now, disable it since the below command indicates it can cause a
failure. Running that command with `-experimental_mempurge_threshold=0`
has been running successfully for several minutes, whereas before it
failed in seconds.
```
$ while rm -rf /dev/shm/single_stress && ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/single_stress --experimental_mempurge_threshold=5.493146827397074 --flush_one_in=10000 --reopen=0 --write_buffer_size=262144 --value_size_mult=33 --max_write_buffer_number=3 -ops_per_thread=10000; do : ; done
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8958
Reviewed By: ltamasi
Differential Revision: D31187059
Pulled By: ajkr
fbshipit-source-id: 04d5bfb4fcc4f5b66233e691427dfd940c67037f
Summary:
Several improvements to MultiRead:
1. Fix a bug in stress test which causes false positive when both MultiRead() return and individual read request have failure injected.
2. Add two more types of fault that should be handled: empty read results and checksum mismatch
3. Add a message indicating which type of fault is injected
4. Increase the failure rate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8937
Reviewed By: anand1976
Differential Revision: D31085930
fbshipit-source-id: 3a04994a3cadebf9a64d25e1fe12b14b7a272fba
Summary:
Current implementation does not support user-defined timestamp when
block-based filter is used. Will implement the support in the future, or
wait to see if block-based filter can be deprecated and removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8703
Test Plan: make whitebox_crash_test_with_ts
Reviewed By: pdillinger
Differential Revision: D30528931
Pulled By: riversand963
fbshipit-source-id: 60dd74ee0a6194e69072069d8c4bd876f249f38d
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.
So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)
I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.
C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.
BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
Test Plan: new + updated tests, including crash test
Reviewed By: jay-zhuang
Differential Revision: D30445797
Pulled By: pdillinger
fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
Reviewed By: pdillinger
Differential Revision: D30149315
Pulled By: bjlemaire
fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27