Summary:
wide_columns can now be pretty-printed in the following commands
- `./ldb dump_wal`
- `./ldb dump`
- `./ldb idump`
- `./ldb dump_live_files`
- `./ldb scan`
- `./sst_dump --command=scan`
There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11754
Test Plan:
**New Tests added**
- `WideColumnsHelperTest::DumpWideColumns`
- `WideColumnsHelperTest::DumpSliceAsWideColumns`
**Changes added to existing tests**
- `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in https://github.com/facebook/rocksdb/issues/11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool.
- `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST
**dump_wal**
```
./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header
```
```
Sequence,Count,ByteSize,Physical Offset,Key(s) : value
1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A
```
**idump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump
```
```
'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:2, type:22 => attr_one:two attr_three:four
'third' seq:3, type:1 => baz
Internal keys in range: 3
```
**SST Dump from dump_live_files**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files
```
```
...
==============================
SST Files
==============================
/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1
------------------------------
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
...
```
**dump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump
```
```
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3
```
**scan**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan
```
```
first : :hello attr_name1:foo attr_name2:bar
second : attr_one:two attr_three:four
third : baz
```
**sst_dump**
```
./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan
```
```
options.env is 0x7ff54b296000
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
from [] to []
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
```
Reviewed By: ltamasi
Differential Revision: D48837999
Pulled By: jaykorean
fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f
Summary:
Same as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11729
Test Plan: make crash_test -j32
Reviewed By: anand1976
Differential Revision: D48534820
Pulled By: akankshamahajan15
fbshipit-source-id: 3a2a28af98dfad164b82ddaaf9fddb94c53a652e
Summary:
In blackbox tests, db_stress command always run with timeout. Timeout can happen during validation, leaving some of the keys not checked. Since key validation is done in order, it is quite likely that keys those are towards to the end of the set are never validated. This PR adds a final execution, without timeout, to ensure validation is executed for all keys, at least once.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11592
Reviewed By: cbi42
Differential Revision: D48003998
Pulled By: hx235
fbshipit-source-id: 72543475a932f12cf0f57534b7e3b6e07e87080f
Summary:
Currently the stress test does not support restoring expected state (to a specific sequence number) when there is unsynced data loss during the reopen phase. This causes a few internal stress test failure with errors like inconsistent value. This PR disables dropping unsynced data during reopen to avoid failures due to this issue. We can re-enable later after we decide to support unsynced data loss during DB reopen in stress test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11731
Test Plan:
* Running this test a few times can fail for inconsistent value before this change
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=20.57166126835524 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=6 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=1 --manual_wal_flush_one_in=1000000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=5 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=0 --secondary_cache_fault_one_in=32 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=3 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35```
Reviewed By: hx235
Differential Revision: D48537494
Pulled By: cbi42
fbshipit-source-id: ddae21b9bb6ee8d67229121f58513e95f7ef6d8d
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.
* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701
Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.
Reviewed By: jaykorean, ajkr
Differential Revision: D48331564
Pulled By: cbi42
fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
Summary:
It's a bit repetitive in order to give reasonably informative error messages.
I also removed total_order_seek in cases where it's not needed, just to make sure a case that shouldn't matter really doesn't.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11698
Test Plan:
run it -
```
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --duration=86400 --interval=10 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --blob_compression_type=none --writepercent=50 -iterpercent=45 -readpercent=0 -prefixpercent=0 --prefix_size=0 --verify_iterator_with_expected_state_one_in=10 --test_batches_snapshots=0 -enable_compaction_filter=0
```
Reviewed By: cbi42
Differential Revision: D48285036
Pulled By: ajkr
fbshipit-source-id: 51b147bd7c8011740629ae2fd8114d3d48ce7137
Summary:
* The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation.
* Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache).
* Simplify some existing tests not to depend directly on cache type.
* Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation.
* Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test.
* Add another templated class BaseHyperClockCache to reduce future copy-paste
* Added ReportProblems support to cache_bench
* Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC:
```
2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17
```
In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692
Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in.
Reviewed By: jowlyzhang
Differential Revision: D48247413
Pulled By: pdillinger
fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da
Summary:
**Context/Summary**
After https://github.com/facebook/rocksdb/pull/11058, we no longer lock the key range to iterate in TestIterateAgainstExpected, except for working with timestamp feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11695
Test Plan: no code change
Reviewed By: ajkr
Differential Revision: D48276668
Pulled By: hx235
fbshipit-source-id: dc92a3708b2281dc737c0877fb755548bf03a9fc
Summary:
We're still getting some rare cases of 5x TryAgains in a row. Here I'm boosting the failure threshold to 10 in a row and adding more info in the output, to help us manually verify whether there's anything suspicous about the sequence of TryAgains, such as if Rollback failed to reset to new sequence numbers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11691
Test Plan: By lowering the threshold to 2 and adjusting some other db_crashtest parameters, I was able to hit my new code and saw fresh sequence number on the subsequent TryAgain.
Reviewed By: cbi42
Differential Revision: D48236153
Pulled By: pdillinger
fbshipit-source-id: c0530e969ddcf8de7348e5cf7daf5d6d5dec24f4
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
- For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
- New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.
- Misc
- More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
- Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444
Test Plan:
- CI fake db crash/stress test
- Microbenchmarking
**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR
**Test**
Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```
Result
```
Coming soon
```
AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```
Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,
PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```
Reviewed By: ajkr
Differential Revision: D45918925
Pulled By: hx235
fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
Summary:
As titled, and also removed an undefined and unused member function in for ColumnFamilyData
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683
Reviewed By: ajkr
Differential Revision: D48156290
Pulled By: jowlyzhang
fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7
Summary:
I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache.
This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676
Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache
Reviewed By: ajkr
Differential Revision: D48103165
Pulled By: pdillinger
fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358
Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.
Reviewed By: ajkr
Differential Revision: D46582680
Pulled By: cbi42
fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
Summary:
https://github.com/facebook/rocksdb/issues/11653 broke some crash tests.
Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656
Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn
Reviewed By: cbi42
Differential Revision: D47906280
Pulled By: pdillinger
fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b
Summary:
In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.
Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
* (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
* Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.
Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
* Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
* Use unique_ptr (nicely complements removing unnecessary Rollbacks)
* Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)
Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653
Test Plan:
this is the test :)
Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.
Reviewed By: cbi42
Differential Revision: D47882995
Pulled By: pdillinger
fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
Summary:
`StressTest::optimistic_txn_db_` is currently not initialized by the constructor, which
can lead to assertion failures down the line in `StressTest::Open`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11547
Reviewed By: cbi42
Differential Revision: D46845658
Pulled By: ltamasi
fbshipit-source-id: 578b0f24fc00e3e97f24221fcdd003cc529439c2
Summary:
Context:
OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB.
1. Adding the following gflag options to to test OptimisticTransactionDB
- `use_optimistic_txn`: When true, open OptimisticTransactionDB to test
- `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default.
- `share_occ_lock_buckets`: Use shared occ locks
- `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock.
2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc`
3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk
Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513
Test Plan:
**Crash Test**
Serial Validation
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0"
make crash_test -j
```
Parallel Validation (no share bucket)
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0"
make crash_test -j
```
Parallel Validation (share bucket)
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500"
make crash_test -j
```
**Stress Test**
```
./db_stress -use_optimistic_txn -threads=32
```
Reviewed By: pdillinger
Differential Revision: D46547387
Pulled By: jaykorean
fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960
Summary:
Context:
This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions`
1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB.
2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing.
1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct.
2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h`
3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436
Test Plan:
Following tests added
- `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish`
- `DBCompactionTest::WaitForCompactAbortOnPauseAborted`
- `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted`
- `DBCompactionTest::WaitForCompactShutdownWhileWaiting`
- `TransactionTest::WaitForCompactAbortOnPause`
NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place.
Reviewed By: pdillinger
Differential Revision: D45799659
Pulled By: jaykorean
fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b
Summary:
This patch remove the "stress" aspect from the WriteUnpreparedStressTest and leave it to be a unit test for some correctness testing w.r.t. snapshot functionality. I added some read-your-write verification to the transaction test in db_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11424
Test Plan:
`./write_unprepared_transaction_test`
`./db_crashtest.py whitebox --txn`
`./db_crashtest.py blackbox --txn`
Reviewed By: hx235
Differential Revision: D45551521
Pulled By: jowlyzhang
fbshipit-source-id: 20c3d510eb4255b08ddd7b6c85bdb4945436f6e8
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/11424 made me realize there are a couple gaps in my `ExpectedValue` comments so I updated them, along with separating `ExpectedValue` into separate files so it's clearer that `ExpectedValue` can be used without updating `ExpectedState` (e.g, TestMultiGet() where we care about value base of expected value but not updating the ExpectedState).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11456
Test Plan: CI
Reviewed By: jowlyzhang
Differential Revision: D45965070
Pulled By: hx235
fbshipit-source-id: dcee690c13b00a3119757ea9d43b646f9644e1a9
Summary:
Context:
In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.
In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.
This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.
Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.
Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443
Test Plan:
Existing tests that involve a shutdown in progress:
- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle
Reviewed By: pdillinger
Differential Revision: D45923426
Pulled By: jaykorean
fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
Summary:
**Context:**
Current `NonBatchedOpsStressTest` does not allow multi-thread read (i.e, Get, Iterator) and write (i.e, Put, Merge) or delete to the same key. Every read or write/delete operation will acquire lock (`GetLocksForKeyRange`) on the target key to gain exclusive access to it. This does not align with RocksDB's nature of allowing multi-thread read and write/delete to the same key, that is concurrent threads can issue read/write/delete to RocksDB without external locking. Therefore this is a gap in our testing coverage.
To close the gap, biggest challenge remains in verifying db value against expected state in presence of parallel read and write/delete. The challenge is due to read/write/delete to the db and read/write to expected state is not within one atomic operation. Therefore we may not know the exact expected state of a certain db read, as by the time we read the expected state for that db read, another write to expected state for another db write to the same key might have changed the expected state.
**Summary:**
Credited to ajkr's idea, we now solve this challenge by breaking the 32-bits expected value of a key into different parts that can be read and write to in parallel.
Basically we divide the 32-bits expected value into `value_base` (corresponding to the previous whole 32 bits but now with some shrinking in the value base range we allow), `pending_write` (i.e, whether there is an ongoing concurrent write), `del_counter` (i.e, number of times a value has been deleted, analogous to value_base for write), `pending_delete` (similar to pending_write) and `deleted` (i.e whether a key is deleted).
Also, we need to use incremental `value_base` instead of random value base as before because we want to control the range of value base a correct db read result can possibly be in presence of parallel read and write. In that way, we can verify the correctness of the read against expected state more easily. This is at the cost of reducing the randomness of the value generated in NonBatchedOpsStressTest we are willing to accept.
(For detailed algorithm of how to use these parts to infer expected state of a key, see the PR)
Misc: hide value_base detail from callers of ExpectedState by abstracting related logics into ExpectedValue class
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11058
Test Plan:
- Manual test of small number of keys (i.e, high chances of parallel read and write/delete to same key) with equally distributed read/write/deleted for 30 min
```
python3 tools/db_crashtest.py --simple {blackbox|whitebox} --sync_fault_injection=1 --skip_verifydb=0 --continuous_verification_interval=1000 --clear_column_family_one_in=0 --max_key=10 --column_families=1 --threads=32 --readpercent=25 --writepercent=25 --nooverwritepercent=0 --iterpercent=25 --verify_iterator_with_expected_state_one_in=1 --num_iterations=5 --delpercent=15 --delrangepercent=10 --range_deletion_width=5 --use_merge={0|1} --use_put_entity_one_in=0 --use_txn=0 --verify_before_write=0 --user_timestamp_size=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --flush_one_in=1000 --get_property_one_in=1000 --ingest_external_file_one_in=100 --backup_one_in=100 --checkpoint_one_in=100 --approximate_size_one_in=0 --acquire_snapshot_one_in=100 --use_multiget=0 --prefixpercent=0 --get_live_files_one_in=1000 --manual_wal_flush_one_in=1000 --pause_background_one_in=1000 --target_file_size_base=524288 --write_buffer_size=524288 --verify_checksum_one_in=1000 --verify_db_one_in=1000
```
- Rehearsal stress test for normal parameter and aggressive parameter to see if such change can find what existing stress test can find (i.e, no regression in testing capability)
- [Ongoing]Try to find new bugs with this change that are not found by current NonBatchedOpsStressTest with no parallel read and write/delete to same key
Reviewed By: ajkr
Differential Revision: D42257258
Pulled By: hx235
fbshipit-source-id: e6fdc18f1fad3753e5ac91731483a644d9b5b6eb
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates
Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`
Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.
Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):
```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none
BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE
The readrandom ops/sec looks like the following:
Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB
Main 240805 223604 198176 161653 139040
PR prot_bytes=0 238691 226693 200127 161082 141153
PR prot_bytes=1 214983 193199 178532 137013 108211
prot_bytes=1 vs -10% -15% -10.8% -15% -23%
prot_bytes=0
```
The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287
Reviewed By: ajkr
Differential Revision: D43970708
Pulled By: cbi42
fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
- Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob)
- May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
// PlainTable
Does not apply
```
- **Db bench 2: performance**
**Read**
SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
rocksdb.sst.read.micros COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec`
Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec`
Reviewed By: ajkr
Differential Revision: D44007011
Pulled By: hx235
fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
Summary:
The old cleanup code had a race condition:
1. Test thread: DestroyDB() marked a file as trash
2. DeleteScheduler thread: Got the file's size and decided to delete it in chunks
3. Test thread: DestroyDir() deleted that trash file
4. DeleteScheduler thread: Began deleting in chunks starting by calling ReopenWritableFile(). Unfortunately this recreates the deleted trash file
5. Test thread: DestroyDir() fails to remove the parent directory because it contains the file created in 4.
6. Test thread: Checkpoint::Create() fails due to the directory already existing
It could be repro'd with the following patch/command.
Patch:
```
diff --git a/file/delete_scheduler.cc b/file/delete_scheduler.cc
index 8a2d1615d..337d24a60 100644
--- a/file/delete_scheduler.cc
+++ b/file/delete_scheduler.cc
@@ -317,6 +317,12 @@ Status DeleteScheduler::DeleteTrashFile(const std::string& path_in_trash,
&num_hard_links, nullptr);
if (my_status.ok()) {
if (num_hard_links == 1) {
+ // Give some time for DestroyDir() to delete file entries. Then, the
+ // below `ReopenWritableFile()` will recreate files, preventing the
+ // parent directory from being deleted.
+ if (rand() % 2 == 0) {
+ usleep(1000);
+ }
std::unique_ptr<FSWritableFile> wf;
my_status = fs_->ReopenWritableFile(path_in_trash, FileOptions(), &wf,
nullptr);
diff --git a/file/file_util.cc b/file/file_util.cc
index 43608fcdc..2cee1ad8e 100644
--- a/file/file_util.cc
+++ b/file/file_util.cc
@@ -263,6 +263,13 @@ Status DestroyDir(Env* env, const std::string& dir) {
}
}
+ // Give some time for the DeleteScheduler thread's ReopenWritableFile() to
+ // recreate deleted files
+ if (dir.find("checkpoint") != std::string::npos) {
+ fprintf(stderr, "waiting to destroy %s\n", dir.c_str());
+ usleep(10000);
+ }
+
if (s.ok()) {
s = env->DeleteDir(dir);
// DeleteDir might or might not report NotFound
```
Command:
```
TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=131072 --target_file_size_base=131072 --max_bytes_for_level_base=524288 --checkpoint_one_in=100 --clear_column_family_one_in=0 --max_key=1000 --value_size_mult=33 --sst_file_manager_bytes_per_truncate=4096 --sst_file_manager_bytes_per_sec=1048576 --interval=3 --compression_type=none --sync_fault_injection=1
```
Obviously we don't want to use scheduled deletion here as we need the checkpoint directory deleted immediately. I suspect the DestroyDir() was an attempt to fixup incomplete DestroyDB()s. Now that we expect DestroyDB() to be complete I removed that code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11389
Reviewed By: hx235
Differential Revision: D45137142
Pulled By: ajkr
fbshipit-source-id: 2af743d342c77cc414fd25fc4c9d7c9c6079ad24
Summary:
Similarly to `GetEntity` prior to https://github.com/facebook/rocksdb/issues/11303, the `MultiGetEntity` API is currently
only used in the DB verification logic of the stress tests. The patch introduces
a new mode where all point lookups are performed using `MultiGetEntity`,
and implements the corresponding logic in the non-batched, batched, and
CF consistency tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11336
Test Plan: Ran simple blackbox tests for the various stress test flavors.
Reviewed By: akankshamahajan15
Differential Revision: D44513285
Pulled By: ltamasi
fbshipit-source-id: c3db098501bf875b6a356b09fc676a0268d92c35
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.
## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.
## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
## Crash/stress test
Updated to use the new combination.
## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```
**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127
**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)
This is an acceptable difference IMHO.
Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec
So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
**Before**:
multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec
Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
Let's also look at Get() in db_bench:
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec
**After**:
readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec
That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
**Before**:
readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec
**After**:
readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec
Reviewed By: ltamasi
Differential Revision: D44177044
Pulled By: pdillinger
fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
Summary:
The `GetEntity` API is currently used in the stress tests for verification purposes;
this patch extends the coverage by adding a mode where all point lookups in
the non-batched, batched, and CF consistency stress tests are done using this API.
The PR also includes a bit of refactoring to eliminate some boilerplate code around
the wide-column consistency checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11303
Test Plan: Ran stress tests of the batched, non-batched, and CF consistency varieties.
Reviewed By: akankshamahajan15
Differential Revision: D44148503
Pulled By: ltamasi
fbshipit-source-id: fecdbfd3e65a459bbf16ab7aa7b9173e19240077
Summary:
Stressing small DB with small number of keys and user-defined timestamp enabled usually fails pretty quickly in TestGet.
Example command to reproduce the failure:
` tools/db_crashtest.py blackbox --enable_ts --simple --delrangepercent=0 --delpercent=5 --max_key=100 --interval=3 --write_buffer_size=262144 --target_file_size_base=262144 --max_bytes_for_level_base=262144 --subcompactions=1`
Example failure: `error : inconsistent values for key 0000000000000009000000000000000A7878: expected state has the key, Get() returns NotFound.`
Fixes this test failure by refreshing the read up to timestamp to the most up to date timestamp, a.k.a now, after a key is locked. Without this, things could happen in this order and cause a test failure:
<table>
<tr>
<th>TestGet thread</th>
<th> A writing thread</th>
</tr>
<tr>
<td>read_opts.timestamp = GetNow()</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Lock key, do write</td>
</tr>
<tr>
<td>Lock key, read(read_opts) return NotFound</td>
<td></td>
</tr>
</table>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11249
Reviewed By: ltamasi
Differential Revision: D43551302
Pulled By: jowlyzhang
fbshipit-source-id: 26877ab379bdb97acd2682a2632bc29718427f38
Summary:
IO uring usage is causing crash test failures due to bad cqe data being returned in the uring. Revert the change to enable IO uring in db_stress, and also re-enable async_io in CircleCI so that code path can be tested. Added the -use_io_uring flag to db_stress that, when false, will wrap the default env in db_stress to emulate async IO.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11242
Reviewed By: akankshamahajan15
Differential Revision: D43470569
Pulled By: anand1976
fbshipit-source-id: 7c69ac3f53a79ade31d37313f815f1a4b6108b75
Summary:
- Return NotSupported in scan if IOUring not supported if async_io is enabled
- Enable IOUring in db_stress for async_io testing
- Disable async_io in circleci crash testing as circleci doesn't support IOUring
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11197
Test Plan: CircleCI jobs
Reviewed By: anand1976
Differential Revision: D43096313
Pulled By: akankshamahajan15
fbshipit-source-id: c2c53a87636950c0243038b9f5bd0d91608e4fda
Summary:
I missed a stress test code sanity check when enabling this combination of tests. This PR addresses that, the "iter_start_ts" function for user defined timestamp feature is not supported when BlobDB is enabled. It's disabled for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11199
Test Plan:
Locally always enable BlobDB and run
tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887
Reviewed By: ltamasi
Differential Revision: D43245657
Pulled By: jowlyzhang
fbshipit-source-id: 4cae19817bb1afd50a76f9e0e49f006fb5c0b211
Summary:
The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:
* Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
* Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
* Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
* Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
* Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
* Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143
Test Plan: unit tests added / updated, added to stress/crash test
Reviewed By: ajkr
Differential Revision: D42848627
Pulled By: pdillinger
fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
Summary:
Seeting this error in stress test:
db_stress: internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2459: void rocksdb::StressTest::Open(rocksdb::SharedState *): Assertion `txn_db_ == nullptr' failed. Received signal 6 (Aborted)
......
It doesn't appear that txn_db_ is set to nullptr at all. We set ithere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11164
Test Plan: Run db_stress transaction and non-transation with low kill rate and see restarting without assertion
Reviewed By: ajkr
Differential Revision: D42855662
fbshipit-source-id: 06816d37cce9c94a81cb54ab238fb73aa102ed46
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
Most of changes were done through following comments:
unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
Test Plan: See CI
Reviewed By: pdillinger
Differential Revision: D42796341
fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11133
Test Plan:
- ran it a few times on a mismatching DB+expected state; verified error messages look right:
```
Verification failed for column family 0 key 000000000000D553000000000000014C0000000000000142 (163988): value_from_db: , value_from_expected: 25E7B53421202322, msg: GetMergeOperands verification: Value not found: NotFound:
Verification failed for column family 0 key 000000000000AAE2787878 (131123): value_from_db: , value_from_expected: B2A69C18B6B7B4B5BABBB8B9BEBFBCBDA2A3A0A1A6A7A4A5, msg: Iterator verification: Value not found: NotFound:
Verification failed for column family 0 key 00000000000080C6000000000000004C78787878 (98409): value_from_db: , value_from_expected: 67AB7E1E636261606F6E6D6C6B6A6968, msg: Get verification: Value not found: NotFound:
```
Reviewed By: hx235
Differential Revision: D42757072
Pulled By: ajkr
fbshipit-source-id: b0a4a0aaa5be5d110434324853ac92aaa6972d89
Summary:
Since compressed block cache is removed, those stats are not needed. They are removed in different PR in case there is a problem with it. The stats are removed in the same way in https://github.com/facebook/rocksdb/pull/11131/ . HISTORY.md was already updated by mistake, and it would be correct after merging this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11135
Test Plan: Watch CI
Reviewed By: ltamasi
Differential Revision: D42757616
fbshipit-source-id: bd7cb782585c8535ce5784295225c376f3011f35
Summary:
Compressed block cache is replaced by compressed secondary cache. Remove the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117
Test Plan: See CI passes
Reviewed By: pdillinger
Differential Revision: D42700164
fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
Summary:
Before this pr, the destruction order is `shared` -> `db_`(StressTest destruction) -> `stress`, but `compaction_filter` of `db_` will hold the `shared` pointer, so `shared` maybe used after free.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11059
Reviewed By: hx235
Differential Revision: D42297366
Pulled By: ajkr
fbshipit-source-id: 17b314635359acacd5ba62f9db5f955f451133f7
Summary:
The IO uring usage is disabled in RocksDB by default and, as a result, PosixRandomAccessFile::ReadAsync returns a NotSupported() status. This was causing stress test failures with MultiGet and async_io combination. Fix it by relying on redirection of ReadAsync to Read when default Env is used in db_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11045
Reviewed By: akankshamahajan15
Differential Revision: D42136213
Pulled By: anand1976
fbshipit-source-id: fc7904d8ece74d7e8f2e1a34c3d70bd5774fb45f
Summary:
The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037
Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890.
Reviewed By: akankshamahajan15
Differential Revision: D42061290
Pulled By: anand1976
fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074
Summary:
**Context/Summary:**
This reverts commit fc74abb436 and related HISTORY record.
The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004
Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D41572604
Pulled By: hx235
fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954
Test Plan: tests updated
Reviewed By: akankshamahajan15
Differential Revision: D41310123
Pulled By: pdillinger
fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
Summary:
With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952
Test Plan: Ran a simple non-batched ops blackbox crash test.
Reviewed By: riversand963
Differential Revision: D41292114
Pulled By: ltamasi
fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c
Summary:
The patch adjusts the generation of values in batched ops stress tests so that the digits 0..9 are appended (instead of prepended) to the values written. This has the advantage of aligning the encoding of the "value base" into the value string across non-batched, batched, and CF consistency stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10872
Test Plan: Tested using some black box stress test runs.
Reviewed By: riversand963
Differential Revision: D40692847
Pulled By: ltamasi
fbshipit-source-id: 26bf8adff2944cbe416665f09c3bab89d80416b3
Summary:
Some lines of .h and .cc files are not properly fomatted. Clear them up with clang format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10868
Test Plan: Watch existing CI to pass
Reviewed By: ajkr
Differential Revision: D40683485
fbshipit-source-id: 491fbb78b2cdcb948164f306829909ad816d5d0b
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress
./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```
**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
- [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
- For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
- This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
- For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
- This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
- Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only
Reviewed By: ajkr
Differential Revision: D40090485
Pulled By: hx235
fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10818
Test Plan: Tested using some simple blackbox crash test runs in the various modes (non-batched, batched, CF consistency).
Reviewed By: riversand963
Differential Revision: D40349527
Pulled By: ltamasi
fbshipit-source-id: 2918bc26adbbeac314beaa958aafe770b01e5cc6
Summary:
There are two places in the stress test code where we compute the CRC
for a range of KVs for the purposes of checking consistency, namely in the
CF consistency test (to make sure CFs contain the same data), and when
performing `CompactRange` (to make sure the pre- and post-compaction
states are equivalent). The patch extends the logic so that wide columns
are also considered in both cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10788
Test Plan: Tested using some simple blackbox crash test runs.
Reviewed By: riversand963
Differential Revision: D40191134
Pulled By: ltamasi
fbshipit-source-id: 542c21cac9077c6d225780deb210319bb5eee955
Summary:
When the `iter_start_ts` read option is set, iterator exposes internal keys. This also includes tombstones, which by definition do not have a value (or columns). The patch makes sure we skip the wide-column consistency check in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10799
Test Plan: Tested using a simple blackbox crash test with timestamps enabled.
Reviewed By: jay-zhuang, riversand963
Differential Revision: D40235628
fbshipit-source-id: 49519fb55d8fe2bb9249ced809f7a81bff2b9df2
Summary:
The patch adds checks to the
`{NonBatchedOps,BatchedOps,CfConsistency}StressTest::TestPrefixScan` methods
to make sure the wide columns exposed by the iterators are as expected (based on
the value base encoded into the iterator value). It also makes some code hygiene
improvements in these methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10786
Test Plan:
Ran some simple blackbox tests in the various modes (non-batched, batched,
CF consistency).
Reviewed By: riversand963
Differential Revision: D40163623
Pulled By: riversand963
fbshipit-source-id: 72f4c3b51063e48c15f974c4ec64d751d3ed0a83
Summary:
As the first step of covering the wide-column functionality of iterators
in our stress tests, the patch adds verification logic to
`NonBatchedOpsStressTest::VerifyDb` that checks whether the
iterator's value and columns are in sync. Note: I plan to update the other
types of stress tests and add similar verification for prefix scans etc.
in separate PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783
Test Plan: Ran some simple blackbox crash tests.
Reviewed By: riversand963
Differential Revision: D40152370
Pulled By: riversand963
fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35
Summary:
We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)
This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768
Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)
Reviewed By: anand1976
Differential Revision: D40034747
Pulled By: anand1976
fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
Summary:
`SstFileWriter` currently does not support the `PutEntity` API, so in `TestIngestExternalFile` all key-values are written using regular `Put`s. This violates the assumption that whether or not a key corresponds to a plain old key-value or a wide-column entity can be determined by solely looking at the "value base" used when generating the value. The patch fixes this issue by disabling ingestion when `PutEntity` is enabled in the stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10769
Test Plan: Ran a simple blackbox stress test.
Reviewed By: akankshamahajan15
Differential Revision: D40042132
Pulled By: ltamasi
fbshipit-source-id: 93e75ff55545b7b69fa4ddef1d96093c961158a0
Summary:
added calls to `Iterator::Refresh()` in `NonBatchedOpsStressTest::TestIterateAgainstExpected()`. The testing key range is locked in `TestIterateAgainstExpected` so I do not expect this change to provide thorough stress test to `Iterator::Refresh()`. However, it can still be helpful for catching bugs like https://github.com/facebook/rocksdb/issues/10739. Will add calls to refresh in `TestIterate` once we support iterator refresh with snapshots.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10766
Test Plan: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=2`
Reviewed By: ajkr
Differential Revision: D40008320
Pulled By: ajkr
fbshipit-source-id: cec93b07f915ef6476d41c1fee9b23c115188085
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case.
| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|
Reviewed By: riversand963
Differential Revision: D39441192
Pulled By: cbi42
fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
Summary:
If the `-expected_values_dir` argument to db_stress is empty, then verification against expected state is effectively disabled. But `RunStressTest` still calls `TrackExpectedState`, which returns `NotSupported` causing a the crash test to fail with a false alarm. Fix it by only calling `TrackExpectedState` if necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10764
Reviewed By: ajkr
Differential Revision: D39980129
Pulled By: anand1976
fbshipit-source-id: d02651746fe3a297877a4b2b2fbcb7274860f49c
Summary:
The patch adds the `PutEntity` API to the non-batched, batched, and
CF consistency stress tests. Namely, when the new `db_stress` command
line parameter `use_put_entity_one_in` is greater than zero, one in
N writes on average is performed using `PutEntity` rather than `Put`.
The wide-column entity written has the generated value in its default
column; in addition, it contains up to three additional columns where
the original generated value is divided up between the column name and the
column value (with the column name containing the first k characters of
the generated value, and the column value containing the rest). Whether
`PutEntity` is used (and if so, how many columns the entity has) is completely
determined by the "value base" used to generate the value (that is, there is
no randomness involved). Assuming the same `use_put_entity_one_in` setting
is used across `db_stress` invocations, this enables us to reconstruct and
validate the entity during subsequent `db_stress` runs.
Note that `PutEntity` is currently incompatible with `Merge`, transactions, and
user-defined timestamps; these combinations are currently disabled/disallowed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10760
Test Plan: Ran some batched, non-batched, and CF consistency stress tests using the script.
Reviewed By: riversand963
Differential Revision: D39939032
Pulled By: ltamasi
fbshipit-source-id: eafdf124e95993fb7d73158e3b006d11819f7fa9
Summary:
If all the keys in range [key_base, shared->GetMaxKey()) are non-overwritable `TestIngestExternalFile()` would attempt to ingest a file with zero keys, leading to the following error: "Cannot create sst file with no entries". This PR changes `TestIngestExternalFile()` to return early in that case instead of going through with the ingestion attempt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10754
Reviewed By: hx235
Differential Revision: D39909195
Pulled By: ajkr
fbshipit-source-id: e06e6b9cc24826fbd450e5130885e6f07164badd
Summary:
An add-on to https://github.com/facebook/rocksdb/pull/6818 to complete adding single-level universal compaction to stress/crash testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10732
Test Plan:
- Locally run for 10 min `python3 ./tools/db_crashtest.py whitebox --simple --compaction_style=1 --num_levels=1 -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887`
- Check LOG to confirm single-level universal compaction is called
- Manual testing and log checking to ensure destroy_db_initially=1 is correctly set across runs with different compaction styles (i.e, in the second half of whitebox testing).
- [ongoing]CI jobs stress test
Reviewed By: ajkr
Differential Revision: D39797612
Pulled By: ajkr
fbshipit-source-id: 16f5c40c3464c57360c06c8305f92118e426149c
Summary:
**Context:**
Prior to this PR, correctness testing with un-sync data loss [disabled](https://github.com/facebook/rocksdb/pull/10605) transaction (`use_txn=1`) thus all of the `txn_write_policy` . This PR improved that by adding support for one policy - WriteCommit (`txn_write_policy=0`).
**Summary:**
They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan.
For (a)
- During prepare (i.e, between `MarkBeginPrepare()` and `MarkEndPrepare(xid)`), `ExpectedStateTraceRecordHandler` will buffer all writes by adding all writes to an internal `WriteBatch`.
- On `MarkEndPrepare()`, that `WriteBatch` will be associated with the transaction's `xid`.
- During the commit (i.e, on `MarkCommit(xid)`), `ExpectedStateTraceRecordHandler` will retrieve and iterate the internal `WriteBatch` and finally apply those writes to `ExpectedState`
- During the rollback (i.e, on `MarkRollback(xid)`), `ExpectedStateTraceRecordHandler` will erase the internal `WriteBatch` from the map.
For (b) - one major issue described below:
- TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical [PR](6d06be22c0) predated correctness testing.
- And we will verify those processed keys in a recovered db against their expected state.
- However since now we turn on `sync_fault_injection=1` where the expected state is constructed from the trace instead of using the LATEST.state from previous run. The expected state now used to verify those processed keys won't contain UNKNOWN_SENTINEL as they should - see test 1 for a failed case.
- Therefore, we decided to manually update its expected state to be UNKNOWN_SENTINEL as part of the processing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10624
Test Plan:
1. Test exposed the major issue described above. This test will fail without setting UNKNOWN_SENTINEL in expected state during the processing and pass after
```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
dbt=$db.tmp
expt=$exp.tmp
rm -rf $db $exp
mkdir -p $exp
echo "RUN 1"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 0.2
sleep 20
kill $pid
sleep 0.2
echo "RUN 2"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 0.2
sleep 20
kill $pid
sleep 0.2
echo "RUN 3"
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
```
2. Manual testing to ensure ExpectedState is constructed correctly during recovery by verifying it against previously crashed TransactionDB's WAL.
- Run the following command to crash a TransactionDB with WriteCommit policy. Then `./ldb dump_wal` on its WAL file
```
db=/dev/shm/rocksdb_crashtest_blackbox
exp=/dev/shm/rocksdb_crashtest_expected
rm -rf $db $exp
mkdir -p $exp
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 &
pid=$!
sleep 30
kill $pid
sleep 1
```
- Run the following command to verify recovery of the crashed db under debugger. Compare the step-wise result with WAL records (e.g, WriteBatch content, xid, prepare/commit/rollback marker)
```
./db_stress \
--clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \
--use_txn=1 --txn_write_policy=0 --sync_fault_injection=1
```
3. Automatic testing by triggering all RocksDB stress/crash test jobs for 3 rounds with no failure.
Reviewed By: ajkr, riversand963
Differential Revision: D39199373
Pulled By: hx235
fbshipit-source-id: 7a1dec0e3e2ee6ea86ddf5dd19ceb5543a3d6f0c
Summary:
The PR cleans up the logic in `NonBatchedOpsStressTest::VerifyDb` so that
the verification method is picked using a single random number generation.
It also eliminates some repeated key comparisons and makes some small
code hygiene improvements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10740
Test Plan: Ran a simple blackbox crash test.
Reviewed By: riversand963
Differential Revision: D39828646
Pulled By: ltamasi
fbshipit-source-id: 60ee5a3bb1851278f62c7d83b0c93b902ed9702e
Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)
Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.
For performance tests see https://github.com/facebook/rocksdb/pull/10626
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684
Test Plan: no interesting functional changes; tests updated
Reviewed By: anand1976
Differential Revision: D39547800
Pulled By: pdillinger
fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
Summary:
One problem of the previous strategy was `NonBatchedOpsStressTest::TestIngestExternalFile()` could release the lock for `rand_keys[0]` in `rand_column_families[0]`, and then subsequent operations in the same loop iteration (e.g., `TestPut()`) would run without locking. This PR changes the strategy so each `Test*()` function is responsible for acquiring and releasing its own locks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10678
Reviewed By: hx235
Differential Revision: D39516401
Pulled By: ajkr
fbshipit-source-id: bf67f12ebbd293ba8c24fdf8754ff28737bcd758
Summary:
In `db_stress`, DB and expected state files containing changes leading up to a verification failure are often deleted, which makes debugging such failures difficult. On the DB side, flushed WAL files and compacted SST files are marked obsolete and then deleted. Without those files, we cannot pinpoint where a key that failed verification changed unexpectedly. On the expected state side, files for verifying prefix-recoverability in the presence of unsynced data loss are deleted before verification. These include a baseline state file containing the expected state at the time of the last successful verification, and a trace file containing all operations since then. Without those files, we cannot know the sequence of DB operations expected to be recovered.
This PR attempts to address this gap with a new `db_stress` flag: `preserve_unverified_changes`. Setting `preserve_unverified_changes=1` has two effects.
First, prior to startup verification, `db_stress` hardlinks all DB and expected state files in "unverified/" subdirectories of `FLAGS_db` and `FLAGS_expected_values_dir`. The separate directories are needed because the pre-verification opening process deletes files written by the previous `db_stress` run as described above. These "unverified/" subdirectories are cleaned up following startup verification success.
I considered other approaches for preserving DB files through startup verification, like using a read-only DB or preventing deletion of DB files externally, e.g., in the `Env` layer. However, I decided against it since such an approach would not work for expected state files, and I did not want to change the DB management logic. If there were a way to disable DB file deletions before regular DB open, I would have preferred to use that.
Second, `db_stress` attempts to keep all DB and expected state files that were live at some point since the start of the `db_stress` run. This is a bit tricky and involves the following changes.
- Open the DB with `disable_auto_compactions=1` and `avoid_flush_during_recovery=1`
- DisableFileDeletions()
- EnableAutoCompactions()
For this part, too, I would have preferred to use a hypothetical API that disables DB file deletion before regular DB open.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10659
Reviewed By: hx235
Differential Revision: D39407454
Pulled By: ajkr
fbshipit-source-id: 6e981025c7dce147649d2e770728471395a7fa53
Summary:
Same as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10632
Test Plan: make crash_test -j32
Reviewed By: anand1976
Differential Revision: D39241479
Pulled By: akankshamahajan15
fbshipit-source-id: 5db5b0c007da786bacc1b30d8926d36d6d029b87
Summary:
Currently, `db_bench` and `db_stress` print the blob cache options even if
a shared block/blob cache is configured, i.e. when they are not actually
in effect. The patch changes this so they are only printed when a separate blob
cache is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10614
Test Plan: Tested manually using `db_bench` and `db_stress`.
Reviewed By: akankshamahajan15
Differential Revision: D39144603
Pulled By: ltamasi
fbshipit-source-id: f714304c5d46186f8514746c27ee6f52aa3e4af8
Summary:
When verification fails for db_stress, print more information about
value read from the db and expected state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10587
Test Plan:
make check
./db_stress
Reviewed By: akankshamahajan15, hx235
Differential Revision: D39078511
Pulled By: riversand963
fbshipit-source-id: 77ac8ffae01fc3a9b58a02c2e7bbe141e1a18f0b
Summary:
https://github.com/facebook/rocksdb/issues/10538 added `TestIterateAgainstExpected()` in `no_batched_ops_test` to verify iterator correctness against the in memory expected state. It is not compatible when run after some other stress tests, e.g. `TestPut()` in `batched_op_stress`, that either do not set expected state when writing to DB or use keys that cannot be parsed by `GetIntVal()`. The assert [here](d17be55aab/db_stress_tool/db_stress_common.h (L520)) could fail. This PR fixed this issue by setting iterator upperbound to `max_key` when `destroy_db_initially=0` to avoid the key space that `batched_op_stress` touches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10590
Test Plan:
```
# set up DB with batched_op_stress
./db_stress --test_batches_snapshots=1 --verify_iterator_with_expected_state_one_in=1 --max_key_len=3 --max_key=100000000 --skip_verifydb=1 --continuous_verification_interval=0 --writepercent=85 --delpercent=3 --delrangepercent=0 --iterpercent=10 --nooverwritepercent=1 --prefixpercent=0 --readpercent=2 --key_len_percent_dist=1,30,69
# Before this PR, the following test will fail the asserts with error msg like the following
# Assertion failed: (size_key <= key_gen_ctx.weights.size() * sizeof(uint64_t)), function GetIntVal, file db_stress_common.h, line 524.
./db_stress --verify_iterator_with_expected_state_one_in=1 --max_key_len=3 --max_key=100000000 --skip_verifydb=1 --continuous_verification_interval=0 --writepercent=0 --delpercent=3 --delrangepercent=0 --iterpercent=95 --nooverwritepercent=1 --prefixpercent=0 --readpercent=2 --key_len_percent_dist=1,30,69 --destroy_db_initially=0
```
Reviewed By: ajkr
Differential Revision: D39085243
Pulled By: cbi42
fbshipit-source-id: a7dfee2320c330773b623b442d730fd014ec7056
Summary:
As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913,
`db_stress` does not have much verification for iterator correctness.
It has a `TestIterate()` function, but that is mainly for comparing results
between two iterators, one with `total_order_seek` and the other optionally
sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33
added a new `TestIterateAgainstExpected()` function that compares iterator against
expected state. It locks a range of keys, creates an iterator, does
a random sequence of `Next/Prev` and compares against expected state.
This PR is based on that commit, the main changes include some logs
(for easier debugging if a test fails), a forward and backward scan to
cover the entire locked key range, and a flag for optionally turning on
this version of Iterator testing.
Added constraint that the checks against expected state in
`TestIterateAgainstExpected()` and in `TestGet()` are only turned on
when `--skip_verifydb` flag is not set.
Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538
Test Plan:
Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`,
and a large `--iterpercent` and `--num_iterations`. Checked `op_logs`
manually to ensure expected coverage. Tweaked part of the code in
https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it.
- internally run various flavor of crash test
Reviewed By: ajkr
Differential Revision: D38847269
Pulled By: cbi42
fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599
Summary:
updated `TestGet()` in `no_batched_op_stress` to check the result of `Get()` operations against expected state (`expected_state_manager_`). More specifically, if `Get()` finds a key, expected state should not have `DELETION_SENTINEL` for the same key, and if `Get()` returns NotFound for a key, expected state should not have the key. One intention for this change it to verify correctness of code path change regarding range tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10553
Test Plan: run db_stress with nonzero readpercent: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`. When I initially used wrong column family in `thread->shared->Get`, the test reported inconsistencies.
Reviewed By: ajkr
Differential Revision: D38927007
Pulled By: cbi42
fbshipit-source-id: f9f61b312ad0b4c21a799329609ba8526169b048
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321
Reviewed By: ltamasi
Differential Revision: D37913590
Pulled By: gangliao
fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298
Reviewed By: ltamasi
Differential Revision: D37908743
Pulled By: gangliao
fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
Summary:
This complements https://github.com/facebook/rocksdb/issues/10351. This PR reverts NewClockCache's signature to an older version, expected by the users of the old (buggy) ClockCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10358
Test Plan: ``make -j24 check`` and re-run the pre-release tests.
Reviewed By: siying
Differential Revision: D37832601
Pulled By: guidotag
fbshipit-source-id: 32a91d3da4119be187935003b7b897272ceb1950
Summary:
Before this PR, we call `now()` to get the wall time before performing point-lookup and range
scans when user-defined timestamp is enabled.
With this PR, we expand the coverage to:
- read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now()
- add coverage for `ReadOptions::iter_start_ts != nullptr`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280
Test Plan:
```bash
make check
```
Also,
```bash
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
```
So far, we have had four successful runs of the above
In addition,
```bash
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
```
Succeeded twice showing no regression.
Reviewed By: ltamasi
Differential Revision: D37539805
Pulled By: riversand963
fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.
Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37522461
Pulled By: guidotag
fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943