Commit Graph

370 Commits

Author SHA1 Message Date
Yu Zhang ba8fa0f546 internal_repo_rocksdb (4372117296613874540) (#12117)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/12117

Reviewed By: ajkr

Differential Revision: D51745846

Pulled By: jowlyzhang

fbshipit-source-id: 51c806a484b3b43d174b06d2cfe9499191d09914
2023-12-04 11:17:32 -08:00
anand76 4d04138512 Add dynamic disabling of compressed cache to db_stress (#12102)
Summary:
We now support re-enabling the compressed portion of the `TieredCache` after dynamically disabling it. Add it to db_stress for testing purposes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12102

Reviewed By: akankshamahajan15

Differential Revision: D51594259

Pulled By: anand1976

fbshipit-source-id: ea544e30a5ebd6290fc9ed46a241f09634764d2a
2023-11-27 13:00:15 -08:00
anand76 2222caec9e Make CacheWithSecondaryAdapter reservation accounting more robust (#12059)
Summary:
`CacheWithSecondaryAdapter` can distribute placeholder reservations across the primary and secondary caches. The current implementation of the accounting is quite complicated in order to avoid using a mutex. This may cause the accounting to be slightly off after changes to the cache capacity and ratio, resulting in assertion failures. There's also a bug in the unlikely event that the total reservation exceeds the cache capacity. Furthermore, the current implementation is difficult to reason about.

This PR simplifies it by doing the accounting while holding a mutex. The reservations are processed in 1MB chunks in order to avoid taking a lock too frequently. As a side effect, this also removes the restriction of not allowing to increase the compressed secondary cache capacity after decreasing it to 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12059

Test Plan: Existing unit tests, and a new test for capacity increase from 0

Reviewed By: pdillinger

Differential Revision: D51278686

Pulled By: anand1976

fbshipit-source-id: 7e1ad2c50694772997072dd59cab35c93c12ba4f
2023-11-14 16:25:52 -08:00
Changyu Bi b48480cfd0 Enable `TestIterateAgainstExpected()` in more crash tests (#12040)
Summary:
db_stress flag `verify_iterator_with_expected_state_one_in` is only enabled for in crash test if --simple flag is set. This PR enables it for all supported crash tests by enabling it by default. This adds coverage for --txn and --enable_ts crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12040

Test Plan:
ran crash tests that disabled this flag before for a few hours
```
python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --txn --txn_write_policy=[0,1,2]

python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --enable_ts
```

Reviewed By: ajkr, hx235

Differential Revision: D50980001

Pulled By: cbi42

fbshipit-source-id: 3daf6b4c32bdddc5df057240068162aa1a907587
2023-11-03 16:27:11 -07:00
Changyu Bi 8505b26db1 Fix stress test error message for black/whitebox test to catch failures (#12039)
Summary:
black/whitebox crash test relies on error/fail keyword in stderr to catch stress test failure. If a db_stress run prints an error message without these keyword, and then is killed before it graceful exits and prints out "Verification failed" here (2648e0a747/db_stress_tool/db_stress_driver.cc (L256)), the error won't be caught. This is more likely to happen if db_stress is printing a stack trace. This PR fixes some error messages. Ideally in the future we should not rely on searching for keywords in stderr to determine failed stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12039

Test Plan:
```
Added the following change on top of this PR to simulate exit without relevant keyword:

@@ -1586,6 +1587,8 @@ class NonBatchedOpsStressTest : public StressTest {
     assert(thread);
     assert(!rand_column_families.empty());
     assert(!rand_keys.empty());
+    fprintf(stderr, "Inconsistency");
+    thread->shared->SafeTerminate();

python3 ./tools/db_crashtest.py blackbox --simple --verify_iterator_with_expected_state_one_in=1 --interval=10

will print a stack trace but continue to run db_stress.
```

Reviewed By: jaykorean

Differential Revision: D50960076

Pulled By: cbi42

fbshipit-source-id: 5c60a1be04ce4a43adbd33f040d54434f2ae24c9
2023-11-03 09:53:22 -07:00
Yu Zhang a42910537d Save the correct user comparator name in OPTIONS file (#12037)
Summary:
I noticed the user comparator name in OPTIONS file can be incorrect when working on a recent stress test failure. The name of the comparator retrieved via the "Comparator::GetRootComparator" API is saved in OPTIONS file as the user comparator. The intention was to get the user comparator wrapped in the internal comparator. However `ImmutableCFOptions.user_comparator` has always been a user comparator of type `Comparator`. The corresponding `GetRootComparator` API is also defined only for user comparator type `Comparator`, not the internal key comparator type `InternalKeyComparator`.

For built in comparator `BytewiseComparator` and `ReverseBytewiseComparator`, there is no difference between `Comparator::Name` and `Comparator::GetRootComparator::Name` because these built in comparators' root comparator is themselves. However, for built in comparator `BytewiseComparatorWithU64Ts` and `ReverseBytewiseComparatorWithU64Ts`, there are differences. So this change update the logic to persist the user comparator's name, not its root comparator's name.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12037

Test Plan:
The restore flow in stress test, which relies on converting Options object to string and back to Options object is updated to help validate comparator object can be correctly serialized and deserialized with the OPTIONS file mechanism

Updated unit test to use a comparator that has a root comparator that is not itself.

Reviewed By: cbi42

Differential Revision: D50909750

Pulled By: jowlyzhang

fbshipit-source-id: 9086d7135c7a6f4b5565fb47fce194ea0a024f52
2023-11-02 13:27:59 -07:00
Yu Zhang 0b057a7acc Initialize comparator explicitly in PrepareOptionsForRestoredDB() (#12034)
Summary:
This is to fix below error seeing in stress test:
```
Failure in DB::Open in backup/restore with: Invalid argument: Cannot open a column family and disable user-defined timestamps feature if its existing persist_user_defined_timestamps flag is not false.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12034

Reviewed By: cbi42

Differential Revision: D50860689

Pulled By: jowlyzhang

fbshipit-source-id: ebc6cf0a75caa43d3d3bd58e3d5c2ac754cc637c
2023-10-31 16:10:48 -07:00
Changyu Bi 2818a74b95 Initialize merge operator explicitly in PrepareOptionsForRestoredDB() (#12033)
Summary:
We are seeing the following stress test failure: `Failure in DB::Get in backup/restore with: Invalid argument: merge_operator is not properly initialized. Verification failed: Backup/restore failed: Invalid argument: merge_operator is not properly initialized.`. The reason is likely that `GetColumnFamilyOptionsFromString()` does not set merge operator if it's a customized merge operator. Fixing it by initializing merge operator explicitly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12033

Test Plan:
this repro gives the error consistently before this PR
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=1048576000000 --backup_one_in=50 --batch_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=31.014388066505518 --bottommost_compression_type=lz4hc --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=0 --compaction_ttl=10 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=4095 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=100 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_max_range_deletions=1000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=0 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=10 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=0 --write_fault_one_in=0 --writepercent=35
```

Reviewed By: hx235

Differential Revision: D50825558

Pulled By: cbi42

fbshipit-source-id: 8468dc0444c112415a515af8291ef3abec8a42de
2023-10-31 07:39:41 -07:00
Hui Xiao 212b5bf826 Deep-copy Options in restored db for stress test to avoid race with SetOptions() (#12015)
Summary:
**Context**
DB open will persist the `Options` in memory to options file and verify the file right after the write. The verification is done by comparing the options from parsing the written options file against the `Options` object in memory. Upon inconsistency, corruption such as https://github.com/facebook/rocksdb/blob/main/options/options_parser.cc#L725 will be returned.

This verification assumes the `Options` object in memory is not changed from before the write till the verification. This assumption can break during [opening the restored db in stress test](0f141352d8/db_stress_tool/db_stress_test_base.cc (L1784-L1799)).

This [line](0f141352d8/db_stress_tool/db_stress_test_base.cc (L1770)) makes it shares some pointer options (e.g, `std::shared_ptr<const FilterPolicy> filter_policy`) with other threads (e.g, SetOptions()) in db stress.

And since https://github.com/facebook/rocksdb/pull/11838, filter_policy's field `bloom_before_level ` has now been mutable by SetOptions(). Therefore we started to see stress test failure like below:

```
Failure in DB::Open in backup/restore with: IO error: DB::Open() failed --- Unable to persist Options file: IO error: Unable to persist options.: Corruption: [RocksDBOptionsParser]:failed the verification on BlockBasedTable::: filter_policy.id

Verification failed: Backup/restore failed: IO error: DB::Open() failed --- Unable to persist Options file: IO error: Unable to persist options.: Corruption: [RocksDBOptionsParser]:failed the verification on BlockBasedTable::: filter_policy.id

db_stress: db_stress_tool/db_stress_test_base.cc:479: void rocksdb::StressTest::ProcessStatus(rocksdb::SharedState*, std::string, rocksdb::Status) const: Assertion `false' failed.
```

**Summary**
This PR uses "deep copy" of the `options_` by CreateXXXFromString() to avoid sharing pointer options.

**Test plan**
Run the below db stress command that failed before this PR and pass after
```
./db_stress --column_families=1 --threads=2 --preserve_unverified_changes=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=0 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=10 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=2147483646 --bloom_bits=0 --bottommost_compression_type=disable --bottommost_file_compaction_delay=86400 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=33554432 --cache_type=tiered_auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_readahead_size=0 --compaction_ttl=0 --compressed_secondary_cache_ratio=0.3333333333333333 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=2 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=14 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=1 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=500000 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=50 --recycle_log_file_num=0 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri= --set_options_one_in=5 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=600 --subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12015

Reviewed By: pdillinger

Differential Revision: D50666136

Pulled By: hx235

fbshipit-source-id: 804acc23aecb4eedfe5c44f732e86291f2420b2b
2023-10-27 17:07:39 -07:00
Changyu Bi 9ded9f789f Fix db_stress FaultInjectionTestFS set up before DB open (#11958)
Summary:
We saw frequent stress test failures with error messages like:
```
Verification failed for column family 0 key ...: value_from_db: , value_from_expected: ..., msg: GetEntity verification: Value not found: NotFound:
```
One cause for this is that data in WAL is lost after a crash. We initialize FaultInjectionTestFS to be not direct writable when write_fault_injection is enabled (see code change). This can cause the first WAL created during DB open to be lost if a db_stress is killed before the first WAL is synced. This PR initializes FaultInjectionTestFS to be direct writable. Note that FaultInjectionTestFS will be configured propertly for write fault injection after DB open in `RunStressTestImpl()`. So this change should not affect write fault injection coverage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11958

Test Plan:
a repro for the above bug:
```
Simulate crash before first WAL is sealed:

 --- a/db_stress_tool/db_stress_driver.cc
+++ b/db_stress_tool/db_stress_driver.cc
@@ -256,6 +256,7 @@ bool RunStressTestImpl(SharedState* shared) {
     fprintf(stderr, "Verification failed :(\n");
     return false;
   }
+  exit(1);
   return true;
 }

./db_stress --clear_column_family_one_in=0 --column_families=1 --preserve_internal_time_seconds=60 --destroy_db_initially=0 --db=/dev/shm/rocksdb_crashtest_blackbox --db_write_buffer_size=2097152 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --reopen=0 --test_batches_snapshots=0 --threads=1 --ops_per_thread=100 --write_fault_one_in=1000 --sync_fault_injection=0

./db_stress_main  --clear_column_family_one_in=0 --column_families=1 --preserve_internal_time_seconds=60 --destroy_db_initially=0 --db=/dev/shm/rocksdb_crashtest_blackbox --db_write_buffer_size=2097152 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --reopen=0 --test_batches_snapshots=0 --sync_fault_injection=1
```

Reviewed By: akankshamahajan15

Differential Revision: D50300347

Pulled By: cbi42

fbshipit-source-id: 3a4881d72197f5ece82364382a0100912e16c2d6
2023-10-14 13:33:55 -07:00
Changyu Bi 50b0879d50 Do not fail stress test when file ingestion return injected error (#11956)
Summary:
Currently, if file ingestion hit injected error, stress test is considered failed since it prints a message to stderr containing the keyword "error" and db_crashtest.py looks for it in stderr. This PR fixes it by print injected error to stdout.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11956

Test Plan: Check future stress test runs.

Reviewed By: akankshamahajan15

Differential Revision: D50293537

Pulled By: cbi42

fbshipit-source-id: e74915b1b3c6876a61ab6933c4529780362ec02b
2023-10-14 10:08:03 -07:00
Jay Huh d2daa10afc Fix crash_test_with_best_efforts_recovery (#11938)
Summary:
Thanks ltamasi and ajkr  for initial investigations on the test failure. Per the investigations, the following scenario is likely causing the test to fail.

1. Recovery is needed (could be any reason during crash test)
2. Trying to recover from the latest manifest fails (likely due to read error injection)
3. DB opens with recovery from the next manifest which is different from step 2.
4. Expected state is based on the manifest we tried and failed in step 2.
5. Two manifests used in step 2 and 3 are confirmed to have difference in LSM trees (Thanks ltamasi  again for the finding).

```
2023/10/05-11:24:18.942189 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007184
...
2023/10/05-11:24:18.978007 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007180
```
```
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007184_renamed_ > 2
[ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007180_renamed_ > 1
[ltamasi@devbig1024.prn1 /tmp/x]$ diff 1 2
 --- 1   2023-10-09 10:29:16.966215207 -0700
+++ 2   2023-10-09 10:29:11.984241645 -0700
@@ -13,7 +13,7 @@
  7174:3950254[1875617 .. 2203952]['000000000003415B000000000000012B000000000000007D' seq:1906214, type:1 .. '000000000003CA59000000000000012B000000000000005C' seq:2039838, type:1]
  7175:88060[2074748 .. 2203892]['000000000003CA6300000000000000CF78787878787878' seq:2167539, type:2 .. '000000000003D08F000000000000012B0000000000000130' seq:2112478, type:0]
 --- level 6 --- version# 1 ---
- 7057:3132633[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
+ 7219:2135565[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1]
  7061:827724[0 .. 2046131]['0000000000005F95000000000000000778787878787878' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000113' seq:0, type:1]
  6763:1352[0 .. 0]['000000000000784F000000000000012B0000000000000129' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000129' seq:0, type:1]
  7173:4812291[0 .. 2203957]['000000000000784F000000000000012B0000000000000138' seq:0, type:1 .. '0000000000020FAE787878787878' seq:0, type:1]
@@ -77,4 +77,4 @@
 --- level 61 --- version# 1 ---
 --- level 62 --- version# 1 ---
 --- level 63 --- version# 1 ---
-next_file_number 7182 last_sequence 2203963  prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
+next_file_number 7221 last_sequence 2203963  prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015
```

We have two options to fix this. Either skip verification against expected state or disable read injection when BE recovery is enabled. I chose to skip verification against expected state per discussion. (See comments in this PR)

Please note that some linter changes were included in this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11938

Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery
```

Reviewed By: ltamasi

Differential Revision: D50136341

Pulled By: jaykorean

fbshipit-source-id: ac7434d592aebc148bfc3a4fcaa34936f136b95c
2023-10-11 14:26:10 -07:00
anand76 20b4f1356e Enable write fault injection in db_stress (#11924)
Summary:
This PR depends on https://github.com/facebook/rocksdb/issues/11879 . Enable write fault injection for the basic whitebox, blackbox, and cf_consistency modes. For other test modes like multiops_txn, best_efforts_recovery etc., leave it disabled for now until we can do more testing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11924

Reviewed By: ajkr

Differential Revision: D50178252

Pulled By: anand1976

fbshipit-source-id: 5794f81c14cded1eb28762b2de818dfff1c1a34c
2023-10-11 11:28:00 -07:00
anand76 5b11f5a3a2 Add TieredCache and compressed cache capacity change to db_stress (#11935)
Summary:
Add `TieredCache` to the cache types tested by db_stress. Also add compressed secondary cache capacity change, and `WriteBufferManager` integration with `TieredCache` for memory charging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11935

Test Plan: Run whitebox/blackbox crash tests locally

Reviewed By: akankshamahajan15

Differential Revision: D50135365

Pulled By: anand1976

fbshipit-source-id: 7d73ed00c00a0953d86e49f35cce6bd550ba00f1
2023-10-10 13:12:18 -07:00
Andrew Kryczka 8a9cfd5292 Make stopped writes block on recovery (#11879)
Summary:
Relaxed the constraints for blocking when writes are stopped. When a recovery is already being attempted, we might as well let `!no_slowdown` writes wait on it in case it succeeds. This makes the user-visible behavior consistent across recovery flush and non-recovery flush.

This enables `db_stress` to inject retryable (soft) flush read errors without having to handle user write failures. I changed `db_stress` a bit to permit injected errors in much more foreground operations as more admin operations (like `GetLiveFiles()`) can fail on a retryable error during flush.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11879

Reviewed By: anand1976

Differential Revision: D49571196

Pulled By: ajkr

fbshipit-source-id: 5d516d6faf20d2c6bfe0594ab4f2706bca6d69b0
2023-10-10 06:29:01 -07:00
akankshamahajan 97f6f475bc Fix various failures in auto_readahead_size (#11884)
Summary:
1. **Error** in TestIterateAgainstExpected API - `Assertion index < pre_read_expected_values.size() && index < post_read_expected_values.size() failed.`
 **Fix** - `Prev` op is not supported with `auto_readahead_size`. So added support to Reseek in db_iter, if Prev is called. In BlockBasedTableIterator, index_iter_ already moves forward. So there is no way to do Prev from BlockBasedTableIterator.

2. **Error** - `void rocksdb::BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(uint64_t, size_t, size_t&): Assertion index_iter_->value().handle.offset() == offset`
   **Fix** - Remove prefetch_buffer to be used when uncompressed dict is read.

3. ** Error in TestPrefixScan API - `db_stress: db/db_iter.cc:369: bool rocksdb::DBIter::FindNextUserEntryInternal(bool, const rocksdb::Slice*): Assertion !skipping_saved_key || CompareKeyForSkip(ikey_.user_key, saved_key_.GetUserKey()) > 0 failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
db_stress: table/merging_iterator.cc:1036: bool rocksdb::MergingIterator::SkipNextDeleted(): Assertion comparator_->Compare(range_tombstone_iters_[i]->start_key(), pik) <= 0 failed`
  **Fix** -  SeekPrev also calls 1) SeekPrev , 2)Seek and then 3)Prev in some cases in db_iter.cc leading to failure of Prev operation. These backward operations also call Seek.  Added direction to disable lookup once direction is backwards in BlockBasedTableIterator.cc

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11884

Test Plan: Ran various flavors of crash tests locally for the whole duration

Reviewed By: anand1976

Differential Revision: D49834201

Pulled By: akankshamahajan15

fbshipit-source-id: 9a007b4d46a48002c43dc4623a400ecf47d997fe
2023-10-02 17:47:24 -07:00
Andrew Kryczka be879cc56b stress test verification value mismatch message (#11912)
Summary:
Separate the message for value mismatch from the message for an extra value in the DB

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11912

Reviewed By: hx235

Differential Revision: D49792137

Pulled By: ajkr

fbshipit-source-id: 311bc1801843a15367f409ead88ef755acbde468
2023-10-02 16:07:39 -07:00
Levi Tamasi 01e2d33565 Add the wide-column aware merge API to the stress tests (#11906)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11906

The patch adds stress test coverage for the wide-column aware `FullMergeV3` API by implementing a new `DBStressWideMergeOperator`. This operator is similar to `PutOperator` / `PutOperatorV2` in the sense that its result is based on the last merge operand; however, the merge result can be either a plain value or a wide-column entity, depending on the value base encoded into the operand and the value of the `use_put_entity_one_in` stress test parameter. Following the same rule for merge results that we do for writes ensures that the queries issued by the validation logic receive the expected results. The new operator is used instead of `PutOperatorV2` whenever `use_put_entity_one_in` is positive. Note that the patch also makes it possible to set `use_put_entity_one_in` and `use_merge` (but not `use_full_merge_v1`) at the same time, giving `use_put_entity_one_in` precedence, so the stress test will use `PutEntity` for writes passing the `use_put_entity_one_in` check described above and `Merge` for any other writes.

Reviewed By: jaykorean

Differential Revision: D49760024

fbshipit-source-id: 3893602c3e7935381b484f4f5026f1983e3a04a9
2023-09-29 08:54:50 -07:00
akankshamahajan 3d67b5e8e5 Lookup ahead in block cache ahead to tune Readaheadsize (#11860)
Summary:
Implement block cache lookup to determine readahead_size during scans. It's enabled if auto_readahead_size, block_cache and iterate_upper_bound - all three are set.

Design -
1. Whenever there is a cache miss and FilePrefetchBuffer is called, a callback is made to determine readahead_size for that prefetching.
2. The callback iterates over index and do block cache lookup for each data block handle until existing readahead_size is reached. Then It removes the cache hit data blocks from end to calculate optimized readahead_size.
3. Since index_iter_ is moved, it stores block handles in a queue, and use that queue to get block handle instead of doing index_iter_->Next().
4. This is for Sync scans. Async scans support is in progress.

NOTE:
The issue right now is after Seek and Next, if Prev is called, there is no way to do Prev operation. index_iter_ is already pointing to a different block. So it returns "Not supported" in that case with error message - "auto tuning of readahead size is not supported with Prev op"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11860

Test Plan:
- Added new unit test
- crash_tests
- Running scans locally to check for any regression

Reviewed By: anand1976

Differential Revision: D49548118

Pulled By: akankshamahajan15

fbshipit-source-id: f1aee409a71b4ad9e5bf3610f43edf30c6630c78
2023-09-22 18:12:08 -07:00
Changyu Bi cfe0e0b037 Initialize FaultInjectionTestFS DirectWritable field (#11862)
Summary:
FaultInjectionTestFS is not directly writable by default. Should set it to direct writable if there is no write fault injection.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11862

Test Plan: internal stress test failure reduces.

Reviewed By: jaykorean

Differential Revision: D49428108

Pulled By: cbi42

fbshipit-source-id: 5dfa1fbb454272a14f8228a5c496d480d7138ef1
2023-09-19 12:23:38 -07:00
Changyu Bi ba5897ada8 Fix stress test failure due to write fault injections and disable write fault injection (#11859)
Summary:
This PR contains two fixes:

1. disable write fault injection since it caused several other kinds of internal stress test failures. I'll try to fix those separately before enabling it again.
2. Fix segfault like
```
https://github.com/facebook/rocksdb/issues/5  0x000000000083dc43 in rocksdb::port::Mutex::Lock (this=0x30) at internal_repo_rocksdb/repo/port/port_posix.cc:80
80	internal_repo_rocksdb/repo/port/port_posix.cc: No such file or directory.
https://github.com/facebook/rocksdb/issues/6  0x0000000000465142 in rocksdb::MutexLock::MutexLock (mu=0x30, this=<optimized out>) at internal_repo_rocksdb/repo/util/mutexlock.h:37
37	internal_repo_rocksdb/repo/util/mutexlock.h: No such file or directory.
https://github.com/facebook/rocksdb/issues/7  rocksdb::FaultInjectionTestFS::DisableWriteErrorInjection (this=0x0) at internal_repo_rocksdb/repo/utilities/fault_injection_fs.h:505
505	internal_repo_rocksdb/repo/utilities/fault_injection_fs.h: No such file or directory.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11859

Test Plan: db_stress with no fault injection: `./db_stress --write_fault_one_in=0 --read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --sync_fault_injection=0`

Reviewed By: jaykorean

Differential Revision: D49408247

Pulled By: cbi42

fbshipit-source-id: 0ca01f20e6e81bf52af77818b50d562ef7462165
2023-09-19 08:33:05 -07:00
Jay Huh 20dbf51247 DB Stress Fix - Commit pending value after checking for status (#11856)
Summary:
We've seen occasional crash test failure in optimistic transaction DB with the following error message.

```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Verification failed for column family 0 key 0000000000001EDB0000000000000036787878787878 (789064): value_from_db: 010000000504070609080B0A0D0C0F0E111013121514171619181B1A1D1C1F1E212023222524272629282B2A2D2C2F2E313033323534373639383B3A3D3C3F3E, value_from_expected: , msg: MultiGet verification: Unexpected value found
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```

There was a possibility if the db stress crashes after `pending_expected_value.Commit()` but before `thread->shared->SafeTerminate();`, we may have expected value committed while actual DB value was not.

Moving the `pending_expected_value.Commit()` after `s.ok()` check to fix the test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11856

Test Plan:
Ran the following in a script with while loop. (It doesn't always repro the issue even without this fix, though..)
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=8 --bloom_bits=11 --bottommost_compression_type=lz4 --bottommost_file_compaction_delay=0 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=0 --compaction_ttl=100 --compression_checksum=0 --compression_max_dict_buffer_bytes=2097151 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=3 --index_type=2 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --occ_lock_bucket_count=500 --occ_validation_policy=0 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=2 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --read_fault_one_in=0 --readahead_size=0 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=10000 --share_occ_lock_buckets=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=2 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_optimistic_txn=1 --use_put_entity_one_in=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35&
```
```
pid=$!
sleep 10
kill -9 $pid
sleep 1
```
```
db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=0 --atomic_flush=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=75.01353068032098 --bottommost_compression_type=xpress --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=4 --compaction_readahead_size=0 --compaction_ttl=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=134217728 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --occ_lock_bucket_count=10 --occ_validation_policy=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=36000 --progress_reports=0 --read_fault_one_in=32 --readahead_size=0 --readpercent=45 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=10000 --share_occ_lock_buckets=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=4 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=1 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_optimistic_txn=1 --use_put_entity_one_in=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=35
```

Reviewed By: cbi42

Differential Revision: D49403091

Pulled By: jaykorean

fbshipit-source-id: 5ee6136133bbdc46aa733e5101c1f998f658c200
2023-09-18 23:04:38 -07:00
Changyu Bi c90807d103 Inject retryable write IOError when writing to SST files in stress test (#11829)
Summary:
* db_crashtest.py now may set `write_fault_one_in` to 500 for blackbox and whitebox simple test.
* Error injection only applies to writing to SST files. Flush error will cause DB to pause background operations and auto-resume. Compaction error will just re-schedule later.
* File ingestion and back up tests are updated to check if the result status is due to an injected error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11829

Test Plan:
a full round of whitebox simple and blackbox simple crash test
*  `python3 ./tools/db_crashtest.py whitebox/blackbox --simple  --write_fault_one_in=500`

Reviewed By: ajkr

Differential Revision: D49256962

Pulled By: cbi42

fbshipit-source-id: 68e0c9648d8e03bad39c7672b25d5500fc286d97
2023-09-18 16:23:26 -07:00
Jay Huh 99f8820054 Fix test on IOActivity check for MultiGetEntity (#11850)
Summary:
After https://github.com/facebook/rocksdb/issues/11842  merged, we started to see some crash_test failures.

There is a flow inside `TestMultiGetEntity()` that it calls `GetEntity()` to compare the result between `MultiGetEntity()` and `GetEntity()` 1c6faf3587/db_stress_tool/no_batched_ops_stress.cc (L1068-L1072)

However, IOActivity check inside DbStressRandomAccessFileWrapper was expecting IOActivity::MultiGet when GetEntity() was called. We are fixing the test by setting expected operation to be GetEntity before calling GetEntity()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11850

Test Plan:
Error repro'ed by the following run before fix and no more error after the fix.

```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --auto_readahead_size=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=1 --bloom_bits=9.880688060667444 --bottommost_compression_type=zstd --bottommost_file_compaction_delay=86400 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_ttl=0 --compression_checksum=0 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_blackbox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --enable_thread_tracking=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=0 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=6 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_max_range_deletions=0 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=1 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=1 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=0 --writepercent=35
```

Reviewed By: cbi42

Differential Revision: D49344996

Pulled By: jaykorean

fbshipit-source-id: 8059b8127c0e3cb8af96cf222f47398413c92c50
2023-09-15 22:50:49 -07:00
Peter Dillinger 1c6faf3587 Make RibbonFilterPolicy::bloom_before_level mutable (SetOptions()) (#11838)
Summary:
An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars.

As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making
RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic.

I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838

Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in).

Reviewed By: ajkr

Differential Revision: D49296284

Pulled By: pdillinger

fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a
2023-09-15 15:46:10 -07:00
Jay Huh cff6490bc4 Add IOActivity.kMultiGetEntity (#11842)
Summary:
- As a follow up from https://github.com/facebook/rocksdb/issues/11799, adding `Env::IOActivity::kMultiGetEntity` support to `DBImpl::MultiGetEntity()`.

## Minor Refactor
- Because both `DBImpl::MultiGet()` and `DBImpl::MultiGetEntity()` call `DBImpl::MultiGetCommon()` which later calls `DBImpl::MultiGetWithCallback()` where we check `Env::IOActivity::kMultiGet`, minor refactor was needed so that we don't check `Env::IOActivity::kMultiGet` for  `DBImpl::MultiGetEntity()`.
- I still see more areas for refactoring to avoid duplicate code of checking IOActivity and setting it when Unknown, but this will be addressed separately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11842

Test Plan:
- Added the `ThreadStatus::OperationType::OP_MULTIGETENTITY` in `db_stress` to verify the pass-down IOActivity in a thread aligns with the actual activity the thread is doing.
```
python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --interval=10
```

Reviewed By: ltamasi

Differential Revision: D49329575

Pulled By: jaykorean

fbshipit-source-id: 05198f1d3f92e6be3d42a3d184bacb3ab2ce6923
2023-09-15 15:34:04 -07:00
leipeng 68ce5d84f6 Add new Iterator API Refresh(const snapshot*) (#10594)
Summary:
This PR resolves https://github.com/facebook/rocksdb/issues/10487 & https://github.com/facebook/rocksdb/issues/10536, user code needs to call Refresh() periodically.

The main code change is to support range deletions. A range tombstone iterator uses a sequence number as upper bound to decide which range tombstones are effective. During Iterator refresh, this sequence number upper bound needs to be updated for all range tombstone iterators under DBIter and LevelIterator. LevelIterator may create new table iterators and range tombstone iterator during scanning, so it needs to be aware of iterator refresh. The code path that propagates this change is `db_iter_->set_sequence(read_seq)  -> MergingIterator::SetRangeDelReadSeqno() -> TruncatedRangeDelIterator::SetRangeDelReadSeqno() and LevelIterator::SetRangeDelReadSeqno()`.

This change also fixes an issue where range tombstone iterators created by LevelIterator may access ReadOptions::snapshot, even though we do not explicitly require users to keep a snapshot alive after creating an Iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10594

Test Plan:
* New unit tests.
* Add Iterator::Refresh(snapshot) to stress test. Note that this change only adds tests for refreshing to the same snapshot since this is the main target use case.

TODO in a following PR:
* Stress test Iterator::Refresh() to different snapshots or no snapshot.

Reviewed By: ajkr

Differential Revision: D48456896

Pulled By: cbi42

fbshipit-source-id: 2e642c04e91235cc9542ef4cd37b3c20823bd779
2023-09-15 10:44:43 -07:00
Jay Huh f2b623bcc1 GetEntity Support for ReadOnlyDB and SecondaryDB (#11799)
Summary:
`GetEntity` API support for ReadOnly DB and Secondary DB.
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_readonly` and refactored current `Get()` logic into `GetImpl()` so that look up logic can be reused for `GetEntity()` (Following the same pattern as `DBImpl::Get()` and `DBImpl::GetEntity()`)
- Introduced `GetImpl()` with `GetImplOptions` in `db_impl_secondary` and refactored current `GetImpl()` logic. This is to make `DBImplSecondary::Get/GetEntity` consistent with `DBImpl::Get/GetEntity` and `DBImplReadOnly::Get/GetEntity`
- `GetImpl()` in `db_impl` is now virtual. both `db_impl_readonly` and `db_impl_secondary`'s `Get()` override are no longer needed since all three dbs now have the same `Get()` which calls `GetImpl()` internally.
- `GetImpl()` in `DBImplReadOnly` and `DBImplSecondary` now pass in `columns` instead of `nullptr` in lookup functions like `memtable->get()`
- Introduced `GetEntity()` API in `DBImplReadOnly` and `DBImplSecondary` which simply calls `GetImpl()` with `columns` set in `GetImplOptions`.
- Introduced `Env::IOActivity::kGetEntity` and set read_options.io_activity to `Env::IOActivity::kGetEntity` for `GetEntity()` operations (in db_impl)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11799

Test Plan:
**Unit Tests**
- Added verification in `DBWideBasicTest::PutEntity` by Reopening DB as ReadOnly with the same setup.
- Added verification in `DBSecondaryTest::ReopenAsSecondary` by calling `PutEntity()` and `GetEntity()` on top of existing `Put()` and `Get()`
- `make -j64 check`

**Crash Tests**
- `python3 tools/db_crashtest.py blackbox --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`
- `python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 `
- `python3 tools/db_crashtest.py blackbox --cf_consistency --max_key=25000000 --write_buffer_size=4194304 --max_bytes_for_level_base=2097152 --target_file_size_base=2097152 --periodic_compaction_seconds=0 --use_put_entity_one_in=10 --use_get_entity=1 --duration=60 --inter
val=10`

Reviewed By: ltamasi

Differential Revision: D49037040

Pulled By: jaykorean

fbshipit-source-id: a0648253ded6e91af7953de364ed3c6bf163626b
2023-09-15 08:30:44 -07:00
Levi Tamasi 1e63fc9925 Add a helper method WideColumnsHelper::SortColumns (#11823)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11823

Similarly to https://github.com/facebook/rocksdb/pull/11813, the patch is a small refactoring that eliminates some copy-paste around sorting the columns of entities by column name.

Reviewed By: jaykorean

Differential Revision: D49195504

fbshipit-source-id: d48c9f290e3203f838cc5949856c469ecf730008
2023-09-12 12:36:07 -07:00
Levi Tamasi 8fc78a3a9e Add helper methods WideColumnsHelper::{Has,Get}DefaultColumn (#11813)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11813

The patch adds a couple of helper methods `WideColumnsHelper::{Has,Get}DefaultColumn` to eliminate some code duplication.

Reviewed By: jaykorean

Differential Revision: D49166682

fbshipit-source-id: f229ca5b94599f7445a0112b10f8317292505c82
2023-09-11 16:32:32 -07:00
anand76 137cd4bb75 Disable error injection after compaction completion (#11798)
Summary:
https://github.com/facebook/rocksdb/issues/11789 added error injection during compaction to db_stress. However, error injection was not disabled after compaction completion, which resulted in some test failures due to stale errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11798

Reviewed By: cbi42

Differential Revision: D49022821

Pulled By: anand1976

fbshipit-source-id: 3cbfe18d55bee393697e063d05e7a7a7f88b7635
2023-09-06 12:57:10 -07:00
Andrew Kryczka 392d6957cd Added compaction read errors to `db_stress` (#11789)
Summary:
- Fixed misspellings of "inject"
- Made user read errors retryable when `FLAGS_inject_error_severity == 1`
- Added compaction read errors when `FLAGS_read_fault_one_in > 0`. These are always retryable so that the DB will keep accepting writes
- Reenabled setting `compaction_readahead_size` in crash test. The reason for disabling it was to "keep the test clean", which is not a good enough reason to skip testing it

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11789

Test Plan:
With https://github.com/facebook/rocksdb/issues/11782 reverted, reproduced the bug:
- Build: `make -j56 db_stress`
- Command: `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --interval=10 --max_key=1000000`
- Output:
```
stderr has error message:
***put or merge error: Corruption: Compaction number of input keys does not match number of keys processed.***
```

Reviewed By: cbi42

Differential Revision: D48939994

Pulled By: ajkr

fbshipit-source-id: a1efb799efecdfd5d9cfd185e4a6321db8fccfbb
2023-09-05 10:41:29 -07:00
Jay Huh ea9a5b2914 Wide Column support in ldb (#11754)
Summary:
wide_columns can now be pretty-printed in the following commands
- `./ldb dump_wal`
- `./ldb dump`
- `./ldb idump`
- `./ldb dump_live_files`
- `./ldb scan`
- `./sst_dump --command=scan`

There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11754

Test Plan:
**New Tests added**
- `WideColumnsHelperTest::DumpWideColumns`
- `WideColumnsHelperTest::DumpSliceAsWideColumns`

**Changes added to existing tests**
- `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in https://github.com/facebook/rocksdb/issues/11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool.
- `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST

**dump_wal**
```
./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header
```
```
Sequence,Count,ByteSize,Physical Offset,Key(s) : value
1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A
```

**idump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump
```
```
'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:2, type:22 => attr_one:two attr_three:four
'third' seq:3, type:1 => baz
Internal keys in range: 3
```

**SST Dump from dump_live_files**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files
```
```
...
==============================
SST Files
==============================
/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1
------------------------------
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
...
```

**dump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump
```
```
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3
```

**scan**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan
```
```
first : :hello attr_name1:foo attr_name2:bar
second : attr_one:two attr_three:four
third : baz
```

**sst_dump**
```
./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan
```
```
options.env is 0x7ff54b296000
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
from [] to []
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
```

Reviewed By: ltamasi

Differential Revision: D48837999

Pulled By: jaykorean

fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f
2023-08-30 12:45:52 -07:00
Akanksha Mahajan 6353c6e2fb Add new experimental ReadOption auto_readahead_size to db_bench and db_stress (#11729)
Summary:
Same as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11729

Test Plan: make crash_test -j32

Reviewed By: anand1976

Differential Revision: D48534820

Pulled By: akankshamahajan15

fbshipit-source-id: 3a2a28af98dfad164b82ddaaf9fddb94c53a652e
2023-08-24 14:58:27 -07:00
Fuat Basik bc448e9c89 Run db_stress for final time to ensure un-interrupted validation (#11592)
Summary:
In blackbox tests, db_stress command always run with timeout. Timeout can happen during validation, leaving some of the keys not checked. Since key validation is done in order, it is quite likely that keys those are towards to the end of the set are never validated. This PR adds a final execution, without timeout, to ensure validation is executed for all keys, at least once.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11592

Reviewed By: cbi42

Differential Revision: D48003998

Pulled By: hx235

fbshipit-source-id: 72543475a932f12cf0f57534b7e3b6e07e87080f
2023-08-23 15:24:23 -07:00
Changyu Bi 5e0584bd73 Do not drop unsynced data during reopen in stress test (#11731)
Summary:
Currently the stress test does not support restoring expected state (to a specific sequence number) when there is unsynced data loss during the reopen phase. This causes a few internal stress test failure with errors like inconsistent value. This PR disables dropping unsynced data during reopen to avoid failures due to this issue. We can re-enable later after we decide to support unsynced data loss during DB reopen in stress test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11731

Test Plan:
* Running this test a few times can fail for inconsistent value before this change
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=20.57166126835524 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=6 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=1 --manual_wal_flush_one_in=1000000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=5 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=0 --secondary_cache_fault_one_in=32 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=3 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35```

Reviewed By: hx235

Differential Revision: D48537494

Pulled By: cbi42

fbshipit-source-id: ddae21b9bb6ee8d67229121f58513e95f7ef6d8d
2023-08-22 09:47:04 -07:00
Changyu Bi c2aad555c3 Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (d857369028/lib/zstd.h (L428)) to detect corruption during decompression. Main changes are in compression.h:
* User can set CompressionOptions::checksum to true to enable this feature.
* We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`.
* Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag.
* Error handling during decompression to propagate error message from ZSTD.
* Updated microbench to test read performance impact.

About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666

Test Plan:
* Existing unit tests for zstd compression
* Add unit test `DBTest2.ZSTDChecksum` to test the corruption case
* Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API.
* Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work.
* Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2.
* Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable.
* Microbench:
```
TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100

Min out of 100 runs:
Main:
10390 10436 10456 10484 10499 10535 10544 10545 10565 10568

After this PR, checksum=false
10285 10397 10503 10508 10515 10557 10562 10635 10640 10660

After this PR, checksum=true
10827 10876 10925 10949 10971 11052 11061 11063 11100 11109
```
* db_bench:
```
Write perf
TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=..

[FillSeq checksum=0]
fillseq [AVG    10 runs] : 281635 (± 31711) ops/sec;   31.2 (± 3.5) MB/sec
fillseq [MEDIAN 10 runs] : 294027 ops/sec;   32.5 MB/sec

[FillSeq checksum=1]
fillseq [AVG    10 runs] : 286961 (± 34700) ops/sec;   31.7 (± 3.8) MB/sec
fillseq [MEDIAN 10 runs] : 283278 ops/sec;   31.3 MB/sec

Read perf
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1

[Readrandom checksum=1]
readrandom [AVG    20 runs] : 360928 (± 3579) ops/sec;    4.0 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 362468 ops/sec;    4.0 MB/sec

[Readrandom checksum=0]
readrandom [AVG    20 runs] : 380365 (± 2384) ops/sec;    4.2 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 379800 ops/sec;    4.2 MB/sec

Compression
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1

checksum=1
compress [AVG    20 runs] : 54074 (± 634) ops/sec;  211.2 (± 2.5) MB/sec
compress [MEDIAN 20 runs] : 54396 ops/sec;  212.5 MB/sec

checksum=0
compress [AVG    20 runs] : 54598 (± 393) ops/sec;  213.3 (± 1.5) MB/sec
compress [MEDIAN 20 runs] : 54592 ops/sec;  213.3 MB/sec

Decompression:
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1

checksum = 0
uncompress [AVG    20 runs] : 167499 (± 962) ops/sec;  654.3 (± 3.8) MB/sec
uncompress [MEDIAN 20 runs] : 167210 ops/sec;  653.2 MB/sec
checksum = 1
uncompress [AVG    20 runs] : 167980 (± 924) ops/sec;  656.2 (± 3.6) MB/sec
uncompress [MEDIAN 20 runs] : 168465 ops/sec;  658.1 MB/sec
```

Reviewed By: ajkr

Differential Revision: D48019378

Pulled By: cbi42

fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba
2023-08-18 15:01:59 -07:00
Changyu Bi d1ff401472 Delay bottommost level single file compactions (#11701)
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.

* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701

Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.

Reviewed By: jaykorean, ajkr

Differential Revision: D48331564

Pulled By: cbi42

fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
2023-08-16 17:45:44 -07:00
Jay Huh b63018fb59 Wide Column Ingestion in CrashTest (#11697)
Summary:
`PutEntity` is now supported in SST file writer (https://github.com/facebook/rocksdb/issues/11688). This PR enables ingestion of wide column data in the stress/crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11697

Test Plan:
```
python3 tools/db_crashtest.py blackbox --simple --duration=300 --ingest_external_file_one_in=2 --use_put_entity_one_in=2 --max_key=1048576 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0 --key_len_percent_dist="1,30,69"
```

Reviewed By: ltamasi

Differential Revision: D48370719

Pulled By: jaykorean

fbshipit-source-id: 5855d3112b37b2fb300d05e6df110d899855d77d
2023-08-15 16:13:13 -07:00
Andrew Kryczka a09c141dde In TestIterateAgainstExpected(), verify iterator moves in expected direction (#11698)
Summary:
It's a bit repetitive in order to give reasonably informative error messages.

I also removed total_order_seek in cases where it's not needed, just to make sure a case that shouldn't matter really doesn't.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11698

Test Plan:
run it -

```
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --duration=86400 --interval=10 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --blob_compression_type=none --writepercent=50 -iterpercent=45 -readpercent=0 -prefixpercent=0 --prefix_size=0 --verify_iterator_with_expected_state_one_in=10 --test_batches_snapshots=0 -enable_compaction_filter=0
```

Reviewed By: cbi42

Differential Revision: D48285036

Pulled By: ajkr

fbshipit-source-id: 51b147bd7c8011740629ae2fd8114d3d48ce7137
2023-08-14 14:57:28 -07:00
Peter Dillinger ef6f025563 Placeholder for AutoHyperClockCache, more (#11692)
Summary:
* The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation.
* Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache).
* Simplify some existing tests not to depend directly on cache type.
* Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation.
* Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test.
* Add another templated class BaseHyperClockCache to reduce future copy-paste
* Added ReportProblems support to cache_bench
* Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC:
```
2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17
```

In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692

Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in.

Reviewed By: jowlyzhang

Differential Revision: D48247413

Pulled By: pdillinger

fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da
2023-08-11 16:27:38 -07:00
Hui Xiao 38ecfabed2 Remove comment about locking about TestIterateAgainstExpected (#11695)
Summary:
**Context/Summary**
After https://github.com/facebook/rocksdb/pull/11058, we no longer lock the key range to iterate in TestIterateAgainstExpected, except for working with timestamp feature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11695

Test Plan: no code change

Reviewed By: ajkr

Differential Revision: D48276668

Pulled By: hx235

fbshipit-source-id: dc92a3708b2281dc737c0877fb755548bf03a9fc
2023-08-11 13:14:04 -07:00
Peter Dillinger a85eccc6d6 Adjust db_stress handling of TryAgain from optimistic txn (#11691)
Summary:
We're still getting some rare cases of 5x TryAgains in a row. Here I'm boosting the failure threshold to 10 in a row and adding more info in the output, to help us manually verify whether there's anything suspicous about the sequence of TryAgains, such as if Rollback failed to reset to new sequence numbers.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11691

Test Plan: By lowering the threshold to 2 and adjusting some other db_crashtest parameters, I was able to hit my new code and saw fresh sequence number on the subsequent TryAgain.

Reviewed By: cbi42

Differential Revision: D48236153

Pulled By: pdillinger

fbshipit-source-id: c0530e969ddcf8de7348e5cf7daf5d6d5dec24f4
2023-08-10 13:05:45 -07:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
Yu Zhang 9c2ebcc2c3 Log user_defined_timestamps_persisted flag in event logger (#11683)
Summary:
As titled, and also removed an undefined and unused member function in for ColumnFamilyData

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683

Reviewed By: ajkr

Differential Revision: D48156290

Pulled By: jowlyzhang

fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7
2023-08-08 12:25:21 -07:00
Peter Dillinger 99daea3481 Prepare tests for new HCC naming (#11676)
Summary:
I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache.

This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676

Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache

Reviewed By: ajkr

Differential Revision: D48103165

Pulled By: pdillinger

fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736
2023-08-07 18:17:12 -07:00
Vardhan 87a21d08fe Add an option to trigger flush when the number of range deletions reach a threshold (#11358)
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358

Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.

Reviewed By: ajkr

Differential Revision: D46582680

Pulled By: cbi42

fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
2023-08-02 19:58:56 -07:00
Peter Dillinger bb8fcc0044 db_stress: Reinstate Transaction::Rollback() calls before destruction (#11656)
Summary:
https://github.com/facebook/rocksdb/issues/11653 broke some crash tests.
Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656

Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn

Reviewed By: cbi42

Differential Revision: D47906280

Pulled By: pdillinger

fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b
2023-07-30 17:30:01 -07:00
Peter Dillinger b3c54186ab Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653)
Summary:
In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.

Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
* (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
* Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.

Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
* Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
* Use unique_ptr (nicely complements removing unnecessary Rollbacks)
* Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)

Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653

Test Plan:
this is the test :)

Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.

Reviewed By: cbi42

Differential Revision: D47882995

Pulled By: pdillinger

fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
2023-07-28 16:25:29 -07:00
Levi Tamasi b3edb87341 Initialize StressTest::optimistic_txn_db_ in ctor (#11547)
Summary:
`StressTest::optimistic_txn_db_` is currently not initialized by the constructor, which
can lead to assertion failures down the line in `StressTest::Open`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11547

Reviewed By: cbi42

Differential Revision: D46845658

Pulled By: ltamasi

fbshipit-source-id: 578b0f24fc00e3e97f24221fcdd003cc529439c2
2023-06-19 15:41:30 -07:00