rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-12-02 20:52:55 +00:00

Author	SHA1	Message	Date
Yu Zhang	f007b8fdea	Support iter_start_ts in integrated BlobDB (#11244 ) Summary: Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244 Test Plan: ```make check ./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs" tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1``` Reviewed By: ltamasi Differential Revision: D43506726 Pulled By: jowlyzhang fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0	2023-02-22 15:44:59 -08:00
anand76	476b01579c	Revert enabling IO uring in db_stress (#11242 ) Summary: IO uring usage is causing crash test failures due to bad cqe data being returned in the uring. Revert the change to enable IO uring in db_stress, and also re-enable async_io in CircleCI so that code path can be tested. Added the -use_io_uring flag to db_stress that, when false, will wrap the default env in db_stress to emulate async IO. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11242 Reviewed By: akankshamahajan15 Differential Revision: D43470569 Pulled By: anand1976 fbshipit-source-id: 7c69ac3f53a79ade31d37313f815f1a4b6108b75	2023-02-21 12:53:55 -08:00
akankshamahajan	68e4581c67	Return NotSupported in scan if IOUring not supported and enable IOUring in db_stress for async_io testing (#11197 ) Summary: - Return NotSupported in scan if IOUring not supported if async_io is enabled - Enable IOUring in db_stress for async_io testing - Disable async_io in circleci crash testing as circleci doesn't support IOUring Pull Request resolved: https://github.com/facebook/rocksdb/pull/11197 Test Plan: CircleCI jobs Reviewed By: anand1976 Differential Revision: D43096313 Pulled By: akankshamahajan15 fbshipit-source-id: c2c53a87636950c0243038b9f5bd0d91608e4fda	2023-02-16 18:33:06 -08:00
Levi Tamasi	ab22e79824	Support using MultiGetEntity as verification method in stress tests (#11228 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11228 Reviewed By: akankshamahajan15 Differential Revision: D43332120 Pulled By: ltamasi fbshipit-source-id: 15f32cf335aecb7e654da24ecafc6e010dc65194	2023-02-15 17:08:25 -08:00
Yu Zhang	c19672c187	Enable crash test to run BlobDB together with user-defined timestamp (#11199 ) Summary: I missed a stress test code sanity check when enabling this combination of tests. This PR addresses that, the "iter_start_ts" function for user defined timestamp feature is not supported when BlobDB is enabled. It's disabled for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11199 Test Plan: Locally always enable BlobDB and run tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 Reviewed By: ltamasi Differential Revision: D43245657 Pulled By: jowlyzhang fbshipit-source-id: 4cae19817bb1afd50a76f9e0e49f006fb5c0b211	2023-02-13 13:40:02 -08:00
Levi Tamasi	753d4d5078	Support using GetEntity as a verification method in the non-batched stress tests (#11144 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11144 Test Plan: Ran a simple blackbox crash test. Reviewed By: akankshamahajan15 Differential Revision: D42791464 Pulled By: ltamasi fbshipit-source-id: 8eb6e62f0bc47f709816136ff3ded0a41d04fab8	2023-01-31 10:17:48 -08:00
Peter Dillinger	94e3beec77	Cleanup, improve, stress test LockWAL() (#11143 ) Summary: The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change: * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use. * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions) * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability) * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other. * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads. * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143 Test Plan: unit tests added / updated, added to stress/crash test Reviewed By: ajkr Differential Revision: D42848627 Pulled By: pdillinger fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4	2023-01-30 22:52:30 -08:00
sdong	36174d89a6	DB Stress to fix a false assertion (#11164 ) Summary: Seeting this error in stress test: db_stress: internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2459: void rocksdb::StressTest::Open(rocksdb::SharedState *): Assertion `txn_db_ == nullptr' failed. Received signal 6 (Aborted) ...... It doesn't appear that txn_db_ is set to nullptr at all. We set ithere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11164 Test Plan: Run db_stress transaction and non-transation with low kill rate and see restarting without assertion Reviewed By: ajkr Differential Revision: D42855662 fbshipit-source-id: 06816d37cce9c94a81cb54ab238fb73aa102ed46	2023-01-30 19:45:47 -08:00
sdong	4720ba4391	Remove RocksDB LITE (#11147 ) Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE \| egrep '[.](cc\|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2	2023-01-27 13:14:19 -08:00
Yu Zhang	6943ff6e50	Remove deprecated util functions in options_util.h (#11126 ) Summary: Remove the util functions in options_util.h that have previously been marked deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126 Test Plan: `make check` Reviewed By: ltamasi Differential Revision: D42757496 Pulled By: jowlyzhang fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb	2023-01-27 11:10:53 -08:00
Andrew Kryczka	97c1024d3e	Include db_stress verification method in failure message (#11133 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11133 Test Plan: - ran it a few times on a mismatching DB+expected state; verified error messages look right: ``` Verification failed for column family 0 key 000000000000D553000000000000014C0000000000000142 (163988): value_from_db: , value_from_expected: 25E7B53421202322, msg: GetMergeOperands verification: Value not found: NotFound: Verification failed for column family 0 key 000000000000AAE2787878 (131123): value_from_db: , value_from_expected: B2A69C18B6B7B4B5BABBB8B9BEBFBCBDA2A3A0A1A6A7A4A5, msg: Iterator verification: Value not found: NotFound: Verification failed for column family 0 key 00000000000080C6000000000000004C78787878 (98409): value_from_db: , value_from_expected: 67AB7E1E636261606F6E6D6C6B6A6968, msg: Get verification: Value not found: NotFound: ``` Reviewed By: hx235 Differential Revision: D42757072 Pulled By: ajkr fbshipit-source-id: b0a4a0aaa5be5d110434324853ac92aaa6972d89	2023-01-27 07:45:25 -08:00
sdong	e808858ae0	Remove Stats related to compressed block cache (#11135 ) Summary: Since compressed block cache is removed, those stats are not needed. They are removed in different PR in case there is a problem with it. The stats are removed in the same way in https://github.com/facebook/rocksdb/pull/11131/ . HISTORY.md was already updated by mistake, and it would be correct after merging this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11135 Test Plan: Watch CI Reviewed By: ltamasi Differential Revision: D42757616 fbshipit-source-id: bd7cb782585c8535ce5784295225c376f3011f35	2023-01-25 15:37:50 -08:00
sdong	2800aa069a	Remove compressed block cache (#11117 ) Summary: Compressed block cache is replaced by compressed secondary cache. Remove the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117 Test Plan: See CI passes Reviewed By: pdillinger Differential Revision: D42700164 fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d	2023-01-24 17:09:19 -08:00
ehds	4737e1d41b	fix shared state used after free (#11059 ) Summary: Before this pr, the destruction order is `shared` -> `db_`(StressTest destruction) -> `stress`, but `compaction_filter` of `db_` will hold the `shared` pointer, so `shared` maybe used after free. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11059 Reviewed By: hx235 Differential Revision: D42297366 Pulled By: ajkr fbshipit-source-id: 17b314635359acacd5ba62f9db5f955f451133f7	2023-01-04 19:35:34 -08:00
Hui Xiao	b965a5a80e	Add back Options::CompactionOptionsFIFO::allow_compaction to stress/crash test (#11063 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/10777 was reverted (https://github.com/facebook/rocksdb/pull/10999) due to internal blocker and replaced with a better fix https://github.com/facebook/rocksdb/pull/10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR. It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like https://github.com/facebook/rocksdb/pull/10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false` ``` --fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65 ``` Therefore this PR is adding it back to stress/crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11063 Test Plan: Rehearsal stress test to make sure stress/crash test is stable Reviewed By: ajkr Differential Revision: D42283650 Pulled By: hx235 fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef	2023-01-03 11:54:58 -08:00
anand76	692d6be358	Prevent db_stress failure when io_uring is disabled (#11045 ) Summary: The IO uring usage is disabled in RocksDB by default and, as a result, PosixRandomAccessFile::ReadAsync returns a NotSupported() status. This was causing stress test failures with MultiGet and async_io combination. Fix it by relying on redirection of ReadAsync to Read when default Env is used in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11045 Reviewed By: akankshamahajan15 Differential Revision: D42136213 Pulled By: anand1976 fbshipit-source-id: fc7904d8ece74d7e8f2e1a34c3d70bd5774fb45f	2022-12-19 11:38:42 -08:00
anand76	c3f720c60d	Enable ReadAsync testing and fault injection in db_stress (#11037 ) Summary: The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037 Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890. Reviewed By: akankshamahajan15 Differential Revision: D42061290 Pulled By: anand1976 fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074	2022-12-15 15:48:50 -08:00
Hui Xiao	f1574a20ff	Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999 ) Summary: Context/Summary: This reverts commit `fc74abb436` and related HISTORY record. The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004 Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999 Test Plan: make check Reviewed By: anand1976 Differential Revision: D41572604 Pulled By: hx235 fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a	2022-11-29 10:56:42 -08:00
Yanqin Jin	a8a4ed52a4	Test Merge with timestamps in stress test (#10948 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10948 Test Plan: make crash_test_with_ts Reviewed By: ltamasi Differential Revision: D41390854 Pulled By: riversand963 fbshipit-source-id: 599e114da8e2b2bbff5628fb8c67fa0393a31c05	2022-11-17 20:43:50 -08:00
Peter Dillinger	32520df1d9	Remove prototype FastLRUCache (#10954 ) Summary: This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954 Test Plan: tests updated Reviewed By: akankshamahajan15 Differential Revision: D41310123 Pulled By: pdillinger fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c	2022-11-16 10:15:55 -08:00
Levi Tamasi	b644baa1eb	Support using GetMergeOperands for verification with wide columns (#10952 ) Summary: With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952 Test Plan: Ran a simple non-batched ops blackbox crash test. Reviewed By: riversand963 Differential Revision: D41292114 Pulled By: ltamasi fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c	2022-11-15 08:06:41 -08:00
Levi Tamasi	d484275230	Adjust value generation in batched ops stress tests (#10872 ) Summary: The patch adjusts the generation of values in batched ops stress tests so that the digits 0..9 are appended (instead of prepended) to the values written. This has the advantage of aligning the encoding of the "value base" into the value string across non-batched, batched, and CF consistency stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10872 Test Plan: Tested using some black box stress test runs. Reviewed By: riversand963 Differential Revision: D40692847 Pulled By: ltamasi fbshipit-source-id: 26bf8adff2944cbe416665f09c3bab89d80416b3	2022-10-25 17:51:20 -07:00
sdong	48fe921754	Run clang format against files under tools/ and db_stress_tool/ (#10868 ) Summary: Some lines of .h and .cc files are not properly fomatted. Clear them up with clang format. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10868 Test Plan: Watch existing CI to pass Reviewed By: ajkr Differential Revision: D40683485 fbshipit-source-id: 491fbb78b2cdcb948164f306829909ad816d5d0b	2022-10-25 14:29:41 -07:00
Hui Xiao	fc74abb436	Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777 ) Summary: Context: Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case Repro: ``` COERCE_CONTEXT_SWICH=1 make -j56 db_stress ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35 put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049 ``` Summary: FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving. - [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true` - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56 - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here. - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0` - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno` - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378) - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction. Some additional clean-up included in this PR: - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming - Added comment about `earliest_memtable_seqno` in related APIs - Made parameter `earliest_memtable_seqno` constant and required Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777 Test Plan: - make check - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only Reviewed By: ajkr Differential Revision: D40090485 Pulled By: hx235 fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f	2022-10-25 10:39:58 -07:00
Jay Zhuang	8124bc3526	Enable preclude_last_level_data_seconds in stress test (#10824 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10824 Reviewed By: siying Differential Revision: D40390535 Pulled By: jay-zhuang fbshipit-source-id: 700803a1aff8a1e77c038740d87931577e79bcf6	2022-10-16 09:28:43 -07:00
Levi Tamasi	2f3042d732	Check wide columns in TestIterateAgainstExpected (#10820 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10820 Reviewed By: riversand963 Differential Revision: D40363653 Pulled By: ltamasi fbshipit-source-id: d347547d8cdd3f8926b35b6af4d1fa0f827e4a10	2022-10-14 14:25:05 -07:00
Levi Tamasi	eae3a686ee	Check wide columns in TestIterate (#10818 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10818 Test Plan: Tested using some simple blackbox crash test runs in the various modes (non-batched, batched, CF consistency). Reviewed By: riversand963 Differential Revision: D40349527 Pulled By: ltamasi fbshipit-source-id: 2918bc26adbbeac314beaa958aafe770b01e5cc6	2022-10-13 12:06:36 -07:00
Levi Tamasi	23b7dc2f4f	Check columns in CfConsistencyStressTest::VerifyDb (#10804 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10804 Reviewed By: riversand963 Differential Revision: D40279057 Pulled By: ltamasi fbshipit-source-id: 9efc3dae7f5eaab162d55a41c58c2535b0a53054	2022-10-12 11:43:34 -07:00
Levi Tamasi	85399b14f7	Consider wide columns when checksumming in the stress tests (#10788 ) Summary: There are two places in the stress test code where we compute the CRC for a range of KVs for the purposes of checking consistency, namely in the CF consistency test (to make sure CFs contain the same data), and when performing `CompactRange` (to make sure the pre- and post-compaction states are equivalent). The patch extends the logic so that wide columns are also considered in both cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10788 Test Plan: Tested using some simple blackbox crash test runs. Reviewed By: riversand963 Differential Revision: D40191134 Pulled By: ltamasi fbshipit-source-id: 542c21cac9077c6d225780deb210319bb5eee955	2022-10-11 14:40:25 -07:00
Hui Xiao	f6a0065d54	Allow Flush(sync=true) not supported in DB::Open() and db_stress (#10784 ) Summary: Context: https://github.com/facebook/rocksdb/pull/10698 made `Flush(sync=true)` required for` DB::Open()` (to pass the original but now deleted assertion `impl->TEST_WALBufferIsEmpty()` under `manual_wal_flush=true`, see https://github.com/facebook/rocksdb/pull/10698 summary for more ) as well as db_stress to pass. However RocksDB users may not implement SyncWAL() (used inFlush(sync=true)). Therefore we replace such in DB::Open and db_stress in this PR and align with https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1883-L1887 and https://github.com/facebook/rocksdb/blob/main/db_stress_tool/db_stress_test_base.cc#L847-L849 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10784 Test Plan: make check Reviewed By: anand1976 Differential Revision: D40193354 Pulled By: anand1976 fbshipit-source-id: e80d53880799ae01bdd717641d07997d3bfe2b54	2022-10-10 15:52:10 -07:00
Levi Tamasi	5182bf3f83	Skip column validation for non-value types when iter_start_ts is set (#10799 ) Summary: When the `iter_start_ts` read option is set, iterator exposes internal keys. This also includes tombstones, which by definition do not have a value (or columns). The patch makes sure we skip the wide-column consistency check in this case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10799 Test Plan: Tested using a simple blackbox crash test with timestamps enabled. Reviewed By: jay-zhuang, riversand963 Differential Revision: D40235628 fbshipit-source-id: 49519fb55d8fe2bb9249ced809f7a81bff2b9df2	2022-10-10 15:07:07 -07:00
Levi Tamasi	7af47c532b	Verify wide columns during prefix scan in stress tests (#10786 ) Summary: The patch adds checks to the `{NonBatchedOps,BatchedOps,CfConsistency}StressTest::TestPrefixScan` methods to make sure the wide columns exposed by the iterators are as expected (based on the value base encoded into the iterator value). It also makes some code hygiene improvements in these methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10786 Test Plan: Ran some simple blackbox tests in the various modes (non-batched, batched, CF consistency). Reviewed By: riversand963 Differential Revision: D40163623 Pulled By: riversand963 fbshipit-source-id: 72f4c3b51063e48c15f974c4ec64d751d3ed0a83	2022-10-07 11:17:57 -07:00
Levi Tamasi	d6d8c007ff	Verify columns in NonBatchedOpsStressTest::VerifyDb (#10783 ) Summary: As the first step of covering the wide-column functionality of iterators in our stress tests, the patch adds verification logic to `NonBatchedOpsStressTest::VerifyDb` that checks whether the iterator's value and columns are in sync. Note: I plan to update the other types of stress tests and add similar verification for prefix scans etc. in separate PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783 Test Plan: Ran some simple blackbox crash tests. Reviewed By: riversand963 Differential Revision: D40152370 Pulled By: riversand963 fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35	2022-10-06 15:07:16 -07:00
Peter Dillinger	b205c6d029	Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768 ) Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.) This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768 Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test) Reviewed By: anand1976 Differential Revision: D40034747 Pulled By: anand1976 fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68	2022-10-06 14:54:21 -07:00
Levi Tamasi	3ae00dec90	Disable ingestion in stress tests when PutEntity is used (#10769 ) Summary: `SstFileWriter` currently does not support the `PutEntity` API, so in `TestIngestExternalFile` all key-values are written using regular `Put`s. This violates the assumption that whether or not a key corresponds to a plain old key-value or a wide-column entity can be determined by solely looking at the "value base" used when generating the value. The patch fixes this issue by disabling ingestion when `PutEntity` is enabled in the stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10769 Test Plan: Ran a simple blackbox stress test. Reviewed By: akankshamahajan15 Differential Revision: D40042132 Pulled By: ltamasi fbshipit-source-id: 93e75ff55545b7b69fa4ddef1d96093c961158a0	2022-10-03 18:09:56 -07:00
Changyu Bi	8b430e01dc	Add iterator refresh to stress test (#10766 ) Summary: added calls to `Iterator::Refresh()` in `NonBatchedOpsStressTest::TestIterateAgainstExpected()`. The testing key range is locked in `TestIterateAgainstExpected` so I do not expect this change to provide thorough stress test to `Iterator::Refresh()`. However, it can still be helpful for catching bugs like https://github.com/facebook/rocksdb/issues/10739. Will add calls to refresh in `TestIterate` once we support iterator refresh with snapshots. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10766 Test Plan: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=2` Reviewed By: ajkr Differential Revision: D40008320 Pulled By: ajkr fbshipit-source-id: cec93b07f915ef6476d41c1fee9b23c115188085	2022-10-03 16:22:39 -07:00
Changyu Bi	9f2363f4c4	User-defined timestamp support for `DeleteRange()` (#10661 ) Summary: Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps. - Garbage collection of range tombstones and range tombstone covered keys during compaction. - Get()/MultiGet() to return the timestamp of a range tombstone when needed. - Get/Iterator with range tombstones bounded by readoptions.timestamp. - timestamp crash test now issues DeleteRange by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661 Test Plan: - Added unit test: `make check` - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case. \| micros/op \| fillrandom \| seekrandom \| \| --- \| --- \| --- \| \|main\| 2.58 \|10.96\| \|PR 10661\| 2.68 \|10.63\| Reviewed By: riversand963 Differential Revision: D39441192 Pulled By: cbi42 fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2	2022-09-30 16:13:03 -07:00
Hui Xiao	3b8164912e	Add manual_wal_flush, FlushWAL() to stress/crash test (#10698 ) Summary: Context/Summary: Introduce `manual_wal_flush_one_in` as titled. - When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in` Incompatibilities fixed along the way: ``` db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle>, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed. ``` - It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false. - As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698 Test Plan: - Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3` - Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs Reviewed By: ajkr Differential Revision: D39593752 Pulled By: ajkr fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce	2022-09-30 15:48:33 -07:00
anand76	793fd09783	Track expected state only if expected values dir is non-empty (#10764 ) Summary: If the `-expected_values_dir` argument to db_stress is empty, then verification against expected state is effectively disabled. But `RunStressTest` still calls `TrackExpectedState`, which returns `NotSupported` causing a the crash test to fail with a false alarm. Fix it by only calling `TrackExpectedState` if necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10764 Reviewed By: ajkr Differential Revision: D39980129 Pulled By: anand1976 fbshipit-source-id: d02651746fe3a297877a4b2b2fbcb7274860f49c	2022-09-30 13:37:05 -07:00
Levi Tamasi	9078fcccee	Add the PutEntity API to the stress/crash tests (#10760 ) Summary: The patch adds the `PutEntity` API to the non-batched, batched, and CF consistency stress tests. Namely, when the new `db_stress` command line parameter `use_put_entity_one_in` is greater than zero, one in N writes on average is performed using `PutEntity` rather than `Put`. The wide-column entity written has the generated value in its default column; in addition, it contains up to three additional columns where the original generated value is divided up between the column name and the column value (with the column name containing the first k characters of the generated value, and the column value containing the rest). Whether `PutEntity` is used (and if so, how many columns the entity has) is completely determined by the "value base" used to generate the value (that is, there is no randomness involved). Assuming the same `use_put_entity_one_in` setting is used across `db_stress` invocations, this enables us to reconstruct and validate the entity during subsequent `db_stress` runs. Note that `PutEntity` is currently incompatible with `Merge`, transactions, and user-defined timestamps; these combinations are currently disabled/disallowed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10760 Test Plan: Ran some batched, non-batched, and CF consistency stress tests using the script. Reviewed By: riversand963 Differential Revision: D39939032 Pulled By: ltamasi fbshipit-source-id: eafdf124e95993fb7d73158e3b006d11819f7fa9	2022-09-30 11:11:07 -07:00
Andrew Kryczka	dc9f499639	db_stress TestIngestExternalFile avoid empty files (#10754 ) Summary: If all the keys in range [key_base, shared->GetMaxKey()) are non-overwritable `TestIngestExternalFile()` would attempt to ingest a file with zero keys, leading to the following error: "Cannot create sst file with no entries". This PR changes `TestIngestExternalFile()` to return early in that case instead of going through with the ingestion attempt. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10754 Reviewed By: hx235 Differential Revision: D39909195 Pulled By: ajkr fbshipit-source-id: e06e6b9cc24826fbd450e5130885e6f07164badd	2022-09-28 16:21:43 -07:00
Andrew Kryczka	b0d8ccbbca	db_stress print TestMultiGet error value in hex (#10753 ) Summary: Without this fix, db_crashtest.py could fail with useless output such as: `UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 267: invalid start byte` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10753 Reviewed By: hx235 Differential Revision: D39905809 Pulled By: ajkr fbshipit-source-id: 50ba2cf20d206eeb168309cec137e827a34c8f0b	2022-09-28 15:17:12 -07:00
Hui Xiao	f3b359a549	Set options.num_levels in db_stress_test_base (#10732 ) Summary: An add-on to https://github.com/facebook/rocksdb/pull/6818 to complete adding single-level universal compaction to stress/crash testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10732 Test Plan: - Locally run for 10 min `python3 ./tools/db_crashtest.py whitebox --simple --compaction_style=1 --num_levels=1 -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887` - Check LOG to confirm single-level universal compaction is called - Manual testing and log checking to ensure destroy_db_initially=1 is correctly set across runs with different compaction styles (i.e, in the second half of whitebox testing). - [ongoing]CI jobs stress test Reviewed By: ajkr Differential Revision: D39797612 Pulled By: ajkr fbshipit-source-id: 16f5c40c3464c57360c06c8305f92118e426149c	2022-09-27 12:18:28 -07:00
Hui Xiao	aed30ddf21	Support WriteCommit policy with sync_fault_injection=1 (#10624 ) Summary: Context: Prior to this PR, correctness testing with un-sync data loss [disabled](https://github.com/facebook/rocksdb/pull/10605) transaction (`use_txn=1`) thus all of the `txn_write_policy` . This PR improved that by adding support for one policy - WriteCommit (`txn_write_policy=0`). Summary: They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan. For (a) - During prepare (i.e, between `MarkBeginPrepare()` and `MarkEndPrepare(xid)`), `ExpectedStateTraceRecordHandler` will buffer all writes by adding all writes to an internal `WriteBatch`. - On `MarkEndPrepare()`, that `WriteBatch` will be associated with the transaction's `xid`. - During the commit (i.e, on `MarkCommit(xid)`), `ExpectedStateTraceRecordHandler` will retrieve and iterate the internal `WriteBatch` and finally apply those writes to `ExpectedState` - During the rollback (i.e, on `MarkRollback(xid)`), `ExpectedStateTraceRecordHandler` will erase the internal `WriteBatch` from the map. For (b) - one major issue described below: - TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical [PR](`6d06be22c0`) predated correctness testing. - And we will verify those processed keys in a recovered db against their expected state. - However since now we turn on `sync_fault_injection=1` where the expected state is constructed from the trace instead of using the LATEST.state from previous run. The expected state now used to verify those processed keys won't contain UNKNOWN_SENTINEL as they should - see test 1 for a failed case. - Therefore, we decided to manually update its expected state to be UNKNOWN_SENTINEL as part of the processing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10624 Test Plan: 1. Test exposed the major issue described above. This test will fail without setting UNKNOWN_SENTINEL in expected state during the processing and pass after ``` db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected dbt=$db.tmp expt=$exp.tmp rm -rf $db $exp mkdir -p $exp echo "RUN 1" ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 & pid=$! sleep 0.2 sleep 20 kill $pid sleep 0.2 echo "RUN 2" ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 & pid=$! sleep 0.2 sleep 20 kill $pid sleep 0.2 echo "RUN 3" ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 ``` 2. Manual testing to ensure ExpectedState is constructed correctly during recovery by verifying it against previously crashed TransactionDB's WAL. - Run the following command to crash a TransactionDB with WriteCommit policy. Then `./ldb dump_wal` on its WAL file ``` db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected rm -rf $db $exp mkdir -p $exp ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 & pid=$! sleep 30 kill $pid sleep 1 ``` - Run the following command to verify recovery of the crashed db under debugger. Compare the step-wise result with WAL records (e.g, WriteBatch content, xid, prepare/commit/rollback marker) ``` ./db_stress \ --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=0 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --test_batches_snapshots=0 --value_size_mult=32 --writepercent=90 \ --use_txn=1 --txn_write_policy=0 --sync_fault_injection=1 ``` 3. Automatic testing by triggering all RocksDB stress/crash test jobs for 3 rounds with no failure. Reviewed By: ajkr, riversand963 Differential Revision: D39199373 Pulled By: hx235 fbshipit-source-id: 7a1dec0e3e2ee6ea86ddf5dd19ceb5543a3d6f0c	2022-09-26 18:01:59 -07:00
Levi Tamasi	2280b2612a	Small cleanup in NonBatchedOpsStressTest::VerifyDb (#10740 ) Summary: The PR cleans up the logic in `NonBatchedOpsStressTest::VerifyDb` so that the verification method is picked using a single random number generation. It also eliminates some repeated key comparisons and makes some small code hygiene improvements. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10740 Test Plan: Ran a simple blackbox crash test. Reviewed By: riversand963 Differential Revision: D39828646 Pulled By: ltamasi fbshipit-source-id: 60ee5a3bb1851278f62c7d83b0c93b902ed9702e	2022-09-26 15:33:36 -07:00
Jay Zhuang	00050d4634	Disable tiered storage + BlobDB stress test (#10699 ) Summary: There're 2 knobs to disable blobdb, adding that. Also print call stack when there's assert failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10699 Reviewed By: gitbw95 Differential Revision: D39596448 Pulled By: jay-zhuang fbshipit-source-id: 5ce9fd0630d8b6ff1e157a2685a1e80a99997098	2022-09-19 15:39:31 -07:00
Peter Dillinger	0f91c72adc	Call experimental new clock cache HyperClockCache (#10684 ) Summary: This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons: * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache." * We want to highlight the key feature: it's fast (especially under parallel load) * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements. * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.) Some API detail: * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`. * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated. For performance tests see https://github.com/facebook/rocksdb/pull/10626 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684 Test Plan: no interesting functional changes; tests updated Reviewed By: anand1976 Differential Revision: D39547800 Pulled By: pdillinger fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf	2022-09-16 12:47:29 -07:00
Andrew Kryczka	6ce782beaf	move db_stress locking to `StressTest::Test()` functions (#10678 ) Summary: One problem of the previous strategy was `NonBatchedOpsStressTest::TestIngestExternalFile()` could release the lock for `rand_keys[0]` in `rand_column_families[0]`, and then subsequent operations in the same loop iteration (e.g., `TestPut()`) would run without locking. This PR changes the strategy so each `Test()` function is responsible for acquiring and releasing its own locks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10678 Reviewed By: hx235 Differential Revision: D39516401 Pulled By: ajkr fbshipit-source-id: bf67f12ebbd293ba8c24fdf8754ff28737bcd758	2022-09-15 15:55:37 -07:00
Andrew Kryczka	03c4ea26bb	db_stress option to preserve all files until verification success (#10659 ) Summary: In `db_stress`, DB and expected state files containing changes leading up to a verification failure are often deleted, which makes debugging such failures difficult. On the DB side, flushed WAL files and compacted SST files are marked obsolete and then deleted. Without those files, we cannot pinpoint where a key that failed verification changed unexpectedly. On the expected state side, files for verifying prefix-recoverability in the presence of unsynced data loss are deleted before verification. These include a baseline state file containing the expected state at the time of the last successful verification, and a trace file containing all operations since then. Without those files, we cannot know the sequence of DB operations expected to be recovered. This PR attempts to address this gap with a new `db_stress` flag: `preserve_unverified_changes`. Setting `preserve_unverified_changes=1` has two effects. First, prior to startup verification, `db_stress` hardlinks all DB and expected state files in "unverified/" subdirectories of `FLAGS_db` and `FLAGS_expected_values_dir`. The separate directories are needed because the pre-verification opening process deletes files written by the previous `db_stress` run as described above. These "unverified/" subdirectories are cleaned up following startup verification success. I considered other approaches for preserving DB files through startup verification, like using a read-only DB or preventing deletion of DB files externally, e.g., in the `Env` layer. However, I decided against it since such an approach would not work for expected state files, and I did not want to change the DB management logic. If there were a way to disable DB file deletions before regular DB open, I would have preferred to use that. Second, `db_stress` attempts to keep all DB and expected state files that were live at some point since the start of the `db_stress` run. This is a bit tricky and involves the following changes. - Open the DB with `disable_auto_compactions=1` and `avoid_flush_during_recovery=1` - DisableFileDeletions() - EnableAutoCompactions() For this part, too, I would have preferred to use a hypothetical API that disables DB file deletion before regular DB open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10659 Reviewed By: hx235 Differential Revision: D39407454 Pulled By: ajkr fbshipit-source-id: 6e981025c7dce147649d2e770728471395a7fa53	2022-09-12 14:49:38 -07:00
Akanksha Mahajan	7a9ecdac3c	Add auto prefetching parameters to db_bench and db_stress (#10632 ) Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/10632 Test Plan: make crash_test -j32 Reviewed By: anand1976 Differential Revision: D39241479 Pulled By: akankshamahajan15 fbshipit-source-id: 5db5b0c007da786bacc1b30d8926d36d6d029b87	2022-09-09 12:52:27 -07:00
Andrew Kryczka	ccf822492f	Reenable sync_fault_injection in crash test (#10172 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10172 Reviewed By: riversand963 Differential Revision: D37164671 Pulled By: ajkr fbshipit-source-id: 40eb919b8dc261d502510e878ee8ac7874ab35d0	2022-08-31 14:27:23 -07:00
Levi Tamasi	228f2c5bf5	Adjust the blob cache printout in db_bench/db_stress (#10614 ) Summary: Currently, `db_bench` and `db_stress` print the blob cache options even if a shared block/blob cache is configured, i.e. when they are not actually in effect. The patch changes this so they are only printed when a separate blob cache is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10614 Test Plan: Tested manually using `db_bench` and `db_stress`. Reviewed By: akankshamahajan15 Differential Revision: D39144603 Pulled By: ltamasi fbshipit-source-id: f714304c5d46186f8514746c27ee6f52aa3e4af8	2022-08-31 09:55:50 -07:00
Yanqin Jin	3613d862ba	print value when verification fails (#10587 ) Summary: When verification fails for db_stress, print more information about value read from the db and expected state. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10587 Test Plan: make check ./db_stress Reviewed By: akankshamahajan15, hx235 Differential Revision: D39078511 Pulled By: riversand963 fbshipit-source-id: 77ac8ffae01fc3a9b58a02c2e7bbe141e1a18f0b	2022-08-29 14:13:06 -07:00
Changyu Bi	5532b462c4	Verify Iterator/Get() against expected state in only `no_batched_ops_test` (#10590 ) Summary: https://github.com/facebook/rocksdb/issues/10538 added `TestIterateAgainstExpected()` in `no_batched_ops_test` to verify iterator correctness against the in memory expected state. It is not compatible when run after some other stress tests, e.g. `TestPut()` in `batched_op_stress`, that either do not set expected state when writing to DB or use keys that cannot be parsed by `GetIntVal()`. The assert [here](`d17be55aab/db_stress_tool/db_stress_common.h (L520)`) could fail. This PR fixed this issue by setting iterator upperbound to `max_key` when `destroy_db_initially=0` to avoid the key space that `batched_op_stress` touches. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10590 Test Plan: ``` # set up DB with batched_op_stress ./db_stress --test_batches_snapshots=1 --verify_iterator_with_expected_state_one_in=1 --max_key_len=3 --max_key=100000000 --skip_verifydb=1 --continuous_verification_interval=0 --writepercent=85 --delpercent=3 --delrangepercent=0 --iterpercent=10 --nooverwritepercent=1 --prefixpercent=0 --readpercent=2 --key_len_percent_dist=1,30,69 # Before this PR, the following test will fail the asserts with error msg like the following # Assertion failed: (size_key <= key_gen_ctx.weights.size() * sizeof(uint64_t)), function GetIntVal, file db_stress_common.h, line 524. ./db_stress --verify_iterator_with_expected_state_one_in=1 --max_key_len=3 --max_key=100000000 --skip_verifydb=1 --continuous_verification_interval=0 --writepercent=0 --delpercent=3 --delrangepercent=0 --iterpercent=95 --nooverwritepercent=1 --prefixpercent=0 --readpercent=2 --key_len_percent_dist=1,30,69 --destroy_db_initially=0 ``` Reviewed By: ajkr Differential Revision: D39085243 Pulled By: cbi42 fbshipit-source-id: a7dfee2320c330773b623b442d730fd014ec7056	2022-08-29 09:51:40 -07:00
Changyu Bi	d140fbfd7d	Add Iterator test against expected state to stress test (#10538 ) Summary: As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913, `db_stress` does not have much verification for iterator correctness. It has a `TestIterate()` function, but that is mainly for comparing results between two iterators, one with `total_order_seek` and the other optionally sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33 added a new `TestIterateAgainstExpected()` function that compares iterator against expected state. It locks a range of keys, creates an iterator, does a random sequence of `Next/Prev` and compares against expected state. This PR is based on that commit, the main changes include some logs (for easier debugging if a test fails), a forward and backward scan to cover the entire locked key range, and a flag for optionally turning on this version of Iterator testing. Added constraint that the checks against expected state in `TestIterateAgainstExpected()` and in `TestGet()` are only turned on when `--skip_verifydb` flag is not set. Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538 Test Plan: Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`, and a large `--iterpercent` and `--num_iterations`. Checked `op_logs` manually to ensure expected coverage. Tweaked part of the code in https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it. - internally run various flavor of crash test Reviewed By: ajkr Differential Revision: D38847269 Pulled By: cbi42 fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599	2022-08-24 14:59:50 -07:00
Changyu Bi	198e5d8ee9	Update `TestGet()` to verify against expected state (#10553 ) Summary: updated `TestGet()` in `no_batched_op_stress` to check the result of `Get()` operations against expected state (`expected_state_manager_`). More specifically, if `Get()` finds a key, expected state should not have `DELETION_SENTINEL` for the same key, and if `Get()` returns NotFound for a key, expected state should not have the key. One intention for this change it to verify correctness of code path change regarding range tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10553 Test Plan: run db_stress with nonzero readpercent: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`. When I initially used wrong column family in `thread->shared->Get`, the test reported inconsistencies. Reviewed By: ajkr Differential Revision: D38927007 Pulled By: cbi42 fbshipit-source-id: f9f61b312ad0b4c21a799329609ba8526169b048	2022-08-23 17:08:14 -07:00
Changyu Bi	fd165c869d	Add memtable per key-value checksum (#10281 ) Summary: Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281 Test Plan: - Added new unit test cases: `make check` - Benchmark on memtable insert ``` TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 # avg over 10 runs Baseline: 1166936 ops/sec memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%) memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%) write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%) ``` - Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum. ops/sec for `readrandom` dropped ~6.8%. ``` # Readseq sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 readseq [AVG 20 runs] : 7432840 (± 212005) ops/sec; 822.3 (± 23.5) MB/sec readseq [MEDIAN 20 runs] : 7573878 ops/sec; 837.9 MB/sec With -memtable_protection_bytes_per_key=2: readseq [AVG 20 runs] : 5134607 (± 119596) ops/sec; 568.0 (± 13.2) MB/sec readseq [MEDIAN 20 runs] : 5232946 ops/sec; 578.9 MB/sec # Readrandom sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100 readrandom [AVG 10 runs] : 140236 (± 3938) ops/sec; 9.8 (± 0.3) MB/sec readrandom [MEDIAN 10 runs] : 140545 ops/sec; 9.8 MB/sec With -memtable_protection_bytes_per_key=2: readrandom [AVG 10 runs] : 130632 (± 2738) ops/sec; 9.1 (± 0.2) MB/sec readrandom [MEDIAN 10 runs] : 130341 ops/sec; 9.1 MB/sec ``` - Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800` Reviewed By: ajkr Differential Revision: D37607896 Pulled By: cbi42 fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113	2022-08-12 13:51:32 -07:00
Jay Zhuang	1e86d424e4	Tiered storage stress test (#10493 ) Summary: Add Tiered storage stress test and db_bench option Pull Request resolved: https://github.com/facebook/rocksdb/pull/10493 Test Plan: new crashtest: https://app.circleci.com/pipelines/github/facebook/rocksdb/16905/workflows/68c2967c-9274-434f-8506-1403cf441ead Reviewed By: ajkr Differential Revision: D38481892 Pulled By: jay-zhuang fbshipit-source-id: 217a0be4acb93d420222e6ede2a1290d9f464776	2022-08-08 13:08:35 -07:00
Yanqin Jin	b443d24f4d	Stop operating on DB in a stress test background thread (#10373 ) Summary: Stress test background threads do not coordinate with test worker threads for db reopen in the middle of a test run, thus accessing db obj in a stress test bg thread can race with test workers. Remove the TimestampedSnapshotThread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10373 Test Plan: ``` ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 \ --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 \ --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 \ --block_size=16384 --bloom_bits=7.580319535285394 --bottommost_compression_type=disable \ --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache \ --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 \ --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=0 \ --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 \ --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 --create_timestamped_snapshot_one_in=20 --data_block_index_type=0 \ --db=/dev/shm/rocksdb/ --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=1 \ --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=1 --enable_pipelined_write=0 \ --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 \ --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 --index_block_restart_interval=11 --index_type=0 --ingest_external_file_one_in=0 \ --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \ --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 \ --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 \ --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 \ --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.5 \ --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True \ --nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000 \ --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 \ --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 \ --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=1000 \ --readpercent=55 --recycle_log_file_num=0 --reopen=100 --ribbon_starting_level=8 \ --secondary_cache_fault_one_in=0 --secondary_cache_uri= --snapshot_hold_ops=100000 \ --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 \ --subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 \ --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=1 \ --txn_write_policy=0 --unordered_write=0 --unpartitioned_pinning=0 \ --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 \ --use_merge=1 --use_multiget=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 \ --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 \ --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none \ --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35 ``` make crash_test_with_txn make crash_test_with_multiops_wc_txn Reviewed By: jay-zhuang Differential Revision: D37903189 Pulled By: riversand963 fbshipit-source-id: cd1728ad7ba4ce4cf47af23c4f65dda0956744f9	2022-07-19 11:25:43 -07:00
Gang Liao	0b6bc101ba	Charge blob cache usage against the global memory limit (#10321 ) Summary: To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321 Reviewed By: ltamasi Differential Revision: D37913590 Pulled By: gangliao fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72	2022-07-18 23:26:57 -07:00
Gang Liao	ec4ebeff30	Support prepopulating/warming the blob cache (#10298 ) Summary: Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298 Reviewed By: ltamasi Differential Revision: D37908743 Pulled By: gangliao fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1	2022-07-17 07:13:59 -07:00
Guido Tagliavini Ponce	7e1b417824	Revert NewClockCache signature (#10358 ) Summary: This complements https://github.com/facebook/rocksdb/issues/10351. This PR reverts NewClockCache's signature to an older version, expected by the users of the old (buggy) ClockCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10358 Test Plan: ``make -j24 check`` and re-run the pre-release tests. Reviewed By: siying Differential Revision: D37832601 Pulled By: guidotag fbshipit-source-id: 32a91d3da4119be187935003b7b897272ceb1950	2022-07-13 17:43:39 -07:00
Yanqin Jin	2f13f5f7d0	Add coverage for timestamped snapshot to MultiOpsTxnsStressTest (#10325 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10325 Test Plan: ```bash TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_multiops_wc_txn TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_txn ``` Reviewed By: akankshamahajan15 Differential Revision: D37688742 Pulled By: riversand963 fbshipit-source-id: e198ace921898af63f99e869568c1a7bbf69f1a4	2022-07-11 12:44:08 -07:00
Yanqin Jin	caced09e79	Expand stress test coverage for user-defined timestamp (#10280 ) Summary: Before this PR, we call `now()` to get the wall time before performing point-lookup and range scans when user-defined timestamp is enabled. With this PR, we expand the coverage to: - read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now() - add coverage for `ReadOptions::iter_start_ts != nullptr` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280 Test Plan: ```bash make check ``` Also, ```bash TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts ``` So far, we have had four successful runs of the above In addition, ```bash TEST_TMPDIR=/dev/shm/rocksdb make crash_test ``` Succeeded twice showing no regression. Reviewed By: ltamasi Differential Revision: D37539805 Pulled By: riversand963 fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5	2022-07-05 13:30:15 -07:00
zczhu	e716bda010	Add FLAGS_compaction_pri into crash_test (#10255 ) Summary: Add FLAGS_compaction_pri into correctness test Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255 Test Plan: run crash_test with FLAGS_compaction_pri Reviewed By: ajkr Differential Revision: D37510372 Pulled By: littlepig2013 fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641	2022-06-30 22:56:58 -07:00
Guido Tagliavini Ponce	57a0e2f304	Clock cache (#10273 ) Summary: This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work: - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash). - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores. - Middle insertions into the clock list. - A test that exercises the clock eviction policy. - Update the Java API of ClockCache and Java calls to C++. Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37522461 Pulled By: guidotag fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943	2022-06-29 21:50:39 -07:00
Gang Liao	d7ebb58cb5	Add blob cache tickers, perf context statistics, and DB properties (#10203 ) Summary: In order to be able to monitor the performance of the new blob cache, we made the follow changes: - Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics) - Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context) - Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203 Reviewed By: ltamasi Differential Revision: D37478658 Pulled By: gangliao fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40	2022-06-28 13:52:35 -07:00
Yanqin Jin	d3de59255a	Enable compaction filter for db_stress with user-defined timestamp (#10259 ) Summary: Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter. This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of the presence of timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259 Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts Reviewed By: ajkr Differential Revision: D37459692 Pulled By: riversand963 fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef	2022-06-27 11:53:09 -07:00
Andrew Kryczka	f322f273b0	Temporarily disable mempurge in crash test (#10252 ) Summary: Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252 Reviewed By: riversand963 Differential Revision: D37432948 Pulled By: ajkr fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a	2022-06-24 17:11:27 -07:00
Baptiste Lemaire	5879053fd0	Dynamically changeable `MemPurge` option (#10011 ) Summary: Summary Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled. Motivation RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible. Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement. Content of this PR This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes. Benchmarking I will add numbers to prove that there is no performance impact within the next 12 hours. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011 Reviewed By: pdillinger Differential Revision: D36462357 Pulled By: bjlemaire fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802	2022-06-23 09:42:18 -07:00
Gang Liao	2352e2dfda	Add the blob cache to the stress tests and the benchmarking tool (#10202 ) Summary: In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`. As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202 Reviewed By: ltamasi Differential Revision: D37325739 Pulled By: gangliao fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8	2022-06-22 16:04:03 -07:00
Peter Dillinger	84210c9489	Add data block hash index to crash test, fix MultiGet issue (#10220 ) Summary: There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data block hash index, which was not caught because data block hash index was never added to stress tests. This change fixes both issues. Fixes https://github.com/facebook/rocksdb/issues/10186 I intend to pick this into the 7.4.0 release candidate Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220 Test Plan: Failure quickly reproduces in crash test with kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing the failure with a unit test I believe would be too tricky and fragile to be worthwhile. Reviewed By: anand1976 Differential Revision: D37315647 Pulled By: pdillinger fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba	2022-06-21 16:23:58 -07:00
Hui Xiao	a5d773e077	Add rate-limiting support to batched MultiGet() (#10159 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not. However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159 Test Plan: - Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet` - Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds. - Stress test: - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work Reviewed By: ajkr, anand1976 Differential Revision: D37135172 Pulled By: hx235 fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd	2022-06-17 16:40:47 -07:00
Andrew Kryczka	5d6005c780	Add WriteOptions::protection_bytes_per_key (#10037 ) Summary: Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`. Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user. There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037 Test Plan: - Manual - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24` - Enabled in MyShadow for 1+ week - Automated - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions` - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection Reviewed By: cbi42 Differential Revision: D36614569 Pulled By: ajkr fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb	2022-06-16 23:10:07 -07:00
Peter Dillinger	126c223714	Remove deprecated block-based filter (#10184 ) Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80	2022-06-16 15:51:33 -07:00
Yanqin Jin	ce419c0f10	Allow db_bench and db_stress to set `allow_data_in_errors` (#10171 ) Summary: There is `Options::allow_data_in_errors` that controls whether RocksDB is allowed to log data, e.g. key, value, etc in LOG files. It is false by default. However, in db_bench and db_stress, it is often ok to log data because there is no concern about privacy. This PR allows db_stress and db_bench to set this option on the command line, while it remains false by default. Furthermore, make crash/recovery test driven by db_crashtest.py to opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171 Test Plan: Stress test and db_bench Reviewed By: hx235 Differential Revision: D37163787 Pulled By: riversand963 fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8	2022-06-15 12:38:04 -07:00
Hui Xiao	d665afdbf3	Account memory of FileMetaData in global memory limit (#9924 ) Summary: Context/Summary: As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - write micros/op: -0.24% : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 \| egrep 'fillseq'` - compact micros/op -0.87% : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 \| egrep 'compact'` table 1 - write #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 3.9711 \| 0.264408 \| 3.9914 \| 0.254563 \| 0.5111933721 20 \| 3.83905 \| 0.0664488 \| 3.8251 \| 0.0695456 \| -0.3633711465 40 \| 3.86625 \| 0.136669 \| 3.8867 \| 0.143765 \| 0.5289363078 80 \| 3.87828 \| 0.119007 \| 3.86791 \| 0.115674 \| -0.2673865734 160 \| 3.87677 \| 0.162231 \| 3.86739 \| 0.16663 \| -0.2419539978 table 2 - compact #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 2,399,650.00 \| 96,375.80 \| 2,359,537.00 \| 53,243.60 \| -1.67 20 \| 2,410,480.00 \| 89,988.00 \| 2,433,580.00 \| 91,121.20 \| 0.96 40 \| 2.41E+06 \| 121811 \| 2.39E+06 \| 131525 \| -0.96 80 \| 2.40E+06 \| 134503 \| 2.39E+06 \| 108799 \| -0.78 - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625	2022-06-14 13:06:40 -07:00
Guido Tagliavini Ponce	f105e1a501	Make the per-shard hash table fixed-size. (#10154 ) Summary: We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values. Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154 Test Plan: `make -j24 check` Reviewed By: pdillinger Differential Revision: D37124451 Pulled By: guidotag fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c	2022-06-13 20:29:00 -07:00
Yanqin Jin	bfaf8291c5	Fix a race condition in transaction stress test (#10157 ) Summary: Before this PR, there can be a race condition between the thread calling `StressTest::Open()` and a background compaction thread calling `MultiOpsTxnsStressTest::VerifyPkSkFast()`. ``` Time thread1 bg_compact_thr \| TransactionDB::Open(..., &txn_db_) \| db_ is still nullptr \| db_->GetSnapshot() // segfault \| db_ = txn_db_ V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10157 Test Plan: CI Reviewed By: akankshamahajan15 Differential Revision: D37121653 Pulled By: riversand963 fbshipit-source-id: 6a53117f958e9ee86f77297fdeb843e5160a9331	2022-06-13 18:54:38 -07:00
Yanqin Jin	1777e5f7e9	Snapshots with user-specified timestamps (#9879 ) Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with user-specified timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4	2022-06-10 16:07:03 -07:00
Yanqin Jin	f890527b16	Update test for secondary instance in stress test (#10121 ) Summary: This PR updates secondary instance testing in stress test by default. A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary. Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan to read some random keys with only very basic verification to make sure no assertion failure is triggered. Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled. Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic catch-up. In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not work well when the underlying file is renamed by primary, e.g. SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb make crash_test TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush ``` Reviewed By: ajkr Differential Revision: D36939458 Pulled By: riversand963 fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e	2022-06-07 21:07:47 -07:00
Andrew Kryczka	ff32346415	Set db_stress defaults for TSAN deadlock detector (#10131 ) Summary: After https://github.com/facebook/rocksdb/issues/9357 we began seeing the following error attempting to acquire locks for file ingestion: ``` FATAL: ThreadSanitizer CHECK failed: /home/engshare/third-party2/llvm-fb/12/src/llvm/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40) ``` The command was using default values for `ingest_external_file_width` (1000) and `log2_keys_per_lock` (2). The expected number of locks needed to update those keys is then (1000 / 2^2) = 250, which is above the 0x40 (64) limit. This PR reduces the default value of `ingest_external_file_width` to 100 so the expected number of locks is 25, which is within the limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10131 Reviewed By: ltamasi Differential Revision: D36986307 Pulled By: ajkr fbshipit-source-id: e918cdb2fcc39517d585f1e5fd2539e185ada7c1	2022-06-07 15:15:09 -07:00
Gang Liao	e6432dfd4c	Make it possible to enable blob files starting from a certain LSM tree level (#10077 ) Summary: Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed. In order to achieve this, we would like to do the following: - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic) - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level` - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` ) - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh` - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool) - Ideally extend the C and Java bindings with the new option - Update the BlobDB wiki to document the new option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077 Reviewed By: ltamasi Differential Revision: D36884156 Pulled By: gangliao fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d	2022-06-02 20:04:33 -07:00
Gang Liao	3dc6ebaf74	Support specifying blob garbage collection parameters when CompactRange() (#10073 ) Summary: Garbage collection is generally controlled by the BlobDB configuration options `enable_blob_garbage_collection` and `blob_garbage_collection_age_cutoff`. However, there might be use cases where we would want to temporarily override these options while performing a manual compaction. (One use case would be doing a full key-space manual compaction with full=100% garbage collection age cutoff in order to minimize the space occupied by the database.) Our goal here is to make it possible to override the configured GC parameters when using the `CompactRange` API to perform manual compactions. This PR would involve: - Extending the `CompactRangeOptions` structure so clients can both force-enable and force-disable GC, as well as use a different cutoff than what's currently configured - Storing whether blob GC should actually be enabled during a certain manual compaction and the cutoff to use in the `Compaction` object (considering the above overrides) and passing it to `CompactionIterator` via `CompactionProxy` - Updating the BlobDB wiki to document the new options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10073 Test Plan: Adding unit tests and adding the new options to the stress test tool. Reviewed By: ltamasi Differential Revision: D36848700 Pulled By: gangliao fbshipit-source-id: c878ef101d1c612429999f513453c319f75d78e9	2022-06-01 19:40:26 -07:00
Guido Tagliavini Ponce	b4d0e041d0	Add support for FastLRUCache in stress and crash tests. (#10081 ) Summary: Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache. Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081 Test Plan: - To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`. - To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`. Reviewed By: anand1976 Differential Revision: D36839908 Pulled By: guidotag fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725	2022-06-01 18:00:28 -07:00
Andrew Kryczka	91ba7837b7	Enable IngestExternalFile() in crash test (#9357 ) Summary: Thanks to https://github.com/facebook/rocksdb/issues/9919 and https://github.com/facebook/rocksdb/issues/10051 the known bugs in file ingestion (besides mmap read + file checksum) are fixed. Now we can try again to enable file ingestion in crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9357 Test Plan: stress file ingestion heavily for an hour: `$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=1000000 --ingest_external_file_one_in=100 --duration=3600 --interval=20 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152` Reviewed By: riversand963 Differential Revision: D33410746 Pulled By: ajkr fbshipit-source-id: d276431390995a67f68390d61c06a40945fdd280	2022-05-26 10:31:37 -07:00
Yanqin Jin	9901e7f681	Enable checkpoint and backup in db_stress when timestamp is enabled (#10047 ) Summary: After https://github.com/facebook/rocksdb/issues/10030 and https://github.com/facebook/rocksdb/issues/10004, we can enable checkpoint and backup in stress tests when user-defined timestamp is enabled. This PR has no production risk. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10047 Test Plan: ``` TEST_TMPDIR=/dev/shm make crash_test_with_ts ``` Reviewed By: jowlyzhang Differential Revision: D36641565 Pulled By: riversand963 fbshipit-source-id: d86c9d87efcc34c32d1aa176af691d32b897644a	2022-05-24 18:25:43 -07:00
Changyu Bi	cc23b46da1	Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857 ) Summary: An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857 Test Plan: #### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data: Set up: change the parameter [here](`fb9a167a55/tools/db_bench_tool.cc (L1766)`) to 16384 to make synthetic data more compressible. ``` # linked local ZSTD with version 1.5.2 # DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench dict_bytes=16384 train_bytes=1048576 echo "========== No Dictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== Raw Content Dictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== FinalizeDictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== TrainDictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total # Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory. # before compression data size: 1.2GB dict_bytes=16384 max_dict_buffer_bytes = 1048576 space cpu/memory No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k ``` #### Benchmark on 10 sample SST files for spacing saving and CPU time on compression: FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression. ``` dict_bytes=16384 train_bytes=1048576 for sst_file in `ls ../temp/myrock-sst/` do echo "******** $sst_file ********" echo "========== No Dictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD echo "========== Raw Content Dictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes echo "========== FinalizeDictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict echo "========== TrainDictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes done 010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254 Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523 FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433 TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593 ``` #### Decompression/Read test: With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads. ``` dict_bytes=16384 train_bytes=1048576 echo "No Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 \| grep MB/s echo "Raw Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 \| grep MB/s echo "FinalizeDict" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 \| grep MB/s echo "Train Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 \| grep MB/s No Dictionary readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found) Raw Dictionary readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) FinalizeDict readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found) Train Dictionary readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found) ``` Reviewed By: ajkr Differential Revision: D35720026 Pulled By: cbi42 fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f	2022-05-20 12:09:09 -07:00
Jay Zhuang	c6d326d3d7	Track SST unique id in MANIFEST and verify (#9990 ) Summary: Start tracking SST unique id in MANIFEST, which is used to verify with SST properties to make sure the SST file is not overwritten or misplaced. A DB option `try_verify_sst_unique_id` is introduced to enable/disable the verification, if enabled, it opens all SST files during DB-open to read the unique_id from table properties (default is false), so it's recommended to use it with `max_open_files = -1` to pre-open the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990 Test Plan: unittests, format-compatible test, mini-crash Reviewed By: anand1976 Differential Revision: D36381863 Pulled By: jay-zhuang fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f	2022-05-19 11:04:21 -07:00
Yaroslav Stepanchuk	0a43061f8d	Remove ROCKSDB_SUPPORT_THREAD_LOCAL define because it's a part of C++11 (#10015 ) Summary: ROCKSDB_SUPPORT_THREAD_LOCAL definition has been removed. `__thread`(#define) has been replaced with `thread_local`(C++ keyword) across the code base. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10015 Reviewed By: siying Differential Revision: D36485491 Pulled By: pdillinger fbshipit-source-id: 6522d212514ee190b90b4e2750c80c7e34013c78	2022-05-18 15:25:19 -07:00
Yanqin Jin	e3a3dbf2be	Avoid overwriting options loaded from OPTIONS (#9943 ) Summary: This is similar to https://github.com/facebook/rocksdb/issues/9862, including the following fixes/refactoring: 1. If OPTIONS file is specified via `-options_file`, majority of options will be loaded from the file. We should not overwrite options that have been loaded from the file. Instead, we configure only fields of options which are shared objects and not set by the OPTIONS file. We also configure a few fields, e.g. `create_if_missing` necessary for stress test to run. 2. Refactor options initialization into three functions, `InitializeOptionsFromFile()`, `InitializeOptionsFromFlags()` and `InitializeOptionsGeneral()` similar to db_bench. I hope they can be shared in the future. The high-level logic is as follows: ```cpp if (!InitializeOptionsFromFile(...)) { InitializeOptionsFromFlags(...); } InitializeOptionsGeneral(...); ``` 3. Currently, the setting for `block_cache_compressed` does not seem correct because it by default specifies a size of `numeric_limits<size_t>::max()` ((size_t)-1). According to code comments, `-1` indicates default value, which should be referring to `num_shard_bits` argument. 4. Clarify `fail_if_options_file_error`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9943 Test Plan: 1. make check 2. Run stress tests, and manually check generated OPTIONS file and compare them with input OPTIONS files Reviewed By: jay-zhuang Differential Revision: D36133769 Pulled By: riversand963 fbshipit-source-id: 35dacdc090a0a72c922907170cd132b9ecaa073e	2022-05-18 12:43:50 -07:00
Peter Dillinger	0070680cfd	Adjust public APIs to prefer 128-bit SST unique ID (#10009 ) Summary: 128 bits should suffice almost always and for tracking in manifest. Note that this changes the output of sst_dump --show_properties to only show 128 bits. Also introduces InternalUniqueIdToHumanString for presenting internal IDs for debugging purposes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10009 Test Plan: unit tests updated Reviewed By: jay-zhuang Differential Revision: D36458189 Pulled By: pdillinger fbshipit-source-id: 93ebc4a3b6f9c73ee154383a1f8b291a5d6bbef5	2022-05-17 18:43:48 -07:00
Hui Xiao	3573558ec5	Rewrite memory-charging feature's option API (#9926 ) Summary: Context: Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`. Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags. Summary: - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that - Added missing db bench/stress test for some memory charging features - Renamed related test suite to indicate they are under the same theme of memory charging - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926 Test Plan: - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)` - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)` - CI - db bench (in case querying new options introduces regression) +0.5% micros/op: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 \| egrep 'fillseq'` #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 3.9711 \| 0.264408 \| 3.9914 \| 0.254563 \| 0.5111933721 20 \| 3.83905 \| 0.0664488 \| 3.8251 \| 0.0695456 \| -0.3633711465 40 \| 3.86625 \| 0.136669 \| 3.8867 \| 0.143765 \| 0.5289363078 - db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36054712 Pulled By: hx235 fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af	2022-05-17 15:01:51 -07:00
Yanqin Jin	f6d9730ea1	Fix stress test with best-efforts-recovery (#9986 ) Summary: This PR - since we are testing with disable_wal = true and best_efforts_recovery, we should set column family count to 1, due to the requirement of `ExpectedState` tracking and replaying logic. - during backup and checkpoint restore, disable best-efforts-recovery. This does not matter now because db_crashtest.py always disables wal when testing best-efforts-recovery. In the future, if we enable wal, then not setting `restore_opitions.best_efforts_recovery` will cause backup db not to recover the WALs, and differ from db (that enables WAL). - during verification of backup and checkpoint restore, print the key where inconsistency exists between expected state and db. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9986 Test Plan: TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery Reviewed By: siying Differential Revision: D36353105 Pulled By: riversand963 fbshipit-source-id: a484da161273e6216a1f7e245bac15a349693917	2022-05-13 12:29:20 -07:00
sdong	736a7b5433	Remove own ToString() (#9955 ) Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471	2022-05-06 13:03:58 -07:00
Andrew Kryczka	62d84e2a2b	db_stress fault injection in release mode (#9957 ) Summary: Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode. Other notable changes include: - Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint` - Made `--read_fault_one_in > 0` return an error in release mode - Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957 Test Plan: ``` $ DEBUG_LEVEL=0 make -j24 db_stress $ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox ``` Reviewed By: anand1976 Differential Revision: D36193830 Pulled By: ajkr fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321	2022-05-06 11:17:08 -07:00
Andrew Kryczka	a62506aee2	Enable unsynced data loss in crash test (#9947 ) Summary: `db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`. Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947 Test Plan: - regular blackbox, blackbox --simple - various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16` Reviewed By: riversand963 Differential Revision: D36152775 Pulled By: ajkr fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8	2022-05-05 13:21:03 -07:00
sdong	49628c9a83	Use std::numeric_limits<> (#9954 ) Summary: Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954 Test Plan: See CI Runs. Reviewed By: riversand963 Differential Revision: D36173954 fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0	2022-05-05 13:08:21 -07:00
Yanqin Jin	9d634dd5b6	Rename kRemoveWithSingleDelete to kPurge (#9951 ) Summary: PR 9929 adds a new CompactionFilter::Decision, i.e. kRemoveWithSingleDelete so that CompactionFilter can indicate to CompactionIterator that a PUT can only be removed with SD. However, how CompactionIterator handles such a key is implementation detail which should not be implied in the public API. In fact, such a PUT can just be dropped. This is an optimization which we will apply in the near future. Discussion thread: https://github.com/facebook/rocksdb/pull/9929#discussion_r863198964 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9951 Test Plan: make check Reviewed By: ajkr Differential Revision: D36156590 Pulled By: riversand963 fbshipit-source-id: 7b7d01f47bba4cad7d9cca6ca52984f27f88b372	2022-05-05 08:16:20 -07:00
Andrew Kryczka	cda34dd64a	Allow consecutive SingleDelete() in stress/crash test (#9930 ) Summary: We need to support consecutive SingleDelete(), so this PR adds it to the stress/crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9930 Test Plan: `python3 tools/db_crashtest.py blackbox --simple --nooverwritepercent=50 --writepercent=90 --delpercent=10 --readpercent=0 --prefixpercent=0 --delrangepercent=0 --iterpercent=0 --max_key=1000000 --duration=3600 --interval=10 --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --value_size_mult=33` Reviewed By: riversand963 Differential Revision: D36081863 Pulled By: ajkr fbshipit-source-id: 3566cdbaed375b8003126fc298968eb1a854317f	2022-05-02 16:19:00 -07:00

1 2 3 4 5 ...

355 commits