rocksdb

mirror of https://github.com/facebook/rocksdb.git synced 2024-12-03 14:52:53 +00:00

Author	SHA1	Message	Date
Hui Xiao	8e6e8957fb	Disable `wal_bytes_per_sync` at one more place (#12492 ) Summary: Summary/Context: supplement to https://github.com/facebook/rocksdb/pull/12489 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12492 Test Plan: CI Reviewed By: jaykorean Differential Revision: D55612747 Pulled By: hx235 fbshipit-source-id: 5c8fbda3e6c8482f2a3363a98a545f1c11e4ea27	2024-04-02 09:44:37 -07:00
Hui Xiao	21d11de761	Temporarily disable `wal_bytes_per_sync` in crash test (#12489 ) Summary: Context/Summary: `wal_bytes_per_sync > 0` can sync newer WAL but not an older WAL by its nature. This creates a hole in synced WAL data. By our crash test, we recently discovered that our DB can recover past that hole. This resulted in crash-recovery-verification error. Before we fix that recovery behavior, we will temporarily disable `wal_bytes_per_sync` in crash test Bonus: updated the API to make the nature of this option more explicitly documented Pull Request resolved: https://github.com/facebook/rocksdb/pull/12489 Test Plan: More stabilized crash test Reviewed By: ajkr Differential Revision: D55531589 Pulled By: hx235 fbshipit-source-id: 6dea6486420dc0f50550d488c15652f93972a0ea	2024-03-29 13:01:15 -07:00
Andrew Kryczka	3d4e78937a	Initialize `FaultInjectionTestFS::checksum_handoff_func_type_` to `kCRC32c` (#12485 ) Summary: Previously it was uninitialized. Setting `checksum_handoff_file_types` will cause `kCRC32c` checksums to be passed down in the `DataVerificationInfo`, so it makes sense for `kCRC32c` to be the default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12485 Test Plan: ran `db_stress` in a way that failed before. Building with ASAN was needed to ensure the uninitialized bytes are nonzero according to `malloc_fill_byte` (default 0xbe) ``` $ COMPILE_WITH_ASAN=1 make -j28 db_stress ... $ ./db_stress -sync_fault_injection=1 -enable_checksum_handoff=true ``` Reviewed By: jaykorean Differential Revision: D55450587 Pulled By: ajkr fbshipit-source-id: 53dc829b86e49b3fa80570032e83af0bb12adaad	2024-03-27 18:37:58 -07:00
anand76	63a105a481	Enable recycle_log_file_num option for point in time recovery (#12403 ) Summary: This option was previously disabled due to a bug in the recovery logic. The recovery code in `DBImpl::RecoverLogFiles` couldn't tell if an EoF reported by the log reader was really an EoF or a possible corruption that made a record look like an old log record. To fix this, the log reader now explicitly reports when it encounters what looks like an old record. The recovery code treats it as a possible corruption, and uses the next sequence number in the WAL to determine if it should continue replaying the WAL. This PR also fixes a couple of bugs that log file recycling exposed in the backup and checkpoint path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12403 Test Plan: 1. Add new unit tests to verify behavior upon corruption 2. Re-enable disabled tests for verifying recycling behavior Reviewed By: ajkr Differential Revision: D54544824 Pulled By: anand1976 fbshipit-source-id: 12f5ce39bd6bc0d63b0bc6432dc4db510e0e802a	2024-03-21 12:29:35 -07:00
Changyu Bi	3d5be596a5	Fix a bug in iterator with UDT + `ReadOptions::pin_data` (#12451 ) Summary: with https://github.com/facebook/rocksdb/issues/12414 enabling `ReadOptions::pin_data`, this bug surfaced as corrupted per key-value checksum during crash test. `saved_key_.GetUserKey()` could be pinned user key, so DBIter should not overwrite it. In one case, it only surfaces when iterator skips many keys of the same user key. To stress that code path, this PR also added `max_sequential_skip_in_iterations` to crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12451 Test Plan: - Set ReadOptions::pin_data to true, the bug can be reproed quickly with `./db_stress --persist_user_defined_timestamps=1 --user_timestamp_size=8 --writepercent=35 --delpercent=4 --delrangepercent=1 --iterpercent=20 --nooverwritepercent=1 --prefix_size=8 --prefixpercent=10 --readpercent=30 --memtable_protection_bytes_per_key=8 --block_protection_bytes_per_key=2 --clear_column_family_one_in=0`. - Set max_sequential_skip_in_iterations to 1 for the other occurrence of the bug. Reviewed By: jowlyzhang Differential Revision: D55003766 Pulled By: cbi42 fbshipit-source-id: 23e1049129456684dafb028b6132b70e0afc07fb	2024-03-18 09:05:11 -07:00
Changyu Bi	ba022dd44c	Disable `enable_checksum_handoff` in crash test (#12431 ) Summary: since it been causing a few crash tests failures, I suspect it'll be easy to repro locally. Also fixed how to print its corruption message so it does not crash with output cannot be utf-8 decoded. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12431 Reviewed By: hx235 Differential Revision: D54881023 Pulled By: cbi42 fbshipit-source-id: 47208a637cd69b30d2545154849405e37db62ed3	2024-03-13 18:03:55 -07:00
Hui Xiao	30243c6573	Add missing db crash options (#12414 ) Summary: Context/Summary: We are doing a sweep in all public options, including but not limited to the `Options`, `Read/WriteOptions`, `IngestExternalFileOptions`, cache options.., to find and add the uncovered ones into db crash. The options included in this PR require minimum changes to db crash other than adding the options themselves. A bonus change: to surface new issues by improved coverage in stderror, we decided to fail/terminate crash test for manual compactions (CompactFiles, CompactRange()) on meaningful errors. See https://github.com/facebook/rocksdb/pull/12414/files#diff-5c4ced6afb6a90e27fec18ab03b2cd89e8f99db87791b4ecc6fa2694284d50c0R2528-R2532, https://github.com/facebook/rocksdb/pull/12414/files#diff-5c4ced6afb6a90e27fec18ab03b2cd89e8f99db87791b4ecc6fa2694284d50c0R2330-R2336 for more. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12414 Test Plan: - Run `python3 ./tools/db_crashtest.py --simple blackbox` for 10 minutes to ensure no trivial failure - Run `python3 tools/db_crashtest.py --simple blackbox --compact_files_one_in=1 --compact_range_one_in=1 --read_fault_one_in=1 --write_fault_one_in=1 --interval=50` for a while to ensure the bonus change does not result in trivial crash/termination of stress test Reviewed By: ajkr, jowlyzhang, cbi42 Differential Revision: D54691774 Pulled By: hx235 fbshipit-source-id: 50443dfb6aaabd8e24c79a2e42b68c6de877be88	2024-03-12 17:24:12 -07:00
Andrew Kryczka	27a2473668	Best-effort recovery support for atomic flush (#12406 ) Summary: This PR updates `VersionEditHandlerPointInTime` to recover all or none of the updates in an AtomicGroup. This makes best-effort recovery properly handle atomic flushes during recovery, so the features are now allowed to both be enabled at once. The new logic requires that AtomicGroups do not contain column family additions or removals. AtomicGroups are currently written for atomic flush, which does not include such edits. Column family additions or removals are recovered independently of AtomicGroups. The new logic needs to be aware of removal, though, so that a dropped CF does not prevent completion of an AtomicGroup recovery. The new logic treats each AtomicGroup as if it contains updates for all existing column families, even though it is possible to create AtomicGroups that only affect a subset of column families. This simplifies the logic at the expense of recovering less data in certain edge case scenarios. The usage of `MaybeCreateVersion()` is pretty tricky. The goal is to create a barrier at the start of an AtomicGroup such that all valid states up to that point will be applied to `versions_`. Here is a summary. - `MaybeCreateVersion(..., false)` creates a `Version` on a negative edge trigger (transition from valid to invalid). It was previously called when applying each update. Now, it is only called when applying non-AtomicGroup updates. - `MaybeCreateVersion(..., true)` creates a `Version` on a positive level trigger (valid state). It was previously called only at the end of iteration. Now, it is additionally called before processing an AtomicGroup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12406 Reviewed By: jaykorean, cbi42 Differential Revision: D54494904 Pulled By: ajkr fbshipit-source-id: 0114a9fe1d04b471d086dcab5978ea8a3a56ad52	2024-03-06 14:40:40 -08:00
Jay Huh	8c7c0a38f1	Minor refactor with printing stdout in blackbox tests (#12350 ) Summary: As title. Adding a missing stdout printing in `blackbox_crash_main()` # Test Blackbox ``` $> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 ``` ``` ... stdout: Choosing random keys with no overwrite DB path: [/tmp/jewoongh/rocksdb_crashtest_blackbox34jwn9of] (Re-)verified 0 unique IDs 2024/02/13-12:27:33 Initializing worker threads Crash-recovery verification passed :) 2024/02/13-12:27:36 Starting database operations ... jewoongh stdout test jewoongh stdout test ... jewoongh stdout test stderr: jewoongh injected error ``` Whitebox ``` $> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 ``` ``` ... stdout: Choosing random keys with no overwrite Creating 24415 locks ... 2024/02/13-12:31:51 Initializing worker threads Crash-recovery verification passed :) 2024/02/13-12:31:54 Starting database operations jewoongh stdout test jewoongh stdout test jewoongh stdout test ... stderr: jewoongh injected error ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/12350 Reviewed By: akankshamahajan15, cbi42 Differential Revision: D53728910 Pulled By: jaykorean fbshipit-source-id: ec90ed3b5e6a1102d1fb55d357d0371e5072a173	2024-02-13 14:15:52 -08:00
Changyu Bi	b46f5707c4	Fix unexpected keyword argument 'print_as_stderr' in crash test (#12339 ) Summary: Fix crash test failure like https://github.com/facebook/rocksdb/actions/runs/7821514511/job/21338625372#step:5:530 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12339 Test Plan: CI Reviewed By: jaykorean Differential Revision: D53545053 Pulled By: cbi42 fbshipit-source-id: b466a8dc9c0ded0377e8677937199c6f959f96ef	2024-02-07 15:44:17 -08:00
Akanksha Mahajan	9a2d7485f0	Print stderr in crash test script and exit on stderr (#12335 ) Summary: Some of the errors like data race and heap-after-use are error out based on crash test reporting them as error by relying on stderr. So reverting back to original form unless we come up with a more reliable solution to error out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12335 Reviewed By: cbi42 Differential Revision: D53534781 Pulled By: akankshamahajan15 fbshipit-source-id: b19aa560d1560ac2281f7bc04e13961ed751f178	2024-02-07 12:34:40 -08:00
Yu Zhang	071a146fa0	Add support for range deletion when user timestamps are not persisted (#12254 ) Summary: For the user defined timestamps in memtable only feature, some special handling for range deletion blocks are needed since both the key (start_key) and the value (end_key) of a range tombstone can contain user-defined timestamps. Handling for the key is taken care of in the same way as the other data blocks in the block based table. This PR adds the special handling needed for the value (end_key) part. This includes: 1) On the write path, when L0 SST files are first created from flush, user-defined timestamps are removed from an end key of a range tombstone. There are places where it's logically removed (replaced with a min timestamp) because there is still logic with the running comparator that expects a user key that contains timestamp. And in the block based builder, it is eventually physically removed before persisted in a block. 2) On the read path, when range deletion block is being read, we artificially pad a min timestamp to the end key of a range tombstone in `BlockBasedTableReader`. 3) For file boundary `FileMetaData.largest`, we artificially pad a max timestamp to it if it contains a range deletion sentinel. Anytime when range deletion end_key is used to update file boundaries, it's using max timestamp instead of the range tombstone's actual timestamp to mark it as an exclusive end. `d69628e6ce/db/dbformat.h (L923-L935)` This max timestamp is removed when in memory `FileMetaData.largest` is persisted into Manifest, we pad it back when it's read from Manifest while handling related `VersionEdit` in `VersionEditHandler`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12254 Test Plan: Added unit test and enabled this feature combination's stress test. Reviewed By: cbi42 Differential Revision: D52965527 Pulled By: jowlyzhang fbshipit-source-id: e8315f8a2c5268e2ae0f7aec8012c266b86df985	2024-01-29 11:37:34 -08:00
Jay Huh	8829ba9fe1	print stderr separately per option (#12301 ) Summary: While working on Meta's internal test triaging process, I found that `db_crashtest.py` was printing out `stdout` and `stderr` altogether. Adding an option to print `stderr` separately so that it's easy to extract only `stderr` from the test run. `print_stderr_separately` is introduced as an optional parameter with default value `False` to keep the existing behavior as is (except a few minor changes). Minor changes to the existing behavior - We no longer print `stderr has error message:` and `*` prefix to each line. We simply print `stderr:` before printing `stderr` if stderr is printed in stdout and print `stderr` as is. - We no longer print `times error occurred in output is ...` which doesn't appear to have any values Pull Request resolved: https://github.com/facebook/rocksdb/pull/12301 Test Plan: Default Behavior (blackbox) Run printed everything as is ``` $> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 2> /tmp/error.log Running blackbox-crash-test with interval_between_crash=120 total-duration=6000 ... Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0 Integrated BlobDB: blob cache disabled DB path: [/tmp/jewoongh/rocksdb_crashtest_blackboxwh7yxpec] (Re-)verified 0 unique IDs 2024/01/29-09:16:30 Initializing worker threads Crash-recovery verification passed :) 2024/01/29-09:16:35 Starting database operations 2024/01/29-09:16:35 Starting verification Stress Test : 543.600 micros/op 8802 ops/sec : Wrote 0.00 MB (0.27 MB/sec) (50% of 10 ops) : Wrote 5 times : Deleted 1 times : Single deleted 0 times : 4 read and 0 found the key : Prefix scanned 0 times : Iterator size sum is 0 : Iterated 0 times : Deleted 0 key-ranges : Range deletions covered 0 keys : Got errors 0 times : 0 CompactFiles() succeed : 0 CompactFiles() did not succeed stderr: WARNING: prefix_size is non-zero but memtablerep != prefix_hash Error : jewoongh injected test error This is not a real failure. Verification failed :( ``` Nothing in stderr ``` $> cat /tmp/error.log ``` Default Behavior (whitebox) Run printed everything as is ``` $> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 2> /tmp/error.log Running whitebox-crash-test with total-duration=10000 ... (Re-)verified 571 unique IDs 2024/01/29-09:33:53 Initializing worker threads Crash-recovery verification passed :) 2024/01/29-09:35:16 Starting database operations 2024/01/29-09:35:16 Starting verification Stress Test : 97248.125 micros/op 10 ops/sec : Wrote 0.00 MB (0.00 MB/sec) (12% of 8 ops) : Wrote 1 times : Deleted 0 times : Single deleted 0 times : 4 read and 1 found the key : Prefix scanned 1 times : Iterator size sum is 120868 : Iterated 4 times : Deleted 0 key-ranges : Range deletions covered 0 keys : Got errors 0 times : 0 CompactFiles() succeed : 0 CompactFiles() did not succeed stderr: WARNING: prefix_size is non-zero but memtablerep != prefix_hash Error : jewoongh injected test error This is not a real failure. New cache capacity = 4865393 Verification failed :( TEST FAILED. See kill option and exit code above!!! ``` Nothing in stderr ``` $> cat /tmp/error.log ``` New option added (blackbox) ``` $> python3 tools/db_crashtest.py blackbox --simple --max_key=25000000 --write_buffer_size=4194304 --print_stderr_separately 2> /tmp/error.log Running blackbox-crash-test with interval_between_crash=120 total-duration=6000 ... Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0 Integrated BlobDB: blob cache disabled DB path: [/tmp/jewoongh/rocksdb_crashtest_blackbox7ybna32z] (Re-)verified 0 unique IDs Compaction filter factory: DbStressCompactionFilterFactory 2024/01/29-09:05:39 Initializing worker threads Crash-recovery verification passed :) 2024/01/29-09:05:46 Starting database operations 2024/01/29-09:05:46 Starting verification Stress Test : 235.917 micros/op 16000 ops/sec : Wrote 0.00 MB (0.16 MB/sec) (16% of 12 ops) : Wrote 2 times : Deleted 1 times : Single deleted 0 times : 9 read and 0 found the key : Prefix scanned 0 times : Iterator size sum is 0 : Iterated 0 times : Deleted 0 key-ranges : Range deletions covered 0 keys : Got errors 0 times : 0 CompactFiles() succeed : 0 CompactFiles() did not succeed ``` stderr printed separately ``` $> cat /tmp/error.log WARNING: prefix_size is non-zero but memtablerep != prefix_hash Error : jewoongh injected test error This is not a real failure. New cache capacity = 19461571 Verification failed :( ``` New option added (whitebox)** ``` $> python3 tools/db_crashtest.py whitebox --simple --max_key=25000000 --write_buffer_size=4194304 --print_stderr_separately 2> /tmp/error.log Running whitebox-crash-test with total-duration=10000 ... Integrated BlobDB: blob files enabled 0, min blob size 0, blob file size 268435456, blob compression type NoCompression, blob GC enabled 0, cutoff 0.250000, force threshold 1.000000, blob compaction readahead size 0, blob file starting level 0 Integrated BlobDB: blob cache disabled DB path: [/tmp/jewoongh/rocksdb_crashtest_whiteboxtwj0ihn6] (Re-)verified 157 unique IDs 2024/01/29-09:39:59 Initializing worker threads Crash-recovery verification passed :) 2024/01/29-09:40:16 Starting database operations 2024/01/29-09:40:16 Starting verification Stress Test : 742.474 micros/op 11801 ops/sec : Wrote 0.00 MB (0.27 MB/sec) (36% of 19 ops) : Wrote 7 times : Deleted 1 times : Single deleted 0 times : 8 read and 0 found the key : Prefix scanned 0 times : Iterator size sum is 0 : Iterated 4 times : Deleted 0 key-ranges : Range deletions covered 0 keys : Got errors 0 times : 0 CompactFiles() succeed : 0 CompactFiles() did not succeed TEST FAILED. See kill option and exit code above!!! ``` stderr printed separately ``` $> cat /tmp/error.log WARNING: prefix_size is non-zero but memtablerep != prefix_hash Error : jewoongh injected test error This is not a real failure. Error : jewoongh injected test error This is not a real failure. Error : jewoongh injected test error This is not a real failure. New cache capacity = 4865393 Verification failed :( ``` Reviewed By: akankshamahajan15 Differential Revision: D53187491 Pulled By: jaykorean fbshipit-source-id: 76f9100d08b96d014e41b7b88b206d69f0ae932b	2024-01-29 11:09:47 -08:00
akankshamahajan	36704e9227	Improve crash test script to not rely on std::errors for failures. (#12265 ) Summary: Right now crash_test relies on std::errors too to check for only errors/failures along with verification. However, that's not a reliable solution and many internal services logs benign errors/warnings in which case our test script fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12265 Test Plan: Keep std::errors but printout instead of failing and will monitor crash tests internally to see if there is any scenario which solely relies on std::error, in which case stress tests can be improve. Reviewed By: ajkr, cbi42 Differential Revision: D52967000 Pulled By: akankshamahajan15 fbshipit-source-id: 5328c8b69480c7946fe6a9c72f9ffeede70ac2ad	2024-01-26 11:39:47 -08:00
Jay Huh	d982260b63	Clean up after long-running whitebox crashtest (#12248 ) Summary: Currently, we treat the long-running whitebox_crash_test as passing. However, we were not cleaning up after ourselves when we killed the running test for running too long, which often caused out-of-space errors in subsequent tests (e.g., blackbox_crash_test after whitebox_crash_test). Unless we want to start treating these timeouts as failures and need the DB output for investigation now, we should properly clean up the tmp dir. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12248 Test Plan: ``` $> make crash_test -j ``` Reviewed By: ajkr Differential Revision: D52885342 Pulled By: jaykorean fbshipit-source-id: 7c1f2ca7cf03d0705bb14155ee44d5d7a411c132	2024-01-19 16:25:39 -08:00
Changyu Bi	9d58e3f63a	Disable LockWAL() for multiops_wp_txn stress test (#12221 ) Summary: We test LockWAL() and UnlockWAL() by checking that latest sequence number is not changed: `1a1f9f1660/db_stress_tool/db_stress_test_base.cc (L920-L937)`. With writeprepared transaction, sequence number can be advanced in SwitchMemtable::WriteRecoverableState() when writing recoverable state: `1a1f9f1660/db/db_impl/db_impl_write.cc (L1560)` This PR disables LockWAL() tests for writeprepared transaction for now. We probably need to change how we test LockWAL() for writeprepared before re-enabling this test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12221 Reviewed By: ajkr Differential Revision: D52677076 Pulled By: cbi42 fbshipit-source-id: 27ee694878edf63e8f4ad52f769d4db401f511bc	2024-01-11 15:54:11 -08:00
Yu Zhang	c5fbfd7ad8	Disable blobDB and UDT in memtable only combination in stress test (#12218 ) Summary: This feature combination is not fully working yet. Disable them so the stress tests have less noise. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12218 Reviewed By: cbi42 Differential Revision: D52643957 Pulled By: jowlyzhang fbshipit-source-id: 8815a18a3b5814cad4f7ec41f3fb94869302081e	2024-01-09 17:37:01 -08:00
Peter Dillinger	ea6ed0d56e	Re-enable ingest_external_file with mmap_read in crash test (#12201 ) Summary: I suspect the issue called out in https://github.com/facebook/rocksdb/issues/9357 was fixed in https://github.com/facebook/rocksdb/issues/11328 Pull Request resolved: https://github.com/facebook/rocksdb/pull/12201 Test Plan: `make blackbox_crash_test` for hours Reviewed By: ajkr Differential Revision: D52543075 Pulled By: pdillinger fbshipit-source-id: b705a6bdb2799a5f51ad2746df2083aa82f360a2	2024-01-04 13:46:07 -08:00
Hui Xiao	5b981b64f4	Intensify operations on same key in crash test (#12148 ) Summary: Context/Summary: Continued from https://github.com/facebook/rocksdb/pull/12127, we can randomly reduce the # max key to coerce more operations on the same key. My experimental run shows it surfaced more issue than just https://github.com/facebook/rocksdb/pull/12127. I also randomly reduce the related parameters, write buffer size and target file base, to adapt to randomly lower number of # max key. This creates 4 situations of testing, 3 of which are new: 1. high # max key with high write buffer size and target file base (existing) 2. high # max key with low write buffer size and target file base (new, will go through some rehearsal testing to ensure we don't run out of space with many files) 3. low # max key with high write buffer size and target file base (new, keys will stay in memory longer) 4. low # max key with low write buffer size and target file base (new, experimental runs show it surfaced even more issues) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12148 Test Plan: - [Ongoing] Rehearsal stress test - Monitor production stress test Reviewed By: jaykorean Differential Revision: D52174980 Pulled By: hx235 fbshipit-source-id: bd5e11280826819ca9314c69bbbf05d481c6d105	2023-12-17 10:46:26 -08:00
Yu Zhang	c2ab4e754b	Add initial support to stress test persist_user_defined_timestamps (#12124 ) Summary: This PR adds initial stress testing for the user-defined timestamps in memtable only feature. Each flavor of the `*_ts` crash test get a 1 in 3 chance to run with timestamps not persisted, this setting is initialized once and kept consistent across the following re-runs. This initial stress test included these things besides disabling incompatible feature combinations to make the test run more stably: 1) It currently only run test methods that validates db state with expected state. Not the ones that validate db state by comparing result from one API to another API. Such as `TestMultiGet` (compared with `Get`), similarly `TestMultiGetEntity`, `TestIterate` (compare src iterator to a control iterator). Due to timestamps being removed, results from one API to another API is not directly comparable as it is now. More test logic to handle that need to be added, will do that in a follow up. 2) Even when comparing db state to expected state, sometimes the db can receive `InvalidArgument` too due to timestamps getting flushed and removed. Added some logic to handle that. 3) When timestamps are not persisted, we don't try to read with older timestamp. Since that's making it easier to get `InvalidArgument`. And this capability is not yet needed by our customer so it's disabled for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12124 Test Plan: running multiple flavor of this test on continuous run for sometime before checkin Reviewed By: ltamasi Differential Revision: D51916267 Pulled By: jowlyzhang fbshipit-source-id: 3f3eb5f9618d05d296062820e0ef5cb8edc7c2b2	2023-12-12 09:35:29 -08:00
Hui Xiao	179d2c7646	Intensify "xxx_one_in"'s default value in crash test (#12127 ) Summary: Context/Summary: My experimental stress runs with more frequent "xxx_one_in" surfaced a couple interesting bugs/issues with RocksDB or crash test framework in the past. We now consider changing the default value so they are run more frequently in production testing environment. Increase frequency by 2 orders of magnitude for most parameters, except for error-prone features e.g, manual compaction and file ingestion (increased by 3 orders) and expensive features e.g, checksum verification (increased by 1 order) Pull Request resolved: https://github.com/facebook/rocksdb/pull/12127 Test Plan: Monitor CI to see if it did surface more interesting bugs/issues. If not, we may consider intensify even more. Reviewed By: pdillinger Differential Revision: D51954235 Pulled By: hx235 fbshipit-source-id: 92046cb7c52a37212f19ab7965b40f77b90b08b1	2023-12-08 10:22:14 -08:00
Yu Zhang	7eca51dfc3	Refactor crash test stderr parsing logic into a function (#12109 ) Summary: This is a simple refactor for the crash test script to put shared logic for parsing stderr into a function. There is no functional change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12109 Test Plan: manually tested the script Reviewed By: ajkr Differential Revision: D51692172 Pulled By: jowlyzhang fbshipit-source-id: d346d64e981d9c489c380ff6ce33296a224b5877	2023-12-01 11:01:29 -08:00
Peter Dillinger	92dc5f3e67	AutoHCC: fix a bug with "blind" Insert (#12046 ) Summary: I have finally tracked down and fixed a bug affecting AutoHCC that was causing CI crash test assertion failures in AutoHCC when using secondary cache, but I was only able to reproduce locally a couple of times, after very long runs/repetitions. It turns out that the essential feature used by secondary cache to trigger the bug is Insert without keeping a handle, which is otherwise rarely used in RocksDB and not incorporated into cache_bench (also used for targeted correctness stress testing) until this change (new option `-blind_insert_percent`). The problem was in copying some logic from FixedHCC that makes the entry "sharable" but unreferenced once populated, if no reference is to be saved. The problem in AutoHCC is that we can only add the entry to a chain after it is in the sharable state, and must be removed from the chain while in the "under (de)construction" state and before it is back in the "empty" state. Also, it is possible for Lookup to find entries that are not connected to any chain, by design for efficiency, and for Release to erase_if_last_ref. Therefore, we could have * Thread 1 starts to Insert a cache entry without keeping ref, and pauses before adding to the chain. * Thread 2 finds it with Lookup optimizations, and then does Release with `erase_if_last_ref=true` causing it to trigger erasure on the entry. It successfully locks the home chain for the entry and purges any entries pending erasure. It is OK that this entry is not found on the chain, as another thread is allowed to remove it from the chain before we are able to (but after is it marked for (de)construction). And after the purge of the chain, the entry is marked empty. * Thread 1 resumes in adding the slot (presumed entry) to the home chain for what was being inserted, but that now violates invariants and sets up a race or double-chain-reference as another thread could insert a new entry in the slot and try to insert into a different chain. This is easily fixed by holding on to a reference until inserted onto the chain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12046 Test Plan: As I don't have a reliable local reproducer, I triggered 20 runs of internal CI on fbcode_blackbox_crash_test that were previously failing in AutoHCC with about 1/3 probability, and they all passed. Also re-enabling AutoHCC in the crash test with this change. (Revert https://github.com/facebook/rocksdb/issues/12000) Reviewed By: jowlyzhang Differential Revision: D51016979 Pulled By: pdillinger fbshipit-source-id: 3840fb829d65b97c779d8aed62a4a4a433aeff2b	2023-11-06 16:06:01 -08:00
Changyu Bi	b48480cfd0	Enable `TestIterateAgainstExpected()` in more crash tests (#12040 ) Summary: db_stress flag `verify_iterator_with_expected_state_one_in` is only enabled for in crash test if --simple flag is set. This PR enables it for all supported crash tests by enabling it by default. This adds coverage for --txn and --enable_ts crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12040 Test Plan: ran crash tests that disabled this flag before for a few hours ``` python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --txn --txn_write_policy=[0,1,2] python3 ./tools/db_crashtest.py blackbox --verify_iterator_with_expected_state_one_in=1 --enable_ts ``` Reviewed By: ajkr, hx235 Differential Revision: D50980001 Pulled By: cbi42 fbshipit-source-id: 3daf6b4c32bdddc5df057240068162aa1a907587	2023-11-03 16:27:11 -07:00
Akanksha Mahajan	917fd87513	Error out in case of std errors in blackbox test and export file in TARGETS Summary: - Right now in blackbox test we don't exit if there are std::error as we do in whitebox crash tests. As result those errors are swallowed. It only errors out if state is unexpected. One example that was noticed in blackbox crash test - ``` stderr has error message: *Error restoring historical expected values: Corruption: DB is older than any restorable expected state* Running db_stress with pid=30454: /packages/rocksdb_db_stress_internal_repo/rocks_db_stress .... ``` - This diff also provided support to export files - db_crashtest.py file to be used by different repo. Reviewed By: ajkr Differential Revision: D50564889 fbshipit-source-id: 7bafbbc6179dc79467ca2b680fe83afc7850616a	2023-10-24 11:46:18 -07:00
Peter Dillinger	4d9f9733b2	Disable AutoHCC in crash test (#12000 ) Summary: ... until I can reproduce and resolve assertion failures (mostly in PurgeImplLocked) seen in crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/12000 Test Plan: make blackbox_crash_test Reviewed By: hx235 Differential Revision: D50565984 Pulled By: pdillinger fbshipit-source-id: 5eea1638ff2683c41b4f65ee1ffc2398071911e7	2023-10-23 12:23:13 -07:00
anand76	84af7cf0bd	Sanitize db_stress arguments when secondary_cache_uri is not empty (#11967 ) Summary: When `secondary_cache_uri` is non-empty and the `cache_type` is not a tiered cache, then sanitize `compressed_secondary_cache_size` to 0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11967 Test Plan: Run crash test Reviewed By: akankshamahajan15 Differential Revision: D50346157 Pulled By: anand1976 fbshipit-source-id: 57bcbad2ec81fa736f1539a0a41ed6854ded2077	2023-10-16 17:28:36 -07:00
Jay Huh	d2daa10afc	Fix crash_test_with_best_efforts_recovery (#11938 ) Summary: Thanks ltamasi and ajkr for initial investigations on the test failure. Per the investigations, the following scenario is likely causing the test to fail. 1. Recovery is needed (could be any reason during crash test) 2. Trying to recover from the latest manifest fails (likely due to read error injection) 3. DB opens with recovery from the next manifest which is different from step 2. 4. Expected state is based on the manifest we tried and failed in step 2. 5. Two manifests used in step 2 and 3 are confirmed to have difference in LSM trees (Thanks ltamasi again for the finding). ``` 2023/10/05-11:24:18.942189 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007184 ... 2023/10/05-11:24:18.978007 56341 [db/version_set.cc:6079] Trying to recover from manifest: /dev/shm/rocksdb_test/rocksdb_crashtest_blackbox/MANIFEST-007180 ``` ``` [ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007184_renamed_ > 2 [ltamasi@devbig1024.prn1 /tmp/x]$ ldb manifest_dump --hex --path=MANIFEST-007180_renamed_ > 1 [ltamasi@devbig1024.prn1 /tmp/x]$ diff 1 2 --- 1 2023-10-09 10:29:16.966215207 -0700 +++ 2 2023-10-09 10:29:11.984241645 -0700 @@ -13,7 +13,7 @@ 7174:3950254[1875617 .. 2203952]['000000000003415B000000000000012B000000000000007D' seq:1906214, type:1 .. '000000000003CA59000000000000012B000000000000005C' seq:2039838, type:1] 7175:88060[2074748 .. 2203892]['000000000003CA6300000000000000CF78787878787878' seq:2167539, type:2 .. '000000000003D08F000000000000012B0000000000000130' seq:2112478, type:0] --- level 6 --- version# 1 --- - 7057:3132633[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1] + 7219:2135565[0 .. 2046144]['0000000000000009000000000000000978' seq:0, type:1 .. '0000000000005F8B000000000000012B00000000000002AC' seq:0, type:1] 7061:827724[0 .. 2046131]['0000000000005F95000000000000000778787878787878' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000113' seq:0, type:1] 6763:1352[0 .. 0]['000000000000784F000000000000012B0000000000000129' seq:0, type:1 .. '000000000000784F000000000000012B0000000000000129' seq:0, type:1] 7173:4812291[0 .. 2203957]['000000000000784F000000000000012B0000000000000138' seq:0, type:1 .. '0000000000020FAE787878787878' seq:0, type:1] @@ -77,4 +77,4 @@ --- level 61 --- version# 1 --- --- level 62 --- version# 1 --- --- level 63 --- version# 1 --- -next_file_number 7182 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015 +next_file_number 7221 last_sequence 2203963 prev_log_number 0 max_column_family 0 min_log_number_to_keep 7015 ``` We have two options to fix this. Either skip verification against expected state or disable read injection when BE recovery is enabled. I chose to skip verification against expected state per discussion. (See comments in this PR) Please note that some linter changes were included in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11938 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery ``` Reviewed By: ltamasi Differential Revision: D50136341 Pulled By: jaykorean fbshipit-source-id: ac7434d592aebc148bfc3a4fcaa34936f136b95c	2023-10-11 14:26:10 -07:00
anand76	20b4f1356e	Enable write fault injection in db_stress (#11924 ) Summary: This PR depends on https://github.com/facebook/rocksdb/issues/11879 . Enable write fault injection for the basic whitebox, blackbox, and cf_consistency modes. For other test modes like multiops_txn, best_efforts_recovery etc., leave it disabled for now until we can do more testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11924 Reviewed By: ajkr Differential Revision: D50178252 Pulled By: anand1976 fbshipit-source-id: 5794f81c14cded1eb28762b2de818dfff1c1a34c	2023-10-11 11:28:00 -07:00
anand76	5b11f5a3a2	Add TieredCache and compressed cache capacity change to db_stress (#11935 ) Summary: Add `TieredCache` to the cache types tested by db_stress. Also add compressed secondary cache capacity change, and `WriteBufferManager` integration with `TieredCache` for memory charging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11935 Test Plan: Run whitebox/blackbox crash tests locally Reviewed By: akankshamahajan15 Differential Revision: D50135365 Pulled By: anand1976 fbshipit-source-id: 7d73ed00c00a0953d86e49f35cce6bd550ba00f1	2023-10-10 13:12:18 -07:00
akankshamahajan	40b618f234	Enable auto_readahead_size in db_stress (#11916 ) Summary: Depends on https://github.com/facebook/rocksdb/pull/11884 This PR only enables the option in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11916 Reviewed By: anand1976 Differential Revision: D49834479 Pulled By: akankshamahajan15 fbshipit-source-id: 103a64fd7b23236493a8f3064d4c5af83656bd18	2023-10-03 14:41:26 -07:00
Levi Tamasi	01e2d33565	Add the wide-column aware merge API to the stress tests (#11906 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11906 The patch adds stress test coverage for the wide-column aware `FullMergeV3` API by implementing a new `DBStressWideMergeOperator`. This operator is similar to `PutOperator` / `PutOperatorV2` in the sense that its result is based on the last merge operand; however, the merge result can be either a plain value or a wide-column entity, depending on the value base encoded into the operand and the value of the `use_put_entity_one_in` stress test parameter. Following the same rule for merge results that we do for writes ensures that the queries issued by the validation logic receive the expected results. The new operator is used instead of `PutOperatorV2` whenever `use_put_entity_one_in` is positive. Note that the patch also makes it possible to set `use_put_entity_one_in` and `use_merge` (but not `use_full_merge_v1`) at the same time, giving `use_put_entity_one_in` precedence, so the stress test will use `PutEntity` for writes passing the `use_put_entity_one_in` check described above and `Merge` for any other writes. Reviewed By: jaykorean Differential Revision: D49760024 fbshipit-source-id: 3893602c3e7935381b484f4f5026f1983e3a04a9	2023-09-29 08:54:50 -07:00
akankshamahajan	bd655b9af3	Disable AutoReadaheadSize in stress tests (#11883 ) Summary: Crash tests are failing with recent change of auto_readahead_size. Disable it in stress tests and enable it with fix to clear the crash tests failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11883 Reviewed By: pdillinger Differential Revision: D49597854 Pulled By: akankshamahajan15 fbshipit-source-id: 0af8ca7414ee9b92f244ee0fb811579c3c052b41	2023-09-25 09:06:22 -07:00
Changyu Bi	ba5897ada8	Fix stress test failure due to write fault injections and disable write fault injection (#11859 ) Summary: This PR contains two fixes: 1. disable write fault injection since it caused several other kinds of internal stress test failures. I'll try to fix those separately before enabling it again. 2. Fix segfault like ``` https://github.com/facebook/rocksdb/issues/5 0x000000000083dc43 in rocksdb::port::Mutex::Lock (this=0x30) at internal_repo_rocksdb/repo/port/port_posix.cc:80 80 internal_repo_rocksdb/repo/port/port_posix.cc: No such file or directory. https://github.com/facebook/rocksdb/issues/6 0x0000000000465142 in rocksdb::MutexLock::MutexLock (mu=0x30, this=<optimized out>) at internal_repo_rocksdb/repo/util/mutexlock.h:37 37 internal_repo_rocksdb/repo/util/mutexlock.h: No such file or directory. https://github.com/facebook/rocksdb/issues/7 rocksdb::FaultInjectionTestFS::DisableWriteErrorInjection (this=0x0) at internal_repo_rocksdb/repo/utilities/fault_injection_fs.h:505 505 internal_repo_rocksdb/repo/utilities/fault_injection_fs.h: No such file or directory. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11859 Test Plan: db_stress with no fault injection: `./db_stress --write_fault_one_in=0 --read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --sync_fault_injection=0` Reviewed By: jaykorean Differential Revision: D49408247 Pulled By: cbi42 fbshipit-source-id: 0ca01f20e6e81bf52af77818b50d562ef7462165	2023-09-19 08:33:05 -07:00
Changyu Bi	c90807d103	Inject retryable write IOError when writing to SST files in stress test (#11829 ) Summary: * db_crashtest.py now may set `write_fault_one_in` to 500 for blackbox and whitebox simple test. * Error injection only applies to writing to SST files. Flush error will cause DB to pause background operations and auto-resume. Compaction error will just re-schedule later. * File ingestion and back up tests are updated to check if the result status is due to an injected error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11829 Test Plan: a full round of whitebox simple and blackbox simple crash test * `python3 ./tools/db_crashtest.py whitebox/blackbox --simple --write_fault_one_in=500` Reviewed By: ajkr Differential Revision: D49256962 Pulled By: cbi42 fbshipit-source-id: 68e0c9648d8e03bad39c7672b25d5500fc286d97	2023-09-18 16:23:26 -07:00
Peter Dillinger	1c6faf3587	Make RibbonFilterPolicy::bloom_before_level mutable (SetOptions()) (#11838 ) Summary: An internal user wants to be able to dynamically switch between Bloom and Ribbon filters, without a custom FilterPolicy. Making `filter_policy` mutable would actually make issue https://github.com/facebook/rocksdb/issues/10079 worse, because it would be a race on a pointer field, not just on scalars. As a reasonable compromise until that is fixed, I am enabling dynamic control over Bloom vs. Ribbon choice by making RibbonFilterPolicy::bloom_before_level mutable, and doing that safely by using an atomic. I've also slightly tweaked the interpretation of that field so that setting it to INT_MAX really means "always Bloom." Pull Request resolved: https://github.com/facebook/rocksdb/pull/11838 Test Plan: unit tests added/extended. crash test updated for SetOptions call and tested under TSAN with amplified probability (lower set_options_one_in). Reviewed By: ajkr Differential Revision: D49296284 Pulled By: pdillinger fbshipit-source-id: e4251c077510df9a9c719876f482448c0d15402a	2023-09-15 15:46:10 -07:00
Andrew Kryczka	392d6957cd	Added compaction read errors to `db_stress` (#11789 ) Summary: - Fixed misspellings of "inject" - Made user read errors retryable when `FLAGS_inject_error_severity == 1` - Added compaction read errors when `FLAGS_read_fault_one_in > 0`. These are always retryable so that the DB will keep accepting writes - Reenabled setting `compaction_readahead_size` in crash test. The reason for disabling it was to "keep the test clean", which is not a good enough reason to skip testing it Pull Request resolved: https://github.com/facebook/rocksdb/pull/11789 Test Plan: With https://github.com/facebook/rocksdb/issues/11782 reverted, reproduced the bug: - Build: `make -j56 db_stress` - Command: `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --interval=10 --max_key=1000000` - Output: ``` stderr has error message: *put or merge error: Corruption: Compaction number of input keys does not match number of keys processed.* ``` Reviewed By: cbi42 Differential Revision: D48939994 Pulled By: ajkr fbshipit-source-id: a1efb799efecdfd5d9cfd185e4a6321db8fccfbb	2023-09-05 10:41:29 -07:00
Akanksha Mahajan	6353c6e2fb	Add new experimental ReadOption auto_readahead_size to db_bench and db_stress (#11729 ) Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/11729 Test Plan: make crash_test -j32 Reviewed By: anand1976 Differential Revision: D48534820 Pulled By: akankshamahajan15 fbshipit-source-id: 3a2a28af98dfad164b82ddaaf9fddb94c53a652e	2023-08-24 14:58:27 -07:00
Fuat Basik	bc448e9c89	Run db_stress for final time to ensure un-interrupted validation (#11592 ) Summary: In blackbox tests, db_stress command always run with timeout. Timeout can happen during validation, leaving some of the keys not checked. Since key validation is done in order, it is quite likely that keys those are towards to the end of the set are never validated. This PR adds a final execution, without timeout, to ensure validation is executed for all keys, at least once. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11592 Reviewed By: cbi42 Differential Revision: D48003998 Pulled By: hx235 fbshipit-source-id: 72543475a932f12cf0f57534b7e3b6e07e87080f	2023-08-23 15:24:23 -07:00
Changyu Bi	c2aad555c3	Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666 ) Summary: Optionally enable zstd checksum flag (`d857369028/lib/zstd.h (L428)`) to detect corruption during decompression. Main changes are in compression.h: * User can set CompressionOptions::checksum to true to enable this feature. * We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`. * Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag. * Error handling during decompression to propagate error message from ZSTD. * Updated microbench to test read performance impact. About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666 Test Plan: * Existing unit tests for zstd compression * Add unit test `DBTest2.ZSTDChecksum` to test the corruption case * Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API. * Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work. * Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2. * Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable. * Microbench: ``` TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100 Min out of 100 runs: Main: 10390 10436 10456 10484 10499 10535 10544 10545 10565 10568 After this PR, checksum=false 10285 10397 10503 10508 10515 10557 10562 10635 10640 10660 After this PR, checksum=true 10827 10876 10925 10949 10971 11052 11061 11063 11100 11109 ``` * db_bench: ``` Write perf TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=.. [FillSeq checksum=0] fillseq [AVG 10 runs] : 281635 (± 31711) ops/sec; 31.2 (± 3.5) MB/sec fillseq [MEDIAN 10 runs] : 294027 ops/sec; 32.5 MB/sec [FillSeq checksum=1] fillseq [AVG 10 runs] : 286961 (± 34700) ops/sec; 31.7 (± 3.8) MB/sec fillseq [MEDIAN 10 runs] : 283278 ops/sec; 31.3 MB/sec Read perf TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1 [Readrandom checksum=1] readrandom [AVG 20 runs] : 360928 (± 3579) ops/sec; 4.0 (± 0.0) MB/sec readrandom [MEDIAN 20 runs] : 362468 ops/sec; 4.0 MB/sec [Readrandom checksum=0] readrandom [AVG 20 runs] : 380365 (± 2384) ops/sec; 4.2 (± 0.0) MB/sec readrandom [MEDIAN 20 runs] : 379800 ops/sec; 4.2 MB/sec Compression TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1 checksum=1 compress [AVG 20 runs] : 54074 (± 634) ops/sec; 211.2 (± 2.5) MB/sec compress [MEDIAN 20 runs] : 54396 ops/sec; 212.5 MB/sec checksum=0 compress [AVG 20 runs] : 54598 (± 393) ops/sec; 213.3 (± 1.5) MB/sec compress [MEDIAN 20 runs] : 54592 ops/sec; 213.3 MB/sec Decompression: TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1 checksum = 0 uncompress [AVG 20 runs] : 167499 (± 962) ops/sec; 654.3 (± 3.8) MB/sec uncompress [MEDIAN 20 runs] : 167210 ops/sec; 653.2 MB/sec checksum = 1 uncompress [AVG 20 runs] : 167980 (± 924) ops/sec; 656.2 (± 3.6) MB/sec uncompress [MEDIAN 20 runs] : 168465 ops/sec; 658.1 MB/sec ``` Reviewed By: ajkr Differential Revision: D48019378 Pulled By: cbi42 fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba	2023-08-18 15:01:59 -07:00
Changyu Bi	d1ff401472	Delay bottommost level single file compactions (#11701 ) Summary: For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions. * The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno and the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701 Test Plan: * Add two unit tests to test when bottommost_file_compaction_delay > 0. * Ran crash test with the new option. Reviewed By: jaykorean, ajkr Differential Revision: D48331564 Pulled By: cbi42 fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8	2023-08-16 17:45:44 -07:00
Andrew Kryczka	0b6ee88d51	clarify TODO for whitebox disable_wal=1 in db_crashtest.py (#11665 ) Summary: See https://github.com/facebook/rocksdb/issues/11613 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11665 Reviewed By: hx235 Differential Revision: D48010507 Pulled By: ajkr fbshipit-source-id: 65c6d87d2c6ffc9d25f1d17106eae467ec528082	2023-08-16 09:43:20 -07:00
Jay Huh	b63018fb59	Wide Column Ingestion in CrashTest (#11697 ) Summary: `PutEntity` is now supported in SST file writer (https://github.com/facebook/rocksdb/issues/11688). This PR enables ingestion of wide column data in the stress/crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11697 Test Plan: ``` python3 tools/db_crashtest.py blackbox --simple --duration=300 --ingest_external_file_one_in=2 --use_put_entity_one_in=2 --max_key=1048576 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0 --key_len_percent_dist="1,30,69" ``` Reviewed By: ltamasi Differential Revision: D48370719 Pulled By: jaykorean fbshipit-source-id: 5855d3112b37b2fb300d05e6df110d899855d77d	2023-08-15 16:13:13 -07:00
Peter Dillinger	ef6f025563	Placeholder for AutoHyperClockCache, more (#11692 ) Summary: * The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation. * Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache). * Simplify some existing tests not to depend directly on cache type. * Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation. * Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test. * Add another templated class BaseHyperClockCache to reduce future copy-paste * Added ReportProblems support to cache_bench * Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC: ``` 2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17 ``` In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692 Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in. Reviewed By: jowlyzhang Differential Revision: D48247413 Pulled By: pdillinger fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da	2023-08-11 16:27:38 -07:00
Hui Xiao	38ecfabed2	Remove comment about locking about TestIterateAgainstExpected (#11695 ) Summary: Context/Summary After https://github.com/facebook/rocksdb/pull/11058, we no longer lock the key range to iterate in TestIterateAgainstExpected, except for working with timestamp feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11695 Test Plan: no code change Reviewed By: ajkr Differential Revision: D48276668 Pulled By: hx235 fbshipit-source-id: dc92a3708b2281dc737c0877fb755548bf03a9fc	2023-08-11 13:14:04 -07:00
Hui Xiao	9a034801ce	Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444 ) Summary: Context/Summary: - Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`. - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation) - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main. - Misc - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority` - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444 Test Plan: - CI fake db crash/stress test - Microbenchmarking Build `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench` - google benchmark version: `604f6fd3f4` - db_basic_bench_base: upstream - db_basic_bench_pr: db_basic_bench_base + this PR - asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read) - asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR Test Get ``` TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat\|base\|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000 ``` Result ``` Coming soon ``` AsyncRead ``` TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base\|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base\|pr}.out ``` Result ``` Base: 1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008, PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before): 1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053, ``` Reviewed By: ajkr Differential Revision: D45918925 Pulled By: hx235 fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a	2023-08-08 17:26:50 -07:00
Vardhan	87a21d08fe	Add an option to trigger flush when the number of range deletions reach a threshold (#11358 ) Summary: Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358 Test Plan: * New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions` * Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions. Reviewed By: ajkr Differential Revision: D46582680 Pulled By: cbi42 fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da	2023-08-02 19:58:56 -07:00
Peter Dillinger	7a1b0207e6	format_version=6 and context-aware block checksums (#9058 ) Summary: ## Context checksum All RocksDB checksums currently use 32 bits of checking power, which should be 1 in 4 billion false negative (FN) probability (failing to detect corruption). This is true for random corruptions, and in some cases small corruptions are guaranteed to be detected. But some possible corruptions, such as in storage metadata rather than storage payload data, would have a much higher FN rate. For example: * Data larger than one SST block is replaced by data from elsewhere in the same or another SST file. Especially with block_align=true, the probability of exact block size match is probably around 1 in 100, making the FN probability around that same. Without `block_align=true` the probability of same block start location is probably around 1 in 10,000, for FN probability around 1 in a million. To solve this problem in new format_version=6, we add "context awareness" to block checksum checks. The stored and expected checksum value is modified based on the block's position in the file and which file it is in. The modifications are cleverly chosen so that, for example * blocks within about 4GB of each other are guaranteed to use different context * blocks that are offset by exactly some multiple of 4GiB are guaranteed to use different context * files generated by the same process are guaranteed to use different context for the same offsets, until wrap-around after 2^32 - 1 files Thus, with format_version=6, if a valid SST block and checksum is misplaced, its checksum FN probability should be essentially ideal, 1 in 4B. ## Footer checksum This change also adds checksum protection to the SST footer (with format_version=6), for the first time without relying on whole file checksum. To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to defeat the footer checksum, we change much of the footer data format including an "extended magic number" in format_version 6 that would be interpreted as empty index and metaindex block handles in older footer versions. We also change the encoding of handles to free up space for other new data in footer. ## More detail: making space in footer In order to keep footer the same size in format_version=6 (avoid change to IO patterns), we have to free up some space for new data. We do this two ways: * Metaindex block handle is encoded down to 4 bytes (from 10) by assuming it immediately precedes the footer, and by assuming it is < 4GB. * Index block handle is moved into metaindex. (I don't know why it was in footer to begin with.) ## Performance In case of small performance penalty, I've made a "pay as you go" optimization to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep with the only field used in that structure after construction: `prefix_extractor`. This makes the PR an overall performance improvement (results below). Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6, even including that improvement for both. That's based on extreme case table write performance testing, many files with many blocks. This is relatively checksum intensive (small blocks) and salt generation intensive (small files). ``` (for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 \| grep micros/op \| tee out awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out ``` Each value below is ops/s averaged over 100 runs, run simultaneously with competing configuration for load fairness Before -> after (both fv=5): 483530 -> 483673 (negligible) Re-run 1: 480733 -> 485427 (1.0% faster) Re-run 2: 483821 -> 484541 (0.1% faster) Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster) Re-run 1: 482212 -> 485075 (0.6% faster) Re-run 2: 483590 -> 484073 (0.1% faster) After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster) Re-run 1: 485331 -> 483385 (0.4% slower) Re-run 2: 485283 -> 483435 (0.4% slower) Re-run 3: 483647 -> 486109 (0.5% faster) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058 Test Plan: unit tests included (table_test, db_properties_test, salt in env_test). General DB tests and crash test updated to test new format_version. Also temporarily updated the default format version to 6 and saw some test failures. Almost all were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum, though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block checksum, just assuming it was verified in opening the index reader (probably usually true but probably not always true). Some other concerns about VerifyChecksum are left in FIXME comments. The only remaining test failure on change of default (in block_fetcher_test) now has a comment about how to upgrade the test. The format compatibility test does not need updating because we have not updated the default format_version. Reviewed By: ajkr, mrambacher Differential Revision: D33100915 Pulled By: pdillinger fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee	2023-07-30 16:40:01 -07:00
Akanksha Mahajan	5187ac2af3	Add skip_tmpdir_check arg in crash script (#11539 ) Summary: Add `skip_tmpdir_check` argument in crash script. If `tmp_dir` is on remote storage and exist, `isdir` will be false (checking on local storage) leading to exit. By passing `skip_tmpdir_check` with `crashtest.py`, the dir check can be skipped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11539 Test Plan: Ran locally Reviewed By: anand1976 Differential Revision: D46740456 Pulled By: akankshamahajan15 fbshipit-source-id: 8726882ef53d2c84b604c7515e84eda6d1bf797c	2023-06-27 12:30:19 -07:00
Jay Huh	17d5200504	Stress/Crash Test for OptimisticTransactionDB (#11513 ) Summary: Context: OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB. 1. Adding the following gflag options to to test OptimisticTransactionDB - `use_optimistic_txn`: When true, open OptimisticTransactionDB to test - `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default. - `share_occ_lock_buckets`: Use shared occ locks - `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock. 2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc` 3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513 Test Plan: Crash Test Serial Validation ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0" make crash_test -j ``` Parallel Validation (no share bucket) ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0" make crash_test -j ``` Parallel Validation (share bucket) ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500" make crash_test -j ``` Stress Test ``` ./db_stress -use_optimistic_txn -threads=32 ``` Reviewed By: pdillinger Differential Revision: D46547387 Pulled By: jaykorean fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960	2023-06-17 16:27:37 -07:00

1 2 3 4 5 ...

340 commits