mirror of https://github.com/facebook/rocksdb.git
282 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Changyu Bi | c2aad555c3 |
Add `CompressionOptions::checksum` for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (
|
|
Changyu Bi | d1ff401472 |
Delay bottommost level single file compactions (#11701)
Summary: For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions. * The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701 Test Plan: * Add two unit tests to test when bottommost_file_compaction_delay > 0. * Ran crash test with the new option. Reviewed By: jaykorean, ajkr Differential Revision: D48331564 Pulled By: cbi42 fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8 |
|
Peter Dillinger | ef6f025563 |
Placeholder for AutoHyperClockCache, more (#11692)
Summary: * The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation. * Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache). * Simplify some existing tests not to depend directly on cache type. * Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation. * Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test. * Add another templated class BaseHyperClockCache to reduce future copy-paste * Added ReportProblems support to cache_bench * Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC: ``` 2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17 ``` In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692 Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in. Reviewed By: jowlyzhang Differential Revision: D48247413 Pulled By: pdillinger fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da |
|
Peter Dillinger | a85eccc6d6 |
Adjust db_stress handling of TryAgain from optimistic txn (#11691)
Summary: We're still getting some rare cases of 5x TryAgains in a row. Here I'm boosting the failure threshold to 10 in a row and adding more info in the output, to help us manually verify whether there's anything suspicous about the sequence of TryAgains, such as if Rollback failed to reset to new sequence numbers. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11691 Test Plan: By lowering the threshold to 2 and adjusting some other db_crashtest parameters, I was able to hit my new code and saw fresh sequence number on the subsequent TryAgain. Reviewed By: cbi42 Differential Revision: D48236153 Pulled By: pdillinger fbshipit-source-id: c0530e969ddcf8de7348e5cf7daf5d6d5dec24f4 |
|
Hui Xiao | 9a034801ce |
Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
- For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
- New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.
- Misc
- More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
- Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444
Test Plan:
- CI fake db crash/stress test
- Microbenchmarking
**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version:
|
|
Yu Zhang | 9c2ebcc2c3 |
Log user_defined_timestamps_persisted flag in event logger (#11683)
Summary: As titled, and also removed an undefined and unused member function in for ColumnFamilyData Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683 Reviewed By: ajkr Differential Revision: D48156290 Pulled By: jowlyzhang fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7 |
|
Peter Dillinger | 99daea3481 |
Prepare tests for new HCC naming (#11676)
Summary: I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache. This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676 Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache Reviewed By: ajkr Differential Revision: D48103165 Pulled By: pdillinger fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736 |
|
Vardhan | 87a21d08fe |
Add an option to trigger flush when the number of range deletions reach a threshold (#11358)
Summary: Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358 Test Plan: * New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions` * Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions. Reviewed By: ajkr Differential Revision: D46582680 Pulled By: cbi42 fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da |
|
Peter Dillinger | b3c54186ab |
Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653)
Summary: In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers. Also, some of the APIs were not clear about Transaction semantics, so I have clarified: * (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction). * Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test. Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress. * Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function. * Use unique_ptr (nicely complements removing unnecessary Rollbacks) * Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".) Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653 Test Plan: this is the test :) Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances. Reviewed By: cbi42 Differential Revision: D47882995 Pulled By: pdillinger fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d |
|
Levi Tamasi | b3edb87341 |
Initialize StressTest::optimistic_txn_db_ in ctor (#11547)
Summary: `StressTest::optimistic_txn_db_` is currently not initialized by the constructor, which can lead to assertion failures down the line in `StressTest::Open`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11547 Reviewed By: cbi42 Differential Revision: D46845658 Pulled By: ltamasi fbshipit-source-id: 578b0f24fc00e3e97f24221fcdd003cc529439c2 |
|
Jay Huh | 17d5200504 |
Stress/Crash Test for OptimisticTransactionDB (#11513)
Summary: Context: OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB. 1. Adding the following gflag options to to test OptimisticTransactionDB - `use_optimistic_txn`: When true, open OptimisticTransactionDB to test - `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default. - `share_occ_lock_buckets`: Use shared occ locks - `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock. 2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc` 3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513 Test Plan: **Crash Test** Serial Validation ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0" make crash_test -j ``` Parallel Validation (no share bucket) ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0" make crash_test -j ``` Parallel Validation (share bucket) ``` export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500" make crash_test -j ``` **Stress Test** ``` ./db_stress -use_optimistic_txn -threads=32 ``` Reviewed By: pdillinger Differential Revision: D46547387 Pulled By: jaykorean fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960 |
|
Jay Huh | 81aeb15988 |
Add WaitForCompact with WaitForCompactOptions to public API (#11436)
Summary: Context: This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions` 1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing. 1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct. 2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h` 3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436 Test Plan: Following tests added - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish` - `DBCompactionTest::WaitForCompactAbortOnPauseAborted` - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted` - `DBCompactionTest::WaitForCompactShutdownWhileWaiting` - `TransactionTest::WaitForCompactAbortOnPause` NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place. Reviewed By: pdillinger Differential Revision: D45799659 Pulled By: jaykorean fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b |
|
Jay Huh | 586d78b31e |
Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary: Context: In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue. In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code. This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`. Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`. Furthermore, all tests that previously called `waitForCompact(true)` have been fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443 Test Plan: Existing tests that involve a shutdown in progress: - DBCompactionTest::CompactRangeShutdownWhileDelayed - DBTestWithParam::PreShutdownMultipleCompaction - DBTestWithParam::PreShutdownCompactionMiddle Reviewed By: pdillinger Differential Revision: D45923426 Pulled By: jaykorean fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae |
|
Hui Xiao | 5fc57eec2b |
Support parallel read and write/delete to same key in NonBatchedOpsStressTest (#11058)
Summary: **Context:** Current `NonBatchedOpsStressTest` does not allow multi-thread read (i.e, Get, Iterator) and write (i.e, Put, Merge) or delete to the same key. Every read or write/delete operation will acquire lock (`GetLocksForKeyRange`) on the target key to gain exclusive access to it. This does not align with RocksDB's nature of allowing multi-thread read and write/delete to the same key, that is concurrent threads can issue read/write/delete to RocksDB without external locking. Therefore this is a gap in our testing coverage. To close the gap, biggest challenge remains in verifying db value against expected state in presence of parallel read and write/delete. The challenge is due to read/write/delete to the db and read/write to expected state is not within one atomic operation. Therefore we may not know the exact expected state of a certain db read, as by the time we read the expected state for that db read, another write to expected state for another db write to the same key might have changed the expected state. **Summary:** Credited to ajkr's idea, we now solve this challenge by breaking the 32-bits expected value of a key into different parts that can be read and write to in parallel. Basically we divide the 32-bits expected value into `value_base` (corresponding to the previous whole 32 bits but now with some shrinking in the value base range we allow), `pending_write` (i.e, whether there is an ongoing concurrent write), `del_counter` (i.e, number of times a value has been deleted, analogous to value_base for write), `pending_delete` (similar to pending_write) and `deleted` (i.e whether a key is deleted). Also, we need to use incremental `value_base` instead of random value base as before because we want to control the range of value base a correct db read result can possibly be in presence of parallel read and write. In that way, we can verify the correctness of the read against expected state more easily. This is at the cost of reducing the randomness of the value generated in NonBatchedOpsStressTest we are willing to accept. (For detailed algorithm of how to use these parts to infer expected state of a key, see the PR) Misc: hide value_base detail from callers of ExpectedState by abstracting related logics into ExpectedValue class Pull Request resolved: https://github.com/facebook/rocksdb/pull/11058 Test Plan: - Manual test of small number of keys (i.e, high chances of parallel read and write/delete to same key) with equally distributed read/write/deleted for 30 min ``` python3 tools/db_crashtest.py --simple {blackbox|whitebox} --sync_fault_injection=1 --skip_verifydb=0 --continuous_verification_interval=1000 --clear_column_family_one_in=0 --max_key=10 --column_families=1 --threads=32 --readpercent=25 --writepercent=25 --nooverwritepercent=0 --iterpercent=25 --verify_iterator_with_expected_state_one_in=1 --num_iterations=5 --delpercent=15 --delrangepercent=10 --range_deletion_width=5 --use_merge={0|1} --use_put_entity_one_in=0 --use_txn=0 --verify_before_write=0 --user_timestamp_size=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --flush_one_in=1000 --get_property_one_in=1000 --ingest_external_file_one_in=100 --backup_one_in=100 --checkpoint_one_in=100 --approximate_size_one_in=0 --acquire_snapshot_one_in=100 --use_multiget=0 --prefixpercent=0 --get_live_files_one_in=1000 --manual_wal_flush_one_in=1000 --pause_background_one_in=1000 --target_file_size_base=524288 --write_buffer_size=524288 --verify_checksum_one_in=1000 --verify_db_one_in=1000 ``` - Rehearsal stress test for normal parameter and aggressive parameter to see if such change can find what existing stress test can find (i.e, no regression in testing capability) - [Ongoing]Try to find new bugs with this change that are not found by current NonBatchedOpsStressTest with no parallel read and write/delete to same key Reviewed By: ajkr Differential Revision: D42257258 Pulled By: hx235 fbshipit-source-id: e6fdc18f1fad3753e5ac91731483a644d9b5b6eb |
|
Changyu Bi | 62fc15f009 |
Block per key-value checksum (#11287)
Summary: add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are 1. checksum construction and verification in block.cc/h 2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h) 3. unit tests/crash test updates Tests: * Added unit tests * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576` Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled. Performance: Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory. For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates): ``` SETUP make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none BENCHMARK ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE The readrandom ops/sec looks like the following: Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB Main 240805 223604 198176 161653 139040 PR prot_bytes=0 238691 226693 200127 161082 141153 PR prot_bytes=1 214983 193199 178532 137013 108211 prot_bytes=1 vs -10% -15% -10.8% -15% -23% prot_bytes=0 ``` The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287 Reviewed By: ajkr Differential Revision: D43970708 Pulled By: cbi42 fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940 |
|
Hui Xiao | 151242ce46 |
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132 |
|
Andrew Kryczka | 6cac4c79d4 |
Fix race condition in db_stress checkpoint cleanup (#11389)
Summary: The old cleanup code had a race condition: 1. Test thread: DestroyDB() marked a file as trash 2. DeleteScheduler thread: Got the file's size and decided to delete it in chunks 3. Test thread: DestroyDir() deleted that trash file 4. DeleteScheduler thread: Began deleting in chunks starting by calling ReopenWritableFile(). Unfortunately this recreates the deleted trash file 5. Test thread: DestroyDir() fails to remove the parent directory because it contains the file created in 4. 6. Test thread: Checkpoint::Create() fails due to the directory already existing It could be repro'd with the following patch/command. Patch: ``` diff --git a/file/delete_scheduler.cc b/file/delete_scheduler.cc index 8a2d1615d..337d24a60 100644 --- a/file/delete_scheduler.cc +++ b/file/delete_scheduler.cc @@ -317,6 +317,12 @@ Status DeleteScheduler::DeleteTrashFile(const std::string& path_in_trash, &num_hard_links, nullptr); if (my_status.ok()) { if (num_hard_links == 1) { + // Give some time for DestroyDir() to delete file entries. Then, the + // below `ReopenWritableFile()` will recreate files, preventing the + // parent directory from being deleted. + if (rand() % 2 == 0) { + usleep(1000); + } std::unique_ptr<FSWritableFile> wf; my_status = fs_->ReopenWritableFile(path_in_trash, FileOptions(), &wf, nullptr); diff --git a/file/file_util.cc b/file/file_util.cc index 43608fcdc..2cee1ad8e 100644 --- a/file/file_util.cc +++ b/file/file_util.cc @@ -263,6 +263,13 @@ Status DestroyDir(Env* env, const std::string& dir) { } } + // Give some time for the DeleteScheduler thread's ReopenWritableFile() to + // recreate deleted files + if (dir.find("checkpoint") != std::string::npos) { + fprintf(stderr, "waiting to destroy %s\n", dir.c_str()); + usleep(10000); + } + if (s.ok()) { s = env->DeleteDir(dir); // DeleteDir might or might not report NotFound ``` Command: ``` TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=131072 --target_file_size_base=131072 --max_bytes_for_level_base=524288 --checkpoint_one_in=100 --clear_column_family_one_in=0 --max_key=1000 --value_size_mult=33 --sst_file_manager_bytes_per_truncate=4096 --sst_file_manager_bytes_per_sec=1048576 --interval=3 --compression_type=none --sync_fault_injection=1 ``` Obviously we don't want to use scheduled deletion here as we need the checkpoint directory deleted immediately. I suspect the DestroyDir() was an attempt to fixup incomplete DestroyDB()s. Now that we expect DestroyDB() to be complete I removed that code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11389 Reviewed By: hx235 Differential Revision: D45137142 Pulled By: ajkr fbshipit-source-id: 2af743d342c77cc414fd25fc4c9d7c9c6079ad24 |
|
Levi Tamasi | 0efd7b4ba1 |
Extend the stress test coverage of MultiGetEntity (#11336)
Summary: Similarly to `GetEntity` prior to https://github.com/facebook/rocksdb/issues/11303, the `MultiGetEntity` API is currently only used in the DB verification logic of the stress tests. The patch introduces a new mode where all point lookups are performed using `MultiGetEntity`, and implements the corresponding logic in the non-batched, batched, and CF consistency tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11336 Test Plan: Ran simple blackbox tests for the various stress test flavors. Reviewed By: akankshamahajan15 Differential Revision: D44513285 Pulled By: ltamasi fbshipit-source-id: c3db098501bf875b6a356b09fc676a0268d92c35 |
|
Peter Dillinger | 204fcff751 |
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary: Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key. This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places. It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct. ## cache.h (public API) Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache. ## advanced_cache.h (advanced public API) * Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it. * Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper. These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations. * Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle) I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss. ## cacheable_entry.h A couple of functions are obsolete because Cache::Handle can no longer be pending. ## cache.cc Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches. ## secondary_cache_adapter.{h,cc} The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code. ## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc Simply updated for Cache API changes. ## lru_cache.{h,cc} Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality. ## clock_cache.{h,cc} Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring. ## block_based_table_reader* Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready. Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse). Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait. ## Intended follow-up work * Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet * Stacked secondary caches (see above discussion) * See if we can make up for the small MultiGet performance regression. * Study more performance with SecondaryCache * Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal. * Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup. * Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301 Test Plan: ## Unit tests Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them. ## Crash/stress test Updated to use the new combination. ## Performance First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0. ``` (while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }' ``` **Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type: HyperClockCache: 3168662 (average parallel ops/sec) LRUCache: 2940127 **After** this and #11299, running for about an hour: HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower) LRUCache: 2940928 (0.03% faster) This is an acceptable difference IMHO. Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits. Create DB and test (before and after tests running simultaneously): ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16 TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec **After**: multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache: **Before**: multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec **After**: multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately. Let's also look at Get() in db_bench: ``` TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16 ``` **Before**: readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec **After**: readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement: **Before**: readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec **After**: readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec Reviewed By: ltamasi Differential Revision: D44177044 Pulled By: pdillinger fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5 |
|
Levi Tamasi | a72d55c99d |
Increase the stress test coverage of GetEntity (#11303)
Summary: The `GetEntity` API is currently used in the stress tests for verification purposes; this patch extends the coverage by adding a mode where all point lookups in the non-batched, batched, and CF consistency stress tests are done using this API. The PR also includes a bit of refactoring to eliminate some boilerplate code around the wide-column consistency checks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11303 Test Plan: Ran stress tests of the batched, non-batched, and CF consistency varieties. Reviewed By: akankshamahajan15 Differential Revision: D44148503 Pulled By: ltamasi fbshipit-source-id: fecdbfd3e65a459bbf16ab7aa7b9173e19240077 |
|
Yu Zhang | f007b8fdea |
Support iter_start_ts in integrated BlobDB (#11244)
Summary: Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244 Test Plan: ```make check ./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs" tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1``` Reviewed By: ltamasi Differential Revision: D43506726 Pulled By: jowlyzhang fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0 |
|
Yu Zhang | c19672c187 |
Enable crash test to run BlobDB together with user-defined timestamp (#11199)
Summary: I missed a stress test code sanity check when enabling this combination of tests. This PR addresses that, the "iter_start_ts" function for user defined timestamp feature is not supported when BlobDB is enabled. It's disabled for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11199 Test Plan: Locally always enable BlobDB and run tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 Reviewed By: ltamasi Differential Revision: D43245657 Pulled By: jowlyzhang fbshipit-source-id: 4cae19817bb1afd50a76f9e0e49f006fb5c0b211 |
|
Peter Dillinger | 94e3beec77 |
Cleanup, improve, stress test LockWAL() (#11143)
Summary: The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change: * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use. * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions) * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability) * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other. * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads. * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143 Test Plan: unit tests added / updated, added to stress/crash test Reviewed By: ajkr Differential Revision: D42848627 Pulled By: pdillinger fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4 |
|
sdong | 36174d89a6 |
DB Stress to fix a false assertion (#11164)
Summary: Seeting this error in stress test: db_stress: internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2459: void rocksdb::StressTest::Open(rocksdb::SharedState *): Assertion `txn_db_ == nullptr' failed. Received signal 6 (Aborted) ...... It doesn't appear that txn_db_ is set to nullptr at all. We set ithere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11164 Test Plan: Run db_stress transaction and non-transation with low kill rate and see restarting without assertion Reviewed By: ajkr Differential Revision: D42855662 fbshipit-source-id: 06816d37cce9c94a81cb54ab238fb73aa102ed46 |
|
sdong | 4720ba4391 |
Remove RocksDB LITE (#11147)
Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2 |
|
Yu Zhang | 6943ff6e50 |
Remove deprecated util functions in options_util.h (#11126)
Summary: Remove the util functions in options_util.h that have previously been marked deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126 Test Plan: `make check` Reviewed By: ltamasi Differential Revision: D42757496 Pulled By: jowlyzhang fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb |
|
sdong | 2800aa069a |
Remove compressed block cache (#11117)
Summary: Compressed block cache is replaced by compressed secondary cache. Remove the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117 Test Plan: See CI passes Reviewed By: pdillinger Differential Revision: D42700164 fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d |
|
Hui Xiao | b965a5a80e |
Add back Options::CompactionOptionsFIFO::allow_compaction to stress/crash test (#11063)
Summary: **Context/Summary:** https://github.com/facebook/rocksdb/pull/10777 was reverted (https://github.com/facebook/rocksdb/pull/10999) due to internal blocker and replaced with a better fix https://github.com/facebook/rocksdb/pull/10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR. It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like https://github.com/facebook/rocksdb/pull/10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false` ``` --fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65 ``` Therefore this PR is adding it back to stress/crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11063 Test Plan: Rehearsal stress test to make sure stress/crash test is stable Reviewed By: ajkr Differential Revision: D42283650 Pulled By: hx235 fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef |
|
anand76 | c3f720c60d |
Enable ReadAsync testing and fault injection in db_stress (#11037)
Summary: The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037 Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890. Reviewed By: akankshamahajan15 Differential Revision: D42061290 Pulled By: anand1976 fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074 |
|
Hui Xiao | f1574a20ff |
Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999)
Summary:
**Context/Summary:**
This reverts commit
|
|
Yanqin Jin | a8a4ed52a4 |
Test Merge with timestamps in stress test (#10948)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10948 Test Plan: make crash_test_with_ts Reviewed By: ltamasi Differential Revision: D41390854 Pulled By: riversand963 fbshipit-source-id: 599e114da8e2b2bbff5628fb8c67fa0393a31c05 |
|
Peter Dillinger | 32520df1d9 |
Remove prototype FastLRUCache (#10954)
Summary: This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954 Test Plan: tests updated Reviewed By: akankshamahajan15 Differential Revision: D41310123 Pulled By: pdillinger fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c |
|
Levi Tamasi | d484275230 |
Adjust value generation in batched ops stress tests (#10872)
Summary: The patch adjusts the generation of values in batched ops stress tests so that the digits 0..9 are appended (instead of prepended) to the values written. This has the advantage of aligning the encoding of the "value base" into the value string across non-batched, batched, and CF consistency stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10872 Test Plan: Tested using some black box stress test runs. Reviewed By: riversand963 Differential Revision: D40692847 Pulled By: ltamasi fbshipit-source-id: 26bf8adff2944cbe416665f09c3bab89d80416b3 |
|
Hui Xiao | fc74abb436 |
Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
Summary: **Context:** Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case Repro: ``` COERCE_CONTEXT_SWICH=1 make -j56 db_stress ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35 put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049 ``` **Summary:** FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving. - [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true` - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56 - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here. - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0` - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno` - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378) - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction. Some additional clean-up included in this PR: - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming - Added comment about `earliest_memtable_seqno` in related APIs - Made parameter `earliest_memtable_seqno` constant and required Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777 Test Plan: - make check - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only Reviewed By: ajkr Differential Revision: D40090485 Pulled By: hx235 fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f |
|
Jay Zhuang | 8124bc3526 |
Enable preclude_last_level_data_seconds in stress test (#10824)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10824 Reviewed By: siying Differential Revision: D40390535 Pulled By: jay-zhuang fbshipit-source-id: 700803a1aff8a1e77c038740d87931577e79bcf6 |
|
Levi Tamasi | eae3a686ee |
Check wide columns in TestIterate (#10818)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10818 Test Plan: Tested using some simple blackbox crash test runs in the various modes (non-batched, batched, CF consistency). Reviewed By: riversand963 Differential Revision: D40349527 Pulled By: ltamasi fbshipit-source-id: 2918bc26adbbeac314beaa958aafe770b01e5cc6 |
|
Levi Tamasi | 85399b14f7 |
Consider wide columns when checksumming in the stress tests (#10788)
Summary: There are two places in the stress test code where we compute the CRC for a range of KVs for the purposes of checking consistency, namely in the CF consistency test (to make sure CFs contain the same data), and when performing `CompactRange` (to make sure the pre- and post-compaction states are equivalent). The patch extends the logic so that wide columns are also considered in both cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10788 Test Plan: Tested using some simple blackbox crash test runs. Reviewed By: riversand963 Differential Revision: D40191134 Pulled By: ltamasi fbshipit-source-id: 542c21cac9077c6d225780deb210319bb5eee955 |
|
Hui Xiao | f6a0065d54 |
Allow Flush(sync=true) not supported in DB::Open() and db_stress (#10784)
Summary: **Context:** https://github.com/facebook/rocksdb/pull/10698 made `Flush(sync=true)` required for` DB::Open()` (to pass the original but now deleted assertion `impl->TEST_WALBufferIsEmpty()` under `manual_wal_flush=true`, see https://github.com/facebook/rocksdb/pull/10698 summary for more ) as well as db_stress to pass. However RocksDB users may not implement SyncWAL() (used inFlush(sync=true)). Therefore we replace such in DB::Open and db_stress in this PR and align with https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1883-L1887 and https://github.com/facebook/rocksdb/blob/main/db_stress_tool/db_stress_test_base.cc#L847-L849 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10784 Test Plan: make check Reviewed By: anand1976 Differential Revision: D40193354 Pulled By: anand1976 fbshipit-source-id: e80d53880799ae01bdd717641d07997d3bfe2b54 |
|
Levi Tamasi | d6d8c007ff |
Verify columns in NonBatchedOpsStressTest::VerifyDb (#10783)
Summary: As the first step of covering the wide-column functionality of iterators in our stress tests, the patch adds verification logic to `NonBatchedOpsStressTest::VerifyDb` that checks whether the iterator's value and columns are in sync. Note: I plan to update the other types of stress tests and add similar verification for prefix scans etc. in separate PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783 Test Plan: Ran some simple blackbox crash tests. Reviewed By: riversand963 Differential Revision: D40152370 Pulled By: riversand963 fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35 |
|
Peter Dillinger | b205c6d029 |
Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768)
Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.) This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768 Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test) Reviewed By: anand1976 Differential Revision: D40034747 Pulled By: anand1976 fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68 |
|
Changyu Bi | 9f2363f4c4 |
User-defined timestamp support for `DeleteRange()` (#10661)
Summary: Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps. - Garbage collection of range tombstones and range tombstone covered keys during compaction. - Get()/MultiGet() to return the timestamp of a range tombstone when needed. - Get/Iterator with range tombstones bounded by readoptions.timestamp. - timestamp crash test now issues DeleteRange by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661 Test Plan: - Added unit test: `make check` - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case. | micros/op | fillrandom | seekrandom | | --- | --- | --- | |main| 2.58 |10.96| |PR 10661| 2.68 |10.63| Reviewed By: riversand963 Differential Revision: D39441192 Pulled By: cbi42 fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2 |
|
Hui Xiao | 3b8164912e |
Add manual_wal_flush, FlushWAL() to stress/crash test (#10698)
Summary: **Context/Summary:** Introduce `manual_wal_flush_one_in` as titled. - When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in` Incompatibilities fixed along the way: ``` db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed. ``` - It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false. - As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698 Test Plan: - Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3` - Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs Reviewed By: ajkr Differential Revision: D39593752 Pulled By: ajkr fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce |
|
Levi Tamasi | 9078fcccee |
Add the PutEntity API to the stress/crash tests (#10760)
Summary: The patch adds the `PutEntity` API to the non-batched, batched, and CF consistency stress tests. Namely, when the new `db_stress` command line parameter `use_put_entity_one_in` is greater than zero, one in N writes on average is performed using `PutEntity` rather than `Put`. The wide-column entity written has the generated value in its default column; in addition, it contains up to three additional columns where the original generated value is divided up between the column name and the column value (with the column name containing the first k characters of the generated value, and the column value containing the rest). Whether `PutEntity` is used (and if so, how many columns the entity has) is completely determined by the "value base" used to generate the value (that is, there is no randomness involved). Assuming the same `use_put_entity_one_in` setting is used across `db_stress` invocations, this enables us to reconstruct and validate the entity during subsequent `db_stress` runs. Note that `PutEntity` is currently incompatible with `Merge`, transactions, and user-defined timestamps; these combinations are currently disabled/disallowed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10760 Test Plan: Ran some batched, non-batched, and CF consistency stress tests using the script. Reviewed By: riversand963 Differential Revision: D39939032 Pulled By: ltamasi fbshipit-source-id: eafdf124e95993fb7d73158e3b006d11819f7fa9 |
|
Hui Xiao | f3b359a549 |
Set options.num_levels in db_stress_test_base (#10732)
Summary: An add-on to https://github.com/facebook/rocksdb/pull/6818 to complete adding single-level universal compaction to stress/crash testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10732 Test Plan: - Locally run for 10 min `python3 ./tools/db_crashtest.py whitebox --simple --compaction_style=1 --num_levels=1 -max_key=1000000 -value_size_mult=33 -write_buffer_size=524288 -target_file_size_base=524288 -max_bytes_for_level_base=2097152 --duration=120 --interval=10 --ops_per_thread=1000 --random_kill_odd=887` - Check LOG to confirm single-level universal compaction is called - Manual testing and log checking to ensure destroy_db_initially=1 is correctly set across runs with different compaction styles (i.e, in the second half of whitebox testing). - [ongoing]CI jobs stress test Reviewed By: ajkr Differential Revision: D39797612 Pulled By: ajkr fbshipit-source-id: 16f5c40c3464c57360c06c8305f92118e426149c |
|
Hui Xiao | aed30ddf21 |
Support WriteCommit policy with sync_fault_injection=1 (#10624)
Summary:
**Context:**
Prior to this PR, correctness testing with un-sync data loss [disabled](https://github.com/facebook/rocksdb/pull/10605) transaction (`use_txn=1`) thus all of the `txn_write_policy` . This PR improved that by adding support for one policy - WriteCommit (`txn_write_policy=0`).
**Summary:**
They key to this support is (a) handle Mark{Begin, End}Prepare/MarkCommit/MarkRollback in constructing ExpectedState under WriteCommit policy correctly and (b) monitor CI jobs and solve any test incompatibility issue till jobs are stable. (b) will be part of the test plan.
For (a)
- During prepare (i.e, between `MarkBeginPrepare()` and `MarkEndPrepare(xid)`), `ExpectedStateTraceRecordHandler` will buffer all writes by adding all writes to an internal `WriteBatch`.
- On `MarkEndPrepare()`, that `WriteBatch` will be associated with the transaction's `xid`.
- During the commit (i.e, on `MarkCommit(xid)`), `ExpectedStateTraceRecordHandler` will retrieve and iterate the internal `WriteBatch` and finally apply those writes to `ExpectedState`
- During the rollback (i.e, on `MarkRollback(xid)`), `ExpectedStateTraceRecordHandler` will erase the internal `WriteBatch` from the map.
For (b) - one major issue described below:
- TransactionsDB in db stress recovers prepared-but-not-committed txns from the previous crashed run by randomly committing or rolling back it at the start of the current run, see a historical [PR](
|
|
Peter Dillinger | 0f91c72adc |
Call experimental new clock cache HyperClockCache (#10684)
Summary: This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons: * We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache." * We want to highlight the key feature: it's fast (especially under parallel load) * Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements. * We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.) Some API detail: * To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`. * Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated. For performance tests see https://github.com/facebook/rocksdb/pull/10626 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684 Test Plan: no interesting functional changes; tests updated Reviewed By: anand1976 Differential Revision: D39547800 Pulled By: pdillinger fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf |
|
Andrew Kryczka | 6ce782beaf |
move db_stress locking to `StressTest::Test*()` functions (#10678)
Summary: One problem of the previous strategy was `NonBatchedOpsStressTest::TestIngestExternalFile()` could release the lock for `rand_keys[0]` in `rand_column_families[0]`, and then subsequent operations in the same loop iteration (e.g., `TestPut()`) would run without locking. This PR changes the strategy so each `Test*()` function is responsible for acquiring and releasing its own locks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10678 Reviewed By: hx235 Differential Revision: D39516401 Pulled By: ajkr fbshipit-source-id: bf67f12ebbd293ba8c24fdf8754ff28737bcd758 |
|
Andrew Kryczka | 03c4ea26bb |
db_stress option to preserve all files until verification success (#10659)
Summary: In `db_stress`, DB and expected state files containing changes leading up to a verification failure are often deleted, which makes debugging such failures difficult. On the DB side, flushed WAL files and compacted SST files are marked obsolete and then deleted. Without those files, we cannot pinpoint where a key that failed verification changed unexpectedly. On the expected state side, files for verifying prefix-recoverability in the presence of unsynced data loss are deleted before verification. These include a baseline state file containing the expected state at the time of the last successful verification, and a trace file containing all operations since then. Without those files, we cannot know the sequence of DB operations expected to be recovered. This PR attempts to address this gap with a new `db_stress` flag: `preserve_unverified_changes`. Setting `preserve_unverified_changes=1` has two effects. First, prior to startup verification, `db_stress` hardlinks all DB and expected state files in "unverified/" subdirectories of `FLAGS_db` and `FLAGS_expected_values_dir`. The separate directories are needed because the pre-verification opening process deletes files written by the previous `db_stress` run as described above. These "unverified/" subdirectories are cleaned up following startup verification success. I considered other approaches for preserving DB files through startup verification, like using a read-only DB or preventing deletion of DB files externally, e.g., in the `Env` layer. However, I decided against it since such an approach would not work for expected state files, and I did not want to change the DB management logic. If there were a way to disable DB file deletions before regular DB open, I would have preferred to use that. Second, `db_stress` attempts to keep all DB and expected state files that were live at some point since the start of the `db_stress` run. This is a bit tricky and involves the following changes. - Open the DB with `disable_auto_compactions=1` and `avoid_flush_during_recovery=1` - DisableFileDeletions() - EnableAutoCompactions() For this part, too, I would have preferred to use a hypothetical API that disables DB file deletion before regular DB open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10659 Reviewed By: hx235 Differential Revision: D39407454 Pulled By: ajkr fbshipit-source-id: 6e981025c7dce147649d2e770728471395a7fa53 |
|
Akanksha Mahajan | 7a9ecdac3c |
Add auto prefetching parameters to db_bench and db_stress (#10632)
Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/10632 Test Plan: make crash_test -j32 Reviewed By: anand1976 Differential Revision: D39241479 Pulled By: akankshamahajan15 fbshipit-source-id: 5db5b0c007da786bacc1b30d8926d36d6d029b87 |
|
Andrew Kryczka | ccf822492f |
Reenable sync_fault_injection in crash test (#10172)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10172 Reviewed By: riversand963 Differential Revision: D37164671 Pulled By: ajkr fbshipit-source-id: 40eb919b8dc261d502510e878ee8ac7874ab35d0 |
|
Levi Tamasi | 228f2c5bf5 |
Adjust the blob cache printout in db_bench/db_stress (#10614)
Summary: Currently, `db_bench` and `db_stress` print the blob cache options even if a shared block/blob cache is configured, i.e. when they are not actually in effect. The patch changes this so they are only printed when a separate blob cache is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10614 Test Plan: Tested manually using `db_bench` and `db_stress`. Reviewed By: akankshamahajan15 Differential Revision: D39144603 Pulled By: ltamasi fbshipit-source-id: f714304c5d46186f8514746c27ee6f52aa3e4af8 |
|
Yanqin Jin | 3613d862ba |
print value when verification fails (#10587)
Summary: When verification fails for db_stress, print more information about value read from the db and expected state. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10587 Test Plan: make check ./db_stress Reviewed By: akankshamahajan15, hx235 Differential Revision: D39078511 Pulled By: riversand963 fbshipit-source-id: 77ac8ffae01fc3a9b58a02c2e7bbe141e1a18f0b |
|
Changyu Bi | d140fbfd7d |
Add Iterator test against expected state to stress test (#10538)
Summary: As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913, `db_stress` does not have much verification for iterator correctness. It has a `TestIterate()` function, but that is mainly for comparing results between two iterators, one with `total_order_seek` and the other optionally sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33 added a new `TestIterateAgainstExpected()` function that compares iterator against expected state. It locks a range of keys, creates an iterator, does a random sequence of `Next/Prev` and compares against expected state. This PR is based on that commit, the main changes include some logs (for easier debugging if a test fails), a forward and backward scan to cover the entire locked key range, and a flag for optionally turning on this version of Iterator testing. Added constraint that the checks against expected state in `TestIterateAgainstExpected()` and in `TestGet()` are only turned on when `--skip_verifydb` flag is not set. Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538 Test Plan: Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`, and a large `--iterpercent` and `--num_iterations`. Checked `op_logs` manually to ensure expected coverage. Tweaked part of the code in https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it. - internally run various flavor of crash test Reviewed By: ajkr Differential Revision: D38847269 Pulled By: cbi42 fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599 |
|
Changyu Bi | fd165c869d |
Add memtable per key-value checksum (#10281)
Summary: Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281 Test Plan: - Added new unit test cases: `make check` - Benchmark on memtable insert ``` TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 # avg over 10 runs Baseline: 1166936 ops/sec memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%) memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%) write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%) ``` - Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum. ops/sec for `readrandom` dropped ~6.8%. ``` # Readseq sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 readseq [AVG 20 runs] : 7432840 (± 212005) ops/sec; 822.3 (± 23.5) MB/sec readseq [MEDIAN 20 runs] : 7573878 ops/sec; 837.9 MB/sec With -memtable_protection_bytes_per_key=2: readseq [AVG 20 runs] : 5134607 (± 119596) ops/sec; 568.0 (± 13.2) MB/sec readseq [MEDIAN 20 runs] : 5232946 ops/sec; 578.9 MB/sec # Readrandom sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100 readrandom [AVG 10 runs] : 140236 (± 3938) ops/sec; 9.8 (± 0.3) MB/sec readrandom [MEDIAN 10 runs] : 140545 ops/sec; 9.8 MB/sec With -memtable_protection_bytes_per_key=2: readrandom [AVG 10 runs] : 130632 (± 2738) ops/sec; 9.1 (± 0.2) MB/sec readrandom [MEDIAN 10 runs] : 130341 ops/sec; 9.1 MB/sec ``` - Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800` Reviewed By: ajkr Differential Revision: D37607896 Pulled By: cbi42 fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113 |
|
Jay Zhuang | 1e86d424e4 |
Tiered storage stress test (#10493)
Summary: Add Tiered storage stress test and db_bench option Pull Request resolved: https://github.com/facebook/rocksdb/pull/10493 Test Plan: new crashtest: https://app.circleci.com/pipelines/github/facebook/rocksdb/16905/workflows/68c2967c-9274-434f-8506-1403cf441ead Reviewed By: ajkr Differential Revision: D38481892 Pulled By: jay-zhuang fbshipit-source-id: 217a0be4acb93d420222e6ede2a1290d9f464776 |
|
Yanqin Jin | b443d24f4d |
Stop operating on DB in a stress test background thread (#10373)
Summary: Stress test background threads do not coordinate with test worker threads for db reopen in the middle of a test run, thus accessing db obj in a stress test bg thread can race with test workers. Remove the TimestampedSnapshotThread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10373 Test Plan: ``` ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 \ --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 \ --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 \ --block_size=16384 --bloom_bits=7.580319535285394 --bottommost_compression_type=disable \ --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache \ --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 \ --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=0 \ --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 \ --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 --create_timestamped_snapshot_one_in=20 --data_block_index_type=0 \ --db=/dev/shm/rocksdb/ --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=1 \ --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=1 --enable_pipelined_write=0 \ --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 \ --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 --index_block_restart_interval=11 --index_type=0 --ingest_external_file_one_in=0 \ --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \ --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 \ --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 \ --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 \ --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.5 \ --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True \ --nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000 \ --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 \ --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 \ --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=1000 \ --readpercent=55 --recycle_log_file_num=0 --reopen=100 --ribbon_starting_level=8 \ --secondary_cache_fault_one_in=0 --secondary_cache_uri= --snapshot_hold_ops=100000 \ --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 \ --subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 \ --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=1 \ --txn_write_policy=0 --unordered_write=0 --unpartitioned_pinning=0 \ --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 \ --use_merge=1 --use_multiget=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 \ --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 \ --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none \ --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35 ``` make crash_test_with_txn make crash_test_with_multiops_wc_txn Reviewed By: jay-zhuang Differential Revision: D37903189 Pulled By: riversand963 fbshipit-source-id: cd1728ad7ba4ce4cf47af23c4f65dda0956744f9 |
|
Gang Liao | 0b6bc101ba |
Charge blob cache usage against the global memory limit (#10321)
Summary: To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321 Reviewed By: ltamasi Differential Revision: D37913590 Pulled By: gangliao fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72 |
|
Gang Liao | ec4ebeff30 |
Support prepopulating/warming the blob cache (#10298)
Summary: Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298 Reviewed By: ltamasi Differential Revision: D37908743 Pulled By: gangliao fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1 |
|
Guido Tagliavini Ponce | 7e1b417824 |
Revert NewClockCache signature (#10358)
Summary: This complements https://github.com/facebook/rocksdb/issues/10351. This PR reverts NewClockCache's signature to an older version, expected by the users of the old (buggy) ClockCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10358 Test Plan: ``make -j24 check`` and re-run the pre-release tests. Reviewed By: siying Differential Revision: D37832601 Pulled By: guidotag fbshipit-source-id: 32a91d3da4119be187935003b7b897272ceb1950 |
|
Yanqin Jin | 2f13f5f7d0 |
Add coverage for timestamped snapshot to MultiOpsTxnsStressTest (#10325)
Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10325 Test Plan: ```bash TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_multiops_wc_txn TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_txn ``` Reviewed By: akankshamahajan15 Differential Revision: D37688742 Pulled By: riversand963 fbshipit-source-id: e198ace921898af63f99e869568c1a7bbf69f1a4 |
|
Yanqin Jin | caced09e79 |
Expand stress test coverage for user-defined timestamp (#10280)
Summary: Before this PR, we call `now()` to get the wall time before performing point-lookup and range scans when user-defined timestamp is enabled. With this PR, we expand the coverage to: - read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now() - add coverage for `ReadOptions::iter_start_ts != nullptr` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280 Test Plan: ```bash make check ``` Also, ```bash TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts ``` So far, we have had four successful runs of the above In addition, ```bash TEST_TMPDIR=/dev/shm/rocksdb make crash_test ``` Succeeded twice showing no regression. Reviewed By: ltamasi Differential Revision: D37539805 Pulled By: riversand963 fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5 |
|
zczhu | e716bda010 |
Add FLAGS_compaction_pri into crash_test (#10255)
Summary: Add FLAGS_compaction_pri into correctness test Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255 Test Plan: run crash_test with FLAGS_compaction_pri Reviewed By: ajkr Differential Revision: D37510372 Pulled By: littlepig2013 fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641 |
|
Guido Tagliavini Ponce | 57a0e2f304 |
Clock cache (#10273)
Summary: This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work: - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash). - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores. - Middle insertions into the clock list. - A test that exercises the clock eviction policy. - Update the Java API of ClockCache and Java calls to C++. Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37522461 Pulled By: guidotag fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943 |
|
Gang Liao | d7ebb58cb5 |
Add blob cache tickers, perf context statistics, and DB properties (#10203)
Summary: In order to be able to monitor the performance of the new blob cache, we made the follow changes: - Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics) - Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context) - Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203 Reviewed By: ltamasi Differential Revision: D37478658 Pulled By: gangliao fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40 |
|
Yanqin Jin | d3de59255a |
Enable compaction filter for db_stress with user-defined timestamp (#10259)
Summary: Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter. This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of the presence of timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259 Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts Reviewed By: ajkr Differential Revision: D37459692 Pulled By: riversand963 fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef |
|
Andrew Kryczka | f322f273b0 |
Temporarily disable mempurge in crash test (#10252)
Summary: Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252 Reviewed By: riversand963 Differential Revision: D37432948 Pulled By: ajkr fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a |
|
Baptiste Lemaire | 5879053fd0 |
Dynamically changeable `MemPurge` option (#10011)
Summary: **Summary** Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled. **Motivation** RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible. Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement. **Content of this PR** This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes. **Benchmarking** I will add numbers to prove that there is no performance impact within the next 12 hours. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011 Reviewed By: pdillinger Differential Revision: D36462357 Pulled By: bjlemaire fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802 |
|
Gang Liao | 2352e2dfda |
Add the blob cache to the stress tests and the benchmarking tool (#10202)
Summary: In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`. As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202 Reviewed By: ltamasi Differential Revision: D37325739 Pulled By: gangliao fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8 |
|
Peter Dillinger | 84210c9489 |
Add data block hash index to crash test, fix MultiGet issue (#10220)
Summary: There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data block hash index, which was not caught because data block hash index was never added to stress tests. This change fixes both issues. Fixes https://github.com/facebook/rocksdb/issues/10186 I intend to pick this into the 7.4.0 release candidate Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220 Test Plan: Failure quickly reproduces in crash test with kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing the failure with a unit test I believe would be too tricky and fragile to be worthwhile. Reviewed By: anand1976 Differential Revision: D37315647 Pulled By: pdillinger fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba |
|
Andrew Kryczka | 5d6005c780 |
Add WriteOptions::protection_bytes_per_key (#10037)
Summary: Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`. Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user. There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037 Test Plan: - Manual - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24` - Enabled in MyShadow for 1+ week - Automated - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions` - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection Reviewed By: cbi42 Differential Revision: D36614569 Pulled By: ajkr fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb |
|
Peter Dillinger | 126c223714 |
Remove deprecated block-based filter (#10184)
Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80 |
|
Yanqin Jin | ce419c0f10 |
Allow db_bench and db_stress to set `allow_data_in_errors` (#10171)
Summary: There is `Options::allow_data_in_errors` that controls whether RocksDB is allowed to log data, e.g. key, value, etc in LOG files. It is false by default. However, in db_bench and db_stress, it is often ok to log data because there is no concern about privacy. This PR allows db_stress and db_bench to set this option on the command line, while it remains false by default. Furthermore, make crash/recovery test driven by db_crashtest.py to opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171 Test Plan: Stress test and db_bench Reviewed By: hx235 Differential Revision: D37163787 Pulled By: riversand963 fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8 |
|
Hui Xiao | d665afdbf3 |
Account memory of FileMetaData in global memory limit (#9924)
Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625 |
|
Guido Tagliavini Ponce | f105e1a501 |
Make the per-shard hash table fixed-size. (#10154)
Summary: We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values. Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154 Test Plan: `make -j24 check` Reviewed By: pdillinger Differential Revision: D37124451 Pulled By: guidotag fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c |
|
Yanqin Jin | bfaf8291c5 |
Fix a race condition in transaction stress test (#10157)
Summary: Before this PR, there can be a race condition between the thread calling `StressTest::Open()` and a background compaction thread calling `MultiOpsTxnsStressTest::VerifyPkSkFast()`. ``` Time thread1 bg_compact_thr | TransactionDB::Open(..., &txn_db_) | db_ is still nullptr | db_->GetSnapshot() // segfault | db_ = txn_db_ V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10157 Test Plan: CI Reviewed By: akankshamahajan15 Differential Revision: D37121653 Pulled By: riversand963 fbshipit-source-id: 6a53117f958e9ee86f77297fdeb843e5160a9331 |
|
Yanqin Jin | 1777e5f7e9 |
Snapshots with user-specified timestamps (#9879)
Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4 |
|
Yanqin Jin | f890527b16 |
Update test for secondary instance in stress test (#10121)
Summary: This PR updates secondary instance testing in stress test by default. A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary. Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan to read some random keys with only very basic verification to make sure no assertion failure is triggered. Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled. Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic catch-up. In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not work well when the underlying file is renamed by primary, e.g. SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb make crash_test TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush ``` Reviewed By: ajkr Differential Revision: D36939458 Pulled By: riversand963 fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e |
|
Gang Liao | e6432dfd4c |
Make it possible to enable blob files starting from a certain LSM tree level (#10077)
Summary: Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed. In order to achieve this, we would like to do the following: - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic) - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level` - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` ) - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh` - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool) - Ideally extend the C and Java bindings with the new option - Update the BlobDB wiki to document the new option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077 Reviewed By: ltamasi Differential Revision: D36884156 Pulled By: gangliao fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d |
|
Gang Liao | 3dc6ebaf74 |
Support specifying blob garbage collection parameters when CompactRange() (#10073)
Summary: Garbage collection is generally controlled by the BlobDB configuration options `enable_blob_garbage_collection` and `blob_garbage_collection_age_cutoff`. However, there might be use cases where we would want to temporarily override these options while performing a manual compaction. (One use case would be doing a full key-space manual compaction with full=100% garbage collection age cutoff in order to minimize the space occupied by the database.) Our goal here is to make it possible to override the configured GC parameters when using the `CompactRange` API to perform manual compactions. This PR would involve: - Extending the `CompactRangeOptions` structure so clients can both force-enable and force-disable GC, as well as use a different cutoff than what's currently configured - Storing whether blob GC should actually be enabled during a certain manual compaction and the cutoff to use in the `Compaction` object (considering the above overrides) and passing it to `CompactionIterator` via `CompactionProxy` - Updating the BlobDB wiki to document the new options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10073 Test Plan: Adding unit tests and adding the new options to the stress test tool. Reviewed By: ltamasi Differential Revision: D36848700 Pulled By: gangliao fbshipit-source-id: c878ef101d1c612429999f513453c319f75d78e9 |
|
Guido Tagliavini Ponce | b4d0e041d0 |
Add support for FastLRUCache in stress and crash tests. (#10081)
Summary: Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache. Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081 Test Plan: - To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`. - To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`. Reviewed By: anand1976 Differential Revision: D36839908 Pulled By: guidotag fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725 |
|
Yanqin Jin | 9901e7f681 |
Enable checkpoint and backup in db_stress when timestamp is enabled (#10047)
Summary: After https://github.com/facebook/rocksdb/issues/10030 and https://github.com/facebook/rocksdb/issues/10004, we can enable checkpoint and backup in stress tests when user-defined timestamp is enabled. This PR has no production risk. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10047 Test Plan: ``` TEST_TMPDIR=/dev/shm make crash_test_with_ts ``` Reviewed By: jowlyzhang Differential Revision: D36641565 Pulled By: riversand963 fbshipit-source-id: d86c9d87efcc34c32d1aa176af691d32b897644a |
|
Changyu Bi | cc23b46da1 |
Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857
Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](
|
|
Jay Zhuang | c6d326d3d7 |
Track SST unique id in MANIFEST and verify (#9990)
Summary: Start tracking SST unique id in MANIFEST, which is used to verify with SST properties to make sure the SST file is not overwritten or misplaced. A DB option `try_verify_sst_unique_id` is introduced to enable/disable the verification, if enabled, it opens all SST files during DB-open to read the unique_id from table properties (default is false), so it's recommended to use it with `max_open_files = -1` to pre-open the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990 Test Plan: unittests, format-compatible test, mini-crash Reviewed By: anand1976 Differential Revision: D36381863 Pulled By: jay-zhuang fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f |
|
Yanqin Jin | e3a3dbf2be |
Avoid overwriting options loaded from OPTIONS (#9943)
Summary: This is similar to https://github.com/facebook/rocksdb/issues/9862, including the following fixes/refactoring: 1. If OPTIONS file is specified via `-options_file`, majority of options will be loaded from the file. We should not overwrite options that have been loaded from the file. Instead, we configure only fields of options which are shared objects and not set by the OPTIONS file. We also configure a few fields, e.g. `create_if_missing` necessary for stress test to run. 2. Refactor options initialization into three functions, `InitializeOptionsFromFile()`, `InitializeOptionsFromFlags()` and `InitializeOptionsGeneral()` similar to db_bench. I hope they can be shared in the future. The high-level logic is as follows: ```cpp if (!InitializeOptionsFromFile(...)) { InitializeOptionsFromFlags(...); } InitializeOptionsGeneral(...); ``` 3. Currently, the setting for `block_cache_compressed` does not seem correct because it by default specifies a size of `numeric_limits<size_t>::max()` ((size_t)-1). According to code comments, `-1` indicates default value, which should be referring to `num_shard_bits` argument. 4. Clarify `fail_if_options_file_error`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9943 Test Plan: 1. make check 2. Run stress tests, and manually check generated OPTIONS file and compare them with input OPTIONS files Reviewed By: jay-zhuang Differential Revision: D36133769 Pulled By: riversand963 fbshipit-source-id: 35dacdc090a0a72c922907170cd132b9ecaa073e |
|
Hui Xiao | 3573558ec5 |
Rewrite memory-charging feature's option API (#9926)
Summary: **Context:** Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`. Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags. **Summary:** - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that - Added missing db bench/stress test for some memory charging features - Renamed related test suite to indicate they are under the same theme of memory charging - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926 Test Plan: - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)` - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)` - CI - db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465** 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078** - db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36054712 Pulled By: hx235 fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af |
|
Yanqin Jin | f6d9730ea1 |
Fix stress test with best-efforts-recovery (#9986)
Summary: This PR - since we are testing with disable_wal = true and best_efforts_recovery, we should set column family count to 1, due to the requirement of `ExpectedState` tracking and replaying logic. - during backup and checkpoint restore, disable best-efforts-recovery. This does not matter now because db_crashtest.py always disables wal when testing best-efforts-recovery. In the future, if we enable wal, then not setting `restore_opitions.best_efforts_recovery` will cause backup db not to recover the WALs, and differ from db (that enables WAL). - during verification of backup and checkpoint restore, print the key where inconsistency exists between expected state and db. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9986 Test Plan: TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery Reviewed By: siying Differential Revision: D36353105 Pulled By: riversand963 fbshipit-source-id: a484da161273e6216a1f7e245bac15a349693917 |
|
sdong | 736a7b5433 |
Remove own ToString() (#9955)
Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471 |
|
Andrew Kryczka | 62d84e2a2b |
db_stress fault injection in release mode (#9957)
Summary: Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode. Other notable changes include: - Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint` - Made `--read_fault_one_in > 0` return an error in release mode - Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957 Test Plan: ``` $ DEBUG_LEVEL=0 make -j24 db_stress $ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox ``` Reviewed By: anand1976 Differential Revision: D36193830 Pulled By: ajkr fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321 |
|
Andrew Kryczka | a62506aee2 |
Enable unsynced data loss in crash test (#9947)
Summary: `db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`. Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947 Test Plan: - regular blackbox, blackbox --simple - various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16` Reviewed By: riversand963 Differential Revision: D36152775 Pulled By: ajkr fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8 |
|
sdong | 49628c9a83 |
Use std::numeric_limits<> (#9954)
Summary: Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954 Test Plan: See CI Runs. Reviewed By: riversand963 Differential Revision: D36173954 fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0 |
|
Yanqin Jin | 06394ff4e7 |
Fix a bug of CompactionIterator/CompactionFilter using `Delete` (#9929)
Summary: When compaction filter determines that a key should be removed, it updates the internal key's type to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction together with `SingleDelete`, it will cause compaction iterator to return Corruption. To fix the issue, compaction filter should return more information in addition to the intention of removing a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing `kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`. In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()` when calling from `Open()` so that we can set up the rollback deletion type callback for transactions. Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does not quite work with `MultiOpsTxnsStressTest`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929 Test Plan: make check make crash_test make crash_test_with_txn Reviewed By: anand1976 Differential Revision: D36069678 Pulled By: riversand963 fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92 |
|
Yanqin Jin | 94e245a14d |
Improve stress test for MultiOpsTxnsStressTest (#9829)
Summary: Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions. 1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw. 2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true. 3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn ``` Reviewed By: pdillinger Differential Revision: D35806678 Pulled By: riversand963 fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616 |
|
anand76 | c3d7e16252 |
Add WAL compression to stress tests (#9811)
Summary: Add the WAL compression feature to the stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9811 Reviewed By: riversand963 Differential Revision: D35414316 Pulled By: anand1976 fbshipit-source-id: 0c17b1ec55679a52f088ad368798b57139bd921a |
|
Hui Xiao | 49623f9c8e |
Account memory of big memory users in BlockBasedTable in global memory limit (#9748)
Summary: **Context:** Through heap profiling, we discovered that `BlockBasedTableReader` objects can accumulate and lead to high memory usage (e.g, `max_open_file = -1`). These memories are currently not saved, not tracked, not constrained and not cache evict-able. As a first step to improve this, similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track an estimate of `BlockBasedTableReader` object's memory in block cache and fail future creation if the memory usage exceeds the available space of cache at the time of creation. **Summary:** - Approximate big memory users (`BlockBasedTable::Rep` and `TableProperties` )' memory usage in addition to the existing estimated ones (filter block/index block/un-compression dictionary) - Charge all of these memory usages to block cache on `BlockBasedTable::Open()` and release them on `~BlockBasedTable()` as there is no memory usage fluctuation of concern in between - Refactor on CacheReservationManager (and its call-sites) to add concurrent support for BlockBasedTable used in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9748 Test Plan: - New unit tests - db bench: `OpenDb` : **-0.52% in ms** - Setup `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -write_buffer_size=1048576` - Repeated run with pre-change w/o feature and post-change with feature, benchmark `OpenDb`: `./db_bench -benchmarks=readrandom -use_existing_db=1 -db=/dev/shm/testdb -reserve_table_reader_memory=true (remove this when running w/o feature) -file_opening_threads=3 -open_files=-1 -report_open_timing=true| egrep 'OpenDb:'` #-run | (feature-off) avg milliseconds | std milliseconds | (feature-on) avg milliseconds | std milliseconds | change (%) -- | -- | -- | -- | -- | -- 10 | 11.4018 | 5.95173 | 9.47788 | 1.57538 | -16.87382694 20 | 9.23746 | 0.841053 | 9.32377 | 1.14074 | 0.9343477536 40 | 9.0876 | 0.671129 | 9.35053 | 1.11713 | 2.893283155 80 | 9.72514 | 2.28459 | 9.52013 | 1.0894 | -2.108041632 160 | 9.74677 | 0.991234 | 9.84743 | 1.73396 | 1.032752389 320 | 10.7297 | 5.11555 | 10.547 | 1.97692 | **-1.70275031** 640 | 11.7092 | 2.36565 | 11.7869 | 2.69377 | **0.6635807741** - db bench on write with cost to cache in WriteBufferManager (just in case this PR's CRM refactoring accidentally slows down anything in WBM) : `fillseq` : **+0.54% in micros/op** `./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -disable_auto_compactions=1 -cost_write_buffer_to_cache=true -write_buffer_size=10000000000 | egrep 'fillseq'` #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) avg micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 6.15 | 0.260187 | 6.289 | 0.371192 | 2.260162602 20 | 7.28025 | 0.465402 | 7.37255 | 0.451256 | 1.267813605 40 | 7.06312 | 0.490654 | 7.13803 | 0.478676 | **1.060579461** 80 | 7.14035 | 0.972831 | 7.14196 | 0.92971 | **0.02254791432** - filter bench: `bloom filter`: **-0.78% in ms/key** - ` ./filter_bench -impl=2 -quick -reserve_table_builder_memory=true | grep 'Build avg'` #-run | (pre-PR) avg ns/key | std ns/key | (post-PR) ns/key | std ns/key | change (%) -- | -- | -- | -- | -- | -- 10 | 26.4369 | 0.442182 | 26.3273 | 0.422919 | **-0.4145720565** 20 | 26.4451 | 0.592787 | 26.1419 | 0.62451 | **-1.1465262** - Crash test `python3 tools/db_crashtest.py blackbox --reserve_table_reader_memory=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D35136549 Pulled By: hx235 fbshipit-source-id: 146978858d0f900f43f4eb09bfd3e83195e3be28 |
|
Peter Dillinger | 6534c6dea4 |
Fix remaining uses of "backupable" (#9792)
Summary: Various renaming and fixes to get rid of remaining uses of "backupable" which is terminology leftover from the original, flawed design of BackupableDB. Now any DB can be backed up, using BackupEngine. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9792 Test Plan: CI Reviewed By: ajkr Differential Revision: D35334386 Pulled By: pdillinger fbshipit-source-id: 2108a42b4575c8cccdfd791c549aae93ec2f3329 |
|
Akanksha Mahajan | fd66005628 |
Add 'adaptive_readahead' and 'async_io' options to db_stress (#9750)
Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/9750 Test Plan: export CRASH_TEST_EXT_ARGS=" --async_io=1 --adaptive_readahead=1; make -j crash_test Reviewed By: jay-zhuang Differential Revision: D35114326 Pulled By: akankshamahajan15 fbshipit-source-id: 8b05c95be09f7aff6cb9eb757aa20a6520349d45 |
|
Peter Dillinger | cff0d1e8e6 |
New backup meta schema, with file temperatures (#9660)
Summary: The primary goal of this change is to add support for backing up and restoring (applying on restore) file temperature metadata, without committing to either the DB manifest or the FS reported "current" temperatures being exclusive "source of truth". To achieve this goal, we need to add temperature information to backup metadata, which requires updated backup meta schema. Fortunately I prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version 6.19.0 for this kind of schema update. (Previously, backup meta schema was not extensible! Making this schema update public will allow some other "nice to have" features like taking backups with hard links, and avoiding crc32c checksum computation when another checksum is already available.) While schema version 2 is newly public, the default schema version is still 1. Until we change the default, users will need to set to 2 to enable features like temperature data backup+restore. New metadata like temperature information will be ignored with a warning in versions before this change and since 6.19.0. The metadata is considered ignorable because a functioning DB can be restored without it. Some detail: * Some renaming because "future schema" is now just public schema 2. * Initialize some atomics in TestFs (linter reported) * Add temperature hint support to SstFileDumper (used by BackupEngine) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660 Test Plan: related unit test majorly updated for the new functionality, including some shared testing support for tracking temperatures in a FS. Some other tests and testing hooks into production code also updated for making the backup meta schema change public. Reviewed By: ajkr Differential Revision: D34686968 Pulled By: pdillinger fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a |
|
Yanqin Jin | 5894761056 |
Improve stress test for transactions (#9568)
Summary: Test only, no change to functionality. Extremely low risk of library regression. Update test key generation by maintaining existing and non-existing keys. Update db_crashtest.py to drive multiops_txn stress test for both write-committed and write-prepared. Add a make target 'blackbox_crash_test_with_multiops_txn'. Running the following commands caught the bug exposed in https://github.com/facebook/rocksdb/issues/9571. ``` $rm -rf /tmp/rocksdbtest/* $./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 \ -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 \ -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 \ -ub_c=100 -destroy_db_initially=0 -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0 $./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 \ -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 -ub_c=100 -destroy_db_initially=0 \ -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0 ``` Running the following command caught a bug which will be fixed in https://github.com/facebook/rocksdb/issues/9648 . ``` $TEST_TMPDIR=/dev/shm make blackbox_crash_test_with_multiops_wc_txn ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9568 Reviewed By: jay-zhuang Differential Revision: D34308154 Pulled By: riversand963 fbshipit-source-id: 99ff1b65c19b46c471d2f2d3b47adcd342a1b9e7 |
|
Hui Xiao | ca0ef54f16 |
Rate-limit automatic WAL flush after each user write (#9607)
Summary: **Context:** WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`. Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes), for the benefits of: - being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515) - being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat) `WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint. - The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL` right now before improving that scheduling. A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR. **Summary:** - Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite - Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions - Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607 Test Plan: - Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`). - Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush` - `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100` - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0` - crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0 --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal **Benchmarked on flush/compaction to ensure no performance regression:** - compaction with rate-limiting (see table 1, avg over 1280-run): pre-change: **915635 micros/op**; post-change: **907350 micros/op (improved by 0.106%)** ``` #!/bin/bash TEST_TMPDIR=/dev/shm/testdb START=1 NUM_DATA_ENTRY=8 N=10 rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt for i in $(eval echo "{$START..$NUM_DATA_ENTRY}") do NUM_RUN=$(($N*(2**($i-1)))) for j in $(eval echo "{$START..$NUM_RUN}") do ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact' done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt done ``` - compaction w/o rate-limiting (see table 2, avg over 640-run): pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)** ``` Same as above script, except that -rate_limiter_bytes_per_sec=0 ``` - flush with rate-limiting (see table 3, avg over 320-run, run on the [patch]( |
|
Peter Dillinger | 4a9ae4f713 |
Avoid .trash handling race in db_stress Checkpoint (#9673)
Summary: The shared SstFileManager in db_stress can create background work that races with TestCheckpoint such that DestroyDir fails because of file rename while it is running. Analogous to change already made for TestBackupRestore Pull Request resolved: https://github.com/facebook/rocksdb/pull/9673 Test Plan: make blackbox_crash_test for a while with checkpoint_one_in=100 Reviewed By: ajkr Differential Revision: D34702215 Pulled By: pdillinger fbshipit-source-id: ac3e166efa28cba6c6f4b9b391e799394603ebfd |