Commit graph

318 commits

Author SHA1 Message Date
Yu Zhang 68cc429be2 Fix stress test failure caused by #11424 (#11470)
Summary:
The `ryw_expected_values` check only applies to when transaction is used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11470

Reviewed By: hx235

Differential Revision: D46085614

Pulled By: jowlyzhang

fbshipit-source-id: 4757896c3a62975641adcf97db077a04a0f33030
2023-05-22 15:47:28 -07:00
Yu Zhang ffb5f1f445 Refactor WriteUnpreparedStressTest to be a unit test (#11424)
Summary:
This patch remove the "stress" aspect from the WriteUnpreparedStressTest and leave it to be a unit test for some correctness testing w.r.t. snapshot functionality. I added some read-your-write verification to the transaction test in db_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11424

Test Plan:
`./write_unprepared_transaction_test`
`./db_crashtest.py whitebox --txn`
`./db_crashtest.py blackbox --txn`

Reviewed By: hx235

Differential Revision: D45551521

Pulled By: jowlyzhang

fbshipit-source-id: 20c3d510eb4255b08ddd7b6c85bdb4945436f6e8
2023-05-22 12:31:52 -07:00
Hui Xiao 7263f51d50 Improve comment of ExpectedValue in db stress (#11456)
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/11424 made me realize there are a couple gaps in my `ExpectedValue` comments so I updated them, along with separating `ExpectedValue` into separate files so it's clearer that `ExpectedValue` can be used without updating `ExpectedState` (e.g, TestMultiGet() where we care about value base of expected value but not updating the ExpectedState).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11456

Test Plan: CI

Reviewed By: jowlyzhang

Differential Revision: D45965070

Pulled By: hx235

fbshipit-source-id: dcee690c13b00a3119757ea9d43b646f9644e1a9
2023-05-18 09:44:15 -07:00
Jay Huh 586d78b31e Remove wait_unscheduled from waitForCompact internal API (#11443)
Summary:
Context:

In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue.

In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code.

This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`.

Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`.

Furthermore, all tests that previously called `waitForCompact(true)` have been fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443

Test Plan:
Existing tests that involve a shutdown in progress:

- DBCompactionTest::CompactRangeShutdownWhileDelayed
- DBTestWithParam::PreShutdownMultipleCompaction
- DBTestWithParam::PreShutdownCompactionMiddle

Reviewed By: pdillinger

Differential Revision: D45923426

Pulled By: jaykorean

fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae
2023-05-17 18:13:50 -07:00
Hui Xiao 5fc57eec2b Support parallel read and write/delete to same key in NonBatchedOpsStressTest (#11058)
Summary:
**Context:**
Current `NonBatchedOpsStressTest` does not allow multi-thread read (i.e, Get, Iterator) and write (i.e, Put, Merge) or delete to the same key. Every read or write/delete operation will acquire lock (`GetLocksForKeyRange`) on the target key to gain exclusive access to it. This does not align with RocksDB's nature of allowing multi-thread read and write/delete to the same key, that is concurrent threads can issue read/write/delete to RocksDB without external locking. Therefore this is a gap in our testing coverage.

To close the gap, biggest challenge remains in verifying db value against expected state in presence of parallel read and write/delete. The challenge is due to read/write/delete to the db and read/write to expected state is not within one atomic operation. Therefore we may not know the exact expected state of a certain db read, as by the time we read the expected state for that db read, another write to expected state for another db write to the same key might have changed the expected state.

**Summary:**
Credited to ajkr's idea, we now solve this challenge by breaking the 32-bits expected value of a key into different parts that can be read and write to in parallel.

Basically we divide the 32-bits expected value into `value_base` (corresponding to the previous whole 32 bits but now with some shrinking in the value base range we allow), `pending_write` (i.e, whether there is an ongoing concurrent write), `del_counter` (i.e, number of times a value has been deleted, analogous to value_base for write), `pending_delete` (similar to pending_write) and `deleted` (i.e whether a key is deleted).

Also, we need to use incremental `value_base` instead of random value base as before because we want to control the range of value base a correct db read result can possibly be in presence of parallel read and write. In that way, we can verify the correctness of the read against expected state more easily. This is at the cost of reducing the randomness of the value generated in NonBatchedOpsStressTest we are willing to accept.

(For detailed algorithm of how to use these parts to infer expected state of a key, see the PR)

Misc: hide value_base detail from callers of ExpectedState by abstracting related logics into ExpectedValue class

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11058

Test Plan:
- Manual test of small number of keys (i.e, high chances of parallel read and write/delete to same key) with equally distributed read/write/deleted for 30 min
```
python3 tools/db_crashtest.py --simple {blackbox|whitebox} --sync_fault_injection=1 --skip_verifydb=0 --continuous_verification_interval=1000 --clear_column_family_one_in=0 --max_key=10 --column_families=1 --threads=32 --readpercent=25 --writepercent=25 --nooverwritepercent=0 --iterpercent=25 --verify_iterator_with_expected_state_one_in=1 --num_iterations=5 --delpercent=15 --delrangepercent=10 --range_deletion_width=5 --use_merge={0|1} --use_put_entity_one_in=0 --use_txn=0 --verify_before_write=0 --user_timestamp_size=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --flush_one_in=1000 --get_property_one_in=1000 --ingest_external_file_one_in=100 --backup_one_in=100 --checkpoint_one_in=100 --approximate_size_one_in=0 --acquire_snapshot_one_in=100 --use_multiget=0 --prefixpercent=0 --get_live_files_one_in=1000 --manual_wal_flush_one_in=1000 --pause_background_one_in=1000 --target_file_size_base=524288 --write_buffer_size=524288 --verify_checksum_one_in=1000 --verify_db_one_in=1000
```
- Rehearsal stress test for normal parameter and aggressive parameter to see if such change can find what existing stress test can find (i.e, no regression in testing capability)
- [Ongoing]Try to find new bugs with this change that are not found by current NonBatchedOpsStressTest with no parallel read and write/delete to same key

Reviewed By: ajkr

Differential Revision: D42257258

Pulled By: hx235

fbshipit-source-id: e6fdc18f1fad3753e5ac91731483a644d9b5b6eb
2023-05-15 15:34:22 -07:00
Changyu Bi 62fc15f009 Block per key-value checksum (#11287)
Summary:
add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are
1. checksum construction and verification in block.cc/h
2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h)
3. unit tests/crash test updates

Tests:
* Added unit tests
* Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576`

Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled.

Performance:
Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory.
For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates):

```
SETUP
make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench
./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none

BENCHMARK
./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE

The readrandom ops/sec looks like the following:
Block cache size:  2GB        1.2GB * 0.9    1.2GB * 0.8     1.2GB * 0.5   8MB
Main              240805     223604         198176           161653       139040
PR prot_bytes=0   238691     226693         200127           161082       141153
PR prot_bytes=1   214983     193199         178532           137013       108211
prot_bytes=1 vs    -10%        -15%          -10.8%          -15%        -23%
prot_bytes=0
```

The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287

Reviewed By: ajkr

Differential Revision: D43970708

Pulled By: cbi42

fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
2023-04-25 12:08:23 -07:00
Hui Xiao 3622cfa34a Add back io_uring stress test hack with DbStressFSWrapper for FS not supporting read async (#11404)
Summary:
**Context/Summary:**
To better utilize `DbStressFSWrapper` for some assertion, https://github.com/facebook/rocksdb/pull/11288 removed an io_uring stress test hack for POSIX FS not supporting read async added in https://github.com/facebook/rocksdb/pull/11242 = It was removed based on the assumption that a later PR https://github.com/facebook/rocksdb/pull/11296 is sufficient to serve as an alternative workaround.

But recent stress tests has shown the opposite, mostly because 11296  approach might be subjected to incompleteness when more `ReadOptions` are passed down as what https://github.com/facebook/rocksdb/pull/11288 has done.

As a short-term solution to both work around POSIX FS constraint above and utilize `DbStressFSWrapper` for 11288 assertion, I proposed this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11404

Test Plan:
- Stress test ensures 11288's assertion is still effective in `DbStressFSWrapper`
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=16 --bottommost_compression_type=disable --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=32767 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=$exp --fail_if_options_file_error=1 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --kill_random_test=888887 --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000000 --optimize_filters_for_memory=0 --paranoid_file_checks=0 --partition_filters=0 --partition_pinning=0 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=5 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=1 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=1 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=1 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=4194304 --write_dbid_to_manifest=1 --writepercent=35
```
- Monitor future stress test to show `MultiGet error: Not implemented: ReadAsync` is gone

Reviewed By: ltamasi

Differential Revision: D45242280

Pulled By: hx235

fbshipit-source-id: 9823e3fbd4e9672efdd31478a2f2cbd68a98bdf5
2023-04-24 15:14:23 -07:00
Hui Xiao 151242ce46 Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.

**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
   - Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288

Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
     - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689

// PlainTable
Does not apply
```
- **Db bench 2: performance**

**Read**

SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`

**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none

rocksdb.sst.read.micros  COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`

Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`

Reviewed By: ajkr

Differential Revision: D44007011

Pulled By: hx235

fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 09:07:18 -07:00
Andrew Kryczka 6cac4c79d4 Fix race condition in db_stress checkpoint cleanup (#11389)
Summary:
The old cleanup code had a race condition:

1. Test thread: DestroyDB() marked a file as trash
2. DeleteScheduler thread: Got the file's size and decided to delete it in chunks
3. Test thread: DestroyDir() deleted that trash file
4. DeleteScheduler thread: Began deleting in chunks starting by calling ReopenWritableFile(). Unfortunately this recreates the deleted trash file
5. Test thread: DestroyDir() fails to remove the parent directory because it contains the file created in 4.
6. Test thread: Checkpoint::Create() fails due to the directory already existing

It could be repro'd with the following patch/command.

Patch:

```
 diff --git a/file/delete_scheduler.cc b/file/delete_scheduler.cc
index 8a2d1615d..337d24a60 100644
 --- a/file/delete_scheduler.cc
+++ b/file/delete_scheduler.cc
@@ -317,6 +317,12 @@ Status DeleteScheduler::DeleteTrashFile(const std::string& path_in_trash,
                                            &num_hard_links, nullptr);
       if (my_status.ok()) {
         if (num_hard_links == 1) {
+          // Give some time for DestroyDir() to delete file entries. Then, the
+          // below `ReopenWritableFile()` will recreate files, preventing the
+          // parent directory from being deleted.
+          if (rand() % 2 == 0) {
+            usleep(1000);
+          }
           std::unique_ptr<FSWritableFile> wf;
           my_status = fs_->ReopenWritableFile(path_in_trash, FileOptions(), &wf,
                                               nullptr);
 diff --git a/file/file_util.cc b/file/file_util.cc
index 43608fcdc..2cee1ad8e 100644
 --- a/file/file_util.cc
+++ b/file/file_util.cc
@@ -263,6 +263,13 @@ Status DestroyDir(Env* env, const std::string& dir) {
     }
   }

+  // Give some time for the DeleteScheduler thread's ReopenWritableFile() to
+  // recreate deleted files
+  if (dir.find("checkpoint") != std::string::npos) {
+    fprintf(stderr, "waiting to destroy %s\n", dir.c_str());
+    usleep(10000);
+  }
+
   if (s.ok()) {
     s = env->DeleteDir(dir);
     // DeleteDir might or might not report NotFound
```

Command:

```
TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=131072 --target_file_size_base=131072 --max_bytes_for_level_base=524288 --checkpoint_one_in=100 --clear_column_family_one_in=0  --max_key=1000 --value_size_mult=33 --sst_file_manager_bytes_per_truncate=4096 --sst_file_manager_bytes_per_sec=1048576  --interval=3 --compression_type=none --sync_fault_injection=1
```

Obviously we don't want to use scheduled deletion here as we need the checkpoint directory deleted immediately. I suspect the DestroyDir() was an attempt to fixup incomplete DestroyDB()s. Now that we expect DestroyDB() to be complete I removed that code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11389

Reviewed By: hx235

Differential Revision: D45137142

Pulled By: ajkr

fbshipit-source-id: 2af743d342c77cc414fd25fc4c9d7c9c6079ad24
2023-04-20 12:48:53 -07:00
Levi Tamasi 0efd7b4ba1 Extend the stress test coverage of MultiGetEntity (#11336)
Summary:
Similarly to `GetEntity` prior to https://github.com/facebook/rocksdb/issues/11303, the `MultiGetEntity` API is currently
only used in the DB verification logic of the stress tests. The patch introduces
a new mode where all point lookups are performed using `MultiGetEntity`,
and implements the corresponding logic in the non-batched, batched, and
CF consistency tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11336

Test Plan: Ran simple blackbox tests for the various stress test flavors.

Reviewed By: akankshamahajan15

Differential Revision: D44513285

Pulled By: ltamasi

fbshipit-source-id: c3db098501bf875b6a356b09fc676a0268d92c35
2023-03-29 20:35:15 -07:00
Peter Dillinger 204fcff751 HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.

This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.

It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.

## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.

## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.

These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.

* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)

I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.

## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.

## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.

## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.

## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.

## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.

## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.

## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.

Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).

Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.

## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301

Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.

## Crash/stress test
Updated to use the new combination.

## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.

```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```

**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127

**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)

This is an acceptable difference IMHO.

Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.

Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec

So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:

**Before**:
multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec

Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.

Let's also look at Get() in db_bench:

```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
**After**:
readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec

That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:

**Before**:
readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
**After**:
readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec

Reviewed By: ltamasi

Differential Revision: D44177044

Pulled By: pdillinger

fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-17 20:23:49 -07:00
Levi Tamasi a72d55c99d Increase the stress test coverage of GetEntity (#11303)
Summary:
The `GetEntity` API is currently used in the stress tests for verification purposes;
this patch extends the coverage by adding a mode where all point lookups in
the non-batched, batched, and CF consistency stress tests are done using this API.
The PR also includes a bit of refactoring to eliminate some boilerplate code around
the wide-column consistency checks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11303

Test Plan: Ran stress tests of the batched, non-batched, and CF consistency varieties.

Reviewed By: akankshamahajan15

Differential Revision: D44148503

Pulled By: ltamasi

fbshipit-source-id: fecdbfd3e65a459bbf16ab7aa7b9173e19240077
2023-03-17 14:47:29 -07:00
Yu Zhang af7872ffd1 Fix a TestGet failure when user defined timestamp is enabled (#11249)
Summary:
Stressing small DB with small number of keys and user-defined timestamp enabled usually fails pretty quickly in TestGet.

Example command to reproduce the failure:

` tools/db_crashtest.py blackbox --enable_ts --simple --delrangepercent=0 --delpercent=5 --max_key=100 --interval=3 --write_buffer_size=262144 --target_file_size_base=262144 --max_bytes_for_level_base=262144 --subcompactions=1`

Example failure: `error : inconsistent values for key 0000000000000009000000000000000A7878: expected state has the key, Get() returns NotFound.`

Fixes this test failure by refreshing the read up to timestamp to the most up to date timestamp, a.k.a now, after a key is locked.  Without this, things could happen in this order and cause a test failure:

<table>
  <tr>
    <th>TestGet thread</th>
    <th> A writing thread</th>
  </tr>
  <tr>
    <td>read_opts.timestamp = GetNow()</td>
    <td></td>
  </tr>
  <tr>
    <td></td>
    <td>Lock key, do write</td>
  </tr>
  <tr>
    <td>Lock key, read(read_opts) return NotFound</td>
    <td></td>
  </tr>
</table>

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11249

Reviewed By: ltamasi

Differential Revision: D43551302

Pulled By: jowlyzhang

fbshipit-source-id: 26877ab379bdb97acd2682a2632bc29718427f38
2023-02-23 17:00:04 -08:00
Yu Zhang f007b8fdea Support iter_start_ts in integrated BlobDB (#11244)
Summary:
Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244

Test Plan:
```make check
./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs"
tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1```

Reviewed By: ltamasi

Differential Revision: D43506726

Pulled By: jowlyzhang

fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0
2023-02-22 15:44:59 -08:00
anand76 476b01579c Revert enabling IO uring in db_stress (#11242)
Summary:
IO uring usage is causing crash test failures due to bad cqe data being returned in the uring. Revert the change to enable IO uring in db_stress, and also re-enable async_io in CircleCI so that code path can be tested. Added the -use_io_uring flag to db_stress that, when false, will wrap the default env in db_stress to emulate async IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11242

Reviewed By: akankshamahajan15

Differential Revision: D43470569

Pulled By: anand1976

fbshipit-source-id: 7c69ac3f53a79ade31d37313f815f1a4b6108b75
2023-02-21 12:53:55 -08:00
akankshamahajan 68e4581c67 Return NotSupported in scan if IOUring not supported and enable IOUring in db_stress for async_io testing (#11197)
Summary:
- Return NotSupported in scan if IOUring not supported if async_io is enabled
- Enable IOUring in db_stress for async_io testing
- Disable async_io in circleci crash testing as circleci doesn't support IOUring

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11197

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D43096313

Pulled By: akankshamahajan15

fbshipit-source-id: c2c53a87636950c0243038b9f5bd0d91608e4fda
2023-02-16 18:33:06 -08:00
Levi Tamasi ab22e79824 Support using MultiGetEntity as verification method in stress tests (#11228)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11228

Reviewed By: akankshamahajan15

Differential Revision: D43332120

Pulled By: ltamasi

fbshipit-source-id: 15f32cf335aecb7e654da24ecafc6e010dc65194
2023-02-15 17:08:25 -08:00
Yu Zhang c19672c187 Enable crash test to run BlobDB together with user-defined timestamp (#11199)
Summary:
I missed a stress test code sanity check when enabling this combination of tests. This PR addresses that, the "iter_start_ts" function for user defined timestamp feature is not supported when BlobDB is enabled. It's disabled for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11199

Test Plan:
Locally always enable BlobDB and run
tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887

Reviewed By: ltamasi

Differential Revision: D43245657

Pulled By: jowlyzhang

fbshipit-source-id: 4cae19817bb1afd50a76f9e0e49f006fb5c0b211
2023-02-13 13:40:02 -08:00
Levi Tamasi 753d4d5078 Support using GetEntity as a verification method in the non-batched stress tests (#11144)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11144

Test Plan: Ran a simple blackbox crash test.

Reviewed By: akankshamahajan15

Differential Revision: D42791464

Pulled By: ltamasi

fbshipit-source-id: 8eb6e62f0bc47f709816136ff3ded0a41d04fab8
2023-01-31 10:17:48 -08:00
Peter Dillinger 94e3beec77 Cleanup, improve, stress test LockWAL() (#11143)
Summary:
The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:

* Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
* Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
* Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
* Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
* Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
* Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143

Test Plan: unit tests added / updated, added to stress/crash test

Reviewed By: ajkr

Differential Revision: D42848627

Pulled By: pdillinger

fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
2023-01-30 22:52:30 -08:00
sdong 36174d89a6 DB Stress to fix a false assertion (#11164)
Summary:
Seeting this error in stress test:

db_stress: internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2459: void rocksdb::StressTest::Open(rocksdb::SharedState *): Assertion `txn_db_ == nullptr' failed. Received signal 6 (Aborted)
......

It doesn't appear that txn_db_ is set to nullptr at all. We set ithere.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11164

Test Plan: Run db_stress transaction and non-transation with low kill rate and see restarting without assertion

Reviewed By: ajkr

Differential Revision: D42855662

fbshipit-source-id: 06816d37cce9c94a81cb54ab238fb73aa102ed46
2023-01-30 19:45:47 -08:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Yu Zhang 6943ff6e50 Remove deprecated util functions in options_util.h (#11126)
Summary:
Remove the util functions in options_util.h that have previously been marked deprecated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126

Test Plan: `make check`

Reviewed By: ltamasi

Differential Revision: D42757496

Pulled By: jowlyzhang

fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb
2023-01-27 11:10:53 -08:00
Andrew Kryczka 97c1024d3e Include db_stress verification method in failure message (#11133)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11133

Test Plan:
- ran it a few times on a mismatching DB+expected state; verified error messages look right:

```
Verification failed for column family 0 key 000000000000D553000000000000014C0000000000000142 (163988): value_from_db: , value_from_expected: 25E7B53421202322, msg: GetMergeOperands verification: Value not found: NotFound:
Verification failed for column family 0 key 000000000000AAE2787878 (131123): value_from_db: , value_from_expected: B2A69C18B6B7B4B5BABBB8B9BEBFBCBDA2A3A0A1A6A7A4A5, msg: Iterator verification: Value not found: NotFound:
Verification failed for column family 0 key 00000000000080C6000000000000004C78787878 (98409): value_from_db: , value_from_expected: 67AB7E1E636261606F6E6D6C6B6A6968, msg: Get verification: Value not found: NotFound:
```

Reviewed By: hx235

Differential Revision: D42757072

Pulled By: ajkr

fbshipit-source-id: b0a4a0aaa5be5d110434324853ac92aaa6972d89
2023-01-27 07:45:25 -08:00
sdong e808858ae0 Remove Stats related to compressed block cache (#11135)
Summary:
Since compressed block cache is removed, those stats are not needed. They are removed in different PR in case there is a problem with it. The stats are removed in the same way in https://github.com/facebook/rocksdb/pull/11131/ . HISTORY.md was already updated by mistake, and it would be correct after merging this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11135

Test Plan: Watch CI

Reviewed By: ltamasi

Differential Revision: D42757616

fbshipit-source-id: bd7cb782585c8535ce5784295225c376f3011f35
2023-01-25 15:37:50 -08:00
sdong 2800aa069a Remove compressed block cache (#11117)
Summary:
Compressed block cache is replaced by compressed secondary cache. Remove the feature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117

Test Plan: See CI passes

Reviewed By: pdillinger

Differential Revision: D42700164

fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
2023-01-24 17:09:19 -08:00
ehds 4737e1d41b fix shared state used after free (#11059)
Summary:
Before this pr,  the destruction order is `shared` -> `db_`(StressTest destruction) -> `stress`, but `compaction_filter` of `db_` will hold the `shared` pointer, so `shared` maybe used after free.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11059

Reviewed By: hx235

Differential Revision: D42297366

Pulled By: ajkr

fbshipit-source-id: 17b314635359acacd5ba62f9db5f955f451133f7
2023-01-04 19:35:34 -08:00
Hui Xiao b965a5a80e Add back Options::CompactionOptionsFIFO::allow_compaction to stress/crash test (#11063)
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/10777 was reverted (https://github.com/facebook/rocksdb/pull/10999) due to internal blocker and replaced with a better fix https://github.com/facebook/rocksdb/pull/10922. However, the revert also reverted the `Options::CompactionOptionsFIFO::allow_compaction` stress/crash coverage added by the PR.

It's an useful coverage cuz setting `Options::CompactionOptionsFIFO::allow_compaction=true` will [increase](https://github.com/facebook/rocksdb/blob/7.8.fb/db/version_set.cc#L3255) the compaction score of L0 files for FIFO and then trigger more FIFO compaction. This speed up discovery of bug related to FIFO compaction like https://github.com/facebook/rocksdb/pull/10955. To see the speedup, compare the failure occurrence in following commands with `Options::CompactionOptionsFIFO::allow_compaction=true/false`

```
--fifo_allow_compaction=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=8.869062094789008 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=100 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=15 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0  --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=65
```

Therefore this PR is adding it back to stress/crash test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11063

Test Plan: Rehearsal stress test to make sure stress/crash test is stable

Reviewed By: ajkr

Differential Revision: D42283650

Pulled By: hx235

fbshipit-source-id: 132e6396ab6e24d8dcb8fe51c62dd5211cdf53ef
2023-01-03 11:54:58 -08:00
anand76 692d6be358 Prevent db_stress failure when io_uring is disabled (#11045)
Summary:
The IO uring usage is disabled in RocksDB by default and, as a result, PosixRandomAccessFile::ReadAsync returns a NotSupported() status. This was causing stress test failures with MultiGet and async_io combination. Fix it by relying on redirection of ReadAsync to Read when default Env is used in db_stress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11045

Reviewed By: akankshamahajan15

Differential Revision: D42136213

Pulled By: anand1976

fbshipit-source-id: fc7904d8ece74d7e8f2e1a34c3d70bd5774fb45f
2022-12-19 11:38:42 -08:00
anand76 c3f720c60d Enable ReadAsync testing and fault injection in db_stress (#11037)
Summary:
The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037

Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890.

Reviewed By: akankshamahajan15

Differential Revision: D42061290

Pulled By: anand1976

fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074
2022-12-15 15:48:50 -08:00
Hui Xiao f1574a20ff Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999)
Summary:
**Context/Summary:**

This reverts commit fc74abb436 and related HISTORY record.

The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF.  This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004

Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D41572604

Pulled By: hx235

fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
2022-11-29 10:56:42 -08:00
Yanqin Jin a8a4ed52a4 Test Merge with timestamps in stress test (#10948)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10948

Test Plan: make crash_test_with_ts

Reviewed By: ltamasi

Differential Revision: D41390854

Pulled By: riversand963

fbshipit-source-id: 599e114da8e2b2bbff5628fb8c67fa0393a31c05
2022-11-17 20:43:50 -08:00
Peter Dillinger 32520df1d9 Remove prototype FastLRUCache (#10954)
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954

Test Plan: tests updated

Reviewed By: akankshamahajan15

Differential Revision: D41310123

Pulled By: pdillinger

fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
2022-11-16 10:15:55 -08:00
Levi Tamasi b644baa1eb Support using GetMergeOperands for verification with wide columns (#10952)
Summary:
With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952

Test Plan: Ran a simple non-batched ops blackbox crash test.

Reviewed By: riversand963

Differential Revision: D41292114

Pulled By: ltamasi

fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c
2022-11-15 08:06:41 -08:00
Levi Tamasi d484275230 Adjust value generation in batched ops stress tests (#10872)
Summary:
The patch adjusts the generation of values in batched ops stress tests so that the digits 0..9 are appended (instead of prepended) to the values written. This has the advantage of aligning the encoding of the "value base" into the value string across non-batched, batched, and CF consistency stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10872

Test Plan: Tested using some black box stress test runs.

Reviewed By: riversand963

Differential Revision: D40692847

Pulled By: ltamasi

fbshipit-source-id: 26bf8adff2944cbe416665f09c3bab89d80416b3
2022-10-25 17:51:20 -07:00
sdong 48fe921754 Run clang format against files under tools/ and db_stress_tool/ (#10868)
Summary:
Some lines of .h and .cc files are not properly fomatted. Clear them up with clang format.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10868

Test Plan: Watch existing CI to pass

Reviewed By: ajkr

Differential Revision: D40683485

fbshipit-source-id: 491fbb78b2cdcb948164f306829909ad816d5d0b
2022-10-25 14:29:41 -07:00
Hui Xiao fc74abb436 Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress

./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35

put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```

**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
-  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
   - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
   - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
  - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
  - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
    - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.

Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777

Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761  and on FIFO compaction only

Reviewed By: ajkr

Differential Revision: D40090485

Pulled By: hx235

fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
2022-10-25 10:39:58 -07:00
Jay Zhuang 8124bc3526 Enable preclude_last_level_data_seconds in stress test (#10824)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10824

Reviewed By: siying

Differential Revision: D40390535

Pulled By: jay-zhuang

fbshipit-source-id: 700803a1aff8a1e77c038740d87931577e79bcf6
2022-10-16 09:28:43 -07:00
Levi Tamasi 2f3042d732 Check wide columns in TestIterateAgainstExpected (#10820)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10820

Reviewed By: riversand963

Differential Revision: D40363653

Pulled By: ltamasi

fbshipit-source-id: d347547d8cdd3f8926b35b6af4d1fa0f827e4a10
2022-10-14 14:25:05 -07:00
Levi Tamasi eae3a686ee Check wide columns in TestIterate (#10818)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10818

Test Plan: Tested using some simple blackbox crash test runs in the various modes (non-batched, batched, CF consistency).

Reviewed By: riversand963

Differential Revision: D40349527

Pulled By: ltamasi

fbshipit-source-id: 2918bc26adbbeac314beaa958aafe770b01e5cc6
2022-10-13 12:06:36 -07:00
Levi Tamasi 23b7dc2f4f Check columns in CfConsistencyStressTest::VerifyDb (#10804)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10804

Reviewed By: riversand963

Differential Revision: D40279057

Pulled By: ltamasi

fbshipit-source-id: 9efc3dae7f5eaab162d55a41c58c2535b0a53054
2022-10-12 11:43:34 -07:00
Levi Tamasi 85399b14f7 Consider wide columns when checksumming in the stress tests (#10788)
Summary:
There are two places in the stress test code where we compute the CRC
for a range of KVs for the purposes of checking consistency, namely in the
CF consistency test (to make sure CFs contain the same data), and when
performing `CompactRange` (to make sure the pre- and post-compaction
states are equivalent). The patch extends the logic so that wide columns
are also considered in both cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10788

Test Plan: Tested using some simple blackbox crash test runs.

Reviewed By: riversand963

Differential Revision: D40191134

Pulled By: ltamasi

fbshipit-source-id: 542c21cac9077c6d225780deb210319bb5eee955
2022-10-11 14:40:25 -07:00
Hui Xiao f6a0065d54 Allow Flush(sync=true) not supported in DB::Open() and db_stress (#10784)
Summary:
**Context:**
https://github.com/facebook/rocksdb/pull/10698 made `Flush(sync=true)` required for` DB::Open()` (to pass the original but now deleted assertion `impl->TEST_WALBufferIsEmpty()` under `manual_wal_flush=true`, see https://github.com/facebook/rocksdb/pull/10698 summary for more ) as well as db_stress to pass.

However RocksDB users may not implement SyncWAL() (used inFlush(sync=true)). Therefore we replace such in DB::Open and db_stress in this PR and align with https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1883-L1887 and https://github.com/facebook/rocksdb/blob/main/db_stress_tool/db_stress_test_base.cc#L847-L849

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10784

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D40193354

Pulled By: anand1976

fbshipit-source-id: e80d53880799ae01bdd717641d07997d3bfe2b54
2022-10-10 15:52:10 -07:00
Levi Tamasi 5182bf3f83 Skip column validation for non-value types when iter_start_ts is set (#10799)
Summary:
When the `iter_start_ts` read option is set, iterator exposes internal keys. This also includes tombstones, which by definition do not have a value (or columns). The patch makes sure we skip the wide-column consistency check in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10799

Test Plan: Tested using a simple blackbox crash test with timestamps enabled.

Reviewed By: jay-zhuang, riversand963

Differential Revision: D40235628

fbshipit-source-id: 49519fb55d8fe2bb9249ced809f7a81bff2b9df2
2022-10-10 15:07:07 -07:00
Levi Tamasi 7af47c532b Verify wide columns during prefix scan in stress tests (#10786)
Summary:
The patch adds checks to the
`{NonBatchedOps,BatchedOps,CfConsistency}StressTest::TestPrefixScan` methods
to make sure the wide columns exposed by the iterators are as expected (based on
the value base encoded into the iterator value). It also makes some code hygiene
improvements in these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10786

Test Plan:
Ran some simple blackbox tests in the various modes (non-batched, batched,
CF consistency).

Reviewed By: riversand963

Differential Revision: D40163623

Pulled By: riversand963

fbshipit-source-id: 72f4c3b51063e48c15f974c4ec64d751d3ed0a83
2022-10-07 11:17:57 -07:00
Levi Tamasi d6d8c007ff Verify columns in NonBatchedOpsStressTest::VerifyDb (#10783)
Summary:
As the first step of covering the wide-column functionality of iterators
in our stress tests, the patch adds verification logic to
`NonBatchedOpsStressTest::VerifyDb` that checks whether the
iterator's value and columns are in sync. Note: I plan to update the other
types of stress tests and add similar verification for prefix scans etc.
in separate PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783

Test Plan: Ran some simple blackbox crash tests.

Reviewed By: riversand963

Differential Revision: D40152370

Pulled By: riversand963

fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35
2022-10-06 15:07:16 -07:00
Peter Dillinger b205c6d029 Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768)
Summary:
We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.)

This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768

Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test)

Reviewed By: anand1976

Differential Revision: D40034747

Pulled By: anand1976

fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68
2022-10-06 14:54:21 -07:00
Levi Tamasi 3ae00dec90 Disable ingestion in stress tests when PutEntity is used (#10769)
Summary:
`SstFileWriter` currently does not support the `PutEntity` API, so in `TestIngestExternalFile` all key-values are written using regular `Put`s. This violates the assumption that whether or not a key corresponds to a plain old key-value or a wide-column entity can be determined by solely looking at the "value base" used when generating the value. The patch fixes this issue by disabling ingestion when `PutEntity` is enabled in the stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10769

Test Plan: Ran a simple blackbox stress test.

Reviewed By: akankshamahajan15

Differential Revision: D40042132

Pulled By: ltamasi

fbshipit-source-id: 93e75ff55545b7b69fa4ddef1d96093c961158a0
2022-10-03 18:09:56 -07:00
Changyu Bi 8b430e01dc Add iterator refresh to stress test (#10766)
Summary:
added calls to `Iterator::Refresh()` in `NonBatchedOpsStressTest::TestIterateAgainstExpected()`. The testing key range is locked in `TestIterateAgainstExpected` so I do not expect this change to provide thorough stress test to `Iterator::Refresh()`. However, it can still be helpful for catching bugs like https://github.com/facebook/rocksdb/issues/10739. Will add calls to refresh in `TestIterate` once we support iterator refresh with snapshots.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10766

Test Plan: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=2`

Reviewed By: ajkr

Differential Revision: D40008320

Pulled By: ajkr

fbshipit-source-id: cec93b07f915ef6476d41c1fee9b23c115188085
2022-10-03 16:22:39 -07:00
Changyu Bi 9f2363f4c4 User-defined timestamp support for DeleteRange() (#10661)
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661

Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.

| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|

Reviewed By: riversand963

Differential Revision: D39441192

Pulled By: cbi42

fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
2022-09-30 16:13:03 -07:00