Summary:
Same as title
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10632
Test Plan: make crash_test -j32
Reviewed By: anand1976
Differential Revision: D39241479
Pulled By: akankshamahajan15
fbshipit-source-id: 5db5b0c007da786bacc1b30d8926d36d6d029b87
Summary:
Expanded `all_params` to include all parameters crash test may set. Previously, `atomic_flush` was not included in `all_params` and thus was not visible to `finalize_and_sanitize()`. The consequence was manual crash test runs could provide unsafe combinations of parameters to `db_stress`. For example, running `db_crashtest.py` with `-atomic_flush=0` could cause `db_stress` to run with `-atomic_flush=0 -disable_wal=1`, which is known to produce inconsistencies across column families.
While expanding `all_params`, I found we cannot have an entry in it for both `db_stress` and `db_crashtest.py`. So I renamed `enable_tiered_storage` to `test_tiered_storage` for `db_crashtest.py`, which appears more conventional anyways.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10654
Reviewed By: hx235
Differential Revision: D39369349
Pulled By: ajkr
fbshipit-source-id: 31d9010c760c868b20d5e9bd78ba75c8ff3ce348
Summary:
**Context/Summary:**
`ExpectedState` is not aware of transaction-related concept so `use_txn=1 ` is not compatible with `sync_fault_injection=1`. Therefore this PR disabled this combination until we expand our correctness testing to transaction related features.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10605
Test Plan:
- Run the following commands to verify `--use_txn` is correctly sanitized
- `python3 ./tools/db_crashtest.py blackbox --use_txn=1 --sync_fault_injection=1 `
- `python3 ./tools/db_crashtest.py blackbox --use_txn=0 --sync_fault_injection=1 `
Reviewed By: ajkr
Differential Revision: D39121287
Pulled By: hx235
fbshipit-source-id: 7d5d6dd32479ea1c07df4f38322650f3a60def9c
Summary:
Currently, `db_bench` and `db_stress` print the blob cache options even if
a shared block/blob cache is configured, i.e. when they are not actually
in effect. The patch changes this so they are only printed when a separate blob
cache is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10614
Test Plan: Tested manually using `db_bench` and `db_stress`.
Reviewed By: akankshamahajan15
Differential Revision: D39144603
Pulled By: ltamasi
fbshipit-source-id: f714304c5d46186f8514746c27ee6f52aa3e4af8
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.
This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.
The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```
```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```
**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
- Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
- `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573
Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.
Reviewed By: ajkr
Differential Revision: D39005886
Pulled By: hx235
fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
Summary:
CI benchmarks refine configuration
Run only “essential” benchmarks, but for longer
Fix (reduce) the NUM_KEYS to ensure cached behaviour
Reduce level size to try to ensure more levels
Refine test durations again, more time per test, but fewer tests.
In CI benchmark mode, the only read test is readrandom.
There are still 3 mostly-read tests.
Goal is to squeeze complete run a little bit inside 1 hour so it doesn’t clash with the next run (cron scheduled for main branch), but it gets to run as long as possible, so that results are as credible as possible.
Reduce thread count to physical capacity, in an attempt to reduce throughput variance for write heavy tests. See Mark Callaghan’s comments in related documentation..
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10514
Reviewed By: ajkr
Differential Revision: D38952469
Pulled By: jay-zhuang
fbshipit-source-id: 72fa6bba897cc47066ced65facd1fd36e28f30a8
Summary:
The features that cannot work with disable_wal=1 due to unsynced data dropping (ingest_external_file_one_in and enable_compaction_filter) similarly cannot work with sync_fault_injection=1. This PR prevents those features from being used together with sync_fault_injection=1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10559
Reviewed By: hx235
Differential Revision: D38953019
Pulled By: ajkr
fbshipit-source-id: 7e2c7644ec84d7323f632cf976bcee00502d0ed7
Summary:
As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913,
`db_stress` does not have much verification for iterator correctness.
It has a `TestIterate()` function, but that is mainly for comparing results
between two iterators, one with `total_order_seek` and the other optionally
sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33
added a new `TestIterateAgainstExpected()` function that compares iterator against
expected state. It locks a range of keys, creates an iterator, does
a random sequence of `Next/Prev` and compares against expected state.
This PR is based on that commit, the main changes include some logs
(for easier debugging if a test fails), a forward and backward scan to
cover the entire locked key range, and a flag for optionally turning on
this version of Iterator testing.
Added constraint that the checks against expected state in
`TestIterateAgainstExpected()` and in `TestGet()` are only turned on
when `--skip_verifydb` flag is not set.
Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538
Test Plan:
Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`,
and a large `--iterpercent` and `--num_iterations`. Checked `op_logs`
manually to ensure expected coverage. Tweaked part of the code in
https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it.
- internally run various flavor of crash test
Reviewed By: ajkr
Differential Revision: D38847269
Pulled By: cbi42
fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599
Summary:
Optionally issue DeleteRange in `*whilewriting` benchmarks. This happens in `BGWriter` and uses similar logic as in `DoWrite` to issue DeleteRange operations. I added this when I was benchmarking https://github.com/facebook/rocksdb/issues/10547, but this should be an independent PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10552
Test Plan: ran some benchmarks with various delete range options, e.g. `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --writes=200000 --reads=1000000 --disable_auto_compactions --max_num_range_tombstones=10000`
Reviewed By: ajkr
Differential Revision: D38927020
Pulled By: cbi42
fbshipit-source-id: 31ee20cb8127f7173f0816ea0cc2a204ec02aad6
Summary:
After branch 7.6.fb branch is cut, following release process, upgrade version number to 7.7 and add 7.6.fb to format compatibility check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10546
Test Plan: Watch CI
Reviewed By: ajkr
Differential Revision: D38892023
Pulled By: gitbw95
fbshipit-source-id: 94e96dedbd973f5f9713e73d3bed336e4678565b
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535
Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO
No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom : 7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations; 571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom : 7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations; 568.2 MB/s (8271992 of 8271992 found)`
For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom : 5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations; 797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom : 5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations; 790.7 MB/s (11649992 of 11649992 found) `
Reviewed By: akankshamahajan15
Differential Revision: D38774009
Pulled By: anand1976
fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461
Reviewed By: siying
Differential Revision: D38672823
Pulled By: ltamasi
fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519
Reviewed By: jay-zhuang
Differential Revision: D38603291
fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
Summary:
Moved linux builds to using docker to avoid CI instability caused by dependency installation site down.
Added the `Dockerfile` which is used to build the image.
The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test).
Also fixed a few issues detected while building this:
* `DestoryDB()` Status not checked for a few tests
* nullptr might be used in `inlineskiplist.cc`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D38554200
Pulled By: jay-zhuang
fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471
Test Plan: CI
Reviewed By: siying
Differential Revision: D38450621
Pulled By: jay-zhuang
fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
Summary:
Currently user_defined_timestamp is failing in stress test with
subcompactions. So disabling it for now and will re enable it once its
fixed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10503
Test Plan: make crash_test_with_ts -j32
Reviewed By: riversand963
Differential Revision: D38510485
Pulled By: akankshamahajan15
fbshipit-source-id: 82fd0ec8cf86a96ff6653edd5bad7623cb9e0a15
Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable.
```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.
Single thread result:
| Max # tombstones | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op | 99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |6.68 |6.57 |6.72 |4.72 |4.79 |4.54 |
| 1 |6.67 |6.58 |6.62 |5.41 |4.74 |4.72 |
| 10 |6.59 |6.5 |6.56 |7.83 |4.69 |4.59 |
| 100 |6.62 |6.75 |6.58 |29.57 |5.04 |5.09 |
| 1000 |6.54 |6.82 |6.61 |320.33 |5.22 |5.21 |
32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op | 99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |234.52 |260.25 |239.42 |5.06 |5.38 |5.09 |
| 1 |236.46 |262.0 |231.1 |19.57 |22.14 |5.45 |
| 10 |236.95 |263.84 |251.49 |151.73 |21.61 |5.73 |
| 100 |268.16 |296.8 |280.13 |2308.52 |22.27 |6.57 |
Reviewed By: ajkr
Differential Revision: D37916564
Pulled By: cbi42
fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
Summary:
This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns.
The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458
Test Plan:
- new DB level test
- measured benefit/regression in a number of memtable scenarios
Setup command:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000
```
Benchmark command:
```
./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10
```
Worst regression is when a key has many tiny operands:
- Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1
- `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%)
The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact.
Best improvement is when a key has large operands and high concurrency:
- Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32
- `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%).
Reviewed By: cbi42
Differential Revision: D38336578
Pulled By: ajkr
fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258
Summary:
If a db_bench process gets hung or runaway on a machine, that
could prevent regression_test.sh from ever making progress. To fix that,
regression_test.sh will now kill any db_bench process that is >12 hours
old. Also made this more reliable by not using string matching (grep) to
get db_bench process IDs.
I also had to make some other updates to get local runs working
reliably:
* Fix some quoting hell and other dubious complexity with db_bench_cmd
* Only save a DB for re-use when building it passes
* Report failed command in more cases
* Add safeguards against "rm -rf ."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10441
Test Plan:
manual (local and remote), with temporary changes e.g. to have
a manageable age threshold etc.
Reviewed By: riversand963
Differential Revision: D38285537
Pulled By: pdillinger
fbshipit-source-id: 4d598876aedc38ac4bd9d8ddf32c5995d8e44db8
Summary:
The secondary cache is randomly disabled or enabled with CompressedSecondaryCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10442
Test Plan: - To test that the CompressedSecondaryCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test `
Reviewed By: anand1976
Differential Revision: D38290796
Pulled By: gitbw95
fbshipit-source-id: bb7027b39e0ed9c0c62835abe09e759898130ec8
Summary:
deleterandom tests are too fast to get good signal, e.g.
--deletes=31250 in 0.170 seconds vs. --reads=1500000 in 288.491
seconds for readrandom. Removing the special handling (unknown
motivation in faa7eb3b99) should suffice.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10437
Test Plan: watch continuous results
Reviewed By: ltamasi
Differential Revision: D38261185
Pulled By: pdillinger
fbshipit-source-id: 0f1b1b19efccda5689027d36cc2f01307f36031d
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309
Reviewed By: ltamasi
Differential Revision: D38211655
Pulled By: gangliao
fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
Summary:
In this PR we bring ClockCache closer to production quality. We implement the following changes:
1. Fixed a few bugs in ClockCache.
2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.
As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418
Test Plan:
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
Reviewed By: pdillinger
Differential Revision: D38170673
Pulled By: guidotag
fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
Summary:
If user runs `db_bench` with `-perf_level=2` or higher, db_bench should
print perf context after each of all benchmarks.
Or make `-perf_level` a per-benchmark switch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10396
Test Plan: ./db_bench -benchmarks=fillseq,readseq -perf_level=2
Reviewed By: ajkr
Differential Revision: D38016324
Pulled By: riversand963
fbshipit-source-id: d83ea4abc34d40ffea394ca6abf0814bc5c0a2e0
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321
Reviewed By: ltamasi
Differential Revision: D37913590
Pulled By: gangliao
fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
Summary:
After branch 7.5.fb branch is cut, following release process, upgrade version number to 7.6 and add 7.5.fb to format compatibility check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10376
Test Plan: Watch CI
Reviewed By: ajkr
Differential Revision: D37927694
fbshipit-source-id: 71b37ae55ebb7c95a1bcc0d7eee643d6ba5f8461
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298
Reviewed By: ltamasi
Differential Revision: D37908743
Pulled By: gangliao
fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
Summary:
ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.
The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.
A new version of the NewClockCache function was created for our internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351
Test Plan: ``make -j24 check`` and re-run the pre-release tests.
Reviewed By: pdillinger
Differential Revision: D37802685
Pulled By: guidotag
fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
Summary:
as title.
Test plan
- make check
- CI on PR
- TEST_TMPDIR=/dev/shm make crash_test_with_multiops_wp_txn (tested with successful run)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10350
Reviewed By: ajkr
Differential Revision: D37792872
Pulled By: riversand963
fbshipit-source-id: ff064093b7f715d0acf387af2e3ae87b1278b52b
Summary:
Previously the version was displayed as $major.$minor
This changes it to $major.$minor.$path
This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people
who consume/graph these results are OK with it.
Example output:
ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id githash
609488 244.1 1GB 0.0GB, 1.4 0.7 93.3 39 38 0 0 1.6 1.0 4 15 26 5365 15 0.0 0 0.1 0.0 0.5 fillseq.wal_disabled.v400 2022-06-29T13:36:05 7.5.0 6115254416
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277
Test Plan: Run it
Reviewed By: jay-zhuang
Differential Revision: D37532418
Pulled By: mdcallag
fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d
Summary:
This adds --undefok to support use of this script with BlobDB for db_bench versions prior
to 7.5 when the options land in a release.
While there is a limit to how far back this script can go WRT backwards compatiblity,
this is an easy change to support early 7.x releases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276
Test Plan: Run it with versions of db_bench that do not and then do support these options
Reviewed By: gangliao
Differential Revision: D37529299
Pulled By: mdcallag
fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.
Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37522461
Pulled By: guidotag
fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
Summary:
Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter.
This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of
the presence of timestamps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259
Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts
Reviewed By: ajkr
Differential Revision: D37459692
Pulled By: riversand963
fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef
Summary:
Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252
Reviewed By: riversand963
Differential Revision: D37432948
Pulled By: ajkr
fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a