Commit Graph

1505 Commits

Author SHA1 Message Date
Peter Dillinger d79be3dca2 Changes and enhancements to compression stats, thresholds (#11388)
Summary:
## Option API updates
* Add new CompressionOptions::max_compressed_bytes_per_kb, which corresponds to 1024.0 / min allowable compression ratio. This avoids the hard-coded minimum ratio of 8/7.
* Remove unnecessary constructor for CompressionOptions.
* Document undocumented CompressionOptions. Use idiom for default values shown clearly in one place (not precariously repeated).

 ## Stat API updates
* Deprecate the BYTES_COMPRESSED, BYTES_DECOMPRESSED histograms. Histograms incur substantial extra space & time costs compared to tickers, and the distribution of uncompressed data block sizes tends to be uninteresting. If we're interested in that distribution, I don't see why it should be limited to blocks stored as compressed.
* Deprecate the NUMBER_BLOCK_NOT_COMPRESSED ticker, because the name is very confusing.
* New or existing tickers relevant to compression:
  * BYTES_COMPRESSED_FROM
  * BYTES_COMPRESSED_TO
  * BYTES_COMPRESSION_BYPASSED
  * BYTES_COMPRESSION_REJECTED
  * COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES (both existing)
  * NUMBER_BLOCK_COMPRESSED (existing)
  * NUMBER_BLOCK_COMPRESSION_BYPASSED
  * NUMBER_BLOCK_COMPRESSION_REJECTED
  * BYTES_DECOMPRESSED_FROM
  * BYTES_DECOMPRESSED_TO

We can compute a number of things with these stats:
* "Successful" compression ratio: BYTES_COMPRESSED_FROM / BYTES_COMPRESSED_TO
* Compression ratio of data on which compression was attempted: (BYTES_COMPRESSED_FROM + BYTES_COMPRESSION_REJECTED) / (BYTES_COMPRESSED_TO + BYTES_COMPRESSION_REJECTED)
* Compression ratio of data that could be eligible for compression: (BYTES_COMPRESSED_FROM + X) / (BYTES_COMPRESSED_TO + X) where X = BYTES_COMPRESSION_REJECTED + NUMBER_BLOCK_COMPRESSION_REJECTED
* Overall SST compression ratio (compression disabled vs. actual): (Y - BYTES_COMPRESSED_TO + BYTES_COMPRESSED_FROM) / Y where Y = COMPACT_WRITE_BYTES + FLUSH_WRITE_BYTES

Keeping _REJECTED separate from _BYPASSED helps us to understand "wasted" CPU time in compression.

 ## BlockBasedTableBuilder
Various small refactorings, optimizations, and name clean-ups.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11388

Test Plan:
unit tests added

* `options_settable_test.cc`: use non-deprecated idiom for configuring CompressionOptions from string. The old idiom is tested elsewhere and does not need to be updated to support the new field.

Reviewed By: ajkr

Differential Revision: D45128202

Pulled By: pdillinger

fbshipit-source-id: 5a652bf5c022b7ec340cf79018cccf0686962803
2023-04-21 21:57:40 -07:00
Hui Xiao 151242ce46 Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.

**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
   - Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288

Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
     - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689

// PlainTable
Does not apply
```
- **Db bench 2: performance**

**Read**

SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`

**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none

rocksdb.sst.read.micros  COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`

Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`

Reviewed By: ajkr

Differential Revision: D44007011

Pulled By: hx235

fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 09:07:18 -07:00
Andrew Kryczka 0a774a102f Clarify `SstFileWriter::DeleteRange()` ordering requirements (#11390)
Summary:
As titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11390

Reviewed By: cbi42

Differential Revision: D45148830

Pulled By: ajkr

fbshipit-source-id: 9a8dfd040514bae3d8ed9e97a79cae7683f2749a
2023-04-20 13:02:16 -07:00
Peter Dillinger 9b698cda51 Update GeneralTableTest::ApproximateOffsetOfCompressed values (#11384)
Summary:
Because of this failure with snappy 1.1.8, ROCKSDB_NO_FBCODE=1

```
Value 3531 is not in range [2000, 3525]
table/table_test.cc:4231: Failure
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11384

Test Plan: run updated test in failing configuration

Reviewed By: ajkr

Differential Revision: D45057161

Pulled By: pdillinger

fbshipit-source-id: 397054f08033315e2e2bd9410f1fa32ddbf3b9c8
2023-04-17 14:17:18 -07:00
Murali Vilayannur 226ee25d30 Block fetch CPU time counters in perf context (#11342)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11342

Reviewed By: ajkr

Differential Revision: D45026838

Pulled By: mnv104

fbshipit-source-id: 099ed9579922b8fa6e7d3332bbb829d50ec47d91
2023-04-15 11:09:44 -07:00
Peter Dillinger f9db0c6e9c Refactor block cache tracing w/improved MultiGet (#11339)
Summary:
After https://github.com/facebook/rocksdb/issues/11301, I wasn't sure whether I had regressed block cache tracing with MultiGet. Demo PR https://github.com/facebook/rocksdb/issues/11330 shows the flawed state of tracing MultiGet before my change, and based on the unit test, there was essentially no change in tracing behavior with https://github.com/facebook/rocksdb/issues/11301. This change is to leave that code and behavior better than I found it.

This change is not intended to change any production behaviors except when block cache tracing is active, though might improve general read path efficiency by disabling some related tracking when such tracing is disabled.

More detail on production code:
* Refactoring to consolidate the construction of BlockCacheTraceRecord, and other related functionality, in block-based table reader, though it's somewhat awkward to preserve an optimization to avoid copying Slices into temporary strings in BlockCacheLookupContext.
* Accurately track cache hits and misses (etc.) for each data block accessed by a MultiGet(). (Previously reported hits as misses.)
* Reduced repeated checking of `block_cache_tracer_` state (by creating lookup_context only when active) for efficiency and to reduce the risk of corner case bugs where tracing is enabled or disabled for different parts of a read op. (See a TODO below)
* Improved estimate calculation for num_keys_in_block (see code comment)

Possible follow-up:
* `XXX:` use_cache=true means double cache query? (possible double-query of block cache when allow_mmap_reads=true)
* `TODO:` need more than one lookup_context here to track individual filter and index partition hits and misses
* `TODO:` optimize more state checks of `block_cache_tracer_` down to `lookup_context != nullptr`
* Pre-existing `XXX:` There appear to be 'break' statements above that bypass this writing of the block cache trace record
* Expand test coverage (see below)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11339

Test Plan:
* Added a basic unit test for block cache tracing MultiGet, for now just covering one data block with two keys.
* Added HitMissCountingCache to independently verify that the actual block cache trace and expected block cache trace also agree with the actual number of cache hits / misses (nothing missing or mislabeled). For now only used with MultiGet test.
* Better testing of num_keys_in_block, for now just with MultiGet
* Misc improvements to table_test to improve clarity, such as making it clear that certain keys are auto-inserted at the start of every test.

Performance test:
Testing multireadrandom as in https://github.com/facebook/rocksdb/issues/11301, except averaging over distinct runs rather than [-X30] which doesn't seem to sufficiently reset after each run to work as an independent test run.

Base with revert of 11301: 3148926 ops/sec
Base: 3019146 ops/sec
New: 2999529 ops/sec

Possibly a tiny MultiGet CPU regression with this change. We are now always allocating an additional vector for the LookupContexts. I'm still contemplating options to try to correct the regression in https://github.com/facebook/rocksdb/issues/11301.

Testing readrandom:
Base with revert of 11301: 2311988
Base: 2281726
New: 2299722

Possibly a tiny Get CPU improvement with this change. We are now avoiding some unnecessary LookupContext population.

Reviewed By: akankshamahajan15

Differential Revision: D44557845

Pulled By: pdillinger

fbshipit-source-id: b841691799d2a48fb59cc8880dc7cbb1e107ae3d
2023-04-07 12:55:56 -07:00
Changyu Bi f631138e1c Better support for merge operation with data block hash index (#11356)
Summary:
when data block hash index finds a key of op_type `kTypeMerge`, do not redo data block seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11356

Test Plan:
- added new unit test
- crash test: `python3 tools/db_crashtest.py whitebox --simple --use_merge=1 --data_block_index_type=1`
- benchmark see slight improvement in read throughput:
```
TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=mergerandom -use_existing_db=false -num=10000000 -compression_type=none -level_compaction_dynamic_level_bytes=1 -merge_operator=PutOperator -write_buffer_size=1000000 --use_data_block_hash_index=1

TEST_TMPDIR=/dev/shm/hashindex ./db_bench -benchmarks=readrandom[-X10] -use_existing_db=true -num=10000000 -merge_operator=PutOperator -readonly=1 -disable_auto_compactions=1 -reads=100000

Main: readrandom [AVG 10 runs] : 29526 (± 1118) ops/sec;    2.1 (± 0.1) MB/sec
Post-PR: readrandom [AVG 10 runs] : 31095 (± 662) ops/sec;    2.2 (± 0.0) MB/sec
```

Reviewed By: pdillinger

Differential Revision: D44759895

Pulled By: cbi42

fbshipit-source-id: 387f0c35938c7e0e96b810ca3babf1967fc68191
2023-04-07 10:06:03 -07:00
Wentian Guo 0578d9f951 Filter table files by timestamp: Get operator (#11332)
Summary:
If RocksDB enables user-defined timestamp, then RocksDB read path can filter table files by the min/max timestamps of each file. If application wants to lookup a key that is the most recent and visible to a certain timestamp ts, then we can compare ts with the min_ts of each file. If ts < min_ts, then we know all keys in the file is not visible at time ts, then we do not have to open the file. This can also save an IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11332

Reviewed By: pdillinger

Differential Revision: D44763497

Pulled By: guowentian

fbshipit-source-id: abde346b9f18480fe03c04e4006e7d62aa9c22a8
2023-04-06 15:39:38 -07:00
Andrew Kryczka b45738622a Use user-provided ReadOptions for metadata block reads more often (#11208)
Summary:
This is mostly taken from https://github.com/facebook/rocksdb/issues/10427 with my own comments addressed. This PR plumbs the user’s `ReadOptions` down to `GetOrReadIndexBlock()`, `GetOrReadFilterBlock()`, and `GetFilterPartitionBlock()`. Now those functions no longer have to make up a `ReadOptions` with incomplete information.

I also let `PartitionIndexReader::NewIterator()` pass through its caller's `ReadOptions::verify_checksums`, which was inexplicably dropped previously.

Fixes https://github.com/facebook/rocksdb/issues/10463

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11208

Test Plan:
Functional:
- Measured `-verify_checksum=false` applies to metadata blocks read outside of table open
  - setup command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56`
  - run command: `TEST_TMPDIR=/tmp/100M-DB/ ./db_bench -benchmarks=readrandom -use_existing_db=true -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -compression_type=none -num=1638400 -key_size=8 -value_size=56 -duration=10 -threads=32 -cache_size=131072 -statistics=true -verify_checksum=false -open_files=20 -cache_index_and_filter_blocks=true`
  - before: `rocksdb.block.checksum.compute.count COUNT : 384353`
  - after: `rocksdb.block.checksum.compute.count COUNT : 22`

Performance:
- Setup command (tmpfs, 128MB logical data size, cache indexes/filters without pinning so index/filter lookups go through table reader): `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=filluniquerandom,waitforcompaction -write_buffer_size=131072 -target_file_size_base=131072 -max_bytes_for_level_base=524288 -compression_type=none -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1`
- Measured point lookup performance. Database is fully cached to emphasize any new callstack overheads
  - Command: `TEST_TMPDIR=/dev/shm/128M-DB/ ./db_bench -benchmarks=readrandom[-W1][-X20] -use_existing_db=true -cache_index_and_filter_blocks=true -disable_auto_compactions=true -num=4194304 -key_size=8 -value_size=24 -bloom_bits=8 -whole_key_filtering=1 -duration=10 -cache_size=1048576000`
  - Before: `readrandom [AVG    20 runs] : 274848 (± 3717) ops/sec;    8.4 (± 0.1) MB/sec`
  - After: `readrandom [AVG    20 runs] : 277904 (± 4474) ops/sec;    8.5 (± 0.1) MB/sec`

Reviewed By: hx235

Differential Revision: D43145366

Pulled By: ajkr

fbshipit-source-id: 75ec062ece86a82cd788783de9de2c72df57f994
2023-04-04 16:53:14 -07:00
Peter Dillinger 3c17930ede Change default block cache from 8MB to 32MB (#11350)
Summary:
... which increases default number of shards from 16 to 64. Although the default block cache size is only recommended for applications where RocksDB is not performance-critical, under stress conditions, block cache mutex contention could become a performance bottleneck. This change of default should alleviate that.

Note that reducing the size of cache shards (recommended minimum 512MB) could cause thrashing, e.g. on filter blocks, so capacity needs to increase to safely increase number of shards.

The 8MB default dates back to 2011 or earlier (f779e7a5), when the most simultaneous threads you could get from a single CPU socket was 20 (e.g. Intel Xeon E7-8870). Now more than 100 is available.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11350

Test Plan: unit tests updated

Reviewed By: cbi42

Differential Revision: D44674873

Pulled By: pdillinger

fbshipit-source-id: 91ed3070789b42679283c7e6dc97c41a6a97bdf4
2023-04-04 15:33:24 -07:00
Andrew Kryczka 9f8cdc8ad6 validate SstFileWriter range tombstones cover positive ranges (#11322)
Summary:
As titled. This is the same as https://github.com/facebook/rocksdb/issues/6788 but for range tombstones written through `SstFileWriter` rather than through `DB`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11322

Reviewed By: cbi42

Differential Revision: D44317733

Pulled By: ajkr

fbshipit-source-id: f6eb8791ae2c09c169b6bfe0d047449d924b377e
2023-03-22 21:03:13 -07:00
Peter Dillinger 204fcff751 HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.

This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.

It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.

## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.

## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.

These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.

* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)

I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.

## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.

## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.

## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.

## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.

## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.

## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.

## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.

Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).

Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.

## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301

Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.

## Crash/stress test
Updated to use the new combination.

## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.

```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```

**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127

**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)

This is an acceptable difference IMHO.

Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.

Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec

So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:

**Before**:
multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec

Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.

Let's also look at Get() in db_bench:

```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
**After**:
readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec

That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:

**Before**:
readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
**After**:
readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec

Reviewed By: ltamasi

Differential Revision: D44177044

Pulled By: pdillinger

fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-17 20:23:49 -07:00
Peter Dillinger ccaa3225b0 Simplify tracking entries already in SecondaryCache (#11299)
Summary:
In preparation for factoring secondary cache support out of individual Cache implementations, we can get rid of the "in secondary cache" flag on entries through a workable hack: when an entry is promoted from secondary, it is inserted in primary using a helper that lacks secondary cache support, thus preventing re-insertion into secondary cache through existing logic.

This adds to the complexity of building CacheItemHelpers, because you always have to be able to get to an equivalent helper without secondary cache support, but that complexity is reasonably isolated within RocksDB typed_cache.h and test code.

gcc-7 seems to have problems with constexpr constructor referencing `this` so removed constexpr support on CacheItemHelper.

Also refactored some related test code to share common code / functionality.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11299

Test Plan: existing tests

Reviewed By: anand1976

Differential Revision: D44101453

Pulled By: pdillinger

fbshipit-source-id: 7a59d0a3938ee40159c90c3e65d7004f6a272345
2023-03-15 17:51:44 -07:00
Hui Xiao bab5f9a6f2 Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
Summary:
**Context/Summary:**
We are adding new stats to measure behavior of prefetched tail size and look up into this buffer

The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265

Test Plan:
**- Piggyback on existing test**
**- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.**
```
./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3  -use_direct_reads=true
```
```
rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600
rocksdb.table.open.prefetch.tail.miss COUNT : 91
rocksdb.table.open.prefetch.tail.hit COUNT : 1034
```
**- No perf regression observed in db_bench**

SETUP command: create same db with ~900 files for pre-change/post-change.
```
./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none
```
TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing.
```
./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true
```
async io = 0, direct io read = true

  | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs
-- | -- | -- | --
pre-post change | 4776 (± 28) ops/sec;   24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec;   74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec;   75.6 (± 2.2) MB/sec
post-change | 4790 (± 32) ops/sec;   24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec;   74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec;   74.5 (± 1.6) MB/sec

async io = 1, direct io read = true
  | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs
-- | -- | -- | --
pre-post change | 3350 (± 36) ops/sec;   17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec;   68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec;   71.8 (± 1.0) MB/sec
post-change | 3358 (± 27) ops/sec;   17.4 (± 0.1) MB/sec  | 263 (± 2) ops/sec;   68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec;   72.6 (± 0.6) MB/sec

Reviewed By: ajkr

Differential Revision: D43781467

Pulled By: hx235

fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a
2023-03-15 14:02:43 -07:00
Levi Tamasi 49881921cd Rename a recently added PerfContext counter (#11294)
Summary:
The patch renames the counter added in https://github.com/facebook/rocksdb/issues/11284 for better consistency with the existing naming scheme.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11294

Test Plan: `make check`

Reviewed By: jowlyzhang

Differential Revision: D44035964

Pulled By: ltamasi

fbshipit-source-id: 8b1a2a03ee728148365367e0ecc1fcf462f62191
2023-03-13 18:43:27 -07:00
Levi Tamasi 1d52438504 Add a PerfContext counter for merge operands applied in point lookups (#11284)
Summary:
The existing PerfContext counter `internal_merge_count` only tracks the
Merge operands applied during range scans. The patch adds a new counter
called `internal_merge_count_point_lookups` to track the same metric
for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
also fixes a couple of cases in the iterator where the existing counter wasn't
updated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284

Test Plan: `make check`

Reviewed By: jowlyzhang

Differential Revision: D43926082

Pulled By: ltamasi

fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
2023-03-08 18:22:11 -08:00
Peter Dillinger e01073252b Tests verifying non-zero checksums of zero bytes (#11260)
Summary:
Adds unit tests verifying that a block payload and checksum of all zeros is not falsely considered valid data. The test exhaustively checks that for blocks up to some length (default 20K, more exhaustively 10M) of all zeros do not produce a block checksum of all zeros.

Also small refactoring of an existing checksum test to use parameterized test. (Suggest hiding whitespace changes for review.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11260

Test Plan:
this is the test, manual run with
`ROCKSDB_THOROUGH_CHECKSUM_TEST=1` to verify up to 10M.

Reviewed By: hx235

Differential Revision: D43706192

Pulled By: pdillinger

fbshipit-source-id: 95e721c320ca928e7fa2400c2570fb359cc30b1f
2023-03-06 11:53:09 -08:00
Changyu Bi d053926fa2 Improve documentation for MergingIterator (#11161)
Summary:
Add some comments to try to explain how/why MergingIterator works. Made some small refactoring, mostly in MergingIterator::SkipNextDeleted() and MergingIterator::SeekImpl().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11161

Test Plan:
crash test with small key range:
```
python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=6000 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10
```

Reviewed By: ajkr

Differential Revision: D42860994

Pulled By: cbi42

fbshipit-source-id: 3f0c1c9c6481a7f468bf79d823998907a8116e9e
2023-03-03 12:17:30 -08:00
Levi Tamasi 3c9eed688e Enable moving a string or PinnableSlice into PinnableWideColumns (#11248)
Summary:
This makes it possible to eliminate some copies in `GetEntity` / `MultiGetEntity`,
in particular when `Merge`s or blobs are involved.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11248

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43544215

Pulled By: ltamasi

fbshipit-source-id: bc4c8955a24bbd8bc4ab098e72133ead757f9707
2023-02-24 10:33:00 -08:00
Changyu Bi 229297d1b8 Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113)
Summary:
A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).

The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113

Test Plan:
* added unit test in db_range_del_test
* crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`

Reviewed By: ajkr

Differential Revision: D42655709

Pulled By: cbi42

fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
2023-02-22 12:28:18 -08:00
mrambacher b6640c3117 Remove FactoryFunc from LoadXXXObject (#11203)
Summary:
The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available.  With the removal of LITE mode, the function was no longer required.

Note that the MergeOperator had some private classes defined in header files.  To gain access to their constructors (and name methods), the class definitions were moved into header files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203

Reviewed By: cbi42

Differential Revision: D43160255

Pulled By: pdillinger

fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c
2023-02-17 12:54:07 -08:00
Andrew Kryczka 25e1365227 Merge operator failed subcode (#11231)
Summary:
From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.

This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231

Test Plan: updated unit test

Reviewed By: akankshamahajan15

Differential Revision: D43396607

Pulled By: ajkr

fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9
2023-02-17 10:58:46 -08:00
Andrew Kryczka 6aef1a05d6 Use CacheDependencies() at start of ApproximateKeyAnchors() (#11230)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11230

Test Plan:
- setup command: `$ ./db_bench -benchmarks=fillrandom,compact -compression_type=none -num=1000000 -write_buffer_size=4194304 -target_file_size_base=4194304 -use_direct_io_for_flush_and_compaction=true -partition_index_and_filters=true -bloom_bits=10 -metadata_block_size=1024`
- measure small read count bucketed by size: `$ strace -fye pread64 ./db_bench.ctrl -use_existing_db=true -benchmarks=compact -compaction_readahead_size=4194304 -compression_type=none -num=1000000 -write_buffer_size=4194304 -target_file_size_base=4194304 -use_direct_io_for_flush_and_compaction=true -partition_index_and_filters=true -bloom_bits=10 -metadata_block_size=1024  -subcompactions=4 -cache_size=1048576000  2>&1 >/dev/null | awk '/= [0-9]+$/{print "[", int($NF / 1024), "KB,", int(1 + $NF / 1024), "KB)"}' | sort -n -k 2 | uniq -c | head -3`
- before:
```
   1119 [ 0 KB, 1 KB)
      1 [ 6 KB, 7 KB)
      2 [ 7 KB, 8 KB)
```
- after:
```
    242 [ 0 KB, 1 KB)
      1 [ 6 KB, 7 KB)
      2 [ 7 KB, 8 KB)
```

Reviewed By: pdillinger

Differential Revision: D43388507

Pulled By: ajkr

fbshipit-source-id: a02413c9f615b00784700646825a9870ee10f3a7
2023-02-17 09:03:37 -08:00
Levi Tamasi 9794acb597 Add a new MultiGetEntity API (#11222)
Summary:
The new `MultiGetEntity` API can be used to get a consistent view of
a batch of keys, with the results presented as wide-column entities.
Similarly to `GetEntity` and the iterator's `columns` API, if the entry
corresponding to the key is a wide-column entity to start with, it is
returned as-is, and if it is a plain key-value, it is wrapped into an entity
with a single default column.

Implementation-wise, the new API shares the logic of the batched `MultiGet`
API (via the `MultiGetCommon` methods). Both single-CF and multi-CF
`MultiGetEntity` APIs are provided, and blobs are also supported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11222

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43256950

Pulled By: ltamasi

fbshipit-source-id: 47fb2cb7e2d0470e3580f43fdb2fe9e51f0e7005
2023-02-15 09:34:17 -08:00
Wentian Guo 42d6652ba2 remove dependency on options.h for port_posix.h andport_win.h (#11214)
Summary:
The files in `port/`, such as `port_posix.h`, are layering over the system libraries, so shouldn't include the DB-specific files like `options.h`. This PR remove this dependency.

# How
The reason that `port_posix.h` (or `port_win.h`) include `options.h` is to use `CpuPriority`, as there is a method `SetCpuPriority()` in `port_posix.h` that uses `CpuPriority.`
- I think `SetCpuPriority()` make sense to exist in `port_posix.h` as it provides has platform-dependent implementation
- `CpuPriority` enum is defined in `env.h`, but used in `rocksdb/include` and `port/`.

Hence, let us define `CpuPriority` enum in a common file, say `port_defs.h`, such that both directories `rocksdb/include` and `port/` can include.

When we remove this dependency, some other files have compile errors because they can't find definitions, so add header files to resolve

# Test
make all check -j

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11214

Reviewed By: pdillinger

Differential Revision: D43196910

Pulled By: guowentian

fbshipit-source-id: 70deccb72844cfb08fcc994f76c6ef6df5d55ab9
2023-02-13 02:21:38 -08:00
Peter Dillinger 3cacd4b4ec Put Cache and CacheWrapper in new public header (#11192)
Summary:
The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.

In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544

Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.

Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192

Test Plan: `make check`

Reviewed By: anand1976

Differential Revision: D43055487

Pulled By: pdillinger

fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
2023-02-09 12:12:02 -08:00
Hui Xiao 6650ca244e Remove a couple deprecated convenience.h APIs (#11120)
Summary:
**Context/Summary:**
As instructed by convenience.h comments, a few deprecated APIs are removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11120

Test Plan:
- make check & CI
- eyeball check on test semantics.

Reviewed By: pdillinger

Differential Revision: D42937507

Pulled By: hx235

fbshipit-source-id: a9e4709387da01b1d0e9148c2e210f02e9746ee1
2023-02-07 14:11:53 -08:00
Yu Zhang 24ac53d81a Use user key on sst file for blob verification for Get and MultiGet (#11105)
Summary:
Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.

Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105

Test Plan:
make V=1 db_blob_basic_test
./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"

Reviewed By: ltamasi

Differential Revision: D42716487

Pulled By: jowlyzhang

fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
2023-01-30 10:21:21 -08:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Karim TAAM a1e92bd956 use verify checksum option in block based table reader Open() (#11099)
Summary:
## Description
In this issue https://github.com/facebook/rocksdb/issues/11002 we found that when we use rocksdb with the `verify checksum` read_option to false the verification is done anyway

By analyzing the code along the stacktrace I saw that at the level of https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129f we are not keeping all the options and we forget the `verify_checksum`

the comment in this class suggests that it should be managed https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fL581

<img width="1724" alt="204511641-86ab4b9b-45e5-4a2b-a13d-81fa26435d38" src="https://user-images.githubusercontent.com/26581503/213152802-c46bc1c7-a3a2-4a6f-9bb1-bf92ee93af7a.png">

this PR just adds the line to manage the `verify checksum`

## Tests

- Running unit tests
- Test without setting `verify checksum` and verifying that we are calling the checksum code
- Test by setting `verify checksum` to true and verifying that we are calling the checksum code
- Test by setting `verify checksum` to false and verifying that we are **not** calling the checksum code

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11099

Reviewed By: cbi42

Differential Revision: D42679881

Pulled By: ajkr

fbshipit-source-id: c7dd10768282fd0699f7e1bf397ceb7adbea4ab6
2023-01-26 17:38:59 -08:00
Heiko Becker 88edfbfb5e Fix build with gcc 13 by including <cstdint> (#11118)
Summary:
Like other versions before, gcc 13 moved some includes around and as a result <cstdint> is no longer transitively included [1]. Explicitly include it for uint{32,64}_t.

[1] https://gcc.gnu.org/gcc-13/porting_to.html#header-dep-changes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11118

Reviewed By: cbi42

Differential Revision: D42711356

Pulled By: ajkr

fbshipit-source-id: 5ea257b85b7017f40fd8fdbce965336da95c55b2
2023-01-25 14:30:32 -08:00
sdong 2800aa069a Remove compressed block cache (#11117)
Summary:
Compressed block cache is replaced by compressed secondary cache. Remove the feature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117

Test Plan: See CI passes

Reviewed By: pdillinger

Differential Revision: D42700164

fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
2023-01-24 17:09:19 -08:00
Andrew Kryczka b7fbcefda8 Add API to limit blast radius of merge operator failure (#11092)
Summary:
Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.

To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092

Reviewed By: siying

Differential Revision: D42525673

Pulled By: ajkr

fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
2023-01-20 14:40:30 -08:00
leipeng a5bcbcd8be remove unused InternalIteratorBase::is_mutable_ (#11104)
Summary:
`InternalIteratorBase::is_mutable_` is not used any more, remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11104

Reviewed By: ajkr

Differential Revision: D42582747

Pulled By: cbi42

fbshipit-source-id: d30bf75151fc8414df0ae112a6ec4943b5b7330b
2023-01-19 13:28:58 -08:00
Changyu Bi f515d9d203 Revert #10802 Consider range tombstone in compaction output file cutting (#11089)
Summary:
This reverts commit f02c708aa3 since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089

Test Plan: existing CI.

Reviewed By: ajkr

Differential Revision: D42505972

Pulled By: cbi42

fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153
2023-01-13 12:28:21 -08:00
Peter Dillinger 9f7801c5f1 Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).

The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.

* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)

## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
  * Simpler for implementations to deal with just one Insert and one Lookup.
  * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
  * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
  * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
  * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
  * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.

## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).

The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.

These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)

Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.

## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.

## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.

The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.

## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)

## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975

Test Plan:
tests updated

Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):

34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83

Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.

Reviewed By: anand1976

Differential Revision: D42417818

Pulled By: pdillinger

fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2023-01-11 14:20:40 -08:00
anand76 bec4264813 Avoid mixing sync and async prefetch (#11050)
Summary:
Reading uncompression dict block always uses sync reads, while data blocks may use async reads and prefetching. This causes problems in FilePrefetchBuffer. So avoid mixing the two by reading the uncompression dict straight from the file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11050

Test Plan: Crash test

Reviewed By: akankshamahajan15

Differential Revision: D42194682

Pulled By: anand1976

fbshipit-source-id: aaa8b396fdfe966b157e210f5ef8501c45b7b69e
2022-12-21 22:42:19 -08:00
Changyu Bi f02c708aa3 Consider range tombstone in compaction output file cutting (#10802)
Summary:
This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802

Test Plan:
- added unit tests in db_range_del_test.
- stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`

Reviewed By: ajkr, jay-zhuang

Differential Revision: D40308827

Pulled By: cbi42

fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
2022-12-15 09:11:54 -08:00
Arvid Lunnemark 00238a386b replace sprintf with its safe version snprintf (v2) (#11011)
Summary:
same motivations as https://github.com/facebook/rocksdb/pull/5475, applied to the last remaining `sprintf`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11011

Reviewed By: pdillinger

Differential Revision: D41673500

Pulled By: ajkr

fbshipit-source-id: 88618ea791cafad86a9a491799c45979d46e3544
2022-12-12 10:39:53 -08:00
Peter Dillinger 433d7e4594 Improve error messages for SST footer and size errors (#11009)
Summary:
Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure:
* Magic number error is reported first if we attempt to open an SST file and the footer is completely bad.
* Footer errors are reported with affected file.
* If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009

Test Plan:
unit tests added, some manual

Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed).

Reviewed By: siying

Differential Revision: D41656272

Pulled By: pdillinger

fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df
2022-12-09 10:03:47 -08:00
Changyu Bi 6cdb7af9f8 Remove copying of range tombstones keys in iterator (#10878)
Summary:
In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878

Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement.
```
# Favorable set up: half of the writes are DeleteRange.
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50

# benchmark command
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472  --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25

# main
readseq [AVG    5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec
readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec

# this PR
readseq [AVG    5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec
readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec
```

Reviewed By: ajkr

Differential Revision: D40711170

Pulled By: cbi42

fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f
2022-11-28 19:27:22 -08:00
Changyu Bi 534fb06dd3 Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966)
Summary:
Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966

Test Plan:
- added unit test
- stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
- ran different stress tests over sandcastle
- Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.

Reviewed By: ajkr

Differential Revision: D41414172

Pulled By: cbi42

fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
2022-11-23 14:27:14 -08:00
Peter Dillinger f321e8fc98 Don't attempt to use SecondaryCache on block_cache_compressed (#10944)
Summary:
Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!

This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.

Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.

This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944

Test Plan: test added that fails on base revision (at least with ASAN)

Reviewed By: akankshamahajan15

Differential Revision: D41205578

Pulled By: pdillinger

fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
2022-11-11 17:35:53 -08:00
Levi Tamasi 2ea109521f Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932)
Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41129399

Pulled By: ltamasi

fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
2022-11-09 12:54:05 -08:00
Levi Tamasi fbd9077d66 Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925)
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41096453

Pulled By: ltamasi

fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
2022-11-07 15:42:10 -08:00
Levi Tamasi 941d834739 Support Merge for wide-column entities during point lookups (#10916)
Summary:
The patch adds `Merge` support for wide-column entities to the point lookup
APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
update the iterator and compaction logic in separate PRs.) In terms of semantics,
the `Merge` operation is applied to the default (anonymous) column; any other
columns in the entity are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40962311

Pulled By: ltamasi

fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
2022-11-03 08:35:42 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Yanqin Jin 9079895aae Fix deletion counting in memtable stats (#10886)
Summary:
Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40740754

Pulled By: riversand963

fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1
2022-10-28 17:03:44 -07:00
Jay Zhuang 36f5e19e33 Fix a Windows build error (#10897)
Summary:
The for loop is marked as unreachable code because it will never call the increment. Switch it to `if`.

```
\table\merging_iterator.cc(823): error C2220: the following warning is treated as an error
\table\merging_iterator.cc(823): warning C4702: unreachable code
\table\merging_iterator.cc(1030): error C2220: the following warning is treated as an error
\table\merging_iterator.cc(1030): warning C4702: unreachable code
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10897

Reviewed By: cbi42

Differential Revision: D40811790

Pulled By: jay-zhuang

fbshipit-source-id: fe8fd3e7cf3d6f710360c402b79763854d5120df
2022-10-28 14:24:48 -07:00
Levi Tamasi 7867a1112b Handle Merges correctly in GetEntity (#10894)
Summary:
The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
supported for wide-column entities written using `PutEntity`; this change is
about returning correct (i.e. consistent with `Get`) results in cases like when the
base value is a plain old key-value written using `Put` or when there is no real base
value because we hit either a tombstone or the beginning of history.

Implementation-wise, the patch introduces a new wrapper around the existing
`MergeHelper::TimedFullMerge` that can store the merge result in either a string
(for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40782708

Pulled By: ltamasi

fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
2022-10-28 10:48:51 -07:00
Changyu Bi 56715350d9 Reduce heap operations for range tombstone keys in iterator (#10877)
Summary:
Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877

Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmark:
```
# Set up: --writes_per_range_tombstone=1 means one point write and one delete range

TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64

Level Files Size(MB)
--------------------
  0        8      152
  1        0        0
  2        0        0
  3        0        0
  4        0        0
  5        0        0
  6        0        0

# Benchmark
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true

# Pre PR
readseq [AVG    5 runs] : 1432116 (± 59664) ops/sec;  224.0 (± 9.3) MB/sec
readseq [MEDIAN 5 runs] : 1454886 ops/sec;  227.5 MB/sec

# Post PR
readseq [AVG    5 runs] : 1944425 (± 29521) ops/sec;  304.1 (± 4.6) MB/sec
readseq [MEDIAN 5 runs] : 1959430 ops/sec;  306.5 MB/sec
```

Reviewed By: ajkr

Differential Revision: D40710936

Pulled By: cbi42

fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
2022-10-27 14:28:50 -07:00
sdong 3e686c7cbe sst_dump --command=raw to add index offset information (#10873)
Summary:
Add some extra information in outputs of "sst_dump --command=raw" to help debug some issues. Right now, encoded block handle is printed out. It is more useful to directly print out offset and size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10873

Test Plan: Manually run it against a file and check the output.

Reviewed By: anand1976

Differential Revision: D40742289

fbshipit-source-id: 04d7de26e7f27e1595a7cc3ac1c1082e4e835b93
2022-10-27 11:56:09 -07:00
anand76 727bad78b8 Format files under table/ by clang-format (#10852)
Summary:
Run clang-format on files under the `table` directory.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10852

Reviewed By: ajkr

Differential Revision: D40650732

Pulled By: anand1976

fbshipit-source-id: 2023a958e37fd6274040c5181130284600c9e0ef
2022-10-25 11:50:38 -07:00
Changyu Bi 7a95938899 Improve FragmentTombstones() speed by lazily initializing `seq_set_` (#10848)
Summary:
FragmentedRangeTombstoneList has a member variable `seq_set_` that contains the sequence numbers of all range tombstones in a set. The set is constructed in `FragmentTombstones()` and is used only in `FragmentedRangeTombstoneList::ContainsRange()` which only happens during compaction. This PR moves the initialization of `seq_set_` to `FragmentedRangeTombstoneList::ContainsRange()`. This should speed up `FragmentTombstones()` when the range tombstone list is used for read/scan requests. Microbench shows the speed improvement to be ~45%.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10848

Test Plan:
- Existing tests and stress test: `python3 tools/db_crashtest.py whitebox --simple  --verify_iterator_with_expected_state_one_in=5`.
- Microbench: update `range_del_aggregator_bench` to benchmark speed of `FragmentTombstones()`:
```
./range_del_aggregator_bench --num_range_tombstones=1000 --tombstone_start_upper_bound=50000000 --num_runs=10000 --tombstone_width_mean=200 --should_deletes_per_run=100 --use_compaction_range_del_aggregator=true

Before this PR:
=========================
Fragment Tombstones:     270.286 us
AddTombstones:           1.28933 us
ShouldDelete (first):    0.525528 us
ShouldDelete (rest):     0.0797519 us

After this PR: time to fragment tombstones is pushed to AddTombstones() which only happen during compaction.
=========================
Fragment Tombstones:     149.879 us
AddTombstones:           102.131 us
ShouldDelete (first):    0.565871 us
ShouldDelete (rest):     0.0729444 us
```
- db_bench: this should improve speed for fragmenting range tombstones for mutable memtable:
```
./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=250000 --disable_auto_compactions --max_num_range_tombstones=100000 --finish_after_writes --write_buffer_size=1073741824 --threads=25

Before this PR:
readwhilewriting :      18.301 micros/op 1310445 ops/sec 4.769 seconds 6250000 operations;   28.1 MB/s (41001 of 250000 found)
After this PR:
readwhilewriting :      16.943 micros/op 1439376 ops/sec 4.342 seconds 6250000 operations;   23.8 MB/s (28977 of 250000 found)
```

Reviewed By: ajkr

Differential Revision: D40646227

Pulled By: cbi42

fbshipit-source-id: ea471667edb258f67d01cfd828588e80a89e4083
2022-10-25 11:33:04 -07:00
Changyu Bi deb6a24be2 Remove range tombstone test code from sst_file_reader (#10847)
Summary:
`#include "db/range_tombstone_fragmenter.h"` seems to break some internal test for 7.8 release. I'm removing it from sst_file_reader.h for now to unblock release. This should be fine as it is only used in a unit test for DeleteRange with timestamp. In addition, it does not seem to be useful to support delete range for sst file writer, since the range tombstone won't cover any key (its sequence number is 0). So maybe we can remove it in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10847

Test Plan: CI.

Reviewed By: akankshamahajan15

Differential Revision: D40620865

Pulled By: cbi42

fbshipit-source-id: be44b2f31e062bff87ed1b8d94482c3f7eaa370c
2022-10-23 20:17:14 -07:00
Peter Dillinger 27c9705ac4 Use kXXH3 as default checksum (CPU efficiency) (#10778)
Summary:
Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware.

A current DB performance comparison using buck+clang build:
```
TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal
```
kXXH3 (+0.2% DB write throughput):
`fillseq [AVG    1000 runs] : 822149 (± 1004) ops/sec;   91.0 (± 0.1) MB/sec`
kCRC32c:
`fillseq [AVG    1000 runs] : 820484 (± 1203) ops/sec;   90.8 (± 0.1) MB/sec`

Micro benchmark comparison:
```
./db_bench --benchmarks=xxh3[-X20],crc32c[-X20]
```
Machine 1, buck+clang build:
`xxh3 [AVG    20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec`
`crc32c [AVG    20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec`

Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0:
`xxh3 [AVG    20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec`
`crc32c [AVG    20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778

Test Plan: make check, unit tests updated

Reviewed By: ajkr

Differential Revision: D40112510

Pulled By: pdillinger

fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37
2022-10-21 18:09:12 -07:00
akankshamahajan 0e7b27bfcf Refactor block cache tracing APIs (#10811)
Summary:
Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.

`DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.

This same philosophy can be applied to KV and IO tracing as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811

Test Plan:
existing unit tests
Old API DB::StartBlockTrace checked with db_bench tool
create database
```
./db_bench --benchmarks="fillseq" \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000
```

To trace block cache accesses when running readrandom benchmark:
```
./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
--threads=16 \
-block_cache_trace_file="/tmp/binary_trace_test_example" \
-block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
-block_cache_trace_sampling_frequency=1

```

Reviewed By: anand1976

Differential Revision: D40435289

Pulled By: akankshamahajan15

fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
2022-10-21 12:15:35 -07:00
Peter Dillinger 7555243bcf Refactor ShardedCache for more sharing, static polymorphism (#10801)
Summary:
The motivations for this change include
* Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size.
  * This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash.
* Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads
  * ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks).
  * You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache.

More detail:
* Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`.
* Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards.
* Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10801

Test Plan:
Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash.

Performance:
Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16`

Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16
```

Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec`
After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec`

So possibly ~0.1% improvement.

And with `-cache_type=hyper_clock_cache`:
Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec`
After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec`

So roughly 5% improvement!

Reviewed By: anand1976

Differential Revision: D40252236

Pulled By: pdillinger

fbshipit-source-id: ff8fc70ef569585edc95bcbaaa0386f61355ae5b
2022-10-18 22:06:57 -07:00
Peter Dillinger e466173d5c Print stack traces on frozen tests in CI (#10828)
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.

For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828

Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI

Reviewed By: hx235

Differential Revision: D40447634

Pulled By: pdillinger

fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
2022-10-18 00:35:35 -07:00
Changyu Bi 9f2363f4c4 User-defined timestamp support for `DeleteRange()` (#10661)
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661

Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`.  Did not see consistent regression in no timestamp case.

| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|

Reviewed By: riversand963

Differential Revision: D39441192

Pulled By: cbi42

fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
2022-09-30 16:13:03 -07:00
Jay Zhuang f3cc66632b Align compaction output file boundaries to the next level ones (#10655)
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.

In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).

db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432

# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds

# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```

The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
          baseline      this_PR
count  1.656000e+03  1.705000e+03
mean   3.116062e+07  3.028076e+07
std    7.145242e+06  8.046139e+06
```

The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`

Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655

Reviewed By: cbi42

Differential Revision: D39552321

Pulled By: jay-zhuang

fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
2022-09-29 19:43:55 -07:00
Yanqin Jin 07249fea8f Fix DBImpl::GetLatestSequenceForKey() for Merge (#10724)
Summary:
Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724

Test Plan: make check

Reviewed By: cbi42

Differential Revision: D39756847

Pulled By: riversand963

fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876
2022-09-23 17:29:05 -07:00
Peter Dillinger ef443cead4 Refactor to avoid confusing "raw block" (#10408)
Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)

This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:

* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).

Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408

Test Plan: existing tests

Reviewed By: akankshamahajan15

Differential Revision: D38172617

Pulled By: pdillinger

fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
2022-09-22 11:25:32 -07:00
anand76 fb9a025892 Fix platform 10 build with folly (#10708)
Summary:
Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708

Test Plan:
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check

Reviewed By: ajkr

Differential Revision: D39666590

Pulled By: anand1976

fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44
2022-09-21 14:43:44 -07:00
Changyu Bi 749b849a34 Fix memtable-only iterator regression (#10705)
Summary:
when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705

Test Plan:
- `make check`
- Performance:
  - Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000`
  - Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts`
```
seek_nexts    main op/sec    https://github.com/facebook/rocksdb/issues/10705      RocksDB v7.6
0             5746568        5749033     5786180
30            2411690        3006466     2837699
1000          102556         128902      124667
```

Reviewed By: ajkr

Differential Revision: D39644221

Pulled By: cbi42

fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d
2022-09-21 09:49:31 -07:00
Akanksha Mahajan bd2ad2f9a0 Fix stress test failure for async_io (#10660)
Summary:
Sanitize initial_auto_readahead_size if its greater than max_auto_readahead_size in case of async_io

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10660

Test Plan: Ran db_stress with intitial_auto_readahead_size  greater than max_auto_readahead_size.

Reviewed By: anand1976

Differential Revision: D39408095

Pulled By: akankshamahajan15

fbshipit-source-id: 07f933242f636cfbc7ccf042e0c8b959a8ec5f3a
2022-09-12 14:48:06 -07:00
Peter Dillinger 6de7081cf3 Always verify SST unique IDs on SST file open (#10532)
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.

One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.

Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`

Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532

Test Plan:
updated unit tests

Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec

Reviewed By: jay-zhuang

Differential Revision: D38765551

Pulled By: pdillinger

fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
2022-09-07 22:52:42 -07:00
Changyu Bi 30bc495c03 Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449)
Summary:
Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.

With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
- in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
- in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.

This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.

One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.

Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449

Test Plan:
- Added many unit tests in db_range_del_test
- Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
- Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.

```
python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
```

- Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written".  As expected, the performance with this PR depends on the range tombstone width.
```
# Setup:
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50

# Scan entire DB
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true

# Short range scan (10 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true

# Long range scan(1000 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
```
Avg over of 10 runs (some slower tests had fews runs):

For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.

- Scan entire DB

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0 range tombstone    |2525600 (± 43564)    |2486917 (± 33698)    |-1.53%               |
| 100   |1853835 (± 24736)    |2073884 (± 32176)    |+11.87%              |
| 1000  |422415 (± 7466)      |1115801 (± 22781)    |+164.15%             |
| 10000 |22384 (± 227)        |227919 (± 6647)      |+918.22%             |
| 1 range tombstone      |2176540 (± 39050)    |2434954 (± 24563)    |+11.87%              |
- Short range scan

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0  range tombstone   |35398 (± 533)        |35338 (± 569)        |-0.17%               |
| 100   |28276 (± 664)        |31684 (± 331)        |+12.05%              |
| 1000  |7637 (± 77)          |25422 (± 277)        |+232.88%             |
| 10000 |1367                 |28667                |+1997.07%            |
| 1 range tombstone      |32618 (± 581)        |32748 (± 506)        |+0.4%                |

- Long range scan

| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- |  ------------- |
| 0 range tombstone     |2262 (± 33)          |2353 (± 20)          |+4.02%               |
| 100   |1696 (± 26)          |1926 (± 18)          |+13.56%              |
| 1000  |410 (± 6)            |1255 (± 29)          |+206.1%              |
| 10000 |25                   |414                  |+1556.0%             |
| 1 range tombstone   |1957 (± 30)          |2185 (± 44)          |+11.65%              |

- Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61

Reviewed By: ajkr

Differential Revision: D38450331

Pulled By: cbi42

fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
2022-09-02 09:51:19 -07:00
Akanksha Mahajan 4cd16d65ae Add new option num_file_reads_for_auto_readahead in BlockBasedTableOptions (#10556)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file if user doesn't provide readahead_size and reads are sequential.
A new option num_file_reads_for_auto_readahead is added which can be
configured and indicates after how many sequential reads prefetching should
be start.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10556

Test Plan: Existing and new unit test

Reviewed By: anand1976

Differential Revision: D38947147

Pulled By: akankshamahajan15

fbshipit-source-id: c9eeab495f84a8df7f701c42f04894e46440ad97
2022-09-01 11:56:00 -07:00
anand76 35cdd3e71e MultiGet async IO across multiple levels (#10535)
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535

Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO

No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom :       7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations;  571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom :       7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations;  568.2 MB/s (8271992 of 8271992 found)`

For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom :       5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations;  797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom :       5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations;  790.7 MB/s (11649992 of 11649992 found) `

Reviewed By: akankshamahajan15

Differential Revision: D38774009

Pulled By: anand1976

fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
2022-08-19 16:52:52 -07:00
Levi Tamasi 81388b36e0 Add support for wide-column point lookups (#10540)
Summary:
The patch adds a new API `GetEntity` that can be used to perform
wide-column point lookups. It also extends the `Get` code path and
the `MemTable` / `MemTableList` and `Version` / `GetContext` logic
accordingly so that wide-column entities can be served from both
memtables and SSTs. If the result of a lookup is a wide-column entity
(`kTypeWideColumnEntity`), it is passed to the application in deserialized
form; if it is a plain old key-value (`kTypeValue`), it is presented as a
wide-column entity with a single default (anonymous) column.
(In contrast, regular `Get` returns plain old key-values as-is, and
returns the value of the default column for wide-column entities, see
https://github.com/facebook/rocksdb/issues/10483 .)

The result of `GetEntity` is a self-contained `PinnableWideColumns` object.
`PinnableWideColumns` contains a `PinnableSlice`, which either stores the
underlying data in its own buffer or holds on to a cache handle. It also contains
a `WideColumns` instance, which indexes the contents of the `PinnableSlice`,
so applications can access the values of columns efficiently.

There are several pieces of functionality which are currently not supported
for wide-column entities: there is currently no `MultiGetEntity` or wide-column
iterator; also, `Merge` and `GetMergeOperands` are not supported, and there
is no `GetEntity` implementation for read-only and secondary instances.
We plan to implement these in future PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D38847474

Pulled By: ltamasi

fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b
2022-08-19 11:51:12 -07:00
anand76 2553d1efa1 Revert "Avoid dynamic memory allocation on read path (#10453)" (#10541)
Summary:
This reverts commit 0d885e80d4. The original commit causes a ASAN stack-use-after-return failure due to the `CreateCallback` being allocated on stack and then used in another thread when a secondary cache object is promoted to the primary cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10541

Reviewed By: gitbw95

Differential Revision: D38850039

Pulled By: anand1976

fbshipit-source-id: 810c592b7de2523693f5bb267159b23b0ee9132c
2022-08-19 11:02:54 -07:00
Gang Liao 275cd80cdb Add a blob-specific cache priority (#10461)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
2022-08-12 17:59:06 -07:00
Peter Dillinger 86a1e3e0e7 Derive cache keys from SST unique IDs (#10394)
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.

See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.

This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.

These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):

```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```

After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```

However, the change does modify (probably weaken) the "guaranteed unique" promise from this

> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86

to this (see https://github.com/facebook/rocksdb/issues/10388)

> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits

I don't think this is a practical concern, though.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394

Test Plan: unit tests updated, see simulation results above

Reviewed By: jay-zhuang

Differential Revision: D38667529

Pulled By: pdillinger

fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
2022-08-12 13:49:49 -07:00
sdong 9277569ba3 Add some missing headers (#10519)
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519

Reviewed By: jay-zhuang

Differential Revision: D38603291

fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
2022-08-11 12:45:50 -07:00
sdong 911c0208b9 WritableFileWriter tries to skip operations after failure (#10489)
Summary:
A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489

Test Plan: Fix existing unit tests and watch crash tests for a while.

Reviewed By: anand1976

Differential Revision: D38473277

fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
2022-08-10 10:19:20 -07:00
Levi Tamasi 24bcab7d5d Make queries return the value of the default column for wide-column entities (#10483)
Summary:
The patch adds support for wide-column entities to the existing query
APIs (`Get`, `MultiGet`, and iterator). Namely, when during a query a
wide-column entity is encountered, we will return the value of the default
(anonymous) column as the result. Later, we plan to add wide-column
specific query APIs which will enable retrieving entire wide-column entities
or a subset of their columns.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10483

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38441881

Pulled By: ltamasi

fbshipit-source-id: 6444e79a31aff2470e866698e3a97985bc2b3543
2022-08-08 16:10:08 -07:00
Jay Zhuang 0d885e80d4 Avoid dynamic memory allocation on read path (#10453)
Summary:
lambda function dynamicly allocates memory from heap if it needs to
capture multiple values, which could be expensive.
Switch to explictly use local functor from stack.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10453

Test Plan:
CI
db_bench shows ~2-3% read improvement:
```
# before the change
TEST_TMPDIR=/tmp/dbbench4 ./db_bench_main --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000
readrandom   :       8.528 micros/op 117265 ops/sec 85.277 seconds 10000000 operations;   13.0 MB/s (10000000 of 10000000 found)
# after the change
TEST_TMPDIR=/tmp/dbbench5 ./db_bench_new --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000
readrandom   :       8.263 micros/op 121015 ops/sec 82.634 seconds 10000000 operations;   13.4 MB/s (10000000 of 10000000 found)
```
details: https://gist.github.com/jay-zhuang/5ac0628db8fc9cbcb499e056d4cb5918

Micro-benchmark shows a similar improvement ~1-2%:
before the change:
https://gist.github.com/jay-zhuang/9dc0ebf51bbfbf4af82f6193d43cf75b
after the change:
https://gist.github.com/jay-zhuang/fc061f1813cd8f441109ad0b0fe7c185

Reviewed By: ajkr

Differential Revision: D38345056

Pulled By: jay-zhuang

fbshipit-source-id: f3597aeeee338a804d37bf2e81386d5a100665e0
2022-08-08 12:59:31 -07:00
Changyu Bi 9d77bf8f7b Fragment memtable range tombstone in the write path (#10380)
Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380

Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.

```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100

multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.

Single thread result:
| Max # tombstones  | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op |  99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
| 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
| 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
| 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
| 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |

32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones  | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op |  99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
| 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
| 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
| 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |

Reviewed By: ajkr

Differential Revision: D37916564

Pulled By: cbi42

fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
2022-08-05 12:02:33 -07:00
anand76 bf4532eb5c Break TableReader MultiGet into filter and lookup stages (#10432)
Summary:
This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432

Reviewed By: akankshamahajan15

Differential Revision: D38245910

Pulled By: anand1976

fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
2022-08-04 12:51:57 -07:00
anand76 54aebb2cc5 Fix cache metrics update when secondary cache is used (#10440)
Summary:
If a secondary cache is configured, its possible that a cache lookup will get a hit in the secondary cache. In that case, the ```LRUCacheShard::Lookup``` doesn't immediately update the ```total_charge``` for the item handle if the ```wait``` parameter is false (i.e caller will call later to check the completeness). However, ```BlockBasedTable::GetEntryFromCache``` assumes the handle is complete and calls ```UpdateCacheHitMetrics```, which checks the usage of the cache item and fails the assert in https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h#L237 (```assert(total_charge >= meta_charge)```).

To fix this, we call ```UpdateCacheHitMetrics``` later in ```MultiGet```, after waiting for all cache lookup completions.

Test plan -
Run crash test with changes from https://github.com/facebook/rocksdb/issues/10160

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10440

Reviewed By: gitbw95

Differential Revision: D38283968

Pulled By: anand1976

fbshipit-source-id: 31c54ef43517726c6e5fdda81899b364241dd7e1
2022-07-29 14:24:44 -07:00
Bo Wang 1aab5b32ad Update passing rate_limiter_priority for a PartitionedFilterBlockReader function to FS (#10438)
Summary:
Add param rate_limiter_parameter in PartitionedFilterBlockReader::GetFilterPartitionBlock .

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10438

Test Plan: Unit Tests.

Reviewed By: anand1976

Differential Revision: D38266395

Pulled By: gitbw95

fbshipit-source-id: 3ed062a3b43d6df323371cb0d266f7fe869e9ad2
2022-07-29 11:32:54 -07:00
Peter Dillinger 65036e4217 Revert "Add a blob-specific cache priority (#10309)" (#10434)
Summary:
This reverts commit 8d178090be
because of a clear performance regression seen in internal dashboard
https://fburl.com/unidash/tpz75iee

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434

Reviewed By: ltamasi

Differential Revision: D38256373

Pulled By: pdillinger

fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b
2022-07-29 07:18:15 -07:00
Gang Liao 8d178090be Add a blob-specific cache priority (#10309)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309

Reviewed By: ltamasi

Differential Revision: D38211655

Pulled By: gangliao

fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
2022-07-27 19:09:24 -07:00
sdong 252bea405e Improve SubCompaction Partitioning (#10393)
Summary:
Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393

Test Plan:
1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
2. Add a new unit test to ApproximateKeyAnchors()
3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.

Reviewed By: jay-zhuang

Differential Revision: D38043783

fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f
2022-07-23 17:38:49 -07:00
Gang Liao 0b6bc101ba Charge blob cache usage against the global memory limit (#10321)
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321

Reviewed By: ltamasi

Differential Revision: D37913590

Pulled By: gangliao

fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
2022-07-18 23:26:57 -07:00
Jay Zhuang a3acf2ef87 Add seqno to time mapping (#10338)
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338

Test Plan: CI

Reviewed By: siying

Differential Revision: D37810187

Pulled By: jay-zhuang

fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
2022-07-14 21:49:34 -07:00
sdong c8b20d469d Make InternalKeyComparator not configurable (#10342)
Summary:
InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too.
We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342

Test Plan: Run existing CI tests and make sure it doesn't fail

Reviewed By: riversand963

Differential Revision: D37771351

fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee
2022-07-14 10:09:31 -07:00
sdong 769b156e65 Remove customized naming from InternalKeyComparator (#10343)
Summary:
InternalKeyComparator is a thin wrapper around user comparator. Storing a string for name is relatively expensive to this small wrapper for both CPU and memory usage. Try to remove it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10343

Test Plan: Run existing tests

Reviewed By: ajkr

Differential Revision: D37772469

fbshipit-source-id: d2d106a8d022193058fd7f6b220108e3d94aca34
2022-07-12 13:30:35 -07:00
Peter Dillinger e6c5e0ab9a Have Cache use Status::MemoryLimit (#10262)
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)

I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.

HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262

Test Plan: updated unit tests

Reviewed By: hx235

Differential Revision: D37473155

Pulled By: pdillinger

fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
2022-07-06 14:41:46 -07:00
Akanksha Mahajan 2acbf386a3 Provide support for direct_reads with async_io (#10197)
Summary:
Provide support for use_direct_reads with async_io.

TestPlan:
-  Updated unit tests
-  db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420
- db_stress
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1"
make crash_test -j
```
- Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197

Reviewed By: anand1976

Differential Revision: D37255646

Pulled By: akankshamahajan15

fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920
2022-07-06 11:42:59 -07:00
Changyu Bi f9cfc6a808 Updated NewDataBlockIterator to not fetch compression dict for non-da… (#10310)
Summary:
…ta blocks

During MyShadow testing, ajkr helped me find out that with partitioned index and dictionary compression enabled, `PartitionedIndexIterator::InitPartitionedIndexBlock()` spent considerable amount of time (1-2% CPU) on fetching uncompression dictionary. Fetching uncompression dict was not needed since the index blocks were not compressed (and even if they were, they use empty dictionary). This should only affect use cases with partitioned index, dictionary compression and without uncompression dictionary pinned. This PR updates NewDataBlockIterator to not fetch uncompression dictionary when it is not for data blocks.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10310

Test Plan:
1. `make check`
2. Perf benchmark: 1.5% (143950 -> 146176) improvement in op/sec for partitioned index + dict compression benchmark.
For default config without partitioned index and without dict compression, there is no regression in readrandom perf from multiple runs of db_bench.

```
# Set up for partitioned index with dictionary compression
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -partition_index=true  -compression_max_dict_bytes=16384 -compression_zstd_max_train_bytes=1638400

# Pre PR
TEST_TMPDIR=/dev/shm ./db_bench_main -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true
readrandom [AVG    50 runs] : 143950 (± 1108) ops/sec;   15.9 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 144406 ops/sec;   16.0 MB/sec

# Post PR
TEST_TMPDIR=/dev/shm ./db_bench_opt -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true
readrandom [AVG    50 runs] : 146176 (± 1121) ops/sec;   16.2 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 146014 ops/sec;   16.2 MB/sec

# Set up for no partitioned index and no dictionary compression
TEST_TMPDIR=/dev/shm/baseline ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false
# Pre PR
TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_main --use_existing_db=true "--benchmarks=readrandom[-X50]"
readrandom [AVG    50 runs] : 158546 (± 1000) ops/sec;   17.5 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 158280 ops/sec;   17.5 MB/sec

# Post PR
TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_opt --use_existing_db=true "--benchmarks=readrandom[-X50]"
readrandom [AVG    50 runs] : 161061 (± 1520) ops/sec;   17.8 (± 0.2) MB/sec
readrandom [MEDIAN 50 runs] : 161596 ops/sec;   17.9 MB/sec
```

Reviewed By: ajkr

Differential Revision: D37631358

Pulled By: cbi42

fbshipit-source-id: 6ca2665e270e63871968e061ba4a99d3136785d9
2022-07-06 09:30:25 -07:00
Levi Tamasi c73d2a9d18 Add API for writing wide-column entities (#10242)
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D37369748

Pulled By: ltamasi

fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
2022-06-25 15:30:47 -07:00
Bo Wang 8e63d90ff8 Pass rate_limiter_priority through filter block reader functions to FS (#10251)
Summary:
With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251

Test Plan: Current unit tests should pass.

Reviewed By: pdillinger

Differential Revision: D37427667

Pulled By: gitbw95

fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433
2022-06-24 16:13:44 -07:00
Bo Wang c073ed7601 Fix typo in comments and code (#10233)
Summary:
Fix typo in comments and code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10233

Test Plan: Existing unit tests should pass.

Reviewed By: jay-zhuang, anand1976

Differential Revision: D37356702

Pulled By: gitbw95

fbshipit-source-id: 32c019adcc6dcc95a9882b38147a310091368e51
2022-06-22 15:45:21 -07:00
Peter Dillinger 84210c9489 Add data block hash index to crash test, fix MultiGet issue (#10220)
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.

Fixes https://github.com/facebook/rocksdb/issues/10186

I intend to pick this into the 7.4.0 release candidate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220

Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.

Reviewed By: anand1976

Differential Revision: D37315647

Pulled By: pdillinger

fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
2022-06-21 16:23:58 -07:00
Peter Dillinger fff302d989 More testing w/prefix extractor, small refactor (#10122)
Summary:
There was an interesting code path not covered by testing that
is difficult to replicate in a unit test, which is now covered using a
sync point. Specifically, the case of table_prefix_extractor == null and
!need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which
can happen if table reader is open before extractor is registered with global
object registry, but is later registered and re-set with SetOptions. (We
don't have sufficient testing control over object registry to set that up
repeatedly.)

Also, this function has been renamed to `PrefixRangeMayMatch` for clarity
vs. other functions that are not the same.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122

Test Plan: unit tests expanded

Reviewed By: siying

Differential Revision: D36944834

Pulled By: pdillinger

fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb
2022-06-16 16:41:25 -07:00
Peter Dillinger 126c223714 Remove deprecated block-based filter (#10184)
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).

This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).

Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184

Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB

Reviewed By: riversand963

Differential Revision: D37212647

Pulled By: pdillinger

fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
2022-06-16 15:51:33 -07:00
anand76 a6691d0f65 Update stats to help users estimate MultiGet async IO impact (#10182)
Summary:
Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements -
1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file
2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used

The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182

Reviewed By: akankshamahajan15

Differential Revision: D37213296

Pulled By: anand1976

fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb
2022-06-16 12:12:43 -07:00
Akanksha Mahajan 8353ae8b27 Add few optimizations in async_io for short scans (#10140)
Summary:
This PR adds few optimizations for async_io for shorter scans.
1.  If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching.  This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead.
2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140

Test Plan:
1. Added a unit test
                  2. Ran crash_test with async_io = 1 to make sure nothing crashes.

Reviewed By: anand1976

Differential Revision: D37042242

Pulled By: akankshamahajan15

fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda
2022-06-15 20:17:35 -07:00