Commit Graph

55 Commits

Author SHA1 Message Date
Akanksha Mahajan 956f1dfde3 Change ReadAsync callback API to remove const from FSReadRequest (#11649)
Summary:
Modify ReadAsync callback API to remove const from FSReadRequest as const doesn't let to fs_scratch to move the ownership.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11649

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D53585309

Pulled By: akankshamahajan15

fbshipit-source-id: 3bff9035db0e6fbbe34721a5963443355807420d
2024-02-16 09:14:55 -08:00
akankshamahajan b5bb553d5e Fix PREFETCH_BYTES_USEFUL stat calculation (#12251)
Summary:
After refactoring of FilePrefetchBuffer, PREFETCH_BYTES_USEFUL was miscalculated. Instead of calculating how many requested bytes are already in the buffer, it took into account alignment as well because aligned_useful_len takes into consideration alignment too.

Also refactored the naming of chunk_offset_in_buffer to make it similar to aligned_useful_len

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12251

Test Plan:
1. Validated internally through release validation benchmarks.
2. Updated unit test that fails without the fix.

Reviewed By: ajkr

Differential Revision: D52891112

Pulled By: akankshamahajan15

fbshipit-source-id: 2526a0b0572d473beaf8b841f2f9c2f6275d9779
2024-01-18 19:09:49 -08:00
akankshamahajan 5cb2d09d47 Refactor FilePrefetchBuffer code (#12097)
Summary:
Summary - Refactor FilePrefetchBuffer code
- Implementation:
FilePrefetchBuffer maintains a deque of free buffers (free_bufs_) of size num_buffers_ and buffers (bufs_) which contains the prefetched data. Whenever a buffer is consumed or is outdated (w.r.t. to requested offset), that buffer is cleared and returned to free_bufs_.

 If a buffer is available in free_bufs_, it's moved to bufs_ and is sent for prefetching. num_buffers_ defines how many buffers are maintained that contains prefetched data.
If num_buffers_ == 1, it's a sequential read flow. Read API will be called on that one buffer whenever the data is requested and is not in the buffer.
If num_buffers_ > 1, then the data is prefetched asynchronosuly in the buffers whenever the data is consumed from the buffers and that buffer is freed.
If num_buffers > 1, then requested data can be overlapping between 2 buffers. To return the continuous buffer overlap_bufs_ is used. The requested data is copied from 2 buffers to the overlap_bufs_ and overlap_bufs_ is returned to
the caller.

- Merged Sync and Async code flow into one in FilePrefetchBuffer.

Test Plan -
- Crash test passed
- Unit tests
- Pending - Benchmarks

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12097

Reviewed By: ajkr

Differential Revision: D51759552

Pulled By: akankshamahajan15

fbshipit-source-id: 69a352945affac2ed22be96048d55863e0168ad5
2024-01-05 09:29:01 -08:00
Peter Dillinger ed46981bea Fix and defend against FilePrefetchBuffer combined with mmap reads (#12206)
Summary:
FilePrefetchBuffer makes an unchecked assumption about the behavior of RandomAccessFileReader::Read: that it will write to the provided buffer rather than returning the data in an alternate buffer. FilePrefetchBuffer has been quietly incompatible with mmap reads (e.g. allow_mmap_reads / use_mmap_reads) because in that case an alternate buffer is returned (mmapped memory). This incompatibility currently leads to quiet data corruption, as seen in amplified crash test failure in https://github.com/facebook/rocksdb/issues/12200.

In this change,
* Check whether RandomAccessFileReader::Read has the expected behavior, and fail if not. (Assertion failure in debug build, return Corruption in release build.) This will detect future regressions synchronously and precisely, rather than relying on debugging downstream data corruption.
  * Why not recover? My understanding is that FilePrefetchBuffer is not intended for use when RandomAccessFileReader::Read uses an alternate buffer, so quietly recovering could lead to undesirable (inefficient) behavior.
* Mention incompatibility with mmap-based readers in the internal API comments for FilePrefetchBuffer
* Fix two cases where FilePrefetchBuffer could be used with mmap, both stemming from SstFileDumper, though one fix is in BlockBasedTableReader. There is currently no way to ask a RandomAccessFileReader whether it's using mmap, so we currently have to rely on other options as clues.

Keeping separate from https://github.com/facebook/rocksdb/issues/12200 in part because this change is more appropriate for backport than that one.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12206

Test Plan:
* Manually verified that the new check aids in debugging.
* Unit test added, that fails if either fix is missed.
* Ran blackbox_crash_test for hours, with and without https://github.com/facebook/rocksdb/issues/12200

Reviewed By: akankshamahajan15

Differential Revision: D52551701

Pulled By: pdillinger

fbshipit-source-id: dea87c5782b7c484a6c6e424585c8832dfc580dc
2024-01-04 18:39:05 -08:00
akankshamahajan e7c6259447 Make auto_readahead_size default true (#12080)
Summary:
Make auto_readahead_size option default true

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12080

Test Plan: benchmarks and exisiting tests

Reviewed By: anand1976

Differential Revision: D52152132

Pulled By: akankshamahajan15

fbshipit-source-id: f1515563564e77df457dff2e865e4ede8c3ddf44
2023-12-14 11:25:51 -08:00
akankshamahajan c77b50a4fd Add AsyncIO support for tuning readahead_size by block cache lookup (#11936)
Summary:
Add support for tuning of readahead_size by block cache lookup for async_io.

**Design/ Implementation** -

**BlockBasedTableIterator.cc** -

`BlockCacheLookupForReadAheadSize` callback API lookups in the block cache and tries to reduce the start
and end offset passed. This function looks into the block cache for the blocks between `start_offset`
and `end_offset` and add all the handles in the queue.

It then iterates from the end in the handles to find first miss block and update the end offset to that block.
It also iterates from the start and find first miss block and update the start offset to that block.

```
_read_curr_block_ argument : True if this call was due to miss in the cache and caller wants to read that block
                             synchronously.
                             False if current call is to prefetch additional data in extra buffers
                            (due to ReadAsync call in FilePrefetchBuffer)
```
In case there is no data to be read in that callback (because of upper_bound or all blocks are in cache),
it updates start and end offset to be equal and that `FilePrefetchBuffer` interprets that as 0 length to be read.

**FilePrefetchBuffer.cc** -

FilePrefetchBuffer calls the callback - `ReadAheadSizeTuning` and pass the start and end offset to that
callback to get updated start and end offset to read based on cache hits/misses.

1. In case of Read calls (when offset passed to FilePrefetchBuffer is on cache miss and that data needs to be read), _read_curr_block_ is passed true.
2. In case of ReadAsync calls, when buffer is all consumed and can go for additional prefetching,  the start offset passed is the initial end offset of prev buffer (without any updated offset based on cache hit/miss).

Foreg. if following are the data blocks with cache hit/miss and start offset
and Read API found miss on DB1 and based on readahead_size (50)  it passes end offset to be 50.
 [DB1 - miss- 0 ] [DB2 - hit -10] [DB3 - miss -20] [DB4 - miss-30] [DB5 - hit-40]
 [DB6 - hit-50] [DB7 - miss-60] [DB8 - miss - 70] [DB9 - hit - 80] [DB6 - hit 90]

- For Read call - updated start offset remains 0 but end offset updates to DB4, as DB5 is in cache.
- Read calls saves initial end offset 50 as that was meant to be prefetched.
- Now for next ReadAsync call - the start offset will be 50 (previous buffer initial end offset) and based on readahead_size, end offset will be 100
- On callback, because of cache hits - callback will update the start offset to 60 and end offset to 80 to read only 2 data blocks (DB7 and DB8).
- And for that ReadAsync call - initial end offset will be set to 100 which will again used by next ReadAsync call as start offset.
-  `initial_end_offset_` in `BufferInfo` is used to save the initial end offset of that buffer.

- If let's say DB5 and DB6 overlaps in 2 buffers (because of alignment), `prev_buf_end_offset` is passed to make sure already prefetched data is not prefetched again in second buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11936

Test Plan:
- Ran crash_test several times.
-  New unit tests added.

Reviewed By: anand1976

Differential Revision: D50906217

Pulled By: akankshamahajan15

fbshipit-source-id: 0d75d3c98274e98aa34901b201b8fb05232139cf
2023-12-06 13:48:15 -08:00
anand76 e81393e81e Add some stats to observe the usefulness of scan prefetching (#11981)
Summary:
Add stats for better observability of scan prefetching. Its only implemented for sync scan right now. These stats can help inform future improvements in scan prefetching.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11981

Test Plan: Add a new unit test

Reviewed By: akankshamahajan15

Differential Revision: D50516505

Pulled By: anand1976

fbshipit-source-id: cb1cc6cf02df8295930a49c62b11870020df3f97
2023-10-23 14:42:44 -07:00
Akanksha Mahajan 018eede679 Remove assertion from PrefetchAsync (#11965)
Summary:
Remove assertion from PrefetchAsync (roundup_len2 >= alignment) as for non direct_io, buffer size can be less than alignment resulting in assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11965

Test Plan: Ran the issue causing db_stress without this assertion and the verification completes successfully.

Reviewed By: anand1976

Differential Revision: D50328955

Pulled By: akankshamahajan15

fbshipit-source-id: 65f55ca230d2bbc63f4e2cc34c7273b22b515879
2023-10-16 15:14:58 -07:00
akankshamahajan 3d67b5e8e5 Lookup ahead in block cache ahead to tune Readaheadsize (#11860)
Summary:
Implement block cache lookup to determine readahead_size during scans. It's enabled if auto_readahead_size, block_cache and iterate_upper_bound - all three are set.

Design -
1. Whenever there is a cache miss and FilePrefetchBuffer is called, a callback is made to determine readahead_size for that prefetching.
2. The callback iterates over index and do block cache lookup for each data block handle until existing readahead_size is reached. Then It removes the cache hit data blocks from end to calculate optimized readahead_size.
3. Since index_iter_ is moved, it stores block handles in a queue, and use that queue to get block handle instead of doing index_iter_->Next().
4. This is for Sync scans. Async scans support is in progress.

NOTE:
The issue right now is after Seek and Next, if Prev is called, there is no way to do Prev operation. index_iter_ is already pointing to a different block. So it returns "Not supported" in that case with error message - "auto tuning of readahead size is not supported with Prev op"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11860

Test Plan:
- Added new unit test
- crash_tests
- Running scans locally to check for any regression

Reviewed By: anand1976

Differential Revision: D49548118

Pulled By: akankshamahajan15

fbshipit-source-id: f1aee409a71b4ad9e5bf3610f43edf30c6630c78
2023-09-22 18:12:08 -07:00
akankshamahajan c1a97fe1f6 Fix Assertion `roundup_len2 >= alignment' failed in crash tests (#11852)
Summary:
When auto_readahead_size is enabled in async_io, during seek, first buffer will prefetch the data - (current block + readahead till upper_bound). There can be cases where
1.  first buffer prefetched all the data till upper bound, or
2.  first buffer already has the data from prev seek call
and second buffer prefetch further leading to alignment issues.

This PR fixes that assertion and second buffer won't go for prefetching if first buffer has already prefetched till upper_bound.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11852

Test Plan:
- Added new unit test that failed without this fix.
- crash tests passed locally

Reviewed By: pdillinger

Differential Revision: D49384138

Pulled By: akankshamahajan15

fbshipit-source-id: 54417e909e4d986f1e5a17dbaea059cd4962fd4d
2023-09-20 16:13:20 -07:00
akankshamahajan ed5b6c0d99 Avoid alignment in FilePrefetchBuffer during seek with async_io (#11793)
Summary:
During Seek, the iterator seeks every file on L0. In async_io, it submit the requests to seek on every file on L0 asynchronously using RocksDB FilePrefetchBuffer.
However, FilePrefetchBuffer does alignment and reads extra bytes then needed that can increase the throughput.
In case of non direct io, the alignment can be avoided.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11793

Test Plan:
- Added a unit test that fails without this PR.
- make crash_test -j32 completed successfully

Reviewed By: anand1976

Differential Revision: D48985051

Pulled By: akankshamahajan15

fbshipit-source-id: 2d130a9e7c3df9c4fcd0408406e6277ab75a4389
2023-09-11 11:41:44 -07:00
akankshamahajan f36394ff20 Fix seg fault in auto_readahead_size with async_io (#11769)
Summary:
Fix seg fault in auto_readahead_size with async_io when readahead_size = 0. If readahead_size is trimmed and is 0, it's not eligible for further prefetching and should return.

Error occured when the first buffer already contains data and it goes for prefetching in second buffer leading to assertion failure -

`assert(roundup_len1 >= alignment);
`
because roundup_len1 = length + readahead_size.
length is 0 and readahead_size is also 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11769

Test Plan: Reproducible with db_stress with async_io enabled.

Reviewed By: anand1976

Differential Revision: D48743031

Pulled By: akankshamahajan15

fbshipit-source-id: 0e08c41f862f6287ca223fbfaf6cd42fc97b3c87
2023-08-28 17:08:28 -07:00
akankshamahajan f65a0379f0 Implement trimming of readhead size when upper bound is specified (#11684)
Summary:
Implement trimming of readahead_size under a new option ReadOptions.auto_readahead_size. It'll trim the readahead_size during prefetching upto iterate_upper_bound offset only when ReadOptions.iterate_upper_bound is set, therefore reducing the prefetching of data beyond upper_bound.
It's enabled for both implicit auto readahead size and when ReadOptions.readahead_size is specified and for sync and async_io.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11684

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D48479723

Pulled By: akankshamahajan15

fbshipit-source-id: 2b1703579caf779105e836b580866ffd7db076fc
2023-08-18 15:52:04 -07:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
akankshamahajan ff1cc8a63e Fix extra prefetching when num_file_reads_for_auto_readahead is 1 in async_io (#11560)
Summary:
When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and
discarded bytes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D47008102

Pulled By: akankshamahajan15

fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0
2023-06-26 10:39:44 -07:00
Peter Dillinger 206fdea3d9 Change internal headers with duplicate names (#11408)
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.

No public API changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D45456696

Pulled By: pdillinger

fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
2023-05-17 11:27:09 -07:00
Hui Xiao bab5f9a6f2 Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
Summary:
**Context/Summary:**
We are adding new stats to measure behavior of prefetched tail size and look up into this buffer

The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265

Test Plan:
**- Piggyback on existing test**
**- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.**
```
./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3  -use_direct_reads=true
```
```
rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600
rocksdb.table.open.prefetch.tail.miss COUNT : 91
rocksdb.table.open.prefetch.tail.hit COUNT : 1034
```
**- No perf regression observed in db_bench**

SETUP command: create same db with ~900 files for pre-change/post-change.
```
./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none
```
TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing.
```
./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true
```
async io = 0, direct io read = true

  | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs
-- | -- | -- | --
pre-post change | 4776 (± 28) ops/sec;   24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec;   74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec;   75.6 (± 2.2) MB/sec
post-change | 4790 (± 32) ops/sec;   24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec;   74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec;   74.5 (± 1.6) MB/sec

async io = 1, direct io read = true
  | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs
-- | -- | -- | --
pre-post change | 3350 (± 36) ops/sec;   17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec;   68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec;   71.8 (± 1.0) MB/sec
post-change | 3358 (± 27) ops/sec;   17.4 (± 0.1) MB/sec  | 263 (± 2) ops/sec;   68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec;   72.6 (± 0.6) MB/sec

Reviewed By: ajkr

Differential Revision: D43781467

Pulled By: hx235

fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a
2023-03-15 14:02:43 -07:00
akankshamahajan 68e4581c67 Return NotSupported in scan if IOUring not supported and enable IOUring in db_stress for async_io testing (#11197)
Summary:
- Return NotSupported in scan if IOUring not supported if async_io is enabled
- Enable IOUring in db_stress for async_io testing
- Disable async_io in circleci crash testing as circleci doesn't support IOUring

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11197

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D43096313

Pulled By: akankshamahajan15

fbshipit-source-id: c2c53a87636950c0243038b9f5bd0d91608e4fda
2023-02-16 18:33:06 -08:00
akankshamahajan bde65052c4 Enhance async scan prefetch unit tests (#11087)
Summary:
Add more coverage in unit tests for async scan. The added unit test fails without PR https://github.com/facebook/rocksdb/pull/10939.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11087

Test Plan: CircleCI jobs status for new unit tests.

Reviewed By: anand1976

Differential Revision: D42487931

Pulled By: akankshamahajan15

fbshipit-source-id: d59ed7666599bd0d2733ac5d76bd70984b54c5a9
2023-01-20 10:17:57 -08:00
anand76 dbf37c290a Fix async prefetch heap use after free (#11049)
Summary:
This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario -
1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching
2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache
3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet
4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned
5. The orphaned prefetch finally completes, resulting in a use after free

Also add a few asserts to surface bugs earlier in the crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049

Test Plan: Repro with db_stress and verify the fix

Reviewed By: akankshamahajan15

Differential Revision: D42181118

Pulled By: anand1976

fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc
2022-12-21 09:15:53 -08:00
akankshamahajan d1aca4a5ae Fix async_io regression in scans (#10939)
Summary:
Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939

Test Plan:
- stress tests  export CRASH_TEST_EXT_ARGS="--async_io=1"
    make crash_test -j32
- Ran db_bench command which was caught the regression:
./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288

seekrandom   :    3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations;  132.3 MB/s (7942 of 7942 found)

Reviewed By: anand1976

Differential Revision: D41173899

Pulled By: akankshamahajan15

fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe
2022-11-11 13:34:49 -08:00
akankshamahajan ff9ad2c39b Fix async_io failures in case there is error in reading data (#10890)
Summary:
Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error.
Fix: Added check to abort IO in second buffer if curr_ got empty.

This PR also fixes db_stress failures which happened when buffers are not aligned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890

Test Plan:
- Ran make crash_test -j32 with async_io enabled.
-  Ran benchmarks to make sure there is no regression.

Reviewed By: anand1976

Differential Revision: D40881731

Pulled By: akankshamahajan15

fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58
2022-11-01 16:06:51 -07:00
Akanksha Mahajan 03fc43976d Async optimization in scan path (#10602)
Summary:
Optimizations
1. In FilePrefetchBuffer, when data is overlapping between two buffers, it copies the data from first to third buffer, then from
second to third buffer to return continuous buffer. This optimization will call ReadAsync on first buffer as soon as buffer is empty instead of getting blocked by second buffer to copy the data.
2. For fixed size readahead_size, FilePrefetchBuffer will issues two async read calls. One with length + readahead_size_/2 on first buffer(if buffer is empty) and readahead_size_/2 on second buffer during seek.

- Add readahead_size to db_stress for stress testing these changes in https://github.com/facebook/rocksdb/pull/10632

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10602

Test Plan:
- CircleCI tests
- stress_test completed successfully
export CRASH_TEST_EXT_ARGS="--async_io=1"
make crash_test -j32
- db_bench showed no regression
   With this PR:
```
 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661876074584472 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:14:34 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  270878.018 micros/op 3 ops/sec 30.068 seconds 111 operations;  618.7 MB/s (111 of 111 found)

 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=true -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661875332862922 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:02:12 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  358352.488 micros/op 2 ops/sec 30.102 seconds 84 operations;  474.4 MB/s (84 of 84 found)
```

Without PR in main:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661876425983045 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:20:26 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  280881.953 micros/op 3 ops/sec 30.054 seconds 107 operations;  605.2 MB/s (107 of 107 found)

 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=0
Set seed to 1661876475267771 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:21:15 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  363155.084 micros/op 2 ops/sec 30.142 seconds 83 operations;  468.1 MB/s (83 of 83 found)
```

Reviewed By: anand1976

Differential Revision: D39141328

Pulled By: akankshamahajan15

fbshipit-source-id: 560655922c1a437a8569c228abb31b8c0b413120
2022-09-12 17:42:01 -07:00
Akanksha Mahajan 2acbf386a3 Provide support for direct_reads with async_io (#10197)
Summary:
Provide support for use_direct_reads with async_io.

TestPlan:
-  Updated unit tests
-  db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420
- db_stress
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1"
make crash_test -j
```
- Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197

Reviewed By: anand1976

Differential Revision: D37255646

Pulled By: akankshamahajan15

fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920
2022-07-06 11:42:59 -07:00
Akanksha Mahajan 8353ae8b27 Add few optimizations in async_io for short scans (#10140)
Summary:
This PR adds few optimizations for async_io for shorter scans.
1.  If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching.  This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead.
2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140

Test Plan:
1. Added a unit test
                  2. Ran crash_test with async_io = 1 to make sure nothing crashes.

Reviewed By: anand1976

Differential Revision: D37042242

Pulled By: akankshamahajan15

fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda
2022-06-15 20:17:35 -07:00
Akanksha Mahajan a479c2c2b2 Fix stress test failure "Corruption: checksum mismatch" or "Iterator Diverged" with async_io enabled (#10032)
Summary:
In case of non sequential reads with `async_io`, `FilePRefetchBuffer::TryReadFromCacheAsync` can be called for previous blocks with `offset < bufs_[curr_].offset_` which wasn't handled correctly resulting wrong data being returned from buffer.

Since `FilePRefetchBuffer::PrefetchAsync` can be called for any data block, it sets `prev_len_` to 0  indicating `FilePRefetchBuffer::TryReadFromCacheAsync` to go for the prefetching even though offset < bufs_[curr_].offset_  This is because async prefetching is always done in second buffer (to avoid mutex) even though curr_ is empty leading to  offset < bufs_[curr_].offset_ in some cases.
If prev_len_ is non zero then `TryReadFromCacheAsync` returns false if `offset < bufs_[curr_].offset_ && prev_len != 0` indicating reads are not sequential and previous call wasn't PrefetchAsync.

-  This PR also simplifies `FilePRefetchBuffer::TryReadFromCacheAsync` as it was getting complicated covering different scenarios based on `async_io` enabled/disabled. If `for_compaction` is set true, it now calls `FilePRefetchBufferTryReadFromCache` following synchronous flow as before. Its decided in BlockFetcher.cc

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10032

Test Plan:
1.  export CRASH_TEST_EXT_ARGS=" --async_io=1"
     make crash_test -j completed successfully locally
2. make crash_test -j completed successfully locally
3. Reran CircleCi mini crashtest job 4 - 5 times.
4. Updated prefetch_test for more coverage.

Reviewed By: anand1976

Differential Revision: D36579858

Pulled By: akankshamahajan15

fbshipit-source-id: 0c428d62b45e12e082a83acf533a5e37a584bedf
2022-05-23 12:15:26 -07:00
Akanksha Mahajan 2db6a4a1d6 Seek parallelization (#9994)
Summary:
The RocksDB iterator is a hierarchy of iterators. MergingIterator maintains a heap of LevelIterators, one for each L0 file and for each non-zero level. The Seek() operation naturally lends itself to parallelization, as it involves positioning every LevelIterator on the correct data block in the correct SST file. It lookups a level for a target key, to find the first key that's >= the target key. This typically involves reading one data block that is likely to contain the target key, and scan forward to find the first valid key. The forward scan may read more data blocks. In order to find the right data block, the iterator may read some metadata blocks (required for opening a file and searching the index).
This flow can be parallelized.

Design: Seek will be called two times under async_io option. First seek will send asynchronous request to prefetch the data blocks at each level and second seek will follow the normal flow and in FilePrefetchBuffer::TryReadFromCacheAsync it will wait for the Poll() to get the results and add the iterator to min_heap.
- Status::TryAgain is passed down from FilePrefetchBuffer::PrefetchAsync to block_iter_.Status indicating asynchronous request has been submitted.
- If for some reason asynchronous request returns error in submitting the request, it will fallback to sequential reading of blocks in one pass.
- If the data already exists in prefetch_buffer, it will return the data without prefetching further and it will be treated as single pass of seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9994

Test Plan:
- **Run Regressions.**
```
./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
```
i) Previous release 7.0 run for normal prefetching with async_io disabled:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 13:11:34 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
```

ii) normal prefetching after changes with async_io disable:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Set seed to 1652922591315307 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.3
Date:       Wed May 18 18:09:51 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483080.466 micros/op 2 ops/sec 120.287 seconds 249 operations;  340.8 MB/s (249 of 249 found)
```
iii) db_bench with async_io enabled completed succesfully

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1
Set seed to 1652924062021732 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.3
Date:       Wed May 18 18:34:22 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  553913.576 micros/op 1 ops/sec 120.199 seconds 217 operations;  293.6 MB/s (217 of 217 found)
```

- db_stress with async_io disabled completed succesfully
```
 export CRASH_TEST_EXT_ARGS=" --async_io=0"
 make crash_test -j
```

I**n Progress**: db_stress with async_io is failing and working on debugging/fixing it.

Reviewed By: anand1976

Differential Revision: D36459323

Pulled By: akankshamahajan15

fbshipit-source-id: abb1cd944abe712bae3986ae5b16704b3338917c
2022-05-20 16:09:33 -07:00
Akanksha Mahajan fce65e7e4f Fix bug in async_io path which reads incorrect length (#9916)
Summary:
In FilePrefetchBuffer, in case data is overlapping between two
buffers and more data is required to read and copy that to third buffer,
incorrect length was updated resulting in
```
Iterator diverged from control iterator which has value 00000000000310C3000000000000012B0000000000000274 total_order_seek: 1 auto_prefix_mode: 0 S 000000000002C37F000000000000012B000000000000001C NNNPPPPPNN; total_order_seek: 1 auto_prefix_mode: 0 S 000000000002F10B00000000000000BF78787878787878 NNNPNNNNPN; total_order_seek: 1 auto_prefix_mode: 0 S 00000000000310C3000000000000012B000000000000026B
iterator is not valid
Control CF default
db_stress: db_stress_tool/db_stress_test_base.cc:1388: void rocksdb::StressTest::VerifyIterator(rocksdb::ThreadState*, rocksdb::ColumnFamilyHandle*, const rocksdb::ReadOptions&, rocksdb::Iterator*, rocksdb::Iterator*, rocksdb::StressTest::LastIterateOp, const rocksdb::Slice&, const string&, bool*): Assertion `false' failed.
Aborted (core dumped)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9916

Test Plan:
```
- CircleCI jobs
- Ran db_stress with OPTIONS file which caught the bug
 ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=42.26248932628998 --bottommost_compression_type=disable --cache_index_and_filter_blocks=0 --cache_size=8388608 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_buffer_bytes=1073741823 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/ --db_write_buffer_size=134217728 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --fail_if_options_file_error=0 --file_checksum_impl=none --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=12 --index_type=2 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=0 --mark_for_compaction_one_file_in=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_prefix_bloom_size_ratio=0.001 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --read_only=0 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --reserve_table_reader_memory=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_catch_up_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_blob_db=0 --use_block_based_filter=0 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=0 --use_txn=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35 --options_file=/home/akankshamahajan/OPTIONS.orig -column_families=1

db_bench with async_io enabled to make sure db_bench completes successfully without any failure.
- ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1
```

crash_test in progress

Reviewed By: anand1976

Differential Revision: D35985789

Pulled By: akankshamahajan15

fbshipit-source-id: 5abe185f34caa99ca587d4bdc8954bd0802b1bf9
2022-04-27 22:33:29 -07:00
Akanksha Mahajan 3653029dda Add stats related to async prefetching (#9845)
Summary:
Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS.
PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by
FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling
file_system Poll API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845

Test Plan: Update existing tests

Reviewed By: anand1976

Differential Revision: D35909694

Pulled By: akankshamahajan15

fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36
2022-04-25 21:58:22 -07:00
Akanksha Mahajan 63e68a4e77 Enable async prefetching for ReadOptions.readahead_size (#9827)
Summary:
Currently async prefetching is enabled for implicit internal auto readahead in FilePrefetchBuffer if `ReadOptions.async_io` is set. This PR enables async prefetching for `ReadOptions.readahead_size` when `ReadOptions.async_io` is set true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9827

Test Plan: Update unit test

Reviewed By: anand1976

Differential Revision: D35552129

Pulled By: akankshamahajan15

fbshipit-source-id: d9f9a96672852a591375a21eef15355cf3289f5c
2022-04-11 13:46:57 -07:00
Akanksha Mahajan f3bcac39a6 Fix stress test failure in ReadAsync. (#9824)
Summary:
Fix stress test failure in ReadAsync by ignoring errors
injected during async read by FaultInjectionFS.
Failure:
```
 WARNING: prefix_size is non-zero but memtablerep != prefix_hash
Didn't get expected error from MultiGet.
num_keys 14 Expected 1 errors, seen 0
Callstack that injected the fault
Injected error type = 32538
Message: error;
#0   ./db_stress() [0x6f7dd4] rocksdb::port::SaveStack(int*, int)	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/port/stack_trace.cc:152
https://github.com/facebook/rocksdb/issues/1   ./db_stress() [0x7f2bda] rocksdb::FaultInjectionTestFS::InjectThreadSpecificReadError(rocksdb::FaultInjectionTestFS::ErrorOperation, rocksdb::Slice*, bool, char*, bool, bool*)	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/utilities/fault_injection_fs.cc:891
https://github.com/facebook/rocksdb/issues/2   ./db_stress() [0x7f2e78] rocksdb::TestFSRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/utilities/fault_injection_fs.cc:367
https://github.com/facebook/rocksdb/issues/3   ./db_stress() [0x6483d7] rocksdb::(anonymous namespace)::CompositeRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/env/composite_env.cc:61
https://github.com/facebook/rocksdb/issues/4   ./db_stress() [0x654564] rocksdb::(anonymous namespace)::LegacyRandomAccessFileWrapper::Read(unsigned long, unsigned long, rocksdb::IOOptions const&, rocksdb::Slice*, char*, rocksdb::IODebugContext*) const	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/env/env.cc:152
https://github.com/facebook/rocksdb/issues/5   ./db_stress() [0x659b3b] rocksdb::FSRandomAccessFile::ReadAsync(rocksdb::FSReadRequest&, rocksdb::IOOptions const&, std::function<void (rocksdb::FSReadRequest const&, void*)>, void*, void**, std::function<void (void*)>*, rocksdb::IODebugContext*)	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/./include/rocksdb/file_system.h:896
https://github.com/facebook/rocksdb/issues/6   ./db_stress() [0x8b8bab] rocksdb::RandomAccessFileReader::ReadAsync(rocksdb::FSReadRequest&, rocksdb::IOOptions const&, std::function<void (rocksdb::FSReadRequest const&, void*)>, void*, void**, std::function<void (void*)>*, rocksdb::Env::IOPriority)	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/file/random_access_file_reader.cc:459
https://github.com/facebook/rocksdb/issues/7   ./db_stress() [0x8b501f] rocksdb::FilePrefetchBuffer::ReadAsync(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, rocksdb::Env::IOPriority, unsigned long, unsigned long, unsigned long, unsigned int)	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/file/file_prefetch_buffer.cc:124
https://github.com/facebook/rocksdb/issues/8   ./db_stress() [0x8b55fc] rocksdb::FilePrefetchBuffer::PrefetchAsync(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, unsigned long, unsigned long, unsigned long, rocksdb::Env::IOPriority, bool&)	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/file/file_prefetch_buffer.cc:363
https://github.com/facebook/rocksdb/issues/9   ./db_stress() [0x8b61f8] rocksdb::FilePrefetchBuffer::TryReadFromCacheAsync(rocksdb::IOOptions const&, rocksdb::RandomAccessFileReader*, unsigned long, unsigned long, rocksdb::Slice*, rocksdb::Status*, rocksdb::Env::IOPriority, bool)	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/file/file_prefetch_buffer.cc:482
https://github.com/facebook/rocksdb/issues/10  ./db_stress() [0x745e04] rocksdb::BlockFetcher::TryGetFromPrefetchBuffer()	/data/sandcastle/boxes/trunk-hg-fbcode-fbsource/fbcode/internal_repo_rocksdb/repo/table/block_fetcher.cc:76
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9824

Test Plan:
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=0 --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 -- backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=5.037629726741734 --bottommost_compression_type=lz4hc --cache_index_and_filter_blocks=0 --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=100 --compression_max_dict_buffer_bytes=1073741823 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --db=/home/akankshamahajan/dev/shm/rocksdb/rocksdb_crashtest_blackbox --db_write_buffer_size=8388608 --delpercent=0 --delrangepercent=0 --destroy_db_initially=0 - detect_filter_construct_corruption=1 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=/home/akankshamahajan/dev/shm/rocksdb/rocksdb_crashtest_expected --experimental_mempurge_threshold=8.772789063014715 --fail_if_options_file_error=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --long_running_snapshots=0 --mark_for_compaction_one_file_in=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_prefix_bloom_size_ratio=0.001 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=1000 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=32 --readpercent=100 --recycle_log_file_num=1 --reopen=0 --reserve_table_reader_memory=1 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=2 --use_block_based_filter=0 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=1 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=0
```

Reviewed By: anand1976

Differential Revision: D35514566

Pulled By: akankshamahajan15

fbshipit-source-id: e2a868fdd7422604774c1419738f9926a21e92a4
2022-04-11 10:56:11 -07:00
Akanksha Mahajan 7ea26abb8b Fix reseting of async_read_in_progress_ variable in FilePrefetchBuffer to call Poll API (#9815)
Summary:
Currently RocksDB reset async_read_in_progress_ in callback
due to which underlying filesystem relying on Poll API won't be called
leading to stale memory access.
In order to fix it, async_read_in_progress_ will be reset after Poll API
is called to make sure underlying file_system waiting on Poll can clear
its state or take appropriate action.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9815

Test Plan: CircleCI tests

Reviewed By: anand1976

Differential Revision: D35451534

Pulled By: akankshamahajan15

fbshipit-source-id: b70ef6251a7aa9ed4876ba5e5100baa33d7d474c
2022-04-06 18:36:23 -07:00
Akanksha Mahajan 36bc3da97f Fix segfault in FilePrefetchBuffer with async_io enabled (#9777)
Summary:
If FilePrefetchBuffer object is destroyed and then later Poll() calls callback on object which has been destroyed, it gives segfault on accessing destroyed object. It was caught after adding unit tests that tests Posix implementation of ReadAsync and Poll APIs.
This PR also updates and fixes existing IOURing tests which were not running locally because RocksDbIOUringEnable function wasn't defined and IOUring was disabled for those tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9777

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D35254002

Pulled By: akankshamahajan15

fbshipit-source-id: 68e80054ffb14ae25c255920ebc6548ca5f130a1
2022-04-04 15:35:43 -07:00
Akanksha Mahajan 33f8a08af2 Fix some errors in async prefetching in FilePrefetchBuffer (#9734)
Summary:
In ReadOption `async_io` which prefetches the data asynchronously, db_bench and db_stress runs were failing  because wrong data was prefetched which resulted in Error: Checksum mismatched. Wrong data was copied because capacity was less than actual size needed. It has been fixed in this PR.

Since there are two separate methods for async and sync prefetching, these changes are in async prefetching methods and any changes would not effect normal prefetching. I ran the regressions to make sure normal prefetching is fine.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9734

Test Plan:
1. CircleCI jobs

2.  Ran db_bench
```
. /db_bench -use_existing_db=true
-db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32
-value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680
-duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1

```
3. Ran db_stress test
```
export CRASH_TEST_EXT_ARGS=" --async_io=1 --adaptive_readahead=1"
make crash_test -j
```

4. Run regressions for async_io disabled.

Old flow without any async changes:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 13:11:34 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
```

With async prefetching changes and async_io disabled to make sure in normal prefetching there is no regression.
 ```
 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 --async_io=0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.1
Date:       Wed Mar 23 15:56:37 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  481819.816 micros/op 2 ops/sec;  340.2 MB/s (250 of 250 found)
```

Reviewed By: riversand963

Differential Revision: D35058471

Pulled By: akankshamahajan15

fbshipit-source-id: 9233a1e6d97cea0c7a8111bfb9e8ac3251c341ce
2022-03-25 18:26:22 -07:00
Akanksha Mahajan 49a10feb21 Provide implementation to prefetch data asynchronously in FilePrefetchBuffer (#9674)
Summary:
In FilePrefetchBuffer if reads are sequential, after prefetching call ReadAsync API to prefetch data asynchronously so that in next prefetching data will be available. Data prefetched asynchronously will be readahead_size/2. It uses two buffers, one for synchronous prefetching and one for asynchronous. In case, the data is overlapping, the data is copied from both buffers to third buffer to make it continuous.
This feature is under ReadOptions::async_io and is under experimental.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9674

Test Plan:
1. Add new unit tests
2. Run **db_stress** to make sure nothing crashes.

    -   Normal prefetch without `async_io` ran successfully:
```
export CRASH_TEST_EXT_ARGS=" --async_io=0"
 make crash_test -j
 ```

3. **Run Regressions**.
   i) Main branch without any change for normal prefetching with async_io disabled:

 ```
 ./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -
           use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
 ```

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 13:11:34 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom   :  483618.390 micros/op 2 ops/sec;  338.9 MB/s (249 of 249 found)
```

  ii) normal prefetching after changes with async_io disable:

```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_withchange -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 7.0
Date:       Thu Mar 17 14:11:31 2022
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_withchange]
seekrandom   :  471347.227 micros/op 2 ops/sec;  348.1 MB/s (255 of 255 found)
```

Reviewed By: anand1976

Differential Revision: D34731543

Pulled By: akankshamahajan15

fbshipit-source-id: 8e23aa93453d5fe3c672b9231ad582f60207937f
2022-03-21 07:12:43 -07:00
Andrew Kryczka babe56ddba Add rate limiter priority to ReadOptions (#9424)
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.

`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.

There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).

The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424

Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
  - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
  - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`

Reviewed By: hx235

Differential Revision: D33747386

Pulled By: ajkr

fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
2022-02-16 23:18:14 -08:00
Akanksha Mahajan 04b2c16f9b Fix bug in rocksdb internal automatic prefetching (#9234)
Summary:
After introducing adaptive_readahead, the original flow got
broken. Readahead size was set to 0 because of which rocksdb wasn't be
able to do automatic prefetching which it enables after seeing
sequential reads. This PR fixes it.

----------------------------------------------------------------------------------------------------

Before this patch:
b_bench -use_existing_db=true -db=/tmp/prefix_scan -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 6.27
Date:       Tue Nov 30 11:56:50 2021
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/prefix_scan]

seekrandom   : 5356367.174 micros/op 0 ops/sec;   29.4 MB/s (23 of 23 found)

----------------------------------------------------------------------------------------------------

After the patch:
./db_bench -use_existing_db=true -db=/tmp/prefix_scan -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB:    version 6.27
Date:       Tue Nov 30 14:38:33 2021
CPU:        24 * Intel Core Processor (Broadwell)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2594.0 MB (estimated)
FileSize:   1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/prefix_scan]
seekrandom   :  456504.277 micros/op 2 ops/sec;  359.8 MB/s (264 of 264 found)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9234

Test Plan:
Ran ./db_bench -db=/data/mysql/rocksdb/prefix_scan
-benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_d
irect_io_for_flush_and_compaction=true -target_file_size_base=16777216
and then ./db_bench -use_existing_db=true
-db=/data/mysql/rocksdb/prefix_scan -benchmarks="seekrandom"
-key_size=32 -value_siz
e=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680
-duration=120 -ops_between_duration_checks=1
 and compared the results.

Reviewed By: anand1976

Differential Revision: D32743965

Pulled By: akankshamahajan15

fbshipit-source-id: b950fba68c91963b7deb5c20acdf471bc60251f5
2021-11-30 22:53:10 -08:00
Levi Tamasi dc5de45af8 Support readahead during compaction for blob files (#9187)
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187

Test Plan: Ran `make check` and the stress/crash test.

Reviewed By: riversand963

Differential Revision: D32565512

Pulled By: ltamasi

fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
2021-11-19 17:53:47 -08:00
Akanksha Mahajan 17ce1ca48b Reuse internal auto readhead_size at each Level (expect L0) for Iterations (#9056)
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.

This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size  at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.

1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer),  readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056

Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.

Reviewed By: anand1976

Differential Revision: D31773640

Pulled By: akankshamahajan15

fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
2021-11-10 16:20:04 -08:00
Akanksha Mahajan a0e0feca62 Improve BlockPrefetcher to prefetch only for sequential scans (#7394)
Summary:
BlockPrefetcher is used by iterators to prefetch data if they
anticipate more data to be used in future and this is valid for forward sequential
scans. But BlockPrefetcher tracks only num_file_reads_ and not if reads
are sequential. This presents problem for MultiGet with large number of
keys when it reseeks index iterator and data block. FilePrefetchBuffer
can end up doing large readahead for reseeks as readahead size
increases exponentially once readahead is enabled. Same issue is with
BlockBasedTableIterator.

Add previous length and offset read as well in BlockPrefetcher (creates
FilePrefetchBuffer) and FilePrefetchBuffer (does prefetching of data) to
determine if reads are sequential and then  prefetch.

Update the last block read after cache hit to take reads from cache also
in account.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7394

Test Plan: Add new unit test case

Reviewed By: anand1976

Differential Revision: D23737617

Pulled By: akankshamahajan15

fbshipit-source-id: 8e6917c25ed87b285ee495d1b68dc623d71205a3
2021-04-28 12:53:46 -07:00
Ziyue Yang 0c2d71edba Fix typo: replace readadhead with readahead (#7953)
Summary:
This PR replaces several "readadhead" typos with "readahead".

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7953

Reviewed By: ajkr

Differential Revision: D26518903

Pulled By: jay-zhuang

fbshipit-source-id: 6f7dece0e39ec4f71c4a936399bcb2e02574f42a
2021-02-18 14:31:20 -08:00
anand76 01298c8ff7 Return Status from FilePrefetchBuffer::TryReadFromCache() (#7816)
Summary:
Return the Status from TryReadFromCache() in an argument to make it easier to report prefetch errors to the user.

Tests:
make crash_test
make check

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7816

Reviewed By: akankshamahajan15

Differential Revision: D25717222

Pulled By: anand1976

fbshipit-source-id: c320d3c12d4146bda16df78ff6927eee584c1810
2020-12-30 09:25:09 -08:00
Akanksha Mahajan b4cd51d847 Fix for stress test failure (#7574)
Summary:
Ignore read error in 'FilePrefetchBuffer::TryReadFromCache' as status is ignored
and  bool value is returned. Return error if prefetch fails in
'PrefetchTail' as we have planned to return Prefetch failures to users.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7574

Test Plan:
make check -j64,
python -u tools/db_crashtest.py blackbox

Reviewed By: zhichao-cao

Differential Revision: D24408825

Pulled By: akankshamahajan15

fbshipit-source-id: feebda771415998253fbe54632f13e6e75b7a243
2020-10-20 09:13:42 -07:00
Akanksha Mahajan db87afbcb3 Return error if Get/Multi() fails in Prefetching Filter blocks (#7543)
Summary:
Right now all I/O failures under
PartitionFilterBlock::CacheDependencies() is swallowed. Return error in
case prefetch fails.

On returning error in PartitionedFilterBlockReader::CacheDependencies was causing stress test failure because PrefetchBuffer is initialized with enable_ = true, as result when PosixMmapReadableFile::Read is called from Prefetch, scratch is ignored causing buffer to fill with garbage values. Initializing prefetch buffer by CreatePrefetchBuffer that sets enable_ with !ioptions.allow_mmap_reads fixed the problem as it returns without prefetching data if allow_mmap_reads is set.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7543

Test Plan:
make check -j64;
python -u tools/db_crashtest.py --simple blackbox

Reviewed By: anand1976

Differential Revision: D24284596

Pulled By: akankshamahajan15

fbshipit-source-id: f3f0fd44b59dcf60645730436f28564a07884868
2020-10-14 10:45:36 -07:00
Levi Tamasi 5d16325ce3 Revert "Return error if Get() fails in Prefetching Filter blocks (#7463)" (#7505)
Summary:
This reverts commit 7d503e66a9.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7505

Reviewed By: ajkr

Differential Revision: D24100875

Pulled By: ltamasi

fbshipit-source-id: 8705e3e6e8be4b4fd175ffdb031baa6530b61151
2020-10-03 18:38:04 -07:00
Akanksha Mahajan 7d503e66a9 Return error if Get() fails in Prefetching Filter blocks (#7463)
Summary:
Right now all I/O failures under
PartitionFilterBlock::CacheDependencies() is swallowed. Return error in
case prefetch fails.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7463

Test Plan: make check -j64

Reviewed By: anand1976

Differential Revision: D24008226

Pulled By: akankshamahajan15

fbshipit-source-id: b65d63b2d01465db92500b78de7ad58650ec9b3b
2020-10-02 19:40:43 -07:00
Jay Zhuang c2485f2d81 Add buffer prefetch support for non directIO usecase (#7312)
Summary:
A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312

Reviewed By: anand1976

Differential Revision: D23329847

Pulled By: jay-zhuang

fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527
2020-08-27 18:16:53 -07:00
Anand Ananthabhotla 9a5886bd8c Extend Get/MultiGet deadline support to table open (#6982)
Summary:
Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.

The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.

Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982

Test Plan: Add new unit tests in db_basic_test

Reviewed By: riversand963

Differential Revision: D22219515

Pulled By: anand1976

fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
2020-06-29 14:53:17 -07:00
sdong 298b00a396 Reduce dependency on gtest dependency in release code (#6907)
Summary:
Release code now depends on gtest, indirectly through including "test_util/testharness.h". This creates multiple problems. One important reason is the definition of IGNORE_STATUS_IF_ERROR() in test_util/testharness.h. Move it to sync_point.h instead.
Note that utilities/cassandra/format.h still depends on "test_util/testharness.h". This will be resolved in a separate diff.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6907

Test Plan: Run all existing tests.

Reviewed By: ajkr

Differential Revision: D21829884

fbshipit-source-id: 9253c19ffde2936f3ae68998210f8e54f645a6e6
2020-06-02 12:11:24 -07:00
anand76 ab13d43e1d Pass a timeout to FileSystem for random reads (#6751)
Summary:
Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.

For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.

Tests:
Update existing unit tests to verify the correct timeout value is being passed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751

Reviewed By: riversand963

Differential Revision: D21285631

Pulled By: anand1976

fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
2020-04-30 14:50:39 -07:00