Commit Graph

227 Commits

Author SHA1 Message Date
Akanksha Mahajan 018eede679 Remove assertion from PrefetchAsync (#11965)
Summary:
Remove assertion from PrefetchAsync (roundup_len2 >= alignment) as for non direct_io, buffer size can be less than alignment resulting in assertion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11965

Test Plan: Ran the issue causing db_stress without this assertion and the verification completes successfully.

Reviewed By: anand1976

Differential Revision: D50328955

Pulled By: akankshamahajan15

fbshipit-source-id: 65f55ca230d2bbc63f4e2cc34c7273b22b515879
2023-10-16 15:14:58 -07:00
akankshamahajan 97f6f475bc Fix various failures in auto_readahead_size (#11884)
Summary:
1. **Error** in TestIterateAgainstExpected API - `Assertion index < pre_read_expected_values.size() && index < post_read_expected_values.size() failed.`
 **Fix** - `Prev` op is not supported with `auto_readahead_size`. So added support to Reseek in db_iter, if Prev is called. In BlockBasedTableIterator, index_iter_ already moves forward. So there is no way to do Prev from BlockBasedTableIterator.

2. **Error** - `void rocksdb::BlockBasedTableIterator::BlockCacheLookupForReadAheadSize(uint64_t, size_t, size_t&): Assertion index_iter_->value().handle.offset() == offset`
   **Fix** - Remove prefetch_buffer to be used when uncompressed dict is read.

3. ** Error in TestPrefixScan API - `db_stress: db/db_iter.cc:369: bool rocksdb::DBIter::FindNextUserEntryInternal(bool, const rocksdb::Slice*): Assertion !skipping_saved_key || CompareKeyForSkip(ikey_.user_key, saved_key_.GetUserKey()) > 0 failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
db_stress: table/merging_iterator.cc:1036: bool rocksdb::MergingIterator::SkipNextDeleted(): Assertion comparator_->Compare(range_tombstone_iters_[i]->start_key(), pik) <= 0 failed`
  **Fix** -  SeekPrev also calls 1) SeekPrev , 2)Seek and then 3)Prev in some cases in db_iter.cc leading to failure of Prev operation. These backward operations also call Seek.  Added direction to disable lookup once direction is backwards in BlockBasedTableIterator.cc

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11884

Test Plan: Ran various flavors of crash tests locally for the whole duration

Reviewed By: anand1976

Differential Revision: D49834201

Pulled By: akankshamahajan15

fbshipit-source-id: 9a007b4d46a48002c43dc4623a400ecf47d997fe
2023-10-02 17:47:24 -07:00
akankshamahajan 3d67b5e8e5 Lookup ahead in block cache ahead to tune Readaheadsize (#11860)
Summary:
Implement block cache lookup to determine readahead_size during scans. It's enabled if auto_readahead_size, block_cache and iterate_upper_bound - all three are set.

Design -
1. Whenever there is a cache miss and FilePrefetchBuffer is called, a callback is made to determine readahead_size for that prefetching.
2. The callback iterates over index and do block cache lookup for each data block handle until existing readahead_size is reached. Then It removes the cache hit data blocks from end to calculate optimized readahead_size.
3. Since index_iter_ is moved, it stores block handles in a queue, and use that queue to get block handle instead of doing index_iter_->Next().
4. This is for Sync scans. Async scans support is in progress.

NOTE:
The issue right now is after Seek and Next, if Prev is called, there is no way to do Prev operation. index_iter_ is already pointing to a different block. So it returns "Not supported" in that case with error message - "auto tuning of readahead size is not supported with Prev op"

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11860

Test Plan:
- Added new unit test
- crash_tests
- Running scans locally to check for any regression

Reviewed By: anand1976

Differential Revision: D49548118

Pulled By: akankshamahajan15

fbshipit-source-id: f1aee409a71b4ad9e5bf3610f43edf30c6630c78
2023-09-22 18:12:08 -07:00
akankshamahajan c1a97fe1f6 Fix Assertion `roundup_len2 >= alignment' failed in crash tests (#11852)
Summary:
When auto_readahead_size is enabled in async_io, during seek, first buffer will prefetch the data - (current block + readahead till upper_bound). There can be cases where
1.  first buffer prefetched all the data till upper bound, or
2.  first buffer already has the data from prev seek call
and second buffer prefetch further leading to alignment issues.

This PR fixes that assertion and second buffer won't go for prefetching if first buffer has already prefetched till upper_bound.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11852

Test Plan:
- Added new unit test that failed without this fix.
- crash tests passed locally

Reviewed By: pdillinger

Differential Revision: D49384138

Pulled By: akankshamahajan15

fbshipit-source-id: 54417e909e4d986f1e5a17dbaea059cd4962fd4d
2023-09-20 16:13:20 -07:00
akankshamahajan 5b5b011cdd Avoid double block cache lookup during Seek with async_io option (#11616)
Summary:
With the async_io option, the Seek happens in 2 phases. Phase 1 starts an asynchronous read on a block cache miss, and phase 2 waits for it to complete and finishes the seek. In both phases, BlockBasedTable::NewDataBlockIterator is called, which tries to lookup the block cache for the data block first before looking in the prefetch buffer. It's optimized by doing the block cache lookup only in the first phase and save some CPU.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11616

Test Plan: Added unit test

Reviewed By: jaykorean

Differential Revision: D47477887

Pulled By: akankshamahajan15

fbshipit-source-id: 0355e0a68fc0ea2eb92340ae42735afcdbcbfd79
2023-09-18 11:32:30 -07:00
Hui Xiao ed913513bd Fix a bug of rocksdb.file.read.verify.file.checksums.micros not being populated (#11836)
Summary:
**Context/Summary:**
`rocksdb.file.read.verify.file.checksums.micros ` was added in https://github.com/facebook/rocksdb/pull/11444 but the related path was not populated with statistics and clock object correctly so the actual statistics collection didn't happen. This PR fixed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11836

Test Plan:
Setup:
```
./db_bench --benchmarks="fillrandom" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb
```
Run:
```
./db_bench --use_existing_db=1  --benchmarks="verifyfilechecksums" --file_checksum=1 --num=100 --db=/dev/shm/rocksdb --statistics=1 --stats_level=4
```
Post-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 9.000000 P95 : 9.000000 P99 : 9.000000 P100 : 9.000000 COUNT : 1 SUM : 9
```

Pre-PR
```
rocksdb.file.read.verify.file.checksums.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

Reviewed By: ajkr

Differential Revision: D49293378

Pulled By: hx235

fbshipit-source-id: 1acd8b828c28e088d0c5d63897f53cd180b82f42
2023-09-15 10:36:14 -07:00
akankshamahajan 1e2fd343bb Update upper_bound_offset when reseek changes iterate_upper_bound dynamically (#11775)
Summary:
Update the logic in FilePrefetchBuffer to update `upper_bound_offset_` during reseek. During Reseek, `iterate_upper_bound` can be changed dynamically. So added an API to update that in FilePrefetchBuffer.
Added unit test to confirm the behavior.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11775

Test Plan:
- Check stress tests in case there is any failure after this diff.
- make crash_test -j32 with auto_readahead_size=1 passed locally

Reviewed By: anand1976

Differential Revision: D48815177

Pulled By: akankshamahajan15

fbshipit-source-id: 5f44fbb3af06c86a1c38f139c5fa4543891837f4
2023-09-15 10:05:56 -07:00
akankshamahajan ed5b6c0d99 Avoid alignment in FilePrefetchBuffer during seek with async_io (#11793)
Summary:
During Seek, the iterator seeks every file on L0. In async_io, it submit the requests to seek on every file on L0 asynchronously using RocksDB FilePrefetchBuffer.
However, FilePrefetchBuffer does alignment and reads extra bytes then needed that can increase the throughput.
In case of non direct io, the alignment can be avoided.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11793

Test Plan:
- Added a unit test that fails without this PR.
- make crash_test -j32 completed successfully

Reviewed By: anand1976

Differential Revision: D48985051

Pulled By: akankshamahajan15

fbshipit-source-id: 2d130a9e7c3df9c4fcd0408406e6277ab75a4389
2023-09-11 11:41:44 -07:00
Changyu Bi 458acf8169 Add some unit tests when file read returns error during compaction/scanning (#11788)
Summary:
Some repro unit tests for the bug fixed in https://github.com/facebook/rocksdb/pull/11782.

Ran on main without https://github.com/facebook/rocksdb/pull/11782:
```
./db_compaction_test --gtest_filter='*ErrorWhenReadFileHead'
Note: Google Test filter = *ErrorWhenReadFileHead
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN      ] DBCompactionTest.ErrorWhenReadFileHead
db/db_compaction_test.cc:10105: Failure
Value of: s.IsIOError()
  Actual: false
Expected: true
[  FAILED  ] DBCompactionTest.ErrorWhenReadFileHead (3960 ms)

./db_iterator_test --gtest_filter="*ErrorWhenReadFile*"
Note: Google Test filter = *ErrorWhenReadFile*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBIteratorTest
[ RUN      ] DBIteratorTest.ErrorWhenReadFile
db/db_iterator_test.cc:3399: Failure
Value of: (iter->status()).ok()
  Actual: true
Expected: false
[  FAILED  ] DBIteratorTest.ErrorWhenReadFile (280 ms)
[----------] 1 test from DBIteratorTest (280 ms total)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11788

Reviewed By: ajkr

Differential Revision: D48940284

Pulled By: cbi42

fbshipit-source-id: 06f3c5963f576db3f85d305ffb2745ee13d209bb
2023-09-06 10:23:41 -07:00
akankshamahajan f36394ff20 Fix seg fault in auto_readahead_size with async_io (#11769)
Summary:
Fix seg fault in auto_readahead_size with async_io when readahead_size = 0. If readahead_size is trimmed and is 0, it's not eligible for further prefetching and should return.

Error occured when the first buffer already contains data and it goes for prefetching in second buffer leading to assertion failure -

`assert(roundup_len1 >= alignment);
`
because roundup_len1 = length + readahead_size.
length is 0 and readahead_size is also 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11769

Test Plan: Reproducible with db_stress with async_io enabled.

Reviewed By: anand1976

Differential Revision: D48743031

Pulled By: akankshamahajan15

fbshipit-source-id: 0e08c41f862f6287ca223fbfaf6cd42fc97b3c87
2023-08-28 17:08:28 -07:00
Hui Xiao f53018c0c8 Improve PrefetchTest.Basic with explicit flush and file num variable (#11720)
Summary:
**Context/Summary:** as title, should be harmless. And it's a guessed fix to https://github.com/facebook/rocksdb/issues/11717 while no repro has obtained on my end yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11720

Test Plan: existing tests

Reviewed By: cbi42

Differential Revision: D48475661

Pulled By: hx235

fbshipit-source-id: 7c7390319f094c540e703fe2e78a8d601b7a894b
2023-08-18 17:47:22 -07:00
akankshamahajan f65a0379f0 Implement trimming of readhead size when upper bound is specified (#11684)
Summary:
Implement trimming of readahead_size under a new option ReadOptions.auto_readahead_size. It'll trim the readahead_size during prefetching upto iterate_upper_bound offset only when ReadOptions.iterate_upper_bound is set, therefore reducing the prefetching of data beyond upper_bound.
It's enabled for both implicit auto readahead size and when ReadOptions.readahead_size is specified and for sync and async_io.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11684

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D48479723

Pulled By: akankshamahajan15

fbshipit-source-id: 2b1703579caf779105e836b580866ffd7db076fc
2023-08-18 15:52:04 -07:00
Changyu Bi 76ed9a3990 Add missing status check when compiling with `ASSERT_STATUS_CHECKED=1` (#11686)
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.

Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686

Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`

Reviewed By: hx235

Differential Revision: D48176132

Pulled By: cbi42

fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
2023-08-09 15:46:44 -07:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
Hui Xiao 09882a52d6 Prepare for deprecation of Options::access_hint_on_compaction_start (#11658)
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D47997856

Pulled By: hx235

fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac
2023-08-03 17:23:02 -07:00
Hui Xiao 629605d645 Move prefetching responsibility to page cache for compaction read under non directIO usecase (#11631)
Summary:
**Context/Summary**
As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read .

**Test**
Extended the existing UT to cover compaction read case

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11631

Reviewed By: ajkr

Differential Revision: D47681437

Pulled By: hx235

fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90
2023-07-21 14:52:52 -07:00
akankshamahajan 749b179c04 Remove reallocation of AlignedBuffer in direct_io sync reads if already aligned (#11600)
Summary:
Remove reallocation of AlignedBuffer in direct_io sync reads in RandomAccessFileReader::Read if buffer passed is already aligned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11600

Test Plan:
Setup: `TEST_TMPDIR=./tmp-db/ ./db_bench -benchmarks=filluniquerandom -disable_auto_compactions=true -target_file_size_base=1048576 -write_buffer_size=1048576 -compression_type=none`
Benchmark: `TEST_TMPDIR=./tmp-db/ perf record ./db_bench --cache_size=8388608 --use_existing_db=true --disable_auto_compactions=true --benchmarks=seekrandom --use_direct_reads=true -use_direct_io_for_flush_and_compaction=true -reads=1000 -seek_nexts=1 -max_auto_readahead_size=131072 -initial_auto_readahead_size=16384 -adaptive_readahead=true -num_file_reads_for_auto_readahead=0`

Perf profile-
Before:
```
8.73% db_bench libc.so.6 [.] __memmove_evex_unaligned_erms
3.34% db_bench [kernel.vmlinux] [k] filemap_get_read_batch
```

After:
```
2.50% db_bench [kernel.vmlinux] [k] filemap_get_read_batch
2.29% db_bench libc.so.6 [.] __memmove_evex_unaligned_erms
```

`make  crash_test -j `with direct_io enabled completed succesfully locally.

Ran few benchmarks with direct_io from seek_nexts varying between 912 to 327680 and different readahead_size parameters and it showed no regression so far.

Reviewed By: ajkr

Differential Revision: D47478598

Pulled By: akankshamahajan15

fbshipit-source-id: 6a48e21cb34696f5d09c22a6311a3a1cb5f9cf33
2023-07-14 20:08:05 -07:00
akankshamahajan ff1cc8a63e Fix extra prefetching when num_file_reads_for_auto_readahead is 1 in async_io (#11560)
Summary:
When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and
discarded bytes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560

Test Plan: Added unit test

Reviewed By: anand1976

Differential Revision: D47008102

Pulled By: akankshamahajan15

fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0
2023-06-26 10:39:44 -07:00
akankshamahajan fbd2f563bb Add an interface to provide support for underlying FS to pass their own buffer during reads (#11324)
Summary:
1. Public API change: Replace `use_async_io`  API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation.

2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324

Test Plan: Tested locally

Reviewed By: anand1976

Differential Revision: D44465322

Pulled By: akankshamahajan15

fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec
2023-06-23 11:48:49 -07:00
Hui Xiao 1da9ac2363 Add UT to test BG read qps behavior during upgrade for pr11406 (#11522)
Summary:
**Context/Summary:**
When db is upgrading to adopt [pr11406](https://github.com/facebook/rocksdb/pull/11406/), it's possible for RocksDB to infer a small tail size to prefetch for pre-upgrade files. Such small tail size would have caused 1 file read per index or filter partition if partitioned index or filer is used. This PR provides a UT to show this would not happen.

Misc: refactor the related UTs a bit to make this new UT more readable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11522

Test Plan:
- New UT
If logic of upgrade is wrong e.g,
```
 --- a/table/block_based/partitioned_index_reader.cc
+++ b/table/block_based/partitioned_index_reader.cc
@@ -166,7 +166,8 @@ Status PartitionIndexReader::CacheDependencies(
   uint64_t prefetch_len = last_off - prefetch_off;
   std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
   if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() ||
-      tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) {
+      (false && tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off)) {
```
, then the UT will fail like below
```
[ RUN      ] PrefetchTailTest/PrefetchTailTest.UpgradeToTailSizeInManifest/0
file/prefetch_test.cc:461: Failure
Expected: (db_open_file_read.count) < (num_index_partition), actual: 38 vs 33
Received signal 11 (Segmentation fault)
```

Reviewed By: pdillinger

Differential Revision: D46546707

Pulled By: hx235

fbshipit-source-id: 9897b0a975e9055963edac5451fd1cd9d6c45d0e
2023-06-16 13:04:30 -07:00
Yu Zhang b421a8c21b Add a ticker to track number of trash files deleted in background thread (#11540)
Summary:
This ticker combined with `rocksdb.files.marked.trash` can help give a better picture of how DeleteScheduler is keeping up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11540

Test Plan:
```
./delete_scheduler_test
```

Reviewed By: ajkr

Differential Revision: D46746401

Pulled By: jowlyzhang

fbshipit-source-id: f3daa622aa3ddefe7d673e0cc257a47699d506df
2023-06-16 10:05:25 -07:00
Hui Xiao 3093d98c78 Fix higher read qps during db open caused by pr 11406 (#11516)
Summary:
**Context:**
[PR11406](https://github.com/facebook/rocksdb/pull/11406/) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
-  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in

**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11516

Test Plan:
- db bench

Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```

Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```

_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
-    return logical_sector_size_;
+    return 1;
  }
 ```

- CI

Reviewed By: pdillinger

Differential Revision: D46472566

Pulled By: hx235

fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
2023-06-06 17:42:43 -07:00
Hui Xiao 50046869a4 Add `rocksdb.file.read.db.open.micros` (#11455)
Summary:
**Context/Summary:**
`rocksdb.file.read.db.open.micros` is left out in https://github.com/facebook/rocksdb/pull/11288

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11455

Test Plan:
- db bench
Setup: `./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`
Run:
`./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/  -benchmarks=readrandom  -key_size=3200 -value_size=512 -num=0 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none -file_checksum=1 -cache_size=1`

```
rocksdb.sst.read.micros P50 : 3.979798 P95 : 9.738420 P99 : 19.566667 P100 : 39.000000 COUNT : 2360 SUM : 12148
rocksdb.file.read.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.read.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.read.db.open.micros P50 : 3.979798 P95 : 9.738420 P99 : 19.566667 P100 : 39.000000 COUNT : 2360 SUM : 12148
```

Reviewed By: ajkr

Differential Revision: D45951934

Pulled By: hx235

fbshipit-source-id: 6c88639dc1b10d98ecccc963ce32a8800495f55b
2023-05-18 09:26:29 -07:00
Peter Dillinger 206fdea3d9 Change internal headers with duplicate names (#11408)
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.

No public API changes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408

Test Plan: existing tests

Reviewed By: ltamasi

Differential Revision: D45456696

Pulled By: pdillinger

fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
2023-05-17 11:27:09 -07:00
Hui Xiao 8f763bdeab Record and use the tail size to prefetch table tail (#11406)
Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g,  small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug.  Therefore we decide to record the exact tail size and use it directly  to prefetch tail of the SST instead of relying heuristics.

**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
   - For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406

Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
 diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
 --- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
                            &tail_prefetch_size);

   // Try file system prefetch
-  if (!file->use_direct_io() && !force_direct_prefetch) {
+  if (false && !file->use_direct_io() && !force_direct_prefetch) {
     if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
              .IsNotSupported()) {
       prefetch_buffer->reset(new FilePrefetchBuffer(
 diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
 --- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
           std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
     } else {
       BlockBasedTableOptions block_based_options;
+      block_based_options.metadata_cache_options.partition_pinning =
+      PinningTier::kAll;
       block_based_options.checksum =
           static_cast<ChecksumType>(FLAGS_checksum_type);
       if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```

(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```

As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR

3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.

Reviewed By: pdillinger

Differential Revision: D45413346

Pulled By: hx235

fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
2023-05-08 13:14:28 -07:00
clundro 50b33ebb1b remove redundant move (#11418)
Summary:
when I use g++-13 to exec the `make all` command,  the output throws the warnings.
```
db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
  376 |           env_, GenerateFileName(file_number), std::move(contents)));
      |                                                ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
  375 |       EXPECT_OK(mock_table_factory_->CreateMockTable(
      |       ^~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
```

and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418

Reviewed By: akankshamahajan15

Differential Revision: D45528223

Pulled By: ajkr

fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc
2023-05-03 09:37:21 -07:00
Hui Xiao 151242ce46 Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288)
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.

**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
   - Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288

Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.**  (without blob)
     - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689

// PlainTable
Does not apply
```
- **Db bench 2: performance**

**Read**

SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`

**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655  -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none

rocksdb.sst.read.micros  COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec;    0.7 (± 0.1) MB/sec`

Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec;    0.7 (± 0.1) MB/sec`

Reviewed By: ajkr

Differential Revision: D44007011

Pulled By: hx235

fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2023-04-21 09:07:18 -07:00
anand76 0623c5b903 Ensure VerifyFileChecksums reads don't exceed readahead_size (#11328)
Summary:
VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328

Test Plan: Add a unit test

Reviewed By: pdillinger

Differential Revision: D44718781

Pulled By: anand1976

fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
2023-04-05 16:22:08 -07:00
anand76 eac6b6d0cd Ignore async_io ReadOption if FileSystem doesn't support it (#11296)
Summary:
In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296

Test Plan: Add new unit tests

Reviewed By: akankshamahajan15

Differential Revision: D44045776

Pulled By: anand1976

fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
2023-03-17 14:57:09 -07:00
Hui Xiao bab5f9a6f2 Add new stat rocksdb.table.open.prefetch.tail.read.bytes, rocksdb.table.open.prefetch.tail.{miss|hit} (#11265)
Summary:
**Context/Summary:**
We are adding new stats to measure behavior of prefetched tail size and look up into this buffer

The stat collection is done in FilePrefetchBuffer but only for prefetched tail buffer during table open for now using FilePrefetchBuffer enum. It's cleaner than the alternative of implementing in upper-level call places of FilePrefetchBuffer for table open. It also has the benefit of extensible to other types of FilePrefetchBuffer if needed. See db bench for perf regression concern.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11265

Test Plan:
**- Piggyback on existing test**
**- rocksdb.table.open.prefetch.tail.miss is harder to UT so I manually set prefetch tail read bytes to be small and run db bench.**
```
./db_bench -db=/tmp/testdb -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3  -use_direct_reads=true
```
```
rocksdb.table.open.prefetch.tail.read.bytes P50 : 4096.000000 P95 : 4096.000000 P99 : 4096.000000 P100 : 4096.000000 COUNT : 225 SUM : 921600
rocksdb.table.open.prefetch.tail.miss COUNT : 91
rocksdb.table.open.prefetch.tail.hit COUNT : 1034
```
**- No perf regression observed in db_bench**

SETUP command: create same db with ~900 files for pre-change/post-change.
```
./db_bench -db=/tmp/testdb -benchmarks="fillseq" -key_size=32 -value_size=512 -num=500000 -write_buffer_size=655360  -disable_auto_compactions=true -target_file_size_base=16777216 -compression_type=none
```
TEST command 60 runs or til convergence: as suggested by anand1976 and akankshamahajan15, vary `seek_nexts` and `async_io` in testing.
```
./db_bench -use_existing_db=true -db=/tmp/testdb -statistics=false -cache_size=0 -cache_index_and_filter_blocks=false -benchmarks=seekrandom[-X60] -num=50000 -seek_nexts={10, 500, 1000} -async_io={0|1} -use_direct_reads=true
```
async io = 0, direct io read = true

  | seek_nexts = 10, 30 runs | seek_nexts = 500, 12 runs | seek_nexts = 1000, 6 runs
-- | -- | -- | --
pre-post change | 4776 (± 28) ops/sec;   24.8 (± 0.1) MB/sec | 288 (± 1) ops/sec;   74.8 (± 0.4) MB/sec | 145 (± 4) ops/sec;   75.6 (± 2.2) MB/sec
post-change | 4790 (± 32) ops/sec;   24.9 (± 0.2) MB/sec | 288 (± 3) ops/sec;   74.7 (± 0.8) MB/sec | 143 (± 3) ops/sec;   74.5 (± 1.6) MB/sec

async io = 1, direct io read = true
  | seek_nexts = 10, 54 runs | seek_nexts = 500, 6 runs | seek_nexts = 1000, 4 runs
-- | -- | -- | --
pre-post change | 3350 (± 36) ops/sec;   17.4 (± 0.2) MB/sec | 264 (± 0) ops/sec;   68.7 (± 0.2) MB/sec | 138 (± 1) ops/sec;   71.8 (± 1.0) MB/sec
post-change | 3358 (± 27) ops/sec;   17.4 (± 0.1) MB/sec  | 263 (± 2) ops/sec;   68.3 (± 0.8) MB/sec | 139 (± 1) ops/sec;   72.6 (± 0.6) MB/sec

Reviewed By: ajkr

Differential Revision: D43781467

Pulled By: hx235

fbshipit-source-id: a706a18472a8edb2b952bac3af40eec803537f2a
2023-03-15 14:02:43 -07:00
akankshamahajan 68e4581c67 Return NotSupported in scan if IOUring not supported and enable IOUring in db_stress for async_io testing (#11197)
Summary:
- Return NotSupported in scan if IOUring not supported if async_io is enabled
- Enable IOUring in db_stress for async_io testing
- Disable async_io in circleci crash testing as circleci doesn't support IOUring

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11197

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D43096313

Pulled By: akankshamahajan15

fbshipit-source-id: c2c53a87636950c0243038b9f5bd0d91608e4fda
2023-02-16 18:33:06 -08:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
akankshamahajan bde65052c4 Enhance async scan prefetch unit tests (#11087)
Summary:
Add more coverage in unit tests for async scan. The added unit test fails without PR https://github.com/facebook/rocksdb/pull/10939.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11087

Test Plan: CircleCI jobs status for new unit tests.

Reviewed By: anand1976

Differential Revision: D42487931

Pulled By: akankshamahajan15

fbshipit-source-id: d59ed7666599bd0d2733ac5d76bd70984b54c5a9
2023-01-20 10:17:57 -08:00
anand76 a510880346 Add a unit test for async prefetch fix in #11049 (#11084)
Summary:
Add a unit test in prefetch_test for https://github.com/facebook/rocksdb/issues/11049

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11084

Test Plan: Verify the test fails without https://github.com/facebook/rocksdb/issues/11049 and passes with it

Reviewed By: akankshamahajan15

Differential Revision: D42485828

Pulled By: anand1976

fbshipit-source-id: ae512f2d121745a1f5212645a9b58868976c1f83
2023-01-12 18:09:07 -08:00
anand76 dbf37c290a Fix async prefetch heap use after free (#11049)
Summary:
This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario -
1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching
2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache
3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet
4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned
5. The orphaned prefetch finally completes, resulting in a use after free

Also add a few asserts to surface bugs earlier in the crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049

Test Plan: Repro with db_stress and verify the fix

Reviewed By: akankshamahajan15

Differential Revision: D42181118

Pulled By: anand1976

fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc
2022-12-21 09:15:53 -08:00
Akanksha Mahajan 1562524e63 Fix db_stress failure in async_io in FilePrefetchBuffer (#10949)
Summary:
Fix db_stress failure in async_io in FilePrefetchBuffer.

From the logs, assertion was caused when
- prev_offset_ = offset but somehow prev_len != 0 and explicit_prefetch_submitted_ = true. That scenario is when we send async request to prefetch buffer during seek but in second seek that data is found in cache. prev_offset_ and prev_len_ get updated but we were not setting explicit_prefetch_submitted_ = false because of which buffers were getting out of sync.
It's possible a read by another thread might have loaded the block into the cache in the meantime.

Particular assertion example:
```
prev_offset: 0, prev_len_: 8097 , offset: 0, length: 8097, actual_length: 8097 , actual_offset: 0 ,
curr_: 0, bufs_[curr_].offset_: 4096 ,bufs_[curr_].CurrentSize(): 48541 , async_len_to_read: 278528, bufs_[curr_].async_in_progress_: false
second: 1, bufs_[second].offset_: 282624 ,bufs_[second].CurrentSize(): 0, async_len_to_read: 262144 ,bufs_[second].async_in_progress_: true ,
explicit_prefetch_submitted_: true , copy_to_third_buffer: false
```
As we can see curr_ was expected to read 278528 but it read 48541. Also buffers are out of sync.
Also `explicit_prefetch_submitted_` is set true but prev_len not 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10949

Test Plan:
- Ran db_bench for regression to make sure there is no regression;
- Ran db_stress failing without this fix,
- Ran build-linux-mini-crashtest 7- 8 times locally + CircleCI

Reviewed By: anand1976

Differential Revision: D41257786

Pulled By: akankshamahajan15

fbshipit-source-id: 1d100f94f8c06bbbe4cc76ca27f1bbc820c2494f
2022-11-14 16:14:41 -08:00
akankshamahajan 8515437594 Update unit test to avoid timeout (#10950)
Summary:
Update unit test to avoid timeout

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10950

Reviewed By: hx235

Differential Revision: D41258892

Pulled By: akankshamahajan15

fbshipit-source-id: cbfe94da63e9e54544a307845deb79ba42458301
2022-11-14 11:39:22 -08:00
anand76 ecba6a320e Add some async read stats (#10947)
Summary:
Add stats for time spent in the ReadAsync call, and async read errors.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947

Test Plan: Run db_bench and look at stats

Reviewed By: akankshamahajan15

Differential Revision: D41236637

Pulled By: anand1976

fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf
2022-11-13 21:38:35 -08:00
akankshamahajan d1aca4a5ae Fix async_io regression in scans (#10939)
Summary:
Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939

Test Plan:
- stress tests  export CRASH_TEST_EXT_ARGS="--async_io=1"
    make crash_test -j32
- Ran db_bench command which was caught the regression:
./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288

seekrandom   :    3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations;  132.3 MB/s (7942 of 7942 found)

Reviewed By: anand1976

Differential Revision: D41173899

Pulled By: akankshamahajan15

fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe
2022-11-11 13:34:49 -08:00
akankshamahajan ff9ad2c39b Fix async_io failures in case there is error in reading data (#10890)
Summary:
Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error.
Fix: Added check to abort IO in second buffer if curr_ got empty.

This PR also fixes db_stress failures which happened when buffers are not aligned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890

Test Plan:
- Ran make crash_test -j32 with async_io enabled.
-  Ran benchmarks to make sure there is no regression.

Reviewed By: anand1976

Differential Revision: D40881731

Pulled By: akankshamahajan15

fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58
2022-11-01 16:06:51 -07:00
akankshamahajan 671753c43d Run Clang format on file folder (#10860)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10860

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D40656236

Pulled By: akankshamahajan15

fbshipit-source-id: 557600db5c2e0ab9b400655336c467307f7136de
2022-10-24 18:34:52 -07:00
Peter Dillinger e466173d5c Print stack traces on frozen tests in CI (#10828)
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.

For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828

Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI

Reviewed By: hx235

Differential Revision: D40447634

Pulled By: pdillinger

fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
2022-10-18 00:35:35 -07:00
akankshamahajan ae0f9c3339 Add new property in IOOptions to skip recursing through directories and list only files during GetChildren. (#10668)
Summary:
Add new property "do_not_recurse" in  IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files.
By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open.

Provided support in PosixFileSystem to use "do_not_recurse".

TestPlan:
- Existing tests

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668

Reviewed By: anand1976

Differential Revision: D39471683

Pulled By: akankshamahajan15

fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f
2022-10-03 10:59:45 -07:00
Hui Xiao 3b8164912e Add manual_wal_flush, FlushWAL() to stress/crash test (#10698)
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in  > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`

Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
 - It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
    - As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698

Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs

Reviewed By: ajkr

Differential Revision: D39593752

Pulled By: ajkr

fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
2022-09-30 15:48:33 -07:00
Akanksha Mahajan 03fc43976d Async optimization in scan path (#10602)
Summary:
Optimizations
1. In FilePrefetchBuffer, when data is overlapping between two buffers, it copies the data from first to third buffer, then from
second to third buffer to return continuous buffer. This optimization will call ReadAsync on first buffer as soon as buffer is empty instead of getting blocked by second buffer to copy the data.
2. For fixed size readahead_size, FilePrefetchBuffer will issues two async read calls. One with length + readahead_size_/2 on first buffer(if buffer is empty) and readahead_size_/2 on second buffer during seek.

- Add readahead_size to db_stress for stress testing these changes in https://github.com/facebook/rocksdb/pull/10632

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10602

Test Plan:
- CircleCI tests
- stress_test completed successfully
export CRASH_TEST_EXT_ARGS="--async_io=1"
make crash_test -j32
- db_bench showed no regression
   With this PR:
```
 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661876074584472 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:14:34 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  270878.018 micros/op 3 ops/sec 30.068 seconds 111 operations;  618.7 MB/s (111 of 111 found)

 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=true -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661875332862922 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:02:12 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  358352.488 micros/op 2 ops/sec 30.102 seconds 84 operations;  474.4 MB/s (84 of 84 found)
```

Without PR in main:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=1
Set seed to 1661876425983045 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:20:26 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  280881.953 micros/op 3 ops/sec 30.054 seconds 107 operations;  605.2 MB/s (107 of 107 found)

 ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main1 -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=327680 -duration=30 -ops_between_duration_checks=1 -async_io=0
Set seed to 1661876475267771 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 7.7.0
Date:       Tue Aug 30 09:21:15 2022
CPU:        32 * Intel Xeon Processor (Skylake)
CPUCache:   16384 KB
Keys:       32 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    50000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    25939.9 MB (estimated)
FileSize:   13732.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main1]
seekrandom   :  363155.084 micros/op 2 ops/sec 30.142 seconds 83 operations;  468.1 MB/s (83 of 83 found)
```

Reviewed By: anand1976

Differential Revision: D39141328

Pulled By: akankshamahajan15

fbshipit-source-id: 560655922c1a437a8569c228abb31b8c0b413120
2022-09-12 17:42:01 -07:00
Akanksha Mahajan 4cd16d65ae Add new option num_file_reads_for_auto_readahead in BlockBasedTableOptions (#10556)
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file if user doesn't provide readahead_size and reads are sequential.
A new option num_file_reads_for_auto_readahead is added which can be
configured and indicates after how many sequential reads prefetching should
be start.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10556

Test Plan: Existing and new unit test

Reviewed By: anand1976

Differential Revision: D38947147

Pulled By: akankshamahajan15

fbshipit-source-id: c9eeab495f84a8df7f701c42f04894e46440ad97
2022-09-01 11:56:00 -07:00
anand76 5fbcc8c54d Update MULTIGET_IO_BATCH_SIZE for non-async MultiGet (#10622)
Summary:
This stat was only getting updated in the async (coroutine) version of MultiGet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10622

Reviewed By: akankshamahajan15

Differential Revision: D39188790

Pulled By: anand1976

fbshipit-source-id: 7e231507f65fc94c8a006c38f79dfba182a2c24a
2022-08-31 21:03:52 -07:00
Hui Xiao e484b81eee Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573)
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.

This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.

The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped  (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```

```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```

**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
    -  Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
    - `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573

Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr  "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.

Reviewed By: ajkr

Differential Revision: D39005886

Pulled By: hx235

fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
2022-08-29 17:35:21 -07:00
anand76 72a3fb3424 Update statistics for async scan readaheads (#10585)
Summary:
Imported a fix to "rocksdb.prefetched.bytes.discarded" stat from https://github.com/facebook/rocksdb/issues/10561, and added a new stat "rocksdb.async.prefetch.abort.micros" to measure time spent waiting for async reads to abort.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10585

Reviewed By: akankshamahajan15

Differential Revision: D39067000

Pulled By: anand1976

fbshipit-source-id: d7cda71abb48017239bd5fd832345a16c7024faf
2022-08-29 14:37:44 -07:00
sdong 4915f89513 WritableFileWriter to allow operation after failure when SyncWithoutFlush() is involved (#10555)
Summary:
https://github.com/facebook/rocksdb/pull/10489 adds an assertion in most functions in WritableFileWriter to check no previous error. However, it only works without calling SyncWithoutFlush(). The nature of SyncWithoutFlush() makes two concurrent call fails to check status code of each other and causing assertion failure. Fix the problem by skipping the check after SyncWithoutFlush() is called and not check status code in SyncWithoutFlush().

Since the original change was not officially released yet, the fix isn't added to HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10555

Test Plan: Make sure existing tests still pass

Reviewed By: anand1976

Differential Revision: D38946208

fbshipit-source-id: 63566732d3f25c8a8342840499cf7b7d745f27c2
2022-08-24 07:30:52 -07:00