Summary:
**Context/Summary**
As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read .
**Test**
Extended the existing UT to cover compaction read case
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11631
Reviewed By: ajkr
Differential Revision: D47681437
Pulled By: hx235
fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90
Summary:
When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and
discarded bytes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560
Test Plan: Added unit test
Reviewed By: anand1976
Differential Revision: D47008102
Pulled By: akankshamahajan15
fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0
Summary:
1. Public API change: Replace `use_async_io` API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation.
2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324
Test Plan: Tested locally
Reviewed By: anand1976
Differential Revision: D44465322
Pulled By: akankshamahajan15
fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec
Summary:
**Context/Summary:**
When db is upgrading to adopt [pr11406](https://github.com/facebook/rocksdb/pull/11406/), it's possible for RocksDB to infer a small tail size to prefetch for pre-upgrade files. Such small tail size would have caused 1 file read per index or filter partition if partitioned index or filer is used. This PR provides a UT to show this would not happen.
Misc: refactor the related UTs a bit to make this new UT more readable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11522
Test Plan:
- New UT
If logic of upgrade is wrong e.g,
```
--- a/table/block_based/partitioned_index_reader.cc
+++ b/table/block_based/partitioned_index_reader.cc
@@ -166,7 +166,8 @@ Status PartitionIndexReader::CacheDependencies(
uint64_t prefetch_len = last_off - prefetch_off;
std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() ||
- tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) {
+ (false && tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off)) {
```
, then the UT will fail like below
```
[ RUN ] PrefetchTailTest/PrefetchTailTest.UpgradeToTailSizeInManifest/0
file/prefetch_test.cc:461: Failure
Expected: (db_open_file_read.count) < (num_index_partition), actual: 38 vs 33
Received signal 11 (Segmentation fault)
```
Reviewed By: pdillinger
Differential Revision: D46546707
Pulled By: hx235
fbshipit-source-id: 9897b0a975e9055963edac5451fd1cd9d6c45d0e
Summary:
This ticker combined with `rocksdb.files.marked.trash` can help give a better picture of how DeleteScheduler is keeping up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11540
Test Plan:
```
./delete_scheduler_test
```
Reviewed By: ajkr
Differential Revision: D46746401
Pulled By: jowlyzhang
fbshipit-source-id: f3daa622aa3ddefe7d673e0cc257a47699d506df
Summary:
**Context:**
[PR11406](https://github.com/facebook/rocksdb/pull/11406/) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
- more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in
**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11516
Test Plan:
- db bench
Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```
Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```
_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
- return logical_sector_size_;
+ return 1;
}
```
- CI
Reviewed By: pdillinger
Differential Revision: D46472566
Pulled By: hx235
fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
Summary:
In IDE navigation I find it annoying that there are two statistics.h files (etc.) and often land on the wrong one. Here I migrate several headers to use the blah.h <- blah_impl.h <- blah.cc idiom. Although clang-format wants "blah.h" to be the top include for "blah.cc", I think overall this is an improvement.
No public API changes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11408
Test Plan: existing tests
Reviewed By: ltamasi
Differential Revision: D45456696
Pulled By: pdillinger
fbshipit-source-id: 809d931253f3272c908cf5facf7e1d32fc507373
Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g, small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug. Therefore we decide to record the exact tail size and use it directly to prefetch tail of the SST instead of relying heuristics.
**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
- For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406
Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
--- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
&tail_prefetch_size);
// Try file system prefetch
- if (!file->use_direct_io() && !force_direct_prefetch) {
+ if (false && !file->use_direct_io() && !force_direct_prefetch) {
if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
.IsNotSupported()) {
prefetch_buffer->reset(new FilePrefetchBuffer(
diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
--- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
} else {
BlockBasedTableOptions block_based_options;
+ block_based_options.metadata_cache_options.partition_pinning =
+ PinningTier::kAll;
block_based_options.checksum =
static_cast<ChecksumType>(FLAGS_checksum_type);
if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```
(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```
As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR
3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.
Reviewed By: pdillinger
Differential Revision: D45413346
Pulled By: hx235
fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
Summary:
when I use g++-13 to exec the `make all` command, the output throws the warnings.
```
db/compaction/compaction_job_test.cc: In member function ‘void rocksdb::CompactionJobTestBase::AddMockFile(const rocksdb::mock::KVVector&, int)’:
db/compaction/compaction_job_test.cc:376:57: error: redundant move in initialization [-Werror=redundant-move]
376 | env_, GenerateFileName(file_number), std::move(contents)));
| ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
375 | EXPECT_OK(mock_table_factory_->CreateMockTable(
| ^~~~~~~~~
db/compaction/compaction_job_test.cc:376:57: note: remove ‘std::move’ call
376 | env_, GenerateFileName(file_number), std::move(contents)));
| ~~~~~~~~~^~~~~~~~~~
db/compaction/compaction_job_test.cc:375:7: note: in expansion of macro ‘EXPECT_OK’
375 | EXPECT_OK(mock_table_factory_->CreateMockTable(
| ^~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:2507: db/compaction/compaction_job_test.o] Error 1
```
and I also add some `(void)unused_variable` statements because of the cmake argument `-Wunused-but-set-variable -Wunused-but-set-variable`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11418
Reviewed By: akankshamahajan15
Differential Revision: D45528223
Pulled By: ajkr
fbshipit-source-id: fee1a77c30039a56b481de953f0a834cc788abbc
Summary:
**Context:**
The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them.
**Summary**
- Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros`
- Fixed the default histogram in `RandomAccessFileReader`
- New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader`
- Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288
Test Plan:
- **Stress test**
- **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob)
- May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads.
```
./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10)
```
```
// BlockBasedTable
rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805
rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116
rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689
// PlainTable
Does not apply
```
- **Db bench 2: performance**
**Read**
SETUP: db with 900 files
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none
```run till convergence
```
./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3
```
Pre-change
`readrandom [AVG 60 runs] : 21568 (± 248) ops/sec`
Post-change (no regression, -0.3%)
`readrandom [AVG 60 runs] : 21486 (± 236) ops/sec`
**Compaction/Flush**run till convergence
```
./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none
rocksdb.sst.read.micros COUNT : 33820
rocksdb.sst.read.flush.micros COUNT : 1800
rocksdb.sst.read.compaction.micros COUNT : 32020
```
Pre-change
`fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec`
Post-change (no regression, ~-0.4%)
`fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec`
Reviewed By: ajkr
Differential Revision: D44007011
Pulled By: hx235
fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
Summary:
VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328
Test Plan: Add a unit test
Reviewed By: pdillinger
Differential Revision: D44718781
Pulled By: anand1976
fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
Summary:
In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296
Test Plan: Add new unit tests
Reviewed By: akankshamahajan15
Differential Revision: D44045776
Pulled By: anand1976
fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
Summary:
- Return NotSupported in scan if IOUring not supported if async_io is enabled
- Enable IOUring in db_stress for async_io testing
- Disable async_io in circleci crash testing as circleci doesn't support IOUring
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11197
Test Plan: CircleCI jobs
Reviewed By: anand1976
Differential Revision: D43096313
Pulled By: akankshamahajan15
fbshipit-source-id: c2c53a87636950c0243038b9f5bd0d91608e4fda
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
Most of changes were done through following comments:
unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
Test Plan: See CI
Reviewed By: pdillinger
Differential Revision: D42796341
fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
Summary:
Add more coverage in unit tests for async scan. The added unit test fails without PR https://github.com/facebook/rocksdb/pull/10939.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11087
Test Plan: CircleCI jobs status for new unit tests.
Reviewed By: anand1976
Differential Revision: D42487931
Pulled By: akankshamahajan15
fbshipit-source-id: d59ed7666599bd0d2733ac5d76bd70984b54c5a9
Summary:
This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario -
1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching
2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache
3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet
4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned
5. The orphaned prefetch finally completes, resulting in a use after free
Also add a few asserts to surface bugs earlier in the crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049
Test Plan: Repro with db_stress and verify the fix
Reviewed By: akankshamahajan15
Differential Revision: D42181118
Pulled By: anand1976
fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc
Summary:
Fix db_stress failure in async_io in FilePrefetchBuffer.
From the logs, assertion was caused when
- prev_offset_ = offset but somehow prev_len != 0 and explicit_prefetch_submitted_ = true. That scenario is when we send async request to prefetch buffer during seek but in second seek that data is found in cache. prev_offset_ and prev_len_ get updated but we were not setting explicit_prefetch_submitted_ = false because of which buffers were getting out of sync.
It's possible a read by another thread might have loaded the block into the cache in the meantime.
Particular assertion example:
```
prev_offset: 0, prev_len_: 8097 , offset: 0, length: 8097, actual_length: 8097 , actual_offset: 0 ,
curr_: 0, bufs_[curr_].offset_: 4096 ,bufs_[curr_].CurrentSize(): 48541 , async_len_to_read: 278528, bufs_[curr_].async_in_progress_: false
second: 1, bufs_[second].offset_: 282624 ,bufs_[second].CurrentSize(): 0, async_len_to_read: 262144 ,bufs_[second].async_in_progress_: true ,
explicit_prefetch_submitted_: true , copy_to_third_buffer: false
```
As we can see curr_ was expected to read 278528 but it read 48541. Also buffers are out of sync.
Also `explicit_prefetch_submitted_` is set true but prev_len not 0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10949
Test Plan:
- Ran db_bench for regression to make sure there is no regression;
- Ran db_stress failing without this fix,
- Ran build-linux-mini-crashtest 7- 8 times locally + CircleCI
Reviewed By: anand1976
Differential Revision: D41257786
Pulled By: akankshamahajan15
fbshipit-source-id: 1d100f94f8c06bbbe4cc76ca27f1bbc820c2494f
Summary:
Add stats for time spent in the ReadAsync call, and async read errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947
Test Plan: Run db_bench and look at stats
Reviewed By: akankshamahajan15
Differential Revision: D41236637
Pulled By: anand1976
fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf
Summary:
Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error.
Fix: Added check to abort IO in second buffer if curr_ got empty.
This PR also fixes db_stress failures which happened when buffers are not aligned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890
Test Plan:
- Ran make crash_test -j32 with async_io enabled.
- Ran benchmarks to make sure there is no regression.
Reviewed By: anand1976
Differential Revision: D40881731
Pulled By: akankshamahajan15
fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.
For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828
Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI
Reviewed By: hx235
Differential Revision: D40447634
Pulled By: pdillinger
fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
Summary:
Add new property "do_not_recurse" in IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files.
By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open.
Provided support in PosixFileSystem to use "do_not_recurse".
TestPlan:
- Existing tests
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668
Reviewed By: anand1976
Differential Revision: D39471683
Pulled By: akankshamahajan15
fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file if user doesn't provide readahead_size and reads are sequential.
A new option num_file_reads_for_auto_readahead is added which can be
configured and indicates after how many sequential reads prefetching should
be start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10556
Test Plan: Existing and new unit test
Reviewed By: anand1976
Differential Revision: D38947147
Pulled By: akankshamahajan15
fbshipit-source-id: c9eeab495f84a8df7f701c42f04894e46440ad97
Summary:
This stat was only getting updated in the async (coroutine) version of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10622
Reviewed By: akankshamahajan15
Differential Revision: D39188790
Pulled By: anand1976
fbshipit-source-id: 7e231507f65fc94c8a006c38f79dfba182a2c24a
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.
This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.
The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```
```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```
**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
- Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
- `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573
Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.
Reviewed By: ajkr
Differential Revision: D39005886
Pulled By: hx235
fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
Summary:
Imported a fix to "rocksdb.prefetched.bytes.discarded" stat from https://github.com/facebook/rocksdb/issues/10561, and added a new stat "rocksdb.async.prefetch.abort.micros" to measure time spent waiting for async reads to abort.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10585
Reviewed By: akankshamahajan15
Differential Revision: D39067000
Pulled By: anand1976
fbshipit-source-id: d7cda71abb48017239bd5fd832345a16c7024faf
Summary:
https://github.com/facebook/rocksdb/pull/10489 adds an assertion in most functions in WritableFileWriter to check no previous error. However, it only works without calling SyncWithoutFlush(). The nature of SyncWithoutFlush() makes two concurrent call fails to check status code of each other and causing assertion failure. Fix the problem by skipping the check after SyncWithoutFlush() is called and not check status code in SyncWithoutFlush().
Since the original change was not officially released yet, the fix isn't added to HISTORY.md.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10555
Test Plan: Make sure existing tests still pass
Reviewed By: anand1976
Differential Revision: D38946208
fbshipit-source-id: 63566732d3f25c8a8342840499cf7b7d745f27c2
Summary:
A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489
Test Plan: Fix existing unit tests and watch crash tests for a while.
Reviewed By: anand1976
Differential Revision: D38473277
fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
Summary:
`FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.
The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.
An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.
```
Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185
Test Plan:
- repro unit test
- the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"
Reviewed By: riversand963
Differential Revision: D37200993
Pulled By: ajkr
fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
- Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
Reviewed By: ajkr, anand1976
Differential Revision: D37135172
Pulled By: hx235
fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
Summary:
This PR adds few optimizations for async_io for shorter scans.
1. If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching. This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead.
2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140
Test Plan:
1. Added a unit test
2. Ran crash_test with async_io = 1 to make sure nothing crashes.
Reviewed By: anand1976
Differential Revision: D37042242
Pulled By: akankshamahajan15
fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda
Summary:
The default implementation of Close() function in Directory/FSDirectory classes returns `NotSupported` status. However, we don't want operations that worked in older versions to begin failing after upgrading when run on FileSystems that have not implemented Directory::Close() yet. So we require the upper level that calls Close() function should properly handle "NotSupported" status instead of treating it as an error status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10127
Reviewed By: ajkr
Differential Revision: D36971112
Pulled By: littlepig2013
fbshipit-source-id: 100f0e6ad1191e1acc1ba6458c566a11724cf466
Summary:
Closing https://github.com/facebook/rocksdb/issues/10080
When `SyncWAL()` calls `MarkLogsSynced()`, even if there is only one active WAL file,
this event should still be added to the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10087
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D36797580
Pulled By: riversand963
fbshipit-source-id: 24184c9dd606b3939a454ed41de6e868d1519999
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
Reviewed By: ajkr, hx235
Differential Revision: D36722135
Pulled By: littlepig2013
fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).
The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973
Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
- Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
- Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
- backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
- restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)
```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000
# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
# Restore
#./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```
Reviewed By: hx235
Differential Revision: D36327418
Pulled By: cbi42
fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
Summary:
In case of non sequential reads with `async_io`, `FilePRefetchBuffer::TryReadFromCacheAsync` can be called for previous blocks with `offset < bufs_[curr_].offset_` which wasn't handled correctly resulting wrong data being returned from buffer.
Since `FilePRefetchBuffer::PrefetchAsync` can be called for any data block, it sets `prev_len_` to 0 indicating `FilePRefetchBuffer::TryReadFromCacheAsync` to go for the prefetching even though offset < bufs_[curr_].offset_ This is because async prefetching is always done in second buffer (to avoid mutex) even though curr_ is empty leading to offset < bufs_[curr_].offset_ in some cases.
If prev_len_ is non zero then `TryReadFromCacheAsync` returns false if `offset < bufs_[curr_].offset_ && prev_len != 0` indicating reads are not sequential and previous call wasn't PrefetchAsync.
- This PR also simplifies `FilePRefetchBuffer::TryReadFromCacheAsync` as it was getting complicated covering different scenarios based on `async_io` enabled/disabled. If `for_compaction` is set true, it now calls `FilePRefetchBufferTryReadFromCache` following synchronous flow as before. Its decided in BlockFetcher.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10032
Test Plan:
1. export CRASH_TEST_EXT_ARGS=" --async_io=1"
make crash_test -j completed successfully locally
2. make crash_test -j completed successfully locally
3. Reran CircleCi mini crashtest job 4 - 5 times.
4. Updated prefetch_test for more coverage.
Reviewed By: anand1976
Differential Revision: D36579858
Pulled By: akankshamahajan15
fbshipit-source-id: 0c428d62b45e12e082a83acf533a5e37a584bedf
Summary:
The RocksDB iterator is a hierarchy of iterators. MergingIterator maintains a heap of LevelIterators, one for each L0 file and for each non-zero level. The Seek() operation naturally lends itself to parallelization, as it involves positioning every LevelIterator on the correct data block in the correct SST file. It lookups a level for a target key, to find the first key that's >= the target key. This typically involves reading one data block that is likely to contain the target key, and scan forward to find the first valid key. The forward scan may read more data blocks. In order to find the right data block, the iterator may read some metadata blocks (required for opening a file and searching the index).
This flow can be parallelized.
Design: Seek will be called two times under async_io option. First seek will send asynchronous request to prefetch the data blocks at each level and second seek will follow the normal flow and in FilePrefetchBuffer::TryReadFromCacheAsync it will wait for the Poll() to get the results and add the iterator to min_heap.
- Status::TryAgain is passed down from FilePrefetchBuffer::PrefetchAsync to block_iter_.Status indicating asynchronous request has been submitted.
- If for some reason asynchronous request returns error in submitting the request, it will fallback to sequential reading of blocks in one pass.
- If the data already exists in prefetch_buffer, it will return the data without prefetching further and it will be treated as single pass of seek.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9994
Test Plan:
- **Run Regressions.**
```
./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
```
i) Previous release 7.0 run for normal prefetching with async_io disabled:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB: version 7.0
Date: Thu Mar 17 13:11:34 2022
CPU: 24 * Intel Core Processor (Broadwell)
CPUCache: 16384 KB
Keys: 32 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2594.0 MB (estimated)
FileSize: 1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom : 483618.390 micros/op 2 ops/sec; 338.9 MB/s (249 of 249 found)
```
ii) normal prefetching after changes with async_io disable:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Set seed to 1652922591315307 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB: version 7.3
Date: Wed May 18 18:09:51 2022
CPU: 32 * Intel Xeon Processor (Skylake)
CPUCache: 16384 KB
Keys: 32 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2594.0 MB (estimated)
FileSize: 1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom : 483080.466 micros/op 2 ops/sec 120.287 seconds 249 operations; 340.8 MB/s (249 of 249 found)
```
iii) db_bench with async_io enabled completed succesfully
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 -adaptive_readahead=1
Set seed to 1652924062021732 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB: version 7.3
Date: Wed May 18 18:34:22 2022
CPU: 32 * Intel Xeon Processor (Skylake)
CPUCache: 16384 KB
Keys: 32 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2594.0 MB (estimated)
FileSize: 1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom : 553913.576 micros/op 1 ops/sec 120.199 seconds 217 operations; 293.6 MB/s (217 of 217 found)
```
- db_stress with async_io disabled completed succesfully
```
export CRASH_TEST_EXT_ARGS=" --async_io=0"
make crash_test -j
```
I**n Progress**: db_stress with async_io is failing and working on debugging/fixing it.
Reviewed By: anand1976
Differential Revision: D36459323
Pulled By: akankshamahajan15
fbshipit-source-id: abb1cd944abe712bae3986ae5b16704b3338917c
Summary:
1. The latest change of DecideRateLimiterPriority in https://github.com/facebook/rocksdb/pull/9988 is reverted.
2. For https://github.com/facebook/rocksdb/blob/main/db/builder.cc#L345-L349
2.1. Remove `we will regrad this verification as user reads` from the comments.
2.2. `Do not set` the read_options.rate_limiter_priority to Env::IO_USER . Flush should be a background job.
2.3. Update db_rate_limiter_test.cc.
3. In IOOptions, mark `prio` as deprecated for future removal.
4. In `file_system.h`, mark `IOPriority` as deprecated for future removal.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10020
Test Plan: Unit tests.
Reviewed By: ajkr
Differential Revision: D36525317
Pulled By: gitbw95
fbshipit-source-id: 011ba421822f8a124e6d25a2661c4e242df6ad36
Summary:
### Context:
Background compactions and flush generate large reads and writes, and can be long running, especially for universal compaction. In some cases, this can impact foreground reads and writes by users.
### Solution
User, Flush, and Compaction reads share some code path. For this task, we update the rate_limiter_priority in ReadOptions for code paths (e.g. FindTable (mainly in BlockBasedTable::Open()) and various iterators), and eventually update the rate_limiter_priority in IOOptions for FSRandomAccessFile.
**This PR is for the Read path.** The **Read:** dynamic priority for different state are listed as follows:
| State | Normal | Delayed | Stalled |
| ----- | ------ | ------- | ------- |
| Flush (verification read in BuildTable()) | IO_USER | IO_USER | IO_USER |
| Compaction | IO_LOW | IO_USER | IO_USER |
| User | User provided | User provided | User provided |
We will respect the read_options that the user provided and will not set it.
The only sst read for Flush is the verification read in BuildTable(). It claims to be "regard as user read".
**Details**
1. Set read_options.rate_limiter_priority dynamically:
- User: Do not update the read_options. Use the read_options that the user provided.
- Compaction: Update read_options in CompactionJob::ProcessKeyValueCompaction().
- Flush: Update read_options in BuildTable().
2. Pass the rate limiter priority to FSRandomAccessFile functions:
- After calling the FindTable(), read_options is passed through GetTableReader(table_cache.cc), BlockBasedTableFactory::NewTableReader(block_based_table_factory.cc), and BlockBasedTable::Open(). The Open() needs some updates for the ReadOptions variable and the updates are also needed for the called functions, including PrefetchTail(), PrepareIOOptions(), ReadFooterFromFile(), ReadMetaIndexblock(), ReadPropertiesBlock(), PrefetchIndexAndFilterBlocks(), and ReadRangeDelBlock().
- In RandomAccessFileReader, the functions to be updated include Read(), MultiRead(), ReadAsync(), and Prefetch().
- Update the downstream functions of NewIndexIterator(), NewDataBlockIterator(), and BlockBasedTableIterator().
### Test Plans
Add unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9996
Reviewed By: anand1976
Differential Revision: D36452483
Pulled By: gitbw95
fbshipit-source-id: 60978204a4f849bb9261cb78d9bc1cb56d6008cf