Find a file
Hui Xiao 3093d98c78 Fix higher read qps during db open caused by pr 11406 (#11516)
Summary:
**Context:**
[PR11406](https://github.com/facebook/rocksdb/pull/11406/) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
-  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in

**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11516

Test Plan:
- db bench

Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```

Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```

_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
-    return logical_sector_size_;
+    return 1;
  }
 ```

- CI

Reviewed By: pdillinger

Differential Revision: D46472566

Pulled By: hx235

fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
2023-06-06 17:42:43 -07:00
.circleci Revert enabling IO uring in db_stress (#11242) 2023-02-21 12:53:55 -08:00
.github/workflows ci: add GitHub token permissions for workflow (#10549) 2022-10-04 12:10:30 -07:00
buckifier Fix duplicate symbols in linking with buck2 (#11421) 2023-05-01 10:45:36 -07:00
build_tools Repair/instate jemalloc build on M1 (#11257) 2023-05-22 11:06:41 -07:00
cache Integrate CacheReservationManager with compressed secondary cache (#11449) 2023-05-30 14:05:48 -07:00
cmake gcc-11 and cmake related cleanup (#9286) 2021-12-17 17:04:35 -08:00
coverage Remove platform009 and default to platform010 (#11333) 2023-03-30 09:56:37 -07:00
db Fix subcompaction bug to allow running two subcompactions (#11501) 2023-06-06 13:36:02 -07:00
db_stress_tool Add WaitForCompact with WaitForCompactOptions to public API (#11436) 2023-05-25 17:25:51 -07:00
docs Remove docs/Gemfile.lock and update github-pages version (#11173) 2023-02-14 12:17:23 -08:00
env Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
examples Remove RocksDB LITE (#11147) 2023-01-27 13:14:19 -08:00
file Fix higher read qps during db open caused by pr 11406 (#11516) 2023-06-06 17:42:43 -07:00
fuzz Block per key-value checksum (#11287) 2023-04-25 12:08:23 -07:00
include/rocksdb CompactRange() always compacts to bottommost level for leveled compaction (#11468) 2023-06-01 15:27:29 -07:00
java Implement missing compactrangeoptions from Java API (#10880) 2023-05-24 11:04:46 -07:00
logging Disabling some IO error assertion in EnvLogger (#11314) 2023-03-20 13:23:29 -07:00
memory Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
memtable remove redundant move (#11418) 2023-05-03 09:37:21 -07:00
microbench Small improvements to DBGet microbenchmark (#11498) 2023-06-02 16:39:14 -07:00
monitoring Much better stats for seeks and prefix filtering (#11460) 2023-05-19 15:25:49 -07:00
options Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
plugin Add initial CMake support to plugin (#9214) 2021-11-30 17:16:53 -08:00
port Simplify detection of x86 CPU features (#11419) 2023-05-09 22:25:45 -07:00
table Fix higher read qps during db open caused by pr 11406 (#11516) 2023-06-06 17:42:43 -07:00
test_util Add support to strip / pad timestamp when writing / reading a block (#11472) 2023-05-25 15:41:32 -07:00
third-party fix optimization-disabled test builds with platform010 (#11361) 2023-04-10 13:59:44 -07:00
tools Support single delete help message in ldb (#11493) 2023-05-31 14:24:54 -07:00
trace_replay Fix error maybe-uninitialized #11100 (#11101) 2023-01-19 13:59:48 -08:00
unreleased_history Fix subcompaction bug to allow running two subcompactions (#11501) 2023-06-06 13:36:02 -07:00
util switch to use RocksDB UnorderedMap (#11507) 2023-06-05 13:36:26 -07:00
utilities Flush option in WaitForCompact() (#11483) 2023-05-31 12:53:51 -07:00
.clang-format
.gitignore Tweak on IsTrivialMove() (#11467) 2023-05-26 16:40:50 -07:00
.lgtm.yml
.watchmanconfig
AUTHORS
CMakeLists.txt Add utils to use for handling user defined timestamp size record in WAL (#11451) 2023-05-22 14:28:58 -07:00
CODE_OF_CONDUCT.md
common.mk Clean up variables for temporary directory (#9961) 2022-05-06 16:38:06 -07:00
CONTRIBUTING.md
COPYING
crash_test.mk Allow a custom DB cleanup command to be passed to db_crashtest.py (#10883) 2022-10-27 19:47:01 -07:00
DEFAULT_OPTIONS_HISTORY.md Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363) 2022-01-18 17:31:03 -08:00
DUMP_FORMAT.md
HISTORY.md Better management of unreleased HISTORY (#11481) 2023-05-30 16:42:49 -07:00
INSTALL.md Simplify detection of x86 CPU features (#11419) 2023-05-09 22:25:45 -07:00
issue_template.md
LANGUAGE-BINDINGS.md Add grocksdb in Go language bindings (#10498) 2022-08-23 15:02:10 -07:00
LICENSE.Apache
LICENSE.leveldb
Makefile Add utils to use for handling user defined timestamp size record in WAL (#11451) 2023-05-22 14:28:58 -07:00
PLUGINS.md Added encryption plugin based on Intel open-source ipp-crypto library (#11429) 2023-05-08 12:13:43 -07:00
README.md Remove deprecated integration tests from README.md (#11354) 2023-04-07 16:52:50 -07:00
rocksdb.pc.in build: fix pkg-config file generation (#9953) 2022-05-30 12:46:40 -07:00
src.mk Add utils to use for handling user defined timestamp size record in WAL (#11451) 2023-05-22 14:28:58 -07:00
TARGETS Add utils to use for handling user defined timestamp size record in WAL (#11451) 2023-05-22 14:28:58 -07:00
thirdparty.inc
USERS.md Add more users (#11369) 2023-05-03 10:47:41 -07:00
Vagrantfile
WINDOWS_PORT.md Update branch name in WINDOWS_PORT.md (#8745) 2021-09-01 19:26:39 -07:00

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

CircleCI Status

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.