Commit Graph

16 Commits

Author SHA1 Message Date
liuhu 6fbd02f258 Add dump all keys for cache dumper impl (#12500)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/12501

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12500

Reviewed By: jowlyzhang

Differential Revision: D55922379

Pulled By: ajkr

fbshipit-source-id: 0759afcec148d256a2d1cd5ef76fd988fab4a9af
2024-04-12 10:47:13 -07:00
Andrew Kryczka 8897bf2d04 Drop unsynced data in `TestFSWritableFile::Close()` (#12528)
Summary:
Our `FileSystem` for simulating unsynced data loss should not sync during `Close()` because it masks bugs where we forgot to sync as long as we closed the file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12528

Test Plan:
Peeled back https://github.com/facebook/rocksdb/issues/10560 fix and verified it is caught much faster now (few seconds vs. ???) with command like

```
$ TEST_TMPDIR=./ python3 tools/db_crashtest.py blackbox --disable_wal=0 --max_key=1000 --write_buffer_size=131072 --max_bytes_for_level_base=524288 --target_file_size_base=131072 --interval=3 --sync_fault_injection=1 --enable_blob_files=0 --manual_wal_flush_one_in=10 --sync_wal_one_in=0 --get_live_files_one_in=0 --get_sorted_wal_files_one_in=0 --backup_one_in=0 --checkpoint_one_in=0 --write_fault_one_in=0 --read_fault_one_in=0 --open_write_fault_one_in=0 --compact_range_one_in=0 --compact_files_one_in=0 --open_read_fault_one_in=0 --get_property_one_in=0 --writepercent=100 -readpercent=0 -prefixpercent=0 -delpercent=0 -delrangepercent=0 -iterpercent=0
```

Reviewed By: anand1976

Differential Revision: D56033250

Pulled By: ajkr

fbshipit-source-id: 6bbf480d79a06c46f08f6214010937f6654af5ca
2024-04-12 09:57:56 -07:00
liuhu b7f1eeb0ca Cache dumper exit early due to deadline or max_dumped_size (#12491)
Summary:
In production, we need to control the duration time or max size of cache dumper to get better performance.
Fixes https://github.com/facebook/rocksdb/issues/12494

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12491

Reviewed By: hx235

Differential Revision: D55905826

Pulled By: ajkr

fbshipit-source-id: 9196a5e852c344d6783f7a8234e997c87215bd19
2024-04-11 21:56:45 -07:00
Peter Dillinger 76c834e441 Remove 'virtual' when implied by 'override' (#12319)
Summary:
... to follow modern C++ style / idioms.

Used this hack:
```
for FILE in `cat my_list_of_files`; do perl -pi -e 'BEGIN{undef $/;} s/ virtual( [^;{]* override)/$1/smg' $FILE; done
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12319

Test Plan: existing tests, CI

Reviewed By: jaykorean

Differential Revision: D53275303

Pulled By: pdillinger

fbshipit-source-id: bc0881af270aa8ef4d0ae4f44c5a6614b6407377
2024-01-31 13:14:42 -08:00
Hui Xiao 06e593376c Group SST write in flush, compaction and db open with new stats (#11910)
Summary:
## Context/Summary
Similar to https://github.com/facebook/rocksdb/pull/11288, https://github.com/facebook/rocksdb/pull/11444, categorizing SST/blob file write according to different io activities allows more insight into the activity.

For that, this PR does the following:
- Tag different write IOs by passing down and converting WriteOptions to IOOptions
- Add new SST_WRITE_MICROS histogram in WritableFileWriter::Append() and breakdown FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS

Some related code refactory to make implementation cleaner:
- Blob stats
   - Replace high-level write measurement with low-level WritableFileWriter::Append() measurement for BLOB_DB_BLOB_FILE_WRITE_MICROS. This is to make FILE_WRITE_{FLUSH|COMPACTION|DB_OPEN}_MICROS  include blob file. As a consequence, this introduces some behavioral changes on it, see HISTORY and db bench test plan below for more info.
   - Fix bugs where BLOB_DB_BLOB_FILE_SYNCED/BLOB_DB_BLOB_FILE_BYTES_WRITTEN include file failed to sync and bytes failed to write.
- Refactor WriteOptions constructor for easier construction with io_activity and rate_limiter_priority
- Refactor DBImpl::~DBImpl()/BlobDBImpl::Close() to bypass thread op verification
- Build table
   - TableBuilderOptions now includes Read/WriteOpitons so BuildTable() do not need to take these two variables
   - Replace the io_priority passed into BuildTable() with TableBuilderOptions::WriteOpitons::rate_limiter_priority. Similar for BlobFileBuilder.
This parameter is used for dynamically changing file io priority for flush, see  https://github.com/facebook/rocksdb/pull/9988?fbclid=IwAR1DtKel6c-bRJAdesGo0jsbztRtciByNlvokbxkV6h_L-AE9MACzqRTT5s for more
   - Update ThreadStatus::FLUSH_BYTES_WRITTEN to use io_activity to track flush IO in flush job and db open instead of io_priority

## Test
### db bench

Flush
```
./db_bench --statistics=1 --benchmarks=fillseq --num=100000 --write_buffer_size=100

rocksdb.sst.write.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.flush.micros P50 : 1.830863 P95 : 4.094720 P99 : 6.578947 P100 : 26.000000 COUNT : 7875 SUM : 20377
rocksdb.file.write.compaction.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.db.open.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
```

compaction, db oopen
```
Setup: ./db_bench --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

rocksdb.sst.write.micros P50 : 2.675325 P95 : 9.578788 P99 : 18.780000 P100 : 314.000000 COUNT : 638 SUM : 3279
rocksdb.file.write.flush.micros P50 : 0.000000 P95 : 0.000000 P99 : 0.000000 P100 : 0.000000 COUNT : 0 SUM : 0
rocksdb.file.write.compaction.micros P50 : 2.757353 P95 : 9.610687 P99 : 19.316667 P100 : 314.000000 COUNT : 615 SUM : 3213
rocksdb.file.write.db.open.micros P50 : 2.055556 P95 : 3.925000 P99 : 9.000000 P100 : 9.000000 COUNT : 23 SUM : 66
```

blob stats - just to make sure they aren't broken by this PR
```
Integrated Blob DB

Setup: ./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

Run:./db_bench --enable_blob_files=1 --statistics=1 --benchmarks=compact  --db=../db_bench --use_existing_db=1

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 7.298246 P95 : 9.771930 P99 : 9.991813 P100 : 16.000000 COUNT : 235 SUM : 1600
rocksdb.blobdb.blob.file.synced COUNT : 1
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 2.000000 P95 : 2.829360 P99 : 2.993779 P100 : 9.000000 COUNT : 707 SUM : 1614
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 1 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 34842 (stay the same)
```

```
Stacked Blob DB

Run: ./db_bench --use_blob_db=1 --statistics=1 --benchmarks=fillseq --num=10000 --disable_auto_compactions=1 -write_buffer_size=100 --db=../db_bench

pre-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 12.808042 P95 : 19.674497 P99 : 28.539683 P100 : 51.000000 COUNT : 10000 SUM : 140876
rocksdb.blobdb.blob.file.synced COUNT : 8
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445

post-PR:
rocksdb.blobdb.blob.file.write.micros P50 : 1.657370 P95 : 2.952175 P99 : 3.877519 P100 : 24.000000 COUNT : 30001 SUM : 67924
- COUNT is higher and values are smaller as it includes header and footer write
- COUNT is 3X higher due to each Append() count as one post-PR, while in pre-PR, 3 Append()s counts as one. See https://github.com/facebook/rocksdb/pull/11910/files#diff-32b811c0a1c000768cfb2532052b44dc0b3bf82253f3eab078e15ff201a0dabfL157-L164

rocksdb.blobdb.blob.file.synced COUNT : 8 (stay the same)
rocksdb.blobdb.blob.file.bytes.written COUNT : 1043445 (stay the same)
```

###  Rehearsal CI stress test
Trigger 3 full runs of all our CI stress tests

###  Performance

Flush
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=ManualFlush/key_num:524288/per_key_size:256 --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark; enable_statistics = true

Pre-pr: avg 507515519.3 ns
497686074,499444327,500862543,501389862,502994471,503744435,504142123,504224056,505724198,506610393,506837742,506955122,507695561,507929036,508307733,508312691,508999120,509963561,510142147,510698091,510743096,510769317,510957074,511053311,511371367,511409911,511432960,511642385,511691964,511730908,

Post-pr: avg 511971266.5 ns, regressed 0.88%
502744835,506502498,507735420,507929724,508313335,509548582,509994942,510107257,510715603,511046955,511352639,511458478,512117521,512317380,512766303,512972652,513059586,513804934,513808980,514059409,514187369,514389494,514447762,514616464,514622882,514641763,514666265,514716377,514990179,515502408,
```

Compaction
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{pre|post}_pr --benchmark_filter=ManualCompaction/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 495346098.30 ns
492118301,493203526,494201411,494336607,495269217,495404950,496402598,497012157,497358370,498153846

Post-pr: avg 504528077.20, regressed 1.85%. "ManualCompaction" include flush so the isolated regression for compaction should be around 1.85-0.88 = 0.97%
502465338,502485945,502541789,502909283,503438601,504143885,506113087,506629423,507160414,507393007
```

Put with WAL (in case passing WriteOptions slows down this path even without collecting SST write stats)
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_pre_pr --benchmark_filter=DBPut/comp_style:0/max_data:107374182400/per_key_size:256/enable_statistics:1/wal:1  --benchmark_repetitions=1000
-- default: 1 thread is used to run benchmark

Pre-pr: avg 3848.10 ns
3814,3838,3839,3848,3854,3854,3854,3860,3860,3860

Post-pr: avg 3874.20 ns, regressed 0.68%
3863,3867,3871,3874,3875,3877,3877,3877,3880,3881
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11910

Reviewed By: ajkr

Differential Revision: D49788060

Pulled By: hx235

fbshipit-source-id: 79e73699cda5be3b66461687e5147c2484fc5eff
2023-12-29 15:29:23 -08:00
Qiaolin Yu f799c73d28 Trace analyzer: replace number with enumeration type (#10827)
Summary:
Currently, some numbers in the `tracer_analyzer_tool` may be a little confusing and unfriendly for people who want to add new query types.

It may be better to replace them with the existing enumeration type to improve readability.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10827

Reviewed By: ajkr

Differential Revision: D40576023

Pulled By: hx235

fbshipit-source-id: 0eb16820a15f365d53e848a3a8efd92928420429
2023-12-27 10:38:53 -08:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Peter Dillinger 9f7801c5f1 Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).

The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.

* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)

## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
  * Simpler for implementations to deal with just one Insert and one Lookup.
  * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
  * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
  * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
  * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
  * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.

## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).

The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.

These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)

Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.

## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.

## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.

The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.

## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)

## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975

Test Plan:
tests updated

Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):

34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83

Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.

Reviewed By: anand1976

Differential Revision: D42417818

Pulled By: pdillinger

fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2023-01-11 14:20:40 -08:00
Peter Dillinger e079d562af Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945)
Summary:
Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945

Test Plan: existing tests for CacheDumper, added basic unit test

Reviewed By: anand1976

Differential Revision: D41231497

Pulled By: pdillinger

fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068
2022-11-21 16:17:36 -08:00
Peter Dillinger ef443cead4 Refactor to avoid confusing "raw block" (#10408)
Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)

This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:

* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).

Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408

Test Plan: existing tests

Reviewed By: akankshamahajan15

Differential Revision: D38172617

Pulled By: pdillinger

fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
2022-09-22 11:25:32 -07:00
Peter Dillinger 126c223714 Remove deprecated block-based filter (#10184)
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).

This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).

Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184

Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB

Reviewed By: riversand963

Differential Revision: D37212647

Pulled By: pdillinger

fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
2022-06-16 15:51:33 -07:00
Peter Dillinger efd035164b Meta-internal folly integration with F14FastMap (#9546)
Summary:
Especially after updating to C++17, I don't see a compelling case for
*requiring* any folly components in RocksDB. I was able to purge the existing
hard dependencies, and it can be quite difficult to strip out non-trivial components
from folly for use in RocksDB. (The prospect of doing that on F14 has changed
my mind on the best approach here.)

But this change creates an optional integration where we can plug in
components from folly at compile time, starting here with F14FastMap to replace
std::unordered_map when possible (probably no public APIs for example). I have
replaced the biggest CPU users of std::unordered_map with compile-time
pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set.
USE_FOLLY is always set in the Meta-internal buck build, and a simulation of
that is in the Makefile for public CI testing. A full folly build is not needed, but
checking out the full folly repo is much simpler for getting the dependency,
and anything else we might want to optionally integrate in the future.

Some picky details:
* I don't think the distributed mutex stuff is actually used, so it was easy to remove.
* I implemented an alternative to `folly::constexpr_log2` (which is much easier
in C++17 than C++11) so that I could pull out the hard dependencies on
`ConstexprMath.h`
* I had to add noexcept move constructors/operators to some types to make
F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a
macro to make that easier in some common cases.
* Updated Meta-internal buck build to use folly F14Map (always)

No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a
production integration for open source users.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546

Test Plan:
CircleCI tests updated so that a couple of them use folly.

Most internal unit & stress/crash tests updated to use Meta-internal latest folly.
(Note: they should probably use buck but they currently use Makefile.)

Example performance improvement: when filter partitions are pinned in cache,
they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build
a test that exercises that heavily. Build DB with

```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters
```

and test with (simultaneous runs with & without folly, ~20 times each to see
convergence)

```
TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache
```

Average ops/s no folly: 26229.2
Average ops/s with folly: 26853.3 (+2.4%)

Reviewed By: ajkr

Differential Revision: D34181736

Pulled By: pdillinger

fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94
2022-04-13 07:34:01 -07:00
Andrew Kryczka babe56ddba Add rate limiter priority to ReadOptions (#9424)
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.

`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.

There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).

The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424

Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
  - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
  - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`

Reviewed By: hx235

Differential Revision: D33747386

Pulled By: ajkr

fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
2022-02-16 23:18:14 -08:00
Andrew Kryczka ee239df351 Initialize cache dumper `DumpUnit` in constructor (#9014)
Summary:
Should fix clang-analyze:

```
utilities/cache_dump_load_impl.cc:296:38: warning: The left operand of '!=' is a garbage value
  while (io_s.ok() && dump_unit.type != CacheDumpUnitType::kFooter) {
                      ~~~~~~~~~~~~~~ ^
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9014

Reviewed By: zhichao-cao

Differential Revision: D31546912

Pulled By: ajkr

fbshipit-source-id: a2e0dc7874e8c1c6abf190862b5d49e6a6ad6d01
2021-10-11 13:05:35 -07:00
Zhichao Cao 699f45049d Introduce a mechanism to dump out blocks from block cache and re-insert to secondary cache (#8912)
Summary:
Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB.

Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912

Test Plan: add new tests to lru_cache_test and pass make check.

Reviewed By: pdillinger

Differential Revision: D31452871

Pulled By: zhichao-cao

fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48
2021-10-07 11:42:31 -07:00