Commit graph

18 commits

Author SHA1 Message Date
Levi Tamasi fe56cb9aa0 Eliminate some allocations/copies around the blob cache (#10647)
Summary:
Historically, `BlobFileReader` has returned the blob(s) read from the file
in the `PinnableSlice` provided by the client. This interface was
preserved when caching was implemented for blobs, which meant that
the blob data was copied multiple times when caching was in use: first,
into the client-provided `PinnableSlice` (by `BlobFileReader::SaveValue`),
and then, into the object stored in the cache (by `BlobSource::PutBlobIntoCache`).
The patch eliminates these copies and the related allocations by changing
`BlobFileReader` so it returns its results in the form of heap-allocated `BlobContents`
objects that can be directly inserted into the cache. The allocations backing
these `BlobContents` objects are made using the blob cache's allocator if the
blobs are to be inserted into the cache (i.e. if a cache is configured and the
`fill_cache` read option is set). Note: this PR focuses on the common case when
blobs are compressed; some further small optimizations are possible for uncompressed
blobs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10647

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D39335185

Pulled By: ltamasi

fbshipit-source-id: 464503d60a5520d654c8273ffb8efd5d1bcd7b36
2022-09-08 12:40:18 -07:00
Bo Wang d490bfcdb6 Avoid recompressing cold block in CompressedSecondaryCache (#10527)
Summary:
**Summary:**
When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.

When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.

**Implementation Details**
Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0.  (refs >= 1 && in_cache == false && IS_STANDALONE == true)

The behaviors of  `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
1. If a handle is found in primary cache:
  1.1. If the handle's value is not nullptr, it is returned immediately.
  1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
    - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
    - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
2. If a handle is not found in primary cache:
  2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
  2.2.  If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block)  and return a standalone handle.

The behaviors of `LRUCacheShard::Promote()` are updated as follows:
1. If `e->sec_handle` has value, one of the following steps can happen:
  1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
  1.2. Insert the item into the primary cache and return the handle to caller.
  1.3. Exception handling.
3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.

The behavior of  `CompressedSecondaryCache::Insert()` is updated:
1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
4. If a dummy item is found for a block, the block is inserted into the secondary cache.

The behavior of  `CompressedSecondaryCache:::Lookup()` is updated:
1. If a handle is not found or it is a dummy item, a nullptr is returned.
2. If `erase_handle` is true, the handle is erased.

The behaviors of  `LRUCacheShard::Release()` are adjusted for the standalone handles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527

Test Plan:
1. stress tests.
5. unit tests.
6. CPU profiling for db_bench.

Reviewed By: siying

Differential Revision: D38747613

Pulled By: gitbw95

fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
2022-09-07 19:00:27 -07:00
Levi Tamasi c8543296ca Support custom allocators for the blob cache (#10628)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10628

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D39228165

Pulled By: ltamasi

fbshipit-source-id: 591fdff08db400b170b26f0165551f86d33c1dbf
2022-09-06 13:31:48 -07:00
Levi Tamasi b07217da04 Pin the newly cached blob after insert (#10625)
Summary:
With the current code, when a blob isn't found in the cache and gets read
from the blob file and then inserted into the cache, the application gets
passed the self-contained `PinnableSlice` resulting from the blob file read.
The patch changes this so that the `PinnableSlice` pins the cache entry
instead in this case.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10625

Test Plan: `make check`

Reviewed By: pdillinger

Differential Revision: D39220904

Pulled By: ltamasi

fbshipit-source-id: cb9c62881e3523b1e9f614e00bf503bac2fe3b0a
2022-09-01 16:25:46 -07:00
Levi Tamasi 23376aa576 Improve the accounting of memory used by cached blobs (#10583)
Summary:
The patch improves the bookkeeping around the memory usage of
cached blobs in two ways: 1) it uses `malloc_usable_size`, which accounts
for allocator bin sizes etc., and 2) it also considers the memory usage
of the `BlobContents` object in addition to the blob itself. Note: some unit
tests had been relying on the cache charge being equal to the size of the
cached blob; these were updated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10583

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D39060680

Pulled By: ltamasi

fbshipit-source-id: 3583adce2b4ce6e84861f3fadccbfd2e5a3cc482
2022-08-26 15:53:08 -07:00
Levi Tamasi 3f57d84af4 Introduce a dedicated class to represent blob values (#10571)
Summary:
The patch introduces a new class called `BlobContents`, which represents
a single uncompressed blob value. We currently use `std::string` for this
purpose; `BlobContents` is somewhat smaller but the primary reason for a
dedicated class is that it enables certain improvements and optimizations
like eliding a copy when inserting a blob into the cache, using custom
allocators, or more control over and better accounting of the memory usage
of cached blobs (see https://github.com/facebook/rocksdb/issues/10484).
(We plan to implement these in subsequent PRs.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10571

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D39000965

Pulled By: ltamasi

fbshipit-source-id: f296eddf9dec4fc3e11cad525b462bdf63c78f96
2022-08-25 16:45:48 -07:00
Levi Tamasi 78bbdef530 Fix a typo in BlobSecondaryCacheTest (#10566)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10566

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D38989926

Pulled By: ltamasi

fbshipit-source-id: 6402635fe745e4e7eb3083ef9ad9f04c0177d762
2022-08-24 13:08:43 -07:00
Gang Liao 275cd80cdb Add a blob-specific cache priority (#10461)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
2022-08-12 17:59:06 -07:00
Peter Dillinger 86a1e3e0e7 Derive cache keys from SST unique IDs (#10394)
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.

See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.

This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.

These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):

```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```

After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```

However, the change does modify (probably weaken) the "guaranteed unique" promise from this

> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86

to this (see https://github.com/facebook/rocksdb/issues/10388)

> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits

I don't think this is a practical concern, though.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394

Test Plan: unit tests updated, see simulation results above

Reviewed By: jay-zhuang

Differential Revision: D38667529

Pulled By: pdillinger

fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
2022-08-12 13:49:49 -07:00
Peter Dillinger 65036e4217 Revert "Add a blob-specific cache priority (#10309)" (#10434)
Summary:
This reverts commit 8d178090be
because of a clear performance regression seen in internal dashboard
https://fburl.com/unidash/tpz75iee

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434

Reviewed By: ltamasi

Differential Revision: D38256373

Pulled By: pdillinger

fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b
2022-07-29 07:18:15 -07:00
Gang Liao 8d178090be Add a blob-specific cache priority (#10309)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309

Reviewed By: ltamasi

Differential Revision: D38211655

Pulled By: gangliao

fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
2022-07-27 19:09:24 -07:00
Gang Liao 0b6bc101ba Charge blob cache usage against the global memory limit (#10321)
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321

Reviewed By: ltamasi

Differential Revision: D37913590

Pulled By: gangliao

fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
2022-07-18 23:26:57 -07:00
Gang Liao 95ef007adc Support using secondary cache with the blob cache (#10349)
Summary:
RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used. The goals of this task are to add support for using a secondary cache with the blob cache and to measure the potential performance gains using `db_bench`.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10349

Reviewed By: ltamasi

Differential Revision: D37896773

Pulled By: gangliao

fbshipit-source-id: 7804619ce4a44b73d9e11ad606640f9385969c84
2022-07-16 03:54:37 -07:00
Gang Liao 056e08d6c4 Enable blob caching for MultiGetBlob in RocksDB (#10272)
Summary:
- [x] Enabled blob caching for MultiGetBlob in RocksDB
- [x] Refactored MultiGetBlob logic and interface in RocksDB
- [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource
- [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272

Reviewed By: ltamasi

Differential Revision: D37558112

Pulled By: gangliao

fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db
2022-06-30 13:24:35 -07:00
Gang Liao d7ebb58cb5 Add blob cache tickers, perf context statistics, and DB properties (#10203)
Summary:
In order to be able to monitor the performance of the new blob cache, we made the follow changes:
- Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics)
- Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context)
- Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203

Reviewed By: ltamasi

Differential Revision: D37478658

Pulled By: gangliao

fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40
2022-06-28 13:52:35 -07:00
Gang Liao a1eb02f089 Change the semantics of bytes_read in GetBlob/MultiGetBlob for consistency (#10248)
Summary:
The `bytes_read` returned by the current BlobSource interface is ambiguous. The uncompressed blob size is returned if the cache hits. The size of the blob read from disk, presumably the compressed version, is returned if the cache misses. Two differing semantics might cause ambiguity and consistency issues. For example, this inconsistency causes the assertion failure (T124246362 and its hot fix is https://github.com/facebook/rocksdb/issues/10249).

This goal is to require that the value of `byte read` always be an on-disk blob record size.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10248

Reviewed By: ltamasi

Differential Revision: D37470292

Pulled By: gangliao

fbshipit-source-id: fbca521b2791d3674dbf2484cea5fcae2fdd94d2
2022-06-27 17:15:21 -07:00
Gang Liao ba1f62ddfb Read from blob cache first when MultiGetBlob() (#10225)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225

Test Plan:
Add test cases for MultiGetBlob
In this task, we added the new API MultiGetBlob() for BlobSource.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Reviewed By: ltamasi

Differential Revision: D37358364

Pulled By: gangliao

fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6
2022-06-23 13:52:00 -07:00
Gang Liao c965c9ef65 Read blob from blob cache if exists when GetBlob() (#10178)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`.  `MultiGetBlob()` will be included in the next PR.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178

Reviewed By: ltamasi

Differential Revision: D37250507

Pulled By: gangliao

fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b
2022-06-17 15:22:59 -07:00