Commit graph

46 commits

Author SHA1 Message Date
Peter Dillinger 7555243bcf Refactor ShardedCache for more sharing, static polymorphism (#10801)
Summary:
The motivations for this change include
* Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size.
  * This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash.
* Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads
  * ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks).
  * You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache.

More detail:
* Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`.
* Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards.
* Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10801

Test Plan:
Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash.

Performance:
Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16`

Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16
```

Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec`
After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec`

So possibly ~0.1% improvement.

And with `-cache_type=hyper_clock_cache`:
Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec`
After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec`

So roughly 5% improvement!

Reviewed By: anand1976

Differential Revision: D40252236

Pulled By: pdillinger

fbshipit-source-id: ff8fc70ef569585edc95bcbaaa0386f61355ae5b
2022-10-18 22:06:57 -07:00
Peter Dillinger e466173d5c Print stack traces on frozen tests in CI (#10828)
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.

For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828

Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI

Reviewed By: hx235

Differential Revision: D40447634

Pulled By: pdillinger

fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
2022-10-18 00:35:35 -07:00
Peter Dillinger 0f91c72adc Call experimental new clock cache HyperClockCache (#10684)
Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)

Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.

For performance tests see https://github.com/facebook/rocksdb/pull/10626

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684

Test Plan: no interesting functional changes; tests updated

Reviewed By: anand1976

Differential Revision: D39547800

Pulled By: pdillinger

fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
2022-09-16 12:47:29 -07:00
Peter Dillinger 5724348689 Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
  * Duplicate Inserts will sometimes go undetected and the shadow duplicate
    will age out with eviction.
  * In many cases, the older Inserted value for a given cache key will be kept
  (i.e. Insert does not support overwrite).
  * Entries explicitly erased (rather than evicted) might not be freed
  immediately in some rare cases.
  * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
  precisely as LRUCache, but is self-correcting and should only deviate by a
  very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.

## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
  Increment internal ref count at slot
  If possible hit:
    Check flags atomic (and non-atomic fields)
    If cache hit:
      Three distinct updates to 'flags' atomic
      Increment refs for internal-to-external
      Return
  Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
  Increment acquire counter in meta word (optimistic)
  If visible entry (already read meta word):
    If match (read non-atomic fields):
      Return
    Else:
      Decrement acquire counter in meta word
  Else if invisible entry (rare, already read meta word):
    Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
  Use CAS etc. to remove
  Return
Else:
  Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
  Use CAS etc. to remove
  Return
```

## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:

base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change

## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944

4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821

4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)

4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38

4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)

4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37

4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46

Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.

Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56

1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45

1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63

610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5

610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453

610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812

The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)

Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.

233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461

233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402

233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016

89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754

89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293

89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223

^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)

Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125

34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793

34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52

As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:

13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383

13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758

13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27

gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:

13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707

13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626

Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN

Reviewed By: anand1976

Differential Revision: D39368406

Pulled By: pdillinger

fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2022-09-16 00:24:11 -07:00
Bo Wang d490bfcdb6 Avoid recompressing cold block in CompressedSecondaryCache (#10527)
Summary:
**Summary:**
When a block is firstly `Lookup` from the secondary cache, we just insert a dummy block in the primary cache (charging the actual size of the block) and don’t erase the block from the secondary cache. A standalone handle is returned from `Lookup`. Only if the block is hit again, we erase it from the secondary cache and add it into the primary cache.

When a block is firstly evicted from the primary cache to the secondary cache, we just insert a dummy block (size 0) in the secondary cache. When the block is evicted again, it is treated as a hot block and is inserted into the secondary cache.

**Implementation Details**
Add a new state of LRUHandle: The handle is never inserted into the LRUCache (both hash table and LRU list) and it doesn't experience the above three states. The entry can be freed when refs becomes 0.  (refs >= 1 && in_cache == false && IS_STANDALONE == true)

The behaviors of  `LRUCacheShard::Lookup()` are updated if the secondary_cache is CompressedSecondaryCache:
1. If a handle is found in primary cache:
  1.1. If the handle's value is not nullptr, it is returned immediately.
  1.2. If the handle's value is nullptr, this means the handle is a dummy one. For a dummy handle, if it was retrieved from secondary cache, it may still exist in secondary cache.
    - 1.2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
    - 1.2.2. If the handle from secondary cache is valid, erase it from the secondary cache and add it into the primary cache.
2. If a handle is not found in primary cache:
  2.1. If no valid handle can be `Lookup` from secondary cache, return nullptr.
  2.2.  If the handle from secondary cache is valid, insert a dummy block in the primary cache (charging the actual size of the block)  and return a standalone handle.

The behaviors of `LRUCacheShard::Promote()` are updated as follows:
1. If `e->sec_handle` has value, one of the following steps can happen:
  1.1. Insert a dummy handle and return a standalone handle to caller when `secondary_cache_` is `CompressedSecondaryCache` and e is a standalone handle.
  1.2. Insert the item into the primary cache and return the handle to caller.
  1.3. Exception handling.
3. If `e->sec_handle` has no value, mark the item as not in cache and charge the cache as its only metadata that'll shortly be released.

The behavior of  `CompressedSecondaryCache::Insert()` is updated:
1. If a block is evicted from the primary cache for the first time, a dummy item is inserted.
4. If a dummy item is found for a block, the block is inserted into the secondary cache.

The behavior of  `CompressedSecondaryCache:::Lookup()` is updated:
1. If a handle is not found or it is a dummy item, a nullptr is returned.
2. If `erase_handle` is true, the handle is erased.

The behaviors of  `LRUCacheShard::Release()` are adjusted for the standalone handles.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10527

Test Plan:
1. stress tests.
5. unit tests.
6. CPU profiling for db_bench.

Reviewed By: siying

Differential Revision: D38747613

Pulled By: gitbw95

fbshipit-source-id: 74a1eba7e1957c9affb2bd2ae3e0194584fa6eca
2022-09-07 19:00:27 -07:00
Gang Liao 275cd80cdb Add a blob-specific cache priority (#10461)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
2022-08-12 17:59:06 -07:00
Peter Dillinger 86a1e3e0e7 Derive cache keys from SST unique IDs (#10394)
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.

See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.

This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.

These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):

```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```

After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```

However, the change does modify (probably weaken) the "guaranteed unique" promise from this

> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86

to this (see https://github.com/facebook/rocksdb/issues/10388)

> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits

I don't think this is a practical concern, though.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394

Test Plan: unit tests updated, see simulation results above

Reviewed By: jay-zhuang

Differential Revision: D38667529

Pulled By: pdillinger

fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
2022-08-12 13:49:49 -07:00
Peter Dillinger 65036e4217 Revert "Add a blob-specific cache priority (#10309)" (#10434)
Summary:
This reverts commit 8d178090be
because of a clear performance regression seen in internal dashboard
https://fburl.com/unidash/tpz75iee

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434

Reviewed By: ltamasi

Differential Revision: D38256373

Pulled By: pdillinger

fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b
2022-07-29 07:18:15 -07:00
Gang Liao 8d178090be Add a blob-specific cache priority (#10309)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309

Reviewed By: ltamasi

Differential Revision: D38211655

Pulled By: gangliao

fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
2022-07-27 19:09:24 -07:00
Guido Tagliavini Ponce 9d7de6517c Towards a production-quality ClockCache (#10418)
Summary:
In this PR we bring ClockCache closer to production quality. We implement the following changes:
1. Fixed a few bugs in ClockCache.
2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.

As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418

Test Plan:
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``

Reviewed By: pdillinger

Differential Revision: D38170673

Pulled By: guidotag

fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
2022-07-26 17:42:03 -07:00
Gang Liao 0b6bc101ba Charge blob cache usage against the global memory limit (#10321)
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321

Reviewed By: ltamasi

Differential Revision: D37913590

Pulled By: gangliao

fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
2022-07-18 23:26:57 -07:00
Guido Tagliavini Ponce c277aeb42c Midpoint insertions in ClockCache (#10305)
Summary:
When an element is first inserted into the ClockCache, it is now assigned either medium or high clock priority, depending on whether its cache priority is low or high, respectively. This is a variant of LRUCache's midpoint insertions. The main difference is that LRUCache can specify the allocated capacity for high-priority elements via the ``high_pri_pool_ratio`` parameter. Contrarily, in ClockCache, low- and high-priority elements compete for all cache slots, and one group can take over the other (of course, it takes more low-priority insertions to push out high-priority elements). However, just as LRUCache, ClockCache provides the following guarantee: a high-priority element will not be evicted before a low-priority element that was inserted earlier in time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10305

Test Plan: ``make -j24 check``

Reviewed By: pdillinger

Differential Revision: D37607787

Pulled By: guidotag

fbshipit-source-id: 24d9f2523d2f4e6415e7f0029cc061fa275c2040
2022-07-06 18:28:35 -07:00
Guido Tagliavini Ponce 54f678cd86 Fix CalcHashBits (#10295)
Summary:
We fix two bugs in CalcHashBits. The first one is an off-by-one error: the desired number of table slots is the real number ``capacity / (kLoadFactor * handle_charge)``, which should not be rounded down. The second one is that we should disallow inputs that set the element charge to 0, namely ``estimated_value_size == 0 && metadata_charge_policy == kDontChargeCacheMetadata``.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10295

Test Plan: CalcHashBits is tested by CalcHashBitsTest (in lru_cache_test.cc). The test now iterates over many more inputs; it covers, in particular, the rounding error edge case. Overall, the test is now more robust. Run ``make -j24 check``.

Reviewed By: pdillinger

Differential Revision: D37573797

Pulled By: guidotag

fbshipit-source-id: ea4f4439f7196ab1c1afb88f566fe92850537262
2022-07-01 20:51:20 -07:00
Guido Tagliavini Ponce 57a0e2f304 Clock cache (#10273)
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.

Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273

Test Plan: ``make -j24 check``

Reviewed By: pdillinger

Differential Revision: D37522461

Pulled By: guidotag

fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
2022-06-29 21:50:39 -07:00
Guido Tagliavini Ponce c6055cba30 Calculate table size of FastLRUCache more accurately (#10235)
Summary:
Calculate the required size of the hash table in FastLRUCache more accurately.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10235

Test Plan: ``make -j24 check``

Reviewed By: gitbw95

Differential Revision: D37460546

Pulled By: guidotag

fbshipit-source-id: 7945128d6f002832f8ed922ef0151919f4350854
2022-06-27 21:04:59 -07:00
Guido Tagliavini Ponce f105e1a501 Make the per-shard hash table fixed-size. (#10154)
Summary:
We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.

Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154

Test Plan: `make -j24 check`

Reviewed By: pdillinger

Differential Revision: D37124451

Pulled By: guidotag

fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
2022-06-13 20:29:00 -07:00
Guido Tagliavini Ponce 415200d792 Assume fixed size key (#10137)
Summary:
FastLRUCache now only supports 16B keys. The tests have changed to reflect this.

Because the unit tests were designed for caches that accept any string as keys, some tests are no longer compatible with FastLRUCache. We have disabled those for runs with FastLRUCache. (We could potentially change all tests to use 16B keys, but we don't because the cache public API does not require this.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10137

Test Plan: make -j24 check

Reviewed By: gitbw95

Differential Revision: D37083934

Pulled By: guidotag

fbshipit-source-id: be1719cf5f8364a9a32bc4555bce1a0de3833b0d
2022-06-10 19:12:18 -07:00
gitbw95 f241d082b6 Prevent double caching in the compressed secondary cache (#9747)
Summary:
###  **Summary:**
When both LRU Cache and CompressedSecondaryCache are configured together, there possibly are some data blocks double cached.

**Changes include:**
1. Update IS_PROMOTED to IS_IN_SECONDARY_CACHE to prevent confusions.
2. This PR updates SecondaryCacheResultHandle and use IsErasedFromSecondaryCache to determine whether the handle is erased in the secondary cache. Then, the caller can determine whether to SetIsInSecondaryCache().
3. Rename LRUSecondaryCache to CompressedSecondaryCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9747

Test Plan:
**Test Scripts:**
1. Populate a DB. The on disk footprint is 482 MB. The data is set to be 50% compressible, so the total decompressed size is expected to be 964 MB.
./db_bench --benchmarks=fillrandom --num=10000000 -db=/db_bench_1

2. overwrite it to a stable state:
./db_bench --benchmarks=overwrite,stats --num=10000000 -use_existing_db -duration=10 --benchmark_write_rate_limit=2000000 -db=/db_bench_1

4. Run read tests with diffeernt cache setting:

T1:
./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000  --statistics -db=/db_bench_1

T2:
./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=320000000 -compressed_secondary_cache_size=400000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1

T3:
./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000 -compressed_secondary_cache_size=400000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1

T4:
./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=20000000 -compressed_secondary_cache_size=500000000 --statistics -use_compressed_secondary_cache -db=/db_bench_1

**Before this PR**
| Cache Size | Compressed Secondary Cache Size | Cache Hit Rate |
|------------|-------------------------------------|----------------|
|520 MB | 0 MB | 85.5% |
|320 MB | 400 MB | 96.2% |
|520 MB | 400 MB | 98.3% |
|20 MB | 500 MB | 98.8% |

**Before this PR**
| Cache Size | Compressed Secondary Cache Size | Cache Hit Rate |
|------------|-------------------------------------|----------------|
|520 MB | 0 MB | 85.5% |
|320 MB | 400 MB | 99.9% |
|520 MB | 400 MB | 99.9% |
|20 MB | 500 MB | 99.2% |

Reviewed By: anand1976

Differential Revision: D35117499

Pulled By: gitbw95

fbshipit-source-id: ea2657749fc13efebe91a8a1b56bc61d6a224a12
2022-04-11 13:28:33 -07:00
Andrew Kryczka 54fb2a8975 Change type of cache buffer passed to Cache::CreateCallback() to const void* (#9595)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9595

Reviewed By: riversand963

Differential Revision: D34329906

Pulled By: ajkr

fbshipit-source-id: 508601856fa9bee4d40f4a68d14d333ef2143d40
2022-02-17 21:09:56 -08:00
Peter Dillinger fd3e0f43b3 Require C++17 (#9481)
Summary:
Drop support for some old compilers by requiring C++17 standard
(or higher). See https://github.com/facebook/rocksdb/issues/9388

First modification based on this is to remove some conditional compilation in slice.h (also
better for ODR)

Also in this PR:
* Fix some Makefile formatting that seems to affect ASSERT_STATUS_CHECKED config in
some cases
* Add c_test to NON_PARALLEL_TEST in Makefile
* Fix a clang-analyze reported "potential leak" in lru_cache_test
* Better "compatibility" definition of DEFINE_uint32 for old versions of gflags
* Fix a linking problem with shared libraries in Makefile (`./random_test: error while loading shared libraries: librocksdb.so.6.29: cannot open shared object file: No such file or directory`)
* Always set ROCKSDB_SUPPORT_THREAD_LOCAL and use thread_local (from C++11)
  * TODO in later PR: clean up that obsolete flag
* Fix a cosmetic typo in c.h (https://github.com/facebook/rocksdb/issues/9488)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9481

Test Plan:
CircleCI config substantially updated.

* Upgrade to latest Ubuntu images for each release
* Generally prefer Ubuntu 20, but keep a couple Ubuntu 16 builds with oldest supported
compilers, to ensure compatibility
* Remove .circleci/cat_ignore_eagain except for Ubuntu 16 builds, because this is to work
around a kernel bug that should not affect anything but Ubuntu 16.
* Remove designated gcc-9 build, because the default linux build now uses GCC 9 from
Ubuntu 20.
* Add some `apt-key add` to fix some apt "couldn't be verified" errors
* Generally drop SKIP_LINK=1; work-around no longer needed
* Generally `add-apt-repository` before `apt-get update` as manual testing indicated the
reverse might not work.

Travis:
* Use gcc-7 by default (remove specific gcc-7 and gcc-4.8 builds)
* TODO in later PR: fix s390x "Assembler messages: Error: invalid switch -march=z14" failure

AppVeyor:
* Completely dropped because we are dropping VS2015 support and CircleCI covers
VS >= 2017

Also local testing with old gflags (out of necessity when using ROCKSDB_NO_FBCODE=1).

Reviewed By: mrambacher

Differential Revision: D33946377

Pulled By: pdillinger

fbshipit-source-id: ae077c823905b45370a26c0103ada119459da6c1
2022-02-04 17:13:10 -08:00
Peter Dillinger 0050a73a4f New stable, fixed-length cache keys (#9126)
Summary:
This change standardizes on a new 16-byte cache key format for
block cache (incl compressed and secondary) and persistent cache (but
not table cache and row cache).

The goal is a really fast cache key with practically ideal stability and
uniqueness properties without external dependencies (e.g. from FileSystem).
A fixed key size of 16 bytes should enable future optimizations to the
concurrent hash table for block cache, which is a heavy CPU user /
bottleneck, but there appears to be measurable performance improvement
even with no changes to LRUCache.

This change replaces a lot of disjointed and ugly code handling cache
keys with calls to a simple, clean new internal API (cache_key.h).
(Preserving the old cache key logic under an option would be very ugly
and likely negate the performance gain of the new approach. Complete
replacement carries some inherent risk, but I think that's acceptable
with sufficient analysis and testing.)

The scheme for encoding new cache keys is complicated but explained
in cache_key.cc.

Also: EndianSwapValue is moved to math.h to be next to other bit
operations. (Explains some new include "math.h".) ReverseBits operation
added and unit tests added to hash_test for both.

Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126

Test Plan:
### Basic correctness
Several tests needed updates to work with the new functionality, mostly
because we are no longer relying on filesystem for stable cache keys
so table builders & readers need more context info to agree on cache
keys. This functionality is so core, a huge number of existing tests
exercise the cache key functionality.

### Performance
Create db with
`TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters`
And test performance with
`TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4`
using DEBUG_LEVEL=0 and simultaneous before & after runs.
Before ops/sec, avg over 100 runs: 121924
After ops/sec, avg over 100 runs: 125385 (+2.8%)

### Collision probability
I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity
over many months, by making some pessimistic simplifying assumptions:
* Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys)
* All of every file is cached for its entire lifetime

We use a simple table with skewed address assignment and replacement on address collision
to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output
with `./cache_bench -stress_cache_key -sck_keep_bits=40`:

```
Total cache or DBs size: 32TiB  Writing 925.926 MiB/s or 76.2939TiB/day
Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached)
```

These come from default settings of 2.5M files per day of 32 MB each, and
`-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of
the 128-bit cache key.  With file size of 2\*\*25 contiguous keys (pessimistic), our simulation
is about 2\*\*(128-40-25) or about 9 billion billion times more prone to collision than reality.

More default assumptions, relatively pessimistic:
* 100 DBs in same process (doesn't matter much)
* Re-open DB in same process (new session ID related to old session ID) on average
every 100 files generated
* Restart process (all new session IDs unrelated to old) 24 times per day

After enough data, we get a result at the end:

```
(keep 40 bits)  17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected)
```

If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data:

```
(keep 41 bits)  16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected)
(keep 42 bits)  19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected)
```

The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases:

```
197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected)
```

I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data.

Reviewed By: zhichao-cao

Differential Revision: D33171746

Pulled By: pdillinger

fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f
2021-12-16 17:15:13 -08:00
Zhichao Cao 6d93b87588 Add lowest_used_cache_tier to ImmutableDBOptions to enable or disable Secondary Cache (#9050)
Summary:
Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050

Test Plan: added new tests

Reviewed By: anand1976

Differential Revision: D31744769

Pulled By: zhichao-cao

fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
2021-10-19 15:54:23 -07:00
Zhichao Cao 699f45049d Introduce a mechanism to dump out blocks from block cache and re-insert to secondary cache (#8912)
Summary:
Background: Cache warming up will cause potential read performance degradation due to reading blocks from storage to the block cache. Since in production, the workload and access pattern to a certain DB is stable, it is a potential solution to dump out the blocks belonging to a certain DB to persist storage (e.g., to a file) and bulk-load the blocks to Secondary cache before the DB is relaunched. For example, when migrating a DB form host A to host B, it will take a short period of time, the access pattern to blocks in the block cache will not change much. It is efficient to dump out the blocks of certain DB, migrate to the destination host and insert them to the Secondary cache before we relaunch the DB.

Design: we introduce the interface of CacheDumpWriter and CacheDumpRead for user to store the blocks dumped out from block cache. RocksDB will encode all the information and send the string to the writer. User can implement their own writer it they want. CacheDumper and CacheLoad are introduced to save the blocks and load the blocks respectively.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8912

Test Plan: add new tests to lru_cache_test and pass make check.

Reviewed By: pdillinger

Differential Revision: D31452871

Pulled By: zhichao-cao

fbshipit-source-id: 11ab4f5d03e383f476947116361d54188d36ec48
2021-10-07 11:42:31 -07:00
anand76 f35042ca40 Add a PerfContext counter for secondary cache hits (#8685)
Summary:
Add a PerfContext counter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8685

Reviewed By: zhichao-cao

Differential Revision: D30453957

Pulled By: anand1976

fbshipit-source-id: 42888a3ced240e1c44446d52d3b04adfb01f5665
2021-08-20 15:17:30 -07:00
anand76 add68bd28a Add a stat to count secondary cache hits (#8666)
Summary:
Add a stat for secondary cache hits. The ```Cache::Lookup``` API had an unused ```stats``` parameter. This PR uses that to pass the pointer to a ```Statistics``` object that ```LRUCache``` uses to record the stat.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8666

Test Plan: Update a unit test in lru_cache_test

Reviewed By: zhichao-cao

Differential Revision: D30353816

Pulled By: anand1976

fbshipit-source-id: 2046f78b460428877a26ffdd2bb914ae47dfbe77
2021-08-16 21:01:14 -07:00
Drewryz 3b27725245 Fix a minor issue with initializing the test path (#8555)
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555

Reviewed By: ajkr

Differential Revision: D29758399

Pulled By: jay-zhuang

fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
2021-07-23 08:38:45 -07:00
mrambacher 570248aeff Make SecondaryCache Customizable (#8480)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8480

Reviewed By: zhichao-cao

Differential Revision: D29528740

Pulled By: mrambacher

fbshipit-source-id: fd0f70d15f66611c8498257a9973f7e98ca13839
2021-07-06 09:18:08 -07:00
anand76 8ea0a2c1bd Parallelize secondary cache lookup in MultiGet (#8405)
Summary:
Implement the ```WaitAll()``` interface in ```LRUCache``` to allow callers to issue multiple lookups in parallel and wait for all of them to complete. Modify ```MultiGet``` to use this to parallelize the secondary cache lookups in order to reduce the overall latency. A call to ```cache->Lookup()``` returns a handle that has an incomplete value (nullptr), and the caller can call ```cache->IsReady()``` to check whether the lookup is complete, and pass a vector of handles to ```WaitAll``` to wait for completion. If any of the lookups fail, ```MultiGet``` will read the block from the SST file.

Another change in this PR is to rename ```SecondaryCacheHandle``` to ```SecondaryCacheResultHandle``` as it more accurately describes the return result of the secondary cache lookup, which is more like a future.

Tests:
1. Add unit tests in lru_cache_test
2. Benchmark results with no secondary cache configured
Master -
```
readrandom   :      41.175 micros/op 388562 ops/sec;  106.7 MB/s (7277999 of 7277999 found)
readrandom   :      41.217 micros/op 388160 ops/sec;  106.6 MB/s (7274999 of 7274999 found)
multireadrandom :      10.309 micros/op 1552082 ops/sec; (28908992 of 28908992 found)
multireadrandom :      10.321 micros/op 1550218 ops/sec; (29081984 of 29081984 found)
```

This PR -
```
readrandom   :      41.158 micros/op 388723 ops/sec;  106.8 MB/s (7290999 of 7290999 found)
readrandom   :      41.185 micros/op 388463 ops/sec;  106.7 MB/s (7287999 of 7287999 found)
multireadrandom :      10.277 micros/op 1556801 ops/sec; (29346944 of 29346944 found)
multireadrandom :      10.253 micros/op 1560539 ops/sec; (29274944 of 29274944 found)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8405

Reviewed By: zhichao-cao

Differential Revision: D29190509

Pulled By: anand1976

fbshipit-source-id: 6f8eff6246712af8a297cfe22ea0d1c3b2a01bb0
2021-06-18 09:35:59 -07:00
Zhichao Cao f44e69c64a Use DbSessionId as cache key prefix when secondary cache is enabled (#8360)
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360

Test Plan: add the test to lru_cache_test

Reviewed By: pdillinger

Differential Revision: D29006215

Pulled By: zhichao-cao

fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
2021-06-10 11:02:43 -07:00
Zhichao Cao a4405fd981 fix lru caching test and fix reference binding to null pointer (#8326)
Summary:
Fix for https://github.com/facebook/rocksdb/issues/8315. Inhe lru caching test, 5100 is not enough to hold meta block and first block in some random case, increase to 6100. Fix the reference binding to null pointer, use template.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8326

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D28625666

Pulled By: zhichao-cao

fbshipit-source-id: 97b85306ae3d09bfb74addc7c65e57fe55a976a5
2021-05-24 08:37:00 -07:00
Zhichao Cao 7303d02bdf Use new Insert and Lookup APIs in table reader to support secondary cache (#8315)
Summary:
Secondary cache is implemented to achieve the secondary cache tier for block cache. New Insert and Lookup APIs are introduced in https://github.com/facebook/rocksdb/issues/8271  . To support and use the secondary cache in block based table reader, this PR introduces the corresponding callback functions that will be used in secondary cache, and update the Insert and Lookup APIs accordingly.

benchmarking:
./db_bench --benchmarks="fillrandom" -num=1000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/tmp/rocks_t/db -partition_index_and_filters=true

./db_bench -db=/tmp/rocks_t/db -use_existing_db=true -benchmarks=readrandom -num=1000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=5 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -stats_dump_period_sec=30 -reads=50000000

master benchmarking results:
readrandom   :       3.923 micros/op 254881 ops/sec;   33.4 MB/s (23849796 of 50000000 found)
rocksdb.db.get.micros P50 : 2.820992 P95 : 5.636716 P99 : 16.450553 P100 : 8396.000000 COUNT : 50000000 SUM : 179947064

Current PR benchmarking results
readrandom   :       4.083 micros/op 244925 ops/sec;   32.1 MB/s (23849796 of 50000000 found)
rocksdb.db.get.micros P50 : 2.967687 P95 : 5.754916 P99 : 15.665912 P100 : 8213.000000 COUNT : 50000000 SUM : 187250053

About 3.8% throughput reduction.
P50: 5.2% increasing, P95, 2.09% increasing, P99 4.77% improvement

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8315

Test Plan: added the testing case

Reviewed By: anand1976

Differential Revision: D28599774

Pulled By: zhichao-cao

fbshipit-source-id: 098c4df0d7327d3a546df7604b2f1602f13044ed
2021-05-21 18:29:12 -07:00
anand76 feb06e83b2 Initial support for secondary cache in LRUCache (#8271)
Summary:
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.

Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.

cache_bench results before and after this change -
Command
```./cache_bench -ops_per_thread=10000000 -threads=1```
Before
```Complete in 40.688 s; QPS = 245774```
```Complete in 40.486 s; QPS = 246996```
```Complete in 42.019 s; QPS = 237989```
After
```Complete in 40.672 s; QPS = 245869```
```Complete in 44.622 s; QPS = 224107```
```Complete in 42.445 s; QPS = 235599```

db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
Commands
```./db_bench  --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```

```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
Before
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      80.702 micros/op 198104 ops/sec;   54.4 MB/s (3708999 of 3708999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      87.124 micros/op 183625 ops/sec;   50.4 MB/s (3439999 of 3439999 found)
```
After
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      77.653 micros/op 206025 ops/sec;   56.6 MB/s (3866999 of 3866999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom   :      84.962 micros/op 188299 ops/sec;   51.7 MB/s (3535999 of 3535999 found)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271

Reviewed By: zhichao-cao

Differential Revision: D28357511

Pulled By: anand1976

fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
2021-05-13 22:58:40 -07:00
Peter Dillinger 78a309bf86 New Cache API for gathering statistics (#8225)
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:

* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)

I have enhanced cache_bench to validate this approach:

* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.

To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225

Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.

The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.

Baseline typical output:

```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662

Operation latency (ns):
Count: 80000000 Average: 11223.9494  StdDev: 29.61
Min: 0  Median: 7759.3973  Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[       0,       1 ]       68   0.000%   0.000%
(    2900,    4400 ]       89   0.000%   0.000%
(    4400,    6600 ] 33630240  42.038%  42.038% ########
(    6600,    9900 ] 18129842  22.662%  64.700% #####
(    9900,   14000 ]  7877533   9.847%  74.547% ##
(   14000,   22000 ] 15193238  18.992%  93.539% ####
(   22000,   33000 ]  3037061   3.796%  97.335% #
(   33000,   50000 ]  1626316   2.033%  99.368%
(   50000,   75000 ]   421532   0.527%  99.895%
(   75000,  110000 ]    56910   0.071%  99.966%
(  110000,  170000 ]    16134   0.020%  99.986%
(  170000,  250000 ]     5166   0.006%  99.993%
(  250000,  380000 ]     3017   0.004%  99.996%
(  380000,  570000 ]     1337   0.002%  99.998%
(  570000,  860000 ]      805   0.001%  99.999%
(  860000, 1200000 ]      319   0.000% 100.000%
( 1200000, 1900000 ]      231   0.000% 100.000%
( 1900000, 2900000 ]      100   0.000% 100.000%
( 2900000, 4300000 ]       39   0.000% 100.000%
( 4300000, 6500000 ]       16   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
```

New, gather_stats=false. Median thread ops/sec of 5 runs:

```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458

Operation latency (ns):
Count: 80000000 Average: 11298.1027  StdDev: 42.18
Min: 0  Median: 7722.0822  Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[       0,       1 ]      109   0.000%   0.000%
(    2900,    4400 ]      793   0.001%   0.001%
(    4400,    6600 ] 34054563  42.568%  42.569% #########
(    6600,    9900 ] 17482646  21.853%  64.423% ####
(    9900,   14000 ]  7908180   9.885%  74.308% ##
(   14000,   22000 ] 15032072  18.790%  93.098% ####
(   22000,   33000 ]  3237834   4.047%  97.145% #
(   33000,   50000 ]  1736882   2.171%  99.316%
(   50000,   75000 ]   446851   0.559%  99.875%
(   75000,  110000 ]    68251   0.085%  99.960%
(  110000,  170000 ]    18592   0.023%  99.983%
(  170000,  250000 ]     7200   0.009%  99.992%
(  250000,  380000 ]     3334   0.004%  99.997%
(  380000,  570000 ]     1393   0.002%  99.998%
(  570000,  860000 ]      700   0.001%  99.999%
(  860000, 1200000 ]      293   0.000% 100.000%
( 1200000, 1900000 ]      196   0.000% 100.000%
( 1900000, 2900000 ]       69   0.000% 100.000%
( 2900000, 4300000 ]       32   0.000% 100.000%
( 4300000, 6500000 ]       10   0.000% 100.000%
```

New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:

```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551

Operation latency (ns):
Count: 80000000 Average: 11311.2629  StdDev: 45.28
Min: 0  Median: 7686.5458  Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[       0,       1 ]       71   0.000%   0.000%
(    2900,    4400 ]      291   0.000%   0.000%
(    4400,    6600 ] 34492060  43.115%  43.116% #########
(    6600,    9900 ] 16727328  20.909%  64.025% ####
(    9900,   14000 ]  7845828   9.807%  73.832% ##
(   14000,   22000 ] 15510654  19.388%  93.220% ####
(   22000,   33000 ]  3216533   4.021%  97.241% #
(   33000,   50000 ]  1680859   2.101%  99.342%
(   50000,   75000 ]   439059   0.549%  99.891%
(   75000,  110000 ]    60540   0.076%  99.967%
(  110000,  170000 ]    14649   0.018%  99.985%
(  170000,  250000 ]     5242   0.007%  99.991%
(  250000,  380000 ]     3260   0.004%  99.995%
(  380000,  570000 ]     1599   0.002%  99.997%
(  570000,  860000 ]     1043   0.001%  99.999%
(  860000, 1200000 ]      471   0.001%  99.999%
( 1200000, 1900000 ]      275   0.000% 100.000%
( 1900000, 2900000 ]      143   0.000% 100.000%
( 2900000, 4300000 ]       60   0.000% 100.000%
( 4300000, 6500000 ]       27   0.000% 100.000%
( 6500000, 9800000 ]        7   0.000% 100.000%
( 9800000, 14000000 ]        1   0.000% 100.000%

Gather stats latency (us):
Count: 46 Average: 980387.5870  StdDev: 60911.18
Min: 879155  Median: 1033777.7778  Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
(  860000, 1200000 ]       45  97.826%  97.826% ####################
( 1200000, 1900000 ]        1   2.174% 100.000%

Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```

Reviewed By: mrambacher

Differential Revision: D28295742

Pulled By: pdillinger

fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
2021-05-11 16:17:10 -07:00
storagezhang 711881bc25 Fix some typos in comments (#8066)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8066

Reviewed By: jay-zhuang

Differential Revision: D27280799

Pulled By: mrambacher

fbshipit-source-id: 68f91f5af4ffe0a84be581961bf9366887f47702
2021-03-25 21:18:08 -07:00
mrambacher 56f468b356 Add more tests to ASSERT_STATUS_CHECKED (#7211)
Summary:
Added 4 more tests to those which pass ASSERT_STATUS_CHECKED (cache_test, lru_cache_test, filename_test, filelock_test).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7211

Reviewed By: ajkr

Differential Revision: D22982858

Pulled By: zhichao-cao

fbshipit-source-id: acdd071582ed6aa7447ed96c5732f10bf720d783
2020-08-06 17:19:41 -07:00
sdong fdf882ded2 Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433)
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433

Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.

Differential Revision: D19977691

fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
2020-02-20 12:09:57 -08:00
Maysam Yabandeh 638d239507 Charge block cache for cache internal usage (#5797)
Summary:
For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
This PR builds up on https://github.com/facebook/rocksdb/issues/4258
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797

Test Plan:
- Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
- The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.

Differential Revision: D17396833

Pulled By: maysamyabandeh

fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
2019-09-16 15:26:21 -07:00
Siying Dong e9e0101ca4 Move test related files under util/ to test_util/ (#5377)
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377

Differential Revision: D15551366

Pulled By: siying

fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
2019-05-30 11:25:51 -07:00
Levi Tamasi 34f8ac0c99 Make adaptivity of LRU cache mutexes configurable (#5054)
Summary:
The patch adds a new config option to LRUCacheOptions that enables
users to choose whether to use an adaptive mutex for the LRU block
cache (on platforms where adaptive mutexes are supported). The default
is true if RocksDB is compiled with -DROCKSDB_DEFAULT_TO_ADAPTIVE_MUTEX,
false otherwise.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5054

Differential Revision: D14542749

Pulled By: ltamasi

fbshipit-source-id: 0065715ab6cf91f10444b737fed8c8aee6a8a0d2
2019-03-20 12:33:44 -07:00
Michael Liu ca89ac2ba9 Apply modernize-use-override (2nd iteration)
Summary:
Use C++11’s override and remove virtual where applicable.
Change are automatically generated.

Reviewed By: Orvid

Differential Revision: D14090024

fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a
2019-02-14 14:41:36 -08:00
Yi Wu bc7e8d472e LRUCache midpoint insertion
Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877

Differential Revision: D8100895

Pulled By: yiwu-arbug

fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da
2018-05-24 15:57:33 -07:00
Yi Wu 7a99c04311 refactor constructor of LRUCacheShard
Summary:
Update LRUCacheShard constructor so that adding new params to it don't need to add extra SetXXX() methods.
Closes https://github.com/facebook/rocksdb/pull/3896

Differential Revision: D8128618

Pulled By: yiwu-arbug

fbshipit-source-id: 6afa715de1493a50de413678761a765e3af9b83b
2018-05-23 18:57:42 -07:00
Daniel Black 16e0388205 LRUCacheShard cache line size alignment
Summary:
combining #2568 and #2612.
Closes https://github.com/facebook/rocksdb/pull/2620

Differential Revision: D5464394

Pulled By: IslamAbdelRahman

fbshipit-source-id: 9f71d3058dd6adaf02ce3b2de3a81a1228009778
2017-07-24 10:54:37 -07:00
Siying Dong 3c327ac2d0 Change RocksDB License
Summary: Closes https://github.com/facebook/rocksdb/pull/2589

Differential Revision: D5431502

Pulled By: siying

fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75
2017-07-15 16:11:23 -07:00
Siying Dong d616ebea23 Add GPLv2 as an alternative license.
Summary: Closes https://github.com/facebook/rocksdb/pull/2226

Differential Revision: D4967547

Pulled By: siying

fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4
2017-04-27 18:06:12 -07:00
Siying Dong d2dce5611a Move some files under util/ to separate dirs
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090

Differential Revision: D4833681

Pulled By: siying

fbshipit-source-id: 2fd8bef
2017-04-05 19:09:16 -07:00
Renamed from util/lru_cache_test.cc (Browse further)