Commit graph

1439 commits

Author SHA1 Message Date
Jay Zhuang 00050d4634 Disable tiered storage + BlobDB stress test (#10699)
Summary:
There're 2 knobs to disable blobdb, adding that.
Also print call stack when there's assert failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10699

Reviewed By: gitbw95

Differential Revision: D39596448

Pulled By: jay-zhuang

fbshipit-source-id: 5ce9fd0630d8b6ff1e157a2685a1e80a99997098
2022-09-19 15:39:31 -07:00
gitbw95 2cc5b39560 Add enable_split_merge option for CompressedSecondaryCache (#10690)
Summary:
`enable_custom_split_merge` is added for enabling the custom split and merge feature, which split the compressed value into chunks so that they may better fit jemalloc bins.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10690

Test Plan:
Unit Tests
Stress Tests

Reviewed By: anand1976

Differential Revision: D39567604

Pulled By: gitbw95

fbshipit-source-id: f6d1d46200f365220055f793514601dcb0edc4b7
2022-09-16 15:41:49 -07:00
Peter Dillinger 0f91c72adc Call experimental new clock cache HyperClockCache (#10684)
Summary:
This change establishes a distinctive name for the experimental new lock-free clock cache (originally developed by guidotag and revamped in PR https://github.com/facebook/rocksdb/issues/10626). A few reasons:
* We want to make it clear that this is a fundamentally different implementation vs. the old clock cache, to avoid people saying "I already tried clock cache."
* We want to highlight the key feature: it's fast (especially under parallel load)
* Because it requires an estimated charge per entry, it is not drop-in API compatible with old clock cache. This estimate might always be required for highest performance, and giving it a distinct name should reduce confusion about the distinct API requirements.
* We might develop a variant requiring the same estimate parameter but with LRU eviction. In that case, using the name HyperLRUCache should make things more clear. (FastLRUCache is just a prototype that might soon be removed.)

Some API detail:
* To reduce copy-pasting parameter lists, etc. as in LRUCache construction, I have a `MakeSharedCache()` function on `HyperClockCacheOptions` instead of `NewHyperClockCache()`.
* Changes -cache_type=clock_cache to -cache_type=hyper_clock_cache for applicable tools. I think this is more consistent / sustainable for reasons already stated.

For performance tests see https://github.com/facebook/rocksdb/pull/10626

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10684

Test Plan: no interesting functional changes; tests updated

Reviewed By: anand1976

Differential Revision: D39547800

Pulled By: pdillinger

fbshipit-source-id: 5c0fe1b5cf3cb680ab369b928c8569682b9795bf
2022-09-16 12:47:29 -07:00
Peter Dillinger 5724348689 Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
  * Duplicate Inserts will sometimes go undetected and the shadow duplicate
    will age out with eviction.
  * In many cases, the older Inserted value for a given cache key will be kept
  (i.e. Insert does not support overwrite).
  * Entries explicitly erased (rather than evicted) might not be freed
  immediately in some rare cases.
  * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
  precisely as LRUCache, but is self-correcting and should only deviate by a
  very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.

## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
  Increment internal ref count at slot
  If possible hit:
    Check flags atomic (and non-atomic fields)
    If cache hit:
      Three distinct updates to 'flags' atomic
      Increment refs for internal-to-external
      Return
  Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
  Increment acquire counter in meta word (optimistic)
  If visible entry (already read meta word):
    If match (read non-atomic fields):
      Return
    Else:
      Decrement acquire counter in meta word
  Else if invisible entry (rare, already read meta word):
    Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
  Use CAS etc. to remove
  Return
Else:
  Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
  Use CAS etc. to remove
  Return
```

## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:

base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change

## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944

4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821

4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)

4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38

4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)

4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37

4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46

Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.

Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56

1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45

1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63

610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5

610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453

610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812

The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)

Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.

233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461

233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402

233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016

89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754

89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293

89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223

^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)

Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125

34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793

34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52

As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:

13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383

13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758

13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27

gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:

13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707

13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626

Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN

Reviewed By: anand1976

Differential Revision: D39368406

Pulled By: pdillinger

fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2022-09-16 00:24:11 -07:00
Yanqin Jin 088b9844d4 Re-enable user-defined timestamp and subcompactions (#10689)
Summary:
Hopefully, we can re-enable the combination of user-defined timestamp and subcompactions
after https://github.com/facebook/rocksdb/issues/10658.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10689

Test Plan:
Make sure the following succeeds on devserver.
make crash_test_with_ts

Reviewed By: ltamasi

Differential Revision: D39556558

Pulled By: riversand963

fbshipit-source-id: 4695f420b1bc9ebf3b24640b693746f4db82c149
2022-09-15 20:21:07 -07:00
Levi Tamasi 7dad485278 Support JemallocNodumpAllocator for the block/blob cache in db_bench (#10685)
Summary:
The patch makes it possible to use the `JemallocNodumpAllocator` with the
block/blob caches in `db_bench`. In addition to its stated purpose of excluding
cache contents from core dumps, `JemallocNodumpAllocator` also uses
a dedicated arena and jemalloc tcaches for cache allocations, which can
reduce fragmentation and thus memory usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10685

Reviewed By: riversand963

Differential Revision: D39552261

Pulled By: ltamasi

fbshipit-source-id: b5c58eab6b7c1baa9a307d9f1248df1d7a77d2b5
2022-09-15 13:44:46 -07:00
Jay Zhuang 1cdc84114f Tiered Storage feature doesn't support BlobDB yet (#10681)
Summary:
Disable the tiered storage + BlobDB test.
Also enable different hot data setting for Tiered compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10681

Reviewed By: ajkr

Differential Revision: D39531941

Pulled By: jay-zhuang

fbshipit-source-id: aa0595eb38d03f17638d300d2e4cc9061429bf61
2022-09-15 08:17:16 -07:00
Akanksha Mahajan 7a9ecdac3c Add auto prefetching parameters to db_bench and db_stress (#10632)
Summary:
Same as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10632

Test Plan: make crash_test -j32

Reviewed By: anand1976

Differential Revision: D39241479

Pulled By: akankshamahajan15

fbshipit-source-id: 5db5b0c007da786bacc1b30d8926d36d6d029b87
2022-09-09 12:52:27 -07:00
Andrew Kryczka 4100eb3053 minor cleanups to db_crashtest.py (#10654)
Summary:
Expanded `all_params` to include all parameters crash test may set. Previously, `atomic_flush` was not included in `all_params` and thus was not visible to `finalize_and_sanitize()`. The consequence was manual crash test runs could provide unsafe combinations of parameters to `db_stress`. For example, running `db_crashtest.py` with `-atomic_flush=0` could cause `db_stress` to run with `-atomic_flush=0 -disable_wal=1`, which is known to produce inconsistencies across column families.

While expanding `all_params`, I found we cannot have an entry in it for both `db_stress` and `db_crashtest.py`. So I renamed `enable_tiered_storage` to `test_tiered_storage` for `db_crashtest.py`, which appears more conventional anyways.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10654

Reviewed By: hx235

Differential Revision: D39369349

Pulled By: ajkr

fbshipit-source-id: 31d9010c760c868b20d5e9bd78ba75c8ff3ce348
2022-09-08 17:39:22 -07:00
Andrew Kryczka ccf822492f Reenable sync_fault_injection in crash test (#10172)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10172

Reviewed By: riversand963

Differential Revision: D37164671

Pulled By: ajkr

fbshipit-source-id: 40eb919b8dc261d502510e878ee8ac7874ab35d0
2022-08-31 14:27:23 -07:00
Hui Xiao e7525a1fff Disable use_txn=1 with sync_fault_injection=1 in db_crashtest.py (#10605)
Summary:
**Context/Summary:**
`ExpectedState` is not aware of transaction-related concept so `use_txn=1 ` is not compatible with `sync_fault_injection=1`. Therefore this PR disabled this combination until we expand our correctness testing to transaction related features.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10605

Test Plan:
- Run the following commands to verify `--use_txn` is correctly sanitized
   - `python3 ./tools/db_crashtest.py blackbox --use_txn=1 --sync_fault_injection=1 `
   - `python3 ./tools/db_crashtest.py blackbox --use_txn=0 --sync_fault_injection=1 `

Reviewed By: ajkr

Differential Revision: D39121287

Pulled By: hx235

fbshipit-source-id: 7d5d6dd32479ea1c07df4f38322650f3a60def9c
2022-08-31 13:16:39 -07:00
Levi Tamasi 228f2c5bf5 Adjust the blob cache printout in db_bench/db_stress (#10614)
Summary:
Currently, `db_bench` and `db_stress` print the blob cache options even if
a shared block/blob cache is configured, i.e. when they are not actually
in effect. The patch changes this so they are only printed when a separate blob
cache is used.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10614

Test Plan: Tested manually using `db_bench` and `db_stress`.

Reviewed By: akankshamahajan15

Differential Revision: D39144603

Pulled By: ltamasi

fbshipit-source-id: f714304c5d46186f8514746c27ee6f52aa3e4af8
2022-08-31 09:55:50 -07:00
Hui Xiao e484b81eee Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573)
Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.

This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.

The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped  (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```

```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```

**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
    -  Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
    - `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573

Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr  "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.

Reviewed By: ajkr

Differential Revision: D39005886

Pulled By: hx235

fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a
2022-08-29 17:35:21 -07:00
bilyz 7670fdd690 fix trace_analyzer_tool args column position (#10576)
Summary:
The column  meaning explanation is not correct according to the parsed human-readable trace file.

Following are the results data from parsed trace human-readable file format.
The key is in the first column.

```
0x00000005 6 1 0 1661317998095439
0x00000007 0 1 0 1661317998095479
0x00000008 6 1 0 1661317998095493
0x0000000300000001 1 1 6 1661317998101508
0x0000000300000000 1 1 6 1661317998101508
0x0000000300000001 0 1 0 1661317998106486
0x0000000300000000 0 1 0 1661317998106498
0x0000000A 6 1 0 1661317998106515
0x00000007 0 1 0 1661317998111887
0x00000001 6 1 0 1661317998111923
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10576

Reviewed By: ajkr

Differential Revision: D39039110

Pulled By: jay-zhuang

fbshipit-source-id: eade6394c7870005b717846af09a848be6f677ce
2022-08-26 08:44:52 -07:00
Alan Paxton 7fbee01f0c CI benchmarks refine configuration (#10514)
Summary:
CI benchmarks refine configuration

Run only “essential” benchmarks, but for longer
Fix (reduce) the NUM_KEYS to ensure cached behaviour
Reduce level size to try to ensure more levels

Refine test durations again, more time per test, but fewer tests.
In CI benchmark mode, the only read test is readrandom.
There are still 3 mostly-read tests.

Goal is to squeeze complete run a little bit inside 1 hour so it doesn’t clash with the next run (cron scheduled for main branch), but it gets to run as long as possible, so that results are as credible as possible.

Reduce thread count to physical capacity, in an attempt to reduce throughput variance for write heavy tests. See Mark Callaghan’s comments in related documentation..

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10514

Reviewed By: ajkr

Differential Revision: D38952469

Pulled By: jay-zhuang

fbshipit-source-id: 72fa6bba897cc47066ced65facd1fd36e28f30a8
2022-08-25 09:47:03 -07:00
Andrew Kryczka d95e376368 Disable db_stress features incompatible with unsynced data dropping when sync_fault_injection=1 (#10559)
Summary:
The features that cannot work with disable_wal=1 due to unsynced data dropping (ingest_external_file_one_in and enable_compaction_filter) similarly cannot work with sync_fault_injection=1. This PR prevents those features from being used together with sync_fault_injection=1.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10559

Reviewed By: hx235

Differential Revision: D38953019

Pulled By: ajkr

fbshipit-source-id: 7e2c7644ec84d7323f632cf976bcee00502d0ed7
2022-08-24 21:50:34 -07:00
Changyu Bi d140fbfd7d Add Iterator test against expected state to stress test (#10538)
Summary:
As mentioned in https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913,
`db_stress` does not have much verification for iterator correctness.
It has a `TestIterate()` function, but that is mainly for comparing results
between two iterators, one with `total_order_seek` and the other optionally
sets auto_prefix, upper/lower bounds. Commit 49a0581ad2462e31aa3f768afa769e0d33390f33
added a new `TestIterateAgainstExpected()` function that compares iterator against
expected state. It locks a range of keys, creates an iterator, does
a random sequence of `Next/Prev` and compares against expected state.
This PR is based on that commit, the main changes include some logs
(for easier debugging if a test fails), a forward and backward scan to
cover the entire locked key range, and a flag for optionally turning on
this version of Iterator testing.

Added constraint that the checks against expected state in
`TestIterateAgainstExpected()` and in `TestGet()` are only turned on
when `--skip_verifydb` flag is not set.
Remove the change log introduced in https://github.com/facebook/rocksdb/issues/10553.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10538

Test Plan:
Run `db_stress` with `--verify_iterator_with_expected_state_one_in=1`,
and a large `--iterpercent` and `--num_iterations`. Checked `op_logs`
manually to ensure expected coverage. Tweaked part of the code in
https://github.com/facebook/rocksdb/issues/10449 and stress test was able to catch it.
- internally run various flavor of crash test

Reviewed By: ajkr

Differential Revision: D38847269

Pulled By: cbi42

fbshipit-source-id: 8b4402a9bba9f6cfa08051943cd672579d489599
2022-08-24 14:59:50 -07:00
Changyu Bi 7b9e970042 Optionally issue DeleteRange in *whilewriting benchmarks (#10552)
Summary:
Optionally issue DeleteRange in `*whilewriting` benchmarks. This happens in `BGWriter` and uses similar logic as in `DoWrite` to issue DeleteRange operations. I added this when I was benchmarking https://github.com/facebook/rocksdb/issues/10547, but this should be an independent PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10552

Test Plan: ran some benchmarks with various delete range options, e.g. `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --writes=200000 --reads=1000000 --disable_auto_compactions --max_num_range_tombstones=10000`

Reviewed By: ajkr

Differential Revision: D38927020

Pulled By: cbi42

fbshipit-source-id: 31ee20cb8127f7173f0816ea0cc2a204ec02aad6
2022-08-23 11:06:09 -07:00
Bo Wang b0048b673c Post 7.6 branch cut changes (#10546)
Summary:
After branch 7.6.fb branch is cut, following release process, upgrade version number to 7.7 and add 7.6.fb to format compatibility check.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10546

Test Plan: Watch CI

Reviewed By: ajkr

Differential Revision: D38892023

Pulled By: gitbw95

fbshipit-source-id: 94e96dedbd973f5f9713e73d3bed336e4678565b
2022-08-21 20:42:12 -07:00
anand76 35cdd3e71e MultiGet async IO across multiple levels (#10535)
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535

Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO

No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom :       7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations;  571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom :       7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations;  568.2 MB/s (8271992 of 8271992 found)`

For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom :       5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations;  797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom :       5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations;  790.7 MB/s (11649992 of 11649992 found) `

Reviewed By: akankshamahajan15

Differential Revision: D38774009

Pulled By: anand1976

fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
2022-08-19 16:52:52 -07:00
Bo Wang 13cb7a84b6 Fix the memory leak in db_stress tests that are caused by FaultInjectionSecondaryCache and add CompressedSecondaryCache into stress tests. (#10523)
Summary:
1. Fix the memory leak in db_stress tests that are caused by `FaultInjectionSecondaryCache`. To address the test requirements for both CompressedSecondaryCache and CachlibWrapper, a new class variable `base_is_compressed_sec_cache_` is added to determine the different behaviors in `Lookup()` and `WaitAll()`.
2. Add `CompressedSecondaryCache` into stress tests.

Before this PR, memory leak is reported during crash tests if  `CompressedSecondaryCache` is in stress tests. One example is shown as follows:
```
==70722==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 6648240 byte(s) in 83103 object(s) allocated from:
    #0 0x13de9d7 in operator new(unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dbgo/gen/aab7ed39/internal_repo_rocksdb/repo/db_stress+0x13de9d7)
    https://github.com/facebook/rocksdb/issues/1 0x9084c7 in rocksdb::BlocklikeTraits<rocksdb::Block>::Create(rocksdb::BlockContents&&, unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*) internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:128
    https://github.com/facebook/rocksdb/issues/2 0x9084c7 in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)::operator()(void const*, unsigned long, void**, unsigned long*) const internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:34
    https://github.com/facebook/rocksdb/issues/3 0x9082c9 in rocksdb::Block std::__invoke_impl<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::__invoke_other, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:61
    https://github.com/facebook/rocksdb/issues/4 0x90825d in std::enable_if<is_invocable_r_v<rocksdb::Block, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>, rocksdb::Block>::type std::__invoke_r<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:114
    https://github.com/facebook/rocksdb/issues/5 0x9081b0 in std::_Function_handler<rocksdb::Status (void const*, unsigned long, void**, unsigned long*), std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)>::_M_invoke(std::_Any_data const&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:291
    https://github.com/facebook/rocksdb/issues/6 0x991f2c in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)>::operator()(void const*, unsigned long, void**, unsigned long*) const third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:560
    https://github.com/facebook/rocksdb/issues/7 0x990277 in rocksdb::CompressedSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/cache/compressed_secondary_cache.cc:77
    https://github.com/facebook/rocksdb/issues/8 0xd3aa4d in rocksdb::FaultInjectionSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/utilities/fault_injection_secondary_cache.cc:92
    https://github.com/facebook/rocksdb/issues/9 0xeadaab in rocksdb::lru_cache::LRUCacheShard::Lookup(rocksdb::Slice const&, unsigned int, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/lru_cache.cc:445
    https://github.com/facebook/rocksdb/issues/10 0x1064573 in rocksdb::ShardedCache::Lookup(rocksdb::Slice const&, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/sharded_cache.cc:89
    https://github.com/facebook/rocksdb/issues/11 0x8be0df in rocksdb::BlockBasedTable::GetEntryFromCache(rocksdb::CacheTier const&, rocksdb::Cache*, rocksdb::Slice const&, rocksdb::BlockType, bool, rocksdb::GetContext*, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:389
    https://github.com/facebook/rocksdb/issues/12 0x905790 in rocksdb::Status rocksdb::BlockBasedTable::GetDataBlockFromCache<rocksdb::Block>(rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::ReadOptions const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::UncompressionDict const&, rocksdb::BlockType, bool, rocksdb::GetContext*) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1263
    https://github.com/facebook/rocksdb/issues/13 0x8b9259 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1559
    https://github.com/facebook/rocksdb/issues/14 0x8b710c in rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool, bool, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1726
    https://github.com/facebook/rocksdb/issues/15 0x8c329f in rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::FilePrefetchBuffer*, bool, bool, rocksdb::Status&) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader_impl.h:58
    https://github.com/facebook/rocksdb/issues/16 0x920117 in rocksdb::BlockBasedTableIterator::InitDataBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:262
    https://github.com/facebook/rocksdb/issues/17 0x920d42 in rocksdb::BlockBasedTableIterator::MaterializeCurrentBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:332
    https://github.com/facebook/rocksdb/issues/18 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/19 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/20 0xef9f6c in rocksdb::MergingIterator::PrepareValue() internal_repo_rocksdb/repo/table/merging_iterator.cc:260
    https://github.com/facebook/rocksdb/issues/21 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/22 0xc67bcd in rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:326
    https://github.com/facebook/rocksdb/issues/23 0xc66d36 in rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:234
    https://github.com/facebook/rocksdb/issues/24 0xc7ab47 in rocksdb::DBIter::Next() internal_repo_rocksdb/repo/db/db_iter.cc:161
    https://github.com/facebook/rocksdb/issues/25 0x70d938 in rocksdb::BatchedOpsStressTest::TestPrefixScan(rocksdb::ThreadState*, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) internal_repo_rocksdb/repo/db_stress_tool/batched_ops_stress.cc:320
    https://github.com/facebook/rocksdb/issues/26 0x6dc6a8 in rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:907
    https://github.com/facebook/rocksdb/issues/27 0x6867de in rocksdb::ThreadBody(void*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_driver.cc:33
    https://github.com/facebook/rocksdb/issues/28 0xce4cc2 in rocksdb::(anonymous namespace)::StartThreadWrapper(void*) internal_repo_rocksdb/repo/env/env_posix.cc:461
    https://github.com/facebook/rocksdb/issues/29 0x7f23f9068c0e in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434:8
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10523

Test Plan:
```
$COMPILE_WITH_ASAN=1  make -j 24
$db_stress J=40 crash_test_with_txn
```

Reviewed By: anand1976

Differential Revision: D38646839

Pulled By: gitbw95

fbshipit-source-id: 9452895c7dc95481a9d7afe83b15193cf5b1c43e
2022-08-18 21:53:27 -07:00
Akanksha Mahajan 5956ef0089 Add initial_auto_readahead_size and max_auto_readahead_size to db_bench (#10539)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10539

Reviewed By: anand1976

Differential Revision: D38837111

Pulled By: akankshamahajan15

fbshipit-source-id: eb845c6e15a3c823ff6113395817388ff15a20b1
2022-08-18 18:03:44 -07:00
Gang Liao 275cd80cdb Add a blob-specific cache priority (#10461)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461

Reviewed By: siying

Differential Revision: D38672823

Pulled By: ltamasi

fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
2022-08-12 17:59:06 -07:00
Changyu Bi fd165c869d Add memtable per key-value checksum (#10281)
Summary:
Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281

Test Plan:
- Added new unit test cases: `make check`
- Benchmark on memtable insert
```
TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

# avg over 10 runs
Baseline: 1166936 ops/sec
memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%)
memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%)
write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%)
```
-  Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum.
ops/sec for `readrandom` dropped ~6.8%.
```
# Readseq
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100

readseq [AVG    20 runs] : 7432840 (± 212005) ops/sec;  822.3 (± 23.5) MB/sec
readseq [MEDIAN 20 runs] : 7573878 ops/sec;  837.9 MB/sec

With -memtable_protection_bytes_per_key=2:

readseq [AVG    20 runs] : 5134607 (± 119596) ops/sec;  568.0 (± 13.2) MB/sec
readseq [MEDIAN 20 runs] : 5232946 ops/sec;  578.9 MB/sec

# Readrandom
sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100
readrandom [AVG    10 runs] : 140236 (± 3938) ops/sec;    9.8 (± 0.3) MB/sec
readrandom [MEDIAN 10 runs] : 140545 ops/sec;    9.8 MB/sec

With -memtable_protection_bytes_per_key=2:
readrandom [AVG    10 runs] : 130632 (± 2738) ops/sec;    9.1 (± 0.2) MB/sec
readrandom [MEDIAN 10 runs] : 130341 ops/sec;    9.1 MB/sec
```

- Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800`

Reviewed By: ajkr

Differential Revision: D37607896

Pulled By: cbi42

fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113
2022-08-12 13:51:32 -07:00
Jay Zhuang f42fec2fab Add bash for running the script (#10521)
Summary:
workaround for scripts cannot be executed directly in docker /dev/shm
might be a permission configuration.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10521

Test Plan: run the format_compatible test: https://app.circleci.com/pipelines/github/facebook/rocksdb/17161/workflows/531cc2ce-188c-4e18-a050-5c5f4df76f5c/jobs/459757

Reviewed By: ltamasi

Differential Revision: D38630967

Pulled By: jay-zhuang

fbshipit-source-id: 501d2b48df4e04027a9d6e891af7edff73d571f3
2022-08-11 13:33:06 -07:00
sdong 9277569ba3 Add some missing headers (#10519)
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519

Reviewed By: jay-zhuang

Differential Revision: D38603291

fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
2022-08-11 12:45:50 -07:00
Jay Zhuang 5d3aefb682 Migrate to docker for CI run (#10496)
Summary:
Moved linux builds to using docker to avoid CI instability caused by dependency installation site down.
Added the `Dockerfile` which is used to build the image.
The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test).
Also fixed a few issues detected while building this:
* `DestoryDB()` Status not checked for a few tests
* nullptr might be used in `inlineskiplist.cc`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D38554200

Pulled By: jay-zhuang

fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd
2022-08-10 17:34:38 -07:00
gitbw95 b57155a0bd Revert "Add CompressedSecondaryCache into stress test" #10442 (#10509)
Summary:
Revert https://github.com/facebook/rocksdb/pull/10442 before I find the root cause and fix the memory leak in db_stress tests that are caused by `FaultInjectionSecondaryCache`.

Memory leak is reported during crash tests and one example is shown as follows:
```
==70722==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 6648240 byte(s) in 83103 object(s) allocated from:
    #0 0x13de9d7 in operator new(unsigned long) (/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/fbcode/buck-out/dbgo/gen/aab7ed39/internal_repo_rocksdb/repo/db_stress+0x13de9d7)
    https://github.com/facebook/rocksdb/issues/1 0x9084c7 in rocksdb::BlocklikeTraits<rocksdb::Block>::Create(rocksdb::BlockContents&&, unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*) internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:128
    https://github.com/facebook/rocksdb/issues/2 0x9084c7 in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)::operator()(void const*, unsigned long, void**, unsigned long*) const internal_repo_rocksdb/repo/table/block_based/block_like_traits.h:34
    https://github.com/facebook/rocksdb/issues/3 0x9082c9 in rocksdb::Block std::__invoke_impl<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::__invoke_other, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:61
    https://github.com/facebook/rocksdb/issues/4 0x90825d in std::enable_if<is_invocable_r_v<rocksdb::Block, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>, rocksdb::Block>::type std::__invoke_r<rocksdb::Status, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*, unsigned long, void**, unsigned long*>(std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/invoke.h:114
    https://github.com/facebook/rocksdb/issues/5 0x9081b0 in std::_Function_handler<rocksdb::Status (void const*, unsigned long, void**, unsigned long*), std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> rocksdb::GetCreateCallback<rocksdb::Block>(unsigned long, rocksdb::Statistics*, bool, rocksdb::FilterPolicy const*)::'lambda'(void const*, unsigned long, void**, unsigned long*)>::_M_invoke(std::_Any_data const&, void const*&&, unsigned long&&, void**&&, unsigned long*&&) third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:291
    https://github.com/facebook/rocksdb/issues/6 0x991f2c in std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)>::operator()(void const*, unsigned long, void**, unsigned long*) const third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_function.h:560
    https://github.com/facebook/rocksdb/issues/7 0x990277 in rocksdb::CompressedSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/cache/compressed_secondary_cache.cc:77
    https://github.com/facebook/rocksdb/issues/8 0xd3aa4d in rocksdb::FaultInjectionSecondaryCache::Lookup(rocksdb::Slice const&, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, bool, bool&) internal_repo_rocksdb/repo/utilities/fault_injection_secondary_cache.cc:92
    https://github.com/facebook/rocksdb/issues/9 0xeadaab in rocksdb::lru_cache::LRUCacheShard::Lookup(rocksdb::Slice const&, unsigned int, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/lru_cache.cc:445
    https://github.com/facebook/rocksdb/issues/10 0x1064573 in rocksdb::ShardedCache::Lookup(rocksdb::Slice const&, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority, bool, rocksdb::Statistics*) internal_repo_rocksdb/repo/cache/sharded_cache.cc:89
    https://github.com/facebook/rocksdb/issues/11 0x8be0df in rocksdb::BlockBasedTable::GetEntryFromCache(rocksdb::CacheTier const&, rocksdb::Cache*, rocksdb::Slice const&, rocksdb::BlockType, bool, rocksdb::GetContext*, rocksdb::Cache::CacheItemHelper const*, std::function<rocksdb::Status (void const*, unsigned long, void**, unsigned long*)> const&, rocksdb::Cache::Priority) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:389
    https://github.com/facebook/rocksdb/issues/12 0x905790 in rocksdb::Status rocksdb::BlockBasedTable::GetDataBlockFromCache<rocksdb::Block>(rocksdb::Slice const&, rocksdb::Cache*, rocksdb::Cache*, rocksdb::ReadOptions const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::UncompressionDict const&, rocksdb::BlockType, bool, rocksdb::GetContext*) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1263
    https://github.com/facebook/rocksdb/issues/13 0x8b9259 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1559
    https://github.com/facebook/rocksdb/issues/14 0x8b710c in rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, bool, bool, bool, bool) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader.cc:1726
    https://github.com/facebook/rocksdb/issues/15 0x8c329f in rocksdb::DataBlockIter* rocksdb::BlockBasedTable::NewDataBlockIterator<rocksdb::DataBlockIter>(rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::DataBlockIter*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::FilePrefetchBuffer*, bool, bool, rocksdb::Status&) const internal_repo_rocksdb/repo/table/block_based/block_based_table_reader_impl.h:58
    https://github.com/facebook/rocksdb/issues/16 0x920117 in rocksdb::BlockBasedTableIterator::InitDataBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:262
    https://github.com/facebook/rocksdb/issues/17 0x920d42 in rocksdb::BlockBasedTableIterator::MaterializeCurrentBlock() internal_repo_rocksdb/repo/table/block_based/block_based_table_iterator.cc:332
    https://github.com/facebook/rocksdb/issues/18 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/19 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/20 0xef9f6c in rocksdb::MergingIterator::PrepareValue() internal_repo_rocksdb/repo/table/merging_iterator.cc:260
    https://github.com/facebook/rocksdb/issues/21 0xc6a201 in rocksdb::IteratorWrapperBase<rocksdb::Slice>::PrepareValue() internal_repo_rocksdb/repo/table/iterator_wrapper.h:78
    https://github.com/facebook/rocksdb/issues/22 0xc67bcd in rocksdb::DBIter::FindNextUserEntryInternal(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:326
    https://github.com/facebook/rocksdb/issues/23 0xc66d36 in rocksdb::DBIter::FindNextUserEntry(bool, rocksdb::Slice const*) internal_repo_rocksdb/repo/db/db_iter.cc:234
    https://github.com/facebook/rocksdb/issues/24 0xc7ab47 in rocksdb::DBIter::Next() internal_repo_rocksdb/repo/db/db_iter.cc:161
    https://github.com/facebook/rocksdb/issues/25 0x70d938 in rocksdb::BatchedOpsStressTest::TestPrefixScan(rocksdb::ThreadState*, rocksdb::ReadOptions const&, std::vector<int, std::allocator<int> > const&, std::vector<long, std::allocator<long> > const&) internal_repo_rocksdb/repo/db_stress_tool/batched_ops_stress.cc:320
    https://github.com/facebook/rocksdb/issues/26 0x6dc6a8 in rocksdb::StressTest::OperateDb(rocksdb::ThreadState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:907
    https://github.com/facebook/rocksdb/issues/27 0x6867de in rocksdb::ThreadBody(void*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_driver.cc:33
    https://github.com/facebook/rocksdb/issues/28 0xce4cc2 in rocksdb::(anonymous namespace)::StartThreadWrapper(void*) internal_repo_rocksdb/repo/env/env_posix.cc:461
    https://github.com/facebook/rocksdb/issues/29 0x7f23f9068c0e in start_thread /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/nptl/pthread_create.c:434:8
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10509

Test Plan:
```
$COMPILE_WITH_ASAN=1  make -j 24
$db_stress J=40 crash_test_with_txn
```

Reviewed By: siying

Differential Revision: D38540648

Pulled By: gitbw95

fbshipit-source-id: 703948e3a7ba40828a6445d00f3e73c184e34bf7
2022-08-09 17:49:01 -07:00
Jay Zhuang 3f763763aa Change bottommost_temperture to last_level_temperture (#10471)
Summary:
Change tiered compaction feature from `bottommost_temperture` to
`last_level_temperture`. The old option is kept for migration purpose only,
which is behaving the same as `last_level_temperture` and it will be removed in
the next release.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471

Test Plan: CI

Reviewed By: siying

Differential Revision: D38450621

Pulled By: jay-zhuang

fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527
2022-08-08 14:36:34 -07:00
Akanksha Mahajan 563f574372 Disable subcompactions for user_defined_timestamp (#10503)
Summary:
Currently user_defined_timestamp is failing in stress test with
subcompactions. So disabling it for now and will re enable it once its
fixed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10503

Test Plan: make crash_test_with_ts -j32

Reviewed By: riversand963

Differential Revision: D38510485

Pulled By: akankshamahajan15

fbshipit-source-id: 82fd0ec8cf86a96ff6653edd5bad7623cb9e0a15
2022-08-08 13:11:11 -07:00
Jay Zhuang 1e86d424e4 Tiered storage stress test (#10493)
Summary:
Add Tiered storage stress test and db_bench option

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10493

Test Plan:
new crashtest:
https://app.circleci.com/pipelines/github/facebook/rocksdb/16905/workflows/68c2967c-9274-434f-8506-1403cf441ead

Reviewed By: ajkr

Differential Revision: D38481892

Pulled By: jay-zhuang

fbshipit-source-id: 217a0be4acb93d420222e6ede2a1290d9f464776
2022-08-08 13:08:35 -07:00
Changyu Bi 9d77bf8f7b Fragment memtable range tombstone in the write path (#10380)
Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380

Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such  an immutable memtable is created and range tombstones are in that memtable.

```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100

multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464 is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.

Single thread result:
| Max # tombstones  | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op |  99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |6.68     |6.57     |6.72     |4.72     |4.79     |4.54     |
| 1    |6.67     |6.58     |6.62     |5.41     |4.74     |4.72     |
| 10   |6.59     |6.5      |6.56     |7.83     |4.69     |4.59     |
| 100  |6.62     |6.75     |6.58     |29.57    |5.04     |5.09     |
| 1000 |6.54     |6.82     |6.61     |320.33   |5.22     |5.21     |

32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones  | main fillrandom micros/op | 99cdf16464 | Post PR | main readrandom micros/op |  99cdf16464 | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0    |234.52   |260.25   |239.42   |5.06     |5.38     |5.09     |
| 1    |236.46   |262.0    |231.1    |19.57    |22.14    |5.45     |
| 10   |236.95   |263.84   |251.49   |151.73   |21.61    |5.73     |
| 100  |268.16   |296.8    |280.13   |2308.52  |22.27    |6.57     |

Reviewed By: ajkr

Differential Revision: D37916564

Pulled By: cbi42

fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
2022-08-05 12:02:33 -07:00
Andrew Kryczka 504fe4de80 Avoid allocations/copies for large GetMergeOperands() results (#10458)
Summary:
This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns.

The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458

Test Plan:
- new DB level test
- measured benefit/regression in a number of memtable scenarios

Setup command:
```
$ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000
```

Benchmark command:
```
./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10
```

Worst regression is when a key has many tiny operands:

- Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1
- `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%)

The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact.

Best improvement is when a key has large operands and high concurrency:

- Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32
- `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%).

Reviewed By: cbi42

Differential Revision: D38336578

Pulled By: ajkr

fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258
2022-08-04 00:42:13 -07:00
Peter Dillinger 9da97a3726 regression_test.sh: kill very old db_bench (and more) (#10441)
Summary:
If a db_bench process gets hung or runaway on a machine, that
could prevent regression_test.sh from ever making progress. To fix that,
regression_test.sh will now kill any db_bench process that is >12 hours
old. Also made this more reliable by not using string matching (grep) to
get db_bench process IDs.

I also had to make some other updates to get local runs working
reliably:
* Fix some quoting hell and other dubious complexity with db_bench_cmd
* Only save a DB for re-use when building it passes
* Report failed command in more cases
* Add safeguards against "rm -rf ."

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10441

Test Plan:
manual (local and remote), with temporary changes e.g. to have
a manageable age threshold etc.

Reviewed By: riversand963

Differential Revision: D38285537

Pulled By: pdillinger

fbshipit-source-id: 4d598876aedc38ac4bd9d8ddf32c5995d8e44db8
2022-08-02 09:16:17 -07:00
gitbw95 e1b176d274 Add CompressedSecondaryCache into stress test (#10442)
Summary:
The secondary cache is randomly disabled or enabled with CompressedSecondaryCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10442

Test Plan: - To test that the CompressedSecondaryCache is used and the stress test runs successfully, run  `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test `

Reviewed By: anand1976

Differential Revision: D38290796

Pulled By: gitbw95

fbshipit-source-id: bb7027b39e0ed9c0c62835abe09e759898130ec8
2022-08-01 11:01:03 -07:00
Peter Dillinger 15da225268 Fix regression_test.sh deleterandom duration (#10437)
Summary:
deleterandom tests are too fast to get good signal, e.g.
--deletes=31250 in 0.170 seconds vs. --reads=1500000 in 288.491
seconds for readrandom. Removing the special handling (unknown
motivation in faa7eb3b99) should suffice.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10437

Test Plan: watch continuous results

Reviewed By: ltamasi

Differential Revision: D38261185

Pulled By: pdillinger

fbshipit-source-id: 0f1b1b19efccda5689027d36cc2f01307f36031d
2022-07-29 10:39:22 -07:00
Peter Dillinger 65036e4217 Revert "Add a blob-specific cache priority (#10309)" (#10434)
Summary:
This reverts commit 8d178090be
because of a clear performance regression seen in internal dashboard
https://fburl.com/unidash/tpz75iee

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434

Reviewed By: ltamasi

Differential Revision: D38256373

Pulled By: pdillinger

fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b
2022-07-29 07:18:15 -07:00
Gang Liao 8d178090be Add a blob-specific cache priority (#10309)
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309

Reviewed By: ltamasi

Differential Revision: D38211655

Pulled By: gangliao

fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
2022-07-27 19:09:24 -07:00
Jay Zhuang 6a0010eb46 ldb to display public unique id and dump work with key range (#10417)
Summary:
2 ldb command improvements:
1. `ldb manifest_dump --verbose` display both the internal unique id and public id. which is useful to manually check sst_unique_id between manifest and SST;
2. `ldb dump` has `--from/to` option, but not working. Add support for that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10417

Test Plan:
run the command locally
```
$ ldb manifest_dump --path=MANIFEST-000026 --verbose
...
AddFile: 0 18 1023 'bar' seq:6, type:1 .. 'foo' seq:5, type:1 oldest_ancester_time:1658787615 file_creation_time:1658787615 file_checksum: file_checksum_func_name: Unknown unique_id(internal): {8800772265202404198,16149248642318466463} public_unique_id: F3E0A029B631D7D4-6E402DE08E771780
```
```
$ ldb dump --path=000036.sst --from=key000006 --to=key000009
Sst file format: block-based
'key000006' seq:2411, type:1 => value6
'key000007' seq:2412, type:1 => value7
'key000008' seq:2413, type:1 => value8
...
```

Reviewed By: ajkr

Differential Revision: D38136140

Pulled By: jay-zhuang

fbshipit-source-id: 8be6eeaa07ff9f089e33011ebe90fd0b69d33bf3
2022-07-26 20:40:18 -07:00
Guido Tagliavini Ponce 9d7de6517c Towards a production-quality ClockCache (#10418)
Summary:
In this PR we bring ClockCache closer to production quality. We implement the following changes:
1. Fixed a few bugs in ClockCache.
2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table.
3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity.
4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.)
5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache.

As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418

Test Plan:
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``
- ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush``

Reviewed By: pdillinger

Differential Revision: D38170673

Pulled By: guidotag

fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a
2022-07-26 17:42:03 -07:00
Alan Paxton e637470f64 Run new benchmark script in branch. (#10303)
Summary:
Configure CI to run modernised benchmark script

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10303

Reviewed By: ramvadiv

Differential Revision: D37719116

Pulled By: jay-zhuang

fbshipit-source-id: 79ecb1cd0abd4d800c6906ba6673268c2adee10e
2022-07-25 14:44:10 -07:00
Yanqin Jin dd759537d0 Print perf context for all benchmarks if enabled (#10396)
Summary:
If user runs `db_bench` with `-perf_level=2` or higher, db_bench should
print perf context after each of all benchmarks.

Or make `-perf_level` a per-benchmark switch.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10396

Test Plan: ./db_bench -benchmarks=fillseq,readseq -perf_level=2

Reviewed By: ajkr

Differential Revision: D38016324

Pulled By: riversand963

fbshipit-source-id: d83ea4abc34d40ffea394ca6abf0814bc5c0a2e0
2022-07-22 09:19:25 -07:00
Gang Liao 0b6bc101ba Charge blob cache usage against the global memory limit (#10321)
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321

Reviewed By: ltamasi

Differential Revision: D37913590

Pulled By: gangliao

fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
2022-07-18 23:26:57 -07:00
sdong d9deffba57 Post 7.5 branch cut changes (#10376)
Summary:
After branch 7.5.fb branch is cut, following release process, upgrade version number to 7.6 and add 7.5.fb to format compatibility check.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10376

Test Plan: Watch CI

Reviewed By: ajkr

Differential Revision: D37927694

fbshipit-source-id: 71b37ae55ebb7c95a1bcc0d7eee643d6ba5f8461
2022-07-18 12:58:04 -07:00
Gang Liao ec4ebeff30 Support prepopulating/warming the blob cache (#10298)
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.

This task is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298

Reviewed By: ltamasi

Differential Revision: D37908743

Pulled By: gangliao

fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
2022-07-17 07:13:59 -07:00
Guido Tagliavini Ponce 9645e66fc9 Temporarily return a LRUCache from NewClockCache (#10351)
Summary:
ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.

The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.

A new version of the NewClockCache function was created for our internal tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351

Test Plan: ``make -j24 check`` and re-run the pre-release tests.

Reviewed By: pdillinger

Differential Revision: D37802685

Pulled By: guidotag

fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
2022-07-13 08:45:44 -07:00
Yanqin Jin b283f041f5 Stop tracking syncing live WAL for performance (#10330)
Summary:
With https://github.com/facebook/rocksdb/issues/10087, applications calling `SyncWAL()` or writing with `WriteOptions::sync=true` can suffer
from performance regression. This PR reverts to original behavior of tracking the syncing of closed WALs.
After we revert back to old behavior, recovery, whether kPointInTime or kAbsoluteConsistency, may fail to
detect corruption in synced WALs if the corruption is in the live WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10330

Test Plan:
make check

Before https://github.com/facebook/rocksdb/issues/10087
```bash
fillsync     :     750.269 micros/op 1332 ops/sec 75.027 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :     776.492 micros/op 1287 ops/sec 77.649 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 1310 (± 44) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :     805.625 micros/op 1241 ops/sec 80.563 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 1287 (± 51) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 1287 ops/sec;    0.1 MB/sec
```

Before this PR and after https://github.com/facebook/rocksdb/issues/10087
```bash
fillsync     :    1479.601 micros/op 675 ops/sec 147.960 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :    1626.080 micros/op 614 ops/sec 162.608 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 645 (± 59) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :    1588.402 micros/op 629 ops/sec 158.840 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 640 (± 35) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 629 ops/sec;    0.1 MB/sec
```

After this PR
```bash
fillsync     :     749.621 micros/op 1334 ops/sec 74.962 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync     :     865.577 micros/op 1155 ops/sec 86.558 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 2 runs] : 1244 (± 175) ops/sec;    0.1 (± 0.0) MB/sec
fillsync     :     845.837 micros/op 1182 ops/sec 84.584 seconds 100000 operations;    0.1 MB/s (100 ops)
fillsync [AVG 3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [AVG    3 runs] : 1223 (± 109) ops/sec;    0.1 (± 0.0) MB/sec
fillsync [MEDIAN 3 runs] : 1182 ops/sec;    0.1 MB/sec
```

Reviewed By: ajkr

Differential Revision: D37725212

Pulled By: riversand963

fbshipit-source-id: 8fa7d13b3c7662be5d56351c42caf3266af937ae
2022-07-12 17:16:57 -07:00
Yanqin Jin 7679f22a89 Add coverage for the combination of write-prepared and WAL recycling (#10350)
Summary:
as title.
Test plan
- make check
- CI on PR
- TEST_TMPDIR=/dev/shm make crash_test_with_multiops_wp_txn (tested with successful run)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10350

Reviewed By: ajkr

Differential Revision: D37792872

Pulled By: riversand963

fbshipit-source-id: ff064093b7f715d0acf387af2e3ae87b1278b52b
2022-07-12 13:17:21 -07:00
Yanqin Jin 2f13f5f7d0 Add coverage for timestamped snapshot to MultiOpsTxnsStressTest (#10325)
Summary:
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10325

Test Plan:
```bash
TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_multiops_wc_txn
TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_txn
```

Reviewed By: akankshamahajan15

Differential Revision: D37688742

Pulled By: riversand963

fbshipit-source-id: e198ace921898af63f99e869568c1a7bbf69f1a4
2022-07-11 12:44:08 -07:00
Mark Callaghan 177b2fa341 Set the value for --version, add --build_info (#10275)
Summary:
./db_bench --version
db_bench version 7.5.0

./db_bench --build_info
 (RocksDB) 7.5.0
    rocksdb_build_date: 2022-06-29 09:58:04
    rocksdb_build_git_sha: d96febeeaa
    rocksdb_build_git_tag: print_version_githash

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10275

Test Plan: run it

Reviewed By: ajkr

Differential Revision: D37524720

Pulled By: mdcallag

fbshipit-source-id: 0f6c819dbadf7b033a4a3ba2941992bb76b4ff99
2022-07-06 09:58:45 -07:00
Mark Callaghan 9eced1a344 Add the git hash and full RocksDB version to report.tsv (#10277)
Summary:
Previously the version was displayed as $major.$minor
This changes it to $major.$minor.$path

This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people
who consume/graph these results are OK with it.

Example output:
ops_sec	mb_sec	lsm_sz	blob_sz	c_wgb	w_amp	c_mbps	c_wsecs	c_csecs	b_rgb	b_wgb	usec_op	p50	p99	p99.9	p99.99	pmax	uptime	stall%	Nstall	u_cpu	s_cpu	rss	test	date	version	job_id	githash
609488	244.1	1GB	0.0GB,	1.4	0.7	93.3	39	38	0	0	1.6	1.0	4	15	26	5365	15	0.0	0	0.1	0.0	0.5	fillseq.wal_disabled.v400	2022-06-29T13:36:05	7.5.0		6115254416

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277

Test Plan: Run it

Reviewed By: jay-zhuang

Differential Revision: D37532418

Pulled By: mdcallag

fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d
2022-07-05 11:46:36 -07:00
zczhu e716bda010 Add FLAGS_compaction_pri into crash_test (#10255)
Summary:
Add FLAGS_compaction_pri into correctness test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255

Test Plan: run crash_test with FLAGS_compaction_pri

Reviewed By: ajkr

Differential Revision: D37510372

Pulled By: littlepig2013

fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641
2022-06-30 22:56:58 -07:00
Mark Callaghan 720ab355f9 Add undefok for BlobDB options not supported prior to 7.5 (#10276)
Summary:
This adds --undefok to support use of this script with BlobDB for db_bench versions prior
to 7.5 when the options land in a release.

While there is a limit to how far back this script can go WRT backwards compatiblity,
this is an easy change to support early 7.x releases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276

Test Plan: Run it with versions of db_bench that do not and then do support these options

Reviewed By: gangliao

Differential Revision: D37529299

Pulled By: mdcallag

fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8
2022-06-30 14:07:26 -07:00
Guido Tagliavini Ponce 57a0e2f304 Clock cache (#10273)
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.

Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273

Test Plan: ``make -j24 check``

Reviewed By: pdillinger

Differential Revision: D37522461

Pulled By: guidotag

fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
2022-06-29 21:50:39 -07:00
Mark Callaghan 28f2d3cca6 Benchmark fix write amplification computation (#10236)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10236

Reviewed By: ajkr

Differential Revision: D37489898

Pulled By: mdcallag

fbshipit-source-id: 4b4565973b1f2c47342b4d1b857c8f89e91da145
2022-06-29 07:22:22 -07:00
Yanqin Jin d3de59255a Enable compaction filter for db_stress with user-defined timestamp (#10259)
Summary:
Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter.

This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of
the presence of timestamps.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259

Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts

Reviewed By: ajkr

Differential Revision: D37459692

Pulled By: riversand963

fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef
2022-06-27 11:53:09 -07:00
Andrew Kryczka f322f273b0 Temporarily disable mempurge in crash test (#10252)
Summary:
Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252

Reviewed By: riversand963

Differential Revision: D37432948

Pulled By: ajkr

fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a
2022-06-24 17:11:27 -07:00
Mark Callaghan 6061905790 Wrapper for benchmark.sh to run a sequence of db_bench tests (#10215)
Summary:
This provides two things:
1) Runs a sequence of db_bench tests. This sequence was chosen to provide
good coverage with less variance.
2) Makes it easier to do A/B testing for multiple binaries. This combines
the report.tsv files into summary.tsv to make it easier to compare results
across multiple binaries.

Example output for 2) is:

ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
1115171 446.7   9GB             8.9     1.0     454.7   26      26      0       0       0.9     0.5     2       7       51      5547    20      0.0     0       0.1     0.1     0.2     fillseq.wal_disabled.v400       2022-04-12T08:53:51     6.0
1045726 418.9   8GB     0.0GB   8.4     1.0     432.4   27      26      0       0       1.0     0.5     2       6       102     5618    20      0.0     0       0.1     0.0     0.1     fillseq.wal_disabled.v400       2022-04-12T12:25:36     6.28

ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
2969192 1189.3  16GB            0.0             0.0     0       0       0       0       10.8    9.3     25      33      49      13551   1781    0.0     0       48.2    6.8     16.8    readrandom.t32  2022-04-12T08:54:28     6.0
2692922 1078.6  16GB    0.0GB   0.0             0.0     0       0       0       0       11.9    10.2    30      38      56      49735   1781    0.0     0       47.8    6.7     16.8    readrandom.t32  2022-04-12T12:26:15     6.28

...

ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
180227  72.2    38GB            1126.4  8.7     643.2   3286    3218    0       0       177.6   50.2    2687    4083    6148    854083  1793    68.4    7804    17.0    5.9     0.5     overwrite.t32.s0        2022-04-12T11:55:21     6.0
236512  94.7    31GB    0.0GB   1502.9  8.9     862.2   5242    5125    0       0       135.3   59.9    2537    3268    5404    18545   1785    49.7    5112    25.5    8.0     9.4     overwrite.t32.s0        2022-04-12T15:27:25     6.28

Example output with formatting preserved is here:
https://gist.github.com/mdcallag/4432e5bbaf91915c916d46bd6ce3c313

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10215

Test Plan: run it

Reviewed By: jay-zhuang

Differential Revision: D37299892

Pulled By: mdcallag

fbshipit-source-id: e6e0ed638fd7e8deeb869d700593fdc3eba899c8
2022-06-23 18:07:14 -07:00
Gang Liao 2352e2dfda Add the blob cache to the stress tests and the benchmarking tool (#10202)
Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202

Reviewed By: ltamasi

Differential Revision: D37325739

Pulled By: gangliao

fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
2022-06-22 16:04:03 -07:00
Peter Dillinger 84210c9489 Add data block hash index to crash test, fix MultiGet issue (#10220)
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.

Fixes https://github.com/facebook/rocksdb/issues/10186

I intend to pick this into the 7.4.0 release candidate

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220

Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.

Reviewed By: anand1976

Differential Revision: D37315647

Pulled By: pdillinger

fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
2022-06-21 16:23:58 -07:00
Guido Tagliavini Ponce 3afed7408c Replace per-shard chained hash tables with open-addressing scheme (#10194)
Summary:
In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles.

Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194

Test Plan: ``make -j24 check``

Reviewed By: pdillinger

Differential Revision: D37296770

Pulled By: guidotag

fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad
2022-06-21 08:45:04 -07:00
Gang Liao deff48bcef Add blob source to retrieve blobs in RocksDB (#10198)
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB.  BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.

Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!

This PR is a part of https://github.com/facebook/rocksdb/issues/10156

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198

Reviewed By: ltamasi

Differential Revision: D37294735

Pulled By: gangliao

fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
2022-06-20 20:58:11 -07:00
Peter Dillinger ccb4f047ae Add 7.4 to format compatibility test (#10209)
Summary:
Forgotten in https://github.com/facebook/rocksdb/issues/10204

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10209

Test Plan: local run with SHORT_TEST=1

Reviewed By: hx235

Differential Revision: D37284028

Pulled By: pdillinger

fbshipit-source-id: 631c1969906d002acc930662dcd5eefc0c758429
2022-06-20 13:13:37 -07:00
Hui Xiao a5d773e077 Add rate-limiting support to batched MultiGet() (#10159)
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.

However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159

Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of  `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
   - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work

Reviewed By: ajkr, anand1976

Differential Revision: D37135172

Pulled By: hx235

fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
2022-06-17 16:40:47 -07:00
Andrew Kryczka 5d6005c780 Add WriteOptions::protection_bytes_per_key (#10037)
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.

Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.

There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037

Test Plan:
- Manual
  - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
  - Enabled in MyShadow for 1+ week
- Automated
  - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
  - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection

Reviewed By: cbi42

Differential Revision: D36614569

Pulled By: ajkr

fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
2022-06-16 23:10:07 -07:00
Peter Dillinger 126c223714 Remove deprecated block-based filter (#10184)
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).

This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).

Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184

Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB

Reviewed By: riversand963

Differential Revision: D37212647

Pulled By: pdillinger

fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
2022-06-16 15:51:33 -07:00
Peter Dillinger 94329ae4ec Use only ASCII in source files (#10164)
Summary:
Fix existing usage of non-ASCII and add a check to prevent
future use. Added `-n` option to greps to provide line numbers.

Alternative to https://github.com/facebook/rocksdb/issues/10147

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164

Test Plan:
used new checker to find & fix cases, manually check
db_bench output is preserved

Reviewed By: akankshamahajan15

Differential Revision: D37148792

Pulled By: pdillinger

fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7
2022-06-15 14:44:43 -07:00
Changyu Bi 9882652b0e Verify write batch checksum before WAL (#10114)
Summary:
Context: WriteBatch can have key-value checksums when it was created `with protection_bytes_per_key > 0`.
This PR added checksum verification for write batches before they are written to WAL.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10114

Test Plan:
- Added new unit tests to db_kv_checksum_test.cc: `make check -j32`
- benchmark on performance regression: `./db_bench --benchmarks=fillrandom[-X20] -db=/dev/shm/test_rocksdb -write_batch_protection_bytes_per_key=8`
  - Pre-PR:
`
fillrandom [AVG    20 runs] : 198875 (± 3006) ops/sec;   22.0 (± 0.3) MB/sec
`
  - Post-PR:
`
fillrandom [AVG    20 runs] : 196487 (± 2279) ops/sec;   21.7 (± 0.3) MB/sec
`
  Mean regressed about 1% (198875 -> 196487 ops/sec).

Reviewed By: ajkr

Differential Revision: D36917464

Pulled By: cbi42

fbshipit-source-id: 29beb74edf65f04b1a890b4f650d873dc7ed790d
2022-06-15 13:43:58 -07:00
Yanqin Jin ce419c0f10 Allow db_bench and db_stress to set allow_data_in_errors (#10171)
Summary:
There is `Options::allow_data_in_errors` that controls whether RocksDB
is allowed to log data, e.g. key, value, etc in LOG files. It is false
by default. However, in db_bench and db_stress, it is often ok to log
data because there is no concern about privacy.

This PR allows db_stress and db_bench to set this option on the command
line, while it remains false by default. Furthermore, make
crash/recovery test driven by db_crashtest.py to opt-in.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171

Test Plan: Stress test and db_bench

Reviewed By: hx235

Differential Revision: D37163787

Pulled By: riversand963

fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8
2022-06-15 12:38:04 -07:00
Hui Xiao d665afdbf3 Account memory of FileMetaData in global memory limit (#9924)
Summary:
**Context/Summary:**
As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924

Test Plan:
- Previous `make check` verified there are only 2 places where the memory of  the allocated `FileMetaData` can be released
- New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)`
- db bench (CPU cost of `charge_file_metadata` in write and compact)
   - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`
   - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'`

table 1 - write

#-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078
80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734**
160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978**

table 2 - compact

#-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67
20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96
40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96**
80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78**

- stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1  --cache_size=1` killed as normal

Reviewed By: ajkr

Differential Revision: D36055583

Pulled By: hx235

fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2022-06-14 13:06:40 -07:00
Guido Tagliavini Ponce f105e1a501 Make the per-shard hash table fixed-size. (#10154)
Summary:
We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.

Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154

Test Plan: `make -j24 check`

Reviewed By: pdillinger

Differential Revision: D37124451

Pulled By: guidotag

fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
2022-06-13 20:29:00 -07:00
Mark Callaghan 04bd347995 Increase num_levels for universal from 8 to 40 (#10158)
Summary:
See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move
isn't done for universal when compaction is from L0 into L0. So a too small value for
num_levels with db_bench means there will be fewer trivial moves with universal and
that means that write-amp will increase.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158

Test Plan: run it

Reviewed By: siying

Differential Revision: D37122519

Pulled By: mdcallag

fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e
2022-06-13 16:24:32 -07:00
Yanqin Jin 1777e5f7e9 Snapshots with user-specified timestamps (#9879)
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.

It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.

This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.

In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.

This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```

If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.

Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();

// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
    std::shared_ptr<TransactionNotifier> notifier,
    TxnTimestamp ts,
    std::shared_ptr<const Snapshot>* ret);

// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```

The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```

We also provide two additional APIs for stats collection and reporting purposes.

```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
    TxnTimestamp ts_lb,
    TxnTimestamp ts_ub,
    std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```

To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```

Before shutdown, RocksDB will release all timestamped snapshots.

Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.

Timestamped snapshots can also be used with user-defined timestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879

Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```

Reviewed By: siying

Differential Revision: D35783919

Pulled By: riversand963

fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
2022-06-10 16:07:03 -07:00
Akanksha Mahajan ecfd4aef0c Enable wal_compression in crash_tests (#10141)
Summary:
Same as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10141

Test Plan:
```
export CRASH_TEST_EXT_ARGS=" --wal_compression=zstd"
 make crash_test -j
```

Reviewed By: riversand963

Differential Revision: D37042810

Pulled By: akankshamahajan15

fbshipit-source-id: 53f0793d78241f1b5c954dcc808cb4c0a3e9172a
2022-06-09 12:08:01 -07:00
Mark Callaghan 9efae14428 Fix parsing of db_bench output (#10124)
Summary:
A recent diff add a few more fields to one of the db_bench output lines that gets parsed.
This diff updates tools/benchmark.sh to handle that.

overwrite    :       7.939 micros/op 125963 ops/sec;   50.5 MB/s

overwrite    :       7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations;   51.0 MB/s

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124

Test Plan: Run it

Reviewed By: jay-zhuang

Differential Revision: D36945137

Pulled By: mdcallag

fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa
2022-06-08 09:23:36 -07:00
Yanqin Jin f890527b16 Update test for secondary instance in stress test (#10121)
Summary:
This PR updates secondary instance testing in stress test by default.

A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary.

Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan
to read some random keys with only very basic verification to make sure no assertion failure is triggered.

Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled.

Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic
catch-up.

In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not
work well when the underlying file is renamed by primary, e.g. SstFileManager.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121

Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush
```

Reviewed By: ajkr

Differential Revision: D36939458

Pulled By: riversand963

fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e
2022-06-07 21:07:47 -07:00
Levi Tamasi 7d36bc4273 Fix some bugs in verify_random_db.sh (#10112)
Summary:
The patch attempts to fix three bugs in `verify_random_db.sh`:
1) https://github.com/facebook/rocksdb/pull/9937 changed the default for
`--try_load_options` to true in the script's use case, so we have to
explicitly set it to false if the corresponding argument of the script
is 0. This should fix the issue we've been seeing with our forward
compatibility tests where 7.3 is unable to open a database created by
the version on main after adding a new configuration option.
2) The script seems to support two "extra parameters"; however,
in practice, if the second one was set, only that one was passed on to
`ldb`. Now both get forwarded.
3) When running the `diff` command, the base DB directory was passed as
the second argument instead of the file containing the `ldb` output
(this actually seems to work, probably accidentally though).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10112

Reviewed By: pdillinger

Differential Revision: D36911363

Pulled By: ltamasi

fbshipit-source-id: fe29db4e28d373cee51a12322c59050fc50e926d
2022-06-03 16:35:13 -07:00
Guido Tagliavini Ponce cf85607795 Add support for FastLRUCache in db_bench. (#10096)
Summary:
db_bench can now run with FastLRUCache.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10096

Test Plan:
- Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 db_bench``. Then test the appropriate code is used by running ``./db_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache.
- Verify that FastLRUCache (currently a clone of LRUCache) produces similar benchmark data than LRUCache, by comparing the outputs of ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=fast_lru_cache`` and ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=lru_cache``.

Reviewed By: gitbw95

Differential Revision: D36898774

Pulled By: guidotag

fbshipit-source-id: f9f6b6f6da124f88b21b3c8dee742fbb04eff773
2022-06-03 11:16:49 -07:00
Yanqin Jin 2b3c50c429 Temporarily disable wal compression (#10108)
Summary:
Will re-enable after fixing the bug in https://github.com/facebook/rocksdb/issues/10099 and https://github.com/facebook/rocksdb/issues/10097.
Right now, the priority is https://github.com/facebook/rocksdb/issues/10087, but the bug in WAL compression prevents the mini crash test from passing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10108

Reviewed By: pdillinger

Differential Revision: D36897214

Pulled By: riversand963

fbshipit-source-id: d64dc52738222d5f66003f7731dc46eaeed812be
2022-06-03 10:22:52 -07:00
Mark Callaghan 5506954b1f Enhance to support more tuning options, and universal and integrated… (#9704)
Summary:
… BlobDB for all tests

This does two big things:
* provides more tuning options
* supports universal and integrated BlobDB for all of the benchmarks that are leveled-only

It does several smaller things, and I will list a few
* sets l0_slowdown_writes_trigger which wasn't set before this diff.
* improves readability in report.tsv by using smaller field names in the header
* adds more columns to report.tsv

report.tsv before this diff:
```
ops_sec mb_sec  total_size_gb   level0_size_gb  sum_gb  write_amplification     write_mbps      usec_op percentile_50   percentile_75   percentile_99   percentile_99.9 percentile_99.99        uptime  stall_time      stall_percent   test_name       test_date      rocksdb_version  job_id
823294  329.8   0.0     21.5    21.5    1.0     183.4   1.2     1.0     1.0     3       6       14      120     00:00:0.000     0.0     fillseq.wal_disabled.v400       2022-03-16T15:46:45.000-07:00   7.0
326520  130.8   0.0     0.0     0.0     0.0     0       12.2    139.8   155.1   170     234     250     60      00:00:0.000     0.0     multireadrandom.t4      2022-03-16T15:48:47.000-07:00   7.0
86313   345.7   0.0     0.0     0.0     0.0     0       46.3    44.8    50.6    75      84      108     60      00:00:0.000     0.0     revrangewhilewriting.t4 2022-03-16T15:50:48.000-07:00   7.0
101294  405.7   0.0     0.1     0.1     1.0     1.6     39.5    40.4    45.9    64      75      103     62      00:00:0.000     0.0     fwdrangewhilewriting.t4 2022-03-16T15:52:50.000-07:00   7.0
258141  103.4   0.0     0.1     1.2     18.2    19.8    15.5    14.3    18.1    28      34      48      62      00:00:0.000     0.0     readwhilewriting.t4     2022-03-16T15:54:51.000-07:00   7.0
334690  134.1   0.0     7.6     18.7    4.2     308.8   12.0    11.8    13.7    21      30      62      62      00:00:0.000     0.0     overwrite.t4.s0 2022-03-16T15:56:53.000-07:00   7.0
```
report.tsv with this diff:
```
ops_sec mb_sec  lsm_sz  blob_sz c_wgb   w_amp   c_mbps  c_wsecs c_csecs b_rgb   b_wgb   usec_op p50     p99     p99.9   p99.99  pmax    uptime  stall%  Nstall  u_cpu   s_cpu   rss     test    date    version job_id
831144  332.9   22GB    0.0GB,  21.7    1.0     185.1   264     262     0       0       1.2     1.0     3       6       14      9198    120     0.0     0       0.4     0.0     0.7     fillseq.wal_disabled.v400       2022-03-16T16:21:23     7.0
325229  130.3   22GB    0.0GB,  0.0             0.0     0       0       0       0       12.3    139.8   170     237     249     572     60      0.0     0       0.4     0.1     1.2     multireadrandom.t4      2022-03-16T16:23:25     7.0
312920  125.3   26GB    0.0GB,  11.1    2.6     189.3   115     113     0       0       12.8    11.8    21      34      1255    6442    60      0.2     1       0.7     0.1     0.6     overwritesome.t4.s0     2022-03-16T16:25:27     7.0
81698   327.2   25GB    0.0GB,  0.0             0.0     0       0       0       0       48.9    46.2    79      246     369     9445    60      0.0     0       0.4     0.1     1.4     revrangewhilewriting.t4 2022-03-16T16:30:21     7.0
92484   370.4   25GB    0.0GB,  0.1     1.5     1.1     1       0       0       0       43.2    42.3    75      103     110     9512    62      0.0     0       0.4     0.1     1.4     fwdrangewhilewriting.t4 2022-03-16T16:32:24     7.0
241661  96.8    25GB    0.0GB,  0.1     1.5     1.1     1       0       0       0       16.5    17.1    30      34      49      9092    62      0.0     0       0.4     0.1     1.4     readwhilewriting.t4     2022-03-16T16:34:27     7.0
305234  122.3   30GB    0.0GB,  12.1    2.7     201.7   127     124     0       0       13.1    11.8    21      128     1934    6339    62      0.0     0       0.7     0.1     0.7     overwrite.t4.s0 2022-03-16T16:36:30     7.0
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9704

Test Plan: run it

Reviewed By: jay-zhuang

Differential Revision: D36864627

Pulled By: mdcallag

fbshipit-source-id: d5af1cfc258a16865210163fa6fd1b803ab1a7d3
2022-06-03 08:20:10 -07:00
Gang Liao e6432dfd4c Make it possible to enable blob files starting from a certain LSM tree level (#10077)
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.

In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077

Reviewed By: ltamasi

Differential Revision: D36884156

Pulled By: gangliao

fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
2022-06-02 20:04:33 -07:00
Zichen Zhu 65893ad959 Explicitly closing all directory file descriptors (#10049)
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049

Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.

Reviewed By: ajkr, hx235

Differential Revision: D36722135

Pulled By: littlepig2013

fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
2022-06-01 18:03:34 -07:00
Guido Tagliavini Ponce b4d0e041d0 Add support for FastLRUCache in stress and crash tests. (#10081)
Summary:
Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache.

Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081

Test Plan:
- To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`.
- To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`.

Reviewed By: anand1976

Differential Revision: D36839908

Pulled By: guidotag

fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725
2022-06-01 18:00:28 -07:00
Andrew Kryczka f6e45382e9 Disable file ingestion in crash test for CF consistency (#10067)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10067

Reviewed By: jay-zhuang

Differential Revision: D36727948

Pulled By: ajkr

fbshipit-source-id: a3502730412c01ba63d822a5d4bf56f8bae8fcb2
2022-05-26 17:41:30 -07:00
Andrew Kryczka 91ba7837b7 Enable IngestExternalFile() in crash test (#9357)
Summary:
Thanks to https://github.com/facebook/rocksdb/issues/9919 and https://github.com/facebook/rocksdb/issues/10051 the known bugs in file ingestion (besides mmap read + file checksum) are fixed. Now we can  try again to enable file ingestion in crash test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9357

Test Plan: stress file ingestion heavily for an hour: `$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=1000000 --ingest_external_file_one_in=100 --duration=3600 --interval=20 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152`

Reviewed By: riversand963

Differential Revision: D33410746

Pulled By: ajkr

fbshipit-source-id: d276431390995a67f68390d61c06a40945fdd280
2022-05-26 10:31:37 -07:00
Peter Dillinger bd170dda03 Abort RocksDB performance regression test on failure in test setup (#10053)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10053

Need to exit if ldb command fails, to avoid running db_bench on
empty/bad DB and considering the results valid.

Reviewed By: jay-zhuang

Differential Revision: D36673200

fbshipit-source-id: e0d78a0d397e0e335d82d9349bfd612d38ffb552
2022-05-25 13:35:08 -07:00
Yanqin Jin 9901e7f681 Enable checkpoint and backup in db_stress when timestamp is enabled (#10047)
Summary:
After https://github.com/facebook/rocksdb/issues/10030 and https://github.com/facebook/rocksdb/issues/10004, we can enable checkpoint and backup in stress tests when
user-defined timestamp is enabled.

This PR has no production risk.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10047

Test Plan:
```
TEST_TMPDIR=/dev/shm make crash_test_with_ts
```

Reviewed By: jowlyzhang

Differential Revision: D36641565

Pulled By: riversand963

fbshipit-source-id: d86c9d87efcc34c32d1aa176af691d32b897644a
2022-05-24 18:25:43 -07:00
Changyu Bi 8515bd50c9 Support read rate-limiting in SequentialFileReader (#9973)
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).

The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973

Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
  - Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
  - Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
  - backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
  - restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)

```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000

# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
   ./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
  # Restore
  #./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```

Reviewed By: hx235

Differential Revision: D36327418

Pulled By: cbi42

fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
2022-05-24 10:28:57 -07:00
Levi Tamasi 253ae017fa Update version on main to 7.4 and add 7.3 to the format compatibility checks (#10038)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10038

Reviewed By: riversand963

Differential Revision: D36604533

Pulled By: ltamasi

fbshipit-source-id: 54ccd0a4b32a320b5640a658ea6846ee897065d1
2022-05-23 14:55:33 -07:00
Changyu Bi cc23b46da1 Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857)
Summary:
An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857

Test Plan:
#### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data:
Set up: change the parameter [here](fb9a167a55/tools/db_bench_tool.cc (L1766)) to 16384 to make synthetic data more compressible.
```
# linked local ZSTD with version 1.5.2
# DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1  EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench

dict_bytes=16384
train_bytes=1048576
echo "========== No Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== Raw Content Dictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== FinalizeDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

echo "========== TrainDictionary =========="
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 | grep elapsed
du -hc /dev/shm/dbbench/*sst | grep total

# Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory.
# before compression data size: 1.2GB
dict_bytes=16384
max_dict_buffer_bytes =  1048576
                    space   cpu/memory
No Dictionary       468M    14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k
Raw Dictionary      251M    15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k
FinalizeDictionary  236M    11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k
TrainDictionary     84M     7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k
```

#### Benchmark on 10 sample SST files for spacing saving and CPU time on compression:
FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression.
```
dict_bytes=16384
train_bytes=1048576

for sst_file in `ls ../temp/myrock-sst/`
do
  echo "********** $sst_file **********"
  echo "========== No Dictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD

  echo "========== Raw Content Dictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes

  echo "========== FinalizeDictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict

  echo "========== TrainDictionary =========="
  ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes
done

                         010240.sst (Size/Time) 011029.sst              013184.sst              021552.sst              185054.sst              185137.sst              191666.sst              7560381.sst             7604174.sst             7635312.sst
No Dictionary           28165569 / 2614419      32899411 / 2976832      32977848 / 3055542      31966329 / 2004590      33614351 / 1755877      33429029 / 1717042      33611933 / 1776936      33634045 / 2771417      33789721 / 2205414      33592194 / 388254
Raw Content Dictionary  28019950 / 2697961      33748665 / 3572422      33896373 / 3534701      26418431 / 2259658      28560825 / 1839168      28455030 / 1846039      28494319 / 1861349      32391599 / 3095649      33772142 / 2407843      33592230 / 474523
FinalizeDictionary      27896012 / 2650029      33763886 / 3719427      33904283 / 3552793      26008225 / 2198033      28111872 / 1869530      28014374 / 1789771      28047706 / 1848300      32296254 / 3204027      33698698 / 2381468      33592344 / 517433
TrainDictionary         28046089 / 2740037      33706480 / 3679019      33885741 / 3629351      25087123 / 2204558      27194353 / 1970207      27234229 / 1896811      27166710 / 1903119      32011041 / 3322315      32730692 / 2406146      33608631 / 570593
```

#### Decompression/Read test:
With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads.
```
dict_bytes=16384
train_bytes=1048576
echo "No Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 | grep MB/s

echo "Raw Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd  -compression_max_dict_bytes=$dict_bytes 2>&1 | grep MB/s

echo "FinalizeDict"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false  > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 | grep MB/s

echo "Train Dictionary"
TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1
TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 | grep MB/s

No Dictionary
readrandom   :      12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations;    9.1 MB/s (1000000 of 1000000 found)
Raw Dictionary
readrandom   :      12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations;    9.0 MB/s (1000000 of 1000000 found)
FinalizeDict
readrandom   :       9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations;   11.3 MB/s (1000000 of 1000000 found)
Train Dictionary
readrandom   :       9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations;   11.4 MB/s (1000000 of 1000000 found)
```

Reviewed By: ajkr

Differential Revision: D35720026

Pulled By: cbi42

fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f
2022-05-20 12:09:09 -07:00
Peter Dillinger 280b9f371a Fix auto_prefix_mode performance with partitioned filters (#10012)
Summary:
Essentially refactored the RangeMayExist implementation in
FullFilterBlockReader to FilterBlockReaderCommon so that it applies to
partitioned filters as well. (The function is not called for the
block-based filter case.) RangeMayExist is essentially a series of checks
around a possible PrefixMayExist, and I'm confident those checks should
be the same for partitioned as for full filters. (I think it's likely
that bugs remain in those checks, but this change is overall a simplifying
one.)

Added auto_prefix_mode support to db_bench

Other small fixes as well

Fixes https://github.com/facebook/rocksdb/issues/10003

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10012

Test Plan:
Expanded unit test that uses statistics to check for filter
optimization, fails without the production code changes here

Performance: populate two DBs with
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters
```

Observe no measurable change in non-partitioned performance
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 15 runs] : 11798 (± 331) ops/sec
After: seekrandom [AVG 15 runs] : 11724 (± 315) ops/sec

Observe big improvement with partitioned (also supported by bloom use statistics)
```
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 12 runs] : 2942 (± 57) ops/sec
After: seekrandom [AVG 12 runs] : 7489 (± 184) ops/sec

Reviewed By: siying

Differential Revision: D36469796

Pulled By: pdillinger

fbshipit-source-id: bcf1e2a68d347b32adb2b27384f945434e7a266d
2022-05-19 13:09:03 -07:00
Jay Zhuang c6d326d3d7 Track SST unique id in MANIFEST and verify (#9990)
Summary:
Start tracking SST unique id in MANIFEST, which is used to verify with
SST properties to make sure the SST file is not overwritten or
misplaced. A DB option `try_verify_sst_unique_id` is introduced to
enable/disable the verification, if enabled, it opens all SST files
during DB-open to read the unique_id from table properties (default is
false), so it's recommended to use it with `max_open_files = -1` to
pre-open the files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990

Test Plan: unittests, format-compatible test, mini-crash

Reviewed By: anand1976

Differential Revision: D36381863

Pulled By: jay-zhuang

fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f
2022-05-19 11:04:21 -07:00
Hui Xiao 3573558ec5 Rewrite memory-charging feature's option API (#9926)
Summary:
**Context:**
Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`.

Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags.

**Summary:**
- Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that
- Added missing db bench/stress test for some memory charging features
- Renamed related test suite to indicate they are under the same theme of memory charging
- Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication
- Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926

Test Plan:
- New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)`
- New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)`
- CI
- db bench (in case querying new options introduces regression) **+0.5% micros/op**: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR  -charge_compression_dictionary_building_buffer=1(remove this for comparison)  -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'`

#-run | (pre-PR) avg micros/op | std micros/op | (post-PR)  micros/op | std micros/op | change (%)
-- | -- | -- | -- | -- | --
10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721
20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | **-0.3633711465**
40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | **0.5289363078**

- db_stress: `python3 tools/db_crashtest.py blackbox  -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal

Reviewed By: ajkr

Differential Revision: D36054712

Pulled By: hx235

fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af
2022-05-17 15:01:51 -07:00
Yanqin Jin f6d9730ea1 Fix stress test with best-efforts-recovery (#9986)
Summary:
This PR

- since we are testing with disable_wal = true and best_efforts_recovery, we should set column family count to 1, due to the requirement of `ExpectedState` tracking and replaying logic.
- during backup and checkpoint restore, disable best-efforts-recovery. This does not matter now because db_crashtest.py always disables wal when testing best-efforts-recovery. In the future, if we enable wal, then not setting `restore_opitions.best_efforts_recovery` will cause backup db not to recover the WALs, and differ from db (that enables WAL).
- during verification of backup and checkpoint restore, print the key where inconsistency exists between expected state and db.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9986

Test Plan: TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery

Reviewed By: siying

Differential Revision: D36353105

Pulled By: riversand963

fbshipit-source-id: a484da161273e6216a1f7e245bac15a349693917
2022-05-13 12:29:20 -07:00
Andrew Kryczka e943bbdd2f Temporarily disable sync_fault_injection (#9979)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9979

Reviewed By: siying

Differential Revision: D36301555

Pulled By: ajkr

fbshipit-source-id: ed298d3484b6aad3ef19746e984bf4c52be33a9f
2022-05-11 12:19:07 -07:00
yaphet 26768edb65 Support single delete in ldb (#9469)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9469

Reviewed By: riversand963

Differential Revision: D33953484

fbshipit-source-id: f4e84a2d9865957d744c7e84ff02ffbb0a62b0a8
2022-05-10 16:37:19 -07:00
Peter Dillinger c5c58708db Fix format_compatible blowing away its TEST_TMPDIR (#9970)
Summary:
https://github.com/facebook/rocksdb/issues/9961 broke format_compatible check because of `make clean`
referencing TEST_TMPDIR. The Makefile behavior seems reasonable to me,
so here's a fix in check_format_compatible.sh

Apparently I also included removing a redundant part of our CircleCI config.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9970

Test Plan: manual run: SHORT_TEST=1 ./tools/check_format_compatible.sh

Reviewed By: riversand963

Differential Revision: D36258172

Pulled By: pdillinger

fbshipit-source-id: d46507f04614e888b414ff23b88d040ae2b5c294
2022-05-09 13:38:46 -07:00
sdong 736a7b5433 Remove own ToString() (#9955)
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955

Test Plan: Watch CI tests

Reviewed By: riversand963

Differential Revision: D36176799

fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
2022-05-06 13:03:58 -07:00
Andrew Kryczka 62d84e2a2b db_stress fault injection in release mode (#9957)
Summary:
Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode.

Other notable changes include:

- Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint`
- Made `--read_fault_one_in > 0` return an error in release mode
- Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957

Test Plan:
```
$ DEBUG_LEVEL=0 make -j24 db_stress
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox
```

Reviewed By: anand1976

Differential Revision: D36193830

Pulled By: ajkr

fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321
2022-05-06 11:17:08 -07:00
Andrew Kryczka a62506aee2 Enable unsynced data loss in crash test (#9947)
Summary:
`db_stress` already tracks expected state history to verify prefix-recoverability when `sync_fault_injection` is enabled. This PR enables `sync_fault_injection` in `db_crashtest.py`.

Previously enabling `sync_fault_injection` would cause whole unsynced files to be dropped. This PR adds a more interesting case of losing only the tail of unsynced data by implementing `TestFSWritableFile::RangeSync()` and enabling `{wal_,}bytes_per_sync`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9947

Test Plan:
- regular blackbox, blackbox --simple
- various commands to stress this new case, such as `TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --write_buffer_size=2097152 --avoid_flush_during_recovery=1 --disable_wal=0 --interval=10 --db_write_buffer_size=0 --sync_fault_injection=1 --wal_compression=none --delpercent=0 --delrangepercent=0 --prefixpercent=0 --iterpercent=0 --writepercent=100 --readpercent=0 --wal_bytes_per_sync=131072 --duration=36000 --sync=0 --open_write_fault_one_in=16`

Reviewed By: riversand963

Differential Revision: D36152775

Pulled By: ajkr

fbshipit-source-id: 44b68a7fad0a4cf74af9fe1f39be01baab8141d8
2022-05-05 13:21:03 -07:00