Commit Graph

5342 Commits

Author SHA1 Message Date
Wentian Guo 0578d9f951 Filter table files by timestamp: Get operator (#11332)
Summary:
If RocksDB enables user-defined timestamp, then RocksDB read path can filter table files by the min/max timestamps of each file. If application wants to lookup a key that is the most recent and visible to a certain timestamp ts, then we can compare ts with the min_ts of each file. If ts < min_ts, then we know all keys in the file is not visible at time ts, then we do not have to open the file. This can also save an IO.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11332

Reviewed By: pdillinger

Differential Revision: D44763497

Pulled By: guowentian

fbshipit-source-id: abde346b9f18480fe03c04e4006e7d62aa9c22a8
2023-04-06 15:39:38 -07:00
Changyu Bi b3c43a5b99 Drain unnecessary levels when `level_compaction_dynamic_level_bytes=true` (#11340)
Summary:
When a user migrates to level compaction + `level_compaction_dynamic_level_bytes=true`, or when a DB shrinks, there can be unnecessary levels in the DB. Before this PR, this is no way to remove these levels except a manual compaction. These extra unnecessary levels make it harder to guarantee max_bytes_for_level_multiplier and can cause extra space amp. This PR boosts compaction score for these levels to allow RocksDB to automatically drain these levels. Together with https://github.com/facebook/rocksdb/issues/11321, this makes migration to `level_compaction_dynamic_level_bytes=true` automatic without needing user to do a one time full manual compaction. Credit: this PR is modified from https://github.com/facebook/rocksdb/issues/3921.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11340

Test Plan:
- New unit tests
- `python3 tools/db_crashtest.py whitebox --simple` which randomly sets level_compaction_dynamic_level_bytes in each run.

Reviewed By: ajkr

Differential Revision: D44563884

Pulled By: cbi42

fbshipit-source-id: e20d3620bd73dff22be18c5a91a07f340740bcc8
2023-04-06 11:20:43 -07:00
anand76 0623c5b903 Ensure VerifyFileChecksums reads don't exceed readahead_size (#11328)
Summary:
VerifyFileChecksums currently interprets the readahead_size as a payload of readahead_size for calculating the checksum, plus a prefetch of an additional readahead_size. Hence each read is readahead_size * 2. This change treats it as chunks of readahead_size for checksum calculation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11328

Test Plan: Add a unit test

Reviewed By: pdillinger

Differential Revision: D44718781

Pulled By: anand1976

fbshipit-source-id: 79bae1ebaa27de2a13bc86f5910bf09356936e63
2023-04-05 16:22:08 -07:00
Hui Xiao 7f5b9f40cb Fix initialization-order-fiasco in write_stall_stats.cc (#11355)
Summary:
**Context/Summary:**
As title.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11355

Test Plan:
- Ran previously failed tests and they succeed
- Perf
`./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3`

Reviewed By: ajkr

Differential Revision: D44719333

Pulled By: hx235

fbshipit-source-id: 23d22f314144071d97f7106ff1241c31c0bdf08b
2023-04-05 14:42:31 -07:00
Niklas Fiekas e5a560ec98 Expose cache occupancy via C API (#11327)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11327

Reviewed By: cbi42

Differential Revision: D44422225

Pulled By: ajkr

fbshipit-source-id: 3bfcf47290b3133c151bdfdd181896ba2e6be520
2023-04-03 14:42:43 -07:00
Hui Xiao 39c29372bf Add `SetAllowStall()` (#11335)
Summary:
**Context/Summary:**
- Allow runtime changes to whether `WriteBufferManager` allows stall or not by calling `SetAllowStall()`
- Misc: some clean up - see PR conversation

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11335

Test Plan: - New UT

Reviewed By: akankshamahajan15

Differential Revision: D44502555

Pulled By: hx235

fbshipit-source-id: 24b5cc57df7734b11d42e4870c06c87b95312b5e
2023-03-30 09:43:33 -07:00
Hui Xiao c14eb134ed Add experimental PerfContext counters for db iterator Prev/Next/Seek* APIs (#11320)
Summary:
**Context/Summary:**
Motived by user need of investigating db iterator behavior during an interval of any time length of a certain thread, we decide to collect and expose related counters in `PerfContext` as an experimental feature, in addition to the existing db-scope ones (i.e, tickers)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11320

Test Plan:
- new UT
- db bench

Setup
```
./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=1000000 -compression_type=none -bloom_bits=3
```
Test till converges
```
./db_bench -seed=1679526311157283 -use_existing_db=1 -perf_level=2 -db=/dev/shm/testdb/ -benchmarks="seekrandom[-X60]"
```
pre-change
`seekrandom [AVG 33 runs] : 7545 (± 100) ops/sec`
post-change (no regression)
`seekrandom [AVG 33 runs] : 7688 (± 67) ops/sec`

Reviewed By: cbi42

Differential Revision: D44321931

Pulled By: hx235

fbshipit-source-id: f98a254ba3e3ced95eb5928884e33f1b99dca401
2023-03-28 10:23:12 -07:00
Changyu Bi 601320164b Trivially move files down when opening db with level_compaction_dynamic_l… (#11321)
Summary:
…evel_bytes

 During DB open, if a column family uses level compaction with level_compaction_dynamic_level_bytes=true, trivially move its files down in the LSM such that the bottommost files are in Lmax, the second from bottommost level files are in Lmax-1 and so on. This is aimed to make it easier to migrate level_compaction_dynamic_level_bytes from false to true.  Before this change, a full manual compaction is suggested for such migration. After this change, user can just restart DB to turn on this option. db_crashtest.py is updated to randomly choose value for level_compaction_dynamic_level_bytes.

Note that there may still be too many unnecessary levels if a user is migrating from universal compaction or level compaction with a smaller level multiplier. A full manual compaction may still be needed in that case before some PR that automatically drain unnecessary levels like https://github.com/facebook/rocksdb/issues/3921 lands. Eventually we may want to change the default value of option level_compaction_dynamic_level_bytes to true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11321

Test Plan:
1. Added unit tests.
2. Crash test: ran a variation of db_crashtest.py (like 32516507e77521ae887e45091b69139e32e8efb7) that turns level_compaction_dynamic_level_bytes on and off and switches between LC and UC for the same DB.

TODO: Update `OptionChangeMigration`, either after this PR or https://github.com/facebook/rocksdb/issues/3921.

Reviewed By: ajkr

Differential Revision: D44341930

Pulled By: cbi42

fbshipit-source-id: 013de19a915c6a0502be569f07c4cc8f1c3c6be2
2023-03-27 14:55:16 -07:00
karemta-orday 40c2ec6d08 Add in-transaction multi-get-for-update to the C interface (#11107)
Summary:
Hi, this is basically a part of https://github.com/facebook/rocksdb/pull/6488 that only adds `multi_get_for_update` functionality to C API (I'd like to call it from Rust), since `multi_get` was already added here https://github.com/facebook/rocksdb/pull/9252

https://github.com/facebook/rocksdb/pull/6488 has conflicts, so I guess it might be easier to get this one in

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11107

Reviewed By: pdillinger

Differential Revision: D42680764

Pulled By: ajkr

fbshipit-source-id: a50f96e1c7f3d470b4ab07e9ff5a283e5cf44865
2023-03-27 12:14:18 -07:00
Andrew Kryczka 9f8cdc8ad6 validate SstFileWriter range tombstones cover positive ranges (#11322)
Summary:
As titled. This is the same as https://github.com/facebook/rocksdb/issues/6788 but for range tombstones written through `SstFileWriter` rather than through `DB`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11322

Reviewed By: cbi42

Differential Revision: D44317733

Pulled By: ajkr

fbshipit-source-id: f6eb8791ae2c09c169b6bfe0d047449d924b377e
2023-03-22 21:03:13 -07:00
sdong b92bc04ab0 Deflake DBCompactionTest.CancelCompactionWaitingOnConflict (#11318)
Summary:
In DBCompactionTest::CancelCompactionWaitingOnConflict, when generating SST files to trigger a compaction, we don't wait after each file, which may cause multiple memtables going to the same SST file, causing insufficient files to trigger the compaction. We do the waiting instead, except the last one, which would trigger compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11318

Test Plan: Run DBCompactionTest.CancelCompactionWaitingOnConflict multiple times.

Reviewed By: ajkr

Differential Revision: D44267273

fbshipit-source-id: 86af49b05fc67ea3335312f0f5f3d22df1520bf8
2023-03-21 15:38:33 -07:00
Andrew Kryczka 8c445407b7 Specify precedence in `SstFileWriter::DeleteRange()` API contract (#11309)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11309

Reviewed By: cbi42

Differential Revision: D44198501

Pulled By: ajkr

fbshipit-source-id: d603aca37b56aac5df255833793a3300807d63cf
2023-03-18 17:37:17 -07:00
Hui Xiao cb58477185 New stat rocksdb.{cf|db}-write-stall-stats exposed in a structural way (#11300)
Summary:
**Context/Summary:**
Users are interested in figuring out what has caused write stall.
- Refactor write stall related stats from property `kCFStats` into its own db property `rocksdb.cf-write-stall-stats` as a map or string. For now, this only contains count of different combination of (CF-scope `WriteStallCause`) + (`WriteStallCondition`)
- Add new `WriteStallCause::kWriteBufferManagerLimit` to reflect write stall caused by write buffer manager
- Add new `rocksdb.db-write-stall-stats`. For now, this only contains `WriteStallCause::kWriteBufferManagerLimit` + `WriteStallCondition::kStopped`

- Expose functions in new class `WriteStallStatsMapKeys` for examining the above two properties returned as map
- Misc: rename/comment some write stall InternalStats for clarity

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11300

Test Plan:
- New UT
- Stress test
`python3 tools/db_crashtest.py blackbox --simple --get_property_one_in=1`
- Perf test: Both converge very slowly at similar rates but post-change has higher average ops/sec than pre-change even though they are run at the same time.
```
./db_bench -seed=1679014417652004 -db=/dev/shm/testdb/ -statistics=false -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=100000 -db_write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3
```
pre-change:
```
fillseq [AVG 15 runs] : 1176 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1052.671 micros/op 949 ops/sec 105.267 seconds 100000 operations;    0.5 MB/s
fillseq [AVG 16 runs] : 1162 (± 685) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1387.330 micros/op 720 ops/sec 138.733 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 17 runs] : 1136 (± 646) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1232.011 micros/op 811 ops/sec 123.201 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 18 runs] : 1118 (± 610) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1282.567 micros/op 779 ops/sec 128.257 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 19 runs] : 1100 (± 578) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1914.336 micros/op 522 ops/sec 191.434 seconds 100000 operations;    0.3 MB/s
fillseq [AVG 20 runs] : 1071 (± 551) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1227.510 micros/op 814 ops/sec 122.751 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 21 runs] : 1059 (± 525) ops/sec;    0.5 (± 0.3) MB/sec
```
post-change:
```
fillseq [AVG 15 runs] : 1226 (± 732) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1323.825 micros/op 755 ops/sec 132.383 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 16 runs] : 1196 (± 687) ops/sec;    0.6 (± 0.4) MB/sec
fillseq      :    1223.905 micros/op 817 ops/sec 122.391 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 17 runs] : 1174 (± 647) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1168.996 micros/op 855 ops/sec 116.900 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 18 runs] : 1156 (± 611) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1348.729 micros/op 741 ops/sec 134.873 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 19 runs] : 1134 (± 579) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1196.887 micros/op 835 ops/sec 119.689 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 20 runs] : 1119 (± 550) ops/sec;    0.6 (± 0.3) MB/sec
fillseq      :    1193.697 micros/op 837 ops/sec 119.370 seconds 100000 operations;    0.4 MB/s
fillseq [AVG 21 runs] : 1106 (± 524) ops/sec;    0.6 (± 0.3) MB/sec
```

Reviewed By: ajkr

Differential Revision: D44159541

Pulled By: hx235

fbshipit-source-id: 8d29efb70001fdc52d34535eeb3364fc3e71e40b
2023-03-18 09:51:58 -07:00
Peter Dillinger 204fcff751 HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.

This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.

It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.

## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.

## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.

These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.

* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)

I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.

## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.

## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.

## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.

## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.

## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.

## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.

## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.

Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).

Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.

## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301

Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.

## Crash/stress test
Updated to use the new combination.

## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.

```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```

**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127

**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)

This is an acceptable difference IMHO.

Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.

Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
multireadrandom [AVG    30 runs] : 3444202 (± 57049) ops/sec;  240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec;  245.8 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3291022 (± 58851) ops/sec;  230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec;  235.4 MB/sec

So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:

**Before**:
multireadrandom [AVG    30 runs] : 3933777 (± 41840) ops/sec;  275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec;  277.7 MB/sec
**After**:
multireadrandom [AVG    30 runs] : 3755338 (± 30391) ops/sec;  262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec;  264.8 MB/sec

Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.

Let's also look at Get() in db_bench:

```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```

**Before**:
readrandom [AVG    30 runs] : 2198685 (± 13412) ops/sec;  153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec;  154.5 MB/sec
**After**:
readrandom [AVG    30 runs] : 2292814 (± 43508) ops/sec;  160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec;  165.4 MB/sec

That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:

**Before**:
readrandom [AVG    30 runs] : 2272333 (± 9992) ops/sec;  158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec;  159.0 MB/sec
**After**:
readrandom [AVG    30 runs] : 2332407 (± 11252) ops/sec;  163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec;  163.3 MB/sec

Reviewed By: ltamasi

Differential Revision: D44177044

Pulled By: pdillinger

fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2023-03-17 20:23:49 -07:00
anand76 eac6b6d0cd Ignore async_io ReadOption if FileSystem doesn't support it (#11296)
Summary:
In PosixFileSystem, IO uring support is opt-in. If the support is not enabled by the user, then ignore the async_io ReadOption in MultiGet and iteration at the top, rather than follow the async_io codepath and transparently switch to sync IO at the FileSystem layer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11296

Test Plan: Add new unit tests

Reviewed By: akankshamahajan15

Differential Revision: D44045776

Pulled By: anand1976

fbshipit-source-id: a0881bf763ca2fde50b84063d0068bb521edd8b9
2023-03-17 14:57:09 -07:00
hackingthekernel 291300ece8 add c-api for allowing FIFO compaction (#11156)
Summary:
Addressing issue https://github.com/facebook/rocksdb/issues/11079

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11156

Reviewed By: pdillinger

Differential Revision: D42869964

Pulled By: ajkr

fbshipit-source-id: 58214901d4e072c568d4c5cf0944a0b1c60de897
2023-03-16 16:57:03 -07:00
Peter Dillinger ccaa3225b0 Simplify tracking entries already in SecondaryCache (#11299)
Summary:
In preparation for factoring secondary cache support out of individual Cache implementations, we can get rid of the "in secondary cache" flag on entries through a workable hack: when an entry is promoted from secondary, it is inserted in primary using a helper that lacks secondary cache support, thus preventing re-insertion into secondary cache through existing logic.

This adds to the complexity of building CacheItemHelpers, because you always have to be able to get to an equivalent helper without secondary cache support, but that complexity is reasonably isolated within RocksDB typed_cache.h and test code.

gcc-7 seems to have problems with constexpr constructor referencing `this` so removed constexpr support on CacheItemHelper.

Also refactored some related test code to share common code / functionality.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11299

Test Plan: existing tests

Reviewed By: anand1976

Differential Revision: D44101453

Pulled By: pdillinger

fbshipit-source-id: 7a59d0a3938ee40159c90c3e65d7004f6a272345
2023-03-15 17:51:44 -07:00
Peter Dillinger 601efe3cf2 Misc cleanup of block cache code (#11291)
Summary:
... ahead of a larger change.
* Rename confusingly named `is_in_sec_cache` to `kept_in_sec_cache`
* Unify naming of "standalone" block cache entries (was "detached" in clock_cache)
* Remove some unused definitions in clock_cache.h (leftover from a previous revision)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11291

Test Plan: usual tests and CI, no behavior changes

Reviewed By: anand1976

Differential Revision: D43984642

Pulled By: pdillinger

fbshipit-source-id: b8bf0c5b90a932a88bcbdb413b2f256834aedf97
2023-03-15 12:08:17 -07:00
Hui Xiao 11cb6af6e5 Fix bug of prematurely excluded CF in atomic flush contains unflushed data that should've been included in the atomic flush (#11148)
Summary:
**Context:**
Atomic flush should guarantee recoverability of all data of seqno up to the max seqno of the flush. It achieves this by ensuring all such data are flushed by the time this atomic flush finishes through `SelectColumnFamiliesForAtomicFlush()`. However, our crash test exposed the following case where an excluded CF from an atomic flush contains unflushed data of seqno less than the max seqno of that atomic flush and loses its data with `WriteOptions::DisableWAL=true` in face of a crash right after the atomic flush finishes .
```
./db_stress --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
    pid=$!
    sleep 0.2
    sleep 10
    kill $pid
    sleep 0.2
./db_stress --ops_per_thread=1 --preserve_unverified_changes=1 --reopen=0 --acquire_snapshot_one_in=0 --adaptive_readahead=1 --allow_data_in_errors=True --async_io=1 --atomic_flush=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=15 --bottommost_compression_type=none --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=0 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=134217727 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=1 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=$exp --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=none --flush_one_in=0 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=2 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=524288 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=1 --manual_wal_flush_one_in=100 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=4 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=0 --periodic_compaction_seconds=100 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --ribbon_starting_level=6 --secondary_cache_fault_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --stats_dump_period_sec=10 --subcompactions=1 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=0 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=1 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=none --write_buffer_size=524288 --write_dbid_to_manifest=1 --write_fault_one_in=0 --writepercent=30 &
    pid=$!
    sleep 0.2
    sleep 40
    kill $pid
    sleep 0.2

Verification failed for column family 6 key 0000000000000239000000000000012B0000000000000138 (56622): value_from_db: , value_from_expected: 4A6331754E4F4C4D42434041464744455A5B58595E5F5C5D5253505156575455, msg: Value not found: NotFound:
Crash-recovery verification failed :(
No writes or ops?
Verification failed :(
```

The bug is due to the following:
- When atomic flush is used, an empty CF is legally [excluded](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L39) in `SelectColumnFamiliesForAtomicFlush` as the first step of `DBImpl::FlushForGetLiveFiles` before [passing](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_filesnapshot.cc#L42) the included CFDs to `AtomicFlushMemTables`.
- But [later](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2133) in `AtomicFlushMemTables`, `WaitUntilFlushWouldNotStallWrites` will [release the db mutex](https://github.com/facebook/rocksdb/blob/7.10.fb/db/db_impl/db_impl_compaction_flush.cc#L2403), during which data@seqno N can be inserted into the excluded CF and data@seqno M can be inserted into one of the included CFs, where M > N.
- However, data@seqno N in an already-excluded CF is thus excluded from this atomic flush while we seqno N is less than seqno M.

**Summary:**
- Replace `SelectColumnFamiliesForAtomicFlush()`-before-`AtomicFlushMemTables()` with `SelectColumnFamiliesForAtomicFlush()`-after-wait-within-`AtomicFlushMemTables()` so we ensure no write affecting the recoverability of this atomic job (i.e, change to max seqno of this atomic flush or insertion of data with less seqno than the max seqno of the atomic flush to excluded CF) can happen after calling `SelectColumnFamiliesForAtomicFlush()`.
- For above, refactored and clarified comments on `SelectColumnFamiliesForAtomicFlush()` and `AtomicFlushMemTables()` for clearer semantics of passed-in CFDs to atomic-flush

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11148

Test Plan:
- New unit test failed before the fix and passes after
- Make check
- Rehearsal stress test

Reviewed By: ajkr

Differential Revision: D42799871

Pulled By: hx235

fbshipit-source-id: 13636b63e9c25c5895857afc36ea580d57f6d644
2023-03-14 16:53:20 -07:00
Levi Tamasi 49881921cd Rename a recently added PerfContext counter (#11294)
Summary:
The patch renames the counter added in https://github.com/facebook/rocksdb/issues/11284 for better consistency with the existing naming scheme.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11294

Test Plan: `make check`

Reviewed By: jowlyzhang

Differential Revision: D44035964

Pulled By: ltamasi

fbshipit-source-id: 8b1a2a03ee728148365367e0ecc1fcf462f62191
2023-03-13 18:43:27 -07:00
Peter Dillinger 648e972f30 Document DB::Resume(), fix LockWALInEffect test (#11290)
Summary:
In rare cases seeing failures like this

```
[ RUN      ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/2
db/db_write_test.cc:653: Failure
Put("key3", "value")
Corruption: Not active
```

in a test with no explicit threading. This is likely because of the unpredictability of background auto-resume. I didn't really know this feature, in part because DB::Resume() was undocumented. So I believe I have fixed the test and documented the API function.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11290

Test Plan: 1000s of stress runs of the test with gtest-parallel

Reviewed By: anand1976

Differential Revision: D43984583

Pulled By: pdillinger

fbshipit-source-id: d30dec120b4864e193751b2e33ff16834d313db3
2023-03-13 14:19:59 -07:00
Changyu Bi 9aa3b6f9ae Support range deletion tombstones in `CreateColumnFamilyWithImport` (#11252)
Summary:
CreateColumnFamilyWithImport() did not support range tombstones for two reasons:
1. it uses point keys of a input file to determine its boundary (smallest and largest internal key), which means range tombstones outside of the point key range will be effectively dropped.
2. it does not handle files with no point keys.

Also included a fix in external_sst_file_ingestion_job.cc where the blocks read in `GetIngestedFileInfo()` can be added to block cache now (issue fixed in https://github.com/facebook/rocksdb/pull/6429).

This PR adds support for exporting and importing column family with range tombstones. The main change is to add smallest internal key and largest internal key to `SstFileMetaData` that will be part of the output of `ExportColumnFamily()`. Then during `CreateColumnFamilyWithImport(...,const ExportImportFilesMetaData& metadata,...)`, file boundaries can be set from `metadata` directly. This is needed since when file boundaries are extended by range tombstones, sometimes they cannot be deduced from a file's content alone.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11252

Test Plan:
- added unit tests that fails before this change

Closes https://github.com/facebook/rocksdb/issues/11245

Reviewed By: ajkr

Differential Revision: D43577443

Pulled By: cbi42

fbshipit-source-id: 6bff78e583cc50c44854994dea0a8dd519398f2f
2023-03-13 11:06:59 -07:00
Jaepil Jeong 969d4e1dd2 Fix compile errors in Clang due to unused variables depending on the build configuration (#11234)
Summary:
This PR fixes compilation errors in Clang due to unused variables like the below:
```
[109/329] Building CXX object CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o
FAILED: CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o
ccache /opt/homebrew/opt/llvm/bin/clang++ -DGFLAGS=1 -DGFLAGS_IS_A_DLL=0 -DHAVE_FULLFSYNC -DJEMALLOC_NO_DEMANGLE -DLZ4 -DOS_MACOSX -DROCKSDB_JEMALLOC -DROCKSDB_LIB_IO_POSIX -DROCKSDB_NO_DYNAMIC_EXTENSION -DROCKSDB_PLATFORM_POSIX -DSNAPPY -DTBB -DZLIB -DZSTD -I/Users/jaepil/work/deepsearch/deps/cpp/rocksdb -I/Users/jaepil/work/deepsearch/deps/cpp/rocksdb/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk -I/Users/jaepil/app/include -I/opt/homebrew/include -I/opt/homebrew/opt/llvm/include -I/opt/homebrew/opt/llvm/include/c++/v1 -W -Wextra -Wall -pthread -Wsign-compare -Wshadow -Wno-unused-parameter -Wno-unused-variable -Woverloaded-virtual -Wnon-virtual-dtor -Wno-missing-field-initializers -Wno-strict-aliasing -Wno-invalid-offsetof -fno-omit-frame-pointer -momit-leaf-frame-pointer -march=armv8-a+crc+crypto -Wno-unused-function -Werror -O2 -g -DNDEBUG -arch arm64 -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.1.sdk -std=gnu++20 -MD -MT CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o -MF CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o.d -o CMakeFiles/rocksdb.dir/db/version_edit_handler.cc.o -c /Users/jaepil/work/deepsearch/deps/cpp/rocksdb/db/version_edit_handler.cc
/Users/jaepil/work/deepsearch/deps/cpp/rocksdb/db/version_edit_handler.cc:30:10: error: variable 'recovered_edits' set but not used [-Werror,-Wunused-but-set-variable]
  size_t recovered_edits = 0;
         ^
1 error generated.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11234

Reviewed By: cbi42

Differential Revision: D43458604

Pulled By: ajkr

fbshipit-source-id: d8c50e1a108887b037a120cd9f19374ddaeee817
2023-03-09 16:42:57 -08:00
Levi Tamasi 1d52438504 Add a PerfContext counter for merge operands applied in point lookups (#11284)
Summary:
The existing PerfContext counter `internal_merge_count` only tracks the
Merge operands applied during range scans. The patch adds a new counter
called `internal_merge_count_point_lookups` to track the same metric
for point lookups (`Get` / `MultiGet` / `GetEntity` / `MultiGetEntity`), and
also fixes a couple of cases in the iterator where the existing counter wasn't
updated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11284

Test Plan: `make check`

Reviewed By: jowlyzhang

Differential Revision: D43926082

Pulled By: ltamasi

fbshipit-source-id: 321566d8b4cf0a3b6c9b73b7a5c984fb9bb492e9
2023-03-08 18:22:11 -08:00
Levi Tamasi a1a3b23346 Deflake/fix BlobSourceCacheReservationTest.IncreaseCacheReservationOnFullCache (#11273)
Summary:
`BlobSourceCacheReservationTest.IncreaseCacheReservationOnFullCache` is both flaky and also doesn't do what its name says. The patch changes this test so it actually tests increasing the cache reservation, hopefully also deflaking it in the process.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11273

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43800935

Pulled By: ltamasi

fbshipit-source-id: 5eb54130dfbe227285b0e14f2084aa4b89f0b107
2023-03-06 09:50:39 -08:00
Peter Dillinger e168c1b1a4 Use FaultInjectionTestFS in DBWriteTest.LockWALInEffect (#11271)
Summary:
Existing use of FaultInjectionTestEnv shows rare TSAN errors with parallel Sync and Flush. This appears to be fixed in FaultInjectionTestFS. (Sigh, code duplication and divergence.)

Example failure:
https://app.circleci.com/pipelines/github/facebook/rocksdb/24631/workflows/fc2a66f0-f21c-48d6-a944-3885bcff50a4/jobs/571928

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11271

Test Plan: wasn't able to reproduce locally but stress tested the updated test with gtest-parallel -r1000 and TSAN.

Reviewed By: ajkr

Differential Revision: D43779477

Pulled By: pdillinger

fbshipit-source-id: a019b0f1d4045a26a15ab08aab63828a398f6d3e
2023-03-05 08:21:16 -08:00
Igor Canadi ddde1e6af8 Avoid ColumnFamilyDescriptor copy (#10978)
Summary:
Hi. :) Noticed we are copying ColumnFamilyDescriptor here because my process crashed during copy constructor (cause unrelated)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10978

Reviewed By: cbi42

Differential Revision: D41473924

Pulled By: ajkr

fbshipit-source-id: 58a3473f2d7b24918f79d4b2726c20081c5e95b4
2023-03-03 20:55:31 -08:00
Changyu Bi d053926fa2 Improve documentation for MergingIterator (#11161)
Summary:
Add some comments to try to explain how/why MergingIterator works. Made some small refactoring, mostly in MergingIterator::SkipNextDeleted() and MergingIterator::SeekImpl().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11161

Test Plan:
crash test with small key range:
```
python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=6000 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10
```

Reviewed By: ajkr

Differential Revision: D42860994

Pulled By: cbi42

fbshipit-source-id: 3f0c1c9c6481a7f468bf79d823998907a8116e9e
2023-03-03 12:17:30 -08:00
Yu Zhang 8dfcfd4e90 Fix backward iteration issue when user defined timestamp is enabled in BlobDB (#11258)
Summary:
During backward iteration, blob verification would fail because the user key (ts included) in `saved_key_` doesn't match the blob. This happens because during`FindValueForCurrentKey`, `saved_key_` is not updated when the user key(ts not included) is the same for all cases except when `timestamp_lb_` is specified. This breaks the blob verification logic when user defined timestamp is enabled and `timestamp_lb_` is not specified. Fix this by always updating `saved_key_` when a smaller user key (ts included) is seen.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11258

Test Plan:
`make check`
`./db_blob_basic_test --gtest_filter=DBBlobWithTimestampTest.IterateBlobs`

Run db_bench (built with DEBUG_LEVEL=0) to demonstrate that no overhead is introduced with:

`./db_bench -user_timestamp_size=8  -db=/dev/shm/rocksdb -disable_wal=1 -benchmarks=fillseq,seekrandom[-W1-X6] -reverse_iterator=1 -seek_nexts=5`

Baseline:

- seekrandom [AVG    6 runs] : 72188 (± 1481) ops/sec;   37.2 (± 0.8) MB/sec

With this PR:

- seekrandom [AVG    6 runs] : 74171 (± 1427) ops/sec;   38.2 (± 0.7) MB/sec

Reviewed By: ltamasi

Differential Revision: D43675642

Pulled By: jowlyzhang

fbshipit-source-id: 8022ae8522d1f66548821855e6eed63640c14e04
2023-03-01 13:28:54 -08:00
Levi Tamasi 3c9eed688e Enable moving a string or PinnableSlice into PinnableWideColumns (#11248)
Summary:
This makes it possible to eliminate some copies in `GetEntity` / `MultiGetEntity`,
in particular when `Merge`s or blobs are involved.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11248

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43544215

Pulled By: ltamasi

fbshipit-source-id: bc4c8955a24bbd8bc4ab098e72133ead757f9707
2023-02-24 10:33:00 -08:00
Yu Zhang f007b8fdea Support iter_start_ts in integrated BlobDB (#11244)
Summary:
Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244

Test Plan:
```make check
./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs"
tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1```

Reviewed By: ltamasi

Differential Revision: D43506726

Pulled By: jowlyzhang

fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0
2023-02-22 15:44:59 -08:00
Changyu Bi 229297d1b8 Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113)
Summary:
A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)).

The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113

Test Plan:
* added unit test in db_range_del_test
* crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10`

Reviewed By: ajkr

Differential Revision: D42655709

Pulled By: cbi42

fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c
2023-02-22 12:28:18 -08:00
ywave 9fa9becf53 fix -Wrange-loop-analysis in Apple clang version 12.0.0 (clang-1200.0.32.29) (#11240)
Summary:
Fix complain
```
db/db_impl/db_impl_compaction_flush.cc:417:19: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type
      'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis]
  for (const auto bg_flush_arg : bg_flush_args) {
                  ^
db/db_impl/db_impl_compaction_flush.cc:417:8: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying
  for (const auto bg_flush_arg : bg_flush_args) {
       ^~~~~~~~~~~~~~~~~~~~~~~~~
                  &
db/db_impl/db_impl_compaction_flush.cc:2911:21: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type
      'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis]
    for (const auto bg_flush_arg : bg_flush_args) {
                    ^
db/db_impl/db_impl_compaction_flush.cc:2911:10: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying
    for (const auto bg_flush_arg : bg_flush_args) {
         ^~~~~~~~~~~~~~~~~~~~~~~~~
                    &
```
from

```sh
xxx@MacBook-Pro / % g++ -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1
Apple clang version 12.0.0 (clang-1200.0.32.29)
Target: x86_64-apple-darwin21.6.0
Thread model: posix
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11240

Reviewed By: cbi42

Differential Revision: D43458729

Pulled By: ajkr

fbshipit-source-id: 26e110f83451509463a1bc308f737ccb693c9f45
2023-02-22 05:44:03 -08:00
Changyu Bi 1b48ecc2c6 Fix an assertion failure in DBIter::SeekToLast() when user-defined timestamp is enabled (#11223)
Summary:
in DBIter::SeekToLast(), key() can be called when iter is invalid and fails the following assertion:
```
./db/db_iter.h:153: virtual rocksdb::Slice rocksdb::DBIter::key() const: Assertion `valid_' failed.
```
This happens when `iterate_upper_bound` and timestamp_lb_ are set. SeekForPrev(*iterate_upper_bound_) positions the iterator on the same user key as *iterate_upper_bound_. A subsequent PrevInternal() call makes the iterator invalid just be the call to key().

This PR fixes this issue by setting updating the seek key to have max sequence number AND max timestamp when the seek key has the same user key as *iterate_upper_bound_.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11223

Test Plan: - Added a unit test that would fail the above assertion before this fix.

Reviewed By: jowlyzhang

Differential Revision: D43283600

Pulled By: cbi42

fbshipit-source-id: 0dd3999845b722584679bbc95be2664b266005ba
2023-02-21 11:57:58 -08:00
leipeng ea85148b78 DBIter::FindNextUserEntryInternal: do not PrepareValue for `Delete` (#11211)
Summary:
`kTypeDeletion/kTypeDeletionWithTimestamp/kTypeSingleDeletion` does not need access iter value, so omit `PrepareValue`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11211

Reviewed By: ajkr

Differential Revision: D43253068

Pulled By: cbi42

fbshipit-source-id: 1945c7f8a90b6909128a0553b62d9fd1078b0a08
2023-02-21 11:26:30 -08:00
HuangYi 83bc03a99a add c api to set option fail_if_not_bottommost_level (#11158)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11158

Reviewed By: cbi42

Differential Revision: D42870647

Pulled By: ajkr

fbshipit-source-id: 1b71a1dd415c34c332cecf60c68ce37fe4393e2a
2023-02-21 10:52:09 -08:00
HuangYi cfe50f7e77 add c api for HyperClockCache (#11110)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11110

Reviewed By: cbi42

Differential Revision: D42660941

Pulled By: ajkr

fbshipit-source-id: e977d9b76dfd5d8c62335f961c275f3b810503d7
2023-02-21 10:00:43 -08:00
Matt Jurik 142b18d00b C-API: Support multi-CF flush (#11112)
Summary:
This PR adds support to the c-api bindings for calling `Flush()` with multiple column families, which is useful for performing atomic flushes (assuming also that the db has been opened with `atomic_flush = true`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11112

Reviewed By: cbi42

Differential Revision: D42666382

Pulled By: ajkr

fbshipit-source-id: 82f05bf32d28452d85c79ea42411c8fea961fd87
2023-02-21 09:10:03 -08:00
mrambacher b6640c3117 Remove FactoryFunc from LoadXXXObject (#11203)
Summary:
The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available.  With the removal of LITE mode, the function was no longer required.

Note that the MergeOperator had some private classes defined in header files.  To gain access to their constructors (and name methods), the class definitions were moved into header files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203

Reviewed By: cbi42

Differential Revision: D43160255

Pulled By: pdillinger

fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c
2023-02-17 12:54:07 -08:00
Andrew Kryczka 25e1365227 Merge operator failed subcode (#11231)
Summary:
From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions.

This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231

Test Plan: updated unit test

Reviewed By: akankshamahajan15

Differential Revision: D43396607

Pulled By: ajkr

fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9
2023-02-17 10:58:46 -08:00
Levi Tamasi 9794acb597 Add a new MultiGetEntity API (#11222)
Summary:
The new `MultiGetEntity` API can be used to get a consistent view of
a batch of keys, with the results presented as wide-column entities.
Similarly to `GetEntity` and the iterator's `columns` API, if the entry
corresponding to the key is a wide-column entity to start with, it is
returned as-is, and if it is a plain key-value, it is wrapped into an entity
with a single default column.

Implementation-wise, the new API shares the logic of the batched `MultiGet`
API (via the `MultiGetCommon` methods). Both single-CF and multi-CF
`MultiGetEntity` APIs are provided, and blobs are also supported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11222

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43256950

Pulled By: ltamasi

fbshipit-source-id: 47fb2cb7e2d0470e3580f43fdb2fe9e51f0e7005
2023-02-15 09:34:17 -08:00
Peter Dillinger 3cacd4b4ec Put Cache and CacheWrapper in new public header (#11192)
Summary:
The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment.

In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544

Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it.

Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192

Test Plan: `make check`

Reviewed By: anand1976

Differential Revision: D43055487

Pulled By: pdillinger

fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110
2023-02-09 12:12:02 -08:00
Peter Dillinger b7747bbc9f Attempt fix flaky DBWriteTest.LockWALInEffect (#11209)
Summary:
Example failure:
```
[ RUN      ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/1
db/db_write_test.cc:646: Failure
Put("key3", "value")
Corruption: Not active
```
Presumably from a background compaction prior to Put.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11209

Test Plan: watch CI

Reviewed By: akankshamahajan15

Differential Revision: D43147727

Pulled By: pdillinger

fbshipit-source-id: a1c34ac5ab124bfe2f23205a30777990056e9082
2023-02-09 09:21:55 -08:00
anand76 77b61abc7b Fix bug in WAL streaming uncompression (#11198)
Summary:
Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198

Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it.

Reviewed By: ajkr, cbi42

Differential Revision: D43102692

Pulled By: anand1976

fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd
2023-02-08 12:05:49 -08:00
Levi Tamasi 876d281592 Add compaction filter support for wide-column entities (#11196)
Summary:
The patch adds compaction filter support for wide-column entities by introducing
a new `CompactionFilter` API called `FilterV3`. This API is called for regular
key-values, merge operands, and wide-column entities as well. It is passed the
existing value/operand or wide-column structure and it can update the value or
columns or keep/delete/etc. the key-value as usual. For compatibility, the default
implementation of `FilterV3` keeps all wide-column entities and falls back to calling
`FilterV2` for plain old key-values and merge operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D43094147

Pulled By: ltamasi

fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd
2023-02-07 16:17:39 -08:00
Hui Xiao 9b66331388 Simplify TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL) (#11186)
Summary:
**Context/Summary**:
Simplify `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` based on https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134 and delete unused sync points.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11186

Test Plan:
- UT failed before fix in https://github.com/facebook/rocksdb/pull/10892 and passes after
- Check UT not flaky when running with https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878

Reviewed By: ajkr

Differential Revision: D43034723

Pulled By: hx235

fbshipit-source-id: f503774987b8f3718505f99e95080a7fad28ac66
2023-02-06 16:10:03 -08:00
Peter Dillinger 0cf1008fe3 Deprecate write_global_seqno and default to false (#11179)
Summary:
This option has long been intended to be set to false by default and deprecated. It might never be practical to completely remove the feature, so that we can continue to test for backward compatibility by keeping the ability to generate DBs in the old way.

Also improved API comments.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11179

Test Plan: existing tests (with one tiny update)

Reviewed By: hx235

Differential Revision: D42973927

Pulled By: pdillinger

fbshipit-source-id: e9bc161cb933266e094aea2dff8cc03753c39dab
2023-02-03 13:00:04 -08:00
Peter Dillinger 390cc0b156 Ensure LockWAL() stall cleared for UnlockWAL() return (#11172)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11160

By counting the number of stalls placed on a write queue, we can check in UnlockWAL() whether the stall present at the start of UnlockWAL() has been cleared by the end, or wait until it's cleared.

More details in code comments and new unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11172

Test Plan: unit test added. Yes, it uses sleep to amplify failure on buggy behavior if present, but using a sync point to only allow new behavior would fail with the old code only because it doesn't contain the new sync point. Basically, using a sync point in UnlockWAL() could easily mask a regression by artificially limiting key behaviors. The test would only check that UnlockWAL() invokes code that *should* do the right thing, without checking that it *does* the right thing.

Reviewed By: ajkr

Differential Revision: D42894341

Pulled By: pdillinger

fbshipit-source-id: 15c9da0ca383e6aec845b29f5447d76cecbf46c3
2023-02-03 12:08:37 -08:00
anand76 63da9cfa26 Return any errors returned by ReadAsync to the MultiGet caller (#11171)
Summary:
Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user.

Tests:
Update existing unit tests and add a new one for this scenario

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171

Reviewed By: akankshamahajan15

Differential Revision: D42950057

Pulled By: anand1976

fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869
2023-02-02 16:35:27 -08:00
Levi Tamasi df680b24ef Clean up InvokeFilterIfNeeded a bit (#11174)
Summary:
The patch makes some code quality enhancements in `CompactionIterator::InvokeFilterIfNeeded`
including the renaming of `filter` (which is most likely a remnant of the days before the `FilterV2`
API when the compaction filter used to return a boolean) to `decision`, the removal of some
outdated comments, the elimination of an `error` flag which was only used in one failure case
out of many, as well as some small stylistic improvements. (Some the above will also come in
handy when adding compaction filter support for wide-column entities.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11174

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D42901408

Pulled By: ltamasi

fbshipit-source-id: ab382d59a4990c5dfe1cee219d49e1d80902b666
2023-02-01 10:03:07 -08:00
Andrew Kryczka 071c33846d Allow canceling manual compaction while waiting for conflicting compaction (#11165)
Summary:
This PR adds logic to the `RunManualCompaction()` loop to check for cancellation before waiting on any conflicting compactions to finish. In case of cancellation, `RunManualCompaction()` no longer waits on conflicting compactions

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11165

Test Plan: repro test case

Reviewed By: cbi42

Differential Revision: D42864058

Pulled By: ajkr

fbshipit-source-id: ea4dd1a8f294abe212905495a8fbe8f07fca3f5a
2023-01-31 16:57:49 -08:00
Levi Tamasi a82021c3d0 Fix a bug where GetEntity would expose a blob reference (#11162)
Summary:
The patch fixes a feature interaction bug between BlobDB and the `GetEntity` API:
without the patch, `GetEntity` would return the blob reference (wrapped into a
single-column entity) instead of the actual blob value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11162

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D42854092

Pulled By: ltamasi

fbshipit-source-id: f750d0ff57def107da16f545077ddce9860ff21a
2023-01-31 09:59:25 -08:00
Peter Dillinger 94e3beec77 Cleanup, improve, stress test LockWAL() (#11143)
Summary:
The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change:

* Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use.
* Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions)
* Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability)
* Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other.
* Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads.
* Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143

Test Plan: unit tests added / updated, added to stress/crash test

Reviewed By: ajkr

Differential Revision: D42848627

Pulled By: pdillinger

fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4
2023-01-30 22:52:30 -08:00
Yu Zhang 24ac53d81a Use user key on sst file for blob verification for Get and MultiGet (#11105)
Summary:
Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller.

Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105

Test Plan:
make V=1 db_blob_basic_test
./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*"

Reviewed By: ltamasi

Differential Revision: D42716487

Pulled By: jowlyzhang

fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a
2023-01-30 10:21:21 -08:00
akankshamahajan 79e57a39a3 Move ExternalSSTTestEnv to FileSystemWrapper (#11139)
Summary:
Migrate ExternalSSTTestEnv to FileSystemWrapper

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11139

Reviewed By: anand1976

Differential Revision: D42780180

Pulled By: akankshamahajan15

fbshipit-source-id: 9a4448c9fe5186b518235fe11e1a34dcad897cdd
2023-01-27 14:51:39 -08:00
sdong 4720ba4391 Remove RocksDB LITE (#11147)
Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.

Most of changes were done through following comments:

unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`

by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147

Test Plan: See CI

Reviewed By: pdillinger

Differential Revision: D42796341

fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
2023-01-27 13:14:19 -08:00
Yu Zhang 6943ff6e50 Remove deprecated util functions in options_util.h (#11126)
Summary:
Remove the util functions in options_util.h that have previously been marked deprecated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126

Test Plan: `make check`

Reviewed By: ltamasi

Differential Revision: D42757496

Pulled By: jowlyzhang

fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb
2023-01-27 11:10:53 -08:00
Karim TAAM a1e92bd956 use verify checksum option in block based table reader Open() (#11099)
Summary:
## Description
In this issue https://github.com/facebook/rocksdb/issues/11002 we found that when we use rocksdb with the `verify checksum` read_option to false the verification is done anyway

By analyzing the code along the stacktrace I saw that at the level of https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129f we are not keeping all the options and we forget the `verify_checksum`

the comment in this class suggests that it should be managed https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fL581

<img width="1724" alt="204511641-86ab4b9b-45e5-4a2b-a13d-81fa26435d38" src="https://user-images.githubusercontent.com/26581503/213152802-c46bc1c7-a3a2-4a6f-9bb1-bf92ee93af7a.png">

this PR just adds the line to manage the `verify checksum`

## Tests

- Running unit tests
- Test without setting `verify checksum` and verifying that we are calling the checksum code
- Test by setting `verify checksum` to true and verifying that we are calling the checksum code
- Test by setting `verify checksum` to false and verifying that we are **not** calling the checksum code

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11099

Reviewed By: cbi42

Differential Revision: D42679881

Pulled By: ajkr

fbshipit-source-id: c7dd10768282fd0699f7e1bf397ceb7adbea4ab6
2023-01-26 17:38:59 -08:00
Andrew Kryczka b44cbbf709 Fix GetMergeOperands() returning MergeInProgress (#11136)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11136

Test Plan: the provided unit test used to fail due to `GetMergeOperands()` returning `Status::MergeInProgress()`; it passes now because the `GetMergeOperands()` call returns `Status::OK()`

Reviewed By: pdillinger

Differential Revision: D42759198

Pulled By: ajkr

fbshipit-source-id: 878f9f40ccc1d7e2fe7b1352814bae3a49c19939
2023-01-26 15:11:19 -08:00
akankshamahajan 986c5b9d4e Migrate TestEnv in listener_test.cc to FileSystemWrapper (#11125)
Summary:
Migrate derived classes from EnvWrapper to FileSystemWrapper so we can eventually deprecate the storage methods in Env.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11125

Test Plan: CircleCI jobs

Reviewed By: anand1976

Differential Revision: D42732241

Pulled By: akankshamahajan15

fbshipit-source-id: c89a70a79fcfb13e158bf8919b1a87a9de133222
2023-01-25 22:42:22 -08:00
Levi Tamasi 6da2e20df3 Remove more obsolete statistics (#11131)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11131

Test Plan: `make check`

Reviewed By: pdillinger

Differential Revision: D42753997

Pulled By: ltamasi

fbshipit-source-id: ce8b84c1e55374257e93ed74fd255c9b759723ce
2023-01-25 15:14:13 -08:00
Peter Dillinger 546e213c4f Fix DelayWrite() calls for two_write_queues (#11130)
Summary:
PR https://github.com/facebook/rocksdb/issues/11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. https://github.com/facebook/rocksdb/issues/11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader.

This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue.

HISTORY not updated because this is intended for the same release where the bug was introduced.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11130

Test Plan:
Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case.

Will also run crash_test_with_multiops_wc_txn for a while looking for issues.

Reviewed By: ajkr

Differential Revision: D42749330

Pulled By: pdillinger

fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b
2023-01-25 14:18:27 -08:00
anand76 bcbab59c55 Migrate ErrorEnv from EnvWrapper to FileSystemWrapper (#11124)
Summary:
Migrate ErrorEnv from EnvWrapper to FileSystemWrapper so we can eventually deprecate the storage methods in Env.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11124

Reviewed By: akankshamahajan15

Differential Revision: D42727791

Pulled By: anand1976

fbshipit-source-id: e8362ad624dc28e55c99fc35eda12866755f62c6
2023-01-24 17:14:35 -08:00
sdong 2800aa069a Remove compressed block cache (#11117)
Summary:
Compressed block cache is replaced by compressed secondary cache. Remove the feature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117

Test Plan: See CI passes

Reviewed By: pdillinger

Differential Revision: D42700164

fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d
2023-01-24 17:09:19 -08:00
Peter Dillinger 4a9185340d A better contract for best_efforts_recovery (#11085)
Summary:
Capture more of the original intent at a high level, without getting bogged down in low-level details.

The old text made some weak promises about handling of LOCK files. There should be no specific concern for LOCK files, because we already rely on LockFile() to create the file if it's not present already. And the lock file is generally size 0, so don't have to worry about truncation. Added a unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11085

Test Plan: existing tests, and a new one.

Reviewed By: siying

Differential Revision: D42713233

Pulled By: pdillinger

fbshipit-source-id: 2fce7c974d35fac065037c9c4c7326a59c9fe340
2023-01-24 12:55:03 -08:00
Hui Xiao 86fa2592be Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111)
Summary:
**Context:**
Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush  `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. `

**Summary:**
Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`)

**Tets:**
- new unit test
- make check
- aggressive crash test rehearsal

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111

Reviewed By: ajkr

Differential Revision: D42644600

Pulled By: hx235

fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf
2023-01-24 09:54:04 -08:00
Andrew Kryczka b7fbcefda8 Add API to limit blast radius of merge operator failure (#11092)
Summary:
Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage.

To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092

Reviewed By: siying

Differential Revision: D42525673

Pulled By: ajkr

fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee
2023-01-20 14:40:30 -08:00
Changyu Bi e9d6a0d7ce Fix asan failure caused by range tombstone start key use-after-free (#11106)
Summary:
the `last_tombstone_start_user_key` variable in `BuildTable()` and in `CompactionOutputs::AddRangeDels()` may point to a start key that is freed if user-defined timestamp is enabled. This was causing ASAN failure and this PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11106

Test Plan: Added UT for repro.

Reviewed By: ajkr

Differential Revision: D42590862

Pulled By: cbi42

fbshipit-source-id: c493265ececdf89636d801d55ae929806c4d4b2c
2023-01-18 16:38:07 -08:00
Changyu Bi 4d0f9a995c Consider TTL compaction file cutting earlier to prevent small output file (#11075)
Summary:
in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075

Test Plan: - Added new unit test.

Reviewed By: hx235

Differential Revision: D42398739

Pulled By: cbi42

fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7
2023-01-17 16:42:41 -08:00
Changyu Bi 6a82b68788 Avoid counting extra range tombstone compensated size in `AddRangeDels()` (#11091)
Summary:
in `CompactionOutputs::AddRangeDels()`, range tombstones with the same start and end key but different sequence numbers all contribute to compensated range tombstone size. This PR removes this redundancy. This PR also includes a fix from https://github.com/facebook/rocksdb/issues/11067 where a range tombstone that is not within a file's range was being added to the file. This fixes an assertion failure for `icmp.Compare(start, end) <= 0` in VersionSet::ApproximateSize() when calculating compensated range tombstone size. Assertions and a comment/essay was added to reason that no such range tombstone will be added after this fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11091

Test Plan:
- Added unit tests
- Stress test with small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10`

Reviewed By: ajkr

Differential Revision: D42521588

Pulled By: cbi42

fbshipit-source-id: 5bda3fe38997995314e1f7592319af12b69bc4f8
2023-01-17 12:47:44 -08:00
Changyu Bi f515d9d203 Revert #10802 Consider range tombstone in compaction output file cutting (#11089)
Summary:
This reverts commit f02c708aa3 since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089

Test Plan: existing CI.

Reviewed By: ajkr

Differential Revision: D42505972

Pulled By: cbi42

fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153
2023-01-13 12:28:21 -08:00
Peter Dillinger 9f7801c5f1 Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).

The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.

* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)

## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
  * Simpler for implementations to deal with just one Insert and one Lookup.
  * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
  * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
  * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
  * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
  * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.

## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).

The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.

These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)

Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.

## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.

## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.

The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.

## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)

## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975

Test Plan:
tests updated

Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):

34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83

Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.

Reviewed By: anand1976

Differential Revision: D42417818

Pulled By: pdillinger

fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2023-01-11 14:20:40 -08:00
Changyu Bi 0a2d3b663a Fix some unit test failure in ExternalSSTFileBasicTest (#11070)
Summary:
valgrind build for `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithMixedValueType` and `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno` started failing (see error message in T141554665). I could not repro but I suspect it is due to file ingestion range overlapping with ongoing compaction, which caused a new global seqno being assigned after https://github.com/facebook/rocksdb/issues/10988.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11070

Test Plan: monitor future valgrind tests result.

Reviewed By: hx235

Differential Revision: D42319056

Pulled By: cbi42

fbshipit-source-id: acbcd841a2a15e36b278f39ba514f4b9a6ee43ca
2023-01-05 12:10:02 -08:00
Niklas Fiekas ff04fb154b Add C API for ReadOptions::async_io (#11062)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11062

Reviewed By: hx235

Differential Revision: D42297489

Pulled By: ajkr

fbshipit-source-id: 03fe1477c1ae1f8af73dc77a6986fdc7025edf4f
2023-01-04 19:36:43 -08:00
ywave 7f71880de9 Fix typo in flushing stats CF (#11055)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11055

Test Plan: make check

Reviewed By: cbi42

Differential Revision: D42232828

Pulled By: ajkr

fbshipit-source-id: 3b46514aebff4da7e47b9954b90800ba4a3ba30b
2022-12-30 16:55:55 -08:00
HuangYi 33aca893c2 add c-api for setting option optimize_filters_for_memory (#11044)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11044

Reviewed By: cbi42

Differential Revision: D42152851

Pulled By: ajkr

fbshipit-source-id: 81710d9503ba4f23f112c72ebf16a48112e27158
2022-12-30 16:53:00 -08:00
Hui Xiao 9502856edd Add missing range conflict check between file ingestion and RefitLevel() (#10988)
Summary:
**Context:**
File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998.

RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions.
- Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason.
- But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in 0f88160f67, 5c64fb67d2 and 87dfc1d23e.
- Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR.

Above are bugs resulting in two bad consequences:
- If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true`
- If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](c62f322169/db/db_iter.cc (L342-L343)) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build.

Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions.

**Summary:**
- Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them.  File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level.
- Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing.
- Replace scattered fixes (0f88160f67, 5c64fb67d2 and 87dfc1d23e.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook.
- Misc: logic cleanup, see PR comments

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988

Test Plan:
- New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards.
- Made compatible with existing tests, see PR comments
- make check
- [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: cbi42

Differential Revision: D41535685

Pulled By: hx235

fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258
2022-12-29 15:05:36 -08:00
Changyu Bi cc6f323705 Include estimated bytes deleted by range tombstones in compensated file size (#10734)
Summary:
compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734

Test Plan:
- Added unit tests.
- benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used:
```
./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000
```

In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size.

Compaction stats from this PR:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      2/0   31.54 MB   0.5      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     63.4    135.56            110.94       544    0.249       0      0       0.0       0.0
  L4      3/0   96.55 MB   0.8     18.5     6.7     11.8      18.4      6.6       0.0   2.7     65.3     64.9    290.08            284.03       108    2.686    284M  1957K       0.0       0.0
  L5     15/0   404.41 MB   1.0     19.1     7.7     11.4      18.8      7.4       0.3   2.5     66.6     65.7    292.93            285.34       220    1.332    293M  3808K       0.0       0.0
  L6    143/0    4.12 GB   0.0     45.0     7.5     37.5      41.6      4.1       0.0   5.5     71.2     65.9    647.00            632.66       251    2.578    739M    47M       0.0       0.0
 Sum    163/0    4.64 GB   0.0     82.6    21.9     60.7      87.2     26.5       0.3  10.4     61.9     65.4   1365.58           1312.97      1123    1.216   1318M    52M       0.0       0.0
```

Compaction stats from main:
```
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      0/0    0.00 KB   0.0      0.0     0.0      0.0       8.4      8.4       0.0   1.0      0.0     60.5    142.12            115.89       569    0.250       0      0       0.0       0.0
  L4      3/0   85.68 MB   1.0     17.7     6.8     10.9      17.6      6.7       0.0   2.6     62.7     62.3    289.05            281.79       112    2.581    272M  2309K       0.0       0.0
  L5     11/0   293.73 MB   1.0     18.8     7.5     11.2      18.5      7.2       0.5   2.5     64.9     63.9    296.07            288.50       220    1.346    288M  4365K       0.0       0.0
  L6    130/0    3.94 GB   0.0     51.5     7.6     43.9      47.9      3.9       0.0   6.3     67.2     62.4    784.95            765.92       258    3.042    848M    51M       0.0       0.0
 Sum    144/0    4.31 GB   0.0     88.0    21.9     66.0      92.3     26.3       0.5  11.0     59.6     62.5   1512.19           1452.09      1159    1.305   1409M    58M       0.0       0.0```

Reviewed By: ajkr

Differential Revision: D39834713

Pulled By: cbi42

fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b
2022-12-29 13:28:24 -08:00
Peter Dillinger e6b6e74154 Make CompactRange() more aware of SstPartitionerFactory (#11032)
Summary:
Some users are at least considering using SstPartitioner to support efficient physical migration of specific key ranges between RocksDB instances. One might expect manual `CompactRange()` over a narrow key range across some partition to enforce partitioning of any SST files crossing that partition boundary, but that currently only works if there are keys within that range.

This change makes the overlap logic in CompactRange more aware of the partitioner to automatically select relevant files crossing a partition boundary, even when they otherwise would not be selected due to the compaction range falling in a gap between entries.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11032

Test Plan: unit test included

Reviewed By: hx235

Differential Revision: D41981380

Pulled By: pdillinger

fbshipit-source-id: 2fe445bdddc73c00276c20f295cc1fa33d15b05a
2022-12-21 15:41:10 -08:00
Changyu Bi 53b703eafe Fix an assertion failure in `CompactionOutputs::AddRangeDels()` (#11040)
Summary:
the [assertion](c3f720c60d/db/compaction/compaction_outputs.cc (L643)) in `CompactionOutputs::AddRangeDels()` can fail after https://github.com/facebook/rocksdb/pull/10802. The assertion fails when `lower_bound_from_range_tombstone` is true during `AddRangeDels()` for a new compaction output file, while the lower bound range tombstone key has seqno 0 and op_type kTypeRangeDeletion. It can have seqno 0 when it was truncated at a point key whose seqno was zeroed out during compaction, the seqno and op_type could be set [here](c3f720c60d/db/compaction/compaction_outputs.cc (L594)). This PR fixes the assertion excluding the case when `lower_bound_from_range_tombstone` is true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11040

Test Plan: CI

Reviewed By: ajkr

Differential Revision: D42119914

Pulled By: cbi42

fbshipit-source-id: 0897e71b5304cb02aac30f71667b590c37b72baf
2022-12-19 16:36:39 -08:00
ehds ddad943c29 snapshots of FragmentedRangeTombstoneList must in ascending order (#11046)
Summary:
`snapshots` argument of `FragmentedRangeTombstoneList` should be in ascending order.

If we pass it in descending order order, it will not work.

for example:

```
  auto range_del_iter = MakeRangeDelIter({{"a", "e", 3},{"a","e", 6}});

  FragmentedRangeTombstoneList fragment_list(
      std::move(range_del_iter), bytewise_icmp, true /* for_compaction */,
      {8 ,7 ,4} /* snapshots */);
    FragmentedRangeTombstoneIterator iter(&fragment_list, bytewise_icmp,
                                        kMaxSequenceNumber /* upper_bound */);
  VerifyFragmentedRangeDels(&iter, {{"a", "e", 6}, {"a", "e", 3}});
```
VerifyFragmentedRangeDels will fail.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11046

Reviewed By: ajkr

Differential Revision: D42148654

Pulled By: cbi42

fbshipit-source-id: a2e76f96dccf56fcca1a91cb8da9b99145f68026
2022-12-19 15:06:22 -08:00
Changyu Bi f02c708aa3 Consider range tombstone in compaction output file cutting (#10802)
Summary:
This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802

Test Plan:
- added unit tests in db_range_del_test.
- stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`

Reviewed By: ajkr, jay-zhuang

Differential Revision: D40308827

Pulled By: cbi42

fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
2022-12-15 09:11:54 -08:00
Yanqin Jin c93ba7db5d Revise LockWAL/UnlockWAL implementation (#11020)
Summary:
RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and
releases the internal `DBImpl::log_write_mutex_`.

According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288
> Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_.

This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application
should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing
the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()`
is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`.

There are two issues:
1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after
calling `LockWAL()` because log_write_mutex is not recursive.
2. Two threads may dead lock due to lock order inversion.

To fix these issues, we can modify the implementation of LockWAL so that it does not keep
`log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can
instead manually inject a write stall so that all future writes will be stopped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D41785203

Pulled By: riversand963

fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce
2022-12-13 21:45:00 -08:00
Hui Xiao 98d5db5c2e Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`

Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.

**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922

Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run 36a5686ec0 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761

Reviewed By: ajkr

Differential Revision: D41063187

Pulled By: hx235

fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2022-12-13 13:29:37 -08:00
Jay Zhuang 1078d860a9 Add an unittest for Periodic compaction conflict with ongoing compaction (#10908)
Summary:
Add a tiered storage migration test which would conflict with
an ongoing penultimate level compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10908

Test Plan: Test only change

Reviewed By: anand1976

Differential Revision: D40864509

Pulled By: ajkr

fbshipit-source-id: e316e849a01a6c71a41be130101f909b6c0498cb
2022-12-12 10:37:55 -08:00
Andrew Kryczka 4d60cbc629 Use VersionBuilder for CF import ordering/validation (#11028)
Summary:
Besides the existing ordering and validation, more is coming to VersionBuilder/VersionStorageInfo, like migration of epoch_numbers from older RocksDB versions. We should start using those common classes for importing CFs, instead of duplicating their ordering, validation, and migration logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11028

Test Plan: rely on existing tests

Reviewed By: hx235

Differential Revision: D41865427

Pulled By: ajkr

fbshipit-source-id: 873f5cd87b8902a2380c3b71373ce0b0db3a0c50
2022-12-10 15:07:42 -08:00
Peter Dillinger 433d7e4594 Improve error messages for SST footer and size errors (#11009)
Summary:
Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure:
* Magic number error is reported first if we attempt to open an SST file and the footer is completely bad.
* Footer errors are reported with affected file.
* If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009

Test Plan:
unit tests added, some manual

Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed).

Reviewed By: siying

Differential Revision: D41656272

Pulled By: pdillinger

fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df
2022-12-09 10:03:47 -08:00
Hui Xiao 15bb4ea084 Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL (#11016)
Summary:
**Context/Summary:**
Credit to ajkr's https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134,
flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to `Flush()` called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true).  Fix this by making checking `sync_point_called == true` after obsoleted WAL is found and `SyncWAL()` is called. Also rename the "sync_point_called" to be something more specific.

Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11016

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D41717786

Pulled By: hx235

fbshipit-source-id: ad1e4701a987285bbe6c8e7d9b05c4db06b4edf4
2022-12-06 18:31:43 -08:00
Changyu Bi 23af6786a9 Fix an assertion failure in `TimestampTablePropertiesCollector` for empty output (#11015)
Summary:
when the compaction output file is empty, the assertion in `TimestampTablePropertiesCollector::Finish()` breaks. This PR fixes this assert and added unit test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11015

Test Plan: added UT.

Reviewed By: ajkr

Differential Revision: D41716719

Pulled By: cbi42

fbshipit-source-id: d891d46be4c4805e3d49be6b80c9d75f1bd51080
2022-12-05 13:46:27 -08:00
anand76 8ffabdc226 Fix table cache leak in MultiGet with async_io (#10997)
Summary:
When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997

Test Plan:
1. New unit test in db_basic_test
2. asan_crash

Reviewed By: pdillinger

Differential Revision: D41587244

Pulled By: anand1976

fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64
2022-12-04 22:58:25 -08:00
Hui Xiao 2f76ab150d Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892)
Summary:
**Context**
`Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.

`DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.
And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.
- Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`.

As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.

**Summary**
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.

**Test**
- Make check
- Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after
- [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears
- [Ongoing]Regular CI stress test to confirm such fix didn't harm anything

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892

Reviewed By: ajkr

Differential Revision: D40778965

Pulled By: hx235

fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f
2022-11-29 14:14:43 -08:00
Hui Xiao f1574a20ff Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999)
Summary:
**Context/Summary:**

This reverts commit fc74abb436 and related HISTORY record.

The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF.  This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004

Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D41572604

Pulled By: hx235

fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
2022-11-29 10:56:42 -08:00
Changyu Bi 6cdb7af9f8 Remove copying of range tombstones keys in iterator (#10878)
Summary:
In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878

Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement.
```
# Favorable set up: half of the writes are DeleteRange.
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50

# benchmark command
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472  --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25

# main
readseq [AVG    5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec
readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec

# this PR
readseq [AVG    5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec
readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec
```

Reviewed By: ajkr

Differential Revision: D40711170

Pulled By: cbi42

fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f
2022-11-28 19:27:22 -08:00
Hui Xiao d8c043f7ad Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955)
Summary:
**Context**

https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect.

As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors  `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound.
Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178
Column family: default, op_logs: S 00000000000023A90000000000000003787878`

**Summary**:
- Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`
- Add some helpful log to LOG file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955

Test Plan:
- Errors repro-ed by
```
./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20
```
is gone after this fix
- CI

Reviewed By: ajkr

Differential Revision: D41319441

Pulled By: hx235

fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709
2022-11-28 15:45:03 -08:00
Changyu Bi 534fb06dd3 Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966)
Summary:
Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966

Test Plan:
- added unit test
- stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
- ran different stress tests over sandcastle
- Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.

Reviewed By: ajkr

Differential Revision: D41414172

Pulled By: cbi42

fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
2022-11-23 14:27:14 -08:00
Andrew Kryczka 54c2542df2 Support tiering when file endpoints overlap (#10961)
Summary:
Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic.

Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961

Test Plan: updated unit test

Reviewed By: cbi42

Differential Revision: D41370615

Pulled By: ajkr

fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572
2022-11-23 09:20:58 -08:00
Yanqin Jin 3d0d6b8140 Make best-efforts recovery verify SST unique ID before Version construction (#10962)
Summary:
The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true).

With best_efforts_recovery being true, RocksDB will recover to the latest point in
MANIFEST such that all valid SST files included up to this point pass unique ID checks as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962

Test Plan: make check

Reviewed By: pdillinger

Differential Revision: D41378241

Pulled By: riversand963

fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c
2022-11-22 22:53:31 -08:00
Andrew Kryczka db9cbddc6f Deflake DBTest2.TraceAndReplay by relaxing latency checks (#10979)
Summary:
Since the latency measurement uses real time it is possible for the operation to complete in zero microseconds and then fail these checks. We saw this with the operation that invokes Get() on an invalid CF. This PR relaxes the assertions to allow for operations completing in zero microseconds.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10979

Reviewed By: riversand963

Differential Revision: D41478300

Pulled By: ajkr

fbshipit-source-id: 50ef096bd8f0162b31adb46f54ae6ddc337d0a5e
2022-11-22 13:07:17 -08:00
Changyu Bi 6c5ec92070 Set correct temperature for range tombstone only file in penultimate level (#10972)
Summary:
before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972

Test Plan: added unit test for this scenario.

Reviewed By: ajkr

Differential Revision: D41449215

Pulled By: cbi42

fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c
2022-11-21 17:08:50 -08:00
Andrew Kryczka 097f9f4425 Fix CompactionIterator flag for penultimate level output (#10967)
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967

Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
  - Before this fix, Y was always zero
  - After this fix, Y gradually increased throughout the benchmark

Reviewed By: riversand963

Differential Revision: D41417726

Pulled By: ajkr

fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
2022-11-21 16:14:03 -08:00
Peter Dillinger 3182beeffc Observe and warn about misconfigured HyperClockCache (#10965)
Summary:
Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.

Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965

Test Plan:
unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.

Also manual testing with db_bench. Create db with
```
./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
```
Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
```
./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
```
Seeing warnings / errors or not as expected.

Reviewed By: anand1976

Differential Revision: D41406932

Pulled By: pdillinger

fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
2022-11-21 12:08:21 -08:00
Peter Dillinger 32520df1d9 Remove prototype FastLRUCache (#10954)
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954

Test Plan: tests updated

Reviewed By: akankshamahajan15

Differential Revision: D41310123

Pulled By: pdillinger

fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
2022-11-16 10:15:55 -08:00
Levi Tamasi 5e8947057b Support Merge for wide-column entities in the compaction logic (#10946)
Summary:
The patch extends the compaction logic to handle `Merge`s in conjunction with wide-column entities. As usual, the merge operation is applied to the anonymous default column, and any other columns are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10946

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41233722

Pulled By: ltamasi

fbshipit-source-id: dfd9b1362222f01bafcecb139eb48480eb279fed
2022-11-11 16:32:32 -08:00
Levi Tamasi dbc4101b89 Support Merge with wide-column entities in iterator (#10941)
Summary:
The patch adds `Merge` support for wide-column entities in `DBIter`. As before, the `Merge` operation is applied to the default column of the entity; any other columns are unchanged. As a small cleanup, the PR also changes the signature of `DBIter::Merge` to simply return a boolean instead of the `Merge` operation's `Status` since the actual `Status` is already stored in a member variable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10941

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41195471

Pulled By: ltamasi

fbshipit-source-id: 362cf555897296e252c3de5ddfbd569ef34f85ef
2022-11-10 18:00:08 -08:00
Levi Tamasi 9460d4b77e Refactor MergeHelper::MergeUntil a bit (#10943)
Summary:
The patch untangles some nested ifs in `MergeHelper::MergeUntil`. This will come in handy when extending the compaction logic to support `Merge` for wide-column entities, and also enables us to eliminate some repeated branching on value type and to decrease the scope of some variables.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10943

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41201946

Pulled By: ltamasi

fbshipit-source-id: 890bd3d4e31cdccadca614489a94686d76485ba9
2022-11-10 17:29:57 -08:00
Levi Tamasi 2ea109521f Revisit the interface of MergeHelper::TimedFullMerge(WithEntity) (#10932)
Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41129399

Pulled By: ltamasi

fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
2022-11-09 12:54:05 -08:00
Levi Tamasi c62f322169 Clear saved value in DBIter::{Next, Prev} (#10934)
Summary:
`DBIter::saved_value_` stores the result of any `Merge` that was performed to compute the iterator's current value. This value can be ditched whenever the iterator's position is changed, and is already cleared in `Seek`, `SeekForPrev`, `SeekToFirst`, and `SeekToLast`. With the patch, it is also cleared in `Next` and `Prev`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10934

Test Plan: `make check`

Reviewed By: akankshamahajan15

Differential Revision: D41133473

Pulled By: ltamasi

fbshipit-source-id: cf9e936f48151e64e455cc1664d6e9f4a03aa308
2022-11-08 14:49:16 -08:00
Levi Tamasi fbd9077d66 Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925)
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D41096453

Pulled By: ltamasi

fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
2022-11-07 15:42:10 -08:00
Andrew Kryczka aa0a11e1b9 Fix flush picking non-consecutive memtables (#10921)
Summary:
Prevents `MemTableList::PickMemtablesToFlush()` from picking non-consecutive memtables. It leads to wrong ordering in L0 if the files are committed, or an error like below if force_consistency_checks=true catches it:

```
Corruption: force_consistency_checks: VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/25 with seqno 320416 368066 vs. file https://github.com/facebook/rocksdb/issues/24 with seqno 336037 352068
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10921

Test Plan: fix the expectation in the existing test of this behavior

Reviewed By: riversand963

Differential Revision: D41046935

Pulled By: ajkr

fbshipit-source-id: 783696bff56115063d5dc5856dfaed6a9881d1ab
2022-11-04 15:55:54 -07:00
Yanqin Jin 18cb731f27 Fix a bug in range scan with merge and deletion with timestamp (#10915)
Summary:
When performing Merge during range scan, iterator should understand value types of kDeletionWithTimestamp.

Also add an additional check in debug mode to MergeHelper, and account for the presence of compaction filter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10915

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40960039

Pulled By: riversand963

fbshipit-source-id: dd79d86d7c79d05755bb939a3d94e0c53ddd7f59
2022-11-03 13:02:06 -07:00
Levi Tamasi 941d834739 Support Merge for wide-column entities during point lookups (#10916)
Summary:
The patch adds `Merge` support for wide-column entities to the point lookup
APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
update the iterator and compaction logic in separate PRs.) In terms of semantics,
the `Merge` operation is applied to the default (anonymous) column; any other
columns in the entity are unaffected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40962311

Pulled By: ltamasi

fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
2022-11-03 08:35:42 -07:00
Andrew Kryczka 5cf6ab6f31 Ran clang-format on db/ directory (#10910)
Summary:
Ran `find ./db/ -type f | xargs clang-format -i`. Excluded minor changes it tried to make on db/db_impl/. Everything else it changed was directly under db/ directory. Included minor manual touchups mentioned in PR commit history.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10910

Reviewed By: riversand963

Differential Revision: D40880683

Pulled By: ajkr

fbshipit-source-id: cfe26cda05b3fb9a72e3cb82c286e21d8c5c4174
2022-11-02 14:34:24 -07:00
Yanqin Jin 7d26e4c5a3 Basic Support for Merge with user-defined timestamp (#10819)
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.

Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```

Some code comments are added for clarity.

Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40603195

Pulled By: riversand963

fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
2022-10-31 22:28:58 -07:00
Denis Hananein 9f3475eccf Fix compilation errors, clang++-15 (#10907)
Summary:
I've tried to compile the main branch, but there are two minor things which are make CE.
I'm not sure about the second one (`num_empty_non_l0_level`), probably there is should be additional assert.

```
-c ../cache/clock_cache.cc
[build] ../cache/clock_cache.cc:855:15: error: variable 'i' set but not used [-Werror,-Wunused-but-set-variable]
[build]   for (size_t i = 0; &array_[current] != h; i++) {
[build]               ^
```

```
[build] ../db/version_set.cc:3665:7: error: variable 'num_empty_non_l0_level' set but not used [-Werror,-Wunused-but-set-variable]
[build]   int num_empty_non_l0_level = 0;
[build]       ^
[build] 1 error generated.
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10907

Reviewed By: jay-zhuang

Differential Revision: D40866667

Pulled By: ajkr

fbshipit-source-id: 963b7bd56859d0b3b2779cd36fad229425cb7b17
2022-10-31 18:24:44 -07:00
sdong d989300ad1 Avoid repeat periodic stats printing when there is no change (#10891)
Summary:
When there is a column family that doesn't get any traffic, its stats are still dumped when options.options.stats_dump_period_sec triggers. This sometimes spam the information logs. With this change, we skip the printing if there is not change, until 8 periods.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10891

Test Plan: Manually test the behavior with hacked db_bench setups.

Reviewed By: jay-zhuang

Differential Revision: D40777183

fbshipit-source-id: ef0b9a793e4f6282df099b464f01d1fb4c5a2cab
2022-10-31 09:51:38 -07:00
Yanqin Jin 9079895aae Fix deletion counting in memtable stats (#10886)
Summary:
Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40740754

Pulled By: riversand963

fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1
2022-10-28 17:03:44 -07:00
Levi Tamasi 7867a1112b Handle Merges correctly in GetEntity (#10894)
Summary:
The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
supported for wide-column entities written using `PutEntity`; this change is
about returning correct (i.e. consistent with `Get`) results in cases like when the
base value is a plain old key-value written using `Put` or when there is no real base
value because we hit either a tombstone or the beginning of history.

Implementation-wise, the patch introduces a new wrapper around the existing
`MergeHelper::TimedFullMerge` that can store the merge result in either a string
(for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40782708

Pulled By: ltamasi

fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
2022-10-28 10:48:51 -07:00
anand76 5fef34fd3a Fix a potential std::vector use after move bug (#10845)
Summary:
The call to `folly::coro::collectAllRange()` should move the input `mget_tasks`. But just in case, assert and clear the std::vector before reusing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10845

Reviewed By: akankshamahajan15

Differential Revision: D40611719

Pulled By: anand1976

fbshipit-source-id: 0f32b387cf5a2894b13389016c020b01ab479b5e
2022-10-26 22:34:36 -07:00
Jay Zhuang b36ec37a4b clang-format for db/compaction (#10882)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10882

Reviewed By: riversand963

Differential Revision: D40724867

Pulled By: jay-zhuang

fbshipit-source-id: 7f387724f8cd07d8d2b90566a515a4e9078d21f1
2022-10-26 12:35:12 -07:00
Yanqin Jin 84563a2701 Run clang-format on some files in db/db_impl directory (#10869)
Summary:
Run clang-format on some files in db/db_impl/ directory

```
clang-format -i <file>
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10869

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40685390

Pulled By: riversand963

fbshipit-source-id: 64449ccb21b0d61c5142eb2bcbff828acb45c154
2022-10-25 13:49:09 -07:00
Changyu Bi 7a95938899 Improve FragmentTombstones() speed by lazily initializing `seq_set_` (#10848)
Summary:
FragmentedRangeTombstoneList has a member variable `seq_set_` that contains the sequence numbers of all range tombstones in a set. The set is constructed in `FragmentTombstones()` and is used only in `FragmentedRangeTombstoneList::ContainsRange()` which only happens during compaction. This PR moves the initialization of `seq_set_` to `FragmentedRangeTombstoneList::ContainsRange()`. This should speed up `FragmentTombstones()` when the range tombstone list is used for read/scan requests. Microbench shows the speed improvement to be ~45%.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10848

Test Plan:
- Existing tests and stress test: `python3 tools/db_crashtest.py whitebox --simple  --verify_iterator_with_expected_state_one_in=5`.
- Microbench: update `range_del_aggregator_bench` to benchmark speed of `FragmentTombstones()`:
```
./range_del_aggregator_bench --num_range_tombstones=1000 --tombstone_start_upper_bound=50000000 --num_runs=10000 --tombstone_width_mean=200 --should_deletes_per_run=100 --use_compaction_range_del_aggregator=true

Before this PR:
=========================
Fragment Tombstones:     270.286 us
AddTombstones:           1.28933 us
ShouldDelete (first):    0.525528 us
ShouldDelete (rest):     0.0797519 us

After this PR: time to fragment tombstones is pushed to AddTombstones() which only happen during compaction.
=========================
Fragment Tombstones:     149.879 us
AddTombstones:           102.131 us
ShouldDelete (first):    0.565871 us
ShouldDelete (rest):     0.0729444 us
```
- db_bench: this should improve speed for fragmenting range tombstones for mutable memtable:
```
./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=250000 --disable_auto_compactions --max_num_range_tombstones=100000 --finish_after_writes --write_buffer_size=1073741824 --threads=25

Before this PR:
readwhilewriting :      18.301 micros/op 1310445 ops/sec 4.769 seconds 6250000 operations;   28.1 MB/s (41001 of 250000 found)
After this PR:
readwhilewriting :      16.943 micros/op 1439376 ops/sec 4.342 seconds 6250000 operations;   23.8 MB/s (28977 of 250000 found)
```

Reviewed By: ajkr

Differential Revision: D40646227

Pulled By: cbi42

fbshipit-source-id: ea471667edb258f67d01cfd828588e80a89e4083
2022-10-25 11:33:04 -07:00
Hui Xiao fc74abb436 Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777)
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress

./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35

put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```

**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
-  [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
   - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
   - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
  - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
  - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
    - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930  in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.

Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777

Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761  and on FIFO compaction only

Reviewed By: ajkr

Differential Revision: D40090485

Pulled By: hx235

fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
2022-10-25 10:39:58 -07:00
Levi Tamasi bb5ac1b524 Run clang-format on db/blob/ (#10856)
Summary:
Apply the formatting changes suggested by `clang-format`, except
where they would ruin the ASCII art in `blob_log_format.h`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10856

Test Plan: `make check`

Reviewed By: siying

Differential Revision: D40652224

Pulled By: ltamasi

fbshipit-source-id: 8c1f5757b758474ea3e8102a7c5a1cf9e6dc1402
2022-10-24 16:00:32 -07:00
Jay Zhuang f726d29a82 Allow penultimate level output for the last level only compaction (#10822)
Summary:
Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level.
(it includes the PR https://github.com/facebook/rocksdb/issues/10829)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822

Reviewed By: siying

Differential Revision: D40389180

Pulled By: jay-zhuang

fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad
2022-10-22 08:57:38 -07:00
Peter Dillinger 27c9705ac4 Use kXXH3 as default checksum (CPU efficiency) (#10778)
Summary:
Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware.

A current DB performance comparison using buck+clang build:
```
TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal
```
kXXH3 (+0.2% DB write throughput):
`fillseq [AVG    1000 runs] : 822149 (± 1004) ops/sec;   91.0 (± 0.1) MB/sec`
kCRC32c:
`fillseq [AVG    1000 runs] : 820484 (± 1203) ops/sec;   90.8 (± 0.1) MB/sec`

Micro benchmark comparison:
```
./db_bench --benchmarks=xxh3[-X20],crc32c[-X20]
```
Machine 1, buck+clang build:
`xxh3 [AVG    20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec`
`crc32c [AVG    20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec`

Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0:
`xxh3 [AVG    20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec`
`crc32c [AVG    20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778

Test Plan: make check, unit tests updated

Reviewed By: ajkr

Differential Revision: D40112510

Pulled By: pdillinger

fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37
2022-10-21 18:09:12 -07:00
sdong 5d17297b76 Make UserComparatorWrapper not Customizable (#10837)
Summary:
Right now UserComparatorWrapper is a Customizable object, although it is not, which introduces some intialization overhead for the object. In some benchmarks, it shows up in CPU profiling. Make it not configurable by defining most functions needed by UserComparatorWrapper to an interface and implement the interface.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10837

Test Plan: Make sure existing tests pass

Reviewed By: pdillinger

Differential Revision: D40528511

fbshipit-source-id: 70eaac89ecd55401a26e8ed32abbc413a9617c62
2022-10-21 12:27:50 -07:00
akankshamahajan 0e7b27bfcf Refactor block cache tracing APIs (#10811)
Summary:
Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.

`DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.

This same philosophy can be applied to KV and IO tracing as well.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811

Test Plan:
existing unit tests
Old API DB::StartBlockTrace checked with db_bench tool
create database
```
./db_bench --benchmarks="fillseq" \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000
```

To trace block cache accesses when running readrandom benchmark:
```
./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
--threads=16 \
-block_cache_trace_file="/tmp/binary_trace_test_example" \
-block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
-block_cache_trace_sampling_frequency=1

```

Reviewed By: anand1976

Differential Revision: D40435289

Pulled By: akankshamahajan15

fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
2022-10-21 12:15:35 -07:00
Changyu Bi 333abe9c55 Ignore max_compaction_bytes for compaction input that are within output key-range (#10835)
Summary:
When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835

Test Plan:
- CI, a unit test on max_compaction_bytes fails before turning this flag off.
- Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432`
```
main:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   91.59 MB   0.8     70.9     0.0     70.9     200.8    129.9       0.0   1.5     25.2     71.2   2886.55           2463.45      9725    0.297   1093M   254K       0.0       0.0
  L1      9/0   248.03 MB   1.0    392.0   129.8    262.2     391.7    129.5       0.0   3.0     69.0     68.9   5821.71           5536.90       804    7.241   6029M  5814K       0.0       0.0
  L2     87/0    2.50 GB   1.0    537.0   128.5    408.5     533.8    125.2       0.7   4.2     69.5     69.1   7912.24           7323.70      4417    1.791   8299M    36M       0.0       0.0
  L3    836/0   24.99 GB   1.0    616.9   118.3    498.7     594.5     95.8       5.2   5.0     66.9     64.5   9442.38           8490.28      4204    2.246   9749M   306M       0.0       0.0
  L4   2355/0   62.95 GB   0.3     67.3    37.1     30.2      54.2     24.0      38.9   1.5     72.2     58.2    954.37            821.18       917    1.041   1076M   173M       0.0       0.0
 Sum   3290/0   90.77 GB   0.0   1684.2   413.7   1270.5    1775.0    504.5      44.9  13.7     63.8     67.3  27017.25          24635.52     20067    1.346     26G   522M       0.0       0.0

Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds

This PR:
** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      3/0   45.71 MB   0.8     72.9     0.0     72.9     202.8    129.9       0.0   1.6     25.4     70.7   2938.16           2510.36      9741    0.302   1124M   265K       0.0       0.0
  L1      8/0   234.54 MB   0.9    384.5   129.8    254.7     384.2    129.6       0.0   3.0     69.0     68.9   5708.08           5424.43       791    7.216   5913M  5753K       0.0       0.0
  L2     84/0    2.47 GB   1.0    543.1   128.6    414.5     539.9    125.4       0.7   4.2     69.6     69.2   7989.31           7403.13      4418    1.808   8393M    36M       0.0       0.0
  L3    839/0   24.96 GB   1.0    615.6   118.4    497.2     593.2     96.0       5.1   5.0     66.6     64.1   9471.23           8489.31      4193    2.259   9726M   306M       0.0       0.0
  L4   2360/0   63.04 GB   0.3     67.6    37.3     30.3      54.4     24.1      38.9   1.5     71.5     57.6    967.30            827.99       907    1.066   1080M   173M       0.0       0.0
 Sum   3294/0   90.75 GB   0.0   1683.8   414.2   1269.6    1774.5    504.9      44.8  13.7     63.7     67.1  27074.08          24655.22     20050    1.350     26G   522M       0.0       0.0

Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds
```

Reviewed By: ajkr

Differential Revision: D40518319

Pulled By: cbi42

fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60
2022-10-21 10:22:41 -07:00
Levi Tamasi 8dd4bf6cff Separate the handling of value types in SaveValue (#10840)
Summary:
Currently, the code in `SaveValue` that handles `kTypeValue` and
`kTypeBlobIndex` (and more recently, `kTypeWideColumnEntity`) is
mostly shared. This made sense originally; however, by now the
handling of these three value types has diverged significantly. The
patch makes the logic cleaner and also eliminates quite a bit of branching
by giving each value type its own `case` and removing a fall-through.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10840

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40568420

Pulled By: ltamasi

fbshipit-source-id: 2e614606afd1c3d9c76d9b5f1efa0959fc174103
2022-10-21 10:05:46 -07:00
Jay Zhuang 1663f77d2a Fix no internal time recorded for small preclude_last_level (#10829)
Summary:
When the `preclude_last_level_data_seconds` or
`preserve_internal_time_seconds` is smaller than 100 (seconds), no seqno->time information was recorded.
Also make sure all data will be compacted to the last level even if there's no write to record the time information.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10829

Test Plan: added unittest

Reviewed By: siying

Differential Revision: D40443934

Pulled By: jay-zhuang

fbshipit-source-id: 2ecf1361daf9f3e5c3385aee6dc924fa59e2813a
2022-10-20 17:11:38 -07:00
Levi Tamasi 865d5576ad Support providing the default column separately when serializing columns (#10839)
Summary:
The patch makes it possible to provide the value of the default column
separately when calling `WideColumnSerialization::Serialize`. This eliminates
the need to construct a new `WideColumns` vector in certain cases
(for example, it will come in handy when implementing `Merge`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10839

Test Plan: `make check`

Reviewed By: riversand963

Differential Revision: D40561448

Pulled By: ltamasi

fbshipit-source-id: 69becdd510e6a83ab1feb956c12772110e1040d6
2022-10-20 16:00:58 -07:00
Andrew Kryczka 33ceea9b76 Add DB property for fast block cache stats collection (#10832)
Summary:
This new property allows users to trigger the background block cache stats collection mode through the `GetProperty()` and `GetMapProperty()` APIs. The background mode has much lower overhead at the expense of returning stale values in more cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10832

Test Plan: updated unit test

Reviewed By: pdillinger

Differential Revision: D40497883

Pulled By: ajkr

fbshipit-source-id: bdcc93402f426463abb2153756aad9e295447343
2022-10-20 15:04:29 -07:00
Yueh-Hsuan Chiang e267909ecf Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348)
Summary:
FIFO compaction can theoretically open a DB with any compaction style.
However, the current code only allows FIFO compaction to open a DB with
a single level.

This PR relaxes the limitation of FIFO compaction and allows it to open a
DB with multiple levels.  Below is the read / write / compaction behavior:

* The read behavior is untouched, and it works like a regular rocksdb instance.
* The write behavior is untouched as well.  When a FIFO compacted DB
is opened with multiple levels, all new files will still be in level 0, and no files
will be moved to a different level.
* Compaction logic is extended.  It will first identify the bottom-most non-empty level.
Then, it will delete the oldest file in that level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348

Test Plan:
Added a new test to verify the migration from level to FIFO where the db has multiple levels.
Extended existing test cases in db_test and db_basic_test to also verify
all entries of a key after reopening the DB with FIFO compaction.

Reviewed By: jay-zhuang

Differential Revision: D40233744

fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1
2022-10-18 14:38:13 -07:00
Peter Dillinger e466173d5c Print stack traces on frozen tests in CI (#10828)
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.

For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828

Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI

Reviewed By: hx235

Differential Revision: D40447634

Pulled By: pdillinger

fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
2022-10-18 00:35:35 -07:00
Peter Dillinger 8367f0d2d7 Improve / refactor anonymous mmap capabilities (#10810)
Summary:
The motivation for this change is a planned feature (related to HyperClockCache) that will depend on a large array that can essentially grow automatically, up to some bound, without the pointer address changing and with guaranteed zero-initialization of the data. Anonymous mmaps provide such functionality, and this change provides an internal API for that.

The other existing use of anonymous mmap in RocksDB is for allocating in huge pages. That code and other related Arena code used some awkward non-RAII and pre-C++11 idioms, so I cleaned up much of that as well, with RAII, move semantics, constexpr, etc.

More specifcs:
* Minimize conditional compilation
* Add Windows support for anonymous mmaps
* Use std::deque instead of std::vector for more efficient bag

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10810

Test Plan: unit test added for new functionality

Reviewed By: riversand963

Differential Revision: D40347204

Pulled By: pdillinger

fbshipit-source-id: ca83fcc47e50fabf7595069380edd2954f4f879c
2022-10-17 17:10:16 -07:00
Peter Dillinger 1ee747d795 Deflake^2 DBBloomFilterTest.OptimizeFiltersForHits (#10816)
Summary:
This reverts https://github.com/facebook/rocksdb/issues/10792 and uses a different strategy to stabilize the test: remove the unnecessary randomness by providing a constant seed for shuffling keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10816

Test Plan: `gtest-parallel ./db_bloom_filter_test -r1000 --gtest_filter=*ForHits*`

Reviewed By: jay-zhuang

Differential Revision: D40347957

Pulled By: pdillinger

fbshipit-source-id: a270e157485cbd94ed03b80cdd21b954ebd57d57
2022-10-13 09:08:09 -07:00
Jay Zhuang 5a5f21c489 Allow the last level data moving up to penultimate level (#10782)
Summary:
Lock the penultimate level for the whole compaction inputs range, so any
key in that compaction is safe to move up from the last level to
penultimate level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782

Reviewed By: siying

Differential Revision: D40231540

Pulled By: siying

fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e
2022-10-10 22:50:34 -07:00
Peter Dillinger 2d0380adbe Allow manifest fix-up without requiring prior state (#10796)
Summary:
This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state.

Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796

Test Plan: tests added/updated, that fail before the change(s) and now pass

Reviewed By: jay-zhuang

Differential Revision: D40232645

Pulled By: jay-zhuang

fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be
2022-10-10 17:59:17 -07:00
Hui Xiao f6a0065d54 Allow Flush(sync=true) not supported in DB::Open() and db_stress (#10784)
Summary:
**Context:**
https://github.com/facebook/rocksdb/pull/10698 made `Flush(sync=true)` required for` DB::Open()` (to pass the original but now deleted assertion `impl->TEST_WALBufferIsEmpty()` under `manual_wal_flush=true`, see https://github.com/facebook/rocksdb/pull/10698 summary for more ) as well as db_stress to pass.

However RocksDB users may not implement SyncWAL() (used inFlush(sync=true)). Therefore we replace such in DB::Open and db_stress in this PR and align with https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1883-L1887 and https://github.com/facebook/rocksdb/blob/main/db_stress_tool/db_stress_test_base.cc#L847-L849

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10784

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D40193354

Pulled By: anand1976

fbshipit-source-id: e80d53880799ae01bdd717641d07997d3bfe2b54
2022-10-10 15:52:10 -07:00
akankshamahajan ebf8c454fd Provide support for async_io with tailing iterators (#10781)
Summary:
Provide support for async_io if ReadOptions.tailing is set true.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10781

Test Plan:
- Update unit tests
- Ran db_bench: ./db_bench --benchmarks="readrandom" --use_existing_db --use_tailing_iterator=1 --async_io=1

Reviewed By: anand1976

Differential Revision: D40128882

Pulled By: anand1976

fbshipit-source-id: 55e17855536871a5c47e2de92d238ae005c32d01
2022-10-10 15:48:48 -07:00
Changyu Bi a6ce1955b1 Fix flaky test ShuttingDownNotBlockStalledWrites (#10800)
Summary:
DBTest::ShuttingDownNotBlockStalledWrites is flaky, added new sync point dependency to fix it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10800

Test Plan: gtest-parallel --repeat=1000 ./db_test --gtest_filter="*ShuttingDownNotBlockStalledWrites"

Reviewed By: jay-zhuang

Differential Revision: D40239116

Pulled By: jay-zhuang

fbshipit-source-id: 8c2d7e7df58f202d287bd9f5c9b60b7eff270d0c
2022-10-10 13:58:55 -07:00
Jay Zhuang 62ba5c8034 Deflake DBBloomFilterTest.OptimizeFiltersForHits (#10792)
Summary:
The test may fail because the L5 files may only cover small portion of the whole key range.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10792

Test Plan:
```
gtest-parallel ./db_bloom_filter_test --gtest_filter=DBBloomFilterTest.OptimizeFiltersForHits -r 1000 -w 100
```

Reviewed By: siying

Differential Revision: D40217600

Pulled By: siying

fbshipit-source-id: 18db549184bccf5e513eaa7e31ab17385b71ef71
2022-10-10 12:34:25 -07:00
Qingping Wang a45e6878f3 fix issue 10751 (#10765)
Summary:
Fix https://github.com/facebook/rocksdb/issues/10751 where a stalled write could be blocked forever when DB shutdown.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10765

Reviewed By: ajkr

Differential Revision: D40110069

Pulled By: ajkr

fbshipit-source-id: 598c05777db9be85913a0a85e421b3295ecdff5e
2022-10-10 09:46:09 -07:00
Jay Zhuang c401f285c3 Add option `preserve_internal_time_seconds` to preserve the time info (#10747)
Summary:
Add option `preserve_internal_time_seconds` to preserve the internal
time information.
It's mostly for the migration of the existing data to tiered storage (
`preclude_last_level_data_seconds`). When the tiering feature is just
enabled, the existing data won't have the time information to decide if
it's hot or cold. Enabling this feature will start collect and preserve
the time information for the new data.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747

Reviewed By: siying

Differential Revision: D39910141

Pulled By: siying

fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138
2022-10-07 18:49:40 -07:00
Yanqin Jin 11943e8b27 Exclude timestamp when checking compaction boundaries (#10787)
Summary:
When checking if a range [start, end) overlaps with a compaction whose range is [start1, end1), always exclude timestamp from start, end, start1 and end1, otherwise some versions of one user key may be compacted to bottommost layer while others remain in the original level.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10787

Test Plan: make check

Reviewed By: ltamasi

Differential Revision: D40187672

Pulled By: ltamasi

fbshipit-source-id: 81226267fd3e33ffa79665c62abadf2ebec45496
2022-10-07 14:11:23 -07:00
Jay Zhuang 23fa5b7789 Use `sstableKeyCompare()` for compaction output boundary check (#10763)
Summary:
To make it consistent with the compaction picker which uses the `sstableKeyCompare()` to pick the overlap files. For example, without this change, it may cut L1 files like:
```
 L1: [2-21]  [22-30]
 L2: [1-10] [21-30]
```
Because "21" on L1 is smaller than "21" on L2. But for compaction, these 2 files are overlapped.
`sstableKeyCompare()` also take range delete into consideration which may cut file for the same key.
It also makes the `max_compaction_bytes` calculation more accurate for cases like above, the overlapped bytes was under estimated. Also make sure the 2 keys won't be splitted to 2 files because of reaching `max_compaction_bytes`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10763

Reviewed By: cbi42

Differential Revision: D39971904

Pulled By: cbi42

fbshipit-source-id: bcc309e9c3dc61a8f50667a6f633e6132c0154a8
2022-10-06 15:54:58 -07:00
Levi Tamasi d6d8c007ff Verify columns in NonBatchedOpsStressTest::VerifyDb (#10783)
Summary:
As the first step of covering the wide-column functionality of iterators
in our stress tests, the patch adds verification logic to
`NonBatchedOpsStressTest::VerifyDb` that checks whether the
iterator's value and columns are in sync. Note: I plan to update the other
types of stress tests and add similar verification for prefix scans etc.
in separate PRs.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10783

Test Plan: Ran some simple blackbox crash tests.

Reviewed By: riversand963

Differential Revision: D40152370

Pulled By: riversand963

fbshipit-source-id: 8f9d17d7af5da58ccf1bd2057cab53cc9645ac35
2022-10-06 15:07:16 -07:00
Yanqin Jin 4d82b94896 Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773)
Summary:
With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773

Test Plan:
Reproduce: D39993203 has detailed steps.
Run the test with and without the fix.

Reviewed By: cbi42

Differential Revision: D40077955

Pulled By: cbi42

fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240
2022-10-05 12:24:39 -07:00
Changyu Bi eca47fb696 Ignore kBottommostFiles compaction logic when allow_ingest_behind (#10767)
Summary:
fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles)  if allow_ingest_behind is enabled and the bottommost level is unfilled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767

Test Plan: Added a unit test to reproduce the compaction loop.

Reviewed By: ajkr

Differential Revision: D40031861

Pulled By: ajkr

fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b
2022-10-05 09:27:14 -07:00
Changyu Bi ffde463a5f Cleanup SuperVersion in Iterator::Refresh() (#10770)
Summary:
Fix a bug in Iterator::Refresh() where the local SV it obtained could be obsolete upon return, and should be cleaned up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10770

Test Plan: added a unit test to reproduce the issue.

Reviewed By: ajkr

Differential Revision: D40063809

Pulled By: ajkr

fbshipit-source-id: 619e728eb0f1ac9540b4d0ad38e43acc37a514b2
2022-10-04 22:23:24 -07:00