Commit graph

12078 commits

Author SHA1 Message Date
Peter Dillinger fe3405e80f Automatic table sizing for HyperClockCache (AutoHCC) (#11738)
Summary:
This change add an experimental next-generation HyperClockCache (HCC) with automatic sizing of the underlying hash table. Both the existing version (stable) and the new version (experimental for now) of HCC are available depending on whether an estimated average entry charge is provided in HyperClockCacheOptions.

Internally, we call the two implementations AutoHyperClockCache (new) and FixedHyperClockCache (existing). The performance characteristics and much of the underlying logic are similar enough that AutoHCC is likely to make FixedHCC obsolete, and so it's best considered an evolution of the same technology or solution rather than an alternative. More specifically, both implementations share essentially the same logic for managing the state of individual entries in the cache, including metadata for reference counting and counting clocks for eviction. This metadata, which I like to call the "low-level HCC protocol," includes a read-write lock on entries, but relaxed consistency requirements on the cache (e.g. allowing rare duplication) means high-level cache operations never wait for these low-level per-entry locks. FixedHCC is fully wait-free.

AutoHCC is different in how entries are indexed into an efficient hash table. AutoHCC is "essentially wait-free" as there is no pattern of typical high-level operations on a large cache that can lead to one thread waiting on another to complete some work, though it can happen in some unusual/unlucky cases, or atypical uses such as erasing specific cache keys. Table growth and entry reclamation is more complex in AutoHCC compared to FixedHCC, so uses some localized locking to manage that. AutoHCC uses linear hashing to grow the table as needed, with low latency and to a precise size. AutoHCC depends on anonymous mmap support from the OS (currently verified working on Linux, MacOS, and Windows) to allow the array underlying a hash table to grow in place without wasting resident memory on space reserved but unused. AutoHCC uses a form of chaining while FixedHCC uses open addressing and double hashing.

More specifics:
* In developing this PR, a rare availability bug (minor) was noticed in the existing HCC implementation of Release()+erase_if_last_ref, which is now inherited into AutoHCC. Fixing this without a performance regression will not be simple, so is left for follow-up work.
* Some existing unit tests required adjustment of operational parameters or conditions to work with the new behaviors of AutoHCC. A number of bugs were found and fixed in the validation process, including getting unit tests in good working order.
* Added an option to cache_bench, `-degenerate_hash_bits` for correctness stress testing described below. For this, the tool uses the reverse-engineered hash function for HCC to generate keys in which the specified number of hash bits, in critical positions, have a fixed value. Essentially each degenerate hash bit will half the number of chain heads utilized and double the average chain length.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11738

Test Plan:
unit tests updated, and already added to db crash test. Also

## Correctness
The code includes generous assertions to check for unexpected states, especially at destruction time, so should be able to detect critical concurrency bugs. Less serious "availability bugs" in which cache data is hidden or cleanly lost are more difficult to detect, but also less scary for data correctness (as long as performance is good and the design is sound).

In average operation, the structure is extremely low stress and low contention (see next section) so stressing the corner case logic requires artificially stressing the operating conditions. First, we keep the structure small to increase the number of threads hitting the same chain or entry, and just one cache shard. Second, we artificially degrade the hashing so that chains are much longer than typical, using the new `-degenerate_hash_bits` option to cache_bench. Third, we re-create the structure from scratch frequently in order to exercise the Grow logic repeatedly and to get the benefit of the consistency checks in the structure's destructor in debug builds. For cache_bench this also means disabling the single-threaded "populate cache" step (normally used for steady state performance testing). And of course use many more threads than cores to have many preemptions.

An effective test for working out bugs was this (using debug build of course):
```
while ./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -cache_size=8000000 -threads=100 -populate_cache=0 -ops_per_thread=10000 -degenerate_hash_bits=6 -num_shard_bits=0; do :; done
```

Or even smaller cases. This setup has around 27 utilized chains, with around 35 entries each, and yield-waits more than 1 million times per second (very high contention; see next section). I have let this run for hours searching for any lingering issues.

I've also run cache_bench under ASAN, UBSAN, and TSAN.

## Essentially wait free
There is a counter for number of yield() calls when one thread is waiting on another. When we pre-populate the structure in a single thread,
```
./cache_bench -cache_type=auto_hyper_clock_cache -histograms=0 -populate_cache=1 -ops_per_thread=200000 2>&1 | grep Yield
```
We see something on the order of 1 yield call per second across 16 threads, even when we load the system other other jobs (parallel compilation). With -populate_cache=0, there are more yield opportunities with parallel table growth. On an otherwise unloaded system, we still see very small (single digit) yield counts, with a chance of getting into the thousands, and getting into 10s of thousands per second during table growth phase if the system is loaded with other jobs. However, I am not worried about this if performance is still good (see next section).

## Overall performance
Although cache_bench initially suggested performance very close to FixedHCC, there was a very noticeable performance hit under a db_bench setup like used in validating https://github.com/facebook/rocksdb/issues/10626. Much of the difference has been reduced by optimizing Lookup with a "naive" pass that will almost always find entries quickly, and only falling back to the careful Lookup algorithm when not found in the first pass.

Setups (chosen to be sensitive to block cache performance), and compiled with USE_CLANG=1 JEMALLOC=1 PORTABLE=0 DEBUG_LEVEL=0:
```
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```

### No regression on FixedHCC
Running before & after builds at the same time on a 48 core machine.
```
TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=readrandom[-X10],block_cache_entry_stats,cache_report_problems -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=24 -cache_type=fixed_hyper_clock_cache -seed=1234
```

Before:
readrandom [AVG    10 runs] : 847234 (± 8150) ops/sec;   59.2 (± 0.6) MB/sec
703MB max RSS

After:
readrandom [AVG    10 runs] : 851021 (± 7929) ops/sec;   59.5 (± 0.6) MB/sec
706MB max RSS

Probably no material difference.

### Single-threaded performance
Using `[-X2]` and `-threads=1` and `-duration=30`, running all three at the same time:

lru_cache: 55100 ops/sec, then 55862 ops/sec  (627MB max RSS)
fixed_hyper_clock_cache: 60496 ops/sec, then 61231 ops/sec (626MB max RSS)
auto_hyper_clock_cache: 47560 ops/sec, then 56081 ops/sec (626MB max RSS)

So AutoHCC has more ramp-up cost in the first pass as the cache grows to the appropriate size. (In single-threaded operation, the parallelizability and per-op low latency of table growth is overall slower.) However, once up to size, its performance is comparable to LRUCache. FixedHCC's lean operations still win overall when a good estimate is available.

If we look at HCC table stats, we can see that this configuration is not favorable to AutoHCC (and I have verified that other memory sizes do not yield substantially different results, until shards are under-sized for the full filters):

FixedHCC:
Slot occupancy stats: Overall 47% (124991/262144), Min/Max/Window = 28%/64%/500, MaxRun{Pos/Neg} = 17/22

AutoHCC:
Slot occupancy stats: Overall 59% (125781/209682), Min/Max/Window = 43%/82%/500, MaxRun{Pos/Neg} = 76/16
Head occupancy stats: Overall 43% (92259/209682), Min/Max/Window = 24%/74%/500, MaxRun{Pos/Neg} = 19/26
Entries at home count: 53350

FixedHCC configuration is relatively good for speed, and not ideal for space utilization. As is typical, AutoHCC has tighter control on metadata usage (209682 x 64 bytes rather than 262144 x 64 bytes), and the higher load factor is slightly worse for speed. LRUCache also has more metadata usage, at 199680 x 96 bytes of tracked metadata (plus roughly another 10% of that untracked in the head pointers), and that metadata is subject to fragmentation.

### Parallel performance, high hit rate
Now using `[-X10]` and `-threads=10`, all three at the same time

lru_cache: [AVG    10 runs] : 263629 (± 1425) ops/sec;   18.4 (± 0.1) MB/sec
655MB max RSS, 97.1% cache hit rate
fixed_hyper_clock_cache: [AVG    10 runs] : 479590 (± 8114) ops/sec;   33.5 (± 0.6) MB/sec
651MB max RSS, 97.1% cache hit rate
auto_hyper_clock_cache: [AVG    10 runs] : 418687 (± 5915) ops/sec;   29.3 (± 0.4) MB/sec
657MB max RSS, 97.1% cache hit rate

Even with just 10-way parallelism for each cache (though 30+/48 cores busy overall), LRUCache is already showing performance degradation, while AutoHCC is in the neighborhood of FixedHCC. And that brings us to the question of how AutoHCC holds up under extreme parallelism, so now independent runs with `-threads=100` (overloading 48 cores).

lru_cache: 438613 ops/sec, 827MB max RSS
fixed_hyper_clock_cache: 1651310 ops/sec, 812MB max RSS
auto_hyper_clock_cache: 1505875 ops/sec, 821MB max RSS (Yield count: 1089 over 30s)

Clearly, AutoHCC holds up extremely well under extreme parallelism, even closing some of the modest performance gap with  FixedHCC.

### Parallel performance, low hit rate
To get down to roughly 50% cache hit rate, we use `-cache_index_and_filter_blocks=0 -cache_size=1650000000` with `-threads=10`. Here the extra cost of running counting clock eviction, especially on the chains of AutoHCC, are evident, especially with the lower contention of cache_index_and_filter_blocks=0:

lru_cache: 725231 ops/sec, 1770MB max RSS, 51.3% hit rate
fixed_hyper_clock_cache: 638620 ops/sec, 1765MB max RSS, 50.2% hit rate
auto_hyper_clock_cache: 541018 ops/sec, 1777MB max RSS, 50.8% hit rate

Reviewed By: jowlyzhang

Differential Revision: D48784755

Pulled By: pdillinger

fbshipit-source-id: e79813dc087474ac427637dd282a14fa3011a6e4
2023-09-01 15:44:38 -07:00
Changyu Bi 9bd1a6fa29 Fix a bug where iterator can return incorrect data for DeleteRange() users (#11786)
Summary:
This should only affect iterator when
- user uses DeleteRange(),
- An iterator from level L has a non-ok status (such non-ok status may not be caught before the bug fix in https://github.com/facebook/rocksdb/pull/11783), and
- A range tombstone covers a key from level > L and triggers a reseek sets the status_ to OK in SeekImpl()/SeekPrevImpl() e.g. bd6a8340c3/table/merging_iterator.cc (L801)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11786

Differential Revision: D48908830

Pulled By: cbi42

fbshipit-source-id: eb564be375af4e33dc27542eff753260186e6d5d
2023-09-01 11:33:15 -07:00
Changyu Bi bd6a8340c3 Fix a bug where iterator status is not checked (#11782)
Summary:
This happens in (Compaction)MergingIterator layer, and can cause data loss during compaction or read/scan return incorrect result

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11782

Reviewed By: ajkr

Differential Revision: D48880575

Pulled By: cbi42

fbshipit-source-id: 2294ad284a6d653d3674bebe55380f12ee4b645b
2023-09-01 09:34:08 -07:00
Jay Huh 47be3ffffb Minor refactor on LDB command for wide column support and release note (#11777)
Summary:
As mentioned in https://github.com/facebook/rocksdb/issues/11754 , refactor to clean up some nearly identical logic. This PR changes the existing debugging string format of Scan command as the following.

```
❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex
```

Before
```
0x6669727374 : :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 : 0x62617A
```
After
```
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11777

Test Plan:
```
❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ dump --hex
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
Keys in range: 3

❯ ./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/ scan --hex
0x6669727374 ==> :0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
0x7365636F6E64 ==> 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
0x7468697264 ==> 0x62617A
```

Reviewed By: jowlyzhang

Differential Revision: D48876755

Pulled By: jaykorean

fbshipit-source-id: b1c608a810fe038999ac528b690d398abf5f21d7
2023-08-31 16:17:03 -07:00
Peter Dillinger 83eb7b8c2c Log host name (#11776)
Summary:
... in info_log. Becoming more important with disaggregated storage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11776

Test Plan: manual

Reviewed By: jaykorean

Differential Revision: D48849471

Pulled By: pdillinger

fbshipit-source-id: 9a8fd8b2564a4f133526ecd7c1414cb667e4ba54
2023-08-31 08:39:09 -07:00
Hui Xiao 05daa12332 Change compaction_readahead_size default value to 2MB (#11762)
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, we rely on `compaction_readahead_size` for how much to read ahead for compaction read under non-direct IO case. https://github.com/facebook/rocksdb/pull/11658 therefore also sanitized 0 `compaction_readahead_size` to 2MB under non-direct IO, which is consistent with the existing sanitization with direct IO.

However, this makes disabling compaction readahead impossible as well as add one more scenario to the inconsistent effects between `Options.compaction_readahead_size=0` during DB open and `SetDBOptions("compaction_readahead_size", "0")` .
- `SetDBOptions("compaction_readahead_size", "0")` will disable compaction readahead as its logic never goes through sanitization above while `Options.compaction_readahead_size=0` will go through sanitization.

Therefore we decided to do this PR.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11762

Test Plan: Modified existing UTs to cover this PR

Reviewed By: ajkr

Differential Revision: D48759560

Pulled By: hx235

fbshipit-source-id: b3f85e58bda362a6fa1dc26bd8a87aa0e171af79
2023-08-30 14:57:08 -07:00
Yu Zhang fc58c7c62a Add UDT support in SstFileDumper (#11757)
Summary:
For a SST file that uses user-defined timestamp aware comparators, if a lower or upper bound is set, sst_dump tool doesn't handle it well. This PR adds support for that. While working on this `MaybeAddTimestampsToRange` is moved to the udt_util.h file to be shared.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11757

Test Plan:
make all check
for changes in db_impl.cc and db_impl_compaction_flush.cc

for changes in sst_file_dumper.cc, I manually tested this change handles specifying bounds for UDT use cases. It probably should have a unit test file eventually.

Reviewed By: ltamasi

Differential Revision: D48668048

Pulled By: jowlyzhang

fbshipit-source-id: 1560465f40e44668d6d82a7439fe9012be0e74a8
2023-08-30 13:42:04 -07:00
Jay Huh ea9a5b2914 Wide Column support in ldb (#11754)
Summary:
wide_columns can now be pretty-printed in the following commands
- `./ldb dump_wal`
- `./ldb dump`
- `./ldb idump`
- `./ldb dump_live_files`
- `./ldb scan`
- `./sst_dump --command=scan`

There are opportunities to refactor to reduce some nearly identical code. This PR is initial change to add wide column support in `ldb` and `sst_dump` tool. More PRs to come for the refactor.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11754

Test Plan:
**New Tests added**
- `WideColumnsHelperTest::DumpWideColumns`
- `WideColumnsHelperTest::DumpSliceAsWideColumns`

**Changes added to existing tests**
- `ExternalSSTFileTest::BasicMixed` added to cover mixed case (This test should have been added in https://github.com/facebook/rocksdb/issues/11688). This test does not verify the ldb or sst_dump output. This test was used to create test SST files having some rows with wide columns and some without and the generated SST files were used to manually test sst_dump_tool.
- `createSST()` in `sst_dump_test` now takes `wide_column_one_in` to add wide column value in SST

**dump_wal**
```
./ldb dump_wal --walfile=/tmp/rocksdbtest-226125/db_wide_basic_test_2675429_2308393776696827948/000004.log --print_value --header
```
```
Sequence,Count,ByteSize,Physical Offset,Key(s) : value
1,1,59,0,PUT_ENTITY(0) : 0x:0x68656C6C6F 0x617474725F6E616D6531:0x666F6F 0x617474725F6E616D6532:0x626172
2,1,34,42,PUT_ENTITY(0) : 0x617474725F6F6E65:0x74776F 0x617474725F7468726565:0x666F7572
3,1,17,7d,PUT(0) : 0x7468697264 : 0x62617A
```

**idump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ idump
```
```
'first' seq:1, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:2, type:22 => attr_one:two attr_three:four
'third' seq:3, type:1 => baz
Internal keys in range: 3
```

**SST Dump from dump_live_files**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ compact
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump_live_files
```
```
...
==============================
SST Files
==============================
/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst level:1
------------------------------
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
...
```

**dump**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ dump
```
```
first ==> :hello attr_name1:foo attr_name2:bar
second ==> attr_one:two attr_three:four
third ==> baz
Keys in range: 3
```

**scan**
```
./ldb --db=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/ scan
```
```
first : :hello attr_name1:foo attr_name2:bar
second : attr_one:two attr_three:four
third : baz
```

**sst_dump**
```
./sst_dump --file=/tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst --command=scan
```
```
options.env is 0x7ff54b296000
Process /tmp/rocksdbtest-226125/db_wide_basic_test_3481961_2308393776696827948/000013.sst
Sst file format: block-based
from [] to []
'first' seq:0, type:22 => :hello attr_name1:foo attr_name2:bar
'second' seq:0, type:22 => attr_one:two attr_three:four
'third' seq:0, type:1 => baz
```

Reviewed By: ltamasi

Differential Revision: D48837999

Pulled By: jaykorean

fbshipit-source-id: b0280f0589d2b9716bb9b50530ffcabb397d140f
2023-08-30 12:45:52 -07:00
Hui Xiao c073c2edde Revert "Clarify comment about compaction_readahead_size's sanitizatio… (#11773)
Summary:
…n change (https://github.com/facebook/rocksdb/issues/11755)"

This reverts commit 451316597f.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11773

Reviewed By: ajkr

Differential Revision: D48832320

Pulled By: hx235

fbshipit-source-id: 96cef26a885134360766a83505f6717598eac6a9
2023-08-30 12:29:04 -07:00
Yu Zhang 4234a6a301 Increase full_history_ts_low when flush happens during recovery (#11774)
Summary:
This PR adds a missing piece for the UDT in memtable only feature, which is to automatically increase `full_history_ts_low` when flush happens during recovery.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11774

Test Plan:
Added unit test
make all check

Reviewed By: ltamasi

Differential Revision: D48799109

Pulled By: jowlyzhang

fbshipit-source-id: fd681ed66d9d40904ca2c919b2618eb692686035
2023-08-30 09:34:31 -07:00
jsteemann c1e6ffc40a remove a sub-condition that is always true (#11746)
Summary:
the value of `done` is always false here, so the sub-condition `!done` will always be true and the check can be removed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11746

Reviewed By: anand1976

Differential Revision: D48656845

Pulled By: ajkr

fbshipit-source-id: 523ba3d07b3af7880c8c8ccb20442fd7c0f49417
2023-08-29 18:40:13 -07:00
Andrew Kryczka e373685dab Add SystemClock::TimedWait() function (#11753)
Summary:
Having a synthetic implementation of `TimedWait()` in `SystemClock` will allow us to add `SyncPoint`s while mutex is released, which was previously impossible since the lock was released and reacquired all within `pthread_cond_timedwait()`. Additionally, integrating `TimedWait()` with `MockSystemClock` allows us to cleanup some workarounds in the test code. In this PR I only cleaned up the `GenericRateLimiter` test workaround.

This is related to the intended follow-up mentioned in https://github.com/facebook/rocksdb/issues/7101's description. There are a couple differences:

(1) This PR does not include removing the particular workaround that initially motivated it. Actually, the `Timer` class uses `InstrumentedCondVar`, so the interface introduced here is inadequate to remove that workaround. On the bright side, the interface introduced in this PR can be changed as needed since it can neither be used nor extended externally, due to using forward-declared `port::CondVar*` in the interface.
(2) This PR only makes the change in `SystemClock` not `Env`. Older revisions of this PR included `Env::TimedWait()` and `SpecialEnv::TimedWait()`; however, since they were unused it probably makes sense to defer adding them until when they are needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11753

Reviewed By: pdillinger

Differential Revision: D48654995

Pulled By: ajkr

fbshipit-source-id: 15e19f2454b64d4ec7f50e328691c66ca9911122
2023-08-29 18:39:10 -07:00
jsteemann 0b8b17a9d1 avoid find() -> insert() sequence (#11743)
Summary:
when a key is recorded for locking in a pessimistic transaction, the key is first looked up in a map, and then inserted into the map if it was not already contained.
this can be simplified to an unconditional insert. in the ideal case that all keys are unique, this saves all the find() operations.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11743

Reviewed By: anand1976

Differential Revision: D48656798

Pulled By: ajkr

fbshipit-source-id: d0150de2db757e0c05e1797cfc24380790c71276
2023-08-29 18:34:59 -07:00
Yu Zhang ecbeb305a0 Removing some checks for UDT in memtable only feature (#11732)
Summary:
The user-defined timestamps feature only enforces that for the same key, user-defined timestamps should be non-decreasing. For the user-defined timestamps in memtable only feature, during flush, we check the user-defined timestamps in each memtable to examine if the data is considered expired with regard to `full_history_ts_low`. In this process, it's assuming that a newer memtable should not have smaller user-defined timestamps than an older memtable. This check however is enforcing ordering of user-defined timestamps across keys, as apposed to the vanilla UDT feature, that only enforce ordering of user-defined timestamps for the same key.

This more strict user-defined timestamp ordering requirement could be an issue for secondary instances where commits can be out of order. And after thinking more about it, this requirement is really an overkill to keep the invariants of `full_history_ts_low` which are:

1) users cannot read below `full_history_ts_low`
2) users cannot write at or below `full_history_ts_low`
3) `full_history_ts_low` can only be increasing

As long as RocksDB enforces these 3 checks, we can prohibit inconsistent read that returns a different value. And these three checks are covered in existing APIs.

So this PR removes the extra checks in the UDT in memtable only feature that requires user-defined timestamps to be non decreasing across keys.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11732

Reviewed By: ltamasi

Differential Revision: D48541466

Pulled By: jowlyzhang

fbshipit-source-id: 95453c6e391cbd511c0feab05f0b11c312d17186
2023-08-29 16:51:48 -07:00
akankshamahajan f36394ff20 Fix seg fault in auto_readahead_size with async_io (#11769)
Summary:
Fix seg fault in auto_readahead_size with async_io when readahead_size = 0. If readahead_size is trimmed and is 0, it's not eligible for further prefetching and should return.

Error occured when the first buffer already contains data and it goes for prefetching in second buffer leading to assertion failure -

`assert(roundup_len1 >= alignment);
`
because roundup_len1 = length + readahead_size.
length is 0 and readahead_size is also 0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11769

Test Plan: Reproducible with db_stress with async_io enabled.

Reviewed By: anand1976

Differential Revision: D48743031

Pulled By: akankshamahajan15

fbshipit-source-id: 0e08c41f862f6287ca223fbfaf6cd42fc97b3c87
2023-08-28 17:08:28 -07:00
Andrew Kryczka 310a242c57 Fix GenericRateLimiter hanging bug (#11763)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11742

Even after performing duty (1) ("Waiting for the next refill time"), it is possible the remaining threads are all in `Wait()`. Waking up at least one thread is enough to ensure progress continues, even if no new requests arrive.

The repro unit test (https://github.com/facebook/rocksdb/commit/bb54245e6) is not included as it depends on an unlanded PR (https://github.com/facebook/rocksdb/issues/11753)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11763

Reviewed By: jaykorean

Differential Revision: D48710130

Pulled By: ajkr

fbshipit-source-id: 9d166bd577ea3a96ccd81dde85871fec5e85a4eb
2023-08-28 13:36:25 -07:00
Jan ba59751430 remove an unused typedef (#11286)
Summary:
`VersionBuilderMap` type alias definition seem unused.
If this PR can be compiled fine then the alias is probably not needed anymore.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11286

Reviewed By: jaykorean

Differential Revision: D48656747

Pulled By: ajkr

fbshipit-source-id: ac8554922aead7dc3d24fe7e6544a4622578c514
2023-08-25 18:01:14 -07:00
Richard Barnes 38e9e6903e Del (object) from 200 inc instagram-server/distillery/slipstream/thrift_models/StoryFeedMediaSticker/ttypes.py
Summary: Python3 makes the use of `(object)` in class inheritance unnecessary. Let's modernize our code by eliminating this.

Reviewed By: itamaro

Differential Revision: D48673915

fbshipit-source-id: a1a6ae8572271eb2898b748c8216ea68e362f06a
2023-08-25 16:22:09 -07:00
akankshamahajan 6cbb104663 Fix seg fault in auto_readahead_size during IOError (#11761)
Summary:
Fix seg fault in auto_readahead_size
```
db_stress:
internal_repo_rocksdb/repo/table/block_based/partitioned_index_iterator.h:70: virtual rocksdb::IndexValue rocksdb::PartitionedIndexIterator::value() const: Assertion `Valid()' failed.
```

During seek, after calculating readahead_size, db_stress can inject IOError resulting in failure to index_iter_->Seek and making index_iter_ invalid.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11761

Test Plan: Reproducible locally and passed with this fix

Reviewed By: anand1976

Differential Revision: D48696248

Pulled By: akankshamahajan15

fbshipit-source-id: 2be43bf56ad0fc2f95f9093c19c9a1b15a716091
2023-08-25 13:50:48 -07:00
Peter Dillinger d3420464c3 cache_bench enhancements for jemalloc etc. (#11758)
Summary:
* Add some options to cache_bench to use JemallocNodumpAllocator
* Make num_shard_bits option use and report cache-specific defaults
* Add a usleep option to sleep between operations, for simulating a workload with more CPU idle/wait time.
* Use const& for JemallocAllocatorOptions, to improve API usability (e.g. can bind to temporary `{}`)
* InstallStackTraceHandler()

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11758

Test Plan: manual

Reviewed By: jowlyzhang

Differential Revision: D48668479

Pulled By: pdillinger

fbshipit-source-id: b6032fbe09444cdb8f1443a5e017d2eea4f6205a
2023-08-24 19:14:38 -07:00
Akanksha Mahajan 6353c6e2fb Add new experimental ReadOption auto_readahead_size to db_bench and db_stress (#11729)
Summary:
Same as title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11729

Test Plan: make crash_test -j32

Reviewed By: anand1976

Differential Revision: D48534820

Pulled By: akankshamahajan15

fbshipit-source-id: 3a2a28af98dfad164b82ddaaf9fddb94c53a652e
2023-08-24 14:58:27 -07:00
Hui Xiao 451316597f Clarify comment about compaction_readahead_size's sanitization change (#11755)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11755

Reviewed By: anand1976

Differential Revision: D48656627

Pulled By: hx235

fbshipit-source-id: 568fa7749cbf6ecf65102b4513fa3af975fd91b8
2023-08-24 14:55:48 -07:00
Fuat Basik bc448e9c89 Run db_stress for final time to ensure un-interrupted validation (#11592)
Summary:
In blackbox tests, db_stress command always run with timeout. Timeout can happen during validation, leaving some of the keys not checked. Since key validation is done in order, it is quite likely that keys those are towards to the end of the set are never validated. This PR adds a final execution, without timeout, to ensure validation is executed for all keys, at least once.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11592

Reviewed By: cbi42

Differential Revision: D48003998

Pulled By: hx235

fbshipit-source-id: 72543475a932f12cf0f57534b7e3b6e07e87080f
2023-08-23 15:24:23 -07:00
Hui Xiao f833ca3878 Pick files from the last sorted run in size amp compaction picker (#11740)
Summary:
**Context/Summary:**
Same intention as https://github.com/facebook/rocksdb/pull/2693 - basically we now pick from the last sorted run and expand forward till we can't

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11740

Test Plan:
Existing UT
Stress test

Reviewed By: ajkr

Differential Revision: D48586475

Pulled By: hx235

fbshipit-source-id: 3eb3c3ee1d5f7e0b0d6d649baaeb8c6990fee398
2023-08-23 11:27:48 -07:00
Alexander Bulimov 2b6bcfe590 Add C API for WaitForCompact (#11737)
Summary:
Add a bunch of C API functions to expose new `WaitForCompact` function and related options.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11737

Test Plan: unit tests

Reviewed By: jaykorean

Differential Revision: D48568239

Pulled By: abulimov

fbshipit-source-id: 1ff35972d7abacd7e1e17fe2ada1e20cdc88d8de
2023-08-22 14:32:35 -07:00
chuhao zeng 1303573589 Reverse sort order in dedup to enable iter checking in callback (#11725)
Summary:
Fix https://github.com/facebook/rocksdb/issues/6470

Ensure TransactionLogIter being initialized correctly with SYNC_POINT API when calling `GetSortedWALFiles`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11725

Reviewed By: akankshamahajan15

Differential Revision: D48529411

Pulled By: ajkr

fbshipit-source-id: 970ca1a6259ed996c6d87f7fcd40f95acf441517
2023-08-22 11:22:35 -07:00
Changyu Bi 5e0584bd73 Do not drop unsynced data during reopen in stress test (#11731)
Summary:
Currently the stress test does not support restoring expected state (to a specific sequence number) when there is unsynced data loss during the reopen phase. This causes a few internal stress test failure with errors like inconsistent value. This PR disables dropping unsynced data during reopen to avoid failures due to this issue. We can re-enable later after we decide to support unsynced data loss during DB reopen in stress test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11731

Test Plan:
* Running this test a few times can fail for inconsistent value before this change
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=1 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=20.57166126835524 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=1 --compaction_ttl=100 --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=0 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --enable_thread_tracking=0 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=big --flush_one_in=1000000 --format_version=3 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=6 --index_type=3 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=1000000 --log2_keys_per_lock=10 --long_running_snapshots=1 --manual_wal_flush_one_in=1000000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_max_range_deletions=100 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=5 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=0 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=524288 --readpercent=50 --recycle_log_file_num=0 --reopen=20 --ribbon_starting_level=0 --secondary_cache_fault_one_in=32 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --subcompactions=3 --sync=0 --sync_fault_injection=1 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=2 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=1 --use_merge=0 --use_multi_get_entity=0 --use_multiget=1 --use_put_entity_one_in=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --wal_compression=zstd --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=35```

Reviewed By: hx235

Differential Revision: D48537494

Pulled By: cbi42

fbshipit-source-id: ddae21b9bb6ee8d67229121f58513e95f7ef6d8d
2023-08-22 09:47:04 -07:00
Yu Zhang 2a9f3b6cc5 Try to use a db's OPTIONS file for some ldb commands (#11721)
Summary:
For some ldb commands that doesn't need to open the DB, it's still useful to use the DB's existing OPTIONS file if it's available.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11721

Reviewed By: pdillinger

Differential Revision: D48485540

Pulled By: jowlyzhang

fbshipit-source-id: 2d2db837523044066f1a2c4b59a5c03f6cd35e6b
2023-08-21 15:04:22 -07:00
anand76 4b53520709 Update HISTORY.md and version.h for 8.6 (#11728)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11728

Reviewed By: jaykorean, jowlyzhang

Differential Revision: D48527100

Pulled By: anand1976

fbshipit-source-id: c48baa44e538fb6bfd3fe7f19046746d3540763f
2023-08-21 13:25:04 -07:00
Jay Huh 4fa2c01719 Replace existing waitforcompaction with new WaitForCompact API in db_bench_tool (#11727)
Summary:
As the new API to wait for compaction is available (https://github.com/facebook/rocksdb/issues/11436), we can now replace the existing logic of waiting in db_bench_tool with the new API.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11727

Test Plan:
```
./db_bench --benchmarks="fillrandom,compactall,waitforcompaction,readrandom"
```
**Before change**
```
Set seed to 1692635571470041 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 8.6.0
Date:       Mon Aug 21 09:33:40 2023
CPU:        80 * Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
CPUCache:   28160 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Optimization is disabled: benchmarks unnecessarily slow
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/tmp/rocksdbtest-226125/dbbench]
fillrandom   :      51.826 micros/op 19295 ops/sec 51.826 seconds 1000000 operations;    2.1 MB/s
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished
DB path: [/tmp/rocksdbtest-226125/dbbench]
readrandom   :      39.042 micros/op 25613 ops/sec 39.042 seconds 1000000 operations;    1.8 MB/s (632886 of 1000000 found)
```
**After change**
```
Set seed to 1692636574431745 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 8.6.0
Date:       Mon Aug 21 09:49:34 2023
CPU:        80 * Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
CPUCache:   28160 KB
Keys:       16 bytes each (+ 0 bytes user-defined timestamp)
Values:     100 bytes each (50 bytes after compression)
Entries:    1000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    110.6 MB (estimated)
FileSize:   62.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
WARNING: Optimization is disabled: benchmarks unnecessarily slow
WARNING: Assertions are enabled; benchmarks unnecessarily slow
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
DB path: [/tmp/rocksdbtest-226125/dbbench]
fillrandom   :      51.271 micros/op 19504 ops/sec 51.271 seconds 1000000 operations;    2.2 MB/s
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): started
waitforcompaction(/tmp/rocksdbtest-226125/dbbench): finished with status (OK)
DB path: [/tmp/rocksdbtest-226125/dbbench]
readrandom   :      39.264 micros/op 25468 ops/sec 39.264 seconds 1000000 operations;    1.8 MB/s (632921 of 1000000 found)
```

Reviewed By: ajkr

Differential Revision: D48524667

Pulled By: jaykorean

fbshipit-source-id: 1052a15b2ed79a35165ec4d9998d0454b2552ef4
2023-08-21 12:14:57 -07:00
Yu Zhang 03a74411c0 Add unit test for default temperature (#11722)
Summary:
This piggy back the existing last level file temperature statistics test to test the default temperature becoming effective.

While adding this unit test, I found that the approach to swap out and use default temperature in `VersionBuilder::LoadTableHandlers` will miss the L0 files created from flush, and only work for existing SST files, SST files created by compaction. So this PR moves that logic to `TableCache::GetTableReader`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11722

Test Plan:
```
./db_test2 --gtest_filter="*LastLevelStatistics*"
make all check
```

Reviewed By: pdillinger

Differential Revision: D48489171

Pulled By: jowlyzhang

fbshipit-source-id: ac29f7d484916f3218729594c5bb35c4f2979ac2
2023-08-21 12:14:03 -07:00
Levi Tamasi a9770b185d Circleci macos sunset (#11633)
Summary:
[draft] this PR is created in order to test CI changes
Closes: https://github.com/facebook/rocksdb/pull/11543

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11633

Reviewed By: akankshamahajan15

Differential Revision: D48525552

Pulled By: cbi42

fbshipit-source-id: 758d57f248304213228af459789459cc2f0bf419
2023-08-21 11:53:40 -07:00
Hui Xiao f53018c0c8 Improve PrefetchTest.Basic with explicit flush and file num variable (#11720)
Summary:
**Context/Summary:** as title, should be harmless. And it's a guessed fix to https://github.com/facebook/rocksdb/issues/11717 while no repro has obtained on my end yet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11720

Test Plan: existing tests

Reviewed By: cbi42

Differential Revision: D48475661

Pulled By: hx235

fbshipit-source-id: 7c7390319f094c540e703fe2e78a8d601b7a894b
2023-08-18 17:47:22 -07:00
akankshamahajan f65a0379f0 Implement trimming of readhead size when upper bound is specified (#11684)
Summary:
Implement trimming of readahead_size under a new option ReadOptions.auto_readahead_size. It'll trim the readahead_size during prefetching upto iterate_upper_bound offset only when ReadOptions.iterate_upper_bound is set, therefore reducing the prefetching of data beyond upper_bound.
It's enabled for both implicit auto readahead size and when ReadOptions.readahead_size is specified and for sync and async_io.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11684

Test Plan: Added new unit test

Reviewed By: anand1976

Differential Revision: D48479723

Pulled By: akankshamahajan15

fbshipit-source-id: 2b1703579caf779105e836b580866ffd7db076fc
2023-08-18 15:52:04 -07:00
Changyu Bi c2aad555c3 Add CompressionOptions::checksum for enabling ZSTD checksum (#11666)
Summary:
Optionally enable zstd checksum flag (d857369028/lib/zstd.h (L428)) to detect corruption during decompression. Main changes are in compression.h:
* User can set CompressionOptions::checksum to true to enable this feature.
* We enable this feature in ZSTD by setting the checksum flag in ZSTD compression context: `ZSTD_CCtx`.
* Uses `ZSTD_compress2()` to do compression since it supports frame parameter like the checksum flag. Compression level is also set in compression context as a flag.
* Error handling during decompression to propagate error message from ZSTD.
* Updated microbench to test read performance impact.

About compatibility, the current compression decoders should continue to work with the data created by the new compression API `ZSTD_compress2()`: https://github.com/facebook/zstd/issues/3711.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11666

Test Plan:
* Existing unit tests for zstd compression
* Add unit test `DBTest2.ZSTDChecksum` to test the corruption case
* Manually tested that compression levels, parallel compression, dictionary compression, index compression all work with the new ZSTD_compress2() API.
* Manually tested with `sst_dump --command=recompress` that different compression levels and dictionary compression settings all work.
* Manually tested compiling with older versions of ZSTD: v1.3.8, v1.1.0, v0.6.2.
* Perf impact: from public benchmark data: http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html for checksum and https://github.com/facebook/zstd#benchmarks, if decompression is 1700MB/s and checksum computation is 70000MB/s, checksum computation is an additional ~2.4% time for decompression. Compression is slower and checksumming should be less noticeable.
* Microbench:
```
TEST_TMPDIR=/dev/shm ./branch_db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:1048576/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:0/compression_type:7/compression_checksum:1/no_blockcache:1/iterations:10000/threads:1 --benchmark_repetitions=100

Min out of 100 runs:
Main:
10390 10436 10456 10484 10499 10535 10544 10545 10565 10568

After this PR, checksum=false
10285 10397 10503 10508 10515 10557 10562 10635 10640 10660

After this PR, checksum=true
10827 10876 10925 10949 10971 11052 11061 11063 11100 11109
```
* db_bench:
```
Write perf
TEST_TMPDIR=/dev/shm/ ./db_bench_ichecksum --benchmarks=fillseq[-X10] --compression_type=zstd --num=10000000 --compression_checksum=..

[FillSeq checksum=0]
fillseq [AVG    10 runs] : 281635 (± 31711) ops/sec;   31.2 (± 3.5) MB/sec
fillseq [MEDIAN 10 runs] : 294027 ops/sec;   32.5 MB/sec

[FillSeq checksum=1]
fillseq [AVG    10 runs] : 286961 (± 34700) ops/sec;   31.7 (± 3.8) MB/sec
fillseq [MEDIAN 10 runs] : 283278 ops/sec;   31.3 MB/sec

Read perf
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=readrandom[-X20] --num=100000000 --reads=1000000 --use_existing_db=true --readonly=1

[Readrandom checksum=1]
readrandom [AVG    20 runs] : 360928 (± 3579) ops/sec;    4.0 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 362468 ops/sec;    4.0 MB/sec

[Readrandom checksum=0]
readrandom [AVG    20 runs] : 380365 (± 2384) ops/sec;    4.2 (± 0.0) MB/sec
readrandom [MEDIAN 20 runs] : 379800 ops/sec;    4.2 MB/sec

Compression
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=compress[-X20] --compression_type=zstd --num=100000000 --compression_checksum=1

checksum=1
compress [AVG    20 runs] : 54074 (± 634) ops/sec;  211.2 (± 2.5) MB/sec
compress [MEDIAN 20 runs] : 54396 ops/sec;  212.5 MB/sec

checksum=0
compress [AVG    20 runs] : 54598 (± 393) ops/sec;  213.3 (± 1.5) MB/sec
compress [MEDIAN 20 runs] : 54592 ops/sec;  213.3 MB/sec

Decompression:
TEST_TMPDIR=/dev/shm ./db_bench_ichecksum --benchmarks=uncompress[-X20] --compression_type=zstd --compression_checksum=1

checksum = 0
uncompress [AVG    20 runs] : 167499 (± 962) ops/sec;  654.3 (± 3.8) MB/sec
uncompress [MEDIAN 20 runs] : 167210 ops/sec;  653.2 MB/sec
checksum = 1
uncompress [AVG    20 runs] : 167980 (± 924) ops/sec;  656.2 (± 3.6) MB/sec
uncompress [MEDIAN 20 runs] : 168465 ops/sec;  658.1 MB/sec
```

Reviewed By: ajkr

Differential Revision: D48019378

Pulled By: cbi42

fbshipit-source-id: 674120c6e1853c2ced1436ac8138559d0204feba
2023-08-18 15:01:59 -07:00
Jay Huh 0fa0c97d3e Timeout in microsecond option in WaitForCompactOptions (#11711)
Summary:
While it's rare, we may run into a scenario where `WaitForCompact()` waits for background jobs indefinitely. For example, not enough space error will add the job back to the queue while WaitForCompact() waits for _all jobs_ including the jobs that are in the queue to be completed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11711

Test Plan:
`DBCompactionWaitForCompactTest::WaitForCompactToTimeout` added
`timeout` option added to the variables for all of the existing DBCompactionWaitForCompactTests

Reviewed By: pdillinger, jowlyzhang

Differential Revision: D48416390

Pulled By: jaykorean

fbshipit-source-id: 7b6a12f705ab6c6dfaf8ad736a484ca654a86106
2023-08-18 11:21:45 -07:00
anand76 a1743e85be Implement a allow cache hits admission policy for the compressed secondary cache (#11713)
Summary:
This PR implements a new admission policy for the compressed secondary cache, which includes the functionality of the existing policy, and also admits items evicted from the primary block cache with the hit bit set. Effectively, the new policy works as follows -
1. When an item is demoted from the primary cache without a hit, a placeholder is inserted in the compressed cache. A second demotion will insert the full entry.
2. When an item is promoted from the compressed cache to the primary cache for the first time, a placeholder is inserted in the primary. The second promotion inserts the full entry, while erasing it form the compressed cache.
3. If an item is demoted from the primary cache with the hit bit set, it is immediately inserted in the compressed secondary cache.
The ```TieredVolatileCacheOptions``` has been updated with a new option, ```adm_policy```, which allows the policy to be selected.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11713

Reviewed By: pdillinger

Differential Revision: D48444512

Pulled By: anand1976

fbshipit-source-id: b4cbf8c169a88097dff08e36e8bc4b3088de1492
2023-08-18 11:19:48 -07:00
Han Zhu a67ef998dc Explicitly instantiate MaybeReadBlockAndLoadToCache as well (#11714)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11714

Fixes T161017540.

The staging build starts failing with an undefined symbol error:
```
ld.lld: error: undefined symbol: std::enable_if<rocksdb::ParsedFullFilterBlock::kCacheEntryRole == (rocksdb::CacheEntryRole)13 || true, rocksdb::Status>::type rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, rocksdb::CachableEntry<rocksdb::ParsedFullFilterBlock>*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*, bool) const
```
This is the `MaybeReadBlockAndLoadToCache` function where `TBlocklike = ParsedFullFilterBlock`. The trigger was an FDO profile update D48261413.

`MaybeReadBlockAndLoadToCache` is used in the same translation unit `block_based_table_reader.cc`, and also in another file `partitioned_filter_block.cc`. The later was the file that couldn't find the symbol. It seems after the FDO profile update, `MaybeReadBlockAndLoadToCache` may've got inlined into its caller in `block_based_table_reader.cc`. And with no knowledge of other usages, the symbol got stripped.

Explicitly instantiate the template similar to how `RetrieveBlock` was handled.

Reviewed By: pdillinger, akankshamahajan15

Differential Revision: D48400574

fbshipit-source-id: d4a80999bfb6ce4afa80678444139fcd8ae84aa4
2023-08-18 10:19:33 -07:00
Yu Zhang 1e77e35d26 Add a per column family default temperature option for accounting (#11708)
Summary:
Add a column family option `default_temperature` that will be used for file reading accounting purpose, such as io statistics, for files that don't have an explicitly set temperature.

This options is not a mutable one, changing its value would require a DB restart. This is to avoid the confusion that had the option being a mutable one, the users may expect it to take effect on all files immediately, while in reality, it would only become effective for SST files opened in the future.

This `default_temperature` also just affect accounting during one DB session. It won't be recorded in manifest as the file's temperature and can be different across different DB sessions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11708

Test Plan:
```
make all check
```

Reviewed By: pdillinger

Differential Revision: D48375763

Pulled By: jowlyzhang

fbshipit-source-id: eb756696c14a694c6e2a93d2bb6f040563194981
2023-08-17 17:06:57 -07:00
Peter Dillinger 966be1cc4e Clean up some FastRange calls (#11707)
Summary:
* JemallocNodumpAllocator was passing a size_t to FastRange32, which could cause compilation errors or warnings (seen with clang)
* Fixed the order of arguments to match what would be used with modulo operator (%), for clarity.

Fixes https://github.com/facebook/rocksdb/issues/11006

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11707

Test Plan: no functional change, existing tests

Reviewed By: ajkr

Differential Revision: D48435149

Pulled By: pdillinger

fbshipit-source-id: e6e8b107ded4eceda37db20df59985c846a2546b
2023-08-17 11:52:38 -07:00
Changyu Bi d1ff401472 Delay bottommost level single file compactions (#11701)
Summary:
For leveled compaction, RocksDB has a special kind of compaction with reason "kBottommmostFiles" that compacts bottommost level files to clear data held by snapshots (more detail in https://github.com/facebook/rocksdb/issues/3009). Such compactions can happen soon after a relevant snapshot is released. For some use cases, a bottommost file may contain only a small amount of keys that can be cleared, so compacting such a file has a high write amp. In addition, these bottommost files may be compacted in compactions with reason other than "kBottommmostFiles" if we wait for some time (so that enough data is ingested to trigger such a compaction). This PR introduces an option `bottommost_file_compaction_delay` to specify the delay of these bottommost level single file compactions.

* The main change is in `VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction()` where we only add a file to `bottommost_files_marked_for_compaction_` if it oldest_snapshot is larger than its non-zero largest_seqno **and** the file is old enough. Note that if a file is not old enough but its largest_seqno is less than oldest_snapshot, we exclude it from the calculation of `bottommost_files_mark_threshold_`. This makes the change simpler, but such a file's eligibility for compaction will only be checked the next time `ComputeBottommostFilesMarkedForCompaction()` is called. This happens when a new Version is created (compaction, flush, SetOptions()...), a new enough snapshot is released (`VersionStorageInfo::UpdateOldestSnapshot()`) or when a compaction is picked and compaction score has to be re-calculated.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11701

Test Plan:
* Add two unit tests to test when bottommost_file_compaction_delay > 0.
* Ran crash test with the new option.

Reviewed By: jaykorean, ajkr

Differential Revision: D48331564

Pulled By: cbi42

fbshipit-source-id: c584f3dc5f6354fce3ed65f4c6366dc450b15ba8
2023-08-16 17:45:44 -07:00
Andrew Kryczka 0b6ee88d51 clarify TODO for whitebox disable_wal=1 in db_crashtest.py (#11665)
Summary:
See https://github.com/facebook/rocksdb/issues/11613

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11665

Reviewed By: hx235

Differential Revision: D48010507

Pulled By: ajkr

fbshipit-source-id: 65c6d87d2c6ffc9d25f1d17106eae467ec528082
2023-08-16 09:43:20 -07:00
Jay Huh b63018fb59 Wide Column Ingestion in CrashTest (#11697)
Summary:
`PutEntity` is now supported in SST file writer (https://github.com/facebook/rocksdb/issues/11688). This PR enables ingestion of wide column data in the stress/crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11697

Test Plan:
```
python3 tools/db_crashtest.py blackbox --simple --duration=300 --ingest_external_file_one_in=2 --use_put_entity_one_in=2 --max_key=1048576 -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 --interval=10 -value_size_mult=33 -column_families=1 -reopen=0 --key_len_percent_dist="1,30,69"
```

Reviewed By: ltamasi

Differential Revision: D48370719

Pulled By: jaykorean

fbshipit-source-id: 5855d3112b37b2fb300d05e6df110d899855d77d
2023-08-15 16:13:13 -07:00
Yu Zhang 407efb021c Expose the root comparator for built-in With64Ts comparators (#11704)
Summary:
As titled. User-defined timestamp feature users sometimes directly call the user comparator to do validation on their side too. Having access to the root comparator can help make their code consistent for when UDT is enabled and disabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11704

Reviewed By: ltamasi

Differential Revision: D48355090

Pulled By: jowlyzhang

fbshipit-source-id: 26bc73543bfb379ef548d1361803d6f8c308cef6
2023-08-15 13:44:13 -07:00
Yu Zhang 6a3da5635e Add documentation to some formatting util functions (#11674)
Summary:
As titled, mostly adding documentation. While updating one usage of these util functions in the external file ingestion job based on code inspection.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11674

Test Plan:
```
make check
```

Note that no unit test was added or updated to check the change in the external file ingestion flow works. This is because user-defined timestamp doesn't support bulk loading yet. There could be other missing pieces that are needed to make this flow functional and testable. That work is separately tracked and unit tests will be added then.

Reviewed By: cbi42

Differential Revision: D48271338

Pulled By: jowlyzhang

fbshipit-source-id: c05c3440f1c08632dd0de51b563a30b44b4eb8b5
2023-08-14 22:04:18 -07:00
Andrew Kryczka a09c141dde In TestIterateAgainstExpected(), verify iterator moves in expected direction (#11698)
Summary:
It's a bit repetitive in order to give reasonably informative error messages.

I also removed total_order_seek in cases where it's not needed, just to make sure a case that shouldn't matter really doesn't.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11698

Test Plan:
run it -

```
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=100000 --duration=86400 --interval=10 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --blob_compression_type=none --writepercent=50 -iterpercent=45 -readpercent=0 -prefixpercent=0 --prefix_size=0 --verify_iterator_with_expected_state_one_in=10 --test_batches_snapshots=0 -enable_compaction_filter=0
```

Reviewed By: cbi42

Differential Revision: D48285036

Pulled By: ajkr

fbshipit-source-id: 51b147bd7c8011740629ae2fd8114d3d48ce7137
2023-08-14 14:57:28 -07:00
Jay Huh 793a786fa3 Fix for unchecked status in CancelAllBackgroundWork (#11699)
Summary:
## Summary
PR https://github.com/facebook/rocksdb/issues/11497 introduced this. Status from `CancelPeriodicTaskScheduler()` is unchecked and causing test failure like https://app.circleci.com/pipelines/github/facebook/rocksdb/30743/workflows/24443a9b-6fc3-41e6-86c1-992d766eb1ec/jobs/642419

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11699

Test Plan: Existing tests

Reviewed By: cbi42

Differential Revision: D48287188

Pulled By: jaykorean

fbshipit-source-id: b6bcf6e3c3c47f126c34c24a3dfed2649635cc8c
2023-08-11 19:59:56 -07:00
Peter Dillinger ef6f025563 Placeholder for AutoHyperClockCache, more (#11692)
Summary:
* The plan is for AutoHyperClockCache to be selected when HyperClockCacheOptions::estimated_entry_charge == 0, and in that case to use a new configuration option min_avg_entry_charge for determining an extreme case maximum size for the hash table. For the placeholder, a hack is in place in HyperClockCacheOptions::MakeSharedCache() to make the unit tests happy despite the new options not really making sense with the current implementation.
* Mostly updating and refactoring tests to test both the current HCC (internal name FixedHyperClockCache) and a placeholder for the new version (internal name AutoHyperClockCache).
* Simplify some existing tests not to depend directly on cache type.
* Type-parameterize the shard-level unit tests, which unfortunately requires more syntax like `this->` in places for disambiguation.
* Added means of choosing auto_hyper_clock_cache to cache_bench, db_bench, and db_stress, including add to crash test.
* Add another templated class BaseHyperClockCache to reduce future copy-paste
* Added ReportProblems support to cache_bench
* Added a DEBUG-level diagnostic to ReportProblems for the variance in load factor throughout the table, which will become more of a concern with linear hashing to be used in the Auto implementation. Example with current Fixed HCC:
```
2023/08/10-13:41:41.602450 6ac36 [DEBUG] [che/clock_cache.cc:1507] Slot occupancy stats: Overall 49% (129008/262144), Min/Max/Window = 39%/60%/500, MaxRun{Pos/Neg} = 18/17
```

In other words, with overall occupancy of 49%, the lowest across any 500 contiguous cells is 39% and highest 60%. Longest run of occupied is 18 and longest run of unoccupied is 17. This seems consistent with random samples from a uniform distribution.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11692

Test Plan: Shouldn't be any meaningful changes yet to production code or to what is tested, but there is temporary redundancy in testing until the new implementation is plugged in.

Reviewed By: jowlyzhang

Differential Revision: D48247413

Pulled By: pdillinger

fbshipit-source-id: 11541f996d97af403c2e43c92fb67ff22dd0b5da
2023-08-11 16:27:38 -07:00
Hui Xiao 38ecfabed2 Remove comment about locking about TestIterateAgainstExpected (#11695)
Summary:
**Context/Summary**
After https://github.com/facebook/rocksdb/pull/11058, we no longer lock the key range to iterate in TestIterateAgainstExpected, except for working with timestamp feature.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11695

Test Plan: no code change

Reviewed By: ajkr

Differential Revision: D48276668

Pulled By: hx235

fbshipit-source-id: dc92a3708b2281dc737c0877fb755548bf03a9fc
2023-08-11 13:14:04 -07:00
Jay Huh 52816ff64d Close DB option in WaitForCompact() (#11497)
Summary:
Context:

As mentioned in https://github.com/facebook/rocksdb/issues/11436, introducing `close_db` option in `WaitForCompactOptions` to close DB after waiting for compactions to finish. Must be set to true to close the DB upon compactions finishing.
1. `bool close_db = false` added to `WaitForCompactOptions`
2. Introduced `CancelPeriodicTaskSchedulers()` and moved unregistering PeriodicTaskSchedulers to it.`CancelAllBackgroundWork()` calls it now.
3. When close_db option is on, unpersisted data (data in memtable when WAL is disabled) will be flushed in `WaitForCompact()` if flush option is not on (and `mutable_db_options_.avoid_flush_during_shutdown` is not true). The unpersisted data flush in `CancelAllBackgroundWork()` will be skipped because `shutting_down_` flag will be set true before calling `Close()`.
4. Atomic boolean `reject_new_background_jobs_` is introduced to prevent new background jobs from being added during the short period of time after waiting is done and before `shutting_down_` is set by `Close()`.
5. `WaitForCompact()` now waits for recovery in progress to complete as well. (flush operations from WAL -> L0 files)
6. Added `close_db_` cases to all existing `WaitForCompactTests`
7. Added a scenario to `DBBasicTest::DBClose`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11497

Test Plan:
- Existing DBCompactionTests
- `WaitForCompactWithOptionToFlushAndCloseDB` added
- Added a scenario to `DBBasicTest::DBClose`

Reviewed By: pdillinger, jowlyzhang

Differential Revision: D46337560

Pulled By: jaykorean

fbshipit-source-id: 0f8c7ee09394847f2af5ea4bdd331b47bcdef0b0
2023-08-11 12:30:48 -07:00