Commit Graph

12222 Commits

Author SHA1 Message Date
Yu Zhang c751583c03 Set default cf ts sz for a reused transaction (#11685)
Summary:
Set up the default column family timestamp size for a reused write committed transaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11685

Test Plan: Added unit test.

Reviewed By: ltamasi

Differential Revision: D48195129

Pulled By: jowlyzhang

fbshipit-source-id: 54faa900c123fc6daa412c01490e36c10a24a678
2023-08-09 13:49:42 -07:00
Hui Xiao 9a034801ce Group rocksdb.sst.read.micros stat by different user read IOActivity + misc (#11444)
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
   - For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
   - New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.

- Misc
   - More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
   - Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444

Test Plan:
- CI fake db crash/stress test
- Microbenchmarking

**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR

**Test**

Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```

Result
```
Coming soon
```

AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```

Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,

PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```

Reviewed By: ajkr

Differential Revision: D45918925

Pulled By: hx235

fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
2023-08-08 17:26:50 -07:00
Yu Zhang 9c2ebcc2c3 Log user_defined_timestamps_persisted flag in event logger (#11683)
Summary:
As titled, and also removed an undefined and unused member function in for ColumnFamilyData

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683

Reviewed By: ajkr

Differential Revision: D48156290

Pulled By: jowlyzhang

fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7
2023-08-08 12:25:21 -07:00
Peter Dillinger e214964f40 Fix a potential memory leak on row_cache insertion failure (#11682)
Summary:
Although the built-in Cache implementations never return failure on Insert without keeping a reference (Handle), a custom implementation could. The code for inserting into row_cache does not keep a reference but does not clean up appropriately on non-OK. This is a fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11682

Test Plan: unit test added that previously fails under ASAN

Reviewed By: ajkr

Differential Revision: D48153831

Pulled By: pdillinger

fbshipit-source-id: 86eb7387915c5b38b6ff5dd8deb4e1e223b7d020
2023-08-08 11:34:41 -07:00
Peter Dillinger 99daea3481 Prepare tests for new HCC naming (#11676)
Summary:
I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache.

This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676

Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache

Reviewed By: ajkr

Differential Revision: D48103165

Pulled By: pdillinger

fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736
2023-08-07 18:17:12 -07:00
tabokie 6d1effaf01 exclude uninitialized files when estimating compression ratio (#11664)
Summary:
Exclude files with uninitialized table properties when estimating compression ratio.

Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/335

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11664

Reviewed By: cbi42

Differential Revision: D48002518

Pulled By: ajkr

fbshipit-source-id: 931fac8a06b4ed7b7b605cf79903302f1b8babfd
2023-08-07 12:35:42 -07:00
Xinye Tao d2b0652b32 compute compaction score once for a batch of range file deletes (#10744)
Summary:
Only re-calculate compaction score once for a batch of deletions. Fix performance regression brought by https://github.com/facebook/rocksdb/pull/8434.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10744

Test Plan:
In one of our production cluster that recently upgraded to RocksDB 6.29, it takes more than 10 minutes to delete files in 30,000 ranges. The RocksDB instance contains approximately 80,000 files. After this patch, the duration reduces to 100+ ms, which is on par with RocksDB 6.4.

Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/316

Signed-off-by: tabokie <xy.tao@outlook.com>

Reviewed By: cbi42

Differential Revision: D48002581

Pulled By: ajkr

fbshipit-source-id: 7245607ee3ad79c53b648a6396c9159f166b9437
2023-08-07 12:29:31 -07:00
Peter Dillinger cdb11f5ce6 More minor HCC refactoring + typed mmap (#11670)
Summary:
More code leading up to dynamic HCC.
* Small enhancements to cache_bench
* Extra assertion in Unref
* Improve a CAS loop in ChargeUsageMaybeEvictStrict
* Put load factor constants in appropriate class
* Move `standalone` field to HyperClockTable::HandleImpl because it can be encoded differently in the upcoming dynamic HCC.
* Add a typed version of MemMapping to simplify some future code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11670

Test Plan: existing tests, unit test added for TypedMemMapping

Reviewed By: jowlyzhang

Differential Revision: D48056464

Pulled By: pdillinger

fbshipit-source-id: 186b7d3105c5d6d2eb6a592369bc10a97ee14a15
2023-08-07 12:20:23 -07:00
Andrew Kryczka 4500a0d6ec Avoid an std::map copy in persistent stats (#11681)
Summary:
An internal user reported this copy showing up in a CPU profile. We can use move instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11681

Differential Revision: D48103170

Pulled By: ajkr

fbshipit-source-id: 083d6470181a0041bb5275b657aa61bee23a3729
2023-08-06 18:01:08 -07:00
Changyu Bi eca48bc166 Avoid shifting component too large error in FileTtlBooster (#11673)
Summary:
When `num_levels` > 65, we may be shifting more than 63 bits in FileTtlBooster. This can give errors like: `runtime error: shift exponent 98 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')`. This PR makes a quick fix for this issue by taking a min in the shifting component. This issue should be rare since it requires a user using a large `num_levels`. I'll follow up with a more complex fix if needed.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11673

Test Plan: * Add a unit test that produce the above error before this PR. Need to compile it with ubsan: `COMPILE_WITH_UBSAN=1 OPT="-fsanitize-blacklist=.circleci/ubsan_suppression_list.txt" ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 compaction_picker_test`

Reviewed By: hx235

Differential Revision: D48074386

Pulled By: cbi42

fbshipit-source-id: 25e59df7e93f20e0793cffb941de70ac815d9392
2023-08-04 14:29:50 -07:00
Hui Xiao 09882a52d6 Prepare for deprecation of Options::access_hint_on_compaction_start (#11658)
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658

Test Plan: No code change

Reviewed By: ajkr

Differential Revision: D47997856

Pulled By: hx235

fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac
2023-08-03 17:23:02 -07:00
Vardhan 87a21d08fe Add an option to trigger flush when the number of range deletions reach a threshold (#11358)
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358

Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.

Reviewed By: ajkr

Differential Revision: D46582680

Pulled By: cbi42

fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
2023-08-02 19:58:56 -07:00
Peter Dillinger f9de217353 Some cache_bench enhancements (#11661)
Summary:
... used in validating some HyperClockCache development in progress.

* Revamp the "populate cache" step to avoid redundant insertions (very rare in practice) and more consistently approach the desired resident_ratio while maintaining appropriate skew (still not perfect).
* Track and print hit ratio on lookups, to ensure a fair comparison is happening between implementations etc.
* Add an option to disable tracking and printing histograms (lots of output)
* Add an option to specify a random seed (for more reproducibility)
* Remove confusing/redundant "-skewed" option

Uses BitwiseAnd from https://github.com/facebook/rocksdb/issues/11660 (tested there)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11661

Test Plan: manual

Reviewed By: akankshamahajan15, jowlyzhang

Differential Revision: D47937671

Pulled By: pdillinger

fbshipit-source-id: 85a2bb881b1bca4f63e015bac684105fd91c9f35
2023-08-02 13:19:20 -07:00
Andrew Kryczka cf95821fb6 Update for 8.5.fb branch cut (#11642)
Summary:
Updated the main branch for the 8.5.fb branch cut. Also made unreleased_history/release.sh backdate to the last commit instead of the current date in case the release manager is a laggard like myself.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11642

Reviewed By: cbi42

Differential Revision: D47783574

Pulled By: ajkr

fbshipit-source-id: 4e2a80f5ccd542dc7dd0d22dfd7e59cb136325a1
2023-08-02 12:34:11 -07:00
Peter Dillinger f4e4039f00 Add some more bit operations to internal APIs (#11660)
Summary:
BottomNBits() - there is a single fast instruction for this on x86 since BMI2, but testing with godbolt indicates you need at least GCC 10 for the compiler to choose that instruction from the obvious C++ code. https://godbolt.org/z/5a7Ysd41h

BitwiseAnd() - this is a convenience function that works around the language flaw that the type of the result of x & y is the larger of the two input types, when it should be the smaller. This can save some ugly static_cast.

I expect to use both of these in coming HyperClockCache developments, and have applied them in a couple of places in existing code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11660

Test Plan: unit tests added

Reviewed By: jowlyzhang

Differential Revision: D47935531

Pulled By: pdillinger

fbshipit-source-id: d148c43a1e51df4a1c549b93aaf2725a3f8d3bd6
2023-08-02 11:30:10 -07:00
amatveev-cf 946d1009bc Expand Statistics support in the C API (#11263)
Summary:
Adds a few missing features to the C API:
1) Statistics level
2) Getting individual values instead of a serialized string

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11263

Test Plan: unit tests

Reviewed By: ajkr

Differential Revision: D47309963

Pulled By: hx235

fbshipit-source-id: 84df59db4045fc0fb3ea4aec451bc5c2afd2a248
2023-08-02 10:53:40 -07:00
Jay Huh 9a2a6db2a9 Use C++17 [[fallthrough]] in transaction_test.cc (#11663)
Summary:
(Copied from https://www.internalfb.com/diff/D46606060)

This diff makes its files safe for use with -Wimplicit-fallthrough. Now that we're using C+20 there's no reason not to use this C++17 feature to make our code safer.
It's currently possible to write code like this:
```
switch(x){
  case 1:
    foo1();
  case 2:
    foo2();
    break;
  case 3:
    foo3();
}
```
But that's scary because we don't know whether the fallthrough from case 1 was intentional or not.
The -Wimplicit-fallthrough flag will make this an error. The solution is to either  fix the bug by inserting break or indicating intention by using [[fallthrough]]; (from C++17).
```
switch(x){
  case 1:
    foo1();
    [[fallthrough]]; // Solution if we intended to fallthrough
    break;           // Solution if we did not intend to fallthrough
  case 2:
    foo2();
    break;
  case 3:
    foo3();
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11663

Test Plan: Existing tests

Reviewed By: jowlyzhang

Differential Revision: D47961248

Pulled By: jaykorean

fbshipit-source-id: 0d374c721bf1b328c14949dc5c17693da7311d03
2023-08-01 14:49:06 -07:00
Peter Dillinger bb8fcc0044 db_stress: Reinstate Transaction::Rollback() calls before destruction (#11656)
Summary:
https://github.com/facebook/rocksdb/issues/11653 broke some crash tests.
Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656

Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn

Reviewed By: cbi42

Differential Revision: D47906280

Pulled By: pdillinger

fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b
2023-07-30 17:30:01 -07:00
Peter Dillinger 7a1b0207e6 format_version=6 and context-aware block checksums (#9058)
Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.

To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files

Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.

## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.

## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)

## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).

Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).

```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```

Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness

Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058

Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.

Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.

The format compatibility test does not need updating because we have not updated the default
format_version.

Reviewed By: ajkr, mrambacher

Differential Revision: D33100915

Pulled By: pdillinger

fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
2023-07-30 16:40:01 -07:00
Peter Dillinger b3c54186ab Allow TryAgain in db_stress with optimistic txn, and refactoring (#11653)
Summary:
In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.

Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
* (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
* Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.

Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
* Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
* Use unique_ptr (nicely complements removing unnecessary Rollbacks)
* Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)

Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653

Test Plan:
this is the test :)

Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.

Reviewed By: cbi42

Differential Revision: D47882995

Pulled By: pdillinger

fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
2023-07-28 16:25:29 -07:00
Peter Dillinger c205a217e6 Strip leading and trailing whitespace for unreleased_history entries (#11652)
Summary:
Some trailing whitespace has leaked into HISTORY.md entries. This can lead to unexpected changes when modifying HISTORY.md with hygienic editors (e.g. for a patch release). This change should protect against future cases.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11652

Test Plan: manual

Reviewed By: akankshamahajan15

Differential Revision: D47882814

Pulled By: pdillinger

fbshipit-source-id: 148c3746d3b298cb6e1f655f0416d46619969086
2023-07-28 14:57:07 -07:00
Changyu Bi 6a0f637633 Compare the number of input keys and processed keys for compactions (#11571)
Summary:
... to improve data integrity validation during compaction.

A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.

For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.

For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571

Test Plan:
* Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
* All other unit tests for non-corrupted case.
* Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`

Reviewed By: ajkr

Differential Revision: D47131965

Pulled By: cbi42

fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
2023-07-28 09:47:31 -07:00
Yu Zhang 5dd8c114bb Add a UDT comparator for ReverseBytewiseComparator to object library (#11647)
Summary:
Add a built-in comparator that supports uint64_t style user-defined timestamps for ReverseBytewiseComparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11647

Test Plan:
Added a test wrapper for retrieving this comparator from registry and used it in this test:
`./udt_util_test`

Reviewed By: ltamasi

Differential Revision: D47848303

Pulled By: jowlyzhang

fbshipit-source-id: 5af5534a8c2d9195997d0308c8e194c1c797548c
2023-07-27 15:31:22 -07:00
akankshamahajan 63a5125a52 Fix use_after_free bug when underlying FS enables kFSBuffer (#11645)
Summary:
Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest
Since it's an experimental feature, added a hack for now to fix the bug.
Planning to make public API change to remove const from the callback as it doesn't make sense to use const.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11645

Test Plan: tested locally

Reviewed By: ltamasi

Differential Revision: D47819907

Pulled By: akankshamahajan15

fbshipit-source-id: 1faf5ef795bf27e2b3a60960374d91274931df8d
2023-07-27 12:02:03 -07:00
Yu Zhang c24ef26ca7 Support switching on / off UDT together with in-Memtable-only feature (#11623)
Summary:
Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.

To do this, this PR includes:
1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry.

2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid:
     a) no user comparator change and no effective `persist_user_defined_timestamp` flag change.
     b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag
         is false.
     c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is
         already false.
3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp.

4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623

Test Plan:
```
make all check
./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*"
./db_wal_test --gtest_filter="*EnableDisableUDT*"
```

Reviewed By: ltamasi

Differential Revision: D47636862

Pulled By: jowlyzhang

fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
2023-07-26 20:16:32 -07:00
Yu Zhang 4ea7b796b7 Respect cutoff timestamp during flush (#11599)
Summary:
Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval.

This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`.

For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`.

Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.

Tet plan:
```
./column_family_test --gtest_filter="*RetainUDT*"
./memtable_list_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599

Reviewed By: ajkr

Differential Revision: D47561586

Pulled By: jowlyzhang

fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89
2023-07-26 16:25:06 -07:00
Changyu Bi 5c2a063c49 Clarify usage for options `ttl` and `periodic_compaction_seconds` for universal compaction (#11552)
Summary:
this is stacked on https://github.com/facebook/rocksdb/issues/11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11552

Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault`

Reviewed By: ajkr

Differential Revision: D47381434

Pulled By: cbi42

fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57
2023-07-26 11:31:54 -07:00
ywave 9cc0986ae2 Fix comment in WriteBatchWithIndex::NewIteratorWithBase (#11636)
Summary:
Remove obsolete comment.

Support for WriteBatchWithIndex::NewIteratorWithBase when overwrite_key=false is added in https://github.com/facebook/rocksdb/pull/8135, as you can clearly see in the HISTORY.md.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11636

Reviewed By: jowlyzhang

Differential Revision: D47722955

Pulled By: ajkr

fbshipit-source-id: 4fa44a309d9708e9f4a1530918a9aaf7114c9032
2023-07-24 10:19:37 -07:00
Peter Dillinger c41122b1a0 Even more HyperClockCache refactoring (#11630)
Summary:
... ahead of dynamic variant.

* Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release.
* Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function.
* Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11630

Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster.

Reviewed By: jowlyzhang

Differential Revision: D47657334

Pulled By: pdillinger

fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3
2023-07-24 09:36:09 -07:00
zhangyuxiang.ax 1567108fc1 Add missing table properties in plaintable GetTableProperties() (#11267)
Summary:
Plaintable will miss properties.
It should have some behavior like blockbasedtable.
Here is a unit test for reproduce this bug.

```
#include <gflags/gflags.h>
#include "rocksdb/db.h"
#include "rocksdb/options.h"
#include "rocksdb/table.h"
#include "rocksdb/slice_transform.h"
#include <iostream>
#include <thread>
#include <csignal>
const std::string kKey = "key";

DEFINE_bool(use_plaintable, true, "use plain table");
DEFINE_string(db_path, "/dev/shm/test_zyx_path", "db_path");

rocksdb::DB* db = nullptr;

class NoopTransform : public rocksdb::SliceTransform {
public:
    explicit NoopTransform() {
    }

    virtual const char* Name() const override {
        return "rocksdb.Noop";
    }

    virtual rocksdb::Slice Transform(const rocksdb::Slice& src) const override {
        return src;
    }

    virtual bool InDomain(const rocksdb::Slice& src) const override {
        return true;
    }

    virtual bool InRange(const rocksdb::Slice& dst) const override {
        return true;
    }

    virtual bool SameResultWhenAppended(const rocksdb::Slice& prefix) const override {
        return false;
    }
};

class TestPropertiesCollector : public ::rocksdb::TablePropertiesCollector {
public:
    explicit TestPropertiesCollector() {
    }

private:
    ::rocksdb::Status AddUserKey(const ::rocksdb::Slice& key, const ::rocksdb::Slice& value, ::rocksdb::EntryType type,
                                 ::rocksdb::SequenceNumber seq, uint64_t file_size) override {
        count++;
        return ::rocksdb::Status::OK();
    }

    ::rocksdb::Status Finish(::rocksdb::UserCollectedProperties* properties) override {
        properties->insert({kKey, std::to_string(count)});
        return ::rocksdb::Status::OK();
    }

    ::rocksdb::UserCollectedProperties GetReadableProperties() const override {
        ::rocksdb::UserCollectedProperties properties;
        properties.insert({kKey, std::to_string(count)});
        return properties;
    }

    const char* Name() const override {
        return "TestPropertiesCollector";
    }
    int count = 0;
};

class TestTablePropertiesCollectorFactory : public ::rocksdb::TablePropertiesCollectorFactory {
public:
    explicit TestTablePropertiesCollectorFactory() {
    }

private:
    ::rocksdb::TablePropertiesCollector* CreateTablePropertiesCollector(
            ::rocksdb::TablePropertiesCollectorFactory::Context context) override {
        return new TestPropertiesCollector();
    }

    const char* Name() const override {
        return "test.TablePropertiesCollectorFactory";
    }
};

class TestFlushListener : rocksdb::EventListener {
public:
    const char* Name() const override {
        return "TestFlushListener";
    }
    void OnFlushCompleted(rocksdb::DB* /*db*/, const rocksdb::FlushJobInfo& flush_job_info) override {
        if (flush_job_info.table_properties.user_collected_properties.find(kKey) ==
            flush_job_info.table_properties.user_collected_properties.end()) {
            std::cerr << "OnFlushCompleted: properties not found" << std::endl;
            return;
        }
        std::cerr << "OnFlushCompleted: properties found "
                  << flush_job_info.table_properties.user_collected_properties.at(kKey) << std::endl;
    }
    explicit TestFlushListener() {
    }
};

int main(int argc, char* argv[]) {
    gflags::ParseCommandLineFlags(&argc, &argv, true);
    rocksdb::DBOptions rocksdb_options;
    std::shared_ptr<rocksdb::EventListener> flush_offset;
    rocksdb_options.create_if_missing = true;
    rocksdb_options.create_missing_column_families = true;
    std::shared_ptr<::rocksdb::TablePropertiesCollectorFactory> properties_collector(
            new TestTablePropertiesCollectorFactory());
    rocksdb::ColumnFamilyOptions cfoptions;
    cfoptions.table_properties_collector_factories.emplace_back(properties_collector);
    std::shared_ptr<rocksdb::EventListener> test_cleaner;
    test_cleaner.reset((rocksdb::EventListener*)new TestFlushListener());
    rocksdb_options.listeners.emplace_back(test_cleaner);

    std::vector<rocksdb::ColumnFamilyDescriptor> cf_desc_;
    cf_desc_.emplace_back(rocksdb::kDefaultColumnFamilyName, cfoptions);
    std::vector<rocksdb::ColumnFamilyHandle*> cfhs;
    cfoptions.prefix_extractor.reset(new NoopTransform());
    if (FLAGS_use_plaintable) {
        cfoptions.table_factory.reset(rocksdb::NewPlainTableFactory());
        std::cerr << "use plaintable" << std::endl;
    } else {
        cfoptions.table_factory.reset(rocksdb::NewBlockBasedTableFactory());
        std::cerr << "use blockbasedtable" << std::endl;
    }

    auto s = rocksdb::DB::Open(rocksdb_options, FLAGS_db_path, cf_desc_, &cfhs, &db);
    if (s.ok()) {
        rocksdb::WriteOptions wops;
        wops.disableWAL = true;
        for (int i = 0; i < 1000000; i++) {
            auto status = db->Put(wops, std::to_string(i), std::string(1024, '3'));
            if (!status.ok()) {
                std::cerr << "write fail " << status.getState() << std::endl;
            }
        }
    } else {
        std::cerr << "open rocksdb failed" << s.getState() << std::endl;
    }
    std::this_thread::sleep_for(std::chrono::seconds(1000));
    delete db;
}
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11267

Reviewed By: jowlyzhang

Differential Revision: D47689943

Pulled By: hx235

fbshipit-source-id: 585589cc48f8b26c7dd2323fc7ac4a0c3d4df6bb
2023-07-21 17:55:25 -07:00
Hui Xiao 629605d645 Move prefetching responsibility to page cache for compaction read under non directIO usecase (#11631)
Summary:
**Context/Summary**
As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read .

**Test**
Extended the existing UT to cover compaction read case

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11631

Reviewed By: ajkr

Differential Revision: D47681437

Pulled By: hx235

fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90
2023-07-21 14:52:52 -07:00
darionyaphet df543460d5 Remove some useless qualifier (#11596)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11596

Reviewed By: ajkr

Differential Revision: D47635614

Pulled By: jowlyzhang

fbshipit-source-id: 651a06049a54d15fd4b4f010bb4b82f53ff9c9d4
2023-07-20 13:43:26 -07:00
Rémi Calixte 6628ff12d6 Extend C API to expose base db of transaction db (#11562)
Summary:
Add `rocksdb_transactiondb_get_base_db` and `rocksdb_transactiondb_close_base_db` functions to the C API modeled after `rocksdb_optimistictransactiondb_get_base_db` and `rocksdb_optimistictransactiondb_close_base_db`:
ca50ccc71a/include/rocksdb/c.h (L2711-L2716)

With this pair of functions, it is possible to get a `rocksdb_t *` from a `rocksdb_transactiondb_t *`. The main goal is to be able to use the approximate memory usage API, only accessible to the `rocksdb_t *` type:
ca50ccc71a/include/rocksdb/c.h (L2821-L2833)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11562

Reviewed By: ajkr

Differential Revision: D47603343

Pulled By: jowlyzhang

fbshipit-source-id: c70cf6af5834026e232fe7791634db3a396f7d5e
2023-07-20 12:03:09 -07:00
ywave 86634885eb Fix typo in comment (#11617)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11617

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D47599209

Pulled By: jowlyzhang

fbshipit-source-id: 00e96266c75128875663083a2877d27fd7392eea
2023-07-19 13:52:41 -07:00
darionyaphet 64b0439bc1 fix typo (#11595)
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11595

Reviewed By: ajkr

Differential Revision: D47600701

Pulled By: jowlyzhang

fbshipit-source-id: 22375b51c726b176e4bc502b49cf3343f45f8a0a
2023-07-19 13:04:48 -07:00
shuzz 2f712235ab optimized code (#11614)
Summary:
improvement code by std::move and c++17

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11614

Reviewed By: ajkr

Differential Revision: D47599519

Pulled By: jowlyzhang

fbshipit-source-id: 6b897876f4e87e94a74c53d8db2a01303d500bff
2023-07-19 12:52:39 -07:00
zhutao aeda36e925 add exe and script path check (#11621)
Summary:
Add path existence check in the script to avoid script running even when db_bench executable does not exist or relative path is not right.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11621

Reviewed By: jowlyzhang

Differential Revision: D47552590

Pulled By: ajkr

fbshipit-source-id: f09ea069f69e067212b249a22ad755b76bc6063a
2023-07-19 12:05:24 -07:00
huangmengbin 98d0f6ec08 fix: VersionSet::DumpManifest (#11605)
Summary:
Fixes https://github.com/facebook/rocksdb/issues/11604

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11605

Reviewed By: jowlyzhang

Differential Revision: D47459254

Pulled By: ajkr

fbshipit-source-id: 4420e443fbf4bd01ddaa2b47285fc4445bf36246
2023-07-19 10:44:10 -07:00
Dan Wang 8a7b9888d4 Fix the sync point SanitizeOptions::AfterChangeMaxOpenFiles which is not executed in db_compaction_test (#11583)
Summary:
In [db_impl_open.cc](https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc), the sync point `SanitizeOptions::AfterChangeMaxOpenFiles` is used to set `max_open_files` with some specified "**invalid**" value even if it has been sanitized.

However,  in [db_compaction_test.cc](https://github.com/facebook/rocksdb/blob/main/db/db_compaction_test.cc), `SanitizeOptions::AfterChangeMaxOpenFiles` would not be executed since `SyncPoint::EnableProcessing()` is run after `DBTestBase::Reopen()`.  To enable `SanitizeOptions::AfterChangeMaxOpenFiles`,  `SyncPoint::EnableProcessing()` should be put ahead of `DBTestBase::Reopen()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11583

Test Plan:
run unit tests locally as below:
```
make J=1 check

[ RUN      ] DBCompactionTest.LevelTtlCascadingCompactions
[       OK ] DBCompactionTest.LevelTtlCascadingCompactions (85 ms)
[ RUN      ] DBCompactionTest.LevelPeriodicCompaction
[       OK ] DBCompactionTest.LevelPeriodicCompaction (57 ms)
```

Reviewed By: jowlyzhang

Differential Revision: D47311827

Pulled By: ajkr

fbshipit-source-id: 99165e87a8129e404af06fdf9b4c96eca540fd23
2023-07-19 10:41:09 -07:00
Muhammad 977aae53d2 Allow rocksdb library to be usable with CMake's `FetchContent` API (#11575)
Summary:
This adds proper support for using rocksdb with FetchContent, without this PR the user must include the following with their own `CMakeLists.txt` file:
```cmake
include_directories(./build/_deps/rocksdb-src/include)
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11575

Reviewed By: jowlyzhang

Differential Revision: D47163520

Pulled By: ajkr

fbshipit-source-id: a202dcf435ecc9dd8d51c88f90e98c04814721ca
2023-07-19 10:39:30 -07:00
Andrew Kryczka 05c3b8ecac Prepare for specialized interface for row cache (#11620)
Summary:
An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema.

This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620

Reviewed By: pdillinger

Differential Revision: D47536962

Pulled By: ajkr

fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691
2023-07-18 19:12:58 -07:00
Chad Austin ff0d618c7f add a missing include (#11624)
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11624

<queue> must be included to use std::queue.

Reviewed By: pdillinger

Differential Revision: D47562433

fbshipit-source-id: 7c5b19fd9e411694c782dfc0dff0231d4f92ef24
2023-07-18 15:38:52 -07:00
Peter Dillinger 846db9d7b1 Refactor ClockCache ApplyToEntries (#11609)
Summary:
... ahead of planned dynamic HCC variant. This changes
simplifies some logic while still enabling future code sharing between
implementation variants.

Detail: For complicated reasons, using a std::function parameter to
`ConstApplyToEntriesRange` with a lambda argument does not play
nice with templated HandleImpl. An explicit conversion to std::function
would be needed for it to compile. Templating the function type is the
easy work-around.

Also made some functions from https://github.com/facebook/rocksdb/issues/11572 private as recommended

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11609

Test Plan: existing tests

Reviewed By: jowlyzhang

Differential Revision: D47407415

Pulled By: pdillinger

fbshipit-source-id: 0f65954db16335999b78fb7d2563ec627624cef0
2023-07-18 12:09:27 -07:00
Changyu Bi 662a1c99f6 Verify number of keys flushed during DB open (#11611)
Summary:
Extend the coverage for option `flush_verify_memtable_count`. The verification code is similar to the ones for regular flush: c3c84b3397/db/flush_job.cc (L956-L965)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11611

Test Plan: existing tests.

Reviewed By: ajkr

Differential Revision: D47478893

Pulled By: cbi42

fbshipit-source-id: ca580c9dbcd6e91facf2e49210661336a79a248e
2023-07-18 10:39:11 -07:00
akankshamahajan 749b179c04 Remove reallocation of AlignedBuffer in direct_io sync reads if already aligned (#11600)
Summary:
Remove reallocation of AlignedBuffer in direct_io sync reads in RandomAccessFileReader::Read if buffer passed is already aligned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11600

Test Plan:
Setup: `TEST_TMPDIR=./tmp-db/ ./db_bench -benchmarks=filluniquerandom -disable_auto_compactions=true -target_file_size_base=1048576 -write_buffer_size=1048576 -compression_type=none`
Benchmark: `TEST_TMPDIR=./tmp-db/ perf record ./db_bench --cache_size=8388608 --use_existing_db=true --disable_auto_compactions=true --benchmarks=seekrandom --use_direct_reads=true -use_direct_io_for_flush_and_compaction=true -reads=1000 -seek_nexts=1 -max_auto_readahead_size=131072 -initial_auto_readahead_size=16384 -adaptive_readahead=true -num_file_reads_for_auto_readahead=0`

Perf profile-
Before:
```
8.73% db_bench libc.so.6 [.] __memmove_evex_unaligned_erms
3.34% db_bench [kernel.vmlinux] [k] filemap_get_read_batch
```

After:
```
2.50% db_bench [kernel.vmlinux] [k] filemap_get_read_batch
2.29% db_bench libc.so.6 [.] __memmove_evex_unaligned_erms
```

`make  crash_test -j `with direct_io enabled completed succesfully locally.

Ran few benchmarks with direct_io from seek_nexts varying between 912 to 327680 and different readahead_size parameters and it showed no regression so far.

Reviewed By: ajkr

Differential Revision: D47478598

Pulled By: akankshamahajan15

fbshipit-source-id: 6a48e21cb34696f5d09c22a6311a3a1cb5f9cf33
2023-07-14 20:08:05 -07:00
Peter Dillinger b1b6f87fbe Some small improvements to HyperClockCache (#11601)
Summary:
Stacked on https://github.com/facebook/rocksdb/issues/11572
* Minimize use of std::function and lambdas to minimize chances of
compiler heap-allocating closures (unnecessary stress on allocator). It
appears that converting FindSlot to a template enables inlining the
lambda parameters, avoiding heap allocations.
* Clean up some logic with FindSlot (FIXMEs from https://github.com/facebook/rocksdb/issues/11572)
* Fix handling of rare case of probing all slots, with new unit test.
(Previously Insert would not roll back displacements in that case, which
would kill performance if it were to happen.)
* Add an -early_exit option to cache_bench for gathering memory stats
before deallocation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11601

Test Plan:
unit test added for probing all slots

## Seeing heap allocations
Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache`
before https://github.com/facebook/rocksdb/issues/11572 vs. after this change. Before, we see this in the
interesting bin statistics:

```
size  nrequests
----  ---------
  32     578460
  64      24340
8192     578460
```
And after:
```
size  nrequests
----  ---------
  32  (insignificant)
  64      24370
8192     579130
```

## Performance test
Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench`

Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000`
in before and after configurations, simultaneously:

```
Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442
After:  Complete in 32.773 s; Rough parallel ops/sec = 2441019
```

Reviewed By: jowlyzhang

Differential Revision: D47375092

Pulled By: pdillinger

fbshipit-source-id: 46f0f57257ddb374290a0a38c651764ea60ba410
2023-07-14 16:19:22 -07:00
leipeng bc0db33483 Optimize about sstableKeyCompare (#11610)
Summary:
We observed `CompactionOutputs::UpdateGrandparentBoundaryInfo` consumes much time for `InternalKey::DecodeFrom` and `InternalKey::~InternalKey` in flame graph.

This PR omit the InternalKey object in `CompactionOutputs::UpdateGrandparentBoundaryInfo` .

![image](https://github.com/facebook/rocksdb/assets/1574991/661eaeec-2f46-46c6-a6a8-9738d6c191de)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11610

Reviewed By: ajkr

Differential Revision: D47426971

Pulled By: cbi42

fbshipit-source-id: f0d3a8186d778294515c0685032f5b395c4d6a62
2023-07-13 22:26:55 -07:00
Peter Dillinger c3c84b3397 Refactor (Hyper)ClockCache code for upcoming changes (#11572)
Summary:
Separate out some functionality that will be common to both static and dynamic HCC into BaseClockTable. Table::InsertState and GrowIfNeeded will be used by the dynamic HCC so don't make much sense right now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11572

Test Plan:
existing tests. No functional changes intended.

Performance test in subsequent PR https://github.com/facebook/rocksdb/issues/11601

Reviewed By: jowlyzhang

Differential Revision: D47110496

Pulled By: pdillinger

fbshipit-source-id: 379bd433322a42ea28c0043b41ec24956d21e7aa
2023-07-12 14:05:34 -07:00
Changyu Bi 854eb76a8c Improve error message when an SST file in MANIFEST is not found (#11573)
Summary:
I got the following error message when an SST file is recorded in MANIFEST but is missing from the db folder.
It's confusing in two ways:
1. The part about file "./074837.ldb" which RocksDB will attempt to open only after ./074837.sst is not found.
2. The last part about "No such file or directory in file ./MANIFEST-074507" sounds like `074837.ldb` is not found in manifest.

```
ldb --hex --db=. get some_key

Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.ldb: No such file or directory in file ./MANIFEST-074507
```

Improving the error message a little bit:

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11573

Test Plan:
run the same command after this PR
```
Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.sst: No such file or directory  The file ./MANIFEST-074507 may be corrupted.
```

Reviewed By: ajkr

Differential Revision: D47192056

Pulled By: cbi42

fbshipit-source-id: 06863f376cc4455803cffb2250c41399b4c39467
2023-07-10 15:52:38 -07:00
weedge 1a7c741977 fix: std::optional value() build error on older macOS SDK (#11574)
Summary:
`PORTABLE=1 USE_SSE=1 USE_PCLMUL=1 WITH_JEMALLOC_FLAG=1 JEMALLOC=1 make static_lib`  on MacOS

clang --version:

Apple clang version 12.0.0 (clang-1200.0.32.29)
Target: x86_64-apple-darwin22.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

compile err like this:

util/udt_util.cc:39:39: error: 'value' is unavailable: introduced in macOS 10.14
  if (running_ts_sz != recorded_ts_sz.value()) {
                                      ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/optional:944:33: note: 'value' has been explicitly marked
      unavailable here
    constexpr value_type const& value() const&
                                ^
util/udt_util.cc:217:62: error: 'value' is unavailable: introduced in macOS 10.14
      *new_key = StripTimestampFromUserKey(key, record_ts_sz.value());
                                                             ^
/Library/Developer/CommandLineTools/usr/bin/../include/c++/v1/optional:953:27: note: 'value' has been explicitly marked
      unavailable here
    constexpr value_type& value() &
                          ^
2 errors generated.
make: *** [util/udt_util.o] Error 1

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11574

Reviewed By: ajkr

Differential Revision: D47269519

Pulled By: cbi42

fbshipit-source-id: da49d90cdf00a0af519f91c0cf7d257401eb395f
2023-07-10 14:21:34 -07:00