Summary:
Expose the functions that creates these UDT aware comparators so that users can create all the RocksDB builtin comparators in consistent ways.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11690
Reviewed By: ltamasi
Differential Revision: D48212021
Pulled By: jowlyzhang
fbshipit-source-id: a17a9a11e36e4267551e193f1b22647414acf467
Summary:
We're still getting some rare cases of 5x TryAgains in a row. Here I'm boosting the failure threshold to 10 in a row and adding more info in the output, to help us manually verify whether there's anything suspicous about the sequence of TryAgains, such as if Rollback failed to reset to new sequence numbers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11691
Test Plan: By lowering the threshold to 2 and adjusting some other db_crashtest parameters, I was able to hit my new code and saw fresh sequence number on the subsequent TryAgain.
Reviewed By: cbi42
Differential Revision: D48236153
Pulled By: pdillinger
fbshipit-source-id: c0530e969ddcf8de7348e5cf7daf5d6d5dec24f4
Summary:
It seems the flag `-fno-elide-constructors` is incorrectly overwritten in Makefile by 9c2ebcc2c3/Makefile (L243)
Applying the change in PR https://github.com/facebook/rocksdb/issues/11675 shows a lot of missing status checks. This PR adds the missing status checks.
Most of changes are just adding asserts in unit tests. I'll add pr comment around more interesting changes that need review.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11686
Test Plan: change Makefile as in https://github.com/facebook/rocksdb/issues/11675, and run `ASSERT_STATUS_CHECKED=1 TEST_UINT128_COMPAT=1 ROCKSDB_MODIFY_NPHASH=1 LIB_MODE=static OPT="-DROCKSDB_NAMESPACE=alternative_rocksdb_ns" make V=1 -j24 J=24 check`
Reviewed By: hx235
Differential Revision: D48176132
Pulled By: cbi42
fbshipit-source-id: 6758946cfb1c6ff84c4c1e0ca540d05e6fc390bd
Summary:
Set up the default column family timestamp size for a reused write committed transaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11685
Test Plan: Added unit test.
Reviewed By: ltamasi
Differential Revision: D48195129
Pulled By: jowlyzhang
fbshipit-source-id: 54faa900c123fc6daa412c01490e36c10a24a678
Summary:
**Context/Summary:**
- Similar to https://github.com/facebook/rocksdb/pull/11288 but for user read such as `Get(), MultiGet(), DBIterator::XXX(), Verify(File)Checksum()`.
- For this, I refactored some user-facing `MultiGet` calls in `TransactionBase` and various types of `DB` so that it does not call a user-facing `Get()` but `GetImpl()` for passing the `ReadOptions::io_activity` check (see PR conversation)
- New user read stats breakdown are guarded by `kExceptDetailedTimers` since measurement shows they have 4-5% regression to the upstream/main.
- Misc
- More refactoring: with https://github.com/facebook/rocksdb/pull/11288, we complete passing `ReadOptions/IOOptions` to FS level. So we can now replace the previously [added](https://github.com/facebook/rocksdb/pull/9424) `rate_limiter_priority` parameter in `RandomAccessFileReader`'s `Read/MultiRead/Prefetch()` with `IOOptions::rate_limiter_priority`
- Also, `ReadAsync()` call time is measured in `SST_READ_MICRO` now
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11444
Test Plan:
- CI fake db crash/stress test
- Microbenchmarking
**Build** `make clean && ROCKSDB_NO_FBCODE=1 DEBUG_LEVEL=0 make -jN db_basic_bench`
- google benchmark version: 604f6fd3f4
- db_basic_bench_base: upstream
- db_basic_bench_pr: db_basic_bench_base + this PR
- asyncread_db_basic_bench_base: upstream + [db basic bench patch for IteratorNext](https://github.com/facebook/rocksdb/compare/main...hx235:rocksdb:micro_bench_async_read)
- asyncread_db_basic_bench_pr: asyncread_db_basic_bench_base + this PR
**Test**
Get
```
TEST_TMPDIR=/dev/shm ./db_basic_bench_{null_stat|base|pr} --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/negative_query:0/enable_filter:0/mmap:1/threads:1 --benchmark_repetitions=1000
```
Result
```
Coming soon
```
AsyncRead
```
TEST_TMPDIR=/dev/shm ./asyncread_db_basic_bench_{base|pr} --benchmark_filter=IteratorNext/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:1/async_io:1/include_detailed_timers:0 --benchmark_repetitions=1000 > syncread_db_basic_bench_{base|pr}.out
```
Result
```
Base:
1956,1956,1968,1977,1979,1986,1988,1988,1988,1990,1991,1991,1993,1993,1993,1993,1994,1996,1997,1997,1997,1998,1999,2001,2001,2002,2004,2007,2007,2008,
PR (2.3% regression, due to measuring `SST_READ_MICRO` that wasn't measured before):
1993,2014,2016,2022,2024,2027,2027,2028,2028,2030,2031,2031,2032,2032,2038,2039,2042,2044,2044,2047,2047,2047,2048,2049,2050,2052,2052,2052,2053,2053,
```
Reviewed By: ajkr
Differential Revision: D45918925
Pulled By: hx235
fbshipit-source-id: 58a54560d9ebeb3a59b6d807639692614dad058a
Summary:
As titled, and also removed an undefined and unused member function in for ColumnFamilyData
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11683
Reviewed By: ajkr
Differential Revision: D48156290
Pulled By: jowlyzhang
fbshipit-source-id: cc99aaafe69db6611af3854cb2b2ebc5044941f7
Summary:
Although the built-in Cache implementations never return failure on Insert without keeping a reference (Handle), a custom implementation could. The code for inserting into row_cache does not keep a reference but does not clean up appropriately on non-OK. This is a fix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11682
Test Plan: unit test added that previously fails under ASAN
Reviewed By: ajkr
Differential Revision: D48153831
Pulled By: pdillinger
fbshipit-source-id: 86eb7387915c5b38b6ff5dd8deb4e1e223b7d020
Summary:
I'm anticipating using the public name HyperClockCache for both the current version with a fixed-size table and the upcoming version with an automatically growing table. However, for simplicity of testing them as substantially distinct implementations, I want to give them distinct internal names, like FixedHyperClockCache and AutoHyperClockCache.
This change anticipates that by renaming to FixedHyperClockCache and assuming for now that all the unit tests run on HCC will run and behave similarly for the automatic HCC. Obviously updates will need to be made, but I'm trying to avoid uninteresting find & replace updates in what will be a large and engineering-heavy PR for AutoHCC
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11676
Test Plan: no behavior change intended, except logging will now use the name FixedHyperClockCache
Reviewed By: ajkr
Differential Revision: D48103165
Pulled By: pdillinger
fbshipit-source-id: a33f1901488fea102164c2318e2f2b156aaba736
Summary:
Only re-calculate compaction score once for a batch of deletions. Fix performance regression brought by https://github.com/facebook/rocksdb/pull/8434.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10744
Test Plan:
In one of our production cluster that recently upgraded to RocksDB 6.29, it takes more than 10 minutes to delete files in 30,000 ranges. The RocksDB instance contains approximately 80,000 files. After this patch, the duration reduces to 100+ ms, which is on par with RocksDB 6.4.
Cherry-picking downstream PR: https://github.com/tikv/rocksdb/pull/316
Signed-off-by: tabokie <xy.tao@outlook.com>
Reviewed By: cbi42
Differential Revision: D48002581
Pulled By: ajkr
fbshipit-source-id: 7245607ee3ad79c53b648a6396c9159f166b9437
Summary:
More code leading up to dynamic HCC.
* Small enhancements to cache_bench
* Extra assertion in Unref
* Improve a CAS loop in ChargeUsageMaybeEvictStrict
* Put load factor constants in appropriate class
* Move `standalone` field to HyperClockTable::HandleImpl because it can be encoded differently in the upcoming dynamic HCC.
* Add a typed version of MemMapping to simplify some future code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11670
Test Plan: existing tests, unit test added for TypedMemMapping
Reviewed By: jowlyzhang
Differential Revision: D48056464
Pulled By: pdillinger
fbshipit-source-id: 186b7d3105c5d6d2eb6a592369bc10a97ee14a15
Summary:
An internal user reported this copy showing up in a CPU profile. We can use move instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11681
Differential Revision: D48103170
Pulled By: ajkr
fbshipit-source-id: 083d6470181a0041bb5275b657aa61bee23a3729
Summary:
When `num_levels` > 65, we may be shifting more than 63 bits in FileTtlBooster. This can give errors like: `runtime error: shift exponent 98 is too large for 64-bit type 'uint64_t' (aka 'unsigned long')`. This PR makes a quick fix for this issue by taking a min in the shifting component. This issue should be rare since it requires a user using a large `num_levels`. I'll follow up with a more complex fix if needed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11673
Test Plan: * Add a unit test that produce the above error before this PR. Need to compile it with ubsan: `COMPILE_WITH_UBSAN=1 OPT="-fsanitize-blacklist=.circleci/ubsan_suppression_list.txt" ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 compaction_picker_test`
Reviewed By: hx235
Differential Revision: D48074386
Pulled By: cbi42
fbshipit-source-id: 25e59df7e93f20e0793cffb941de70ac815d9392
Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658
Test Plan: No code change
Reviewed By: ajkr
Differential Revision: D47997856
Pulled By: hx235
fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358
Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.
Reviewed By: ajkr
Differential Revision: D46582680
Pulled By: cbi42
fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
Summary:
... used in validating some HyperClockCache development in progress.
* Revamp the "populate cache" step to avoid redundant insertions (very rare in practice) and more consistently approach the desired resident_ratio while maintaining appropriate skew (still not perfect).
* Track and print hit ratio on lookups, to ensure a fair comparison is happening between implementations etc.
* Add an option to disable tracking and printing histograms (lots of output)
* Add an option to specify a random seed (for more reproducibility)
* Remove confusing/redundant "-skewed" option
Uses BitwiseAnd from https://github.com/facebook/rocksdb/issues/11660 (tested there)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11661
Test Plan: manual
Reviewed By: akankshamahajan15, jowlyzhang
Differential Revision: D47937671
Pulled By: pdillinger
fbshipit-source-id: 85a2bb881b1bca4f63e015bac684105fd91c9f35
Summary:
Updated the main branch for the 8.5.fb branch cut. Also made unreleased_history/release.sh backdate to the last commit instead of the current date in case the release manager is a laggard like myself.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11642
Reviewed By: cbi42
Differential Revision: D47783574
Pulled By: ajkr
fbshipit-source-id: 4e2a80f5ccd542dc7dd0d22dfd7e59cb136325a1
Summary:
BottomNBits() - there is a single fast instruction for this on x86 since BMI2, but testing with godbolt indicates you need at least GCC 10 for the compiler to choose that instruction from the obvious C++ code. https://godbolt.org/z/5a7Ysd41h
BitwiseAnd() - this is a convenience function that works around the language flaw that the type of the result of x & y is the larger of the two input types, when it should be the smaller. This can save some ugly static_cast.
I expect to use both of these in coming HyperClockCache developments, and have applied them in a couple of places in existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11660
Test Plan: unit tests added
Reviewed By: jowlyzhang
Differential Revision: D47935531
Pulled By: pdillinger
fbshipit-source-id: d148c43a1e51df4a1c549b93aaf2725a3f8d3bd6
Summary:
Adds a few missing features to the C API:
1) Statistics level
2) Getting individual values instead of a serialized string
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11263
Test Plan: unit tests
Reviewed By: ajkr
Differential Revision: D47309963
Pulled By: hx235
fbshipit-source-id: 84df59db4045fc0fb3ea4aec451bc5c2afd2a248
Summary:
(Copied from https://www.internalfb.com/diff/D46606060)
This diff makes its files safe for use with -Wimplicit-fallthrough. Now that we're using C+20 there's no reason not to use this C++17 feature to make our code safer.
It's currently possible to write code like this:
```
switch(x){
case 1:
foo1();
case 2:
foo2();
break;
case 3:
foo3();
}
```
But that's scary because we don't know whether the fallthrough from case 1 was intentional or not.
The -Wimplicit-fallthrough flag will make this an error. The solution is to either fix the bug by inserting break or indicating intention by using [[fallthrough]]; (from C++17).
```
switch(x){
case 1:
foo1();
[[fallthrough]]; // Solution if we intended to fallthrough
break; // Solution if we did not intend to fallthrough
case 2:
foo2();
break;
case 3:
foo3();
}
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11663
Test Plan: Existing tests
Reviewed By: jowlyzhang
Differential Revision: D47961248
Pulled By: jaykorean
fbshipit-source-id: 0d374c721bf1b328c14949dc5c17693da7311d03
Summary:
https://github.com/facebook/rocksdb/issues/11653 broke some crash tests.
Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656
Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn
Reviewed By: cbi42
Differential Revision: D47906280
Pulled By: pdillinger
fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b
Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.
To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files
Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.
## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.
## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)
## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).
Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).
```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```
Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness
Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058
Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.
Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.
The format compatibility test does not need updating because we have not updated the default
format_version.
Reviewed By: ajkr, mrambacher
Differential Revision: D33100915
Pulled By: pdillinger
fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
Summary:
In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.
Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
* (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
* Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.
Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
* Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
* Use unique_ptr (nicely complements removing unnecessary Rollbacks)
* Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)
Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653
Test Plan:
this is the test :)
Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.
Reviewed By: cbi42
Differential Revision: D47882995
Pulled By: pdillinger
fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
Summary:
Some trailing whitespace has leaked into HISTORY.md entries. This can lead to unexpected changes when modifying HISTORY.md with hygienic editors (e.g. for a patch release). This change should protect against future cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11652
Test Plan: manual
Reviewed By: akankshamahajan15
Differential Revision: D47882814
Pulled By: pdillinger
fbshipit-source-id: 148c3746d3b298cb6e1f655f0416d46619969086
Summary:
... to improve data integrity validation during compaction.
A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.
For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.
For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571
Test Plan:
* Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
* All other unit tests for non-corrupted case.
* Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`
Reviewed By: ajkr
Differential Revision: D47131965
Pulled By: cbi42
fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
Summary:
Add a built-in comparator that supports uint64_t style user-defined timestamps for ReverseBytewiseComparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11647
Test Plan:
Added a test wrapper for retrieving this comparator from registry and used it in this test:
`./udt_util_test`
Reviewed By: ltamasi
Differential Revision: D47848303
Pulled By: jowlyzhang
fbshipit-source-id: 5af5534a8c2d9195997d0308c8e194c1c797548c
Summary:
Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest
Since it's an experimental feature, added a hack for now to fix the bug.
Planning to make public API change to remove const from the callback as it doesn't make sense to use const.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11645
Test Plan: tested locally
Reviewed By: ltamasi
Differential Revision: D47819907
Pulled By: akankshamahajan15
fbshipit-source-id: 1faf5ef795bf27e2b3a60960374d91274931df8d
Summary:
Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.
To do this, this PR includes:
1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry.
2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid:
a) no user comparator change and no effective `persist_user_defined_timestamp` flag change.
b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag
is false.
c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is
already false.
3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp.
4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623
Test Plan:
```
make all check
./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*"
./db_wal_test --gtest_filter="*EnableDisableUDT*"
```
Reviewed By: ltamasi
Differential Revision: D47636862
Pulled By: jowlyzhang
fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
Summary:
Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval.
This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`.
For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`.
Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.
Tet plan:
```
./column_family_test --gtest_filter="*RetainUDT*"
./memtable_list_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599
Reviewed By: ajkr
Differential Revision: D47561586
Pulled By: jowlyzhang
fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89
Summary:
this is stacked on https://github.com/facebook/rocksdb/issues/11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11552
Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault`
Reviewed By: ajkr
Differential Revision: D47381434
Pulled By: cbi42
fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57
Summary:
Remove obsolete comment.
Support for WriteBatchWithIndex::NewIteratorWithBase when overwrite_key=false is added in https://github.com/facebook/rocksdb/pull/8135, as you can clearly see in the HISTORY.md.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11636
Reviewed By: jowlyzhang
Differential Revision: D47722955
Pulled By: ajkr
fbshipit-source-id: 4fa44a309d9708e9f4a1530918a9aaf7114c9032
Summary:
... ahead of dynamic variant.
* Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release.
* Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function.
* Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11630
Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster.
Reviewed By: jowlyzhang
Differential Revision: D47657334
Pulled By: pdillinger
fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3
Summary:
**Context/Summary**
As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read .
**Test**
Extended the existing UT to cover compaction read case
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11631
Reviewed By: ajkr
Differential Revision: D47681437
Pulled By: hx235
fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90
Summary:
Add `rocksdb_transactiondb_get_base_db` and `rocksdb_transactiondb_close_base_db` functions to the C API modeled after `rocksdb_optimistictransactiondb_get_base_db` and `rocksdb_optimistictransactiondb_close_base_db`:
ca50ccc71a/include/rocksdb/c.h (L2711-L2716)
With this pair of functions, it is possible to get a `rocksdb_t *` from a `rocksdb_transactiondb_t *`. The main goal is to be able to use the approximate memory usage API, only accessible to the `rocksdb_t *` type:
ca50ccc71a/include/rocksdb/c.h (L2821-L2833)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11562
Reviewed By: ajkr
Differential Revision: D47603343
Pulled By: jowlyzhang
fbshipit-source-id: c70cf6af5834026e232fe7791634db3a396f7d5e
Summary:
Add path existence check in the script to avoid script running even when db_bench executable does not exist or relative path is not right.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11621
Reviewed By: jowlyzhang
Differential Revision: D47552590
Pulled By: ajkr
fbshipit-source-id: f09ea069f69e067212b249a22ad755b76bc6063a
Summary:
In [db_impl_open.cc](https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc), the sync point `SanitizeOptions::AfterChangeMaxOpenFiles` is used to set `max_open_files` with some specified "**invalid**" value even if it has been sanitized.
However, in [db_compaction_test.cc](https://github.com/facebook/rocksdb/blob/main/db/db_compaction_test.cc), `SanitizeOptions::AfterChangeMaxOpenFiles` would not be executed since `SyncPoint::EnableProcessing()` is run after `DBTestBase::Reopen()`. To enable `SanitizeOptions::AfterChangeMaxOpenFiles`, `SyncPoint::EnableProcessing()` should be put ahead of `DBTestBase::Reopen()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11583
Test Plan:
run unit tests locally as below:
```
make J=1 check
[ RUN ] DBCompactionTest.LevelTtlCascadingCompactions
[ OK ] DBCompactionTest.LevelTtlCascadingCompactions (85 ms)
[ RUN ] DBCompactionTest.LevelPeriodicCompaction
[ OK ] DBCompactionTest.LevelPeriodicCompaction (57 ms)
```
Reviewed By: jowlyzhang
Differential Revision: D47311827
Pulled By: ajkr
fbshipit-source-id: 99165e87a8129e404af06fdf9b4c96eca540fd23
Summary:
This adds proper support for using rocksdb with FetchContent, without this PR the user must include the following with their own `CMakeLists.txt` file:
```cmake
include_directories(./build/_deps/rocksdb-src/include)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11575
Reviewed By: jowlyzhang
Differential Revision: D47163520
Pulled By: ajkr
fbshipit-source-id: a202dcf435ecc9dd8d51c88f90e98c04814721ca
Summary:
An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema.
This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620
Reviewed By: pdillinger
Differential Revision: D47536962
Pulled By: ajkr
fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11624
<queue> must be included to use std::queue.
Reviewed By: pdillinger
Differential Revision: D47562433
fbshipit-source-id: 7c5b19fd9e411694c782dfc0dff0231d4f92ef24
Summary:
... ahead of planned dynamic HCC variant. This changes
simplifies some logic while still enabling future code sharing between
implementation variants.
Detail: For complicated reasons, using a std::function parameter to
`ConstApplyToEntriesRange` with a lambda argument does not play
nice with templated HandleImpl. An explicit conversion to std::function
would be needed for it to compile. Templating the function type is the
easy work-around.
Also made some functions from https://github.com/facebook/rocksdb/issues/11572 private as recommended
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11609
Test Plan: existing tests
Reviewed By: jowlyzhang
Differential Revision: D47407415
Pulled By: pdillinger
fbshipit-source-id: 0f65954db16335999b78fb7d2563ec627624cef0
Summary:
Extend the coverage for option `flush_verify_memtable_count`. The verification code is similar to the ones for regular flush: c3c84b3397/db/flush_job.cc (L956-L965)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11611
Test Plan: existing tests.
Reviewed By: ajkr
Differential Revision: D47478893
Pulled By: cbi42
fbshipit-source-id: ca580c9dbcd6e91facf2e49210661336a79a248e
Summary:
Stacked on https://github.com/facebook/rocksdb/issues/11572
* Minimize use of std::function and lambdas to minimize chances of
compiler heap-allocating closures (unnecessary stress on allocator). It
appears that converting FindSlot to a template enables inlining the
lambda parameters, avoiding heap allocations.
* Clean up some logic with FindSlot (FIXMEs from https://github.com/facebook/rocksdb/issues/11572)
* Fix handling of rare case of probing all slots, with new unit test.
(Previously Insert would not roll back displacements in that case, which
would kill performance if it were to happen.)
* Add an -early_exit option to cache_bench for gathering memory stats
before deallocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11601
Test Plan:
unit test added for probing all slots
## Seeing heap allocations
Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache`
before https://github.com/facebook/rocksdb/issues/11572 vs. after this change. Before, we see this in the
interesting bin statistics:
```
size nrequests
---- ---------
32 578460
64 24340
8192 578460
```
And after:
```
size nrequests
---- ---------
32 (insignificant)
64 24370
8192 579130
```
## Performance test
Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench`
Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000`
in before and after configurations, simultaneously:
```
Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442
After: Complete in 32.773 s; Rough parallel ops/sec = 2441019
```
Reviewed By: jowlyzhang
Differential Revision: D47375092
Pulled By: pdillinger
fbshipit-source-id: 46f0f57257ddb374290a0a38c651764ea60ba410
Summary:
We observed `CompactionOutputs::UpdateGrandparentBoundaryInfo` consumes much time for `InternalKey::DecodeFrom` and `InternalKey::~InternalKey` in flame graph.
This PR omit the InternalKey object in `CompactionOutputs::UpdateGrandparentBoundaryInfo` .
![image](https://github.com/facebook/rocksdb/assets/1574991/661eaeec-2f46-46c6-a6a8-9738d6c191de)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11610
Reviewed By: ajkr
Differential Revision: D47426971
Pulled By: cbi42
fbshipit-source-id: f0d3a8186d778294515c0685032f5b395c4d6a62